Following exception may occur when you're trying to read large records:
OverrunException: Reader exceeded the read limit '1024000'
When this happens, you may need to check the buffer limit (Buffer Limit (KB)) or another limitation of the Data Type in the pipeline (for example,Max Record Length (chars) for Delimited Data Format).
These are limitations which can be configured for each pipeline differently.
(Examples of the exceptions for Directory and Amazon S3 origins, you can find here - Parser Overrun Errors or Max record length (chars) property doesn't take effect. )
If changing the buffer limit has not helped, you may need to configure the parser limitation on the SDC level which is set by default to 1 MB. You may see the following exception:
...java.lang.IllegalArgumentException: overRunLimit '4280000' must be
greater than 0 and less than or equal to 1048576
The overrun limit is related to the maximum number of bytes that can be read for a single record. If you need to increase the number, please follow these steps:
- For SDC 2.6 and earlier versions, you need to add to the DC environment configuration file (sdc-env.sh/sdcd-env.sh) this line:
export SDC_JAVA_OPTS="${SDC_JAVA_OPTS} -DDataFactoryBuilder.OverRunLimit=2097152"
- For the SDC versions later than 2.6, we introduced parser.limit configuration available in sdc.properties file (or based on the installation, in Cloudera Manager):
parser.limit=2097152
- For Cloudera Manager, we introduced parser.limit configuration available in "Data Collector Advanced Configuration Snippet (Safety Valve) for sdc.properties".:
parser.limit=2097152
In both cases, you must restart the Data Collector. After this change, the limit will be changed to 2MB.
--------------------------------------------------------------------------------------------------------------
parser.limit - Maximum parser buffer size that origins can use to process data. Limits the size of the data that can be parsed and converted to a record.
Buffer Limit (kB) - Maximum buffer size. The buffer size determines the size of the record that can be processed. Decrease when memory on the Data Collector machine is limited. Increase to process larger records when memory is available.
Max Record Length (chars) - The maximum number of characters in a record. Longer records are diverted to the pipeline for error handling.
March 01, 2021 12:28