If you run into an exception similar to the one below, it means that Data Collector allocated all available private classloaders.
You can calculate how many private classloaders an SDC instance needs by calculating the number of Hadoop stages used in your pipelines, which run simultaneously (stopping a pipeline should also release the private classloaders).
java.lang.RuntimeException: Could not get a private ClassLoader for
'streamsets-datacollector-cdh_5_7-lib', for stage
'com_streamsets_pipeline_stage_destination_hdfs_HdfsDTarget',
active private ClassLoaders='50': java.util.NoSuchElementException: Pool exhausted
If your running pipelines use more than 50 Hadoop stages, you need to increase max.stage.private.classloaders in sdc.properties file, which is set to 50 by default.
From the sdc.properties file:
#Maximum number of private classloaders to allow in the data collector.
#Stage that have configuration singletons (i.e. Hadoop FS & Hbase) require private classloaders
max.stage.private.classloaders=50
For example, if 100 pipelines (within one SDC) are running and writing to HDFS destination, you have to set this number to 100. If one pipeline contains HDFS, Hive Metadata processor, and Hive Metastore destination, you need 3 private classloaders for one pipeline.
Some of the other stages (not a fully exhaustive list) that use private classloader are - Mapr-DB origin, MapR-DB target, HDFS origin, HDFS metadata executor, Hadoop FS target, BigTable target, Hive Metadata processor, HiveQuery executor, Hive target, Hive Metastore target, Spark processor, Amazon S3 target, Kudu look-up processor, Kudu target, MapReduce executor, HBase target, HBase lookup processor, etc.
If you install the SDC with Cloudera Manager, you can set the value in CM UI > StreamSets configuration > "Data Collector Advanced Configuration Snippet (Safety Valve) for sdc.properties".
Leaving the max.stage.private.classloaders
unlimited (or set to a high value, more than your needs) is not recommended. The very common area that gets impacted when this is not taken care of is your memory consumption and the JVM heap management via GC and the likes. And this in turn adversely affects the performance. These become dangerously difficult to trace when you hit an issue with the pipeline.
What if my running pipelines do not use more than 50 Hadoop stages?
If you run only a few pipelines using Hadoop stages simultaneously in one Data Collector and the number is not higher than 50, please contact our support team as this might mean that Data Collector does not release the private classloaders properly.
August 11, 2021 00:14