Kudu: Recommended Number of Worker Threads - SDC 3.3.1 and later.


Userlevel 4
Badge

In the Data Collector version 3.3.1 we have added a new parameter "Maximum Number of Worker Threads" to Kudu Destination and Kudu Lookup Processor that controls how many worker threads the stage should create. This parameter can be used to tweak the balance between performance and load on the data collector, especially as the number of Kudu stages in running pipelines grows.

 

Each worker thread is used to process acknowledgment from Kudu service that the given write was completed. If no worker thread is available to accept this acknowledgment, the request will simply wait in a queue until a work thread becomes available. Thus this parameter is exposed as a way to tweak the performance of the Kudu stages and does not have an impact on “correctness” of the pipeline.

 

The number of write requests will depend on many environmental factors and is even specific to each table and even to each StreamSets batch. The maximal theoretical maximum of write requests per batch is determined by the number of buckets for the given table and the number of partitions that the batch will write to. If the table has 5 buckets and the given batch will write to two partitions, then the theoretical maximal number of write requests is 5*2 = 10.

 

Configuring the worker threads to this theoretical maximum is usually unnecessary as processing write acknowledgment is relatively fast operation. Thus we suggest to start with a conservative number such as 2 and increase only if the performance is not sufficient.

If one Kudu stage is accessing multiple tables, use the maximum number of write requests of all tables.

 

The parameter "Maximum Number of Worker Threads" is not available in the Data Collector until 3.3.1 release. Thus we were using the default value, which is 2 times available processors. If one SDC instance has many pipelines with Kudu stages and is running on a machine with many processors, a massive amount of worker threads may be created. Therefore we strongly encourage users to upgrade to 3.3.1 or later versions if available.


0 replies

Be the first to reply!

Reply