Skip to main content

Hi, I am using JDBC Multitable Origin Stage and I want to run 200+ pipelines in parallel but I have to maintain their process time as low as possible. I am configuring sdc.properties file but I still cannot find the proper values. I have configured max.stage.private.classloaders, runner.thread.pool.size, pipeline.max.runners.count but I still cannot minimize their run time. Whenever they are running in parallel, it seems that their usual individual run time takes longer. Also, whenever I run them in parallel, some pipelines are in STARTING state. They do not run immediately as their pipeline status is STARTING. May I ask what configurations should I consider for me to run these 200+ pipelines concurrently with consideration of lower run time?

@ianjoshua 

I would be able to help you better if you let us know how many data collectors are you using, configuration of your engines, etc.

 

Also, are you ingesting from 200 tables from the same database? There could be better ways to create pipeline for that.

Some help on config here:

https://docs.streamsets.com/portal/platform-controlhub/controlhub/UserGuide/Engines/ResourceThresh.html

 

 


@saleempothiwala

 

Hello, thank you for the reply. I am using one data collector only. The server used for this is AWS EC2, r4.4xlarge.

 

I am using 200 databases. 1 database = 1 pipeline, and I am ingesting 8 tables for each database. So what I did is that I used JDBC Multitable Consumer as my Origin.

 

I am struggling on what configuration should I touch to distribute the proper resources to each of the pipeline to prevent longer run time when running 200+ pipelines concurrently.


@saleempothiwala 

 

Hello, just an update. It seems that I’ve found the process on where the 200+ pipelines running in parallel are taking much time. Each pipeline (200+ pipelines) has a configuration such as Start Event - Shell and Stop Event - Shell. The Shell scripts purpose is just to move files to different folders (Amazon S3). Stop Event - Shell is where their process are taking much time since it has the script to move files. Do we have a configuration ins Streamsets to speed up this process when we are running in parallel of 200+ pipelines?


Hi @ianjoshua ,

 

Please have a look at this video: 

 

Your data collector has a fixed amount of memory available. For every pipeline you run, you spend

memory =   record size x batchsize x destinations + other overheads 

So there are only certain number of pipelines you can run in parallel. Any others will have to wait for resources to free up so that it can run. There is no magic property that will allow you to run 200 pipelines in one go. The STARTING status you see is actually the pipelines waiting for the resources. More runners in waiting will consume more resources. 

Best approach would be to add more data collectors with same configuration and labels, create jobs out of pipelines and allocate same tags for data collectors and then SCH will distribute the load accordingly.

Number of SDC will again depend on the calculation above. So assuming you can run 20 pipelines in parallel on 1 SDC, you will need 10 sdc to run all 200 in parallel.


Hello saleempothiwala,

 

Thank you for the response. I will consider the additional data collectors as my next step in order to run these 200 pipelines in parallel. Very much appreciated. 

 

 


@ianjoshua :-)

Feel free to contact if you need any help/support.


Reply