Solved

Running Concurrent 200+ Pipelines

2 years ago
October 19, 2022
6 replies
162 views

ianjoshua
Fan
3 replies

Hi, I am using JDBC Multitable Origin Stage and I want to run 200+ pipelines in parallel but I have to maintain their process time as low as possible. I am configuring sdc.properties file but I still cannot find the proper values. I have configured max.stage.private.classloaders, runner.thread.pool.size, pipeline.max.runners.count but I still cannot minimize their run time. Whenever they are running in parallel, it seems that their usual individual run time takes longer. Also, whenever I run them in parallel, some pipelines are in STARTING state. They do not run immediately as their pipeline status is STARTING. May I ask what configurations should I consider for me to run these 200+ pipelines concurrently with consideration of lower run time?

Best answer by saleempothiwala

Hi @ianjoshua ,

Please have a look at this video:

https://www.youtube.com/watch?v=cYilVwoIJ4E

Your data collector has a fixed amount of memory available. For every pipeline you run, you spend

memory = record size x batchsize x destinations + other overheads

So there are only certain number of pipelines you can run in parallel. Any others will have to wait for resources to free up so that it can run. There is no magic property that will allow you to run 200 pipelines in one go. The STARTING status you see is actually the pipelines waiting for the resources. More runners in waiting will consume more resources.

Best approach would be to add more data collectors with same configuration and labels, create jobs out of pipelines and allocate same tags for data collectors and then SCH will distribute the load accordingly.

Number of SDC will again depend on the calculation above. So assuming you can run 20 pipelines in parallel on 1 SDC, you will need 10 sdc to run all 200 in parallel.

View original

Did this topic help you find an answer to your question?

saleempothiwala
Headliner
258 replies
2 years ago
October 19, 2022

@ianjoshua

I would be able to help you better if you let us know how many data collectors are you using, configuration of your engines, etc.

Also, are you ingesting from 200 tables from the same database? There could be better ways to create pipeline for that.

Some help on config here:

https://docs.streamsets.com/portal/platform-controlhub/controlhub/UserGuide/Engines/ResourceThresh.html

ianjoshua
Author
Fan
3 replies
2 years ago
October 20, 2022

@saleempothiwala

Hello, thank you for the reply. I am using one data collector only. The server used for this is AWS EC2, r4.4xlarge.

I am using 200 databases. 1 database = 1 pipeline, and I am ingesting 8 tables for each database. So what I did is that I used JDBC Multitable Consumer as my Origin.

I am struggling on what configuration should I touch to distribute the proper resources to each of the pipeline to prevent longer run time when running 200+ pipelines concurrently.

ianjoshua
Author
Fan
3 replies
2 years ago
October 20, 2022

@saleempothiwala

Hello, just an update. It seems that I’ve found the process on where the 200+ pipelines running in parallel are taking much time. Each pipeline (200+ pipelines) has a configuration such as Start Event - Shell and Stop Event - Shell. The Shell scripts purpose is just to move files to different folders (Amazon S3). Stop Event - Shell is where their process are taking much time since it has the script to move files. Do we have a configuration ins Streamsets to speed up this process when we are running in parallel of 200+ pipelines?

saleempothiwala
Headliner
258 replies
Answer
2 years ago
October 20, 2022

Hi @ianjoshua ,

Please have a look at this video:

https://www.youtube.com/watch?v=cYilVwoIJ4E

Your data collector has a fixed amount of memory available. For every pipeline you run, you spend

memory = record size x batchsize x destinations + other overheads

Number of SDC will again depend on the calculation above. So assuming you can run 20 pipelines in parallel on 1 SDC, you will need 10 sdc to run all 200 in parallel.

ianjoshua
Author
Fan
3 replies
2 years ago
October 20, 2022

Hello saleempothiwala,

Thank you for the response. I will consider the additional data collectors as my next step in order to run these 200 pipelines in parallel. Very much appreciated.

saleempothiwala
Headliner
258 replies
2 years ago
October 20, 2022

@ianjoshua :-)

Feel free to contact if you need any help/support.

Reply

Related topics

Using AutodeskDocsConnector in Flowicon

Deployment automation of Github assets using FME Flow upon Github merge eventsicon

Creating files in Network Shared Folders using FME Flowicon

Mapping Transmission Lines Data Flow using Values in Two Columnsicon

Using fme pathreader in fme flow appsicon

Tags

Couldn't find what you're looking for?

Sign up

Social Login

Login to the community

Social Login

Scanning file for viruses.

This file cannot be downloaded