Skip to main content

Hello, 

I was wondering if there was a way to increase the throughput for the JDBC multitable consumer origin? I know that when it comes to the multitable origin it only runs 1 table per thread so increasing the number of threads doesnt do anything. If there another way to increase the throughput without increasing the number of threads?

Hi Joe.  I am guessing that you are just loading from one large table and that is why you say number of threads won't have an affect? If that is the case, have you tried Multithreaded partition processing?  Increasing the batch size will also improve performance.  The data is streamed through memory, so please ensure you have enough memory available for fit the batch size.  You can increase Max Batch Size (records) and Fetch size.  https://docs.streamsets.com/portal/platform-datacollector/latest/datacollector/UserGuide/Origins/MultiTableJDBCConsumer.html#task_kst_m4w_4y


hi @JoeMeszaros,

actually there are a couple of things you can try:
1) you could try to use Table level multi-threading (https://docs.streamsets.com/portal/platform-datacollector/latest/datacollector/UserGuide/Origins/MultiTableJDBCConsumer.html#concept_gvy_yws_p1b) - I have mixed results with this - if your offset column is a numeric, uniformly distributed key, then it might work well. I have tried it with tables with severe skew in the data and results weren’t great
2) you can define multiple configurations for the same table (still in JDBC multi-table origin) and just configure additional offset conditions to each config to manually (well, of course you can parameterize this) partition your data

in both cases you configure the Number of Threads on the “JDBC” tab and also please configure it so that it processes based on  “Process all available rows per table” strategy (on “JDBC” tab too).

An other alternative is to have a single pipeline again using JDBC Multi-Table origin , or even JDBC Query Origin (I’ve had very good results with this recently) and then create a Job Template on top. If you configure your offset additional conditions (or where clause conditions in JDCB query origin) to use pipeline parameters, you can then start instances of the Job template passing different param values that will partition your parallel read.
 

Couple more things: don’t configure the pipeline for enormous batches - about 5000 records is about right most times (well, even smaller if the records are large) and rather increase the JDBC fetch size.
In JDBC Query origin you can increase the fetch size by adding this additional param: defaultRowPrefetch - set it to a large enough number (e.g. 20000).
In JDBC Multi-table origin, you can set “Fetch Size” directly on the JDBC tab.

 


@JoeMeszaros 

Number of threads can improve the speed of extraction if the table is partitioned. Then multiple threads can read from each partition in parallel.

 

Using Transformer pipeline is another option. Here you can provide a field (like ID) and specify it as offset column. Specifying Number of rows per batch and Number of Partitions will create separate connections to the database and query based on partitions.

If you have 10 million records, specifying 1m records per batch and 10 partitions will read use 10 queries.


Reply