Hi Joe. I am guessing that you are just loading from one large table and that is why you say number of threads won't have an affect? If that is the case, have you tried Multithreaded partition processing? Increasing the batch size will also improve performance. The data is streamed through memory, so please ensure you have enough memory available for fit the batch size. You can increase Max Batch Size (records) and Fetch size. https://docs.streamsets.com/portal/platform-datacollector/latest/datacollector/UserGuide/Origins/MultiTableJDBCConsumer.html#task_kst_m4w_4y
hi @JoeMeszaros,
actually there are a couple of things you can try:
1) you could try to use Table level multi-threading (https://docs.streamsets.com/portal/platform-datacollector/latest/datacollector/UserGuide/Origins/MultiTableJDBCConsumer.html#concept_gvy_yws_p1b) - I have mixed results with this - if your offset column is a numeric, uniformly distributed key, then it might work well. I have tried it with tables with severe skew in the data and results weren’t great
2) you can define multiple configurations for the same table (still in JDBC multi-table origin) and just configure additional offset conditions to each config to manually (well, of course you can parameterize this) partition your data
in both cases you configure the Number of Threads on the “JDBC” tab and also please configure it so that it processes based on “Process all available rows per table” strategy (on “JDBC” tab too).
An other alternative is to have a single pipeline again using JDBC Multi-Table origin , or even JDBC Query Origin (I’ve had very good results with this recently) and then create a Job Template on top. If you configure your offset additional conditions (or where clause conditions in JDCB query origin) to use pipeline parameters, you can then start instances of the Job template passing different param values that will partition your parallel read.
Couple more things: don’t configure the pipeline for enormous batches - about 5000 records is about right most times (well, even smaller if the records are large) and rather increase the JDBC fetch size.
In JDBC Query origin you can increase the fetch size by adding this additional param: defaultRowPrefetch - set it to a large enough number (e.g. 20000).
In JDBC Multi-table origin, you can set “Fetch Size” directly on the JDBC tab.
@JoeMeszaros
Number of threads can improve the speed of extraction if the table is partitioned. Then multiple threads can read from each partition in parallel.
Using Transformer pipeline is another option. Here you can provide a field (like ID) and specify it as offset column. Specifying Number of rows per batch and Number of Partitions will create separate connections to the database and query based on partitions.
If you have 10 million records, specifying 1m records per batch and 10 partitions will read use 10 queries.