Product: Streamsets Data Collector 

Scenario:

Running Oracle CDC on a large number of tables say for example close to 100 tables here are some best practices to consider 

Goal:

Improve throughput of the Oracle CDC pipeline and decrease the record lag when dealing with a relatively large number of changes (close to 2 million DML records) across close to 100 tables. 

Solution:

  1. Schedule to run your Oracle CDC pipelines on a dedicated SDC nodes which are not having too many concurrent pipelines working on large batch size records running
  2. Monitor the Heap usage and ensure there is a head room of 40% at all times when CDC pipelines are running. If a CDC pipeline runs for 8hrs monitor the Heap usage every hr to ensure if there is a head room of a minimum of 40% heap.
  3. We had a scenario with a customer where we had about 2 million changes coming during the entire span of 24hrs and there was a single table that was contributing to a workload of 500,000 records.
  4. Moving this large table to a separate pipeline helped with the overall record throughput and reduce the lag in the pipelines.
  5. Breakdown into multiple pipelines with each pipeline working on a set of tables helps improve the throughput considerably
  6. It is very important to understand and optimize the Fetch Size parameter to reduce can reduce the overall lag

Here is some write up about Fetch Size

JDBC Fetch Size:The JDBC Fetch Size is the primary tunable which controls how many records the Origin waits for before returning results from the query. Recommendations: In low transaction rate environments we highly recommend keeping this value to be low or 1 (default). In case of high transaction environments the system will benefit from setting higher value both in terms of network throughput and better performance overall.Setting a high value in a low transaction environment will give the impression that the pipeline is hung. As an example if your set the JDBC fetch size to 1000 and you receive only 100 changes, the origin will not return results until 1000 records are available, this is highly undesirable from both a replicating changes perspective as well as from an end user experience standpoint.