Question

audit logic for CDC pipelines

  • 25 February 2022
  • 4 replies
  • 104 views

we are ingesting data using CDC pipelines(streamsets) from oracle source. so basically it ingests if any INSERT/UPDATE/DELETES are happening at source. Just wanted to how how can audit be performed on the same and its 24*7 and what can be the logic for audit checks.

we are basically using databricks for audit checks


4 replies

Userlevel 3
Badge

@harshith, that makes a lot of sense, you want some audit of loads and have some level of reconciliation vs source systems; you can get the record count from the UI, but that’s only helpful if you’re a user; for automation purposes and to facilitate reporting, you can use the REST API.

Other easy option is to use the Data Collector “Control Hub API”  stage as below:


 

 

note that you’ll need to pass the API credentials in the header as follows:
 

 

Userlevel 3
Badge

 Hi @harshith , the pipeline itself keeps collecting metrics about its performance - the histograms in the pipeline itself provides you a graphical view of the performance (e.g. records throughput). 
 

 

Note that the detailed information can be extracted from the Control Hub repository using REST APIs; you can use the following:

/jobrunner/rest/v1/metrics/job/{jobId}

 

Given a specific jobId it will return a JSON document with all the metrics related to that execution.

 

Giuseppe Mura thanks for the reply, actually since it will be running on production daily basis , we wanted to capture the record count and match it with source everyday on a delta table.

Userlevel 5
Badge

@harshith were @Giuseppe Mura suggestions helpful? If so, Please mark “Best Answer”. 

Reply