Skip to main content

we are ingesting data using CDC pipelines(streamsets) from oracle source. so basically it ingests if any INSERT/UPDATE/DELETES are happening at source. Just wanted to how how can audit be performed on the same and its 24*7 and what can be the logic for audit checks.

we are basically using databricks for audit checks

@harshith, that makes a lot of sense, you want some audit of loads and have some level of reconciliation vs source systems; you can get the record count from the UI, but that’s only helpful if you’re a user; for automation purposes and to facilitate reporting, you can use the REST API.

Other easy option is to use the Data Collector “Control Hub API”  stage as below:


 

 

note that you’ll need to pass the API credentials in the header as follows:
 

 


 Hi @harshith , the pipeline itself keeps collecting metrics about its performance - the histograms in the pipeline itself provides you a graphical view of the performance (e.g. records throughput). 
 

 

Note that the detailed information can be extracted from the Control Hub repository using REST APIs; you can use the following:

/jobrunner/rest/v1/metrics/job/{jobId}

 

Given a specific jobId it will return a JSON document with all the metrics related to that execution.

 


Giuseppe Mura thanks for the reply, actually since it will be running on production daily basis , we wanted to capture the record count and match it with source everyday on a delta table.


@harshith were @Giuseppe Mura suggestions helpful? If so, Please mark “Best Answer”. 


Reply