we are ingesting data using CDC pipelines(streamsets) from oracle source. so basically it ingests if any INSERT/UPDATE/DELETES are happening at source. Just wanted to how how can audit be performed on the same and its 24*7 and what can be the logic for audit checks.
we are basically using databricks for audit checks
Page 1 / 1
Hi @harshith , the pipeline itself keeps collecting metrics about its performance - the histograms in the pipeline itself provides you a graphical view of the performance (e.g. records throughput).
Note that the detailed information can be extracted from the Control Hub repository using REST APIs; you can use the following:
/jobrunner/rest/v1/metrics/job/{jobId}
Given a specific jobId it will return a JSON document with all the metrics related to that execution.
Giuseppe Mura thanks for the reply, actually since it will be running on production daily basis , we wanted to capture the record count and match it with source everyday on a delta table.
@harshith, that makes a lot of sense, you want some audit of loads and have some level of reconciliation vs source systems; you can get the record count from the UI, but that’s only helpful if you’re a user; for automation purposes and to facilitate reporting, you can use the REST API.
Other easy option is to use the Data Collector “Control Hub API” stage as below:
note that you’ll need to pass the API credentials in the header as follows:
@harshith were @Giuseppe Mura suggestions helpful? If so, Please mark “Best Answer”.