Question

Unable to ingest data from Azure SQL (CDC) to Azure Data bricks using Stream Sets.

  • 11 August 2022
  • 4 replies
  • 97 views

Trying to build data pipeline for Azure SQL Server DB (CDC) as source and Azure Data bricks (Delta tables) as destination

I have referred data pipeline sample from
https://github.com/streamsets/pipeline-library/tree/master/datacollector/sample-pipelines/pipelines/SQLServer%20CDC%20to%20Delta%20Lake

 

Getting below error for few records in Schema preview as-well:

DELTA_LAKE_34 - Databricks Delta Lake load request failed: 'DELTA_LAKE_32 - Could not copy staged file 'sdc-4a076fce-7a73-45ba-8dd7-29e58848cf23.csv': java.sql.SQLException: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.
 

Note : On Preview/Draft Run → Pipeline is able to capture changes from Source DB, successfully created files in stage (ADLS container) and created Delta tables at destination but it it fails to ingest records there.

 


4 replies

Userlevel 4
Badge

@gkognole I have seen these kind of errors when the file starts with something like _ or is empty. From the looks of if, your filenames start with sdc- so could be a good idea to check if any temp files are being created and read from.

Userlevel 2
Badge

@gkognole

Could it be that you are using an unsupported version of the cluster? (we support 6.x, 7.x and 8.x only)

@gkognole I have seen these kind of errors when the file starts with something like _ or is empty. From the looks of if, your filenames start with sdc- so could be a good idea to check if any temp files are being created and read from.

Thank you @saleempothiwala for your reply.

Yes, my stage file name starts with sdc- and there are no temp files created with _

@gkognole

Could it be that you are using an unsupported version of the cluster? (we support 6.x, 7.x and 8.x only)

Thank you @alex.sanchez for your reply.

I am using Databricks Runtime Version : 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12). 

I will give try using 8.x version if it resolves the issue.

Reply