Skip to main content

In this article, we will explore the process of creating a pipeline to transfer data from a MySQL database to a Delta Lake tables on Databricks.

Environment:

  • Data Collector on DataOps platform.
  • Destination Databricks Delta Lake.

These are the steps we need to follow to develop the pipeline: 

  1. Login to Databricks UI » Compute » Create new compute then fill out all the details such as spark version, type of cluster, worker type. Once everything is finalised, click on create compute.
Creating compute will take some sweet time.
  1. Now, we will have to fetch the following information from the cluster and populate into the pipeline configuration section.
    1. Token: In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the dropdown » Develop » Token » Generate a new token.
    2. JDBC String: Click on the Databricks newly created cluster, scroll down and expand Advance Option tab » JDBC/ODBC.
    3. For staging purpose, we need to use AWS S3, Google Storage or Azure ADLS. I am going to use the Google Storage in this testing. 

After collecting all the necessary details, you can populate them into the Databricks Delta Lake destination and perform a validation of the pipeline prior to execution.

The pipeline processed around 9 records. 

Here is a method to confirm the successful data write to the destination. In the screenshot below, you can see that only the first 6 records are visible in the UI. You also have the option to download the record files containing all the table's records.

Query output.

 

Attaching the data collector pipeline for references. 

1]. https://docs.streamsets.com/portal/platform-datacollector/5.7.x/index.html?contextID=task_bv5_3wz_vkb

b2]. https://docs.databricks.com/en/getting-started/quick-start.html

Be the first to reply!

Reply