Skip to main content

Hello Experts,

I am building SDC pipeline using http client as origin. Output of api call is json (could be nested jason).

After few stages like rename/pivote/flatten, I and writing data into S3. Next stage is hive query reading data from extrenal table reading data from S3 and inserting into final parquet table.

Source API has batch size limitations (some are allowing only 1000 records at a time and some are 2000).  

SDC pipeline is working fine pulling data in a batches and inserting into final table using offsets if data is more than batch size.  

For every batch S3 stage is creating a file and S3 event is getting fired and executing remaining stages of the pipeline. 

Issue: 

This is taking long time when we have data more than 100000 records. 

Is there better way to improve perfomance of this SDC pipeline. Most of the time is taken by Hive query stage. IS there way to not execute Hive query or/and  other stages after the S3 stage until all data from API call has been received in S3.

here is typical pipeline

 

Thanks in advance 

Meghraj

@meghraj 

In this case you can split this pipeline into 2 pipelines.

Pipeline 1 :

This will fetch data from HTTP client and store data into S3 bucket.

Pipeline 2 :

This will fetch data from S3 bucket and send to destination.

 

If there are dependency in between pipelines , then create jobs for pipeline 1 and 2  and create an orchestration pipeline and it will help to  increase the performance.  

 

Below link is for orchestration pipeline creation.

https://docs.streamsets.com/platform-datacollector/latest/datacollector/UserGuide/Orchestration_Pipelines/OrchestrationPipelines_Title.html

 

Thanks & Regards

Bikram_


@meghraj 

in case of S3 as a destination, after each batch the file is closed a new one created for the next batch. There are a few ways to handle this

  1. You can increase the batch size to say 100,000 so files will be bigger.
  2. You can write to another location where there are no triggers. Once the pipeline finishes you can run another pipeline that read the files from location A and does a ‘Whole file’ transfer to Location B where events are triggered. As suggested by @Bikram.
  3. You can write to a local file system and on pipeline finish, run a script that merges individual files to 1 big file and transfers to S3
  4. etc etc

Hope it helps.


These are the alternatives we have considered.

We cannot increase the batch size more than API batch size (which is 2000) 

I was looking for end to end solution solution inside SDC without control hub or any additional script 

Thanks for your quick response  


Reply