SFTP Pipeline, Job, and Scheduler

  • 9 December 2021
  • 4 replies

I have a streamsets pipeline that part of a scheduled job. The pipeline reads a CSV file that is stored at an AWS sftp location. That csv file gets overwritten every night. The scheduled job is supposed to read the file well after the file is over-written. The scheduler does run the job at the specified hour, however, the pipeline only reads the first line of the csv file and ends even though there are many records for it to process. If I manually run the job and reset the origin the job runs as expected. I have only been working with streamsets for about 5 months. Anyone suggest what I might be missing in my pipeline or possibly the scheduler?


Best answer by Ranjith P 9 December 2021, 17:52

View original

4 replies

Userlevel 2

As a general rule, overwriting files is not supported. Please try recreating this file with a different name each time.

@Dimas Cabré i Chacón Thank you very much for the answer.  The suggestion adds a bunch of complexity to a relatively simple process.  Given that I am new-ish to StreamSets, if I were to try and implement a solution that involves creating a new file each time, how do I tell the pipeline to process only the latest version of the csv file?

Userlevel 2

@uacate I believe you can achieve this use-case by using Pipeline Finisher Executor(with Reset Origin).

Use-Case: Lets say we are writing the data from SFTP → HDFS (any destination)

Pipeline Design:

  • Enable event on SFTP origin and connect it Pipeline Finisher Executor(with Reset Origin) and with pre-condition no-more-data. (Refer the doc)

Once the CSV file is written to destination then SFTP origin will trigger a no-more-data event to Pipeline Finisher. The pipeline finisher will finish the pipeline and reset the origin.

In the next scheduled job run the pipeline will read the file again.

(OR) Along with above you can also try using the Post Processing feature in SFTP origin (with delete option). Instead of overwriting the file in the origin, a new file will be created (with same name).

Note: As Dimas mentioned, overwriting files is not supported (you may face issues, so creating a unique file would be more better)

@Ranjith P I think your solution may be exactly what I was looking for.  Also, I may have miss-described the pipeline.  The source csv file is over-written every night by another enterprise system that we use.  My pipeline only consumes that file and then updates a database that sits somewhere else.  Thank you very much!!!