Solved

SFTP Pipeline, Job, and Scheduler

Forum|Forum|4 years ago
December 9, 2021
4 replies
212 views

uacate
Fan

I have a streamsets pipeline that part of a scheduled job. The pipeline reads a CSV file that is stored at an AWS sftp location. That csv file gets overwritten every night. The scheduled job is supposed to read the file well after the file is over-written. The scheduler does run the job at the specified hour, however, the pipeline only reads the first line of the csv file and ends even though there are many records for it to process. If I manually run the job and reset the origin the job runs as expected. I have only been working with streamsets for about 5 months. Anyone suggest what I might be missing in my pipeline or possibly the scheduler?

Best answer by Ranjith P

@uacate I believe you can achieve this use-case by using Pipeline Finisher Executor(with Reset Origin).

Use-Case: Lets say we are writing the data from SFTP → HDFS (any destination)

Pipeline Design:

Enable event on SFTP origin and connect it Pipeline Finisher Executor(with Reset Origin) and with pre-condition no-more-data. (Refer the doc)

Once the CSV file is written to destination then SFTP origin will trigger a no-more-data event to Pipeline Finisher. The pipeline finisher will finish the pipeline and reset the origin.

In the next scheduled job run the pipeline will read the file again.

(OR) Along with above you can also try using the Post Processing feature in SFTP origin (with delete option). Instead of overwriting the file in the origin, a new file will be created (with same name).

Note: As Dimas mentioned, overwriting files is not supported (you may face issues, so creating a unique file would be more better)

Dimas Cabré i Chacón
StreamSets Employee
Forum|Forum|4 years ago
December 9, 2021

As a general rule, overwriting files is not supported. Please try recreating this file with a different name each time.

R

Like

uacate
Author
Fan
Forum|Forum|4 years ago
December 9, 2021

@Dimas Cabré i Chacón Thank you very much for the answer. The suggestion adds a bunch of complexity to a relatively simple process. Given that I am new-ish to StreamSets, if I were to try and implement a solution that involves creating a new file each time, how do I tell the pipeline to process only the latest version of the csv file?

Like

R

Ranjith P
StreamSets Employee
Answer
Forum|Forum|4 years ago
December 9, 2021

@uacate I believe you can achieve this use-case by using Pipeline Finisher Executor(with Reset Origin).

Use-Case: Lets say we are writing the data from SFTP → HDFS (any destination)

Pipeline Design:

Enable event on SFTP origin and connect it Pipeline Finisher Executor(with Reset Origin) and with pre-condition no-more-data. (Refer the doc)

Once the CSV file is written to destination then SFTP origin will trigger a no-more-data event to Pipeline Finisher. The pipeline finisher will finish the pipeline and reset the origin.

In the next scheduled job run the pipeline will read the file again.

(OR) Along with above you can also try using the Post Processing feature in SFTP origin (with delete option). Instead of overwriting the file in the origin, a new file will be created (with same name).

Note: As Dimas mentioned, overwriting files is not supported (you may face issues, so creating a unique file would be more better)

Like

uacate
Author
Fan
Forum|Forum|4 years ago
December 9, 2021

@Ranjith P I think your solution may be exactly what I was looking for. Also, I may have miss-described the pipeline. The source csv file is over-written every night by another enterprise system that we use. My pipeline only consumes that file and then updates a database that sits somewhere else. Thank you very much!!!

R

Like

Couldn't find what you're looking for?

Sign up

Social Login

Login to the community

Social Login

Scanning file for viruses.

This file cannot be downloaded