I have a streamsets pipeline that part of a scheduled job. The pipeline reads a CSV file that is stored at an AWS sftp location. That csv file gets overwritten every night. The scheduled job is supposed to read the file well after the file is over-written. The scheduler does run the job at the specified hour, however, the pipeline only reads the first line of the csv file and ends even though there are many records for it to process. If I manually run the job and reset the origin the job runs as expected. I have only been working with streamsets for about 5 months. Anyone suggest what I might be missing in my pipeline or possibly the scheduler?
As a general rule, overwriting files is not supported. Please try recreating this file with a different name each time.
Use-Case: Lets say we are writing the data from SFTP → HDFS (any destination)
Pipeline Design:
- Enable event on SFTP origin and connect it Pipeline Finisher Executor(with Reset Origin) and with pre-condition no-more-data. (Refer the doc)
Once the CSV file is written to destination then SFTP origin will trigger a no-more-data event to Pipeline Finisher. The pipeline finisher will finish the pipeline and reset the origin.
In the next scheduled job run the pipeline will read the file again.
(OR) Along with above you can also try using the Post Processing feature in SFTP origin (with delete option). Instead of overwriting the file in the origin, a new file will be created (with same name).
Note: As Dimas mentioned, overwriting files is not supported (you may face issues, so creating a unique file would be more better)
Reply
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.