We currently have a C# console application for processing large CSV files. We are exploring using Streamsets to perform the same. Below is my high-level requirement.
- Currently we have 50+ consumers calling our application by placing a CSV file at on-premises shared drive folder.
- The file size can be from 10,000 to 2 million rows.
- The application should be able to handle processing up to 5 million rows an hour.
- Each file drop triggers an individual process run. So that multiple files are processed simultaneously.
- Below are the high-level steps that are involved in processing the file.
- Write the file to a database table.
- Enrich every record with additional data attributes.
- Make an API call for each record.
- Update the DB record with the response received from the API call.
- Also write the response of the API call to a result CSV file for the consumer.
- After all the records are processed write the result CSV file for the consumer to pick from the shared drive.