In the file system origins, when sdc hit the batch size limit, it send the batch down the pipeline. But there is other condition in FS Origin that ends a batch is the end of the current file.
When processing the file with FS Origins, the batches are produced at the batch size, until you get to the end of the file & when the file is closed and whatever records are "leftover" in the current batch are sent down the pipeline. The pipeline process the batch, sdc get back to the Origin and Origin opens a new file, if there are enough records, batches are produced with the specified batch size.
So basically closing the file has precedence over creating a batch of a specific size in case it reaches to end of file.
Assume the following scenario
Scenario -1
Batch Size = 5
Example file contains 6 records each (there are two files)
Expected Outcome:
1st Batch = 5 record
2nd Batch = 1 record
3rd batch = 5 record
4th batch = 1 record
Scenario -2
Batch Size = 25
Example file contains 6 records each (there are two files)
Expected Outcome : Two output batches both containing 6 records
So how could I achieve the exact number of records in the files ?
the best way to handle this would be to use Local FS as the destination which can be configured to roll the files at a certain number of records, it's configurable in the destination. In that case, the incoming batch size is not relevant, as the Local FS is counting the output records and rolling the files when a specific size (number of records, or MB) is reached.
After creating the files, you can use another pipeline in Whole File mode using Directory Origin to pick up the completed files and send them to the appropriate destination, GCS or S3 for example.