We have two Streamsets jobs. One job generates token continuously. Another job moves data from a processed Kafka topic to API endpoint. Alongwith, it sends a token in its header. Endpoint validates if token is valid, then process data. If token isn’t valid, data is lost. Token is regenerated every 8 mins. So if data reaches endpoint after 8 mins, token it carried will be invalid. We want a solution that provides near ZERO data loss.
One Possible solution we are thinking is as below. Please advise if there is any better option.
- Retry on Kafka topic to API endpoint job using ‘HTTP Client’ Processor.
- Write errored records to a local file/S3/ Oracle/HDFS table.
- Run a daily job every 3 hours to move data from file/table to API endpoint.