We have two Streamsets jobs. One job generates token continuously. Another job moves data from a processed Kafka topic to API endpoint. Alongwith, it sends a token in its header. Endpoint validates if token is valid, then process data. If token isn’t valid, data is lost. Token is regenerated every 8 mins. So if data reaches endpoint after 8 mins, token it carried will be invalid. We want a solution that provides near ZERO data loss.
One Possible solution we are thinking is as below. Please advise if there is any better option.
- Retry on Kafka topic to API endpoint job using ‘HTTP Client’ Processor.
- Write errored records to a local file/S3/ Oracle/HDFS table.
- Run a daily job every 3 hours to move data from file/table to API endpoint.
@rajkar, one option that you could explore and that I’ve used in the past is that of using the Jython scripting origin instead of a HTTP origin; within the Jython origin you can use the requests library to obtain the token, which you can then pass on to the subsequent HTTP processor.
The Jython script can then implement the logic to manage to token expiry; i.e. you get the token on initialisation, then keep emitting records (empty, with just the token in a header attribute) until it’s time to renew the token.
There may be a bit more to it than that, especially if you need to keep track of offsets, so worth discussing in a call.