ISSUE:
When you configure a clustered Kafka pipeline and Kafka topic retention is set to a low value, for example, 30 minutes, you may run into kafka.common.OffsetOutOfRangeException errors. This exception occurs when data no longer exists past the retention period.
SOLUTION:
The error message/scenario which you are seeing kafka.common.OffsetOutOfRangeException
is because the offsets no longer exist, as the retention time is set to 30 minutes. Typical deployments we've seen in the field have retention settings around 24 hours up to a full week.
In this use case, if one starts a pipeline with the Kafka consumer origin with offset X and the retention time is set to 30 minutes then when the pipeline goes down for a couple of hours, offset X which existed a couple of hours ago would vanish as retention will only keep 30 minutes worth of data. You will hit the OffsetOutOfRangeException
in this scenario.
Due to a limitation in the spark-Kafka library, we are using, there is no way to issue an OffsetRequest to get the latest/earliest offset currently available. You can see the related JIRA here:
https://issues.apache.org/jira/browse/SPARK-12693
WORKAROUND:
1) For now, as a workaround, you could clear the offset and restart the pipeline.
Offset file location in cluster mode: hdfs://<user_home>/.streamsets-spark-streaming/<sdcId>/<topic>/<consumer_group>/pipeline_name/offset.json
2) For a long-term solution, you could increase the retention time which will decrease the probability of hitting the error.