Question:
How do cluster (YARN streaming) mode Kafka pipelines store offsets?
Answer:
The cluster mode Kafka pipeline offset tracking is done by the SDC application. It does not use Kafka for offset storage, which is default for standalone Kafka consumer pipelines.
Offsets are persisted into HDFS in the following location:
/user/<sdc_user>/.streamsets-spark-streaming/<streamsets_id>/<topic_name>/<consumer_group>/<pipeline_name>/offset.json