Kafka consumer Offset from beginning every batch

Question

Hello,I'm currently working on a simple pipeline to ingest kafka messages inside a log file.I'm trying to consume all the data from the beginning of a topic but i'm only getting newer data added to this topic.Once consumed, the previous topic messages are not accessible anymore.I've already test all the different "Auto Offset Reset" properties.Same for simple and multi topic consumers.In the official documentation docs.streamsets.com :auto.commit.interval.ms bootstrap.servers enable.auto.commit group.id max.poll.recordsIf I understand correctly all those parameters are locked so I can't disable the offset management and process all the data from the beginning of a topic.Is there an additionnal Kafka configuration property to use or do I need to configure the topic directly via kafka CLI ??StreamSets Data Collector version : 3.14.0Kafka Consumer version : 2.0.0Regards.

Clément Vi · Accepted Answer

My bad, I was trying that via a Shell Executor Stage.

You are right the Start Event is a viable solution if you want to execute a command each time the pipeline is started.

What I need is to execute a command once every batch so I’ve done it like this :

On the Jython Processor :

Produce Events : Activated
Record Processing Mode : Batch by Batch

# Create the event with the specified type and data
new_event = sdc.createEvent('custom_event', 1)
# Send the event to StreamSets Data Collector
sdc.toEvent(new_event)
# SIMPLE OUTPUT
for record in sdc.records:
  try:
    sdc.output.write(record)
  except Exception as e:
    # Send record to error
    sdc.error.write(record, str(e))

This code generate an event for every batch and execute the previously provided Kafka command (Need to be adapted) :

/foobar/kafka/kafka_2.13-3.5.1/bin/kafka-consumer-groups.sh --bootstrap-server mykafka:9093 --group streamsetsDataCollector --topic mytopic --reset-offsets --to-earliest --execute;

It also output all the records of the batch.

Thank you for your support.

Regards.

roma · Answer

You’re right, Kafka as a system keeps track of its offsets for all its consumers. If you want to reset offsets in Kafka, you need to call Kafka directly. But it you have kafka client installed on your SDC machine, you can use Shell command in Start Event of your pipeline and do this reset.

and then in the Script field of Start Event tab put something like

/foobar/kafka/kafka_2.13-3.5.1/bin/kafka-consumer-groups.sh --bootstrap-server mykafka:9093 --group streamsetsDataCollector --topic mytopic --reset-offsets --to-earliest --execute;

Reply

Related topics

Suspicious File Submissionicon

how do I report something suspicious?icon

Spooky Security Showdown:🚨 Cyber Incident Response Challenge 🚨

VirusTotal: Data leak reveals customers of Google's security platform

Did Someone at the Commerce Dept. Find a SolarWinds Backdoor in Aug. 2020?

Tags

Couldn't find what you're looking for?

Sign up

Social Login

Login to the community

Social Login

Scanning file for viruses.

This file cannot be downloaded