Delivery Guarantee in Data Collector

3 years ago
December 27, 2021
0 replies
197 views

Sami
StreamSets Employee

When you configure a pipeline, you define how you want data to be treated.

At least Once delivery guarantee:

When you select At Least Once it will ensure all data is processed and written to the destination which might result in duplicate rows. So duplicates can happen if that data was already written and the pipeline was stopped before saving the next offset.

In at least once guarantee, the offset is committed as soon as the batch finishes (when written to the target).

a) An error occurs during the sql server query (server connection dropped/killed): In this scenario, the offset won't be affected at all as pipeline fails in the Origin itself.

b) An error occurs during the xformation(spark cluster unavailable): In this situation, the offset will not be affected as the batch has not finished processing.

c) An error occurs during the write to the destination (Kerberos authentication denied): In this situation, the offset will not be affected as batch has not finished processing.

d) An error occurs after the write to the destination (hive permission denied): In this situation, the offset will be affected and a new offset is written when a batch has completed writing to the destination. If in a rare scenario, an error happens before the offset is changed and the batch is written to the destination, then duplication of records (hence at-least once) will happen.

When you have event generated by the Destination, at that time when you have selected the At least Once delivery guarantee then the offset will only change once the hive query executor will complete the processing of the batch. So if anything fails in the executor then offset won't be affected.

At Most Once delivery guarantee:

When you select At Most Once it will ensure that data is not reprocessed to prevent writing duplicate data to the destination which might result in missing rows.

In at most once, the offset is committed before the batch is completely processed. In the current implementation, the offset is committed just before a batch enters the destination.

a) An error occurs during the sql server query. (server connection dropped/killed): In this scenario, the offset won't be affected at all as pipeline fails in the Origin itself.

b) An error occurs during the xformation. (spark cluster unavailable): In this scenario, the offset won't be affected at all as pipeline fails in a processor stage.

c) An error occurs during the write to the destination (Kerberos authentication denied): In the current implementation for at most once, a new offset is committed just before a batch enters the destination. So if a pipeline is restarted because an error in the destination happened, it will take the new offset and you may see the loss of data.

d) An error occurs after the write to the destination (hive permission denied): In this scenario, the offset will be changed immediately as soon as it enters the destination. You might see a loss of data as it won't wait for a destination to complete the current batch.

When you have event generated by the Destination, at that time when you have selected At Most Once delivery guarantee then the offset is committed before the batch is completely processed. So, the offset is committed just before a batch enters the destination stage.

You can find some information also in our StreamSets Data Collector documentation - https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Pipeline_Design/DatainMotion.html#concept_ffz_hhw_kq

Did this topic help you find an answer to your question?

This topic has been closed for comments

Related topics

RingCentral Service Outageicon

How long is the RingCentral Service outage going to last? It's been over 1 hour!icon

Going on 4 hours of total service disruption, complete outage!icon

Service outage notification by smsicon

🎙️Update 1/23/2025 🚨 RingCentral Service Update: Calling-Inbound & Calling-Outbound 🚨

Tags

Couldn't find what you're looking for?

Sign up

Social Login

Login to the community

Social Login

Scanning file for viruses.

This file cannot be downloaded