In this article, I will demonstrate example scenarios to help us understand failover retries configurations. When a Data Collector job is enabled for failover, Control Hub retries the failover an infinite number of times by default. However, you may want the failover to stop after a given number of retries. In such cases, you need to define the maximum number of retries to perform.
To determine the maximum number of retries, you can configure one or both of the following properties when you configure the job:
-
Failover Retries per Data Collector (FRPD): This refers to the maximum number of pipeline failover retries to attempt on each available Data Collector. The initial start of a pipeline instance on a Data Collector counts as the first retry attempt.
Control Hub maintains the failover retry count for each available Data Collector. When a Data Collector reaches the maximum number of failover retries, Control Hub does not attempt to restart additional failed pipelines for the job on that Data Collector. However, this does not affect the retry counts for other Data Collectors running pipeline instances for the same job.
-
Global Failover Retries (GFR): This refers to the maximum number of pipeline failover retries to attempt across all available Data Collectors.
Control Hub maintains the global failover retry count across all available Data Collectors. When the maximum number of global failover retries is reached, Control Hub stops the job.
Note that Control Hub increments the failover retry count and applies the retry limit only when the pipeline encounters an error and transitions to a Start_Error or Run_Error state. If the engine running the pipeline shuts down, failover always occurs and Control Hub does not increment the failover retry count.
It's important to keep in mind that when the FRPD limit is reached for all available Data Collectors, Control Hub does not stop the job. Instead, the job remains in a red active status until another Data Collector becomes available to run the pipeline. Does the job gets stopped or not entirely depends on whether the GFR limit is reached.
To illustrate how these configurations work in practice, let's consider the following assumptions:
- We have two data collectors available with the same label for the concerned deployment. Let’s call these data collectors as sdc-1 and sdc-2.
- We have tested this for a pipeline that always ends up in RUNNING_ERROR for its run.
Now, let's look at some example scenarios:
Scenario-1:
Pipeline Retry attempts: -1
Enable Failover: true
Failover Retries per Data Collector: 2
Global Fail over retries: 3
Pipeline is retrying indefinitely on the same datacollector. As pipeline is not giving up on its attempts to retry on assigned data collector, fail over would not happen here and job do not get stopped until it is succeeded.
Scenario-2:
Pipeline Retry attempts: 1
Enable Failover: true
Failover Retries per Data Collector:2
Global Fail over retries: 3
job starts on sdc-1: FRPD(sdc-1) = 1, FRPD(sdc-2) = 0, GFR_current = 0, GFR_max = 2
job fail overs from sdc-1 to sdc-2: FRPD(sdc-1) = 1, FRPD(sdc-2) = 1, GFR_current = 1, GFR_max = 3
job fail overs from sdc-2 to sdc-1: FRPD(sdc-1) = 2, FRPD(sdc-2) = 1, GFR_current = 2, GFR_max = 3
job fail overs from sdc-1 to sdc-2: FRPD(sdc-1) = 2, FRPD(sdc-2) = 2, GFR_current = 3, GFR_max = 3
Job gets stopped since global failover retries limit=3 has been reached. As the pipeline did not run successfully job will be in inactive-red state.
Scenario-3:
Pipeline Retry attempts: 1
Enable Failover: true
Failover Retries per Data Collector:2
Global Fail over retries: 2
job starts on sdc-1: FRPD(sdc-1) = 1, FRPD(sdc-2) = 0, GFR_current = 0, GFR_max = 2
job fail overs from sdc-1 to sdc-2: FRPD(sdc-1) = 1, FRPD(sdc-2) = 1, GFR_current = 1, GFR_max = 2
job fail overs from sdc-2 to sdc-1: FRPD(sdc-1) = 2, FRPD(sdc-2) = 1, GFR_current = 2, GFR_max = 2
job will not fail over further since global failover retries limit=2 has been reached. Job gets stopped and moves to inactive-red.
Scenario-4:
Pipeline Retry attempts: 1
Enable Failover: true
Failover Retries per Data Collector:2
Global Fail over retries: 4
job starts on sdc-1: FRPD(sdc-1) = 1, FRPD(sdc-2) = 0, GFR_current = 0, GFR_max = 4
job fail overs from sdc-1 to sdc-2: FRPD(sdc-1) = 1, FRPD(sdc-2) = 1, GFR_current = 1, GFR_max = 4
job fail overs from sdc-2 to sdc-1: FRPD(sdc-1) = 2, FRPD(sdc-2) = 1, GFR_current = 2, GFR_max = 4
job fail overs from sdc-1 to sdc-2: FRPD(sdc-1) = 2, FRPD(sdc-2) = 2, GFR_current = 3, GFR_max = 4
job can not be failed over from sdc-2 as no instances available for fail over. GFR_current = 3, GFR_max = 4. In this case job stays red-active until new data collector becomes available or gets added to the related deployment.