Skip to main content

Hi there,

 

Recently, all our Job Instances in the DataOps Platform did not execute due to the following,

 

JOBRUNNER_73 - Insufficient resources to run job. All matching Data Collectors 'shttps://XXXXXX:18600]' have reached their maximum memory limits.

 

We understand this is because, memory utilization on the engine was more than 80% and Garbage Collection process did not kick-in.

 

We are using Data Collector v4.3; we use Tarball installation type and AWS EC2 instance. We had allocated enough memory for the few job instances that are scheduled to run; the data volume for each job instance was less than 5000 records.

 

We noticed the issue only after 12+ hours since the Job Instance or the pipeline did not fail,

 

The platform is to released to the wider team and how do we get an alert when,

  1. A Job instance did not execute in-time. Note - it did not have any errors - it was just in an ACTIVE state but color-coded in RED with the above message. The message was NOT even showing as ERROR.
     
  2. How do we set an alert to monitor the memory utilization on the Data Collector engine in DataOps Platform.
     
  3. Do we need to do any extra tweaking for the Garbage Collection Process ?

 

NOTE:

 

  1. I has fail over “enabled” for all job instances. But we did not have a “failover” engine set-up. Our deployment just has 1 engine running.
  2. The maximum number of retries and the global retries was set to -1

 

Can someone confirm, whether this is the reason the JOB INSTANCE was NOT marked as FAILED and an alert NOT sent out?

 

Thanks,

Srini

 

 

@Srinivasan Sankar 

From explained scenario it’s expected that Job will remain in ACTIVE RED state. Please refer the following docs from more details on Job Status : https://docs.streamsets.com/portal/platform-controlhub/controlhub/UserGuide/Jobs/Jobs-Monitoring.html#concept_qll_fbn_gy 

JobStatus_ActiveRed.png Job is active, but there are some issues you must look into.For example, a red active status can indicate one of the following issues:
  • One of the assigned execution engines is not currently running.
  • One of the assigned execution engines encountered an error while running the pipeline.
  • All assigned execution engines have exceeded their resource thresholds.

 

To get an ALERT , you can set the subscription which will trigger when job goes to  RED color

i.e

 

  1. Do we need to do any extra tweaking for the Garbage Collection Process ? : Make sure you have enabled g1gc algorithm for garbage collection for improved performance.

 


Reply