Job Instance Failed to Execute - NO ERRORS / ALERTS REPORTED

Question

Hi there,

Recently, all our Job Instances in the DataOps Platform did not execute due to the following,

JOBRUNNER_73 - Insufficient resources to run job. All matching Data Collectors '[https://XXXXXX:18600]' have reached their maximum memory limits.

We understand this is because, memory utilization on the engine was more than 80% and Garbage Collection process did not kick-in.

We are using Data Collector v4.3; we use Tarball installation type and AWS EC2 instance. We had allocated enough memory for the few job instances that are scheduled to run; the data volume for each job instance was less than 5000 records.

We noticed the issue only after 12+ hours since the Job Instance or the pipeline did not fail,

The platform is to released to the wider team and how do we get an alert when,

A Job instance did not execute in-time. Note - it did not have any errors - it was just in an ACTIVE state but color-coded in RED with the above message. The message was NOT even showing as ERROR.
How do we set an alert to monitor the memory utilization on the Data Collector engine in DataOps Platform.
Do we need to do any extra tweaking for the Garbage Collection Process ?

NOTE:

I has fail over “enabled” for all job instances. But we did not have a “failover” engine set-up. Our deployment just has 1 engine running.
The maximum number of retries and the global retries was set to -1

Can someone confirm, whether this is the reason the JOB INSTANCE was NOT marked as FAILED and an alert NOT sent out?

Thanks,

Srini

Rishi · Answer

@Srinivasan SankarFrom explained scenario it’s expected that Job will remain in ACTIVE RED state. Please refer the following docs from more details on Job Status :https://docs.streamsets.com/portal/platform-controlhub/controlhub/UserGuide/Jobs/Jobs-Monitoring.html#concept_qll_fbn_gyJob isactive, but there are some issues you must look into.For example, aredactivestatus can indicate one of the following issues:One of the assigned execution engines is not currently running.One of the assigned execution engines encounteredan error while running the pipeline.All assigned execution engines have exceeded their resource thresholds.To get an ALERT , you can set the subscription which will trigger when job goes to RED colori.eDo we need to do any extra tweaking for the Garbage Collection Process ? : Makesure you have enabled g1gc algorithm for garbage collection for improved performance.

Job Instance Failed to Execute - NO ERRORS / ALERTS REPORTED

1 reply

Reply

Reply

Couldn't find what you're looking for?

Sign up

Social Login

Login to the community

Social Login

Scanning file for viruses.

This file cannot be downloaded