ISSUE:
We have 2 data collectors(sdc1 & sdc2) on different hosts configured with the same label(prod234) . This morning control hub was routing all jobs with job label prod234 to a single data collector (sdc2) though sdc1 was never failed.
Troubleshooting:
We have found the Issue is likely related to the default values for cpu (80%), memory (100%) and max pipelines (10K) which was set in the engine configuration.
The running jobs are memory intensive and allocating new jobs to a collector that's even close to 100% is asking for trouble.
We've adjusted the thresholds to cpu (80%), memory (75%) and max pipelines (10) and the issue have never reappeared.