There have been cases where Data Collector "disappears" and is not longer a running process. In some cases that can be due to a JVM crash or it may be the case that the Linux Out Of Memory Killer has killed Data Collector.
The Linux “Out Of Memory Killer” is a kernel feature working in the virtual memory subsystem which tries to prevent the system from running out of memory and swap space. The OOM Killer evaluates memory use patterns - mostly which process is growing, and how fast is it growing and will kill a process which is “taking too much memory” or is growing “too fast”. There are several kernel configuration tunables for this, but it is best to leave them alone; the defaults are generally ok.
In a case where the Data Collector heap is set to too high of a value, the OOM killer may kill the process. In this case, the Data Collector appears to just “stop”. The last sdc.log file entries show nothing special, but the process is no longer in the system.
There is another item we should quickly check for - a jvm crash report - the files are typically named hs_errro_pid<pid>.log and are found in the pwd of the jvm. If you find a crash report file, the Linux OOM killer did not stop the process; you should review the crash file - it may be something like SIGSEGV, or a SUGBUS.
Back to the Linux OOM Killer - the machine in the case study is:
- a machine with 44g of memory and a minimal 4g swap space.
- The data collector heap is set to 35g eg. -Xmx35g -Xms10g
On a machine with 44g, say 2g is used by Linux. And let's say Data Collector uses 4g of non-heap memory - for code and it's stack, NIO buffers, thread stacks, etc . Not counting the rest of the processes on the machine, we have maybe 38G free - 75% of that is about 28g. If the machine is only running Data Collector, a 28g heap is very large, and still seems safe.
In this case, setting Data Collector's Jvm heap space to 35g ultimately turns out to be to high. In particular, when the heap grows to that size (after starting at 10g) the Linux OOM killer kicks in and uses the various heuristics and in this specific case, it decides to kill Data Collector’s JVM to reclaim all the memory which has been allocated to Data Collector.
There is trace for this in the system messages file - using `dmesg | grep -i kill` will indicate whether the OOM Killer has killed Data Collector - also, there is a lot of additional information (in Centos), if you edit the messages file and find the point where the Linux OOM Killer ran.
October 07, 2020 17:26