
In this article we will discuss the situation where HANA server crashes without crashdump or trace file.
Symptoms:
- HANA server/process, Indexserver restarts itself.
- No indexserver crash dump found in HANA Studio Diagnosis Files tab or trace folder on HANA machine.
Cause:
The cause of this abnormal server crash could be the OOM_killer mechanism as system might be running low on memory and killing an important process in order to remain operational .
What is OOM Killer Mechanism in Linux:
In Linux, there is a protection mechanism called OOM_Killer, which sacrifices one or more processes in order to free up memory for the system based on the oom_score.
The function called badness() will do the actual scoring of process to find the best process for termination which results from call chain:
alloc_pages-> out_of_memory() -> select_bad_process() -> badness()
/*
* oom_badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
* @p: current uptime in seconds
*
* The formula used is relatively simple and documented inline in the
* function. The main rationale is that we want to select a good task
* to kill when we run out of memory.
*
* Good in this context means that:
* 1) we lose the minimum amount of work done
* 2) we recover a large amount of memory
* 3) we don't kill anything innocent of eating tons of memory
* 4) we want to kill the minimum amount of processes (one)
* 5) we try to kill the process the user expects us to kill, this
* algorithm has been meticulously tuned to meet the principle
* of least surprise ... (be careful when you change it)
*/
When OOM-killer decides the process with highest score to terminate, it will issue the ‘sigkill’ signal (-> “kill -9”).
‘sigkill’ signal will immediately kill the process thus index server cannot write the crashdump before exiting/termination.
We can check the same in the logs located at:
/var/log/messages or /var/log/warn
Solution :
Linux issuing the Sigkill is the last step as a protective measure.
We should be much concerned about the memory usage by HANA processes.
Ensure all HANA instances are running on the affected node have global allocation limit setting(GAL), which is controlled by parameter
global.ini [memorymanager] -> global_allocation_limit
which :
- Limited to upper threshold limit in case multiple HANA instances are running on the node.
- {GAL HANA instances} + OS-running tools + OS pagecache ≤ available memory.
The correct value of Page cache should be calculated as described in our previous article:
As a administrator, we need to plan the memory allocation to avoid memory bottleneck issues.