Is a reboot recommended when OOM killer eats your dashd?

dashameter · Jul 6, 2018

The web is full of opinions that when OOM killer eats your process it might have eaten other system processes as well and it's best to reboot the machine entirely. I see scripts being discussed here that simply call `./dashd` to restart the process. What is the consensus on whether that alone is good enough or if the machine should be restarted?

If the machine needs a restart should the watchdog script be run via the root cron or the user added to the reboot process via visudo:

Code:

user  ALL=NOPASSWD: /sbin/reboot

While having

Code:

@reboot "/home/user/.dashcore/dashd"

in your crontab ?

Adapting the script from the other thread:

Code:

#!/bin/bash
DASHD_RUNNING=$( /home/user/.dashcore/dash-cli help | wc -l )
if [ $DASHD_RUNNING -eq 0 ] ; then
       reboot
fi

I prefer calling dash-cli rather than using ps, I might have other scripts with a dashd in their name running, if dashd isn't running, dash-cli will throw an error.

TroyDASH · Jul 6, 2018

I have always used a script that simply restarts dashd without reboot and have almost never had problems. The most common reasons for irrecoverable dashd crashes for me have been (1) the hosting provider (VPS) having a problem, or (2) insufficient SWAP memory or insufficient disk space -- neither of those issues would be truly fixed by rebooting (frequently rebooting due to low memory might only band-aid a symptom of a bigger problem that needs to be fixed). Of course, in the rare instance when dashd does crash and does not recover in time, then I will do a reboot regardless for good measure, since the masternode is already kicked out of the queue anyway.

dashameter · Jul 6, 2018

TroyDASH said:
I have always used a script that simply restarts dashd without reboot and have almost never had problems. The most common reasons for irrecoverable dashd crashes for me have been (1) the hosting provider (VPS) having a problem, or (2) insufficient SWAP memory or insufficient disk space -- neither of those issues would be truly fixed by rebooting (frequently rebooting due to low memory might only band-aid a symptom of a bigger problem that needs to be fixed). Of course, in the rare instance when dashd does crash and does not recover in time, then I will do a reboot regardless for good measure, since the masternode is already kicked out of the queue anyway.

Thanks for the response, the OOM Killer ate my dashd once in a year, I double the ram and swap as a response, it was during a tx spike, so it should probably be fine now.

In what scenario does your dashd crash because it was the vps hosts fault ? I'm assuming the vps host either has the network go out, then dashd hopefully doesn't crash from that but is still running, or the server gets rebooted against your own volition, in which case it should auto-start upon boot. In neither case would a watchdog script calling dashd solve the issue, is there a scenario that I'm missing?

If rebooting the machine is not nescessary to recover from a OOM kill do we need a script at all? Just run dashd every 10 mins via cron, if it is already running it won't start up a second time, and done. Is there a reason to parse the process list instead?

Thanks a lot!

TroyDASH · Jul 6, 2018

dashameter said:
Thanks for the response, the OOM Killer ate my dashd once in a year, I double the ram and swap as a response, it was during a tx spike, so it should probably be fine now.

In what scenario does your dashd crash because it was the vps hosts fault ? I'm assuming the vps host either has the network go out, then dashd hopefully doesn't crash from that but is still running, or the server gets rebooted against your own volition, in which case it should auto-start upon boot. In neither case would a watchdog script calling dashd solve the issue, is there a scenario that I'm missing?

If rebooting the machine is not nescessary to recover from a OOM kill do we need a script at all? Just run dashd every 10 mins via cron, if it is already running it won't start up a second time, and done. Is there a reason to parse the process list instead?

Thanks a lot!

Sometimes the host provider needs to reboot the node due to scheduled maintenance or because of some other problem -- usually those are fast enough to have dashd recover. Full outages are more rare but they do happen sometimes even to some of the most reliable services like AWS, those ones tend to knock out a chunk of masternodes at a time.

Good point about dashd, I don't really know if there's any downside to just continually running dashd instead of parsing the process list, if dashd won't start again when it is already running. Unless maybe if dashd figuring out not to run is a more expensive operation than parsing the process list? Even if it works either way I'm probably not going to touch mine because its been going so well for so long the way it is

xkcd · Jul 24, 2018

@dashameter What are the specs of you VPS? RAM and SWAP, can you post

free -m

. Have you looked at

journalctl

to see what other procs got OOM? You only need to reboot if OOM got some other proc and you are not sure now the state of your machine. That said there are two settings that should help you with an over zealous OOM killer. Your VPS by default will OOM kill a proc that merely tries to allocate another X bytes over and above the available RAM on the VPS, this is not desired behaviour IMO, the reason is many programs over allocate RAM, but then never end up using it, thus if a proc grabs 1GB, but only uses 500 MB, then 500MB is lost and if you only got 1GB, the next malloc() fails. Great news, there is another way, your kernel can be configured to allow over allocation of RAM, so if you have 2GB and malloc() for 3GB the kernel will say, yeah OK, have it and the OOM killer wont get you until you initialise more than 2GB. How to turn this on?

Code:

# Memory management.
sudo sysctl -w vm.overcommit_memory=1

# Make it permanent.
sudo bash -c "echo \"vm.overcommit_memory=1\">>/etc/sysctl.conf"

Run the above only once!

Next, each process has a OOM score, it is a sort of ugly meter, the higher the score the more likely the OOM will terminate that proc first in the event of a OOM. Great News ! You can adjust the OOM score of your DASHD so the kernel prefers to kill some other process over your money maker.

Code:

sudo bash -c "echo -1000 >/proc/$$/oom_score_adj"
dashd

In the above code, you add the echo bit to the shell script that launches your dashd. You can check your OOM score with htop (gotta add the column first) or via proc fs.

Code:

cat /proc/$(pidof dashd)/oom_score

Your dashd should have a score of 0.

UdjinM6 · Aug 1, 2018

OOM Killer usually kills dashd when it eats way too much memory. Make sure you have enough RAM or you have swap configured. On 1gb VPS for example, a swap file is a must. You can add one this way:

Code:

# create 4gb swap file
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# to make changes permanent open fstab...
sudo vi /etc/fstab
# ... and add this line
/swapfile   none    swap    sw    0   0

Is a reboot recommended when OOM killer eats your dashd?

dashameter

New member

TroyDASH

Well-known member

dashameter

New member

TroyDASH

Well-known member

xkcd

Well-known member

UdjinM6

Official Dash Dev