Is a reboot recommended when OOM killer eats your dashd?

dashameter

New Member
Jan 31, 2018
10
7
3
50
The web is full of opinions that when OOM killer eats your process it might have eaten other system processes as well and it's best to reboot the machine entirely. I see scripts being discussed here that simply call `./dashd` to restart the process. What is the consensus on whether that alone is good enough or if the machine should be restarted?

If the machine needs a restart should the watchdog script be run via the root cron or the user added to the reboot process via visudo:

Code:
user  ALL=NOPASSWD: /sbin/reboot
While having

Code:
@reboot "/home/user/.dashcore/dashd"
in your crontab ?

Adapting the script from the other thread:

Code:
#!/bin/bash
DASHD_RUNNING=$( /home/user/.dashcore/dash-cli help | wc -l )
if [ $DASHD_RUNNING -eq 0 ] ; then
       reboot
fi
I prefer calling dash-cli rather than using ps, I might have other scripts with a dashd in their name running, if dashd isn't running, dash-cli will throw an error.
 
Last edited:

TroyDASH

Well-known Member
Jul 31, 2015
1,251
794
183
I have always used a script that simply restarts dashd without reboot and have almost never had problems. The most common reasons for irrecoverable dashd crashes for me have been (1) the hosting provider (VPS) having a problem, or (2) insufficient SWAP memory or insufficient disk space -- neither of those issues would be truly fixed by rebooting (frequently rebooting due to low memory might only band-aid a symptom of a bigger problem that needs to be fixed). Of course, in the rare instance when dashd does crash and does not recover in time, then I will do a reboot regardless for good measure, since the masternode is already kicked out of the queue anyway.
 
  • Like
Reactions: UdjinM6

dashameter

New Member
Jan 31, 2018
10
7
3
50
I have always used a script that simply restarts dashd without reboot and have almost never had problems. The most common reasons for irrecoverable dashd crashes for me have been (1) the hosting provider (VPS) having a problem, or (2) insufficient SWAP memory or insufficient disk space -- neither of those issues would be truly fixed by rebooting (frequently rebooting due to low memory might only band-aid a symptom of a bigger problem that needs to be fixed). Of course, in the rare instance when dashd does crash and does not recover in time, then I will do a reboot regardless for good measure, since the masternode is already kicked out of the queue anyway.
Thanks for the response, the OOM Killer ate my dashd once in a year, I double the ram and swap as a response, it was during a tx spike, so it should probably be fine now.

In what scenario does your dashd crash because it was the vps hosts fault ? I'm assuming the vps host either has the network go out, then dashd hopefully doesn't crash from that but is still running, or the server gets rebooted against your own volition, in which case it should auto-start upon boot. In neither case would a watchdog script calling dashd solve the issue, is there a scenario that I'm missing?

If rebooting the machine is not nescessary to recover from a OOM kill do we need a script at all? Just run dashd every 10 mins via cron, if it is already running it won't start up a second time, and done. Is there a reason to parse the process list instead?

Thanks a lot!
 

TroyDASH

Well-known Member
Jul 31, 2015
1,251
794
183
Thanks for the response, the OOM Killer ate my dashd once in a year, I double the ram and swap as a response, it was during a tx spike, so it should probably be fine now.

In what scenario does your dashd crash because it was the vps hosts fault ? I'm assuming the vps host either has the network go out, then dashd hopefully doesn't crash from that but is still running, or the server gets rebooted against your own volition, in which case it should auto-start upon boot. In neither case would a watchdog script calling dashd solve the issue, is there a scenario that I'm missing?

If rebooting the machine is not nescessary to recover from a OOM kill do we need a script at all? Just run dashd every 10 mins via cron, if it is already running it won't start up a second time, and done. Is there a reason to parse the process list instead?

Thanks a lot!
Sometimes the host provider needs to reboot the node due to scheduled maintenance or because of some other problem -- usually those are fast enough to have dashd recover. Full outages are more rare but they do happen sometimes even to some of the most reliable services like AWS, those ones tend to knock out a chunk of masternodes at a time.

Good point about dashd, I don't really know if there's any downside to just continually running dashd instead of parsing the process list, if dashd won't start again when it is already running. Unless maybe if dashd figuring out not to run is a more expensive operation than parsing the process list? Even if it works either way I'm probably not going to touch mine because its been going so well for so long the way it is :)
 

xkcd

Member
Masternode Owner/Operator
Feb 19, 2017
103
72
78
australia
mnowatch.org
Dash Address
XpoZXRfr2iFxWhfRSAK3j1jww9xd4tJVez
@dashameter What are the specs of you VPS? RAM and SWAP, can you post
. Have you looked at
journalctl
to see what other procs got OOM? You only need to reboot if OOM got some other proc and you are not sure now the state of your machine. That said there are two settings that should help you with an over zealous OOM killer. Your VPS by default will OOM kill a proc that merely tries to allocate another X bytes over and above the available RAM on the VPS, this is not desired behaviour IMO, the reason is many programs over allocate RAM, but then never end up using it, thus if a proc grabs 1GB, but only uses 500 MB, then 500MB is lost and if you only got 1GB, the next malloc() fails. Great news, there is another way, your kernel can be configured to allow over allocation of RAM, so if you have 2GB and malloc() for 3GB the kernel will say, yeah OK, have it and the OOM killer wont get you until you initialise more than 2GB. How to turn this on?


Code:
# Memory management.
sudo sysctl -w vm.overcommit_memory=1

# Make it permanent.
sudo bash -c "echo \"vm.overcommit_memory=1\">>/etc/sysctl.conf"
Run the above only once!

Next, each process has a OOM score, it is a sort of ugly meter, the higher the score the more likely the OOM will terminate that proc first in the event of a OOM. Great News ! You can adjust the OOM score of your DASHD so the kernel prefers to kill some other process over your money maker.


Code:
sudo bash -c "echo -1000 >/proc/$$/oom_score_adj"
dashd
In the above code, you add the echo bit to the shell script that launches your dashd. You can check your OOM score with htop (gotta add the column first) or via proc fs.

Code:
cat /proc/$(pidof dashd)/oom_score
Your dashd should have a score of 0.
 

UdjinM6

Official Dash Dev
Dash Core Team
Moderator
May 20, 2014
3,638
3,538
1,183
OOM Killer usually kills dashd when it eats way too much memory. Make sure you have enough RAM or you have swap configured. On 1gb VPS for example, a swap file is a must. You can add one this way:
Code:
# create 4gb swap file
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# to make changes permanent open fstab...
sudo vi /etc/fstab
# ... and add this line
/swapfile   none    swap    sw    0   0