The watchdog is a service that can monitor different things and when something fails a hardware reset can happen. A lot of things must correctly run fine before the watchdog gets alive, so it will not act on hardware failures, its targets are detecting software issues and remove them by a hardware reboot.
It can also send out a e-mails but this requires a complex email setup without lot of control what is sent.
To get it sudo apt-get install watchdog and then man watchdog and man watchdog.conf
For the Raspberry there is the kernel module bcm2835_wdt that creates the file /dev/watchdog. For devices not having a hardware watchdog there is a software watchdog driver coming with the kernel source.
basename $(readlink /sys/class/watchdog/watchdog0/device/driver) will show the device driver used
Writing the the dev file will start the the watchdog echo 1 | sudo tee /dev/watchdog or sudo bash -c 'echo 1 > /dev/watchdog'. A reset will occur after the default time of 60s.
It is not limited to kick (pet) the watchdog as man watchdog and the configuration file /etc/watchdog.conf show.
To start the watchdog automatically and prevent resets, the watchdog daemon is required.
Configure in /etc/watchdog.conf the device to take (or uncomment)
watchdog-device = /dev/watchdog
The file /etc/watchdog.conf must have configured some tests otherwise the watchdog will not do anything
Make sure the repair and test scripts use absolute path even if they are in /etc/watchdog.d
Do not put comments after the commands. Use separate lines
Then sudo systemctl enable watchdog and sudo systemctl start watchdog
Check with sudo systemctl status watchdog to see what can and what will trigger the watchdog
To test it stop the service sudo systemctl stop watchdog however it is allowed to stop the watchdog service and therefore nothing will happen.
As sudo lsof /dev/watchdog shows (sudo apt install lsof), there is a wd_keepal process running that pets the dog:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME wd_keepal 1788 root 4w CHR 10,130 0t0 130 /dev/watchdog
sudo kill -9 <PID> will kill the watchdog process wd_keepal then reset will occur.
Once triggered the watchdog requires periodical kicks to not reset, this makes sure the watchdog process is alive. Tests are as:
CPU load not to high
Memory usage (used and free)
Machine temperature too high (can also create warnings)
Status of a file
Check for running processes and daemons using files containing the PID
Ping IP4 addresses
journalctl -u watchdog -f will live (follow) show what the watchdog (unit) logs
The /var/log/watchdog directory is used for what the repair and test scripts write. A common place for the scripts is /etc/watchdog.d
The repair script is called on every interval to find out if something is wrong. The watchdog will not tell if it has found an error.
The repair script is called by the watchdog with $1 = test. This means please test.
If ok return 0 otherwise 1.
With 1 it is indicated that there is a problem. In this case the test script is recalled by the watchdog with $1 = repair. The repair script can do something and when fixed return 0 or return 1 and simply wait until the watchdog triggers.
The test script might use $# number of arguments, $@ all arguments, or $2, $3 to get additional information from the watchdog. The contents depends on the errors.