Hard disk

Hard disk
	Chapter 4. Hardware

dev files for hard disks

The evolution has brought different names for the hard disks.

/dev/nvme0n1p1 Non-Volatile Memory Express device at controller 0, disk 1 partition 1
/dev/sda1 SCSI Disk a=first disk, partition 1
/dev/hda1 IDE/ATA parallel Disk a=first disk, partition 1

Frequent disc access

Since disk are wear parts, periodical blinking of the disc access LED is not a good sign.

Check if this just happens when a user is logged in. If so spot what will be written

find ~ -type f -mmin -1 -ls 2>/dev/null | tail -20

File indexing might be the cause

DMA (Direct Memory Access)

DMA allows coping data from the hard disk to the memory without passing it thought the CPU bottleneck. Enabling DMA increases hard disk access performance significant (using hdparm -tT > more than factor 10).

Enabling DMA is done with the hdparm command, but also the kernel has to be configured to support DMA and the DMA controller on the motherboard has to be selected.

hdparm /dev/hda shows data about your hard disk

hdparm -I /dev/hda shows more data about your hard disk

hdparm -i /dev/hda displays DMA mode info

hdparm -d /dev/hda shows DMA status

hdparm -d 1 /dev/hda turns DMA on

hdparm -d 0 /dev/hda turns DMA off

hdparm -tT /dev/hda tests it

Partitions

To create partitions the basic fdisk, or the bit advanced cfdisk, parted or sfdisk can be use. The hard disk is not allowed to be mounted and be aware when you write you loose your data.

To see what you have (before doing the damage) lsblk -p, fdisk -l, parted -l, df -h, or blkid lists all the devices.

For parted there is the GUI qtparted.

A typical sequence to partition ssd sd for a UEFI-only system not swapping to the SSD could look as follows:

parted

print devices

parted -a optimal /dev/sda

mklabel gpt

unit mib

mkpart primary 1 64 0 to 1 might be critical or / not possible and not per-formant

name 1 boot

mkpart primary 64 -1

name 2 rootfs

set 1 boot on

print

Model: Jmicron Corp. (scsi) Disk /dev/sdg: 114473MiB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 1.00MiB 64.0MiB 63.0MiB boot boot, esp 2 64.0MiB 114472MiB 114408MiB rootfs

quit

mkfs.ext4 /dev/sdg2

mkfs.vfat /dev/sdg1

parted

parted -l to list all the devices

parted /dev/sdb to start parted for working on /dev/sdb

Commands to be used within parted a command line prompt:

mklabel msdos or mklabel gtp to create partition tables

print to see what parted sees

select /dev/sdc to switch to /dev/sdc

resizepart to resize a partition

rm to remove partition

set boot to set partition flags as boot, swap, esp and others

Defragmemt

e4defrag /home/user/directory/ defragments a directory

e4defrag /dev/sda5 defragments a disk

Test programs

More advanced test programs as hdparm -tT /dev/hda are the classic bonnie or the C++ version bonnie++.

If you like to know where the big files and directories are that have filled up your hard disk:

emerge filelight

and you get a graphical picture of it

Show bad blocks: dumpe2fs -b /dev/hda3

Show superblock: dumpe2fs -h /dev/hda3

Show all stuff: dumpe2fs /dev/hda3

Show filesystem stuff: mke2fs -n /dev/hda3

To change the parameters for automatic testing use tune2fs -c or tune2fs -i

The following programs check the "unmounted" file systems. Boot from a live-CD as http://www.knopper.net/knoppix/ and run them:

A front end for various checkers: fsck /dev/hda3

Force check even it is clean: e2fsck -f /dev/hda3

Look for bad blocks: e2fsck -c /dev/hda3

Automatic repair: e2fsck -p /dev/hda3

Search for bad blocks (with the show option): badblocks -s /dev/hda3

A "mounted" filesystem can be checked using the readonly option -n:

e2fsck -n /dev/hda3

The ext3 file system has journaling function that should not require a manual check by definition.

The kjournal is a daemon doing the stuff in background.

To test the disk: emerge testdisk

Interesting to know is also how often your disk is accessed, the command vmstat -d will show that and can be used to move data from one disk to an other to e.g reduce write access to SSD devices.

e2label /dev/<disk> shows disk name on e2 partitions e2label /dev/<disk><name> sets it

Desktop environment come also with GUI's as gnome-disk-utility

SMART (Self-Monitoring Analysis and Reporting Technology)

Drives are spare parts having wear. They have lot of recovery features to prevent disasters. This works so well that most users do not check the health of the drives regularly. Everything has its limits so when it happens the disaster will be too big.

Important

Keep a broken harddisk to test SMART

As root check if the drives support SMART lsblk

For Hard disk do hdparm -I /dev/<sd?> | grep SMART, /dev/nvme<n> will not work with hdparm

Get smartmontools and do smartctl -i /dev/nvme0

It is wise to replace the disks before a disaster happens. SMART is built-into the drives to support acting before the disaster happens.

Note

USB sticks and SDcards usually do not support SMART

The data is stored in the device and can not be altered by a user. So an old device and not be made new. Errors logged remain.

/usr/share/smartmontools/drivedb.h is the drive database and can be updated with the update-smart-drivedb program

To get all attributes supported by the device (all the checks) smartctl -A /dev/nvme0n1 See https://en.wikipedia.org/wiki/S.M.A.R.T.#ATA_S.M.A.R.T._attributes

Run a test and get updated data before seeing the drive health.

smartctl -c /dev/nvme0n1 shows what tests are supported.

To do a self-test (it shows when the test is finished but you can still work normally with it)

smartctl -t short /dev/nvme0n1

Important

smartctl -t long /dev/nvme0n1 takes a couple of hours and is therefore really long

When done smartctl -H /dev/nvme0n1 to see the overall health.

To see more details smartctl -l selftest /dev/nvme0n1 or smartctl -l error /dev/nvme0n1 if there are errors then it should be observed when the error happened, a single power up and down error on usb drives might not be reasons to put the drive into the garbage.

gsmartcontrol is a GUI fronted that shows quickly what is going on https://gsmartcontrol.shaduri.dev/

SMART and USB devices

smartctl on usb devices might fail with Unsupported USB bridge and indicating the usb id as 04b4:6830. Do lsusb to get an hint on the usb bridge as: Bus 010 Device 006: ID 04b4:6830 Cypress Semiconductor Corp. CY7C68300A EZ-USB AT2 USB 2.0 to ATA/ATAPI

Then check -d option using man smartctl and do smartctl -d usbcypress -A /dev/sdi

SMART Attributes

There are many attributes that can indicate end of live. Unfortunately those attributes are not well coordinated between the different manufacturers.

The Attribute 5, Reallocated_Sector_Ct shows in the RAW_VALUE column how many times a bad sector has been replaced and since no spares are left it increases the value. TYPE Pre-fail is critical since those errors happened recently.

Available_Reservd_Space decreases over time and show how much spare space is available to be used instead bad space.

SMART daemon

There is the smartd daemon

SMART daemon and systemd

For systemd systemctl status smartd.service or systemctl is-active smartd.service and systemctl enable smartd.service and systemctl start smartd.service

journalctl -u smartd.service -b to see the logs of last boot and journalctl -u smartd.service --since today

SMART daemon and OpenRC

For OpenRC /etc/init.d/smartd start and rc-update add smartd default

For the logs cat /var/log/syslog | grep smartd or cat /var/log/messages.

smartd.conf

The behavior is configured in /etc/smartd.conf see man smartd.conf and how it is started in /etc/conf.d/smartd

smartd.conf can have lines starting with:

DEFAULT sets default parameters valid for the next lines until a next DEFAULT line appears. A use case is setting the email address to send notifications
/dev/sd<x> or other device name
DEVICESCAN, every line in /etc/smartd.conf after this line will be ignored. Specific configuration must therefore be done in-front of the DEVICESCAN line.
```
DEVICESCAN -H -m root@example.com
```

DEVICESCAN will monitor all devices (except the ones configured in the line in front) and can have parameters as -H for just reporting health, -m where to email problems.

Important

smartd will not test by default the drives periodically, it just reads their health data and acts.

smartd runs and polls the drives periodically. The default smartd period is 30 min = 1800s.

This can be changed as -i <seconds> with a command line parameter when calling smartd, command line parameters are not set in /etc/smartd.conf but in /etc/conf.d/smartd

To test if the /etc/smart.conf is ok run smartd -q onecheck

To run tests the /etc/smart.conf needs to be configured with -s <regular expression> this can be done per drive as /dev/sd<x>, DEFAULT or DEVICESCAN. The <regular expression> starts with a letter indicating the test and a cron like expression telling when

/dev/sda -a -s S/../.././..

short tests are requested to run every hour

Note

When the test starts it takes some time, so the result is not immediately available and will be available during a following smartd period

Testing every hour might be good when testing smartd itself, however it gives stress and annoying noise on mechanical hard disks for no really benefit. Once a week or once a month would be good enough to catch an upcoming disk problem.

cat /var/log/syslog | grep smartd or journalctl -u smartd.service -b should show at the expected time 0..30min that the test starts: Device: /dev/sdi [USB JMicron], starting scheduled Short Self-Test

and hopefully there will be:

Device: /dev/sdb [SAT], previous self-test completed without error

cat /var/log/syslog | grep smartd or journalctl -u smartd.service -b will probably show lot of Attributes as #194 the temperature changes. This is good for testing and to see that smartd runs as every 30min but annoying during regular operation.

This temperature warning can be removed as

DEFAULT -I 194

To not get errors for missing drives -d removable has to be added for usb devices that can be removed or plugged in/out

Warnings as /dev/sdb, type changed from 'scsi' to 'sat' can be removed by adding -d sat there is also -d nvme

smartd messages

In case of an issue smartd has been designed to send emails. Today sending emails is problematic since too many Spam emails are around and a smartd email is not supposed to end up in a Spam filter.

Other scripts than the default one can be passed with the -M option see examples in /usr/share/doc/smartmontools-<xx>/examplescripts

not sending smartd emails

It is also possible to send no email and just run a script -m <nomailer> -M exec <script>

The script can use the environmental variables involved and might look as

#!/bin/bash
# Script to log smartd notifications

LOGFILE="/var/log/smartd-notify.log"

# Log the date and detailed information from smartd
echo "$(date): Device: $SMARTD_DEVICE, Failure: $SMARTD_FAILTYPE, Message: $SMARTD_MESSAGE" >> $LOGFILE

Attach a damaged disk or create a false disk failure with:

SMARTD_DEVICE="/dev/sda" SMARTD_FAILTYPE="Test Failure" SMARTD_MESSAGE="This is a test message" <path and name of the above script>

sending smartd emails

smartd emails call the warning script /etc/smartd_warning.sh. This script creates the warning messages and then calls the program that mails it.

Emails are sent to -m <email>. It usually uses mailutils for that https://mailutils.org/. Setup mailutils and test if it is possible to send mails as from root to a user then use DEFAULT -m <username>@<hostname>.localdomain

Today mail server use authentication and certificates otherwise the emails will not pass Spam filters and the mail sent will hardly arrive. It is therefore not easy to send mail that will arrive. The way out is using a mail account of a certified mail server as https://mail.google.com and use an application password.

See https://linuxconfig.org/how-to-configure-smartd-and-be-notified-of-hard-disk-problems-via-email or write a python script for it.


Chapter 4. Hardware		Graphic Cards