Setting up a homegrown NAS written by neko

Preface

About a year back I decided to install a new NAS in my lab and documented the process. This has been lying around on my wiki for a while in a pretty raw shape.

Since I was asked about it recently, I decided to rework the notes into a proper document.

Please let me know if any info here is incorrect or inprecise, or if you have any ideas how to improve upon it!

Intro

This document details the installation process of a homegrown NAS system running an md+lvm combination. md is used for the heavy lifting of the software raid, and lvm is used to keep volumes tidy.

In this system, the operating system will be run of it’s own SSD drive and will not be stored on the array, therefor no further steps must be taken to ensure that the md array is initialised for booting.

The decision to use software RAID is often an economic one, but I think there’s more to it than just saving some money for a proper RAID controller.

Unless you are ready to pay for a proper server grade RAID controller with enough cache and a working battery backup unit, software RAID is actually preferrable in most situations. Software RAID is often called “slow” but the reality is that modern processors are more than fast enough to deal with the processing required for the RAID. The flexibility gain should also be taken into account. The only real drawback is a lack of dedicated cache for the RAID.

Most off-the-shelve consumer-grade NAS devices are using software RAID aswell.

Installation sequence

Install debian
Set up md (create)
Set up md config for safety
Initialise lvm
Add lvm volumes
Set up smb
Set up exim and monitoring in mdadm
Set up hdparm monitoring and checks

RAID setup mdadm

First, initialise a new RAID 5 on the devices chosen: mdadm --create --level=5 --raid-devices=3 /dev/sda /dev/sdb /dev/sdc

The information is written to the drives themselves and the RAID will automatically be detected by the md module and set up accordingly.

After the RAID was initialised, mdadm --detail --scan will list the currently detected md raid array. For safety, copy the output into mdadm.conf to make sure the device node doesn’t change one day¹.

Finally set up md monitoring in the config (On debian, the configuration should already be set up with placeholders for that.)²

# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default (built-in), scan all partitions (/proc/partitions) and all
# containers for MD superblocks. alternatively, specify devices to scan, using
# wildcards if desired.
#DEVICE partitions containers

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR notifications@example.com

# definitions of existing MD arrays
# this is a copy of mdadm --detail --scan
ARRAY /dev/md/0  metadata=1.2 UUID=83a1806c:d3d5461e:550b91bb:cd59045b name=nas-test:0

# This configuration was auto-generated on Thu, 20 Feb 2020 09:58:14 +0000 by mkconf
MAILFROM nas@example.com

If you are not going to use LVM this is where your RAID is up and running. You can now format your md array, set up hdparm and system mounts and finally your SMB server. (SMB setup is not covered in this documentation)

fdisk /dev/md0 to format the drive
use gdisk /dev/md0 if the RAID is bigger than 2TB to make a GPT table instead of an MBR table
mkfs.xfs /dev/md0p1 to format the first partition to xfs (recommended)

Now we need to change some parameters for the drives. This is tricky.

Since we are running software RAID, data to be written to the array is first cached on the system cache, at which point the journal of the file system will do it’s job.

However, after that we have a second layer of cache: the hard drive cache. Since our journal is taking care of the md array and not the individual hard drive, we cannot be certain that data written to the drive is actually written onto the platter and not to the cache (likely).

In case of a power loss, it would not be possible to determine if a file has been lost before the transaction from the cache to the platter has occured.

The safe route here is to disable the cache on all hard drives. This will cost us performance, since there is no dedicated RAID cache like a real hardware RAID controller would have, but data safety is more important usually. For more info, check out this link (serverfault.com)

Add the hard drives you want to disable the cache for to the hdparm.conf (do this for every drive in the RAID array). This will disable cacheing on the drive.:

/dev/sdX { 
	write_cache = off
}

Now finally, create a systemd.mount mount file so the mount is automatically loaded into the system (replaces the former fstab method).

[Unit]
Description=Mount RAID partition

[Mount]
What=/dev/disk/by-uuid/86fef3b2-bdc9-47fa-bbb1-4e528a89d222
Where=/your/mount/point
Type=xfs
Options=defaults

[Install]
WantedBy=multi-user.target

A few things to note here:

What= should be the path of the disk by UUID, not by device file like /dev/md0p1. These names can change, the UUID will not. You can find the UUID by running blkid. Make sure you use the id for the partition, not the drive.
When saving the mount file, it must be named after the mount folder. So if your mount will be to /mnt/storage the name of the file should be: /etc/systemd/system/mnt-storage.mount. Replace the / in your path with -.

After setting up the mount file, reload the systemd daemon and enable the mount (so it is reloaded after a reboot):

systemctl daemon-reload 
systemctl start mountfile.mount
systemctl enable mountfile.mount

Note: When using this in combination with SMB, add a Before=-clause to the mount file and specify that the mount should start before smbd.service for safety reasons. If the mountpoint is not available by the time Samba starts, the service will fail to start.³

LVM configuration

In case you decide to run a more elaborate partition setup on your array, LVM is highly recommended. While it increases the amount of complexity in the disk setup and might be harder to restore in case of catastrophic failure (You should keep backups off-site no matter your setup, anyway), it will give you a lot of flexibility when it comes to managing the data stored on the array.

In my case, I want to have a single partition, called logical volume in LVM terminology, per SMB share.

First, install LVM:

apt install lvm2

Now initialise a physical device. A physical volume (PV) is the device that will contain the data stored on the logical volume. Note that this requires the device to be free of any prior file system, since the initialisation data and configuration of the PV is written to the beginning of the device (so again, no configuration is needed here)

pvcreate /dev/md0

If pvcreate fails due to a filter, it’s because pvcreate detected a filesystem structure. this is a safety check to make sure you don’t accidentally erase your file system off a drive.

To wipe all remains of a filesystem you can use: wipefs -a /dev/md0

Next, create a volume group. A volume group (VG) is a named entity consisting of one or more physical volumes. All logical volumes have to be associated with a volume group.

vgcreate name /dev/md0

name is the name for the volume group. This is used to identify the volume group later.

Now create your first logical volume (LV):

lvcreate -L 30G -n partname name

partname is the name of the partition you want. Change this to reflect what you will store on it. name is the name of your volume group.

Once this is finished, you can now use the logical volume just like you would use any partition on a hard drive. So now, we need to make a new filesystem on it.

mkfs.xfs /dev/name/partname

Note that instead of using device file names, you will use the readable names entered when creating the VG and the LV respectively.

Monitoring

When running a RAID you always want to monitor your hard drives for failure and for signs of a coming failure (via S.M.A.R.T., covered later). In case of a drive failure, the defective drive has to be replaced as soon as possible to avoid data loss.

mdadm on debian automatically monitors the arrays configured. To make this a bit more flexible, set up a mail address in the config /etc/mdadm/mdadm.conf. If you followed the instructions above, this will already be somewhere in your mdadm.conf:

MAILADDR notifications@example.com
MAILFROM nas@example.com

It is recommended to use a smarthost configuration with exim4 to relay these emails correctly to an external mailbox. (not convered in this documentation)

Additionally to monitoring the drives for failure with mdadm, one should always run S.M.A.R.T. checks on the drive to determine the current health of the drive. This can help identify a future drive failure beforehand.

On linux, smartmontools (smartd) provides this functionality. The details are not covered in this documentation, since the configuration file is very complex. A manual is installed with every copy of smartmontools.

In my configuration, I added the following:

/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03) -m notifications@example.com -M exec /usr/local/bin/smartdnotify
/dev/sdb -a -o on -S on -s (S/../.././02|L/../../6/03) -m notifications@example.com -M exec /usr/local/bin/smartdnotify
/dev/sdc -a -o on -S on -s (S/../.././02|L/../../6/03) -m notifications@example.com -M exec /usr/local/bin/smartdnotify
/dev/sdd -a -o on -S on -s (S/../.././02|L/../../6/03) -m notifications@example.com -M exec /usr/local/bin/smartdnotify

This will run a short test on every drive every night, aswell as a long test every saturday night.

Notifications will be sent via a shell script mentioned behind the exec-directive. The script looks as follows:

#!/bin/sh
# Send email
printf "Subject: $SMARTD_FAILTYPE \n $SMARTD_MESSAGE" | msmtp "$SMARTD_ADDRESS"
# Notify user
wall "$SMARTD_MESSAGE"

Benchmarking tips

gnome-disk-utility has a benchmark, but it bugs with mdadm raid arrays showing super slow write speeds
hdparm -T /dev/sda benchmarks a drive with caches
hdparm -t /dev/sda benchmarks a drive without caches
mind that hdparm is using raw disk operations and therefore does not work on raid arrays from mdadm or lvm
dd works fine but oflag=sync should be specified to disable caching (hdd caching isnt the only type of cache for files on linux)
dd if=/dev/zero of=/dev/sda oflag=sync bs=512 count=10000
dd if=/dev/zero of=/dev/sda oflag=sync bs=1M count=1000
dd if=/dev/zero of=/dev/sda oflag=sync bs=10M count=100
dd if=/dev/zero of=/dev/sda oflag=sync bs=100M count=10
dd if=/dev/zero of=/dev/sda oflag=sync bs=2G count=1

Partition alignment details

When using pvcreate on a raw device like /dev/md0 you might want to check if the physical alignment of the partition was done correctly. All modern versions of lvm2 should account for this automatically. All modern partition management tools do so aswell.

The issue arises if you are reading a 512B sector of a drive that is running 512e (4K physical sectors but 512B sectors exposed externally). Reading a single 512B sector causes 4K bytes to be read, and the drive controller uses processing power to expose only the required 512B.

Another issue arises when blocks in the file system cover more than 1 sector. Imagine a filesystem storing data in blocks the size of 4KiB. On a 512e drive, one might imagine a situation in which reading 4KiB off the drive (8 512B sectors) might cause 2 4K sectors to be read because some 512B sectors are in one 4K sector and some are in another. This can lead to heavy performance loss on 512e drives.

You can make sure block alignment for disks with advanced format (512e) is correct manually. Partitions should start at a sector multiple of 8 ⁴.

gdisk can help by showing your hard drive’s sector layout by running gdisk -l on your disk:

root@nas-test:~# gdisk -l /dev/sda 
GPT fdisk (gdisk) version 1.0.3

Partition table scan:
  MBR: not present
  BSD: not present
  APM: not present
  GPT: not present

Creating new GPT entries.
Disk /dev/sda: 7814037168 sectors, 3.6 TiB
Model: TOSHIBA HDWQ140 
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): A16EAA69-C571-436B-9C02-041A7B761572
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 7814037134
Partitions will be aligned on 2048-sector boundaries
Total free space is 7814037101 sectors (3.6 TiB)

Mind the second to last line, stating the partition alignment.

To check the alignment of an LVM physical volume use pvs -o +pe_start --units m (unit m makes sure the number you see is in MiB, base-2 IEC units rather than SI-prefixed ones.):

root@nas:~# pvs -o +pe_start --units m
  PV         VG    Fmt  Attr PSize       PFree       1st PE
  /dev/md127 share lvm2 a--  7630636.00m 3895084.00m   1.00m

The entry 1st PE shows us that the first partition entry is placed at 1 MiB.

The standard 1MiB alignment done by most tools is perfectly aligned with 4K sectors⁵ aswell as the conventional 512B sectors.

Most modern tools align sectors correctly, since 4K alignment does no harm to 512B sector drives except losing a single MiB at the beginning of the disk and inbetween partitions.

the config itself does not need to be stored on in the config; it is written to the drive itself. this however is useful for making sure your raid might not one day randomly be renamed from md0 to md1 for example. (this may only happen if a new raid array is autodetected) ↩
this is useful in case a drive fails. you want to know as soon as possible. better: set up a monitoring system like influx ↩
This might not be entirely needed, since local filesystems are always mounted before network services are enabled as indicated by a plot of systemd-analyze plot. ↩
8 512B sectors is 4096B - a single 4K sector. ↩
2048/8 = 256. (8 being the amount of 512B logical sectors per physical 4K sector) ↩