ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Super-cheap Computing Cluster for Learning

Fri Dec 15, 2017 2:46 am

The use of Linux in high performance computing dates back to 1993 when Donald Becker wrote the first network device drivers to create Beowulf clusters of desktop computers. Today almost all supercomputers run Linux in some sort of cluster arrangement. The aim of this thread is to document setting up a super-cheap cluster with the Raspberry Pi that will be used to understand how supercomputers work.

There are many tutorials which describe how to set up a cluster of Raspberry Pi computers. For example, the OctaPi projects created by GCHQ focus on clustering Pi 3B computers using the built-in WiFi and running parallel codes using Python. The setup described here combines the following features

1. Possibly the cheapest cluster.

2. Uses centralised storage.

3. Explains the why as well as the how.

4. Configured as a real supercomputer.

5. Includes parallel processing examples.

We first describe the hardware used and attempt to explain why this is the cheapest cluster. Again, it should be emphasized that our goal is to learn about clustering by creating a realistic model of a supercomputer, not to actually build a fast machine. For this purpose we have the following equipment budget:

Code: Select all

 
    One Raspberry Pi B+                  $25
    Samsung 32GB EVO micro sd card       $10
    Five Rapsberry Pi Zeros              $25
    Sabrient 7 port USB Hub              $20
    Sabrient 6-pack 1ft Micro USB Cables  $8
    One Belkin 6" USB Device Cable        $6
    Five more sd cards (optional)        $12
    Aquarium tubing                       $2
    Two coat hangers                    free
    Piece of wood                       free
                               TOTAL    $108
We note that the Sabrient USB hub comes with a 4 amp power supply that is enough to power all six Pi computers. Using a Pi B+ takes less power and creates a more realistic cluster in which the head node has less compute power than the compute nodes. The Pi Zeros run in gadget mode. They boot over USB using rpiboot and only need an SD card for local swap and scratch storage because the root file systems are mounted over NFS from the Pi B+. The price would be under $100 if the Zeros are operated without SD cards. In the cardless case the Zero will not have any local swap and scratch storage. The coat hangers and aquarium tubing are used to mount the components to the wood.

While it may be possible to assemble a cheaper cluster with used and donated equipment, using new equipment has the advantage of low power consumption and a very small footprint. The next post will include photographs of the assembled cluster in action.
Last edited by ejolson on Fri Dec 02, 2022 2:48 pm, edited 6 times in total.

User avatar
Gavinmc42
Posts: 7314
Joined: Wed Aug 28, 2013 3:31 am

Re: Super-cheap Computing Cluster for Learning

Fri Dec 15, 2017 3:02 am

Why not use a Cluster hat?
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Fri Dec 15, 2017 3:18 am

Gavinmc42 wrote:
Fri Dec 15, 2017 3:02 am
Why not use a Cluster hat?
A cluster hat would work just as well except one fewer Zero. If you have one, you could follow along with the main part of this thread--software, configuration and programming--almost no changes.
Last edited by ejolson on Fri Dec 15, 2017 5:07 am, edited 1 time in total.

User avatar
Gavinmc42
Posts: 7314
Joined: Wed Aug 28, 2013 3:31 am

Re: Super-cheap Computing Cluster for Learning

Fri Dec 15, 2017 3:38 am

I use Zero's with no SD cards all the time now with USBboot.
The Cluster Hat is a bit $$ but no USB hubs needed etc and it is very small.
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Fri Dec 15, 2017 5:24 am

Gavinmc42 wrote:
Fri Dec 15, 2017 3:38 am
I use Zero's with no SD cards all the time now with USBboot.
The Cluster Hat is a bit $$ but no USB hubs needed etc and it is very small.
The cluster hat does look suitable, though maybe it blocks the HDMI ports. You are right that the total cost after adding up the cables and USB hub is similar. I've edited my previous post to remove the concerns about additional cost. At the same time, a powered USB hub eliminates the need for a separate power supply.

I didn't know the Pi Zero could boot without an sdcard. I found this link which looks very promising. Is that what you are taking about? If you have multiple Pi Zeros waiting, is there a way to tell the rpiboot command which Pi is which?

User avatar
Gavinmc42
Posts: 7314
Joined: Wed Aug 28, 2013 3:31 am

Re: Super-cheap Computing Cluster for Learning

Sat Dec 16, 2017 1:25 am

https://8086.support/category/23/cluster-hat.html
https://github.com/burtyb/usbboot

I am no Cluster Hat expert yet, I got it to make a 5 camera security system.

I mostly use Zero's and USB boot for Ultibo code development now.
https://ultibo.org/forum/viewtopic.php? ... core#p4954
It saves wear and tear on SDcards and my readers, I go thorough more than one reader/writer per year.
Plus I save the cost of 4 uSd cards ;)

USBboot and Ultibo is quick but Ultibo still lacks the OTG comms required for the next step.
Zero's as peripherals devices for PC's and Pi3's 8-)

I have also got the PiCore kernel booting via USBboot, just not the full OS yet.
This is much smaller than Raspbian but it is not Debian so it is tricky.
Gadget mode g_ether I struggle with, Zero gadgets are expert level, not for beginners ;)

Making a OpenCL Free Pascal interface to run in Ultibo to run on the Zero's VC4 QPU?
So much to learn, which is the point :D

Raspbian is getting bloated, PiCore trims back Linux but is still perfectly usable, Ultibo throws away the OS.
I want to learn how to do CV/ML/AI etc without Linux.

I am hoping one day for a Zero with a 64bit Quad core on it so I can use ARM's Compute library with NEON..
Even a dual core A35 Zero would be great, ditch that nasty 32bit code and go cleaner Aarch64 :D
Bu still plenty to learn in the mean time ;)
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Sat Dec 16, 2017 4:56 pm

Gavinmc42 wrote:
Sat Dec 16, 2017 1:25 am
Raspbian is getting bloated, PiCore trims back Linux but is still perfectly usable, Ultibo throws away the OS.
I want to learn how to do CV/ML/AI etc without Linux.
According to the link, it would appear the cluster hat people have a modified version of rpiboot.
Cluster Hat Support wrote:My version of rpiboot has been modified to support overlay boot directories based on the USB "path" the device is connected via. This allows the use of a custom configuration file for each slot on the Cluster HAT (or ZeroStem/USB Cable).
It would be nice if this change could be merged upstream into the official version. It is exactly the modification I need that allows to distinguish one Pi Zero from the other upon boot.

I plan to keep sdcards in my Zeros because the cards I found are big enough to hold a swap and scratch partition. It sure would be convenient, however, to put all the boot partitions on the B+ and boot using rpiboot. Thanks for drawing my attention to this option. Now the question is whether to go with the patched version of rpiboot or use the official one and instead create a single customized initramfs that differentiates each Pi by its serial number after booting.

I understand your concern about how large Raspbian is becoming. I'll be using the light version, but it still has a full systemd with dbus running. Getting the Ethernet gadgets to configure properly has repeatedly convinced me that udev and dhcpcd are poorly documented.

User avatar
Gavinmc42
Posts: 7314
Joined: Wed Aug 28, 2013 3:31 am

Re: Super-cheap Computing Cluster for Learning

Sun Dec 17, 2017 1:24 pm

I understand your concern about how large Raspbian is becoming. I'll be using the light version, but it still has a full systemd with dbus running. Getting the Ethernet gadgets to configure properly has repeatedly convinced me that udev and dhcpcd are poorly documented.
Yep, Zero's, networking and Linux, not for the faint hearted, yet I still see people wanting to buy lots of Zero's for schools, must have really smart kids.
With Raspbain it started with 4Gb cards, soon it was 8GB, I mostly get 16GB/32GB cards now and my Pi3 boot 32GB USB's.

Raspbian is great for learning and Lite ditto for headless, without them I would not have learned how to use Linux.
PiCore gave me a smaller, easier to use Linux , at the time I had no Internet access so it was manual installs.
Manual install with dependences yuk, so I learned how to Awk, Sed, shell script etc to do everything with whatever comes pre-installed with PiCore.
Still amazed what can be done with shell script and busybox.

But I could see the writing on the wall and started looking for something better and found it in June 2016.
Ultibo, based on free Pascal has native multithread/multicore support and multi-dimensional arrays is native too.
All the basics built into the language to go beyond C/C++.

Then Sep 2016 - Brian Krebs- DDoS etc, last nail for Linux, I will never learn enough in what time I have to secure it.
Even Ultibo got DDoS in Dec 2016 :o
Besides I want to learn CV/AI/ML/multicore stuff without having to learn how to do it on Linux.
Learning OpenCV is about installing it and figuring out how to run it on Linux.

It takes longer to figure it out how to run this stuff baremetal but I figure I will learn the basics better that way.
Maybe one day I will figure out how to make a super lite Debian Raspbian.
Buildroot is fun, a weekend to learn, but still Linux and a long time to make, a single mistake can cost a day.

Does Clustering need Linux?
The top 500 supercomputers now all run Linux.
With a Pi3, Cluster hat and 4 Zero's, I can play with alternatives, anyone can for about $100.

I have spent most of my life playing with bleeding edge hardware, time for me to try bleeding edge software.
Actually I'm just following the bloody footsteps of the early explorers, mostly in the baremetal forums ;)
$7.48 AUD, so much brain pain for so little money.
I have never studied so hard even in my 13 years of tertiary education decades ago as I have in the last 6 years:o

So, about Clustering,
USB 2 ports are fast 480Mbs- about 300Mbs real, LAN on Pi's is not that fast.
Could go faster by using the SMI parallel interface, just a ribbon cable to connect them all?

Or SMI to IDE drives, Cluster storage?
viewtopic.php?f=45&t=197875
Could make an interesting RAID system.

What else is Clustering good for?
Stick one of these on each Zero, roll out personal "Person of Interest" systems.
https://hackaday.com/2017/12/17/googles ... processor/

This makes a DIY home camera system as good as or better than off the shelf stuff.
Just two weeks ago one of my old gen two 5Mp V1 cam Pikrellcam's got some great videos that even impressed the local cops with the quality.
Would have been better if the two old gen 1s and other gen2 cameras had also got video, motion settings are hard to set properly.
People recognition would add interesting capabilities, I don't really need 2000+ videos per day of passing cars.
Perhaps my finance department could be convinced VisionBonnets are needed for a gen 4 system :lol:
A solution until quad A35 Zero's appear?

What else can clustering do?
Of course Vision bonnets don't have to run vision code ;)
How many MA2485's can fit on a Zero hat?
OpenCL is about 50% usable on QPU's, what can that be used for?

Anyway having the Zero's USB boot allows quick app changes.
Big clusters can solve many problems.
With good coding a small cluster can solve one small problem?
Perhaps snail recognition to protect the vegi patch?
Mobile clusters in farm equipment, anyone told John Deer?
When I was working at Leica vision on autonomous tractors was a pipe dream.

DIY smarter home autonomous mowers?
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Sun Dec 17, 2017 6:36 pm

Gavinmc42 wrote:
Sun Dec 17, 2017 1:24 pm
With good coding a small cluster can solve one small problem?
Perhaps snail recognition to protect the vegi patch?
Last week I decided to name my zero cluster "snail" for similar reasons. If I build another, maybe it will be called vegi.

Before proceeding to software configuration and setup, here are photographs to show how I used the coat hangers, aquarium tubing and block of wood.

Image

Image

Note that the coat hangers have been bent into brackets and the tubing placed as spacers between each Pi Zero. Holding the bracket works well to stabilize the Pi's when plugging and unplugging cables. Again, the Pi B+ and Zeros are all powered by the USB hub, so only one power connection is required.

The software configuration described in what follows will work with an even simpler configuration such as plugging the Pi Zeros directly into the B+
or using a Cluster Hat. The USB hub I selected was cheap and came with a 4 amp power supply; however, internally I think it is actually two 4-port hubs chained together. As the Pi B+, 2B and 3B all place restrictive limits on the number of USB2 devices which can be active, similar clusters using more Zero's may not work.

User avatar
Gavinmc42
Posts: 7314
Joined: Wed Aug 28, 2013 3:31 am

Re: Super-cheap Computing Cluster for Learning

Mon Dec 18, 2017 1:17 am

I guess that explains the aquarium tubing ;)
It also gives access to the HDMI ports which the Cluster hat does not.

Glass cockpits was one option I have started exploring.
https://github.com/Gavinmc42/PFD
Ultibo 2.0 gives access to OpenVG on the VC4.
It is not bad and if a Zero is dedicated to one display GPU processing power is split.
This would also allow for redundancies etc.
Good for low cost flight sims?

Clusters for visual displays.
USB cables can be up to 5m, so a distributed Cluster would still work.
It just looks more like real cluster when stacked, going add some blinking leds?
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Mon Jan 01, 2018 7:38 am

In this post I will begin setting up the B+, which will function as the head node of the cluster. In a real supercomputer the compute nodes are built using the fastest technology available. As a result, the compute nodes are typically faster than the head node. Since we want the performance characteristics of the learning cluster to approximate a real cluster, a B+ was chosen rather than a faster 2B or 3B to ensure the head node has less computational power than the compute nodes. It is also convenient for the head node to run the same level and kernel modules as the computational nodes. In what follows I have also included details how to use a 2B, 3B or 3B+ for the head node. However, as the resulting setup is more complicated, my recommendation is to use another Pi Zero or Pi Zero W for the head node if a suitable B+ is not available.

Another characteristic of a supercomputer is that persistent storage is provided by a dedicated data appliance connected though a fast interconnect. This allows a global storage cache to be shared by processes running on separate compute nodes. In practice modern storage appliances provide reliability and resiliency though snapshots, redundancy and data scrubbing. In our learning cluster we use USB2 as the fast interconnect and represent the data appliance and global storage cache as a shared BTRFS file system served by the head node.

There are a number of different technologies that could have been used for our shared file system: The layered approach represented by EXT4 or XFS over LVM RAID with thin provisioning and the all-in-one approach represented by BTRFS or ZFS. I chose BTRFS because it performs better on low-memory systems than ZFS and is easier to setup than the layered approach. While I'm not aware of any real supercomputer installations that use BTRFS, it has the snapshot and data integrity features needed to approximate how a data appliance might work.

The rest of this post describes how to configure Raspbian to use BTRFS as the root file system. These instructions are modified from David Korben's excellent Raspbian BTRFS Root Filesystem Guide. Since the B+ is short on memory and compute power, we will not be using the transparent compression and RAID features of BTRFS but only the copy-on-write snapshot capability. The resulting configuration will allow each of the compute nodes to mount a read-write snapshot of the same root file system. In particular, multiple copy-on-write snapshots of a single Raspbian image will be shared among all compute nodes in the cluster through a single file-system cache.

While it is possible to convert an EXT4 file system to a BTRFS file system using the btrfs-convert command to copy some meta-data around, slightly better results are obtained by creating a new BTRFS from scratch and then copying the files from the EXT4 file system to it. This is the approach used by David's guide. In his approach the old EXT4 file system remains on the SD card unused. Our modification is to place the new BTRFS file system on a new SD card to create a BTRFS-only SD card.

First, copy 2017-11-29-raspbian-stretch-lite.img to a temporary SD card, then boot and configure it. You may want to set the keyboard type, change the GPU memory split to 16MB, create some new user accounts, delete the default pi user or change the default password, configure the networking and enable ssh remote login. Next, we make the changes necessary to boot with a BTRFS root file system.

Edit /boot/cmdline.txt and change the "root=PARTUUID..." option to read "root=/dev/mmcblk0p2" instead. While the original PARTUUID root option may allow booting from media other than the SD card, this mechanism breaks when partitions on the card are modified or an initramfs is used. Since the root file system will always be the second partition on the SD card we can directly specify root as mmcblk0p2.

Since swap files can't be used with BTRFS disable the swap file by logging in as root and entering the commands

# dphys-swapfile swapoff
# dphys-swapfile uninstall
# systemctl stop dphys-swapfile
# systemctl disable dphys-swapfile

Add "btrfs" to /etc/initramfs-tools/modules. Since we will have to add the Ethernet gadget for the Pi Zero's later, let's add those modules as well. The last lines of this file should now read

btrfs
g_ether
libcomposite
u_ether
udc-core
usb_f_ecm
usb_f_rndis

Now create the initramfs and make a copy of the corresponding kernel with the commands

# update-initramfs -c -k `uname -r`
# cd /boot
# mv initrd.img-`uname -r` myinitrd.img
# cp kernel.img mykernel.img

and enable it by adding the following lines to /boot/config.txt

kernel=mykernel.img
initramfs myinitrd.img

If you are using a 2B or 3B instead of a B+ for the server, then copy kernel7.img instead of kernel.img to mykernel.img in the above commands. Note that we have created mykernel.img so updates to the kernel don't get out of sync with the kernel modules in the initial ramdisk. This is important because having the correct btrfs module is required to mount the root filesystem at boot. We copied kernel.img because a B+ is being used for the server, when using a 2B or 3B make a copy of kernel7.img instead. At this point it is worth shutting down the system and checking that the SD card still boots with the EXT4 filesystem as root. If this is the case, it is time to create the BTRFS-only SD card.

Place a new SD card in a USB card reader and connect it to the B+. In what follows, I shall assume that the new SD card appears as /dev/sda. It may happen that it appears as a different device such as /dev/sdb or /dev/sdc. In that case, substitute the appropriate device in what appears below.

Now, use fdisk to create partitions similar to

Code: Select all

    Device    Boot    Start      End  Sectors  Size Id Type
    /dev/sda1          2048   198655   196608   96M  c W95 FAT32 (LBA)
    /dev/sda2        198656 60424191 60225536 28.7G 83 Linux
    /dev/sda4      60424192 62521343  2097152    1G 82 Linux swap / Solaris
The /boot partition has been made twice as large as usual in order to hold our custom initial ramdisk and backups of the corresponding kernel. Note that a swap partition occupies the last 1GB of the SD card. This is because BTRFS doesn't support swap files. Next, format the partitions and copy the boot and root filesystems to the new SD card using

# apt-get install btrfs-tools rsync
# mkdosfs -n BOOT /dev/sda1
# mkfs.btrfs -L ROOT -d single -m single /dev/sda2
# mkswap -L SWAP /dev/sda4
# mkdir -p /z/a1 /z/a2
# mount /dev/sda1 /z/a1
# mount /dev/sda2 /z/a2
# cd /z/a1
# rsync -glopqrtxDH --numeric-ids --delete /boot/ /z/a1
# cd /z/a2
# rsync -glopqrtxDH --numeric-ids --delete / /z/a2

Now, edit /z/a2/etc/fstab to read as

LABEL=BOOT /boot vfat defaults 0 2
LABEL=ROOT / btrfs defaults,noatime 0 1
LABEL=SWAP none swap sw 0 0

Edit /z/a1/cmdline.txt to change "rootfstype=ext4" to "rootfstype=btrfs" and add the "net.ifnames=0" option. For reference the cmdline.txt should now look like

Code: Select all

boot=local dwc_otg.lpm_enable=0 net.ifnames=0 console=serial0,115200 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait
After making these changes unmount the SD card using

# cd
# umount /z/a1
# umount /z/a2

Finally, remove the USB card reader, shutdown the B+, remove the power, place the new SD card in the B+ and reboot. If everything goes well, you should now be running Raspbian Lite with a BTRFS root file system and a 1GB swap partition. If you have any difficulties, please refer to David's guide or directly follow that guide instead.
Last edited by ejolson on Sat Mar 31, 2018 7:32 pm, edited 4 times in total.

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Sat Jan 06, 2018 6:27 am

In this post we setup the Zero nodes to boot in device mode over USB using the rpiboot protocol. After loading the kernel and initial RAM disk, each Zero loads the Ethernet gadget driver and then mounts the root file systems and home directories over NFS. As this method does not require the Zero's to have SD cards, I could have saved $12 on the original bill of materials. However, I originally thought that each Zero would require an SD card with a boot partition. Information how to use rpiboot was obtained thanks to posts by Gavinmc42 on this thread, by ajlitt on the Hackaday blog the Terrible Cluster as well as by Chris Burton on the Cluster HAT blog How do I setup usbboot. Although the instructions provided here allow the Zeros to boot without an SD card, we will later install the SD cards to use as local swap and scratch storage.

We first create the boot directories and install rpiboot. Connect the Pi Zero's to the powered hub through their USB data port. Without SD cards they will just sit there no green light or display apparently doing nothing. However, power is being back fed through the data port and they are actually powered up waiting for rpiboot to feed them the kernel and initial RAM disk. Before installing and configuring rpiboot we need to determine the physical USB addresses for each Zero. Note that if you switch hubs or plug the Zero's into different ports, this step will need to be repeated. Therefore, make sure everything is wired the way you want before proceeding and then use the following commands to probe the USB hub.

Code: Select all

# lsusb -t
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=dwc_otg/1p, 480M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/5p, 480M
        |__ Port 1: Dev 3, If 0, Class=Vendor Specific Class, Driver=smsc95xx, 480M
        |__ Port 5: Dev 4, If 0, Class=Hub, Driver=hub/4p, 480M
            |__ Port 1: Dev 31, If 0, Class=Vendor Specific Class, Driver=, 12M
            |__ Port 2: Dev 6, If 0, Class=Hub, Driver=hub/4p, 480M
                |__ Port 3: Dev 35, If 0, Class=Vendor Specific Class, Driver=, 12M
                |__ Port 1: Dev 32, If 0, Class=Vendor Specific Class, Driver=, 12M
                |__ Port 4: Dev 33, If 0, Class=Vendor Specific Class, Driver=, 12M
                |__ Port 2: Dev 34, If 0, Class=Vendor Specific Class, Driver=, 12M
This information tells us that there are three hubs currently installed on the system: 1-1 is the built-in 5 port hub on the B+ and 1-1.5 and 1-1.5.2 are two 4 port hubs which are internal to the Sabrient 7 port USB hub shown in the picture. At this point, it is worth mentioning that we chose our USB hub because it was the cheapest available with a 4 amp power supply. As far as hubs go, an internal topology consisting of two 4 port hubs chained together is not ideal. Moreover, this particular hub does not support smart hub power switching, which would have allowed us to set up the power on-demand capability of the Slurm Workload Manager. Though not important for a super-cheap cluster, it would have been nice to be able to implement this feature as real supercomputers take megawatts of power and learning about power saving is important. However, in other respects, the hub appears to work well.

We can determine the USB addresses for each Zero in the above USB device tree by looking for the "Class=Vendor Specific Class" devices. In the above case we find that the five Pi Zero's are at addresses 1-1.5.1, 1-1.5.2.1, 1-1.5.2.2, 1-1.5.2.3 and 1-1.5.2.4. We now configure a boot directory for each Pi Zero. First create a BTRFS subvolume to hold the master copy of the boot directory using following commands

# apt-get install m4 rpiboot
# mkdir -p /x/sboot /x/sproto
# cd /x/sboot
# btrfs sub create boot
# cp -rp /boot/* boot

Now edit /x/sboot/boot/config.txt so the last three lines read

dtoverlay=dwc2
kernel=mykernel.img
initramfs myinitrd.img

Note the last two lines should already be present from the changes made earlier to the boot directory of the Pi B+ server, so only one new line needs to be added. Also note that if instead of a Pi B+ you are using a 2B or 3B for the server, the mykernel.img and myinitrd.img copied from the FAT filesystem will need to be replaced by appropriate files for the BCM2835 used in the Zero. Therefore, if you are using a 2B or 3B also type

# update-initramfs -c -k `uname -r | sed "s/-v7//"`
# mv /boot/initrd.img-`uname -r | sed "s/-v7//"` /x/sboot/boot/myinitrd.img
# cp /boot/kernel.img /x/sboot/boot/mykernel.img

We now create the script /x/sboot/bupdate to make five copy-on-write snapshots of the boot subdirectory, one for each Zero. The script should read

Code: Select all

#!/bin/bash
let s=0
for i in 1-1.5.2.4 1-1.5.2.3 1-1.5.2.2 1-1.5.2.1 1-1.5.1 
do
    let ip33=4*$s+33
    let ip34=4*$s+34
    myip=`printf "192.168.7.%d" $ip33`
    mymac=`printf "02:34:33:3c:50:%02x" $ip33`
    yourmac=`printf "02:34:33:3c:50:%02x" $ip34`
    btrfs sub del $i
    btrfs sub snap boot $i
    m4 -D MYIP=$myip -D MYNAME=s$s \
        -D MYMAC=$mymac -D YOURMAC=$yourmac ../sproto/cmdline.m4 \
        >$i/cmdline.txt
    let s=$s+1
done
The script algorithmically generates IP numbers for the Zero's and MAC addresses for the USB Ethernet gadgets. The order of the USB addresses in the for loop determine which Zero is given which hostname. Given the way I ran the wires, the Zero at the top of the stack is named s0 and the one at the bottom s4. In this example we are using private IP numbers of the form 192.168.7.xxx for the USB interconnect between the nodes and MAC addresses of the form 02:34:33:3c:50:xx with the locally administered bit set. The script substitutes the IP and MAC addresses into the template /x/sproto/cmdline.m4 which should look like

Code: Select all

boot=nfs dwc_otg.lpm_enable=0 net.ifnames=0 console=serial0,115200 console=tty1 root=/dev/nfs nfsroot=192.168.7.2:/x/MYNAME rw ip=MYIP::192.168.7.2:255.255.255.0:MYNAME:usb0:off elevator=deadline rootwait modules-load=dwc2,g_ether g_ether.host_addr=YOURMAC g_ether.dev_addr=MYMAC
Note that the above file consists of only one very long line. Now run the script with the commands

# cd /x/sboot
# ./bupdate
ERROR: cannot access subvolume 1-1.5.2.4: No such file or directory
Create a snapshot of 'boot' in './1-1.5.2.4'
ERROR: cannot access subvolume 1-1.5.2.3: No such file or directory
Create a snapshot of 'boot' in './1-1.5.2.3'
ERROR: cannot access subvolume 1-1.5.2.2: No such file or directory
Create a snapshot of 'boot' in './1-1.5.2.2'
ERROR: cannot access subvolume 1-1.5.2.1: No such file or directory
Create a snapshot of 'boot' in './1-1.5.2.1'
ERROR: cannot access subvolume 1-1.5.1: No such file or directory
Create a snapshot of 'boot' in './1-1.5.1'

Ignore the error messages as they result from trying to delete a snapshot that wasn't made yet. If you run the script again, for example after making changes to the /x/sboot/boot subvolume, the errors will not appear.

At this point we could start rpiboot and the Zero's would load the initial RAM file systems and then try to mount their root file systems as /x/s0, /x/s1 and so forth. However, at that point they would get stuck because these root file systems haven't been created or exported through NFS yet. The next post will discuss creating the root file systems using a similar BTRFS copy-on-write snapshot technique as was used for the boot directories.
Last edited by ejolson on Thu Jul 07, 2022 6:49 am, edited 3 times in total.

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Mon Jan 08, 2018 6:57 am

In this post I describe how to finish setting up the the software on the Raspberry Pi B+ so the attached Pi Zero nodes can boot entirely from the USB interface. This is a rather complicated procedure and I would greatly appreciate any feedback from anyone who can confirm whether I got all the details correct. Before making snapshots of the root filesystem, we first move copy the home directory to a separate subvolume so it doesn't appear in the individual root snapshots. This can be done with the following commands

# cd /x
# btrfs sub create snaa
# mv /home/* snaa
# rmdir /home
# ln -s /x/snaa /home

Now, add the following lines to /etc/hosts listing the IP addresses of B+ and Zeros on the bridge and the Ethernet gadgets over the USB network.

Code: Select all

192.168.7.2 snail
192.168.7.33 s0
192.168.7.37 s1
192.168.7.41 s2
192.168.7.45 s3
192.168.7.49 s4
Since the cluster is small and because the hosts files on the server are automatically replicated with the BTRFS copy-on-write snapshot technique used to create the root file systems for each of the Zeros, it is easy to resolve the node names and corresponding IP numbers using files. When the size of the cluster is much larger than 50, it may be better to set up a bind server instead.

Next, add modify the /etc/rc.local to conditionally load the bridge device and start rpiboot if running on the B+ or to set the MTU of the Ethernet gadget if running on the Zero. To do this, add the following lines just before the "exit 0" line at the end of the file

Code: Select all

case `hostname` in
snail*)
    echo Loading san bridge device...
    ip link add name san type bridge
    ip link set san up
    echo Starting rpiboot to boot nodes...
    /usr/bin/rpiboot -m 500000 -l -d /x/sboot -o \
        >>/var/log/rpiboot.log &
    ;;
s[0-4])
    echo Setting usb0 mtu to 7418
    ip link set usb0 mtu 7418
    ;;
esac
Next, configure /etc/dhcpcd.conf so it assigned an IP number to the bridge but doesn't mess with the Ethernet gadgets. Do this by adding

denyinterfaces usb0 usb2 usb3 usb4 usb5 s0 s1 s2 s3 s4

as the first line of the file. A udev rule will be used to name the Ethernet gadgets s0, s1 and so forth; however, we also include usb0, usb1 and so forth in the denyinterfaces line to avoid a possible race condition. At the end of the same file add the lines

interface san
static ip_address=192.168.7.2/24

to assign an IP number to the bridge device.

Create udev rules /etc/udev/rules.d/70-gadget.rules to identify which Zero is which by the MAC addresses and name the corresponding Ethernet gadgets accordingly. The file should look like

Code: Select all

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="02:34:33:3c:50:22", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="usb*", NAME="s0"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="02:34:33:3c:50:26", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="usb*", NAME="s1"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="02:34:33:3c:50:2a", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="usb*", NAME="s2"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="02:34:33:3c:50:2e", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="usb*", NAME="s3"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="02:34:33:3c:50:32", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="usb*", NAME="s4"
Now, create a file /etc/network/interfaces.d/zeros to configure the gadgets on the B+ by adding them to the bridge. The file should look like

Code: Select all

allow-hotplug s0 s1 s2 s3 s4

iface s0 inet manual
    up ip link set s0 up
    post-up ip link set s0 mtu 7418
    post-up ip link set s0 master san

iface s1 inet manual
    up ip link set s1 up
    post-up ip link set s1 mtu 7418
    post-up ip link set s1 master san

iface s2 inet manual
    up ip link set s2 up
    post-up ip link set s2 mtu 7418
    post-up ip link set s2 master san

iface s3 inet manual
    up ip link set s3 up
    post-up ip link set s3 mtu 7418
    post-up ip link set s3 master san

iface s4 inet manual
    up ip link set s4 up
    post-up ip link set s4 mtu 7418
    post-up ip link set s4 master san
Note that the MTU is set at both ends of the USB Ethernet gadget. A Jumbo packet of size 7418 was chosen; however, even this conservative value increases throughput by a factor of about ten-fold under load. I may explore further tuning in subsequent posts.

Next, install the NFS server using

# apt-get install nfs-kernel-server

and configure the /etc/exports file to allow each Zero to mount their root file systems as /x/s0, /x/s1 and so forth and the home directories which are now in /x/snaa. The resulting exports file should look like

Code: Select all

/x/s0 192.168.7.33(rw,no_root_squash,async,no_subtree_check)
/x/snaa 192.168.7.33(rw,no_root_squash,async,no_subtree_check)
/x/s1 192.168.7.37(rw,no_root_squash,async,no_subtree_check)
/x/snaa 192.168.7.37(rw,no_root_squash,async,no_subtree_check)
/x/s2 192.168.7.41(rw,no_root_squash,async,no_subtree_check)
/x/snaa 192.168.7.41(rw,no_root_squash,async,no_subtree_check)
/x/s3 192.168.7.45(rw,no_root_squash,async,no_subtree_check)
/x/snaa 192.168.7.45(rw,no_root_squash,async,no_subtree_check)
/x/s4 192.168.7.49(rw,no_root_squash,async,no_subtree_check)
/x/snaa 192.168.7.49(rw,no_root_squash,async,no_subtree_check)
Note that flag no_root_squash is essential for the root file systems and we have also included it in home file system. The async and no_subtree_check options have been added for performance reasons.

We are now ready to create the root filesystems that the Zero's will mount over NFS. This will be done using the same copy-on-write snapshots that were used for creating the individual boot directories. We emphasize that the copy-on-write semantics imply that only one copy of the root filesystem will be stored on the SD card even though there logically appears to be five additional copies one for each Pi Zero. Since the fstab of each Zero will be different than for the B+, we create the file /x/sproto/fstab to look like

Code: Select all

LABEL=BOOT /boot vfat nofail 0 2
/dev/nfs / nfs noatime 1 1 
LABEL=SWAP none swap sw,nofail 0 0
LABEL=SCRATCH /x/scratch ext4 nofail 0 2
snail:/x/snaa /x/snaa nfs vers=3,noacl,async,bg 0 0
The nofail option has been included so that the Pi Zero's boot whether or not they have an suitably formatted SD card. Note that, at this point there are no SD cards present in any of the Pi Zeros.

Finally, we describe the script /x/supdate that create and update the root file systems for the Zero's using a snapshot of the current root file system of the B+. This script reads

Code: Select all

#!/bin/bash
for i in s0 s1 s2 s3 s4
do
    echo Configuring $i...
    btrfs sub del $i
    btrfs sub snap / $i
    (
        if cd $i/x
        then
            rmdir scratch 
            mkdir scratch
        fi
    )
    echo $i >$i/etc/hostname
    cp /x/sproto/fstab $i/etc/fstab
    rm $i/etc/exports
done
Note that the new fstab is copied into each snapshot and the exports file deleted. The script is complicated by the "rmdir scratch; mkdir scratch" sequence of commands, which for reasons I don't know seems necessary for creating a valid mount point for the SD card later. Run the script as

# cd /x
# ./supdate
Configuring s0...
ERROR: cannot access subvolume s0: No such file or directory
Create a snapshot of '/' in './s0'
Configuring s1...
ERROR: cannot access subvolume s1: No such file or directory
Create a snapshot of '/' in './s1'
Configuring s2...
ERROR: cannot access subvolume s2: No such file or directory
Create a snapshot of '/' in './s2'
Configuring s3...
ERROR: cannot access subvolume s3: No such file or directory
Create a snapshot of '/' in './s3'
Configuring s4...
ERROR: cannot access subvolume s4: No such file or directory
Create a snapshot of '/' in './s4'

As when creating the boot subvolumes, the error messages can be ignored. They will not appear when the script is run again. Every time we change or update the root file system on the B+ using, for example, the commands "apt-get update; apt-get upgrade" the corresponding snapshots in /x/s0, x/s1 and so forth will need to get updated using the supdate script. Note, however, that the snapshots should not be updated when the Zero's are currently booted. We will add additional scripts in a subsequent post that automatically halt the Pi Zero's before performing such an update.

At this point, it should be possible to bring the entire cluster up by rebooting the B+ with the command

# /sbin/reboot

After doing this, you can check that the nodes booted by examining the log file /var/log/rpiboot.log and checking the network status. For reference, the ifconfig should report something like

Code: Select all

root@snail:/etc/ssh# /sbin/ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.46.34  netmask 255.255.255.0  broadcast 192.168.46.255
        inet6 fe80::6fdf:e7e8:a7cc:c6ba  prefixlen 64  scopeid 0x20<link>
        ether b8:27:eb:0b:0d:c2  txqueuelen 1000  (Ethernet)
        RX packets 31301  bytes 2044970 (1.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3889  bytes 506946 (495.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1  (Local Loopback)
        RX packets 171  bytes 29500 (28.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 171  bytes 29500 (28.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 7418
        inet6 fe80::34:33ff:fe3c:5022  prefixlen 64  scopeid 0x20<link>
        ether 02:34:33:3c:50:22  txqueuelen 1000  (Ethernet)
        RX packets 19536  bytes 2173734 (2.0 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 31902  bytes 37517334 (35.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

s1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 7418
        inet6 fe80::34:33ff:fe3c:5026  prefixlen 64  scopeid 0x20<link>
        ether 02:34:33:3c:50:26  txqueuelen 1000  (Ethernet)
        RX packets 31298  bytes 3124166 (2.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 52993  bytes 68332352 (65.1 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

s2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 7418
        inet6 fe80::34:33ff:fe3c:502a  prefixlen 64  scopeid 0x20<link>
        ether 02:34:33:3c:50:2a  txqueuelen 1000  (Ethernet)
        RX packets 20024  bytes 2191754 (2.0 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 30993  bytes 37127532 (35.4 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 7418
        inet6 fe80::34:33ff:fe3c:502e  prefixlen 64  scopeid 0x20<link>
        ether 02:34:33:3c:50:2e  txqueuelen 1000  (Ethernet)
        RX packets 20349  bytes 2207350 (2.1 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 31348  bytes 36773794 (35.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

s4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 7418
        inet6 fe80::34:33ff:fe3c:5032  prefixlen 64  scopeid 0x20<link>
        ether 02:34:33:3c:50:32  txqueuelen 1000  (Ethernet)
        RX packets 20014  bytes 2184902 (2.0 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 31372  bytes 36539786 (34.8 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

san: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 7418
        inet 192.168.7.2  netmask 255.255.255.0  broadcast 192.168.7.255
        inet6 fe80::f0a9:37ff:fe8a:f827  prefixlen 64  scopeid 0x20<link>
        ether 02:34:33:3c:50:22  txqueuelen 1000  (Ethernet)
        RX packets 1056152  bytes 207951908 (198.3 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 974310  bytes 866505756 (826.3 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
Last edited by ejolson on Fri Jun 01, 2018 5:02 am, edited 5 times in total.

User avatar
Gavinmc42
Posts: 7314
Joined: Wed Aug 28, 2013 3:31 am

Re: Super-cheap Computing Cluster for Learning

Mon Jan 08, 2018 7:36 am

Wow, getting closer.
Looks like a lot of work needed to sort it all out.
Would you need to use separate boot/root file systems for each Zero?
There should not be much difference between them unless you want to run different stuff on different Zero's
Even then there will be lots in common.

Noticed there is now Pi-Server for x86 boxes.
Been meaning to find an old box that could do this for my home network.

Boot your Pi3 from network off the Pi-Server with no SD card, then boot the Zero's.
Basically clusters with no SD cards at all :D
I would then start worrying about power supplies, but it means clean boots from power up.
Some nasty virus finds your super cluster, power off kills it dead.
Assuming your x86 is clean ;)

Going to have to try this with PiCore, Raspbian even Lite is big.
Any idea on the time to boot them all up?
I think I need some more Zero's and 3's :lol:
$100 a cluster?
Do the new Pi2's also netboot?
They should, don't really need WiFi for this app and it will run a bit cooler too.
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Mon Jan 08, 2018 6:26 pm

Gavinmc42 wrote:
Mon Jan 08, 2018 7:36 am
Looks like a lot of work needed to sort it all out.
Would you need to use separate boot/root file systems for each Zero?
The reason to write everything out rather than create a single automated script is so people can understand the ideas and adapt the methods easier. Learning how to configure a cluster as well as use it is a large part of the educational aim for this thread.

By creating separate copy-on-write snapshots for the individual boot and root filesystems, the actual data on the SD card is shared between the Zeros and can be quickly updated with a single subvolume delete followed by a snapshot command. The only difference between the individual boot filesystems for the different Zeros in my setup is cmdline.txt which sets the MAC address, IP address and the path of the NSF root. The root filesystems start out identical except for the hostname files. However, during operation Raspbian writes to the root filesystem, so additional differences using the copy-on-write semantics are stored over time.

Using BTRFS subvolumes mounted as root over NFS simplifies things tremendously by automatically sharing common data and only allocating additional storage to hold the changes. In particular, there is no need to use filesystem overlays, large amounts of RAM or extremely small Linux distributions for this method to work.
Gavinmc42 wrote:
Mon Jan 08, 2018 7:36 am
Boot your Pi3 from network off the Pi-Server with no SD card, then boot the Zero's.
Basically clusters with no SD cards at all :D
Network booting should also work with the Pi 2B revision 2, but I don't have one. Also, I'm more interested in building a self-contained cluster mounted to my piece of wood that I can take with me for demonstrations.

From a maintenance point of view only one copy of Raspbian needs to be updated. After updating, one simply makes new snapshots for the Zeros. Note however, since the Zeros interact more closely with the B+, the security benefits of PiNet and PiServer are not obtained. At the same time, since all the Zeros are physically in the same place and communicate through a private network made out of USB gadgets, many of the security issues that PiNet and PiServer address don't exist in the first place.

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Mon Jan 08, 2018 8:42 pm

This post configures the root account to enable easy shutdown of the cluster and updating. We first create a public ssh key for the root user. Login to the B+ as root and type

# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.

Please type enter at each of the prompts above to work with the defaults and in particular for an empty passphrase. Now add the public key to the authorized keys so that root can ssh to root without entering a password.

# cd /root/.ssh
# cat id_rsa.pub >>authorized_keys
# chmod 600 authorized_keys

Now comes the ugly, but perhaps educational, part. Since /root is included in the root filesystem, the separate root filesystem snapshots for each Zero have not been updated with the ssh key made above. Thus, we will need to manually reboot the Zero computers before it is possible to remotely reboot them using the authorized public keys. First type

# killall rpiboot

to stop the Zeros from automatically rebooting until we are ready. Now, pull the USB cables powering the Pi Zeros and then plug them back in again. We will create new root file-system snapshots for the Zeros in the next step, so there is no danger in abruptly turning off the power on the Zeros. Obviously, don't pull the plug on the B+. Wait a few minutes and then check that the Zeros are all down by typing

# /sbin/ifconfig

to make sure that all the Ethernet gadgets are gone and only the bridge is left. At this point the bridge will also be empty and have reverted back to an MTU of 1500. You can check the bridge by typing

# bridge link

which should return nothing. Now update the snapshots with the new root directories.

# cd /x
# ./supdate

At this point snapshots of the updated root filesystem from the B+ have been created for the Zeros. Now restart rpiboot so the Zeros will boot up again. To make this easier in the future, create an executable script called /x/szeroup which contains the lines

Code: Select all

#!/bin/bash
echo Starting rpiboot to boot nodes...
killall rpiboot
/usr/bin/rpiboot -m 500000 -l -d /x/sboot -o \
    >>/var/log/rpiboot.log &
and then type

# cd /x
# ./szeroup
Starting rpiboot to boot nodes...
rpiboot: no process found

Ignore the "no process found" message. The script tries to kill rpiboot just in case it is still running before starting another copy. After waiting about five minutes the Zero nodes should be back online. You can check this with the bridge command

Code: Select all

# bridge link
9: s0 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 7418 master san state forwarding priority 32 cost 100 
10: s1 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 7418 master san state forwarding priority 32 cost 100 
11: s2 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 7418 master san state forwarding priority 32 cost 100 
12: s3 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 7418 master san state forwarding priority 32 cost 100 
13: s4 state UP : <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 7418 master san state forwarding priority 32 cost 100
You should now able to ssh from root to root between the B+ and each Zero. Do this now to populate the /root/.ssh/known_hosts file with details of each computer in the cluster.

Code: Select all

# ssh snail
The authenticity of host 'snail (192.168.7.2)' can't be established.
ECDSA key fingerprint is SHA256:EBU8GWlqWheBfgEzVA/ezWmkxalYUgJxtrqpkfk8vqc.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'snail,192.168.7.2' (ECDSA) to the list of known hosts.
Linux snail 4.9.59+ #1047 Sun Oct 29 11:47:10 GMT 2017 armv6l
Last login: Mon Jan  8 11:26:20 2018 from 192.168.7.2
# exit
logout
Connection to snail closed.
root@snail:~# ssh s0
The authenticity of host 's0 (192.168.7.33)' can't be established.
ECDSA key fingerprint is SHA256:EBU8GWlqWheBfgEzVA/ezWmkxalYUgJxtrqpkfk8vqc.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 's0,192.168.7.33' (ECDSA) to the list of known hosts.
Linux s0 4.9.59+ #1047 Sun Oct 29 11:47:10 GMT 2017 armv6l
Last login: Mon Jan  8 11:26:20 2018 from 192.168.7.2
# exit
logout
Connection to s0 closed.
root@snail:~# ssh s1
The authenticity of host 's1 (192.168.7.37)' can't be established.
ECDSA key fingerprint is SHA256:EBU8GWlqWheBfgEzVA/ezWmkxalYUgJxtrqpkfk8vqc.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 's1,192.168.7.37' (ECDSA) to the list of known hosts.
Linux s1 4.9.59+ #1047 Sun Oct 29 11:47:10 GMT 2017 armv6l
Last login: Mon Jan  8 11:26:20 2018 from 192.168.7.2
# exit
logout
Connection to s1 closed.
root@snail:~# ssh s2
The authenticity of host 's2 (192.168.7.41)' can't be established.
ECDSA key fingerprint is SHA256:EBU8GWlqWheBfgEzVA/ezWmkxalYUgJxtrqpkfk8vqc.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 's2,192.168.7.41' (ECDSA) to the list of known hosts.
Linux s2 4.9.59+ #1047 Sun Oct 29 11:47:10 GMT 2017 armv6l
Last login: Mon Jan  8 11:26:20 2018 from 192.168.7.2
# exit
logout
Connection to s2 closed.
root@snail:~# ssh s3
The authenticity of host 's3 (192.168.7.45)' can't be established.
ECDSA key fingerprint is SHA256:EBU8GWlqWheBfgEzVA/ezWmkxalYUgJxtrqpkfk8vqc.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 's3,192.168.7.45' (ECDSA) to the list of known hosts.
Linux s3 4.9.59+ #1047 Sun Oct 29 11:47:10 GMT 2017 armv6l
Last login: Mon Jan  8 11:26:20 2018 from 192.168.7.2
# exit
logout
Connection to s3 closed.
# ssh s4
The authenticity of host 's4 (192.168.7.49)' can't be established.
ECDSA key fingerprint is SHA256:EBU8GWlqWheBfgEzVA/ezWmkxalYUgJxtrqpkfk8vqc.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 's4,192.168.7.49' (ECDSA) to the list of known hosts.
Linux s4 4.9.59+ #1047 Sun Oct 29 11:47:10 GMT 2017 armv6l
Last login: Mon Jan  8 11:26:20 2018 from 192.168.7.2
# exit
logout
Connection to s4 closed.
In some sense, requiring human intervention when adding a new host to the list of known hosts is a security feature. However, less manual ways to populate the known_hosts file would be needed for clusters with thousands of nodes. Fortunately, we only have 5 nodes. At any rate, /root/.ssh/known_hosts should now consist of 12 lines--two for each computer in the cluster including the B+. So this manual procedure never has to be done again, move the known_hosts file to the system directory with the command

# mv /root/.ssh/known_hosts /etc/ssh/ssh_known_hosts

Now, you should be able to login and logout as root to any of the nodes simply by typing

# ssh s0
Linux s0 4.9.59+ #1047 Sun Oct 29 11:47:10 GMT 2017 armv6l
Last login: Mon Jan 8 11:44:26 2018 from 192.168.7.2
# exit
logout
Connection to s0 closed.

with no further prompts. We use this ability to create one final script /x/sreboot which automates putting the Zeros into rpiboot mode.

Code: Select all

#!/bin/bash
killall rpiboot
for i in s0 s1 s2 s3 s4
do
    echo ssh -n root@$i /sbin/reboot
    ssh -n root@$i /sbin/reboot &
done
Note that sreboot kills rpiboot so the Pi Zeros reboot only to the point of waiting for rpiboot to feed them the kernel and initial RAM file system. In the next post we'll update the system and add a tool for measuring network speed and then automatically propagate those updates to the Zeros.

User avatar
Gavinmc42
Posts: 7314
Joined: Wed Aug 28, 2013 3:31 am

Re: Super-cheap Computing Cluster for Learning

Tue Jan 09, 2018 3:29 am

OK I'm convinced, will get some more Zero's and give this a try.
Thanks for documenting it all, Cluster hat's page is a bit hard to understand.
Just download and copy is not learning ;)

Portable Cluster, hmm nice case, Brass and wood Steam Punk?
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Tue Jan 09, 2018 5:50 am

Gavinmc42 wrote:
Tue Jan 09, 2018 3:29 am
OK I'm convinced, will get some more Zero's and give this a try.
Thanks for documenting it all, Cluster hat's page is a bit hard to understand.
Just download and copy is not learning ;)

Portable Cluster, hmm nice case, Brass and wood Steam Punk?
I like the steam punk idea. Please post photos if you go that route. The method for setting up a cluster described here should also work fine with the cluster hat. I'm looking forward to hearing how you get on with my instructions. Please let me know of any omissions or places where they are unclear.
Last edited by ejolson on Fri Jan 19, 2018 5:11 pm, edited 3 times in total.

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Tue Jan 09, 2018 6:00 am

At this point the cluster boots. To install updates and system software one can update the B+ using apt-get, halt the Zeros, create new snapshots of the updated system and then reboot the Zeros. While apt-get is clever enough to install most software without rebooting, we must reboot the Zeros but not the B+ in order to propagate the changes. For a large cluster consisting of thousands of machines, rebooting each computational node for a software update could be problematic. While new system software is seldom installed on a production system, security updates are. In our current configuration, the compute nodes are completely isolated from the Internet secured behind the B+ server. Thus, security updates can likely wait until a scheduled maintenance period. Moreover, since there are so few nodes in our cluster, there is no problem rebooting them to install new system software. An alternative would be a NAT firewall to enable each node to masquerade as the server and fetch updates over the Internet using apt-get without rebooting. While the nodes on many supercomputers are set up so they can reach the Internet, for simplicity the present configuration avoids this.

As an example we proceed by updating, installing iperf and then testing the network speed of the USB Ethernet gadgets. Begin with the commands

# cd /x
# ./sreboot
ssh -n root@s0 /sbin/reboot
ssh -n root@s1 /sbin/reboot
ssh -n root@s2 /sbin/reboot
ssh -n root@s3 /sbin/reboot
ssh -n root@s4 /sbin/reboot
Connection to s1 closed by remote host.
Connection to s4 closed by remote host.
Connection to s2 closed by remote host.
Connection to s0 closed by remote host.
Connection to s3 closed by remote host.
# apt-get update
# apt-get upgrade
# apt-get install iperf

Now check that the Zeros are all down by typing

# bridge link

If there are still active links wait and try again in a minute. After "bridge link" prints nothing, continue by typing

# cd /x
# ./supdate
# ./szeroup
Starting rpiboot to boot nodes...
rpiboot: no process found

We are now ready to test the network speed. While everything related to configuring the cluster so far has been performed as the superuser, we can test the network speed as a regular user. If you haven't already done so, now might be a good time to enable ssh login without passwords from the B+ to the Zeros for your user account. This is similar but easier than how we configured the superuser account, because user home directories are shared as the NFS export /x/snaa which is mounted on all the Zeros. Type

$ ssh-keygen
$ cd
$ cd .ssh
$ cat id_rsa.pub >>authorized_keys
$ chmod 600 authorized_keys

Now, open two windows and log into the cluster from each window using your regular username. In the first window type

$ iperf -s

in the second window type

$ ssh s0
$ iperf -c snail -d -t 240

The resulting output looks like

Code: Select all

------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to snail, TCP port 5001
TCP window size:  246 KByte (default)
------------------------------------------------------------
[  3] local 192.168.7.33 port 47790 connected with 192.168.7.2 port 5001
[  5] local 192.168.7.33 port 5001 connected with 192.168.7.2 port 43010
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-240.0 sec  2.36 GBytes  84.4 Mbits/sec
[  5]  0.0-240.0 sec  2.46 GBytes  87.9 Mbits/sec
which indicates a bidirectional bandwidth of about 84 Mbits/sec simultaneously in both directions for a total bandwidth of 172.3 Mbits/sec between a Zero and the B+. The observed bandwidth is essentially the same as what one would expect from a real 100 Mbit/sec Ethernet connection.

We now measure the bandwidth between two Zeros. This is important for parallel MPI jobs where the computational nodes exchange data with one another. Return to the first window and type control-C to stop the iperf server and then type

$ ssh s1
$ iperf -s

to restart it on the second Pi Zero. Now, in the second window type

$ iperf -c s1 -d -t 240

The resulting output looks like

Code: Select all

------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to s1, TCP port 5001
TCP window size:  164 KByte (default)
------------------------------------------------------------
[  3] local 192.168.7.33 port 40830 connected with 192.168.7.37 port 5001
[  5] local 192.168.7.33 port 5001 connected with 192.168.7.37 port 40060
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-240.0 sec  2.38 GBytes  85.0 Mbits/sec
[  3]  0.0-240.1 sec  1.16 GBytes  41.5 Mbits/sec
which indicates a factor-of-two drop in performance in one direction. This drop in performance is unsurprising because to reach one Zero from the other each packet has to travel along the network bridge of the B+ and then back out over USB.

We finish by checking the ping latency between two Zeros and between the B+ and a Zero. In the second window type

Code: Select all

$ hostname
s0
$ ping s0
PING s0 (192.168.7.33) 56(84) bytes of data.
64 bytes from s0 (192.168.7.33): icmp_seq=1 ttl=64 time=0.168 ms
64 bytes from s0 (192.168.7.33): icmp_seq=2 ttl=64 time=0.164 ms
64 bytes from s0 (192.168.7.33): icmp_seq=3 ttl=64 time=0.162 ms
^C
--- s0 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.162/0.164/0.168/0.015 ms
$ ping snail
PING snail (192.168.7.2) 56(84) bytes of data.
64 bytes from snail (192.168.7.2): icmp_seq=1 ttl=64 time=0.620 ms
64 bytes from snail (192.168.7.2): icmp_seq=2 ttl=64 time=0.489 ms
64 bytes from snail (192.168.7.2): icmp_seq=3 ttl=64 time=0.522 ms
^C
--- snail ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms
rtt min/avg/max/mdev = 0.489/0.543/0.620/0.061 ms
$ ping s1
PING s1 (192.168.7.37) 56(84) bytes of data.
64 bytes from s1 (192.168.7.37): icmp_seq=1 ttl=64 time=0.797 ms
64 bytes from s1 (192.168.7.37): icmp_seq=2 ttl=64 time=0.678 ms
64 bytes from s1 (192.168.7.37): icmp_seq=3 ttl=64 time=0.731 ms
^C
--- s1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 0.678/0.735/0.797/0.053 ms
For reference it is worth mentioning that the ping times between two Pi 3Bs over regular Ethernet is about the same as between the B+ and the Zero. Unsurprisingly, the latency between two Zeros is about twice that.

We close by noting that we have used jumbo packets with a MTU size of 7418 to obtain the measurements reported here. The resulting bandwidth is much lower if the default MTU size of 1500 is used instead. It would be interesting to tune the cluster by finding the MTU size that leads to the greatest bandwidth and to know how that MTU impacts ping latency when the network is under load.

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Tue Jan 09, 2018 8:24 pm

In the previous post we measured networking latency using ping and bandwidth using iperf. In conclusion, the USB Ethernet gadget performs similarly to the physical 100 Mbit/sec Ethernet built into all B models of Raspberry Pi. This is not surprising because the physical Ethernet on the B models is actually connected internally using the USB bus. At the same time, the ping latency between two Pi's is about double that of the ping latency between two 15-year-old 1.4Ghz AMD Athlon computers.

The original bill of materials for this project included five SD cards for the Pi Zero computers. Although the Zeros work fine without cards, there may be a benefit in using them for extra swap and scratch space. I have updated the original post to indicate that purchasing SD cards for the Zero computers is optional. If you have the cards please keep reading; otherwise, the rest of this post is optional.

We shall format the cards one by one using an external USB SD card reader plugged into the B+. Please insert the card and reader, then type

# fdisk /dev/sda

Make sure the sda refers to the correct device. Then delete the existing partition and create new ones according to this table

Code: Select all

Disk /dev/sda: 7.5 GiB, 8019509248 bytes, 15663104 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x37665771

Device    Boot    Start      End  Sectors  Size Id Type
/dev/sda1          2048   100351    98304   48M  c W95 FAT32 (LBA)
/dev/sda2        100352 13565951 13465600  6.4G 83 Linux
/dev/sda4      13565952 15663103  2097152    1G 82 Linux swap / Solaris
Make sure the swap partition is 1G in size. The FAT partition should be small and the regular Linux partition /dev/sda2 should fill the rest of the card. We do not use the FAT partition; however, it's presence is required because otherwise the Zero gets confused and locks up before it has a chance to switch to USB rpiboot protocol. After writing this partition to the SD card format the partitions using

# mkdosfs -n BOOT /dev/sda1
# mke2fs -t ext4 -L SCRATCH /dev/sda2
# mkswap -L SWAP /dev/sda4
# sync; sync; sync

There is no need to copy any files to the SD card. Simply remove it and repeat the above process for each card. Now, shutdown the cluster, insert the cards and power it back on. To shut the cluster down type

# cd /x
# ./sreboot

wait for all Ethernet gadgets to detach from the bridge and then type

# /sbin/halt

and wait for the B+ to halt. Unplug the power from the USB hub, insert the SD cards into the Zero computers and then plug the power back in. The cluster should boot up and the fstab that we created earlier automatically mount the SD cards in each Zero. You can check that the SD cards are mounted as a regular user by typing the following script directly into the terminal

$ for i in s0 s1 s2 s3 s4
> do
> echo Checking Zero node $i...
> ssh $i "mount | grep scratch; cat /proc/swaps"
> done

Above is an example of how to execute the same sequence of commands on each of the Pi Zeros without individually logging in to each one. We will setup the Slurm batch scheduling system later to make launching computational jobs on the Zeros easier. If everything is fine, the output should look like

Code: Select all

Checking Zero node s0...
/dev/mmcblk0p2 on /x/scratch type ext4 (rw,relatime,data=ordered)
Filename                Type        Size    Used    Priority
/dev/mmcblk0p4                          partition   1048572 0   -1
Checking Zero node s1...
/dev/mmcblk0p2 on /x/scratch type ext4 (rw,relatime,data=ordered)
Filename                Type        Size    Used    Priority
/dev/mmcblk0p4                          partition   1048572 0   -1
Checking Zero node s2...
/dev/mmcblk0p2 on /x/scratch type ext4 (rw,relatime,data=ordered)
Filename                Type        Size    Used    Priority
/dev/mmcblk0p4                          partition   1048572 0   -1
Checking Zero node s3...
/dev/mmcblk0p2 on /x/scratch type ext4 (rw,relatime,data=ordered)
Filename                Type        Size    Used    Priority
/dev/mmcblk0p4                          partition   1048572 0   -1
Checking Zero node s4...
/dev/mmcblk0p2 on /x/scratch type ext4 (rw,relatime,data=ordered)
Filename                Type        Size    Used    Priority
/dev/mmcblk0p4                          partition   1048572 0   -1
Please check that /dev/mmcblk0p2 and /dev/mmcblk0p4 are listed for each Zero.

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Fri Jan 19, 2018 5:16 am

For completeness, here is a summary of a discussion concerning powering the Pi Zeros that occurred outside this forum thread. My suggestions for powering a Pi Zero are

1. If using gadget mode, power the Zero by back feeding power through the data USB port and leave the USB connector for power empty.

2. If using host mode with an OTG adapter, power the Zero through the USB connector for power.

The goal is to prevent separate sources of power from entering the Zero through both the data USB and power USB connectors at the same time.

For the cluster depicted above the Pi Zeros will be in gadget mode. Therefore, power is back fed through the data USB port. This was done by connecting the data USB ports of the Zeros to a powered USB hub. An additional USB cable was connected from one of the charging ports on the hub to the USB connector to power the B+. A similar arrangement should work when a Raspberry Pi 2B or 3B is used in place of the B+. In the case of the 3B, I would recommend down clocking or under clocking the CPU to 900 MHz for reliability and power saving by placing the line arm_freq=900 in /boot/config.txt. Note, however, that the Zeros should still run at their default frequency, so only change the version of config.txt in the FAT formatted boot partition and not the one in /x/sboot/boot used for the rpiboot subvolumes.

I believe a non-powered hub may be used instead of a powered hub. It would be great if someone on this forum could confirm that the following setup works. First, power the Pi B+, 2B or 3B directly from a sufficient power supply. Enable max_usb_current=1 in /boot/config.txt and then connect the Zeros using the USB data connector to the USB ports of the non-powered USB hub. Still leave the USB power connector empty on each of the Zeros. The power from the Pi functioning as the head node will flow into the non-powered USB hub and then back feed the Zeros through their data ports. Zeros don't take much power so multiple Zeros should be able to draw enough power from a single USB port on the the head node through the non-powered hub. Again, it would be nice if someone could confirm that this works.

It may be possible to provide power using the USB power connector while the Zero is in gadget mode. In this case you must figure out a way to prevent power from also back feeding through the USB data connector. One idea would be to modify each of the USB data cables leading to the Zeros by surgically cutting the 5V wire in each cable. If you don't do this, you may set up a current loop that damages something. As far as I know this method is untested. In my opinion modifying cables is also too much work and prone to error. If anyone on this forum would like to experiment with this method, please post your results here. Otherwise, it is probably best to leave the USB power connector of the Zero empty when running in gadget mode and employ a powered hub if necessary.
Last edited by ejolson on Fri Jan 19, 2018 5:20 pm, edited 1 time in total.

User avatar
rpdom
Posts: 21505
Joined: Sun May 06, 2012 5:17 am
Location: Chelmsford, Essex, UK

Re: Super-cheap Computing Cluster for Learning

Fri Jan 19, 2018 5:56 am

ejolson wrote:
Fri Jan 19, 2018 5:16 am
I believe a non-powered hub may be used instead of a powered hub. It would be great if someone on this forum could confirm that the following setup works. First, power the Pi B+, 2B or 3B directly from a sufficient power supply. Enable max_usb_current=1 in /boot/config.txt and then connect the Zeros using the USB data connector to the USB ports of the non-powered USB hub. Still leave the USB power connector empty on each of the Zeros. The power from the Pi functioning as the head node will flow into the non-powered USB hub and then back feed the Zeros through their data ports. Zeros don't take much power so multiple Zeros should be able to draw enough power from a single USB port on the the head node through the non-powered hub. Again, it would be nice if someone could confirm that this works.
I am pretty sure that you are right, I am working towards a similar setup, but I haven't had time to finish it yet.

The max_usb_current=1 parameter is obsolete now. It has been the default setting in Raspbian for some time now.

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Fri Jan 19, 2018 5:01 pm

rpdom wrote:
Fri Jan 19, 2018 5:56 am
The max_usb_current=1 parameter is obsolete now. It has been the default setting in Raspbian for some time now.
Thanks for the update. I hope you get your Zero cluster working soon. Please let me know if you find any problems with what I've written here or you discover a better way to do things.

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Fri Jan 19, 2018 5:08 pm

It is important for security and usability that time be synchronized between each node in the cluster. We do that using the ntp time synchronization utility. Since the Zeros are on a separate network, they can't reach the Debian time-server pool. Therefore, they synchronize their clocks from the B+ which is named snail in my cluster. Note that if a NAT firewall had been configured earlier so the computational nodes could reach the internet, then a customized ntp.conf file would not be needed for the Zeros. First type

# apt-get install ntpdate ntp
# cp /etc/ntp.conf /x/sproto

Then customize /x/sproto/ntp.conf so the relevant lines read as

Code: Select all

#pool 0.debian.pool.ntp.org iburst
#pool 1.debian.pool.ntp.org iburst
#pool 2.debian.pool.ntp.org iburst
#pool 3.debian.pool.ntp.org iburst
server snail
Finally edit the /x/supdate script to add a line which copies the customized ntp.conf file into the root filesystem snapshots for each Zero. The new version of the script should read

Code: Select all

#!/bin/bash
for i in s0 s1 s2 s3 s4
do
    echo Configuring $i...
    btrfs sub del $i
    btrfs sub snap / $i
    (
        if cd $i/x
        then
            rmdir scratch 
            mkdir scratch
        fi
    )
    echo $i >$i/etc/hostname
    cp sproto/fstab $i/etc/fstab
    cp sproto/ntp.conf $i/etc/ntp.conf
    rm $i/etc/exports
done
The time across the cluster will be synchronized after shutting down the cluster, updating the root filesystem snapshots and rebooting. Do this with

# cd /x
# ./sreboot

Now, wait for all the Zeros to disconnect from the bridge. The command

# bridge link

will report nothing when all the Zeros are have finished shutting down and are waiting for rpiboot. Finally, create new snapshots of the root filesystem for the zeros and then reboot the B+ server with

# cd /x
# ./supdate
# sync; sync; sync
# /sbin/reboot

After the system reboots, time across the cluster should be synchronized. A rough demonstration of this can be obtained by typing

$ for i in s0 s1 s2 s3 s4; do ssh $i "date" & done
Fri 19 Jan 08:45:10 PST 2018
Fri 19 Jan 08:45:10 PST 2018
Fri 19 Jan 08:45:10 PST 2018
Fri 19 Jan 08:45:10 PST 2018
Fri 19 Jan 08:45:10 PST 2018

As indicated, the dates should agree to within a second. We remark that ntp could have been installed earlier at the same time nfs-kernel-server was installed. This was not done in this forum thread due to lack of foresight. This post, however, functions as another example of system maintenance and how to install software.

ejolson
Posts: 10725
Joined: Tue Mar 18, 2014 11:47 am

Re: Super-cheap Computing Cluster for Learning

Sat Jan 20, 2018 5:20 am

In this post we set up the Slurm Workload Manager. Slurm is a scalable cluster management and job scheduling system that is used on some of the largest supercomputers in the world and which also works well on very small Linux clusters. The method for installing Slurm will be similar to how we installed ntp. An outline of this method is to first configure the B+, then shut down the Zeros, create new snapshots of the root file system and finally reboot the Zeros. Begin by typing

# apt-get install slurm-llnl
# cd /etc/slurm-llnl
# cp /usr/share/doc/slurm-client/examples/slurm.conf.simple.gz .
# gunzip slurm.conf.simple.gz
# mv slurm.conf.simple slurm.conf

Now edit the file /etc/slurm-llnl/slurm.conf to make the changes

Code: Select all

ControlMachine=snail
FastSchedule=2
SelectType=select/cons_res
SelectTypeParameters=CR_Core
. . .
NodeName=s[0-4] Procs=1 State=UNKNOWN
PartitionName=zero Nodes=s[0-4] Default=YES MaxTime=INFINITE State=UP
Shut down the Zeros with

# cd /x
# ./sreboot

wait for the Zeros to disconnect from the network bridge and then type

# cd /x
# ./supdate
# /sbin/reboot

After the system reboots, check the status of Slurm by typing

Code: Select all

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
zero*        up   infinite      5   idle s[0-4]
If any of the Zeros are down, you can set their states to idle using the commands

# scontrol
Slurmctld(primary/backup) at snail/(NULL) are UP/DOWN
scontrol: update nodename=s0 state=idle
scontrol: update nodename=s1 state=idle
scontrol: update nodename=s2 state=idle
scontrol: update nodename=s3 state=idle
scontrol: update nodename=s4 state=idle
scontrol: quit

At this point check that everything is working properly with a simple test

$ srun -N5 hostname
s2
s1
s4
s3
s0

Note that the order in which the hostnames appear is not important. If resources are unavailable srun will immediately exit with an error message. Since there are only five nodes in the cluster, we obtain

$ srun -N6 hostname
srun: error: Unable to allocate resources: Requested node configuration is not available

The main advantage of Slurm over simply logging into the Zeros using ssh is the sbatch command which waits until resources are available to start long running computations. That will be the topic of the next post.
Last edited by ejolson on Sat Feb 10, 2018 8:57 pm, edited 1 time in total.

Return to “Teaching and learning resources”