1. Remote Machine Learning

State-of-the-art machine learning is compute hungry. Often times our local notebooks or desktops are insufficient to power recent models. In this class we’ll learn how to use a dedicated compute cluster for our machine learning workloads. This is a crucial skill to have in our toolbox and best learned by starting small. We’ll harness the Draco at Friedrich Schiller University Jena to learn about remote machine learning.

1.1. Raspberry Pis

In class we will use a set of Raspberry Pis which act as our Desktops. To access them you need a KSZ account which you might still have to apply for. For further information, please visit KSZ’s homepage. The KSZ accounts allows you to log into the Raspberry Pis in the lab room:

  1. You should see an Ubuntu login screen. Don’t enter your credentials here. First, press Ctrl + Alt + F2.

  2. Now a shell interface opens. Here, you can provide your username, press enter, put in your KSZ password and press enter again.

  3. A few lines of text will pop up. You can ignore them. Press enter one more time.

  4. Next, type the command startx and press enter. Now you are all set up. Have fun! 😀

After finishing your work, you need to log out of the device:

  1. In the bottom right corner of the screen, press the red power button.

  2. A pop-up will open. Press Logout.

  3. Now you are back in the shell. Just type exit, press enter and you’re done!

Attention

Please don’t shut the Pis down!

Tasks

  1. Log into one of the lab room’s Raspberry Pis.

  2. Open a terminal and run the two commands hostname and lscpu.

  3. Log out from the Pi.

1.2. Draco

Requesting Access

As with all other compute clusters, we have to go through few steps before we are able to log into the machine. We’ll get started on the process early such that our accounts are ready once we need them.

Tasks

  1. Add your URZ username to the list shared in the labs.

  2. Wait for the info that you are good to go and test your account as outlined in the first steps.

First Steps

Perfect, everything is set up. On paper we are good to go: Let’s take the machine for a spin!

Tasks

  1. Use your URZ username to log into Draco:

    ssh -X <username>@login1.draco.uni-jena.de
    
  2. Allocate a compute node for three hours:

    salloc -N 1 -p short -t 03:00:00
    

    Get the id of your job and the name of your node through the squeue command. Use squeue -u <username> to only get your user’s jobs.

  3. Connect to the compute node in a second terminal. For example, if you got node node009, do the following:

    ssh node009
    
  4. Print basic information on the node located in /proc/cpuinfo, /proc/meminfo. Try the tool lscpu.

  5. Try the module system. module help should get you started.

  6. Leave the compute node and (if needed) release your job by providing its id to the scancel command.

Installing PyTorch through Pip

In the coming weeks we’ll use PyTorch which is one of the most popular machine learning frameworks. This section makes the beginning and installs PyTorch locally through pip using one of Draco’s compute nodes.

Tasks

  1. Allocate a compute node in the short queue and connect to the node.

  2. Load the tools/python/3.8 module:

    module load tools/python/3.8
    
  3. Install the virtualenv tool which we’ll use for the creation of virtual environments:

    pip install virtualenv
    
  4. Make the local tools available in your session:

    export PATH=/home/le34xez/.local/bin:${PATH}
    
  5. Create a new virtual environment with the name venv_pytorch, activate it and install PyTorch:

    virtualenv venv_pytorch
    source venv_pytorch/bin/activate
    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
    
  6. Start a Python command prompt, load PyTorch, check that you have PyTorch 2.0 and exit the prompt:

    python
    import torch
    print( torch.__version__ )
    exit()