4. Multilayer Perceptron
One of the simplest neural network architectures are multilayer perceptrons. Fashion MNIST is one of the simplest commonly available datasets. We get started with “real” neural networks in PyTorch by training a multilayer perceptron on Fashion MNIST 😎.
Before going into the actual ML-related parts, we typically have to work on our data first. Preparing the data before doing anything in a machine learning framework might turn out to be one of the most time-consuming tasks. This is especially true for “unique” data sets common in scientific machine learning where we often have to find custom solutions. Luckily Fashion MNIST is one of the standard examples and somebody else, i.e., Zalando SE, prepared everything for us. Still we have to find a way to load our data into PyTorch and to potentially preprocess and postprocess it. Once all of this infrastructure is in place, we’ll define our model and train it.
Hint
The Quickstart tutorial conceptually does the same thing as this lab: Train a multilayer perceptron on Fashion MNIST.
4.1. Datasets and Data Loaders
As in Section 3, we’ll first have a brief look at our data. Then we wrap the Fashion MNIST dataset in a data loader.
Tasks
Create a training and test dataset by calling torchvision.datasets.FashionMNIST. Use the transformation torchvision.transforms.ToTensor() to convert the data to PyTorch tensors.
Visualize a few images and show their labels. Use Matplotlib for your plots. Save your visualizations in a PDF file by using PdfPages. An example is given in the Multipage PDF demo.
Wrap the datasets into data loaders. Use torch.utils.data.DataLoader for this.
4.2. Training and Validation
We are ready to train our first “real” neural net in PyTorch! We’ll train a MultiLayer Perceptron (MLP). Our MLP assumes input images with 28 \(\times\) 28 pixels and 10 output classes. Initially, our MLP will use the following network architecture:
Fully-connected layer with 28 \(\times\) 28 input features and 512 output features;
ReLU activation;
Fully-connected layer with 512 input features and 512 output features;
ReLU activation; and
Fully-connected layer with 512 input features and 10 output features.
For the training procedure we’ll also have to think about a loss function and an optimizer. For now:
Use torch.nn.CrossEntropyLoss as your loss function; and
Use torch.optim.SGD as your optimizer.
1## Trains the given MLP-model.
2# @param i_loss_func used loss function.
3# @param io_data_loader data loader containing the data to which the model is applied (single epoch).
4# @param io_model model which is trained.
5# @param io_optimizer.
6# @return summed loss over all training samples.
7def train( i_loss_func,
8 io_data_loader,
9 io_model,
10 io_optimizer ):
11 # switch model to training mode
12 io_model.train()
13
14 l_loss_total = 0
15
16 # TODO: finish implementation
17
18 return l_loss_total
1import torch
2
3## Tests the model
4# @param i_loss_func used loss function.
5# @param io_data_loader data loader containing the data to which the model is applied.
6# @param io_model model which is tested.
7# @return summed loss over all test samples, number of correctly predicted samples.
8def test( i_loss_func,
9 io_data_loader,
10 io_model ):
11 # switch model to evaluation mode
12 io_model.eval()
13
14 l_loss_total = 0
15 l_n_correct = 0
16
17 # TODO: finish implementation
18
19 return l_loss_total, l_n_correct
Listing 4.2.1 and Listing 4.2.2 contain two templates for the modules eml.mlp.trainer
and eml.mlp.tester
.
We’ll use these to avoid spaghetti code by separating our training and testing procedures from the main code.
Tasks
Implement the class
Model
in the moduleeml.mlp.model
which contains the MultiLayer Perceptron (MLP).Implement the training loop and print the total training loss after every epoch. For the time being, implement the training loop directly in your main function.
Move the training loop to the module
eml.mlp.trainer
. Use the template in Listing 4.2.1 to guide your implementation.Implement the module
eml.mlp.tester
. Use the template in Listing 4.2.2 to guide your implementation. The module’s only functiontest
simply applies the MLP to the given data and returns the obtained total loss and number of correctly predicted samples.Hint
When testing your model, switch to evaluation mode through nn.Module.eval() and locally disable gradient tracking through torch.no_grad. Don’t forget to switch back to training mode afterwards if needed. Further information is available from the article Autograd mechanics.
4.3. Visualization
Now, let’s, once again, professionalize our visualization efforts. This means that we provide some data to our MLP and infer Fashion MNIST’s classes by applying our trained model. The input data are then plotted together with the obtained labels.
1import torch
2import matplotlib.pyplot as plt
3
4## Converts an Fashion MNIST numeric id to a string.
5# @param i_id numeric value of the label.
6# @return string corresponding to the id.
7def toLabel( i_id ):
8 l_labels = [ "T-Shirt",
9 "Trouser",
10 "Pullover",
11 "Dress",
12 "Coat",
13 "Sandal",
14 "Shirt",
15 "Sneaker",
16 "Bag",
17 "Ankle Boot" ]
18
19 return l_labels[i_id]
20
21## Applies the model to the data and plots the data.
22# @param i_off offset of the first image.
23# @param i_stride stride between the images.
24# @param io_data_loader data loader from which the data is retrieved.
25# @param io_model model which is used for the predictions.
26# @param i_path_to_pdf optional path to an output file, i.e., nothing is shown at runtime.
27def plot( i_off,
28 i_stride,
29 io_data_loader,
30 io_model,
31 i_path_to_pdf = None ):
32 # switch to evaluation mode
33 io_model.eval()
34
35 # create pdf if required
36 if( i_path_to_pdf != None ):
37 import matplotlib.backends.backend_pdf
38 l_pdf_file = matplotlib.backends.backend_pdf.PdfPages( i_path_to_pdf )
39
40 # TODO: finish implementation
41
42 # close pdf if required
43 if( i_path_to_pdf != None ):
44 l_pdf_file.close()
Tasks
Implement a Fashion MNIST visualization module in
eml.vis.fashion_mnist
. Use the template in Listing 4.3.1 to guide your implementation. The module’s functionplot
function takes the argumenti_off
for the offset of the first visualized image and the argumenti_stride
for the stride between images. For example, ifi_off=5
andi_stride=17
, the function would plot the images with ids 5, 21, 38, and so on.Monitor your training process by visualizing the test data after every ten epochs. Use the stride feature of
eml.vis.fashion_mnist.plot
to keep the file sizes small.
4.4. Batch Jobs
All puzzle pieces for training the MLP are in place.
Further, in Section 1 we have seen that we can get a node from the cluster through the salloc
command.
This is great!
We can already use dedicated resources to train our MLP.
However, interactive resources require us to monitor the training process manually.
What happens if we loose connection to the machine?
Can we still grab a coffee or should we wait until completion and duly release our compute node after?
This is inconvenient: Our software is mature enough to run on its own 💪.
Batch jobs help us with exactly this issue! In simple words we can write a shell script which automatically starts our training once resources of the machine are available and releases the occupied node(s) once we are done.
1#!/usr/bin/env bash
2##
3# Example Draco job script.
4##
5#SBATCH --job-name=mlp_training
6#SBATCH --output=mlp_training_%j.out
7#SBATCH -p short
8#SBATCH -N 1
9#SBATCH --cpus-per-task=96
10#SBATCH --time=01:00:00
11#SBATCH --mail-type=all
12#SBATCH --mail-user=alex.breuer@uni-jena.de
13
14echo "submit host:"
15echo $SLURM_SUBMIT_HOST
16echo "submit dir:"
17echo $SLURM_SUBMIT_DIR
18echo "nodelist:"
19echo $SLURM_JOB_NODELIST
20
21# activate conda environment
22module load tools/anaconda3/2021.05
23source "$(conda info -a | grep CONDA_ROOT | awk -F ' ' '{print $2}')"/etc/profile.d/conda.sh
24conda activate pytorch_x86
25
26# train MLP
27cd $HOME/mlp_fashion_mnist
28export PYTHONUNBUFFERED=TRUE
29python mlp_fashion_mnist.py
An exemplary job script is given in Listing 4.4.1. The script is separated into two parts:
A section sharing details of the job with the job scheduler, i.e., Slurm. Everything related to Slurm starts with
#SBATCH
. In this example we ask for a single node in the “short” queue for an hour. Further, we specify the name of the job, the location for the job’s output, and request emails on all changes of the job status. Now, when the job is starting (or done), we conveniently get an email. Ok, to be precise its not you who gets these emails in the example script, please do adjust this part. 😅A section which represents the commands which are executed once the job starts running. This includes not only the pure call to Python but also everything else we would do in an interactive jobs. For example, we might want to load a customized Conda environment.
Tasks
Write a job script which powers the training of your MLP training.
Submit your job and maybe grab a coffee. ☕