.. _ch:pytorch_mlp: Multilayer Perceptron ===================== One of the simplest neural network architectures are multilayer perceptrons. Fashion MNIST is one of the simplest commonly available datasets. We get started with "real" neural networks in PyTorch by training a multilayer perceptron on Fashion MNIST 😎. Before going into the actual ML-related parts, we typically have to work on our data first. Preparing the data before doing anything in a machine learning framework might turn out to be one of the most time-consuming tasks. This is especially true for "unique" data sets common in scientific machine learning where we often have to find custom solutions. Luckily Fashion MNIST is one of the standard examples and somebody else, i.e., `Zalando SE `__, prepared everything for us. Still we have to find a way to load our data into PyTorch and to potentially preprocess and postprocess it. Once all of this infrastructure is in place, we'll define our model and train it. .. hint:: The `Quickstart `__ tutorial conceptually does the same thing as this lab: Train a multilayer perceptron on Fashion MNIST. Datasets and Data Loaders ------------------------- As in :numref:`ch:linear_perceptron`, we'll first have a brief look at our data. Then we wrap the Fashion MNIST dataset in a data loader. .. admonition:: Tasks #. Create a training and test dataset by calling `torchvision.datasets.FashionMNIST `__. Use the transformation `torchvision.transforms.ToTensor() `__ to convert the data to PyTorch tensors. #. Visualize a few images and show their labels. Use `Matplotlib `_ for your plots. Save your visualizations in a PDF file by using `PdfPages `__. An example is given in the `Multipage PDF demo `__. #. Wrap the datasets into data loaders. Use `torch.utils.data.DataLoader `__ for this. Training and Validation ----------------------- We are ready to train our first "real" neural net in PyTorch! We'll train a MultiLayer Perceptron (MLP). Our MLP assumes input images with 28 :math:`\times` 28 pixels and 10 output classes. Initially, our MLP will use the following network architecture: * Fully-connected layer with 28 :math:`\times` 28 input features and 512 output features; * ReLU activation; * Fully-connected layer with 512 input features and 512 output features; * ReLU activation; and * Fully-connected layer with 512 input features and 10 output features. For the training procedure we'll also have to think about a loss function and an optimizer. For now: * Use `torch.nn.CrossEntropyLoss `__ as your loss function; and * Use `torch.optim.SGD `__ as your optimizer. .. literalinclude:: data_mlp/trainer.py :linenos: :language: python :caption: Template for the module ``eml.mlp.trainer``. :name: lst:mlp_trainer .. literalinclude:: data_mlp/tester.py :linenos: :language: python :caption: Template for the module ``eml.mlp.tester``. :name: lst:mlp_tester :numref:`lst:mlp_trainer` and :numref:`lst:mlp_tester` contain two templates for the modules ``eml.mlp.trainer`` and ``eml.mlp.tester``. We'll use these to avoid spaghetti code by separating our training and testing procedures from the main code. .. admonition:: Tasks #. Implement the class ``Model`` in the module ``eml.mlp.model`` which contains the MultiLayer Perceptron (MLP). #. Implement the training loop and print the total training loss after every epoch. For the time being, implement the training loop directly in your main function. #. Move the training loop to the module ``eml.mlp.trainer``. Use the template in :numref:`lst:mlp_trainer` to guide your implementation. #. Implement the module ``eml.mlp.tester``. Use the template in :numref:`lst:mlp_tester` to guide your implementation. The module's only function ``test`` simply applies the MLP to the given data and returns the obtained total loss and number of correctly predicted samples. .. hint:: When testing your model, switch to evaluation mode through `nn.Module.eval() `__ and locally disable gradient tracking through `torch.no_grad `__. Don't forget to switch back to training mode afterwards if needed. Further information is available from the article `Autograd mechanics `__. Visualization ------------- Now, let's, once again, professionalize our visualization efforts. This means that we provide some data to our MLP and infer Fashion MNIST's classes by applying our trained model. The input data are then plotted together with the obtained labels. .. literalinclude:: data_mlp/fashion_mnist.py :linenos: :language: python :caption: Template for the module ``eml.vis.fashion_mnist``. :name: lst:vis_fashion_mnist .. admonition:: Tasks #. Implement a Fashion MNIST visualization module in ``eml.vis.fashion_mnist``. Use the template in :numref:`lst:vis_fashion_mnist` to guide your implementation. The module's function ``plot`` function takes the argument ``i_off`` for the offset of the first visualized image and the argument ``i_stride`` for the stride between images. For example, if ``i_off=5`` and ``i_stride=17``, the function would plot the images with ids 5, 21, 38, and so on. #. Monitor your training process by visualizing the test data after every ten epochs. Use the stride feature of ``eml.vis.fashion_mnist.plot`` to keep the file sizes small. Batch Jobs ---------- All puzzle pieces for training the MLP are in place. Further, in :numref:`ch:remote_ml` we have seen that we can get a node from the cluster through the ``salloc`` command. This is great! We can already use dedicated resources to train our MLP. However, interactive resources require us to monitor the training process manually. What happens if we loose connection to the machine? Can we still grab a coffee or should we wait until completion and duly release our compute node after? This is inconvenient: Our software is mature enough to run on its own 💪. Batch jobs help us with exactly this issue! In simple words we can write a shell script which automatically starts our training once resources of the machine are available and releases the occupied node(s) once we are done. .. literalinclude:: data_mlp/batch_job_draco.slurm :linenos: :language: bash :caption: Example script for a batch job on the Draco cluster. :name: lst:batch_job An exemplary job script is given in :numref:`lst:batch_job`. The script is separated into two parts: #. A section sharing details of the job with the job scheduler, i.e., Slurm. Everything related to Slurm starts with ``#SBATCH``. In this example we ask for a single node in the "short" queue for an hour. Further, we specify the name of the job, the location for the job's output, and request emails on all changes of the job status. Now, when the job is starting (or done), we conveniently get an email. Ok, to be precise its not you who gets these emails in the example script, please do adjust this part. 😅 #. A section which represents the commands which are executed once the job starts running. This includes not only the pure call to Python but also everything else we would do in an interactive jobs. For example, we might want to load a customized Conda environment. .. admonition:: Tasks #. Write a job script which powers the training of your MLP training. #. Submit your job and maybe grab a coffee. ☕