AI Model Optimization Has Never Been Easier

Author Profile

Michal Oleszak

Machine Learning Engineer with a statistics background. Has worn all the hats, having worked for a consultancy, an AI startup, and a software house. A traveler, polyglot, data science blogger and instructor, and lifelong learner.

Medium

More written by Oleszak is posted on Medium. Please click on the image to connect to Medium immediately.

Linkedin

Oleszak's LinkedIn gives you more knowledge about data science and machine learning. Please click on the image to connect to Linkedin immediately.

Twitter

Oleszak's Twitter is quick to find out what he’s updating. Please click on the image to connect to Twitter immediately.


Compress the model for your edge device without losing the accuracy

As the popular cliché has it, data scientists spend 80% of their time preparing the data, and only 20% developing the models. While there might be some truth to it, we should never underestimate the effort needed for the remaining 20%. Choosing the architecture, training, fine-tuning, and evaluating the model is no mean feat, especially when developing models for edge devices, where criteria other than performance metrics need to be considered. I’ve recently got to use NetsPresso, a platform that promises to take care of all the model optimization in an automated manner. Let me show you how it works.

Optimizing machine learning models

The typical machine learning pipeline is becoming a more or less established process these days. We query or download the raw data, parse it and clean it, and extract and engineer features, to finally obtain a data set ready for training. Then, we iterate over model architectures and a multitude of training and data processing hyperparameters to hopefully arrive at a model that satisfies some relevant performance metrics.

But what if the model is destined to run on edge devices? In this case, performance metrics aren’t the only criteria for model selection. We should also pay attention to the latency, memory footprint, and power consumption, as measured on the particular device we are interested in. After all, our cool AI-powered app won’t bring any value to a user whose mobile phone dies due to insufficient memory or battery drainage.

When building models for edge devices, latency, memory footprint, and power consumption are important criteria for model selection.

However, as we compress our models to be faster and lighter, we would like to sacrifice as little accuracy as possible, if any. I refer to the process of finding the balance between machine learning performance (accuracy metrics) and computational performance (latency, memory usage) as model optimization. Let’s see how to optimize our models with NetsPresso.

Uno cards recognition

At the moment of writing, NetsPresso is a computer vision-focused platform that only supports detection tasks. In future releases, however, it will support additional functionality such as classification and segmentation. Let’s try it out on an object detection and recognition task. We will be using the Uno deck dataset, available freely from roboflow.

The dataset contains photos of playing cards from the popular game called Uno. In each photo, there are three cards, laid on top of one another, pictured on a bright or patterned background designed to make it harder for machine learning models to recognize the cards. Here is a small sample of the images.

Uno deck images. Dataset source: roboflow. Image by the author.

There are 15 distinct types of cards, each recognizable by the symbol in the middle as well as in two of the card’s corners. The task is to find the bounding box around the symbol in the top-left corner (the one that is always visible on all cards) and to classify the symbol as one of the 15 classes.

Searching for the best model

Let’s try to find a good model architecture for our task. Before we can do this, however, we first need to upload the dataset to the NetsPresso platform. To do that, we will use the platform’s in-browser GUI. Currently, this is the only way to interact with it, but API and CLI interfaces will be provided in future releases.

Uploading dataset

NetsPresso supports a number of data formats. While downloading the Uno deck dataset from roboflow, I’ve chosen the YOLO format, named after the famous object detection model.

To verify whether the data is in the correct format on my local drive, NetsPresso offers a validator app that scans the data directory for any issues. If none are found, we can upload the data to the platform.

Uno deck dataset uploaded to NetsPresso. Image by the author.

The platform displays some summary statistics of the data. At a first glance, everything looks fine, so let’s proceed to model search.

Running model search

Finding the right model architecture for our task and dataset is straightforward with NetsPresso. We just need to tell the platform about what we expect of the model.

First, we need to state the target latency in milliseconds. The automated model search will try to find an architecture that performs the inference in a time close to this threshold.

Second, we need to select the target device. We can choose from multiple versions of NVIDIA Jetson and Raspberry Pi, and Intel Server — the three most popular devices for AI model development. Future releases will extend the support to Arm Virtual Hardware, Renesas RZ series, NVIDIA Jetson Orin, and more.

Finally, there are other options we can set such as batch size to be used at inference, image size, the data type for storing model weights, and others. Based on all the information above, NetsPresso recommends a couple of base model architectures we can choose from.

Suggested model architectures. Image by the author.

I have selected the YOLOv5s network with the expected latency of 746 milliseconds. The training took slightly above one hour for our Uno deck dataset. As a result, we get a trained model accompanied by a report on its performance, both in terms of machine learning accuracy metrics as well as some computational efficiency metrics.

Trained model report. Image by the author.

We can also browse through the classifications made on the test set. If the model was not doing what it is supposed to do, we should be able to spot it here.

Model predictions on the test set. Image by the author.

As a result, we have managed to get an over 13MB model with a mean average precision of 99.2%, and an inference latency of 167 milliseconds. Can we do any better? Let’s find out!

Compressing the model

Having found a good model, we might want to try to compress it. Model compression is the process of simplifying the model in order to improve its computational performance, hopefully without losing too much of its accuracy. Two popular methods for model compression are pruning and filter decomposition.

Model pruning boils down to removing some of the neurons, filters or channels from a neural network that don’t contribute too much to the model’s accuracy, but are a substantial burden in terms of memory usage or inference speed.

Filter decomposition, on the other hand, seeks to approximate the original network’s weights via a simpler, lightweight representation.

NetsPresso supports various customizable types of these two generic approaches, as well as automatic one-click compression which doesn’t require an understanding of the methods’ hyperparameters; it’s enough to choose the compression ratio, a value between 0% and 100%. The stronger we compress, the more efficient the model, but at the cost of an increased risk of accuracy loss. I went for the default compression ratio of 50% to obtain the following results.

Compression results. Image by the author.

According to the resulting visualization, how the number of the model’s parameters decreased almost four times! Could it be that we got a model this much lighter without accuracy loss? To find out, we will first need to retrain the compressed model to recover the accuracy after pruning and decomposing. And here are the retraining results

Retrained model report after compression. Image by the author.

It turns out the mean average precision went up from 99.2% to 99.5%! This small increase, however, is likely due to the stochasticity inherent in training neural networks; the two models can be deemed the same in terms of accuracy. The point here is that the compressed model is not less accurate than the original one.

At the same time, the model size went down from 13MB to less than 4MB and the memory footprint was halved. The latency improved as well, which is an expected consequence of model compression since the resulting model has fewer parameters, and so during inference, fewer matrix multiplications are performed.

Final remarks

Model optimization is a complex process, especially when strict accuracy, latency, and memory requirements are to be met. When carried out manually, the subsequent iterations can easily take months before the right model is found. NetsPresso can make this process much faster and easier. Naturally, it is no golden bullet: in order to achieve the best possible results, one still needs to understand the various training and compression hyperparameters and how they influence the model’s learning and inference. But thanks to the platform, one can iterate faster, which is of utmost importance in any machine learning project.

Previous
Previous

Hardware-Aware AI Model Optimization with NetsPresso