Integrating LaunchX with NVIDIA TAO Toolkit for Running on Various Edge Devices

Author

Hoin Na

CoS Tech Part Manager, Nota AI


Introduction

As real-time processing, cost efficiency, and security become increasingly important for AI models, the demand for Edge AI is experiencing explosive growth, with an active effort to deploy these models into various edge devices. To optimize the performance of AI models at the Edge, it is necessary to customize and quantize the models for each device. However, since these tasks vary depending on the frameworks and data types required by different devices, the validation process to determine which device exhibits superior performance consumes significant time and resources. LaunchX, an AI deployment tool that converts and benchmarks AI models, offers a service to check benchmark results for AI models on various devices manufactured by multiple semiconductor companies.

The TAO Toolkit is an open-source AI model optimization tool developed by NVIDIA, designed for the easy training and optimization of AI models based on Tensorflow and PyTorch. Initially, models optimized with TAO Toolkit were only usable on NVIDIA Jetson devices. However, with the release of version 5.0, a feature was added to export models to ONNX, allowing TAO's optimized models to be deployed not only on NVIDIA devices but also on other devices.

Figure 1. Flow of NVIDIA TAO Toolkit (TAO Toolkit | NVIDIA Developer)

LaunchX provides the feature to convert ONNX models into model frameworks that is compatible to each device. This allows users to benchmark optimized models through TAO on various devices using LaunchX. Additionally, PyNetsPresso, a development tool for optimizing AI models based on Python for target devices, enables to use LaunchX with just a few lines of code.

Figure2. LaunchX

In order to enable users looking to deploy TAO optimized models on other edge devices, we have integrated LaunchX using PyNetsPresso into TAO Toolkit pipeline.

Hence, we have taken the following steps:

  1. Use the TAO Toolkit to obtain a pre-trained MobilenetV2(TF1) base model, followed by training, pruning, and evaluation processes using the Pascal VOC dataset.

  2. Export the optimized TAO model to ONNX format.

  3. Upload and benchmark the exported ONNX TAO model on multiple devices served by LaunchX with using PyNetsPresso.

Install TAO Toolkit and setup environments

Before using the TAO Toolkit,

!ngc registry resource download-version nvidia/tao/tao-getting-started:5.1.0 --dest ./
cd ./getting_started_v5.1.0

Get pre-trained model from TAO Model Zoo

NVIDIA provides a variety of models available for easy download through the Model Zoo, enabling users to use them conveniently with the TAO toolkit. You can easily search for pre-trained classification models with the command below.

!ngc registry model list nvidia/tao/pretrained_classification:*

We chose MobileNetV2 as the base model and downloaded it using the command below.

!ngc registry model download-version nvidia/tao/pretrained_classification:mobilenet_v2 --dest $LOCAL_EXPERIMENT_DIR/pretrained_mobilenet_v2

Get optimized ONNX model by the TAO Toolkit

We can easily execute all optimization processes using the TAO Toolkit with the provided Jupyter Notebook in tao-getting-started_v5.1.0/notebooks/tao_launcher_starter_kit/classification_tf1/tao_voc/classification.ipynb. We conducted training using the provided Pascal VOC dataset, allowing us to complete the process without major code changes. Clear explanations for each step were well-documented in the Jupyter Notebook of the TAO Toolkit, facilitating execution of model training, pruning, evaluation, and export. With the TAO Toolkit 5.0.0 version update, we were able to obtain the ultimately optimized model in the ONNX framework, as reflected in the update.

Upload ONNX model and get benchmark result by PyNetsPresso

In LaunchX, serving a web application environment, you can easily upload ONNX models through drag and drop, convert them for your target devices, and verify benchmark results. However, there is an inconvenience in using LaunchX in the Jupyter Notebook provided by the TAO Toolkit, as it requires the use of a web browser. To address this, we have used PyNetsPresso, enabling direct utilization of LaunchX's functionality within the Jupyter Notebook. PyNetsPresso is a Python package distributed as a Python wheel listed on PyPI. Before using PyNetsPresso, please ensure it is installed. To install this package, use Python 3.8 or a later version.

!pip install netspresso==1.1.7

Additionally, for the credentials, it is required to sign up for NetsPresso®. Then, you need email and password for sign in with in the code section below.

import os

os.environ["NETSPRESSO_EMAIL"] = YOUR_EMAIL
os.environ["NETSPRESSO_PASSWORD"] = YOUR_PASSWORD
!mkdir -p $LOCAL_EXPERIMENT_DIR/np_converted

Then, we convert the TAO model exported in ONNX format using PyNetsPresso. After uploading the exported model from TAO to PyNetsPresso, you can convert the model format based on the target device. The supported model formats and devices are listed in the following enums below.

import os
from netspresso.launcher import ModelConverter, ModelFramework, DeviceName

converter = ModelConverter(email=os.environ["NETSPRESSO_EMAIL"], password=os.environ["NETSPRESSO_PASSWORD"])
model_path = os.path.join(os.environ['LOCAL_EXPERIMENT_DIR'], "experiment_dir_final/mobilenetv2_classification_tf1.onnx")
output_path = os.path.join(os.environ['LOCAL_EXPERIMENT_DIR'], "np_converted/mobilenetv2_classification_tf1.tflite")
model = converter.upload_model(model_path)
conversion_task = converter.convert_model(
    model=model,
    input_shape=model.input_shape,
    target_framework=ModelFramework.TENSORFLOW_LITE,
    target_device_name=DeviceName.RASPBERRY_PI_4B,
    wait_until_done=True
)
converter.download_converted_model(conversion_task, dst=output_path)

!ls -rlt $LOCAL_EXPERIMENT_DIR/np_converted/

After finishing the conversion of model, you can run the benchmark on your target device with PyNetsPresso.

import os
from netspresso.launcher import ModelBenchmarker, DeviceName

benchmarker = ModelBenchmarker(email=os.environ["NETSPRESSO_EMAIL"], password=os.environ["NETSPRESSO_PASSWORD"])
model_path = os.path.join(os.environ['LOCAL_EXPERIMENT_DIR'], "np_converted/mobilenetv2_classification_tf1.tflite")
model = benchmarker.upload_model(model_path)
benchmark_task = benchmarker.benchmark_model(
    model=model,
    target_device_name=DeviceName.RASPBERRY_PI_4B,
    wait_until_done=True
)
print("Latency(ms):", benchmark_task.latency / 1000)
print("CPU Memory Footprint(MB):", benchmark_task.memory_footprint_cpu)

The results of multiple benchmarks with TAO models by LaunchX

Figure 3. Multiple benchmarks of MobilenetV2(TF1) model by LaunchX

Moreover, we used the Model Zoo provided through the TAO Toolkit to benchmark a variety of pre-trained models using LaunchX.

Conclusion

As demonstrated by the above results, the optimized AI model generated by the TAO Toolkit is available for conversion into compatible frameworks. It can be benchmarked on various devices through PyNetsPresso. Currently, LaunchX provides benchmark results for Raspberry Pi boards, the NVIDIA Jetson series, Intel Xeon, Renesas RA8D1, and Alif Ensemble DevKit Gen2. Additionally, LaunchX plans to offer services for various edge devices manufactured by different companies, including those mentioned in the benchmark results above. In order to reliably deploy AI models on diverse devices, advanced and refined optimization techniques are required. Consequently, our research team is intensifying efforts to bolster technical support for the stable deployment of a variety of models on different devices. Look forward to additional supported devices and unveiled features from LaunchX in the future!

Previous
Previous

Shortened LLM: A Simple Depth Pruning for Large Language Models

Next
Next

Revolutionizing Mobile AI: How NetsPresso® Turbocharges Semantic Segmentation Models for Real-Time Performance