The power of different Edge AI environments

For the purpose of Edge computing, we are looking at using single-board computers instead of using a desktop GPU, as they require less storage space, less power and are cheaper. There is a variety of single-board computers on which AI processing is possible, with or without hardware acceleration, from the Raspberry Pi 3 to the Nvidia AGX Xavier.

It has to be noted that Google Coral and Intel propose USB accelerators too: the Edge Coral USB Accelerator and the Intel NCS (Neural Compute Stick) based on the Myriad Movidius processor.

Aucun texte alternatif pour cette image

We decided not to use an USB accelerator, as we prefer to have a single unit than an unit with an attached dongle. Furthermore, the following benchmark on the AI accelerators shows that the USB accelerators have poorer performance than a single-board computer like the Jetson Nano, when using an optimizer.

Aucun texte alternatif pour cette image

Thus, we chose to test both the Google Coral Dev Board and the Nvidia Jetson Nano, as their performance and price are quite similar. Our test was on how to create an object detection project and see what we achieve with either platform.

The devices from the Jetson family come with a dedicated GPU and with the Jetpack SDK which regroups the CUDA Toolkit, TensorRT which is its model optimizer (supports most ML frameworks), and a variety of libraries and SDKs. Thanks to that, and even if those devices have a ARM64 processor, it is possible to use and run all types of Machine Learning projects, either that were developped on a Cloud Server or on a desktop with a Nvidia GPU.

Aucun texte alternatif pour cette image

Once gain, because of the Jetpack SDK, you can choose either the Jetson Nano or any of the devices of the Jetson family, depending on the needs of your project. The configuration and the optimization of the models are different; but else, all your ML projects running in the Jetson Nano run on the other Jetson devices.

Aucun texte alternatif pour cette image

It is not the same story for the Edge TPU which supports only 1 ML framework: Tensorflow Lite. It is a variant of Tensorflow, and it supports only a limited number of neural network layers. As well, the Edge TPU only supports models that are quantized to INT8 precision. These two limiting factors reduce drastically the number of models that can be run on the device. For instance, it is impossible to use YoloV4 on the Edge TPU without modifying its layers, while on the Jetson Nano it runs without any issue.

Aucun texte alternatif pour cette image

Due to these limiting factors, there is a risk with the Coral Edge TPU that there is no compatible AI model for our applications, or that the compatible models are not efficient enough in term of accuracy.

Those 2 issues do not exist with the Jetson family. We can run every type of AI models on the Jetson devices, and if the Jetson Nano does not have enough ressources to run a model, we can use the Jetson Xavier NX or Jetson AGX Xavier, as both of them bring more computation power. Thus, we chose to use the Jetson Nano and other devices from its family, to run our AI Edge applications.

The Jetson family is currently composed of the Jetson Nano, of the Jetson Xavier NX, of the Jetson AGX Xavier and of the Jetson TX2.

Aucun texte alternatif pour cette image

The Jetson Nano is the entry level of the SBC from Nvidia, with a Quad-core ARM A57 (ARM64) cadenced at 1.43 GHz, and with a Maxwell GPU (similar architecture than 700-900 GGPU from Nvidia) possessing 128 CUDA cores. There are 2 versions of its development kits, one with 4GB of RAM and the other with only 2 GB. For both version, the storage is done via a SD card on which must be flashed the Image build with the Jetpack SDK.

Aucun texte alternatif pour cette image

We mainly use the Jetson Nano to do Object Detection and to extract metrics from it. To do so, we are running YoloV4 trained on the Coco dataset, which we optimized through TensorRT with a FP16 precision. With the Jetson Nano, we use only the versions of YoloV4 with a input shape of 288x288 pixels and 416x416 pixels. Doing so, we reach a framerate of 7.9 FPS (for input 288) and 4.6 FPS (for input 416).

Aucun texte alternatif pour cette image

There are some limitations though. Using YoloV4 with a batch size of 1 is the maximum the GPU memory can handle. Thus, we cannot increase the batch size. So, to process the streams from 2 cameras, we have to process the frame of only 1 camera at a time, which divides by 2 the framerate of each camera. Another issue linked to the memory is that we cannot have several concurrent models running against YoloV4, and it could be useful like using the DeepSort model in concurrency of YoloV4 to track people.

To overcome this issue, we looked to use the Jetson Xavier NX. The Xavier NX is based on a Six-core ARM v8.2 processor (ARM64) cadenced at 1.5 GHz for 2 cores and 1.2 Ghz for the last 4 cores. The GPU here is based on the Volta Architecture from Nvidia (same than the V100 or Titan V) and has 384 CUDA cores and 48 Tensor cores. It also has 8 GB of RAM which are shared between the CPU and GPU.

The Development Kit is using a SD card for the storage and the installation can be done by flashing the Jetpack image on it, or by flashing the Jetson Xavier NX (with the SD card placed) via the Nvidia SDK Manager. The industrial version of the Xavier NX has a dedicated eMMC of 16 Go in place of the SD Card.

Aucun texte alternatif pour cette image

As it uses the same Jetpack SDK version than the Jetson Nano, we can directly reuse the projects from the Jetson Nano. The only thing to do is to redo the TensorRT optimization of the AI model. Doing so with YoloV4 (FP16), we managed to reach 36 FPS and 25 FPS with the input shape 288x288 pixels and 416x416 pixels.

In contrast with the Jetson Nano, we can also run other AI models concurrently to YoloV4, and even run YoloV4 with a batch size higher than 1 (up to the limit of the memory).

Doing so, we manage to reach 56 FPS with a batch size of 4 with YoloV4 (input shape 288 pixels). If each frame composing the batch is from a different stream, then we have 14 FPS per stream. Thus, we are able to process the videostreams of 4 cameras at the same time.

By choosing the Jetson family we are able to scale the Edge AI hardware in function of the needs of our projects, and there is no limitation on what we can use or do with these devices. We will detail how we do all these things in future blogposts.

If you are interested to discuss more about Machine Learning and Artificial Intelligence, and on how Aptus can leverage these technologies for you, let's get in touch!