Edge computing is going through an exciting phase where several chip vendors are looking to infiltrate the ecosystem with accelerators. As compute requirements at the edge increase, choosing the right compute architecture coupled with a suitable accelerator will be key. Domain-specific acceleration will be needed to maximize gains.
Deep learning and inference models are key to infer various sensory input activities like motion, audio, and vision at the edge. A comprehensive study of various accelerators for different models is needed. This paper surveys a representative set of accelerators needed for deep learning and inference models: Coral (RPi 4B)/Coral Dev accelerators by Google, NCS2(RPi 4B) by Intel, and Jetson Nano (GPU)/Jetson Nano (RT) by NVIDIA.
The experimental design evaluates the following deep learning models to infer motion, audio, and vision activities:
- Motion: Aroma;
- Audio: Emotion, deep keyword spotting (DKWS); and
- Vision: MobileNet V1, EfficientNet-EdgeTPU, Inception V1, DenseNet121.
The authors evaluate the execution of these models on the accelerators to study the following metrics:
- Memory footprint;
- Execution time;
- Energy consumption; and
- Overall performance.
They analyze these metrics during model load, warmup (first inference), and subsequent inferences.
The authors find that devices with dedicated on-chip memory coupled with software pipelines (including compiler optimizations and TensorFlow runtimes) can reduce memory footprint requirements. This works favorably for the Coral/Coral Dev and the NCS2 accelerators, compared to Jetson Nano that shares memory between the central processing unit (CPU) and the graphics processing units (GPUs).
Also, the on-chip memory on Coral Dev tensor processing units (TPUs) results in faster execution times except when large models like the DenseNet121 can’t fit into the on-chip memory. This proves their hypothesis that on-chip memory makes a difference in memory requirements and execution time.
They find some aberrations to this theory, however, with the NCS2 Movidius chip when models have a large kernel size in their first convolution layer. Also, lower loading and warmup times in the Coral Dev TPUs make them better suited for reactive situations when multiple models have to be loaded/unloaded for a changed sensory input in a dynamic usage model.
With their power measurements, they discover a few important patterns:
- (1) The Coral/Coral Dev boards consistently draw less power than the other accelerators;
- (2) Tensor flow optimizations affect energy efficiencies; and
- (3) Self-contained accelerators draw less power than the ones that need Host interface.
Finally, the authors compare RPi 3B+ based accelerators with RPi 4B accelerators and find that RPi 4B accelerators generally perform better (due to the USB 3.0 interface) but tend to consume more energy. This has implications for battery capacity.
In summary, this paper describes a well-designed experimental methodology to analyze edge accelerators and their efficacy toward inference models. They have automated this methodology into a toolkit to further analyze additional edge accelerators in the future.