Deep neural networks (DNNs) are both computation and storage intensive, which prevent them from wide applications on power-budgeted mobile, embedded and IoT systems. To overcome this limitation, there is a focus on weight pruning for DNNs (reducing the number of weights) to simultaneously reduce the model size (storage requirement) and accelerate the computation. There are currently two mainstreams of pruning methods that represent two extremes in pruning regularity: non-structured, fine-grained pruning can achieve high accuracy, but is not hardware friendly; structured, coarse-grained pruning exploits hardware-efficient structures in pruning but suffers from accuracy drop when the pruning rate is high. 
Technology Overview
In PCONV, the proposed idea is an intra-convolution of kernel pruning, pattern pruning, inter-convolution kernel pruning, and connectivity pruning. For pattern pruning, a fixed number of weights are pruned in each convolution kernel. Different from non-structured weight pruning, pattern pruning produces the same sparsity ratio in each filter and a limited number of pattern shapes. Essentially, the designed patterns correspond to the computer vision concept of key convolution filters, such as the Gaussian filter for smoothing or the Laplacian of Gaussian filter for smoothing and sharpening. For connectivity pruning, the key insight is to cut the connections between certain input and output channels, which is equivalent to the removal of corresponding kernels, making filter "length" shorter than the original model. Connectivity pruning further enlarges compression rates and provides greater DNN acceleration potential while maintaining a balanced workload in the filter-wise computation of DNNs. Pattern and connectivity pruning can be combined at the algorithm level and accelerated under the unified compiler-assisted acceleration framework. As a result, real-time inference execution performance on representative DNNs can be achieved for the first time without compromising accuracy.
- PCONV outperforms three state-of-art end-to-end DNN frameworks, TensorFlow-Lite, TVM, and Alibaba Mobile Neural Network (MNN) with speedups up to 39.2X, 11.4X, and 6.3X, respectively, with no accuracy loss. 
- These testings are performed using three widely used DNNs, VGG-16, ResNet-50, and MobileNet-V2, and two benchmark datasets, ImageNet and CIFAR-10. 
- Using Adreno 640 embedded GPU (in a state-of-art smartphone), PCONV achieves an unprecedented 19.1ms inference time of VGG-16 on ImageNet dataset. 
- Can achieve inference real-time execution of representative large-scale DNNs on mobile devices.
- Applicable to any application that requires real-time, fast implementation of deep learning and AI systems, will promote the wide application of DNNs on embedded, mobile, and IoT systems. 
- Auto driving systems, unmanned aerial vehicles (UAVs) and intelligent robotic systems. 
- Real-time medical imaging applications. 
- Cloud-based AI and deep learning accelerators. 
- Field testing, road scan, and sensor-based intelligent systems.
- License
- Research collaboration
- Partnering
Patent Information:
For Information, Contact:
Colin Sullivan
Commercialization Consultant
Northeastern University
Yanzhi Wang
Xiaolong Ma
Artificial intelligence
Deep learning
Mobile devices
Model Compression
Real-Time Implementation