It is still challenging to deploy Deep Neural Networks (DNN) on mobile devices, specifically if both real-time execution and high inference accuracy are in demand. This is because the increasingly large model size and complex model structure of high-accuracy DNNs usually require tremendous computation and memory resources. Weight pruning is proposed to mitigate this challenge. However, existing pruning is either not compatible with modern parallel architectures, resulting in long inference latency or subject to significant accuracy degradation.
Technology Overview
Northeastern University researchers have designed a novel, fine-grained, structured pruning termed Block-based Column-Row pruning (BCR pruning) to achieve this goal, which is a general method working for both CNNs and RNNs. For a weight matrix in a convolutional (CONV) or fully-connected (FC) layer, which divides it into a number of blocks with equal size, and apply independent row and column pruning to each block. The remaining weights in each block still form a full matrix. The hardware acceleration performance on a mobile device can be close to the coarse-grained structured pruning, far better than the non-structured one. This is achieved through the code optimization capability of compilers for inference acceleration. Based on the novel BCR pruning scheme, researchers can further develop an end-to-end BPDNN (standing for BCR Pruning-based DNN) acceleration framework, consisting of two parts: (1) an execution code generation stage with the compiler-based optimizations enabled by this BCR pruning. This part assists inference acceleration with a given BCR pruned DNN (CNN or RNN) model; and (2) an optimization framework to determine the block size (for each layer) and other hyperparameters, and perform BCR pruning accordingly. This part is performed during the training phase.
- Evaluation experiments demonstrate that BPDNN outperforms three state-of-the-art end-to-end DNN acceleration frameworks, Alibaba Mobile Neural Network, TVM, and TensorFlow Lite, and an optimized baseline built on CSR, with speedup up to 5.72×, 7.54×, 11.76×, and 4.19×, respectively, without any accuracy degradation
- BPDNN also outperforms a state-of-the-art FPGA approach for RNNs execution
- Can execute high-accuracy DNNs (e.g., VGG-16) on mobile devices in real-time
- These testings are performed using three widely used DNNs, VGG-16, ResNet-50, and MobileNet-V2, and two benchmark datasets, ImageNet and CIFAR-10
- This invention is in general applicable to any application that requires real-time, fast implementation of deep learning and AI systems will promote the wide application of DNNs on embedded, mobile, and IoT systems
- Auto driving systems, unmanned aerial vehicles (UAVs) and intelligent robotic systems
- Real-time medical imaging applications
- Cloud-based AI and deep learning accelerators
- Field testing, road scan, and sensor-based intelligent systems
- License
- Partnering
- Research collaboration
Patent Information:
For Information, Contact:
Colin Sullivan
Commercialization Consultant
Northeastern University
Yanzhi Wang
Zhengang Li
Bin Ren
Wei Niu
Artificial intelligence
Deep learning
Mobile devices
Model Compression
Real-Time Implementation