Multi-person pose estimation (using computer vision to identify a person's body parts in a single image) has long since been a challenging problem in the field of computer science. Recently, with the development of deep learning, advances in multi-person pose estimation have begun to take shape.
Despite recent advancements in this field, there still exist several challenges to running these models on mobile and embedded systems. Many systems follow the top-down approach that detects a person first and then detects body parts. Since the two steps pipeline utilizes person detection as the first step, the system relies highly on the output of the person detection and is also time-consuming to apply a single person pose estimation to every detected person. The alternatives are proposed as a real-time multi-person 2D pose estimation system on a workstation using powerful GPUs. Another approach utilizes associative embedding to associated body parts based on the stack hourglass network module. However, these systems are still too heavy to be run on mobile or embedded systems.
Technology Overview
This invention presents a lightweight framework that directly predicts body parts as well as the human skeletons and employs the separable convolution in designing the network structure. A deep convolutional neural network is used to directly generate confidence maps for body parts. The confidence maps show the probability of each body part occurs at each location. One limb in the human skeleton is defined as a rectangle area that is between two associated body parts. For the ideal case, the responses on the limb area are 1, while the other area has 0 as a response.
The human skeleton consists of all limbs in the image of multiple people. The Euclidean loss is applied to regress the confidence map of the human skeleton. The cost of increasing one confidence map is small, but it helps to solve the body part association problem well. When making an inference, the limb that passes two correct body parts should have a high response in the human skeleton confidence map. Otherwise, the two body parts should not be associated together. 
- The system can reach 100 fps on a workstation using GPUs
- The system can also be deployed to mobile and embedded system
- Can detect all human limbs using one single output confidence map 
- It utilizes a lightweight convolutional neural network structure that is designed to predict body parts and human skeleton simultaneously 
- Human action recognition
- Human-computer interaction
- Virtual reality
- License
- Research collaboration
- Partnering
Patent Information:
For Information, Contact:
Mark Saulich
Associate Director of Commercialization
Northeastern University
Yun Fu
Yue Wu