Introduction to human pose estimation

5 min readJan 11, 2022

Pose estimation is a computer vision technique that allows to detect the posture of a person on an image or a video, in order to extract a biomechanical model. This technique has many applications. It allows to better understand the behavior of people visible in an image and detect their actions.

In sports, it allows a more detailed analysis of the gestures made by athletes. In security, it can be used to detect suspicious actions. Its great advantage is to be able to extract biomechanical models without markers.

The estimation of human poses in 2D, from a single camera, is a complex challenge to solve. There is an infinite number of possible postures, an indeterminate number of people on each image and numerous contacts and overlaps. In addition, the morphology of the people and worn clothing can have an impact on the model’s predictions.

The estimation of human poses is a subject that has occupied computer vision researchers for many years and several methods have been proposed.

Human body models

Before presenting the different pose estimation approaches, it is necessary to define what a human pose is.

The human body is a complex object made of 206 different bones, 639 muscles and 360 joints, with a large number of degrees of freedom. Several representations exist and propose more or less complex models of the human body.

These models are divided into two categories. The so-called keypoints-based representations, which model the body as a set of keypoints that correspond to body limbs. And the model-based representations. These are 3d models composed of an assembly of shapes and allowing to have a richer description of the body.

Principle of human pose estimation

In the context of pose estimation, we differentiate between cases with several people (multi-person) on the image and cases with only one person (single-person).

Single vs multi-person

In a single-person context the architecture of the model is often similar. The model takes an image as input and is composed of two parts.

The first part plays the role of an encoder and will allow, by using convolutional neural networks, to extract features in the form of a vector. The second part, which is a regression model, plays the role of decoder and allows to estimate the pose of the person from the features vector.

Architecture of a single-person pose estimation model

In most of the proposed solutions, the encoder has a very deep network architecture called residual neural network (ResNet).

ResNet were conceptualized to be able to build very complex networks by solving the vanishing (or exploding) gradient problem.

This problem arises because of the sequence of weight multiplication operations which tend to make the gradient reach 0 (or infinity depending on whether the inputs are normalized or not) and prevent the training of the most distant layers.

To avoid this, ResNet includes skip connections that allow to pass information from a neuron of a certain layer to a neuron of a much more distant layer.

Top-down and bottom-up approaches for human pose estimation

Two approaches exist for pose estimation in a multi-person context.

In the first approach, called top-down, the model first detects and locates the persons present in the image, then extracts the pose for each person by analyzing only the part of the image in which this person is visible, as for an extraction in a single-person context.

Although the top-down approach seems quite intuitive, it has some notable flaws.

First, the pose estimation depends a lot on the performances of the detection model, when the person detection fails, the pose estimation will fail.

Thus, in scenes where several people overlap or are close, this approach will not be suitable. Moreover, the execution time of the model is likely to be high since it is proportional to the number of people in the image.

The second approach is the bottom-up approach. It consists in locating joints of the human body in isolation, then assembling them in a pose using an association model.

Contrary to the top-down approach, the bottom-up approach allows to have good results even in crowded situations with several people overlapping. The algorithmic complexity of such a model will also be lower, which reduces the inference times.

One of the best framework for human pose estimation is OpenPose. It is an open-source model for automatic, real-time multi-person pose estimation. Although initially developed in C++, OpenPose has a Python implementation with Tensorflow and PyTorch.

The designers of the model have chosen a bottom-up approach, which relies on the “direct” detection of the body parts of the persons present on the image, without giving tot the model any prior information on the number or the location of these persons.

Conclusion

The estimation of human poses from a single camera is a big challenge to solve. The current proposed solutions are very sensitive to certain problems such as occlusions or depth problems.

Attempts have been made to reconstruct 3d human poses from 2D images but it is clear that trained models are not suitable for all applications.