One of the major challenges in computer vision is Multi-person pose estimation. Though few of the currently used approaches have achieved significant progress by fusing the multi-scale feature maps, little attention is paid to enhancing the channel-wise and spatial information of the feature maps.
Last week, researchers from Southeast University, Nanjing, China proposed a paper Multi-Person Pose Estimation with Enhanced Channel-wise and Spatial Information, wherein they have introduced two novel modules for enhancing the information for the multi-person pose estimation.
Firstly, the researchers have proposed a Channel Shuffle Module (CSM) to adopt the channel shuffle operation on the feature maps with different levels while promoting cross-channel information communication among the feature maps.
Secondly, they have designed a Spatial, Channel-wise Attention Residual Bottleneck (SCARB) to boost the original residual unit with attention mechanism. And further highlighting the information of the feature maps both in the spatial and channel-wise context.
Lastly, they have evaluated the effectiveness of their proposed modules on the COCO keypoint benchmark, and experimental results show that their approach achieves the state-of-the-art results. So hereby we discuss the modules introduced by the researchers in detail.
Modules proposed by the researchers
Channel Shuffle Module (CSM)
The levels of the feature maps are enriched by the depth of the layers in the deep convolutional neural networks and many visual tasks have also made major improvements.
However, in the case of multi-person pose estimation, there are still limitations in the trade-off between the low-level and high-level feature maps. Here, the channel information with different characteristics can complement and reinforce with each other. So, the researchers decided to propose the Channel Shuffle Module (CSM) to further calculate the interdependencies between the low-level and high-level feature maps.
Let’s assume that the pyramid features extracted from the ResNet backbone are denoted as Conv2∼5 (as shown in the figure). In this case, Conv-3∼5 are first upsampled to the same resolution as the Conv2, and then these feature maps are concatenated together.
Then the channel shuffle operation is performed on the concatenated features in order to fuse the complementary channel information among different levels. The shuffled features then are split and downsampled to the original resolution separately which are denoted as C-Conv-2∼5. C-Conv-2∼5.
Next, the researchers perform 1×1 convolution to further fuse C-Conv-2∼5, and obtain the shuffled features that are denoted as SConv-2∼5. And they concatenate the shuffled feature maps S-Conv-2∼5 with the original pyramid feature maps Conv2∼5 for achieving the final enhanced pyramid feature representations.
These enhanced pyramid feature maps contain the information from the original pyramid features and fused cross-channel information from the shuffled pyramid feature maps.
Attention Residual Bottleneck (ARB)
The researchers introduced Attention Residual Bottleneck based on the enhanced pyramid feature representations mentioned above. With the help of Attention Residual Bottleneck, they enhanced the feature responses both in the spatial and channel-wise context.
In the figure, the schema of the original Residual Bottleneck and the Spatial, Channel-wise Attention Residual Bottleneck is composed of the spatial attention and channel-wise attention. The dashed links in the figure, indicate the identity mapping. The ARB learns the spatial attention weights β and the channel-wise attention weights α respectively.
By applying the whole feature maps, the project leads to sub-optimal results due to the irrelevant regions. Whereas, the spatial attention mechanism attempts to highlight the task-related regions in the feature maps.
Evaluating the models on COCO keypoint
The team evaluates the models on the challenging COCO keypoint benchmark and train them on the COCO dataset that includes 57K images and 150K person instances with no extra data involved. The ablation studies are then validated on the COCO minival dataset and the final results are reported on the COCO test-dev dataset compared with the public state-of-the-art results. The team uses the official evaluation metric that reports the OKS-based AP (average precision) in the experiments. Here the OKS (object keypoints similarity) defines the similarity between the ground truth pose and predicted pose.
In the Channel-wise Attention Residual Bottleneck (SCARB) experiment, the team explores the effects of different implementation orders of the spatial attention and the channelwise attention in the Attention Residual Bottleneck, i.e., SCARB and CSARB.
To know more about this news, check out the paper, Multi-Person Pose Estimation with Enhanced Channel-wise and Spatial Information.