Detection
- You Only Look Once: Unified, Real-Time Object Detection
- Recurrent Attention Models for Depth-Based Person Identificatio
Pose estimation
- Personalizing Human Video Pose Estimation
- Human Pose Estimation with Iterative Error Feedback

Detection

You Only Look Once: Unified, Real-Time Object Detection

Use FC layer to generate bounding box coordinates and confidence scores for each grid box. Therefore, the effective receptive field for each grid prediction node is the full image.

Each grid will predict B bounding box (B=2). During training, the object with highest IOU will be assigned to a specific predictor of a grid (say, the first predictor).

comments: I am really curious how B affect the performance.

Recurrent Attention Models for Depth-Based Person Identificatio

Treat a classification task for video sequence as reinforce-learning is interesting.

The idea of using attension based methods to put attension at different region of the point cloud increase the number of training samples and decresse the dimension of the CNN input.

The video sequence is judged as the whole.

The assumption here is that it is possible (necessary) to identify the person from the point clouds by looking at local regions each time.

Pose estimation

Personalizing Human Video Pose Estimation

It is an iterative refinement process.

Firstly, a few examples are detected with generic body part detector, which has high precision and low recall.
Do the spatial matching.
1. Train random forests for body part detection and find the candidates from un-annotated frames
2. Generate the matches
3. Siftflow to correct the matching
Temporal propagation with optical flow
Evaluation (whether different annotation strategies agree with each other, train svm to detect occlusion).
fine-tuning the generic pose estimation network with newly generated annotation.

Human Pose Estimation with Iterative Error Feedback

use RGB image and heatmaps generated by an initial pose as input
train CNNs to predict the offset between the initial pose and the ground-truth.
Feed “corrected pose” into the next stage CNN. The CNNs in different stages share the same weights. CNNs in different stages are treated as independently.

machine_learning