You Only Look Once: Unified, Real-Time Object Detection

Use FC layer to generate bounding box coordinates and confidence scores for each grid box. Therefore, the effective receptive field for each grid prediction node is the full image.

Each grid will predict B bounding box (B=2). During training, the object with highest IOU will be assigned to a specific predictor of a grid (say, the first predictor).

comments: I am really curious how B affect the performance.

Recurrent Attention Models for Depth-Based Person Identificatio

Treat a classification task for video sequence as reinforce-learning is interesting.

The idea of using attension based methods to put attension at different region of the point cloud increase the number of training samples and decresse the dimension of the CNN input.

The video sequence is judged as the whole.

The assumption here is that it is possible (necessary) to identify the person from the point clouds by looking at local regions each time.

Pose estimation

Personalizing Human Video Pose Estimation

It is an iterative refinement process.

Human Pose Estimation with Iterative Error Feedback