[深度学习论文笔记][Depth Estimation] Depth Map Prediction from a Single Image using a Multi-Scale Deep Netw

来源：互联网发布：网页美工作品编辑：程序博客网时间：2024/05/17 09:28

Eigen, David, Christian Puhrsch, and Rob Fergus. “Depth map prediction from a single image using a multi-scale deep network.” Advances in neural information processing systems. 2014. (Citations: 161).

1 Depth Estimation From a Single Image
Need to use the monocular depth cues such as line angles and perspective, objective sizes, image position, and atmospheric effects. This task is inherently ambiguous. Given an image, an infinite number of possible world scenes may have produced it.

2 Architecture

In a Nutshell See Fig. 14.
• A coarse-scale network first predicts the depth of the scene at a global level.
• This is then refined within local regions by a fine-scale network.
• Both stacks are applied to the original input.

• The coarse network’s output is passed to the fine network as additional first-layer image features.

Global Coarse-Scale Network Use fc to contain the entire image in their field of view. A local view (as is commonly used for stereo matching) is insufficient to notice important features such as these.

Note that the spatial dimension the output is larger than conv5 (coarse5), this is better than limiting the output size and upsampling.

The coarse network is pre-trained on ImageNet.

Local Fine-Scale Network Use a fine-scale network to edit the coarse prediction to align with local details such as object and wall edges. This network does not use fc. Zero-padding on conv2 (fine2) and conv3 (fine3) to maintain the spatial sizes.

Data Augmentation
• Scale: Input and target images are scaled by s ∈ [1, 1.5], and the depths are divided by s (making the image s times larger effectively moves the camera s times closer).
• Translation: Input and target are randomly cropped to the sizes.
• Rotation: Input and target are rotated by r ∈ [5, 5] degrees.
• Color: Input values are multiplied globally by a random RGB value c ∈ [0.8, 1.2] .
• Flips: Input and target are horizontally flipped with 0.5 probability.

Scale-Invariant Error The global scale of a scene is a fundamental ambiguity in depth prediction. Thus, just finding the average scale of the scene accounts for a large fraction of the total error.

Scale-invariant MSE:

Where

We use the scale e^α to best align to the ground truth. All cŷ, c > 0 have the same error, hence the scale invariance.
The output of the network is log ŷ , i.e., the top layer predicts the log depth.

3 Training Details
First train the coarse network first against the ground-truth targets, then train the fine-scale network keeping the coarse-scale output fixed.

4 Experiments
Datasets
• NYU Depth: 464 video sequences color indoor scenes.
• KITTI: outdoor scenes.

5 Future Work
Incorporate 3d geometry information such as surface normal, which turns out to improve performance.

0 0