Focal Loss for Dense Object Detection

来源:互联网 发布:nginx怎么使用 编辑:程序博客网 时间:2024/05/19 11:47

Key Idea

rather than addressing outliers, our focal loss is designed to address class imbalance by down-weighting inliers (easy examples) such that their contribution to the total loss is small even if their number is large

key question

  • We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause.
  • Class imbalance is addressed in R-CNN-like detectors by a two-stage cascade and sampling heuristics. The proposal stage (e.g., Selective Search [34], EdgeBoxes [37], DeepMask [23, 24], RPN [27]) rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples. In the second classification stage, sampling heuristics, such as a fixed foreground-to-background ratio (1:3), or online hard example mining (OHEM) [30], are performed to maintain a manageable balance between foreground and background. In contrast, a one-stage detector must process a much larger set of candidate object locations regularly sampled across an image. In practice this often amounts to enumerating ∼100k locations that densely cover spatial positions, scales, and aspect ratios. While similar sampling heuristics may also be applied, they are inefficient as the training procedure is still dominated by easily classified background examples. This inefficiency is a classic problem in object detection that is typically addressed via techniques such as bootstrapping [32, 28] or hard example mining [36, 8, 30].
  • 这里写图片描述

Method

  • cross entropy (CE) loss for binary classification
    这里写图片描述
    For notational convenience, we define pt
    这里写图片描述
    rewrite CE(p; y) = CE(pt) = − log(pt)

  • Balanced Cross Entropy
    这里写图片描述

  • Focal Loss Definition
    这里写图片描述

    • 2 properties of FL
      • When an example is misclassified and pt is small, the modulating factor is near 1 and the loss is unaffected. As pt1, the factor goes to 0 and the loss for well-classified examples is down-weighted.
      • The focusing parameter γ smoothly adjusts the rate at which easy examples are downweighted.
  • α-balanced variant of the focal loss
    这里写图片描述

  • Class Imbalance and Model Initialization
    introduce the concept of a ‘prior’ for the value of p estimated by the model for the rare class (foreground) at the start of training.We denote the prior by π and set it so that the model’s estimated p for examples of the rare class is low

  • Class Imbalance and Two-stage Detectors

    • a two-stage cascade
      • possible object locations down to one or two thousand
      • selected proposals are not random, but are likely to correspond to true object locations
    • biased minibatch sampling
      • 1:3 ratio of positive to negative examples

Architecture

这里写图片描述

Experiments

这里写图片描述

这里写图片描述

这里写图片描述

Contribution

  • this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples.
  • our proposed focal loss naturally handles the class imbalance faced by a one-stage detector and allows us to efficiently train on all examples without sampling and without easy negatives overwhelming the loss and computed gradients
  • it focuses training on a sparse set of hard examples

Other questions

  • 1.0 binary classification models are by default initialized to have equal probability of outputting either y = −1 or 1. Under such an initialization, in the presence of class imbalance, the loss due to the frequent class can dominate total loss and cause instability in early training.
    • introduce the concept of a ‘prior’ for the value of p estimated by the model for the rare class (foreground) at the start of training.We denote the prior by π and set it so that the model’s estimated p for examples of the rare class is low.
    • For the final conv layer of the classification subnet, we set the bias initialization to b = − log((1 − π)/π), where π specifies that at the start of training every anchor should be labeled as foreground with confidence of ∼π. We use π = :01 in all experiments, although results are robust to the exact value. As explained in x3.4, this initialization prevents the large number of background anchors from generating a large, destabilizing loss value in the first iteration of training
原创粉丝点击