SRILM文档分析之Prob.h

来源：互联网发布：php快速排序算法编辑：程序博客网时间：2024/05/17 17:57

Prob.h Prob.cc

文档作者：jianzhu
创立时间：08.09.11

--------------------------------------
1、概述
--------------------------------------
这两个文件定义了一组函数用于处理浮点数和对数的加减操作。
同时定义一个用于将字符串浮点数转换为浮点数的函数。

--------------------------------------
2、函数功能解释
--------------------------------------
a) LogPtoProb函数

<src>
0  inline Prob LogPtoProb(LogP2 prob)
1  {
2      if (prob == LogP_Zero) {
3       return 0;
4      } else {
5      return exp(prob * M_LN10);
6      }
7  }
</src>

    功能：将以10为底的对数概率值转化为概率值本身

    细解：第2行判断对数概率值是否为LogP_Zero即-HUGE_VAL，若为该值，则
    直接返回0；否则执行第5行。
    第5行首先将prob*M_LN10获得概率的ln值，然后对其求以自然数为底的运算，
    则获得概率值本身，同时返回该概率值。

    注： prob    --> log10(a)
         M_LN10 --> ln10
         exp^X   --> e^X

         log10(a)    = ln(a)/ln10
      -> prob*M_LN10 = ln(a)/ln10 * ln10 = ln(a)
      -> exp(prob * M_LN10) = exp(ln(a)) = a

b) LogPtoPPL函数

<src>
0  inline Prob LogPtoPPL(LogP prob)
1  {
2      return exp(- prob * M_LN10);
3  }
</src>

    功能：将以10为底的对数概率值转化为概率值对应的perplexity。
    对于至含一个概率的情况，即为将以10为底的对数概率值转化为概率值倒数
          perplexity = P(W1W2...Wn)^-1/N
          当只有一个概率值P时，上式简化为:
          perplexity = P^-1

     细解：
         prob    --> log10(a)
         M_LN10 --> ln10
         exp^X   --> e^X

         log10(a)    = ln(a)/ln10
      -> prob*M_LN10 = ln(a)/ln10 * ln10 = ln(a)
      -> exp(-prob * M_LN10) = exp(-ln(a)) = exp(ln(a^-1)) = a^-1 = 1/a

c) ProbToLogP函数

<src>
0  inline LogP ProbToLogP(Prob prob)
1  {
2      return log10(prob);
3  }
</src>
    功能：将概率值转化为以10为底的对数值

    细解：第2行通过调用log10函数，直接将prob转换为以10为底的对数值，同时返回该值

d) MixLogP函数

<src>
0  inline LogP MixLogP(LogP prob1, LogP prob2, double lambda)
1  {
2      return ProbToLogP(lambda * LogPtoProb(prob1) +
3             (1 - lambda) * LogPtoProb(prob2));
4  }
</src>

    功能：对prob1和prob2概率对数值求lambda线性插值

    细解：第2行首先通过调用LogPtoProb函数分别获取prob1和prob2中的概率值，然后对该
    概率值使用lambda参数进行线性插值运算，同时通过调用ProbToLogP函数将运算结果转换
    为以10为底的概率对数值，同时返回该值。

e)AddLogP函数

<src>
0  inline LogP2 AddLogP(LogP2 x, LogP2 y)
1  {
2      if (x<y) {
3      LogP2 temp = x; x = y; y = temp;
4      }
5      if (y == LogP_Zero) {
6      return x;
7      } else {
8      LogP2 diff = y - x;
9      return x + log10(1.0 + exp(diff * M_LN10));
10     }
11 }
</src>

    功能：对概率对数值中的概率进行求和运算，同时返回求和运算结果的以10为底的对数值

    细解：第2-4行用于处理x小于y的情况，同时将较大的值存储到x中，而将较小的值存储到y中。
    第5-7行用于处理当y为longP_Zero的情况，此时直接返回x值。
    第7-10行用于处理剩下的情况
           x --> log10(a)
           y --> log10(b)

           diff = y-x = log10(b) - log10(a) = log10(b/a)
           diff * M_LN10 = ln(b/a)
           exp(diff*M_LN10) = b/a
           log10(1.0 + exp(diff * M_LN10)) = log10(1.0 + b/a) = log10( (a+b)/a )
           x + log10(1.0 + exp(diff * M_LN10)) = log10(a) + log10( (a+b)/a )
                                               = log10(a+b);

f) SubLogP函数

<src>
0  inline LogP2 SubLogP(LogP2 x, LogP2 y)
1  {
2      assert(x >= y);
3      if (x == y) {
4      return LogP_Zero;
5      } else if (y == LogP_Zero) {
6       return x;
7      } else {
8      LogP2 diff = y - x;
9      return x + log10(1.0 - exp(diff * M_LN10));
10     }
11 }
</src>

  功能：对概率对数值中的概率进行求差运算，同时返回求差运算结果的以10为底的对数值

  细解：第2行用于处理x小于y情况，由于log10(A)是一个递增函数，且定义域为(0, 正无穷)
  因此需要保证 x >= y。
  第3-5行处理当x==y的情况，此时直接返回LogP_Zero，即负无穷小。
  第5-7行处理当y为负无穷小的情况，此时直接返回x。
  第7-10行处理剩余的情况
         x --> log10(a)
           y --> log10(b)

           diff = y-x = log10(b) - log10(a) = log10(b/a)
           diff * M_LN10 = ln(b/a)
           exp(diff*M_LN10) = b/a
           log10(1.0 - exp(diff * M_LN10)) = log10(1.0 - b/a) = log10( (a-b)/a )
           x + log10(1.0 - exp(diff * M_LN10)) = log10(a) + log10( (a-b)/a )
                                               = log10(a-b);

g) weightLogP函数

<src>
0  inline LogP weightLogP(double weight, LogP prob)
1  {
2      /*
3      * avoid NaN if weight == 0 && prob == -Infinity
4      */
5      if (weight == 0.0) {
6      return 0.0;
7      } else {
8      return weight * prob;
9      }
10 }
</src>

  功能：对将权重weight乘到prob上，并返回运算结果

  细解：第5-7行处理当weight为0.0的情况，此时直接返回0.0；否则执行第7-9行
  第7-9行对weight和prob进行乘法运算，并返回运算结果。

h) rint函数

<src>
0  inline double rint(double x)
1  {
2    if (x >= 0) {
3      return (double)(int)(x + 0.5);
4    } else {
5      return (double)(int)(x - 0.5);
6    }
7  }
</src>

  功能：对浮点数x进行求顶或求底操作

  细解：第2-4行处理当x >= 0时进行求顶运算，同时返回运算结果
  第4-6行处理当x < 0时进行求底运算，同时返回运算结果

i) finite函数

<src>
0  inline int finite (double x)
1  {
2    if (x < 1.e+300 && x > -1.e+300)
3      return 1;
4    else
5      return 0;
6  }
</src>
  功能：判断浮点数是否足够大或足够小，同时在满足条件时返回1，否则返回0

  细解：第2-3行处理x大于10的300次方或小于-10的300次方的情况，此时认为
  x为一个无穷值，直接返回1；否则执行第5行返回0。

j) parseLogP函数

<src>
0  Boolean
1  parseLogP(const char *str, LogP &result)
2  {
3      const unsigned maxDigits = 8; // number of decimals in an integer
4
5      const char *cp = str;
6      const char *cp0;
7      Boolean minus = false;
8
9      /*
10      * Log probabilties are typically negative values of magnitude > 0.0001,
11      * and thus are usually formatted without exponential notation.
12      * We parse this type of format using integer arithmetic for speed,
13      * and fall back onto scanf() in all other cases.
14      * We also use scanf() when there are too many digits to handle with
15      * integers.
16      * Finally, we also parse +/- infinity values as they are printed by
17      * printf(). These are "[Ii]nf" or "[Ii]nfinity".
18      */
19
20     /*
21      * Parse optional sign
22      */
23     if (*cp == '-') {
24     minus = true;
25     cp++;
26     } else if (*cp == '+') {
27     cp++;
28     }
29     cp0 = cp;
30
31     unsigned digits = 0;  // total value of parsed digits
32     unsigned decimals = 1;  // scaling factor from decimal point
33     unsigned precision = 0;  // total number of parsed digits
34
35     /*
36      * Parse digits before decimal point
37      */
38     while (isdigit(*cp)) {
39     digits = digits * 10 + (*(cp++) - '0');
40     precision ++;
41     }
42
43     if (*cp == '.') {
44     cp++;
45
46     /*
47     * Parse digits after decimal point
48       */
49     while (isdigit(*cp)) {
50         digits = digits * 10 + (*(cp++) - '0');
51           precision ++;
52         decimals *= 10;
53     }
54     }
55
56     /*
57     * If we're at the end of the string then we're done.
58      * Otherwise there was either an error or some format we can't
59      * handle, so fall back on scanf(), after checking for infinity
60      * values.
61      */
62      if (*cp == '\0' && precision <= maxDigits) {
63     result = (minus ? - (LogP)digits : (LogP)digits) / (LogP)decimals;
64     return true;
65     } else if ((*cp0 == 'i' || *cp0 == 'I') &&
66           (strncmp(cp0, "Inf", 3) == 0 || strncmp(cp0, "inf", 3) == 0))
67     {
68     result = (minus ? LogP_Zero : LogP_Inf);
69     return true;
70     } else {
71     return (sscanf(str, "%f", &result) == 1);
72     }
73 }
</src>

  功能：用于分析字符串表示的浮点数并将分析出的浮点数存储到result中。

  细解：第23-28行用于分析str中表示的浮点数是否为负值，若为负值，则将minus设为true。
  第38-41行分析str表示的浮点数小数点之前的数值，并将其保存到digits中；
  第43-54行分析小数点之后的数值，同时将小数点之后的位数值记录到decimals中；
  第62-65行处理当str被正确分析，且precision小于8的情况，即浮点数的数值位数小于8，此时
  直接将result设为digits和decimals相除运算的结果。若minus为true，则将result设为负的结
  果，同时返回true；
  第65-70行通过分析剩下的无法分析的字符串是否为Inf或inf，同时结合minus来决定将result
  设为LogP_Zero(负无穷小)还是LogP_Inf(正无穷大)，同时返回true；
  第70-72行用于处理其他情况，此时直接调用sscanf函数，从str中读出一个浮点数并将其保存在
  result中。同时返回调用sscanf是否成功的判断结果。

k) ProbToBytelog函数

<src>
0  inline Bytelog ProbToBytelog(Prob prob)
1  {
2     return (int)rint(log(prob) * (10000.5 / 1024.0));
3  }
</src>

  功能：将prob转换为Bytelog

  细解：
  Bytelog介绍：A bytelog is a logarithm to base 1.0001, divided by 1024 and rounded to
  an integer。
  因此第2行通过使用ln(prob)*10000.5来模拟log1.0001(prob)
  举例：prob == 2.0
  则
       ln(prob)*10000.5 = 6931.8183791897330668270298306425
       log1.0001(prob) = ln(prob) / ln(1.0001) = 6931.8183734137953551959678499998

  因此可见 10000.5 约等于 1/ln(1.0001) 因此这里直接用10000.5来模拟1/ln(1.0001)
  通过使用1024除上面结果，并将结果取整，即得到相应的Bytelog值

l) ProbToIntlog函数
<src>
0  inline Intlog ProbToIntlog(Prob prob)
1  {
2      return (int)rint(log(prob) * 10000.5);
3  }
</src>
  功能：将prob转化为Intlog

  细解：
  Intlog介绍：A Intlog is a logarithm to base 1.0001, and rounded to
  an integer。
  因此第2行通过使用ln(prob)*10000.5来模拟log1.0001(prob)
  举例：prob == 2.0
  则
       ln(prob)*10000.5 = 6931.8183791897330668270298306425
       log1.0001(prob) = ln(prob) / ln(1.0001) = 6931.8183734137953551959678499998

  因此可见 10000.5 约等于 1/ln(1.0001) 因此这里直接用10000.5来模拟1/ln(1.0001)
  通过将结果取整，即得到相应的Intlog值

m) LogPtoBytelog函数

<src>
0  inline Bytelog LogPtoBytelog(LogP prob)
1  {
2      return (int)rint(prob * (M_LN10 * 10000.5 / 1024.0));
3  }
</src>

  功能：将logProb转化为Bytelog

  细解：
  Bytelog介绍：A bytelog is a logarithm to base 1.0001, divided by 1024 and rounded to
  an integer。
  因此第2行通过使用Prob * M_LN10(Prob == log10(prob) == ln(prob)/ln(10); M_LN10 == ln10)
  来获得ln(prob)，并通过ln(prob)*10000.5来模拟log1.0001(prob)
  举例：prob == 2.0
  则
       ln(prob)*10000.5 = 6931.8183791897330668270298306425
       log1.0001(prob) = ln(prob) / ln(1.0001) = 6931.8183734137953551959678499998

  因此可见 10000.5 约等于 1/ln(1.0001) 因此这里直接用10000.5来模拟1/ln(1.0001)
  通过使用1024除上面结果，并将结果取整，即得到相应的Bytelog值

n) LogPtoIntlog函数

<src>
0  inline Intlog LogPtoIntlog(LogP prob)
1  {
2     return (int)rint(prob * (M_LN10 * 10000.5));
3  }
</src>
  功能：将logProb转化为Intlog

  细解：
  Intlog介绍：A Intlog is a logarithm to base 1.0001, and rounded to
  an integer。
  因此第2行通过使用Prob * M_LN10(Prob == log10(prob) == ln(prob)/ln(10); M_LN10 == ln10)
  来获得ln(prob)，并通过ln(prob)*10000.5来模拟log1.0001(prob)
  举例：prob == 2.0
  则
       ln(prob)*10000.5 = 6931.8183791897330668270298306425
       log1.0001(prob) = ln(prob) / ln(1.0001) = 6931.8183734137953551959678499998

  因此可见 10000.5 约等于 1/ln(1.0001) 因此这里直接用10000.5来模拟1/ln(1.0001)
  通过将结果取整，即得到相应的Intlog值

o) IntlogToLogP函数
<src>
0  inline LogP IntlogToLogP(double prob) /* use double argument to avoid loss
1                      * of information when converting from
2                      * floating point values */
3  {
4      return prob/(M_LN10 * 10000.5);
5  }
</src>
  功能：将Intlog转换为logProb

  细解：
  Intlog介绍：A Intlog is a logarithm to base 1.0001, and rounded to
  an integer。
  因此第2行通过使用Prob / M_LN10(Prob == log1.0001(prob) == ln(prob)/ln(1.0001); M_LN10 == ln10)
  来获得log10(prob)/ln(1.0001)，并通过log10(prob)/(ln(1.0001)*10000.5)来模拟log10(prob)
  因为 10000.5 约等于 1/ln(1.0001)

p) BytelogToLogP函数
<src>
0  inline LogP BytelogToLogP(double bytelog) /* use double argument so we can
1                        * scale float values without loss of
2                        * precision */
3  {
4      return bytelog * (1024.0 / 10000.5 / M_LN10);
5  }
</src>
    功能：将Bytelog转换为logProb

    细解：
    Bytlog介绍：
    A bytelog is a logarithm to base 1.0001, divided by 1024 and rounded to
  an integer。
    因此第2行通过使用bytelog*1024.0近似得到log1.0001(prob)，通过使用10000.5/M_LN10
    即ln(1.0001)/ln10和log1.0001(prob)进行乘法操作得到logProb
    因为log1.0001(prob) = lnprob/ln1.0001
    因此log1.0001(prob)*(ln(1.0001)/ln10) = lnprob/ln10 = log10(prob) --> logProb

q) IntlogToBytelog函数
<src>
0  inline Bytelog IntlogToBytelog(Intlog intlog)
1  {
2     int bytelog = ((-intlog) + (1 << (BytelogShift-1))) >> BytelogShift;
3
4     if (bytelog > 255) {
5      bytelog = 255;
6     }
7     return -bytelog;
8  }
</src>
  功能：将Intlog转换为Bytelog函数

  细解：

r) BytelogToIntlog函数
<src>
0  inline Intlog BytelogToIntlog(Bytelog bytelog)
1  {
2      return bytelog << BytelogShift;
3  }
</src>
  功能：将Bytelog转换为Intlog

  细解：由于Bytelog和Intlog只差了1024，所以只需要
  将bytelog乘以1024即可得到Intlog。（也即左移10位）


--------------------------------------
知识点：
--------------------------------------
1、对数和其系数之间的相互转换
       系数转对数
           该操作一般可以通过直接调用库函数log10来实现。
       对数转系数
           log10(a) --> a
           根据以下公式
           log10(a) = ln(a)/ln(10)
           M_LN10   = ln(10)
           可以推导出
           log10(a) * M_LN10 = ln(a)/ln(10) * ln(10) = ln(a)
           又有公式
           exp(ln(a)) = a
           可得
           exp(log10(a) * M_LN10) = exp(ln(a)) = a;

2、对数系数的加减运算
       加运算
       log10(a) log10(b) --> log10(a+b)

       假设 b > a
       由log10的递增特性以及对数的相加和相减运算操作可得
       log10(b) - log10(a) = log10(b/a)
       1 + exp(log10(b/a)*M_LN10) = 1 + b/a = (a+b)/a
       因此
       log10(a) + log10(1 + exp(log10(b/a)*M_LN10)
       = log10(a) + log10((a+b)/a) = log10(a+b)

       减运算
       log10(a) log10(b) --> log10(a-b)

       由log10的递增特性以及对数的相加和相减运算操作可得
       log10(b) - log10(a) = log10(b/a)
       1 - exp(log10(b/a)*M_LN10) = 1 - b/a = (a-b)/a
       因此
       log10(a) + log10(1 - exp(log10(b/a)*M_LN10)
       = log10(a) + log10((a-b)/a) = log10(a-b)

0 0