随机文本生成技术---order-k马尔科夫链文本生成技术

来源：互联网发布：com域名费用编辑：程序博客网时间：2024/05/26 07:29

这里的k = 2:

    int k = 2;
    char inputchars[5000000];
    char *word[1000000];
    int nword = 0;
    首先，扫描整个输入文本来实现算法从而生成每个单词。我们将数组word作为一个指向字母的后缀数组，只是它仅从单词的边界开始。变量nword保存了单词的数目。我们使用下面的代码读取文件:
    word[0] = inputchars
    while scanf("%s", word[nword]) != EOF
        word[nword+1] = word[nword] + strlen(word[nword]) + 1
        nword++
    将文件中的每个单词添加到inputchars中，并通过scanf提供的null字符终止每个单词。
    第二，在读取输入之后，对word数组进行排序，将所有指向同一个k单词序列的指针收集起来。该函数进行了下列比较
    int wordncmp(char *p, char *q)
        n = k;
        for (; *p == *q; p++, q++)
            if (*p == 0 && --n == 0)
                return 0
        return *p - *q
    当字符相同是，它就扫描两个字符串，每次遇到null字符，它就将计算器n减1,并在查找到k个相同的单词后返回0(相同）。当它找到不同的字符时，返回不同（*p - *q)

    读取输入之后，在最后的单词后追加k个null字符（这样比较函数就不会超过整个字符串的末端），输出文档的前k个单词（以开始随机输出），并调用排序：
    for i = [0, k)
        word[nword][i] = 0
    for i = [0, k)
        print word[i]
    qsort(word, nword, sizeof(word[0]), sortcmp)
    我们采用的空间上比较高效的数据结构中现在包含了大量关于文本中"K-gram（K链）"信息。如果k为1，并且输入文本为“of the people, by the people, for the people”，word数组如下所示：
    排序前:
    word[0]: of the people,by the people .....
    word[1]: the people,by the people, for ...
    word[2]: people,by the people,for the..
    word[3]: by the people, for the people
    word[4]: the people, for the people
    word[5]: people,for the people
    word[6]: for the people
    word[7]: the people
    word[8]: people
    排序后：
    word[0]: by the people, for the people
    word[1]: for the people
    word[2]: of the people, by the people
    word[3]: people
    word[4]: people, by the people
    word[5]: people, for the people
    word[6]: the people,by the people
    word[7]: the people
    word[8]: the people,for the people
    如果查找“the”后跟的单词，就在后缀数组中查找它，有三个选择：两次"people,"和一次"people"

    现在，我们可以使用以下的伪代码来生产没有意义的文本
    phrase = first phrase in input array
    loop
        perform a binary search for phrase in word[0..nword-1] //查找phrase的第一次出现
        for all phrases equal in the first k words //扫描所有相同的词组，并随机选择其中一个。
            select one at random, pointed to by p
        phrase = word following p
        if k-th word of phrase is length 0 //如该词组的第k个单词的长度为0,表明该词组是文档末尾，结束循环
            break
        print k-th word of phrase
    完整的伪码实现为：
    phrase = inputchars
for (wordsleft = 10000; wordsleft > 0; wordsleft--)
  l = -1
  u = nword
  while l+1 != u
   m = (l + u) / 2
   if wordncmp(word[m], phrase) < 0
    l = m
   else
    u = m
  for (i = 0; wordncmp(phrase, word[u+i]) == 0; i++)
   if rand() % (i+1) == 0
    p = word[u+i]
  phrase = skip(p, 1)
  if strlen(skip(phrase, k-1)) == 0
   break
  print skip(phrase, k-1)