Document Fingerprinting in MATLAB
It’s been a while since I’ve posted here. The semester is in full swing, and things at the newspaper have been keeping me busy. A few weeks ago I started reading about document fingerprinting for plagiarism detection, and I’ve made some progress.I was hoping to create an online lyrics aggregator and software to build a lyrics database. Basically, server software would search google for “Better than Ezra Closer Lyrics”, visit the top pages, and use plagiarism detection algorithms to “find” the lyrics on the pages. The basic assumption was that sections of the HTML containing lyrics would be nearly identical. Taking this area of each page, the data could be further refined by identifying phrases and punctuation that was inconsistent across the results. I hate going to sites that offer incorrect, damaged lyrics. Lookup lyrics for any rap song and you’ll see the problem. Every site is slightly different. Using plagiarism and document fingerprinting, it would be possible to find the “average” of these versions – hopefully coming to a more accurate version.Since the document fingerprinting / plagiarism detection algorithms were central to the idea, I started looking into them first. An hour of Google searching turned up some interesting results. Namely, I found this paper:
Winnowing: a Document Fingerprinting Algorithm
“Among digital data, documents are the easiest to copy and remove any signatures or fingerprints embedded, which make the pirating the hardest to detect. Anyone can just retype a document or copy a part of it. Document fingerprinting is concerned with accurately identifying and copying, including small partial copies, within large sets of documents.” [From the Abstract]
It does a great job of explaining the idea behind document fingerprinting, the different approaches, and their usefulness in different contexts. So be sure to check it out.I wrote several MATLAB functions to test the algorithms in the paper. After taking Image Processing last semester, I’m pretty comfortable implementing stuff like this. So here’s what it boils down to:
function f = generateFingerprints(chars) kgram_size = 13; s = 1; [t char_count] = size(chars) k_count = char_count - kgram_size + 1; kgrams = zeros(k_count,kgram_size); for s = 1:k_count kgrams(s,:) = chars(s:(s+kgram_size-1)); end b = 6; s = 1; khash = 0; for i = 1:kgram_size khash = khash + kgrams(s,i)*b^(kgram_size-i); end for s = 2:k_count khash = (khash - kgrams(s-1,1)*b^(kgram_size-1) )*b + kgrams(s,kgram_size); khashes(s) = khash; end w_size = 5; prev_min_position = -1; f_count = 0; for position = 1:(k_count - (w_size-1)) w = khashes(position:(position+w_size-1)); m = min(w); if (m ~= 0) for i = w_size:-1:1 if (w(i) == m) if (prev_min_position ~= position) prev_min_position = position; f_count = f_count + 1; f(f_count,1) = m; f(f_count,2) = position; end break end end end endend