Simplest plagiarism detector ever

It doesn’t deserve a label „antiplagiator“. Let’s call it „text-matcher“. This app is looking for same passages in two pieces of text (word by word), nothing more.
I wrote it because

  1. coudn’t find similar lightweight web app
  2. Drc need it (a bit)
  3. I’m going to need this function for Antiplagiator’s backend to find the exact boundary of plagiarised passages

Here it is: http://blog.orwen.org/textmatcher (be ready to see prototype-like stuff)

How it works?
Function is looking for the biggest match and then does the same for the text around recursively. Here is the code core.

def rlong(ta,tb,phase):

  # let others do the job
  s = SequenceMatcher(None, ta, tb,autojunk=False)

  (i,j,n) = s.find_longest_match(0, len(ta), 0, len(tb))
  if n < 2:
    return None

  self.frags.append((i+phase,j,n,ta[i:i+n])) # found fragment

  if i > self.minim:
    rlong(ta[0:i],tb,phase) # look left around
  if len(ta) > i+n+self.minim:
    rlong(ta[i+n:len(ta)],tb,i+n+phase)  # look right around

You can set the minimal length of word sequence to be matched. For large text it tends to be slow, so I limited the number of words to 30 000. Feel free to report any bugs.

Comments are disabled