Simplest plagiarism detector ever
It doesn’t deserve a label „antiplagiator“. Let’s call it „text-matcher“. This app is looking for same passages in two pieces of text (word by word), nothing more.
I wrote it because
- coudn’t find similar lightweight web app
- Drc need it (a bit)
- I’m going to need this function for Antiplagiator’s backend to find the exact boundary of plagiarised passages
Here it is: http://blog.orwen.org/textmatcher (be ready to see prototype-like stuff)
How it works?
Function is looking for the biggest match and then does the same for the text around recursively. Here is the code core.
def rlong(ta,tb,phase): # let others do the job s = SequenceMatcher(None, ta, tb,autojunk=False) (i,j,n) = s.find_longest_match(0, len(ta), 0, len(tb)) if n < 2: return None self.frags.append((i+phase,j,n,ta[i:i+n])) # found fragment if i > self.minim: rlong(ta[0:i],tb,phase) # look left around if len(ta) > i+n+self.minim: rlong(ta[i+n:len(ta)],tb,i+n+phase) # look right around
You can set the minimal length of word sequence to be matched. For large text it tends to be slow, so I limited the number of words to 30 000. Feel free to report any bugs.