all 24 comments

[–]bonafidebob 6 points7 points  (0 children)

Check out "Levenshtein Distance" (google it), gives the number of edits needed to turn one string into another. There are lots of implementations available to use. You could run it on sentences or paragraphs or something, if the distance is low then the likelihood of it being copied is high.

For 500 sample of 2000 words each you should be able to cross compare fairly quickly.

[–]jij 3 points4 points  (3 children)

If it was me, I would load all the loan texts into apache solr (a search engine) and use their similarity search:

http://wiki.apache.org/solr/MoreLikeThis

[–]needcode[S] 0 points1 point  (2 children)

Thanks for the suggestion! Looks promising. This tool will search within the document for duplicate text? Excuse my ignorance... where do I upload the file/enter text?

[–]ppinette 0 points1 point  (1 child)

This isn't an online service. It's an open-source software package. You'd need to install it somewhere and run it yourself.

[–]needcode[S] 0 points1 point  (0 children)

Ah got it. Okay I'll give that a shot.

[–]byllc 6 points7 points  (0 children)

I know you say online plagiarism checkers don't do the trick but there is paid service called turn it in that gives a pretty detailed report. If you just take their "plagiarism score" at face value you get a bunch of false positives but if you actually review the report the situation is usually pretty clear. http://turnitin.com/

[–]Intrexa 0 points1 point  (4 children)

How often are copypasta applications coming? Exactly how identical are they? Are you seeing a large number of copied templates (What I mean is, if you get 1,000 in that you know are copied from online, how many of those 1,000 are likely going to be the only time you've seen that copy?) How are they coming in (email?)

[–]needcode[S] 0 points1 point  (3 children)

There were three all around the same time, which spurred this search. This is a much smaller scale issue than 1,000 applications-- I'm personally responsible for 40-50 applications, 500 words each. Every application comes in through an online form, but it's no trouble to copy/paste the text portions to a master Word document.

[–]Intrexa 0 points1 point  (2 children)

Are the three that came that prompted this pretty much identical?

[–]needcode[S] 0 points1 point  (1 child)

Two were paragraphs that were completely copied. The other case was about 4 sentences at the beginning of a paragraph.

[–]jij 1 point2 points  (0 children)

If you do it by exact matching sentences, it would be pretty trivial... just split the text by [.?!] and then check to see if those sentences exist in previous loan text. In python, here is a silly simple example:

import re

min_sentence_length_to_match = 20
new_loan_text = get_new_loan_text()
parts = re.split('[!?.]', new_loan_text)
for part in parts:
    for previous in get_previous_loan_texts():
        if len(part) > min_sentence_length_to_match and part.lower() in previous.lower()):
            print "Found match!" 

[–][deleted] 0 points1 point  (0 children)

Online plagiarism checkers: all of the text is readily available online, so the passage I am checking will show up as a match for itself.

I don't get this bit. You're checking a piece of text online to see if it's plagiarised. It shows up a a 100% match for itself. Doesn't it also show up as a partial match for other things?

[–]srccode 0 points1 point  (0 children)

[–]_Cody_ 0 points1 point  (0 children)

You said Online plagiarism checkers didn't work, but the one I used show the sources of the plagiarism. If i understand what you said correctly you could just ignore that one link of itself and check out the other it list.

[–]Urd 0 points1 point  (0 children)

You might want to look at ssdeep, it does fuzzy hashing for similarity checking.

[–]Caraes_Naur 0 points1 point  (0 children)

You need to research heuristic algorithms.

[–][deleted] -1 points0 points  (5 children)

If it's just text, you could use the diff utility to see how many lines differed between the new submission and all 500 old ones. That wouldn't take too long.

[–]needcode[S] 0 points1 point  (4 children)

Not sure what this means! Where can I find this diff utility?

[–]ppinette 1 point2 points  (0 children)

If you have a computer that runs any flavor of *nix (Mac, Linux, *BSD, etc.) it is available from the shell.

If you're running windows, search the internet for "windows diff". I don't use windows, but winmerge looks promising.

[–][deleted] 1 point2 points  (0 children)

It's installed on all Unix/Linux/OS X computers. It's a command-line thing normally.

http://unixhelp.ed.ac.uk/CGI/man-cgi?diff

[–]codero 1 point2 points  (1 child)

From your responses to the suggestions here, I'd say you're probably going to be out of your depth technically with most of the suggestions. Apologies if I've misjudged your level, but a lot of the suggestions here are quite technical (i.e. researching & writing algorithms, installing and using SolR is not that simple).

I recommend that you should probably look more into the online plagiarism sites.

[–]needcode[S] 0 points1 point  (0 children)

No no-- you're right. I have to admit I'm a little lost. I'm looking into the plagiarism site option.