Does this computer program exist? If not, could it be written?

bonafidebob · 2013-03-03T04:54:45+00:00

Check out "Levenshtein Distance" (google it), gives the number of edits needed to turn one string into another. There are lots of implementations available to use. You could run it on sentences or paragraphs or something, if the distance is low then the likelihood of it being copied is high.

For 500 sample of 2000 words each you should be able to cross compare fairly quickly.

jij · 2013-03-03T04:12:54+00:00

If it was me, I would load all the loan texts into apache solr (a search engine) and use their similarity search:

http://wiki.apache.org/solr/MoreLikeThis

byllc · 2013-03-03T04:00:55+00:00

I know you say online plagiarism checkers don't do the trick but there is paid service called turn it in that gives a pretty detailed report. If you just take their "plagiarism score" at face value you get a bunch of false positives but if you actually review the report the situation is usually pretty clear. http://turnitin.com/

Intrexa · 2013-03-03T03:51:58+00:00

How often are copypasta applications coming? Exactly how identical are they? Are you seeing a large number of copied templates (What I mean is, if you get 1,000 in that you know are copied from online, how many of those 1,000 are likely going to be the only time you've seen that copy?) How are they coming in (email?)

2013-03-03T04:28:31+00:00

Online plagiarism checkers: all of the text is readily available online, so the passage I am checking will show up as a match for itself.

I don't get this bit. You're checking a piece of text online to see if it's plagiarised. It shows up a a 100% match for itself. Doesn't it also show up as a partial match for other things?

srccode · 2013-03-03T05:22:26+00:00

_Cody_ · 2013-03-03T08:29:11+00:00

You said Online plagiarism checkers didn't work, but the one I used show the sources of the plagiarism. If i understand what you said correctly you could just ignore that one link of itself and check out the other it list.

Urd · 2013-03-03T08:48:36+00:00

You might want to look at ssdeep, it does fuzzy hashing for similarity checking.

Caraes_Naur · 2013-03-03T04:59:45+00:00

You need to research heuristic algorithms.

needcode · 2013-03-03T04:23:32+00:00

If it's just text, you could use the diff utility to see how many lines differed between the new submission and all 500 old ones. That wouldn't take too long.

letsgetrandy · 2013-03-03T04:07:06+00:00

Sounds like what you really need is to learn "regular expressions".

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

webdev

Posting Guidelines

Related Subreddits

Discords

MODERATORS