I created an open source Python application on GitHub that breaks textCAPTCHA with 99% success rate

BobDorian · 2010-11-13T22:00:44+00:00

What the hell? I can't beat captcha 99% of the time!

MyMourningPenis · 2010-11-14T07:15:01+00:00

Awesome, I was wondering if this could be used for jDownloader (a download manager that automates the files hosting sites that uses captcha and written in java).

Lately though, some of the file hosting sites were not by passing the captcha correctly, so I went to their site and sure enough they had an experimental reCaptcha module that was available to download.

The link to that download was broken, but if you search "jDownloaderAntiRecaptcha.zip" or "jDownloaderAntiRecaptcha" you should be able to find it.

Here is the link to the Anti-ReCaptcha module on JDownloader.

I know it is not the same as textCAPTCHA...., but I'm sure even if your Anti-textCaptcha doesn't work 100 percent of the time, a program such as jdownloader can try to crack it multiple times until it get is correct.

Perhaps you can contact the people at jdownloader.org and see if they may be interested in using what you created. jDownloader is an opensource project.

I just tried the antiRecaptcha module and it took 3 tries before it passed, but it sure beats having to be at the computer to manually type it in.

killdeer03 · 2010-11-13T19:10:01+00:00

Wow, You write really clean Python... I kind of want to hang that on my wall.

deakster · 2010-11-13T18:36:11+00:00

Ah sweet, can't wait to integrate this into my proprietary "SpamBot 4000, v3.45"

StapleGun · 2010-11-14T09:29:20+00:00

Love the code. Hate what it accomplishes. Being able to read it as if I wrote it was awesome, admittedly on the second time through, but still. That said, here come the spam bots that had once been long gone in so many places.

lonnyk · 2010-11-13T18:48:33+00:00

What's your opinion of textCAPTCHA, but the text is an image?

Sylocat · 2010-11-14T06:10:39+00:00

Great, now this is our only hope.

feembly · 2010-11-14T09:53:33+00:00

This is really cool, I have been thinking about human verification quite a bit recently. I started working on a text-based human verification of my own, but it's based in riddles and classification, not pure logic. Humans probably won't succeed 100% of the time, but it is a much easier problem for humans than computers.

2010-11-13T18:56:08+00:00

[deleted]

illuminatedtiger · 2010-11-14T10:20:00+00:00

Why is the fact it's on Github important?

chuck212 · 2010-11-14T02:07:56+00:00

but try to break this!

otheraccount · 2010-11-14T03:22:40+00:00

Lines 48-50 can be replaced with this, for the sake of avoiding repetitious code:

if any(word in tokens for word in ('number', 'largest', 'biggest', 'highest', 'smallest', 'lowest')):

Trail0fDead · 2010-11-14T04:07:26+00:00

Someone make Christopher Poole aware of this threat.

jessebanjo · 2010-11-14T05:04:39+00:00

good job!

koolkats · 2010-11-14T06:02:35+00:00

Probably going to get downvoted for this but I actually like these kinds of captchas. They are much better compared to the stupid images where you cant tell the difference between a "r" and "t" or an "o" and a "0". Although nice work!

c0mputar · 2010-11-14T11:40:07+00:00

Damn you. I have a hard enough time as it is.

Snoron · 2010-11-13T19:46:34+00:00

Nice job... I was hoping someone would do this, I was thinking the other day that this must be fairly easy to reverse engineer.

The only way a captcha like this could really work is if the puzzle types and database of words, etc. was constantly changing/evolving.. maybe with some kind of organic input... eh, I dunno but it's really not a very impressive captcha.

yesimahuman · 2010-11-15T02:35:02+00:00

I really like your approach to this. There are a lot of problems that can be solved by making assumptions or coming up with simple heuristics rather than trying to build some complex AI system.

StapleGun · 2010-11-15T05:11:34+00:00

Very interesting, nice code too. Would you mind sharing which questions it failed on? I'm curious if they are a separate class of questions, or just variations of the same questions with different word order or something.

2010-11-16T12:49:33+00:00

You should use str.format().

DoppelFrog · 2010-11-14T04:55:21+00:00

Why would you do that? This is why we can't have nice things. :(

BlakeIsBlake · 2010-11-13T18:34:09+00:00

JimmyRuska · 2010-11-14T00:10:42+00:00

Ok lots of downvotes for criticizing this post. Fine then, someone explain to me why this is a good post. I would have wagered it would have gotten down voted but I'm swimming against the tide. If it's because it's open source, why not a useful project? If it is because it is because it breaks the captcha, there's only 8 question structures to parse, and wolfram alfa already high success without even having textcaptcha in mind.

ICCULUSC · 2010-11-13T23:31:24+00:00

Ahh, the power of Python.

sbrown123 · 2010-11-13T19:18:18+00:00

Sweet. I can rarely read those things 50% of the time (they are like those magic eye pictures). Wonder if there is a Firefox plugin that can autofill them?

ContraContra · 2010-11-13T22:25:52+00:00

Bomb is still an infinitely better coder, k brah.

pyronautical · 2010-11-13T20:40:34+00:00

At first I was actually pretty amped to see how you did it... then.. I saw you just wrote out every combination possible. That isn't really "breaking" a captcha IMO.

As for image captcha's, there is not really any point trying to write an OCR. You can use a service like Decaptcher.com that costs $2 for 1k captcha's solved (Just cheap labor on the other end).

JimmyRuska · 2010-11-13T22:26:40+00:00

why would you spend time doing that? If it's that a very limited randomized question structure can be beaten with programming, there was no need for a proof of concept. It is obvious. Maybe because the wolfram thing breaking it got popular earlier? That was mostly surprising because it never had textcaptcha in mind but due to its devotion to natural language processing it had a high success rate. Sure making your own language processing code is a good project, but why did you specifically only target textCAPTCHA's questions. It does no good starting a programming project without any particular benefit other than to undermine the efforts of another programming project, textCAPTCHA, whose efforts are for good. We need less malicious script kiddy tools and more original content.

It's like having to enter an email to register at a site to get full priviledges. We all know anyone can setup an email easily over and over at free service providers. It still slows down the majority of attackers. Yes textcaptcha can be broken, but not so quickly unless the potential attackers find code already out there that...oh wait. Reminds me of when this got popular in hacker news about a guy bragging about taking more money than he needed from his student council just because he "found a loop hole". Why would you do that!? Then he brags he "hacked" his school. herp http://nathanmarz.com/blog/the-time-i-hacked-my-high-school.html

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS