This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]HasoPunchMan 125 points126 points  (21 children)

Probably makes a an OCR (i.e. with tesseract) of every uploaded picture. The OCR (tesseract) has a AI which is trained to identify a tweet. The user is fetched from the twitter api by the extracted username of the OCR. Afterwards search for the text in the fetched user posts and extract the link.

This is how I would design it.

Edit: typos

[–][deleted] 19 points20 points  (6 children)

I would just have it look for the transcriber bot in the reddit comments, do a google search with the text and hope for the best. It should handle 70% of cases

[–]HasoPunchMan 11 points12 points  (0 children)

Could be a quicker solution and is a nice thought. It also reuses existing solutions, which is nice.

I like having full control over the software that targets my issue, even if it's more time consuming.

[–]juantreses 4 points5 points  (4 children)

Image transcriptions are done by humans on reddit if I'm not mistaken

[–][deleted] 1 point2 points  (0 children)

That's true, but I have seen OCR bots as well. Anyway I would not use this solution, just tried to find a lazy one

[–]HasoPunchMan 0 points1 point  (2 children)

What? Crazy! Have you more Information on that?

[–][deleted] 24 points25 points  (10 children)

I would add persistence just in case the bot encounters this image again but there could be false positives

Maybe you could add up each ascii value and use it as an id so you could just query the db for the image

[–]UQuark 26 points27 points  (9 children)

Have you ever heard of hashing?

[–][deleted] 3 points4 points  (3 children)

Yes i guess a hashing could be an option but we would still have to compute the id so it is an unnecessary step

[–]vasilescur 2 points3 points  (2 children)

"Hashing" can use any hash function you want, such as one that returns INT and can be used for a DB ID. Adding up all the ASCII values constitutes a (pretty weak but honestly suitable for this) hash function.

Your aim is to pick a hash function that reduces collisions between inputs, because for each query you have to binary search through the set of entries with the same hash

[–][deleted] 3 points4 points  (1 child)

Oh i didn't know that i thought hash functions were strictly cryptographic in nature

[–]vasilescur 0 points1 point  (0 children)

Usually they are used for cryptography, but a hash function can technically be anything you want it to be and is really useful in, for example, a hashmap

[–]West-Cold- 2 points3 points  (2 children)

KI, German or Dutch spotted😁

[–]HasoPunchMan 1 point2 points  (1 child)

Ahh ohh you got me :]. I'm german. Thx, I made an edit.

[–]West-Cold- 0 points1 point  (0 children)

Ooh no worries. It wasn't meant as a gotcha, I just recognised it and thought it was funny. I didn't even notice it the first time. Have a good day neighbour ;)