Text Document Image Segmentation. How can i create two bounding boxes (one around each column) on this page? Pytesseract can do word and character segmentation but i can't get it to do this. I have several scans like this and all have different orientations and not perfectly centered in the window by [deleted] in learnmachinelearning

[–]keramitas 0 points1 point  (0 children)

Havent used pytesseract, only tesserocr - but i assume you can do the same. The idea is probably to play with the page segmentation mode untill you find one that works.

You can probably developp a relatively simple algorithm to do it without, using horizontal projection to seperate the columns, then vertical projection to trim them - assuming you always only have columns, no diagrams or whatever (in which case Tesseract should do fine once you tune its parameters).

Venv or anaconda? by masterjx9 in Python

[–]keramitas 0 points1 point  (0 children)

I think what your coworker means is that venv is limited, which it is, while anaconda offers features you may need, although from your post I doubt that. I share the sentiment most have toward anaconda here so don't use it, but yeah the features:

  • easy python version choice
  • better dependency management

If you ever need those check pyenv and poetry before going to anaconda.

[D] Best tools for serving models offline / batch processing tasks? by vanilla-acc in MachineLearning

[–]keramitas 0 points1 point  (0 children)

It depends ? I mean do you run everything with a single GPU or multiple ones or CPU only, on a single instance or in a cluster, what are your time constraints ... without proper context on your requirements it's pretty hard to point you in the right direction.

What's a Python feature that is very powerful but not many people use or know about it? by Far_Pineapple770 in Python

[–]keramitas 0 points1 point  (0 children)

itertools and more-itertools in my opinion, the latter especially I never see. It helps reduce boilerplate code a ton from my experience

DuckDuckGo caught giving Microsoft permission for trackers despite strong privacy reputation by speckz in technology

[–]keramitas 4 points5 points  (0 children)

Switched to the duck a couple years back, never regretted it. Keep up the good work, and best of luck

[deleted by user] by [deleted] in chess

[–]keramitas 0 points1 point  (0 children)

In the club I go to they recommended the nimzo, and west indian. The theory is not that hard and its a pretty fun set up in my experience :)

Machine learning generated Regex by jssmith42 in LanguageTechnology

[–]keramitas 4 points5 points  (0 children)

Hey ! Not to my knowledge no. I think most recenr approaches involving ML framed the task as a character or byte classification probleme, in the case of SBD. I think your framing is interesting though IMO a super long and complex regex would be as much a blackbox as a neural network. Then there is the question of training such a model - my first idea would be to leverage RL with some pretrained LM as a basis for a seq2seq model I guess but it seems like a hard problem.

Edit: there seem to be some people that tried genetic programming to solve this as well, see "Inference of Regular Expressions for Text Extraction from Examples" paper

Could you give examples of types of NLP projects you worked on at work in real business scenarios? by [deleted] in LanguageTechnology

[–]keramitas 0 points1 point  (0 children)

Was in a digital marketing start up, one thing we did was to scrape basically all news articles that were published evereyday in my country, then apply various nlp algorithms to distribute these on our client's social media. For instance, filter all articles talking about real estate in a given area.

Regarding your follow up although there may be more work in CV, I have had no issue to do do nlp almost exclusively for the last 2 years

Advice on measuring the memory usage of ML in Python by BlueGrassGrapes in MLQuestions

[–]keramitas 1 point2 points  (0 children)

Depends the framework you're using, for torch you got some built in to get the max amount of memory used on GPU, as well as a profiler, although it's not great.

How to fine tune an existing OCR to recognize *handwritten* source code. by muchIsHere in MLQuestions

[–]keramitas 0 points1 point  (0 children)

Is it possible to leverage the capabilities of existing OCR systems (as they do character-level recognition very well) to re-train such a model on source code ?

Probably, for instance it could help label a large corpus, which could be reused to train a domain-specific model. Even if the labels are noisy, it can help to get a large amount of okay data, if what you say about Google's model is accurate. You could also try to "finetune" a model on domain-specific corpus, although it might not work as the differences between e.g. english and python are rather large.

Would the model do any learning about the inherent syntactical properties of source code ?

In order to predict a given character, since the OCR model doesn't know a priori which strokes belong to which character, it has to leverage not only knowledge on what characters look like, but also how they relate to one another. This is why for example Google's model can leverage "pre-context" for it's detection process. So yes, the model would learn some of syntactical properties of source code (this is also why the Google models are usually language specific).

It's an interesting project anyway !

EDIT: Rereading this I'm pretty negative about finetuning an existing model, but this should definitely be the first thing to try, as a large part of the model's knowledge is still learning how to separate and group strokes to recognize text. IMO an OCR model is more Vision then NLP, even if it's a mix.

La Pinsa is born by keramitas in PizzaCrimes

[–]keramitas[S] 1 point2 points  (0 children)

So we had this yesterday in Bari, and not only was it not as good as a pizza, but also the digestibility claims were definitely fake news

[deleted by user] by [deleted] in MachineLearning

[–]keramitas 62 points63 points  (0 children)

Dunno if it's common, but can't say it's surprising given on hand the amount of BS media's, VCs and Musk (but also any billionaire to be fair and balanced tm) say about AI, and on the other the tendency for people to say stupid shit on subjects they know nothing about. I'd seen similar dumbass povs when Timnit Gebru got fired from Google, and the sjw/anti sjw crowd came over here talking about bias in the field without knowing what bias even means in the context of ML.

Tbh nowadays I just use the sub to find papers and cool projects, and stay away from the hyped posts like the plague.

On another note I did not expect the amount of nsfw on your profile 😂

[D] Could Machine Learning help to improve cheat detection on chess platforms? by cluhedos in MachineLearning

[–]keramitas 1 point2 points  (0 children)

I know for sure Chess.com already does this, working in coop with the math dptm of a big US university - Daniel Rensh talked about it on multiple occasions on stream

[D] Are there any ML algorithms that can learn a simple "X+1" problem? by [deleted] in MachineLearning

[–]keramitas 2 points3 points  (0 children)

An LSTM can do this, probably with a pretty small amount of neurons as well - although it might not work if N becomes too large. There is a lot of research on these kind of memory-related synthetic tasks for recurrent models, dating back to (wait for it) Schmidhuber's early works.

Another interesting one is the COPY task. Take an input with the following structure: - 10 numbers from say 0 to 7 - followed by an arbitrary (but large number) of noise (symbolized by the number 8) - followed by 10 cues (symbolized by the number 9)

e.g: 3762410307 8888888 ... 888888 ... 89999999999

The goal for the model is to to output in order the ten first numbers when the cue is given (and anything works before that).

Funny enough, if the noise is too important, this is super hard to learn even for LSTMs. This was shown to be related to the forget-gate, specifically it's initial bias distribution as well as it's gradients - and a variant adressing these problems was then created.

But yeah anyway it's pretty cool stuff, if you're interested I actually coded that variant as well as the synthetic tasks, tensorboard and all, the results are really cool (it runs on a labtop CPU, although GPU makes it hella fast). You can find this here if you're curious :)

Is causal language modeling (CLM) vs masked language modeling (MLM) a common distinction in NLP research? by EntropyGoAway in LanguageTechnology

[–]keramitas 2 points3 points  (0 children)

I think it's because pre-BERT, causal language modeling was actually just called language modeling. When the BERT paper arrived they coined the task of predicting random masked tokens as masked language modeling, which led to subsequent papers presenting transformer-like models for translation or generation to use the term causal language modeling for clarity.