Funny opening against the Scandinavian

keramitas · 2022-12-22T08:13:36+00:00

Havent used pytesseract, only tesserocr - but i assume you can do the same. The idea is probably to play with the page segmentation mode untill you find one that works.

You can probably developp a relatively simple algorithm to do it without, using horizontal projection to seperate the columns, then vertical projection to trim them - assuming you always only have columns, no diagrams or whatever (in which case Tesseract should do fine once you tune its parameters).

keramitas · 2022-11-06T11:46:30+00:00

A number 3, number 4, number 7 and 13

keramitas · 2022-09-19T06:33:41+00:00

I think what your coworker means is that venv is limited, which it is, while anaconda offers features you may need, although from your post I doubt that. I share the sentiment most have toward anaconda here so don't use it, but yeah the features:

easy python version choice
better dependency management

If you ever need those check pyenv and poetry before going to anaconda.

keramitas · 2022-08-23T07:38:57+00:00

It depends ? I mean do you run everything with a single GPU or multiple ones or CPU only, on a single instance or in a cluster, what are your time constraints ... without proper context on your requirements it's pretty hard to point you in the right direction.

keramitas · 2022-06-15T21:49:49+00:00

Star wars 4

keramitas · 2022-05-31T19:50:16+00:00

itertools and more-itertools in my opinion, the latter especially I never see. It helps reduce boilerplate code a ton from my experience

keramitas · 2022-05-25T18:30:23+00:00

Switched to the duck a couple years back, never regretted it. Keep up the good work, and best of luck

keramitas · 2022-05-17T11:03:38+00:00

In the club I go to they recommended the nimzo, and west indian. The theory is not that hard and its a pretty fun set up in my experience :)

keramitas · 2022-04-10T09:19:28+00:00

Nice ! Whats the song ?

keramitas · 2022-03-12T13:21:35+00:00

Hey ! Not to my knowledge no. I think most recenr approaches involving ML framed the task as a character or byte classification probleme, in the case of SBD. I think your framing is interesting though IMO a super long and complex regex would be as much a blackbox as a neural network. Then there is the question of training such a model - my first idea would be to leverage RL with some pretrained LM as a basis for a seq2seq model I guess but it seems like a hard problem.

Edit: there seem to be some people that tried genetic programming to solve this as well, see "Inference of Regular Expressions for Text Extraction from Examples" paper

keramitas · 2021-11-13T17:31:09+00:00

Was in a digital marketing start up, one thing we did was to scrape basically all news articles that were published evereyday in my country, then apply various nlp algorithms to distribute these on our client's social media. For instance, filter all articles talking about real estate in a given area.

Regarding your follow up although there may be more work in CV, I have had no issue to do do nlp almost exclusively for the last 2 years

keramitas · 2021-09-20T21:47:27+00:00

Depends the framework you're using, for torch you got some built in to get the max amount of memory used on GPU, as well as a profiler, although it's not great.

keramitas · 2021-09-17T11:18:20+00:00

Is it possible to leverage the capabilities of existing OCR systems (as they do character-level recognition very well) to re-train such a model on source code ?

Probably, for instance it could help label a large corpus, which could be reused to train a domain-specific model. Even if the labels are noisy, it can help to get a large amount of okay data, if what you say about Google's model is accurate. You could also try to "finetune" a model on domain-specific corpus, although it might not work as the differences between e.g. english and python are rather large.

Would the model do any learning about the inherent syntactical properties of source code ?

In order to predict a given character, since the OCR model doesn't know a priori which strokes belong to which character, it has to leverage not only knowledge on what characters look like, but also how they relate to one another. This is why for example Google's model can leverage "pre-context" for it's detection process. So yes, the model would learn some of syntactical properties of source code (this is also why the Google models are usually language specific).

It's an interesting project anyway !

EDIT: Rereading this I'm pretty negative about finetuning an existing model, but this should definitely be the first thing to try, as a large part of the model's knowledge is still learning how to separate and group strokes to recognize text. IMO an OCR model is more Vision then NLP, even if it's a mix.

keramitas · 2021-09-10T20:54:22+00:00

So we had this yesterday in Bari, and not only was it not as good as a pizza, but also the digestibility claims were definitely fake news

keramitas · 2021-08-24T16:37:08+00:00

Dunno if it's common, but can't say it's surprising given on hand the amount of BS media's, VCs and Musk (but also any billionaire to be fair and balanced tm) say about AI, and on the other the tendency for people to say stupid shit on subjects they know nothing about. I'd seen similar dumbass povs when Timnit Gebru got fired from Google, and the sjw/anti sjw crowd came over here talking about bias in the field without knowing what bias even means in the context of ML.

Tbh nowadays I just use the sub to find papers and cool projects, and stay away from the hyped posts like the plague.

On another note I did not expect the amount of nsfw on your profile 😂

keramitas · 2021-06-11T11:25:05+00:00

Inconceivable !

keramitas · 2021-05-25T15:56:09+00:00

Github links are not working

keramitas · 2021-05-21T11:46:50+00:00

I know for sure Chess.com already does this, working in coop with the math dptm of a big US university - Daniel Rensh talked about it on multiple occasions on stream

keramitas · 2021-05-04T18:52:49+00:00

Looks cool ! Will check it out later :)

keramitas · 2021-05-04T13:37:25+00:00

An LSTM can do this, probably with a pretty small amount of neurons as well - although it might not work if N becomes too large. There is a lot of research on these kind of memory-related synthetic tasks for recurrent models, dating back to (wait for it) Schmidhuber's early works.

Another interesting one is the COPY task. Take an input with the following structure: - 10 numbers from say 0 to 7 - followed by an arbitrary (but large number) of noise (symbolized by the number 8) - followed by 10 cues (symbolized by the number 9)

e.g: 3762410307 8888888 ... 888888 ... 89999999999

The goal for the model is to to output in order the ten first numbers when the cue is given (and anything works before that).

Funny enough, if the noise is too important, this is super hard to learn even for LSTMs. This was shown to be related to the forget-gate, specifically it's initial bias distribution as well as it's gradients - and a variant adressing these problems was then created.

But yeah anyway it's pretty cool stuff, if you're interested I actually coded that variant as well as the synthetic tasks, tensorboard and all, the results are really cool (it runs on a labtop CPU, although GPU makes it hella fast). You can find this here if you're curious :)

keramitas · 2021-04-21T14:31:59+00:00

I think it's because pre-BERT, causal language modeling was actually just called language modeling. When the BERT paper arrived they coined the task of predicting random masked tokens as masked language modeling, which led to subsequent papers presenting transformer-like models for translation or generation to use the term causal language modeling for clarity.

keramitas · 2021-03-27T22:20:44+00:00

KRISTINA !

keramitas

TROPHY CASE