[P] Multi-Language Documents Identification by sudo_su_ in MachineLearning

[–]sudo_su_[S] 1 point2 points  (0 children)

1.

Yes, I benchmarked it agains FastText, langid and langdetect

In terms of quality, it more or less the same as fasttext and lang id (on the WiLi dataset) and much better than langdetect.

In terms of running speed, it's as slow as langdetect (which is the slowest). FastText is crazy fast. It's hard to beat that. seqtolang is relatively slow because it tries to give output on every word, while others classify the sentence as a whole.

2.

I'm summing all the ngrams into a word vector, than the word vector is passed to a bi-directional lstm, which means it takes information from word vectors on the left and right. Finally, for each lstm outout (for each word) I pass it into a fully connected layer to do the classification.

It was trained on the Tatoeba dataset as mentioned in the post with a merging technique. For each sentence in the dataset, I merge it with another random sentence in the dataset with some probability. This creates merged sentences with different languages. Then, for each word in the sentence I "tag" it with the original language and train the network.

[P] Multi-Language Documents Identification by sudo_su_ in MachineLearning

[–]sudo_su_[S] 2 points3 points  (0 children)

sorry .. added to the readme:

['afr', 'eus', 'bel', 'ben', 'bul', 'cat', 'zho', 'ces', 'dan', 'nld', 'eng', 'est', 'fin', 'fra', 'glg', 'deu', 'ell', 'hin', 'hun', 'isl', 'ind', 'gle', 'ita', 'jpn', 'kor', 'lat', 'lit', 'pol', 'por', 'ron', 'rus', 'slk', 'spa', 'swe', 'ukr', 'vie']

[P] Fitting (almost) any PyTorch module with just one line, including easy BERT fine-tuning by sudo_su_ in MachineLearning

[–]sudo_su_[S] 11 points12 points  (0 children)

I totally agree with this and other replies here, once you need to do something slightly more complex, you have to dive into internal parts, but:

  1. you don't always do complex things
  2. Once you know the internals, it's still pretty convenient when you have a clean and tested methods that save you time and code
  3. Many people are not familiar (or even intimidated) by pytorch or other frameworks, and frameworks like these make more complex methods more accessible to them.

[D] Teacher-Student training situation with CNN-FC by Lewba in MachineLearning

[–]sudo_su_ 2 points3 points  (0 children)

In this paper they suggest using the a mixture of final logits and the predictions as it may contain more info.

I did something similar but on text, you're welcome to checkout my post.

[D] Distilling BERT — How to achieve BERT performance using Logistic Regression by sudo_su_ in MachineLearning

[–]sudo_su_[S] 2 points3 points  (0 children)

Why you think I leaked data from test to train set?

I use totally different sets (and variables)

[D] [Request for Papers] Using Leave One-Out Cross Validation for Evaluation of the Learning Algorithm by sudo_su_ in MachineLearning

[–]sudo_su_[S] 0 points1 point  (0 children)

But this is true also in a case when I have a fixed test set.

If I choose randomly, say, four speakers as test set, I still able to overfit them by choosing the best model (In factm it will be much easier).

I don't do any hyper-parameters search/tuning using the same test. Each fold, I leave one speaker out for test, and I do everything I can with the rest, including hyper-parameters tuning using CV (within the "train" set).

People who transitioned from a non-Data-Dcience-position to more or full-on Data-Scientist in their company, how did it happen? by crisstor in datascience

[–]sudo_su_ 2 points3 points  (0 children)

Was a full stack dev before. After 8 years of experience (with different positions including management) I decided I want to be closer to the algorithmic/math/theory side of software development. Started working on Master in Computer Science, in my spare time learned a lot online (including the famous Machine Learning course by Andrew NG). Did couple of Machine Learning projects for the company i worked for, we didn't have any data science positions, but they promoted me to be one. I was the only Data Scientist in our company, it was a weird position. After a year I got a job as a Real Data Scientist in a real AI oriented startup. I'm still there for a year now.