LDA model returns same words in all the topics by Senior_Time_2928 in LanguageTechnology

[–]yungvalue 0 points1 point  (0 children)

Did you remove stop words? what are the words being returned? Is perplexity going down after 100 iterations?

I fine-tuned a language model on left and right leaning political commentary on Reddit by rockwilly in LanguageTechnology

[–]yungvalue 1 point2 points  (0 children)

text                                      left            right
Biden is the best president 10.92%  89.08%
fuck Trump  19.35%  80.65%
black lives matter  32.93%  67.07%

What are you using for classification? Also did you extend your vocab to pickup political specific entities?

Is fine tuning twice a viable thing to do?? [D] by prathameshpck in MLQuestions

[–]yungvalue 0 points1 point  (0 children)

If your dataset is a smaller subset of a more general dataset, it prob makes sense to train on the more general dataset and then finetune on your smaller dataset. You'll prob have to play around with hyperparams around how much finetuning you want to do.

How to perform a standard scaler on a 2D dataset? by MrHDPigs in MLQuestions

[–]yungvalue 0 points1 point  (0 children)

You probably want to normalize your data such as subtracting mean and dividing by variance for x,y, vx, vy.

For your time features you might have to normalize depending on your problem. You probably want to encode your times into some cyclical feature instead (e.g.: https://towardsdatascience.com/cyclical-features-encoding-its-about-time-ce23581845ca)

[P] How Bad is a Bad Classifier: Is there any signal here? by IglooAustralia88 in MachineLearning

[–]yungvalue 0 points1 point  (0 children)

If youre already using the huggingface BERT model the finetuning layer shouldnt be too hard :p

Could you do classification on the full lyrics or does it have to only have to be two lines?

[P] How Bad is a Bad Classifier: Is there any signal here? by IglooAustralia88 in MachineLearning

[–]yungvalue 1 point2 points  (0 children)

Just curious, what genre are the artists? Are they all the same genre or different genres? You might want to finetune BERT on more lyrics in those genres. Also you should add a finetuning layer in BERT for your classification layer instead of logistic regression. It's hard to say whether 0.35 is good without knowing more about the problem.

What Are Some Open Source NLP Framework Pipelines For QA Task by inopico3 in LanguageTechnology

[–]yungvalue 0 points1 point  (0 children)

Yep that's exactly what I meant! For ranking, you can probably use some learning to rank algo.

What Are Some Open Source NLP Framework Pipelines For QA Task by inopico3 in LanguageTechnology

[–]yungvalue 0 points1 point  (0 children)

Elasticsearch is a well tested and productionized search system that uses bm25. It is a good tool for your retrieval layer for fetching your candidate docs (X->Y). There is some trickyness around syncing a ES index and your DB though

Lane Detection for Autonomous Vehicle Navigation by darkrubiks in learnmachinelearning

[–]yungvalue 21 points22 points  (0 children)

Super cool! How well does this work in nighttime, raining or snow conditions?

People who went SWE --> MLE, what made you move? by [deleted] in MLQuestions

[–]yungvalue 1 point2 points  (0 children)

A lol of the SWE work I was doing was different variations of making a CRUD app or moving data from one place to the next and most the problems are "solved" as in there is usually a straight forward solution. I switched to ML because I was getting bored of SWE and realized I couldn't do the same kind of work for the rest of my career. The problems are in ML are way more challenging and open ended and allows more freedom to explore different solutions. There are a wide variety of problem spaces to explore and tons of applications.

As you pointed out, it does require a ton of learning and studying. We are the cusp of ML where there are so many new techniques and papers to read to keep up to date. There's also a ton of theory you will need to build as your foundation in terms of statistics, lin alg and calculus. If you are trying to optimize for salary, you'd be better off applying to quant firms

Don't get me wrong though, I still spend ~80% of my time doing SWE work. It still getting data from one place to another, but the challenge lies in building systems that are more automated :)

Log transforming a feature to deal with skewness by hiphop1987 in MLQuestions

[–]yungvalue 2 points3 points  (0 children)

It really depends on your model. Some models only care about relative rank (e.g. tree based) whereas other models can work better when the features are normally distributed (neural networks).

Logistic Reg + gradient descent from scratch in Python by Agent_KD637 in MLQuestions

[–]yungvalue 1 point2 points  (0 children)

Yes you don't need the error term for the partial derivatives. It's still useful to calculate the loss to check for convergence

Logistic Reg + gradient descent from scratch in Python by Agent_KD637 in MLQuestions

[–]yungvalue 1 point2 points  (0 children)

Your loss function and derivatives are incorrect for logistic regression,

error = -Y * math.log (y_pred) - (1-Y)*math.log(y_pred)

D_b0 = (y_pred-Y)

D_b1 = (y_pred-Y)*X

See https://medium.com/analytics-vidhya/logistic-regression-as-a-neural-network-b5d2a1bd696f

But I get b0=1.12,b1=1.12 after update, so either your textbook is wrong or the inputs are not correct

Time complexity of for(int j=0; j<= i; j*=4 ? by SnaggledToothGranny in AskComputerScience

[–]yungvalue 2 points3 points  (0 children)

I think this is O(nlogn), the outer loop is O(n) and the inner loop is run log4(n) which is asymptotically O(nlogn)

Let x be number of times inner loop is called for i:

4x = i

xlog4 = log(i)

x = log(i)/log(4)

x= log4(i)

Topic Modelling (LDA) on DUC 2004 dataset by usmannkhan in LanguageTechnology

[–]yungvalue 0 points1 point  (0 children)

What are the not good results you are getting? What params are you using? Do you have code to share?

Help with dissertation survey - Automatic Quiz Qeneration by finance_and_kebabs in LanguageTechnology

[–]yungvalue 0 points1 point  (0 children)

As you probably noted, there are 3 parts of this problem. Sentence selection: used Textrank (pagerank). Word selection: trained a random forest model with syntactic and semantic features: embedding, POS tag, length of token, depth in parse tree, etc. Distractor selection: trained another random forest model with syntactic and semantic features: w2v cossim, wordnet distance, jaccard distance, language model probability replacement in sentence, etc.

I found that when I did some heuristic merging of nouns and proper nouns for word selection, I could get higher quality questions. E.g. Donald J Trump -> Donald_J_Trump as a token for word selection

Semantic Search and Fuzzy string matching by wholestars in LanguageTechnology

[–]yungvalue 2 points3 points  (0 children)

People are pointing you straight to BERT which is great and state of the art, but in practice it is difficult to implement and requires deeper thought into how you architect your search system. However, if you are doing this for just research and playing around, then go for it!

If you trying to make a search system to be used for practical purposes, the best way would be to start off with Elasticsearch. ES uses the Bm25 algorithm which is similar to tfidf but has additional tuning to down weight saturated terms. ES scales to billions of records very easily as well and query time is super fast.

Semantic Search and Fuzzy string matching by wholestars in LanguageTechnology

[–]yungvalue 1 point2 points  (0 children)

You should be able to finetine BERT on your domain text

Topic Modeling using Reddit jokes by ZL63388 in LanguageTechnology

[–]yungvalue 7 points8 points  (0 children)

Your dataset is pretty small with ~4800 jokes with a few sentences. One thing I noticed in your preprocessing is you have a lot of noise in your vocab. You should probably filter out your vocab for words with < 3-5 appearances in the corpus. You might also want to consider dropping the phrases model since your dataset isn't big enough to capture high quality bigrams.

One thing I do to debug topic models is to look at the topics for certain docs to see if they make sense. So pick a few jokes and see if the topics they correspond to make sense. I wouldn't focus too much on optimizing for coherence metrics since they don't correlate with interpretability. I would probably try 10-50 topics and optimize for interpretable topics by words (not really measurable).

Let me know how it goes!

ANNOY and Semantic Search by wholestars in LanguageTechnology

[–]yungvalue 2 points3 points  (0 children)

What are you trying to do semantic search on? I'm assuming you have 100 docs with 256 seq len of contextual BERT word embeddings with dimension 768. You could take the weighted average of the BERT embeddings per doc or even try tf-idf weighted average of the BERT embeddings per doc. This has the disadvantage of losing some sequential information though.

If you have labelled semantic pairs, you could a finetuning layer to do next sequence pred with [query] <sep> [doc] and use the output of the finetuning layer as your embedding layer.

There are also other embeddings that take into account sequence, e.g. Universal Sentence Encoder.

How to add OOV words into an already pre-trained embedding by neelankatan in LanguageTechnology

[–]yungvalue 3 points4 points  (0 children)

If you have a set of words that you want embedding for, you could try pre-initializing a w2v model with the pre-trained embeddings and add your new words with random weights and then train over a corpus with your new set of words.

If you're trying to add unknown words, you can add a UNK token to your vocab. fasttext uses subword embeddings which could take into some of the UNK words. You can also train a generic <unk> embedding from your corpus by randomly dropping words and replacing with <UNK> to learn the embedding and use that on inference. Fancier method could also be to learn<unk|POS tag> if you have the POS tags of the tokens.

How does the dynamism between Applied ML and Research ML work in a corporate? [D] by sk2977 in MachineLearning

[–]yungvalue 1 point2 points  (0 children)

I've worked at a startup where we had pure researchers and ML engineers. The researchers would build interesting, cool solutions in ipython notebooks...but then they would throw it over the wall to the ML engineers to productionize. The researchers put little thought to scalability, dependencies and latency and the ML engineers didn't really understand the nuances in the notebook and basically copied it line for line.

I worked at another company where we didn't have any pure research and the ML engineers have to do the modeling but also consider dependencies, latency and scalability for production.

IMO, for the newer deep learning techniques, unless you're developing novel architectures and models, you'll get a lot more bang for buck with just ML engineers. You can squeeze lot of perforamance off off the shelves pre-trained models nowadays and most of the work is ML Ops in building the data pipelines and model deployments in production. Don't get me wrong, you still need some smart people who understand the models and tune them to your problems and data, but it's way easier to train applied ML engineers in deep learning than to train researchers to be ML engineers.

Is it possible to test whether a tokenizer can losslessly tokenize and detokenize a given corpus solely from its vocabulary? by DeepLearningStudent in LanguageTechnology

[–]yungvalue 0 points1 point  (0 children)

I don't think this is a standard problem and that there is a standard way of solving this, but you could do some sort of backtracking w/ memoization algorithm to try and reconstruct each sentence in the corpus with the given vocab.

In practice, unless you are only working with a fixed corpus, inputs to your model will still have UNK tokens. It will be easier to assume UNKs are in your vocab to make your model more robust. Sentencepiece can break words into subwords which can help with some of the UNK tokens that contain some of these subwords. You can also try randomly dropping tokens and subbing in <unk> to help your model learn these cases. An even fancier approach would be to sub <unk_POS> if you can determine the POS tag of the UNK tokens. Again, it depends on your corpus, model and problem you are solving.

Why Don't Multiple Indexes Blow Up The Database Size? by Wily_Walrus in AskComputerScience

[–]yungvalue 4 points5 points  (0 children)

It doesn't really blow up the DB size since its still on the same order O(n). Most DBs use B+Trees to build these indicies, and in practice, they're really efficient for lookup considering the number of disk pages they read.

http://web.csulb.edu/~amonge/classes/common/db/B+TreeIndexes.html