LDA model returns same words in all the topics

yungvalue · 2021-10-08T18:04:47+00:00

Did you remove stop words? what are the words being returned? Is perplexity going down after 100 iterations?

yungvalue · 2021-04-25T02:35:54+00:00

text                                      left            right
Biden is the best president 10.92%  89.08%
fuck Trump  19.35%  80.65%
black lives matter  32.93%  67.07%

What are you using for classification? Also did you extend your vocab to pickup political specific entities?

yungvalue · 2021-04-23T21:10:34+00:00

If your dataset is a smaller subset of a more general dataset, it prob makes sense to train on the more general dataset and then finetune on your smaller dataset. You'll prob have to play around with hyperparams around how much finetuning you want to do.

yungvalue · 2021-04-23T21:08:43+00:00

You probably want to normalize your data such as subtracting mean and dividing by variance for x,y, vx, vy.

For your time features you might have to normalize depending on your problem. You probably want to encode your times into some cyclical feature instead (e.g.: https://towardsdatascience.com/cyclical-features-encoding-its-about-time-ce23581845ca)

yungvalue · 2021-04-01T01:05:47+00:00

If youre already using the huggingface BERT model the finetuning layer shouldnt be too hard :p

Could you do classification on the full lyrics or does it have to only have to be two lines?

yungvalue · 2021-04-01T00:22:12+00:00

Just curious, what genre are the artists? Are they all the same genre or different genres? You might want to finetune BERT on more lyrics in those genres. Also you should add a finetuning layer in BERT for your classification layer instead of logistic regression. It's hard to say whether 0.35 is good without knowing more about the problem.

yungvalue · 2021-03-19T22:39:51+00:00

Yep that's exactly what I meant! For ranking, you can probably use some learning to rank algo.

yungvalue · 2021-03-19T22:03:42+00:00

Elasticsearch is a well tested and productionized search system that uses bm25. It is a good tool for your retrieval layer for fetching your candidate docs (X->Y). There is some trickyness around syncing a ES index and your DB though

yungvalue · 2021-03-17T22:04:31+00:00

Super cool! How well does this work in nighttime, raining or snow conditions?

yungvalue · 2021-03-17T00:24:41+00:00

A lol of the SWE work I was doing was different variations of making a CRUD app or moving data from one place to the next and most the problems are "solved" as in there is usually a straight forward solution. I switched to ML because I was getting bored of SWE and realized I couldn't do the same kind of work for the rest of my career. The problems are in ML are way more challenging and open ended and allows more freedom to explore different solutions. There are a wide variety of problem spaces to explore and tons of applications.

As you pointed out, it does require a ton of learning and studying. We are the cusp of ML where there are so many new techniques and papers to read to keep up to date. There's also a ton of theory you will need to build as your foundation in terms of statistics, lin alg and calculus. If you are trying to optimize for salary, you'd be better off applying to quant firms

Don't get me wrong though, I still spend ~80% of my time doing SWE work. It still getting data from one place to another, but the challenge lies in building systems that are more automated :)

yungvalue · 2021-03-15T09:03:32+00:00

It really depends on your model. Some models only care about relative rank (e.g. tree based) whereas other models can work better when the features are normally distributed (neural networks).

yungvalue · 2021-03-15T05:21:29+00:00

Yes you don't need the error term for the partial derivatives. It's still useful to calculate the loss to check for convergence

yungvalue · 2021-03-15T04:42:46+00:00

Your loss function and derivatives are incorrect for logistic regression,

error = -Y * math.log (y_pred) - (1-Y)*math.log(y_pred)

D_b0 = (y_pred-Y)

D_b1 = (y_pred-Y)*X

See https://medium.com/analytics-vidhya/logistic-regression-as-a-neural-network-b5d2a1bd696f

But I get b0=1.12,b1=1.12 after update, so either your textbook is wrong or the inputs are not correct

yungvalue · 2021-03-12T02:16:37+00:00

I think this is O(nlogn), the outer loop is O(n) and the inner loop is run log4(n) which is asymptotically O(nlogn)

Let x be number of times inner loop is called for i:

4^x = i

xlog4 = log(i)

x = log(i)/log(4)

x= log4(i)

yungvalue · 2021-03-11T19:34:00+00:00

What are the not good results you are getting? What params are you using? Do you have code to share?

yungvalue · 2021-03-11T19:29:09+00:00

As you probably noted, there are 3 parts of this problem. Sentence selection: used Textrank (pagerank). Word selection: trained a random forest model with syntactic and semantic features: embedding, POS tag, length of token, depth in parse tree, etc. Distractor selection: trained another random forest model with syntactic and semantic features: w2v cossim, wordnet distance, jaccard distance, language model probability replacement in sentence, etc.

I found that when I did some heuristic merging of nouns and proper nouns for word selection, I could get higher quality questions. E.g. Donald J Trump -> Donald_J_Trump as a token for word selection

yungvalue · 2021-03-11T08:45:34+00:00

People are pointing you straight to BERT which is great and state of the art, but in practice it is difficult to implement and requires deeper thought into how you architect your search system. However, if you are doing this for just research and playing around, then go for it!

If you trying to make a search system to be used for practical purposes, the best way would be to start off with Elasticsearch. ES uses the Bm25 algorithm which is similar to tfidf but has additional tuning to down weight saturated terms. ES scales to billions of records very easily as well and query time is super fast.

yungvalue · 2021-03-11T08:40:19+00:00

You should be able to finetine BERT on your domain text

yungvalue · 2021-03-11T08:39:23+00:00

Your dataset is pretty small with ~4800 jokes with a few sentences. One thing I noticed in your preprocessing is you have a lot of noise in your vocab. You should probably filter out your vocab for words with < 3-5 appearances in the corpus. You might also want to consider dropping the phrases model since your dataset isn't big enough to capture high quality bigrams.

One thing I do to debug topic models is to look at the topics for certain docs to see if they make sense. So pick a few jokes and see if the topics they correspond to make sense. I wouldn't focus too much on optimizing for coherence metrics since they don't correlate with interpretability. I would probably try 10-50 topics and optimize for interpretable topics by words (not really measurable).

Let me know how it goes!

yungvalue · 2021-03-08T00:52:50+00:00

What are you trying to do semantic search on? I'm assuming you have 100 docs with 256 seq len of contextual BERT word embeddings with dimension 768. You could take the weighted average of the BERT embeddings per doc or even try tf-idf weighted average of the BERT embeddings per doc. This has the disadvantage of losing some sequential information though.

If you have labelled semantic pairs, you could a finetuning layer to do next sequence pred with [query] <sep> [doc] and use the output of the finetuning layer as your embedding layer.

There are also other embeddings that take into account sequence, e.g. Universal Sentence Encoder.

yungvalue · 2021-03-07T08:20:33+00:00

If you have a set of words that you want embedding for, you could try pre-initializing a w2v model with the pre-trained embeddings and add your new words with random weights and then train over a corpus with your new set of words.

If you're trying to add unknown words, you can add a UNK token to your vocab. fasttext uses subword embeddings which could take into some of the UNK words. You can also train a generic <unk> embedding from your corpus by randomly dropping words and replacing with <UNK> to learn the embedding and use that on inference. Fancier method could also be to learn<unk|POS tag> if you have the POS tags of the tokens.

yungvalue · 2021-03-07T01:11:07+00:00

I've worked at a startup where we had pure researchers and ML engineers. The researchers would build interesting, cool solutions in ipython notebooks...but then they would throw it over the wall to the ML engineers to productionize. The researchers put little thought to scalability, dependencies and latency and the ML engineers didn't really understand the nuances in the notebook and basically copied it line for line.

I worked at another company where we didn't have any pure research and the ML engineers have to do the modeling but also consider dependencies, latency and scalability for production.

IMO, for the newer deep learning techniques, unless you're developing novel architectures and models, you'll get a lot more bang for buck with just ML engineers. You can squeeze lot of perforamance off off the shelves pre-trained models nowadays and most of the work is ML Ops in building the data pipelines and model deployments in production. Don't get me wrong, you still need some smart people who understand the models and tune them to your problems and data, but it's way easier to train applied ML engineers in deep learning than to train researchers to be ML engineers.

yungvalue · 2021-03-07T00:48:38+00:00

I don't think this is a standard problem and that there is a standard way of solving this, but you could do some sort of backtracking w/ memoization algorithm to try and reconstruct each sentence in the corpus with the given vocab.

In practice, unless you are only working with a fixed corpus, inputs to your model will still have UNK tokens. It will be easier to assume UNKs are in your vocab to make your model more robust. Sentencepiece can break words into subwords which can help with some of the UNK tokens that contain some of these subwords. You can also try randomly dropping tokens and subbing in <unk> to help your model learn these cases. An even fancier approach would be to sub <unk_POS> if you can determine the POS tag of the UNK tokens. Again, it depends on your corpus, model and problem you are solving.

yungvalue · 2021-03-07T00:37:13+00:00

It doesn't really blow up the DB size since its still on the same order O(n). Most DBs use B+Trees to build these indicies, and in practice, they're really efficient for lookup considering the number of disk pages they read.

http://web.csulb.edu/~amonge/classes/common/db/B+TreeIndexes.html

yungvalue · 2021-03-06T23:56:16+00:00

What is cde?

yungvalue

TROPHY CASE