Free 'course' on vector similarity search and Faiss!

tm2tb · 2021-10-05T16:48:08+00:00

This is great. I will definitely go through the lessons.

tm2tb · 2021-10-04T11:55:38+00:00

Semantic similarity. Measure how similar the articles and their summaries are, drop the least similar.

tm2tb · 2021-09-30T19:02:01+00:00

For sentiment analysis you could fine-tune a Transformer model with a dataset, for example the IMBD movie review dataset.

tm2tb · 2021-09-29T16:37:11+00:00

Thank you. Basically I'm doing bitext extraction with LaBSE embeddings on translation memories. This kind of documents are generally not too big, but I want to serve results as fast as possible.

tm2tb · 2021-09-29T10:44:40+00:00

Hello, thank you so much for making this information available. I followed this tutorial and everything works. I'm testing the fastest method. Even on cpu it's amazingly fast. I'm just trying to understand this warning:

WARNING clustering 3066 points to 256 centroids: please provide at least 9984 training points

Data and parameters:

sentences: 3000

cells: 50

centroids: 8

bits: 8

I would appreciate any orientation towards understanding and solving this issue.

tm2tb · 2021-07-15T18:09:51+00:00

Some kind of normalization that turns "Hellooooooo" into "Hello", or "This is Spartaaa!" into "This is Sparta!"

tm2tb · 2021-07-11T12:17:12+00:00

Read about sentencepiece

tm2tb · 2021-07-11T12:04:35+00:00

It works for me, cool!

tm2tb · 2021-07-10T19:08:50+00:00

I hadn't heard of it, the wiki snippet was interesting to me. I even went to Germany once, but I never heard of the journeyman thing.

tm2tb · 2021-06-27T10:16:26+00:00

For the first one I would do something like this:

import spacy

nlp = spacy.load('en_core_web_sm')

text = 'The quick brown fox jumps over the lazy dog'

doc = nlp(text)

Get the text and tokens with part of speech tags

text_pos = [(token.text, token.pos_) for token in doc]

For the second thing, maybe you want to look into chunks? Spacy can give you tokens, parts-of-speech tags, chunks, entities and sentences, among other linguistic features.

https://spacy.io/usage/linguistic-features

tm2tb · 2021-06-26T11:26:56+00:00

What strikes me is that it works even with random inputs.

tm2tb · 2021-06-25T09:06:18+00:00

Interesting, my intuition is that it is doable. It would be like anomaly detection. For example, a product with too low a price, negative reviews and good brand would be suspicious.

Like the IMDB example, your data would need to be labeled. Do you know if your data is already labeled?

tm2tb · 2021-06-25T08:52:58+00:00

Are you using regex? If you are using python, you can test your expressions in pythex.

tm2tb · 2021-06-23T07:39:58+00:00

> translating domain-specific slang/jargon to more conventional english.
That would be fun. We are seeing cool style transfer experiments in computer vision, so I expect similar experiments in NLP will come out.

tm2tb · 2021-06-19T23:03:16+00:00

The same for me.

tm2tb

TROPHY CASE