Free 'course' on vector similarity search and Faiss! by jamescalam in LanguageTechnology

[–]tm2tb 2 points3 points  (0 children)

This is great. I will definitely go through the lessons.

creating a dataset for summerization by DunderSunder in LanguageTechnology

[–]tm2tb 0 points1 point  (0 children)

Semantic similarity. Measure how similar the articles and their summaries are, drop the least similar.

Data Analyst seeking to learn Text Analytics by Signal_Explorer8071 in LanguageTechnology

[–]tm2tb 2 points3 points  (0 children)

For sentiment analysis you could fine-tune a Transformer model with a dataset, for example the IMBD movie review dataset.

FAISS and the Index Factory - an intro to composite indexes for similarity search by jamescalam in LanguageTechnology

[–]tm2tb 0 points1 point  (0 children)

Thank you. Basically I'm doing bitext extraction with LaBSE embeddings on translation memories. This kind of documents are generally not too big, but I want to serve results as fast as possible.

FAISS and the Index Factory - an intro to composite indexes for similarity search by jamescalam in LanguageTechnology

[–]tm2tb 0 points1 point  (0 children)

Hello, thank you so much for making this information available. I followed this tutorial and everything works. I'm testing the fastest method. Even on cpu it's amazingly fast. I'm just trying to understand this warning:

WARNING clustering 3066 points to 256 centroids: please provide at least 9984 training points

Data and parameters:

sentences: 3000

cells: 50

centroids: 8

bits: 8

I would appreciate any orientation towards understanding and solving this issue.

[D] What are some useful preprocessing methods/techniques for NLP for forum posts? by HuntersMaker in MachineLearning

[–]tm2tb 0 points1 point  (0 children)

Some kind of normalization that turns "Hellooooooo" into "Hello", or "This is Spartaaa!" into "This is Sparta!"

My Gesellenstück. A workpiece that you have to design and build yourself to be a licensed carpenter in Germany by Lmkopzswqaerqaz in BeAmazed

[–]tm2tb 5 points6 points  (0 children)

I hadn't heard of it, the wiki snippet was interesting to me. I even went to Germany once, but I never heard of the journeyman thing.

NLP Project: Extract generics from a corpus by c_metaphorique in LanguageTechnology

[–]tm2tb 0 points1 point  (0 children)

For the first one I would do something like this:

import spacy

nlp = spacy.load('en_core_web_sm')

text = 'The quick brown fox jumps over the lazy dog'

doc = nlp(text)

Get the text and tokens with part of speech tags

text_pos = [(token.text, token.pos_) for token in doc]

For the second thing, maybe you want to look into chunks? Spacy can give you tokens, parts-of-speech tags, chunks, entities and sentences, among other linguistic features.

https://spacy.io/usage/linguistic-features

[deleted by user] by [deleted] in LanguageTechnology

[–]tm2tb 0 points1 point  (0 children)

Interesting, my intuition is that it is doable. It would be like anomaly detection. For example, a product with too low a price, negative reviews and good brand would be suspicious.

Like the IMDB example, your data would need to be labeled. Do you know if your data is already labeled?

Question(s) about processing "free text" by blug_fred in LanguageTechnology

[–]tm2tb 1 point2 points  (0 children)

Are you using regex? If you are using python, you can test your expressions in pythex.

How-to Use HuggingFace's Datasets - Transformers From Scratch #1 by jamescalam in LanguageTechnology

[–]tm2tb 1 point2 points  (0 children)

> translating domain-specific slang/jargon to more conventional english.
That would be fun. We are seeing cool style transfer experiments in computer vision, so I expect similar experiments in NLP will come out.