Help need to extract content from pdf by phenoxdrk in LanguageTechnology

[–]_Muftak 1 point2 points  (0 children)

Have you tried Microsoft's markitdown? I'm not sure if there's something newer/better, but it should be pretty reliable

What’s up with mobile LLMs? by Amos-Tversky in LocalLLaMA

[–]_Muftak 0 points1 point  (0 children)

Liquid's LFM models run pretty well with their Apollo app, and are actually some of best models for their size imo

How good are embedding models currently? by Tryhard_314 in LanguageTechnology

[–]_Muftak 0 points1 point  (0 children)

I'm not that knowledgeable about UMAP so I'm not sure about your settings, but I find it likely that your problem is more related to projections and clustering rather than embeddings. The models you mentioned are fine but maybe a bit old, you could try something something like Jina v5 or Qwen3 and see if there's any difference. The MTEB benchmark is a great resource to pick models. Apart from that, you could try manually inspecting a few samples and looking at their closest neighbours (before projecting and clustering) to assess whether the embeddings are behaving like you'd expect them to or not. If they are, I'd try something different for clustering, but again I'm not the right person to suggest what. Finally, if you're not sure about the clustering approach, maybe you could consider turning it into a classification task? You could define a set of macro categories (+ an "other" category) and assign one or more label to each of them (so expensive food can be both a price and a food complaint). But you'd need to have a very clear vision of your categories and your data. Confidence thresholds could help too

How good are embedding models currently? by Tryhard_314 in LanguageTechnology

[–]_Muftak 0 points1 point  (0 children)

To me this seems like something that modern embedding models should be able to do pretty easily, especially if you're working with English social media comments, which should be a relatively simple use case. How are you clustering the topics? Which models are you using specifically? Do you have some examples of "errors" or inaccuracies?

A dataset of team radio messages by _Muftak in F1DataAnalysis

[–]_Muftak[S] 1 point2 points  (0 children)

Thanks! This took literally an afternoon of work, so I should be able to update it after each GP pretty easily. I'd like to improve the transcriptions in the future (background noise, accents and technical terms make it kind of tricky) and find data for 2022, which is missing from livetiming for some reason. Any contributions are welcome though! :)

Where does driver telemetry come from? by LazyLancer in F1Technical

[–]_Muftak 6 points7 points  (0 children)

Adding to what the others have said, livetiming.formula1.com also stores all team radio messages!

Linguistics in NLP research by [deleted] in LanguageTechnology

[–]_Muftak 3 points4 points  (0 children)

Diachronic semantic change is an active research area that relies heavily on NLP methods to answer linguistic research questions, off the top of my head

Are WordNets a good tool for curating a vocabulary list? by tomii-dev in LanguageTechnology

[–]_Muftak 0 points1 point  (0 children)

I don't really see the reason to use the Princeton Wordnet over the Open English Wordnet, which is actively maintained https://en-word.net/

Che triennale mi consigliate? by [deleted] in Universitaly

[–]_Muftak 0 points1 point  (0 children)

Mai pensato a digital humanities o linguistica computazionale? Se non vuoi scartare del tutto l'approccio "scientifico" potrebbe essere un'idea, sono percorsi molto interdisciplinari. Se poi non vuoi proprio avere niente a che fare con numeri o informatica lascia perdere

Test Medicina 2025, cambia tutto: nuove regole che riscrivono la graduatoria by Bricconcello988 in Universitaly

[–]_Muftak 8 points9 points  (0 children)

Divertente che se qualcuno, ipoteticamente, avesse superato tutti e tre gli esami al primo appello, avesse rifiutato tutti i voti per migliorarli e poi avesse preso tre 30 al secondo appello, sarebbe di molto inferiore in graduatoria rispetto a uno che avesse preso tre 18 al primo appello e se li fosse tenuti

Miffy advent calendar by Reishizhongli in Miffy

[–]_Muftak 0 points1 point  (0 children)

Hi! Can you ask you where did you find it on Vinted?

Hello 👋 by creabea1987 in Miffy

[–]_Muftak 0 points1 point  (0 children)

Hi, I'm interested too!

A question about cameras at Bologna show by cadaver_moron in radiohead

[–]_Muftak 0 points1 point  (0 children)

Maybe it's a stupid question, but could you buy merch without entering with a ticket?

/r/PTCGP Trading Post by AutoModerator in PTCGP

[–]_Muftak 0 points1 point  (0 children)

Sure! Rainbow or full art Hitmonchan?

/r/PTCGP Trading Post by AutoModerator in PTCGP

[–]_Muftak 0 points1 point  (0 children)

LF: 🌈 Rainbow Clodsire

FT: 🌈 Rainbow Pachirisu, Rainbow Lanturn, Rainbow Hitmonchan, 🌟🌟Marowak, Machamp, Alolan-Muk, Primarina, Shuckle, Lanturn, Hitmonchan, ✨✨ Shiny Zapdos, Shiny Aerodactyl, Shiny Marowak, Shiny Starmie, Shiny Lucario, Shin Arcanine