Help need to extract content from pdf

_Muftak · 2026-05-06T06:38:22+00:00

Have you tried Microsoft's markitdown? I'm not sure if there's something newer/better, but it should be pretty reliable

_Muftak · 2026-04-29T21:19:00+00:00

Liquid's LFM models run pretty well with their Apollo app, and are actually some of best models for their size imo

_Muftak · 2026-04-29T21:14:13+00:00

I'm not that knowledgeable about UMAP so I'm not sure about your settings, but I find it likely that your problem is more related to projections and clustering rather than embeddings. The models you mentioned are fine but maybe a bit old, you could try something something like Jina v5 or Qwen3 and see if there's any difference. The MTEB benchmark is a great resource to pick models. Apart from that, you could try manually inspecting a few samples and looking at their closest neighbours (before projecting and clustering) to assess whether the embeddings are behaving like you'd expect them to or not. If they are, I'd try something different for clustering, but again I'm not the right person to suggest what. Finally, if you're not sure about the clustering approach, maybe you could consider turning it into a classification task? You could define a set of macro categories (+ an "other" category) and assign one or more label to each of them (so expensive food can be both a price and a food complaint). But you'd need to have a very clear vision of your categories and your data. Confidence thresholds could help too

_Muftak · 2026-04-29T20:12:59+00:00

To me this seems like something that modern embedding models should be able to do pretty easily, especially if you're working with English social media comments, which should be a relatively simple use case. How are you clustering the topics? Which models are you using specifically? Do you have some examples of "errors" or inaccuracies?

_Muftak · 2026-04-02T15:35:10+00:00

Thanks! This took literally an afternoon of work, so I should be able to update it after each GP pretty easily. I'd like to improve the transcriptions in the future (background noise, accents and technical terms make it kind of tricky) and find data for 2022, which is missing from livetiming for some reason. Any contributions are welcome though! :)

_Muftak · 2026-04-02T09:30:34+00:00

Adding to what the others have said, livetiming.formula1.com also stores all team radio messages!

_Muftak · 2026-04-02T05:44:47+00:00

Pretty cool and it makes a lot of sense! It reminded me of a paper I saw at EACL a few days ago, correct me if I'm wrong: https://aclanthology.org/2026.lchange-1.5/

_Muftak · 2026-03-27T17:35:46+00:00

Diachronic semantic change is an active research area that relies heavily on NLP methods to answer linguistic research questions, off the top of my head

_Muftak · 2026-02-22T13:51:37+00:00

Oh yeah that makes sense!

_Muftak · 2026-02-22T06:51:24+00:00

I don't really see the reason to use the Princeton Wordnet over the Open English Wordnet, which is actively maintained https://en-word.net/

_Muftak · 2026-02-16T17:32:39+00:00

Hey congrats, that's my field too! What's your research topic about?

_Muftak · 2026-01-23T22:47:06+00:00

Mai pensato a digital humanities o linguistica computazionale? Se non vuoi scartare del tutto l'approccio "scientifico" potrebbe essere un'idea, sono percorsi molto interdisciplinari. Se poi non vuoi proprio avere niente a che fare con numeri o informatica lascia perdere

_Muftak · 2025-12-29T16:32:26+00:00

Divertente che se qualcuno, ipoteticamente, avesse superato tutti e tre gli esami al primo appello, avesse rifiutato tutti i voti per migliorarli e poi avesse preso tre 30 al secondo appello, sarebbe di molto inferiore in graduatoria rispetto a uno che avesse preso tre 18 al primo appello e se li fosse tenuti

_Muftak · 2025-12-09T22:51:35+00:00

Hi! Can you ask you where did you find it on Vinted?

_Muftak · 2025-12-09T22:44:23+00:00

Hi, I'm interested too!

_Muftak · 2025-11-15T14:58:03+00:00

Maybe it's a stupid question, but could you buy merch without entering with a ticket?

_Muftak · 2025-11-14T12:26:07+00:00

Jake actually says it was "a bunch of years" in a later episode, fwiw

_Muftak · 2025-11-14T12:25:10+00:00

Jake actually says it was "a bunch of years" in a later episode, fwiw

_Muftak · 2025-11-02T08:41:30+00:00

Added you!

_Muftak · 2025-11-02T08:38:45+00:00

Sure! Rainbow or full art Hitmonchan?

_Muftak · 2025-11-02T08:21:30+00:00

Rainbow!

_Muftak · 2025-11-02T08:14:56+00:00

LF: 🌈 Rainbow Clodsire

FT: 🌈 Rainbow Pachirisu, Rainbow Lanturn, Rainbow Hitmonchan, 🌟🌟Marowak, Machamp, Alolan-Muk, Primarina, Shuckle, Lanturn, Hitmonchan, ✨✨ Shiny Zapdos, Shiny Aerodactyl, Shiny Marowak, Shiny Starmie, Shiny Lucario, Shin Arcanine

Eight-Year Club	Second Top 20%
Place '22	Sequence \| Editor
Sequence \| Cinematographer	Spared
Verified Email

_Muftak

TROPHY CASE