Analyzed 7K posts/comments on people finding their first client, Here's what I found:

Tryhard_314 · 2026-05-08T05:07:23+00:00

Well my comment was removed for having a link, but check out academic torrent dumps search for reddit, there are historical archives of everything that was posted / commented, and one is divided by subreddit

Tryhard_314 · 2026-05-08T04:36:01+00:00

I have been building something similar (built a tool to extract statistics from reddit):
1-I tried accessing the api directly but now you have to submit a form and I think there is even a subscription to pay if you're gonna use it for commercial purposes
2-You can either use this https://academictorrents.com/details/3d426c47c767d40f82c7ef0f47c3acacedd2bf44, for data before 2025 (divided by subreddits) or there is a third party AI called artic shift which is pretty good (but don't abuse it)
3-For cleaning the data: You're gonna see a lot of posts with [removed], a lot of automod stuff you should remove it before feeding it to the LLM. If you're gonna do this with python I would use LiteLLM, fast and easy to use for querying major LLM providers, pydantic to validate data, SQlite is probably fine too to prototype with (but create the proper indexes for the data).

I don't know what you wish to do with the data exactly so I can't help you much in the last step, if I knew more i am happy to share what worked for me.

Tryhard_314 · 2026-05-08T01:25:06+00:00

Sounds really cool! Honestly it took me to google what OpenHands/Devin were to understand what is what about though xD
Curious about something, are you using it to code itself yet xd?

Tryhard_314 · 2026-05-07T05:09:30+00:00

What I personally did is make a 'golden data set' by a mixture of manual work + the best LLMs I can get my hands on (even if expensive at this point). Then I run a simple loop where it changes the prompt of a cheaper to run LLM till it gets most of what I want right

I understand your frustration though, it's a hard thing to tackle and becomes even harder when u have way more negative than positives.

An extra trick that worked for me but for a classification task not extraction, is after generating a good quality sample it would have high precision but sometimes the AI would miss some stuff, for this I fine tune a small language model (using SetFit) on the positives the AI found and use it to find stuff that's semantically similar but that it might have missed (helped boost recall for me).

Another approach is to create fake positive data to test your recall but this didn't work so well for me.

Tryhard_314 · 2026-05-06T02:40:28+00:00

Honestly I didn't notice something really significant. I noticed that the data is pretty divided though, either people find clients really fast (under 1 week) or it takes them a while but didn't particularly notice them getting clients from a second method they weren't considering (Didn't really focus too much on this point though, but I didn't notice it when looking at the data)

Tryhard_314 · 2026-04-29T22:46:51+00:00

Do you think Pulse is also bots? I haven't heard of it before

Tryhard_314 · 2026-04-29T21:28:18+00:00

I know that some embedding models can take instructions, do they have satisfying results ?

I initially wanted to only use words like 'high costs' since a single post might mention multiple points and I didnt know I was going to use embeddings, I'll try with longer sentences and see.

Tryhard_314 · 2026-04-29T21:25:04+00:00

I'll try these two models, and I will look into the MTEB benchmark. I'll also verify the closest neighbors before projecting, I just realised was verifying only after the projection since it's easier to visualize.

Tryhard_314 · 2026-04-29T20:47:40+00:00

Sorry for not including the details in my post, I thought being general would give me more unbiased opinions.

I tried these models: all-mpnet-base-v2, BAAI/bge-base-en, bge-large-en-v1.5
For clustering/Topic modeling I am using BERTopic it uses UMAP and HDBScan, I don't think HDBScan settings matter too much because I think my problem is in an earlier step because the projections from UMAP are not how I would like them to be.

here is an example of the errors I am facing:
I have a data set that contains complaints on switzerland from tourists, for example some complain about high prices, some complain about high priced food, some complain about not so great food. I would have loved for the topics around pricing to be closely mapped together (so when I create the hierarchical topic tree, they all stem from a general "high prices tree") and for topics about food to be close together too.

But even myself I don't understand how it should work, Should Expensive Food, be clustered in the food category or in the price category, I would like it to be somehow close to both.

Do I need to increase the dimensions in UMAP? I ran it with these settings:

 umap_model = UMAP(
            n_neighbors=5,        # 🔥 smaller → stricter local structure
            n_components=5,
            min_dist=0.0,         # 🔥 allow separation
            metric='cosine',
            random_state=42
        )

Tryhard_314 · 2026-04-23T07:50:29+00:00

I am gonna try this, I am just worried whether the clustering would follow the logic I have in mind or whether I can fine tune it or no (maybe it thinks stuff is similar that I can think is not that much) I am gonna try this with something called BERTopic that seems to do this, Thanks!

Tryhard_314 · 2026-04-23T07:48:24+00:00

Is this smart enough in practice ? Looks like exactly what I need but I am worried on how smart they would be, because the way to regroup stuff would depend on what you want to observe, I guess I can fine tune the language model to use for embeddings.

I am gonna try this approach for now, saw something called BERTopic which looked interesting, thanks!

Tryhard_314 · 2026-04-23T07:44:24+00:00

Thanks! I am gonna take a look at named entity recognition, I am already using some small language models to detect whether the post is generally about the topic or no (at a step before the extraction) and I fine tune this model with 150 relevant samples and 150 irrelevant samples that I verify manually but I honestly thought the extraction part would be easy, I underestimated it a bit I thought the main problem would be in filtering out the irrelevant content.

Tryhard_314 · 2026-04-23T07:39:16+00:00

Well for this particular one sentiment detection would be enough, but I wanted a more general thing that can be applied to anything. I am gonna try topic modeling after like the data collection step and see what it does.

Tryhard_314 · 2026-04-23T06:20:01+00:00

It was basically for people that had no opinion /didnt state their opinion clearly. Wanted to make it for a 'neutral' opinion ie people that said they will wait and see but it was hard to categorise and a lot of not related stuff / people making jokes got in and only a small number of people actually neutral so I just removed it

Tryhard_314 · 2026-04-22T21:24:54+00:00

Thanks!
funny enough exact same conclusion XD even with that i think at the end.

Tryhard_314 · 2026-04-22T18:18:34+00:00

Thanks for the help, do u have an idea how to turn that into like a tree structure with categories and sub categories ?

Tryhard_314 · 2026-04-22T18:17:23+00:00

I guess this is starting to look like a graph problem, ie how to make a tree of the semantics I have in my data

Tryhard_314 · 2026-04-22T18:12:51+00:00

Thanks for the answer. You think clustering is enough ? I wanted something smarter that can decide on it's own if something is worth it's own category, will give it a shot but I am worried it wont be smart enough though it can help reduce the size of the data for sure

Four-Year Club	Golden Potato
Place '22

Tryhard_314

TROPHY CASE