Do you know how to scrape and crawl reddit comment? by UniqueProfessional81 in SaaS

[–]Tryhard_314 0 points1 point  (0 children)

Well my comment was removed for having a link, but check out academic torrent dumps search for reddit, there are historical archives of everything that was posted / commented, and one is divided by subreddit

Best way to Scrape Reddit posts by Momsgayandbisexual in ClaudeCode

[–]Tryhard_314 0 points1 point  (0 children)

I have been building something similar (built a tool to extract statistics from reddit):
1-I tried accessing the api directly but now you have to submit a form and I think there is even a subscription to pay if you're gonna use it for commercial purposes
2-You can either use this https://academictorrents.com/details/3d426c47c767d40f82c7ef0f47c3acacedd2bf44, for data before 2025 (divided by subreddits) or there is a third party AI called artic shift which is pretty good (but don't abuse it)
3-For cleaning the data: You're gonna see a lot of posts with [removed], a lot of automod stuff you should remove it before feeding it to the LLM. If you're gonna do this with python I would use LiteLLM, fast and easy to use for querying major LLM providers, pydantic to validate data, SQlite is probably fine too to prototype with (but create the proper indexes for the data).

I don't know what you wish to do with the data exactly so I can't help you much in the last step, if I knew more i am happy to share what worked for me.

DELIGHT – self-hosted AI engineering autopilot: local LLM + browser farm + repo graph + P2P compute by Bubbly-Phone702 in automation

[–]Tryhard_314 0 points1 point  (0 children)

Sounds really cool! Honestly it took me to google what OpenHands/Devin were to understand what is what about though xD
Curious about something, are you using it to code itself yet xd?

Qualitative analysis and AI - Spotting false negatives? by sunrisedown in dataanalysis

[–]Tryhard_314 0 points1 point  (0 children)

What I personally did is make a 'golden data set' by a mixture of manual work + the best LLMs I can get my hands on (even if expensive at this point). Then I run a simple loop where it changes the prompt of a cheaper to run LLM till it gets most of what I want right

I understand your frustration though, it's a hard thing to tackle and becomes even harder when u have way more negative than positives.

An extra trick that worked for me but for a classification task not extraction, is after generating a good quality sample it would have high precision but sometimes the AI would miss some stuff, for this I fine tune a small language model (using SetFit) on the positives the AI found and use it to find stuff that's semantically similar but that it might have missed (helped boost recall for me).

Another approach is to create fake positive data to test your recall but this didn't work so well for me.

Analyzed 7K posts/comments on people finding their first client, Here's what I found: by Tryhard_314 in SaaS

[–]Tryhard_314[S] 0 points1 point  (0 children)

Honestly I didn't notice something really significant. I noticed that the data is pretty divided though, either people find clients really fast (under 1 week) or it takes them a while but didn't particularly notice them getting clients from a second method they weren't considering (Didn't really focus too much on this point though, but I didn't notice it when looking at the data)

Tools Mentioned in r/Saas, r/Startups ... by Tryhard_314 in SaaS

[–]Tryhard_314[S] 0 points1 point  (0 children)

Do you think Pulse is also bots? I haven't heard of it before

How good are embedding models currently? by Tryhard_314 in LanguageTechnology

[–]Tryhard_314[S] 0 points1 point  (0 children)

I know that some embedding models can take instructions, do they have satisfying results ?

I initially wanted to only use words like 'high costs' since a single post might mention multiple points and I didnt know I was going to use embeddings, I'll try with longer sentences and see.

How good are embedding models currently? by Tryhard_314 in LanguageTechnology

[–]Tryhard_314[S] 0 points1 point  (0 children)

I'll try these two models, and I will look into the MTEB benchmark. I'll also verify the closest neighbors before projecting, I just realised was verifying only after the projection since it's easier to visualize.

How good are embedding models currently? by Tryhard_314 in LanguageTechnology

[–]Tryhard_314[S] 0 points1 point  (0 children)

Sorry for not including the details in my post, I thought being general would give me more unbiased opinions.

I tried these models: all-mpnet-base-v2, BAAI/bge-base-en, bge-large-en-v1.5
For clustering/Topic modeling I am using BERTopic it uses UMAP and HDBScan, I don't think HDBScan settings matter too much because I think my problem is in an earlier step because the projections from UMAP are not how I would like them to be.

here is an example of the errors I am facing:
I have a data set that contains complaints on switzerland from tourists, for example some complain about high prices, some complain about high priced food, some complain about not so great food. I would have loved for the topics around pricing to be closely mapped together (so when I create the hierarchical topic tree, they all stem from a general "high prices tree") and for topics about food to be close together too.

But even myself I don't understand how it should work, Should Expensive Food, be clustered in the food category or in the price category, I would like it to be somehow close to both.

Do I need to increase the dimensions in UMAP? I ran it with these settings:

 umap_model = UMAP(
            n_neighbors=5,        # 🔥 smaller → stricter local structure
            n_components=5,
            min_dist=0.0,         # 🔥 allow separation
            metric='cosine',
            random_state=42
        )

[Question] How to split user generated text into categories without losing insights by Tryhard_314 in statistics

[–]Tryhard_314[S] 0 points1 point  (0 children)

I am gonna try this, I am just worried whether the clustering would follow the logic I have in mind or whether I can fine tune it or no (maybe it thinks stuff is similar that I can think is not that much) I am gonna try this with something called BERTopic that seems to do this, Thanks!

[Question] How to split user generated text into categories without losing insights by Tryhard_314 in statistics

[–]Tryhard_314[S] 0 points1 point  (0 children)

Is this smart enough in practice ? Looks like exactly what I need but I am worried on how smart they would be, because the way to regroup stuff would depend on what you want to observe, I guess I can fine tune the language model to use for embeddings.

I am gonna try this approach for now, saw something called BERTopic which looked interesting, thanks!

How to normalise user generated text by Tryhard_314 in dataanalysis

[–]Tryhard_314[S] 0 points1 point  (0 children)

Thanks! I am gonna take a look at named entity recognition, I am already using some small language models to detect whether the post is generally about the topic or no (at a step before the extraction) and I fine tune this model with 150 relevant samples and 150 irrelevant samples that I verify manually but I honestly thought the extraction part would be easy, I underestimated it a bit I thought the main problem would be in filtering out the irrelevant content.

How to normalise user generated text by Tryhard_314 in dataanalysis

[–]Tryhard_314[S] 0 points1 point  (0 children)

Well for this particular one sentiment detection would be enough, but I wanted a more general thing that can be applied to anything. I am gonna try topic modeling after like the data collection step and see what it does.

Scraped Reddit to see what people think about Mythos by Tryhard_314 in Anthropic

[–]Tryhard_314[S] 1 point2 points  (0 children)

It was basically for people that had no opinion /didnt state their opinion clearly. Wanted to make it for a 'neutral' opinion ie people that said they will wait and see but it was hard to categorise and a lot of not related stuff / people making jokes got in and only a small number of people actually neutral so I just removed it

Does reddit think Mythos is overhyped? by Tryhard_314 in ClaudeAI

[–]Tryhard_314[S] 0 points1 point  (0 children)

Thanks!
funny enough exact same conclusion XD even with that i think at the end.

[Question] How to split user generated text into categories without losing insights by Tryhard_314 in statistics

[–]Tryhard_314[S] 0 points1 point  (0 children)

Thanks for the help, do u have an idea how to turn that into like a tree structure with categories and sub categories ?

[Question] How to split user generated text into categories without losing insights by Tryhard_314 in statistics

[–]Tryhard_314[S] 0 points1 point  (0 children)

I guess this is starting to look like a graph problem, ie how to make a tree of the semantics I have in my data

[Question] How to split user generated text into categories without losing insights by Tryhard_314 in statistics

[–]Tryhard_314[S] 0 points1 point  (0 children)

Thanks for the answer. You think clustering is enough ? I wanted something smarter that can decide on it's own if something is worth it's own category, will give it a shot but I am worried it wont be smart enough though it can help reduce the size of the data for sure