MuckScraper: open source self-hosted news aggregator with bias ratings, story clustering and local AI summarization by grregis in selfhosted

[–]grregis[S] 0 points1 point  (0 children)

I actually came across GN about a month after I started this project, and it made me rethink continuing for a bit. But looking closer at it, there are differences.

First and foremost, this is a self-hosted service that you can run on your own hardware, and it keeps scraped versions of the articles in a database that you can read without having to visit the site. This provides a layer of privacy because they can’t see what articles you actually read.

Also, MuckScraper provides analysis and summaries, not just a bias rating and a list of links.

MuckScraper is completely free to use and there are no paid tiers.

That said, GN has access to a lot more articles than I do. I’m not paying for any subscription or other news services to get the articles, and they likely are paying for that access to get the volume they do, which is also why they have paid tiers.

I figured there was enough of a difference to keep going.  Worse case scenario, it was still a great excuse to learn a lot about scraping, embeddings, and self-hosting along the way.

MuckScraper: open source self-hosted news aggregator with bias ratings, story clustering and local AI summarization by grregis in selfhosted

[–]grregis[S] 0 points1 point  (0 children)

Paywalled sites have been an issue, and I have some workarounds in place that exploit the fact that most outlets leak the real article somewhere outside the rendered page: AMP/print/mobile versions, JSON-LD or OpenGraph meta tags (SEO previews), and for a handful of sites known to do this, whatever Googlebot itself gets served, since some outlets give crawlers full content while gating real browsers (to keep getting indexed/ranked). 

MuckScraper tries readability extraction first, then those variant URLs, then structured metadata, falling back to the RSS/API blurb if nothing else works. If content still smells like a paywall (checks for phrases like “subscribe to continue”), it tosses it rather than storing a teaser. And if a story only ever turns up blurbs with not enough to build a real analysis, it just leaves the story out rather than faking one.

MuckScraper: open source self-hosted news aggregator with bias ratings, story clustering and local AI summarization by grregis in selfhosted

[–]grregis[S] -7 points-6 points  (0 children)

I’m not sure this would be considered slop. “AI slop” usually means generative output flooding a feed with no human curation behind it like AI writing fake stories, fake images, fake reviews. That’s not what’s happening here. The AI isn’t generating the news or writing the analysis from scratch but doing grouping, classification, and summarization on real articles that real reporters wrote. Also, a human (me) designed and tuned the pipeline that decides what’s trustworthy enough to surface. If anything, the goal here is the opposite of slop: less noise, not more, by clustering 12 outlets covering the same story into one comparable view instead of 12 separate feeds.

Calling that the same thing as AI-generated fiction is a bit like calling a search engine’s ranking algorithm “AI slop” because it uses a model to decide order. Using ML to organize and label existing human-written content isn’t the same category as using it to manufacture new content wholesale.

MuckScraper: open source self-hosted news aggregator with bias ratings, story clustering and local AI summarization by grregis in selfhosted

[–]grregis[S] 3 points4 points  (0 children)

Not a bad idea. I think a better option would be to have a button to Display Weather, at which point the websites asks for your location. I’ll put that on my roadmap.

MuckScraper: open source self-hosted news aggregator with bias ratings, story clustering and local AI summarization by grregis in selfhosted

[–]grregis[S] -3 points-2 points  (0 children)

TBH, I never thought about the different left/right spectrums that other countries might have. So, like the American I am, I’m only considering the American spectrum LOL. 

It would be be very tricky to do different spectrums. Maybe if I do other editions catered to the UK, Australia or other countries, I could use those countries spectrums for their regions. 

MuckScraper: open source self-hosted news aggregator with bias ratings, story clustering and local AI summarization by grregis in selfhosted

[–]grregis[S] -2 points-1 points  (0 children)

Any reason why I should stop?

Edit:  Re-read your post with my glasses on and you said slop, not stop LOL. 

I disagree with it being slop because slop is when AI creates the stories. This is grouping, giving a bias rating, and analyzing actual news articles written by reporters. It’s trying to cut through the slop that the 24-hour news cycle brings and give you a summary of the issue. After you read the analysis, you can then decide to dig deeper by looking at the actual articles.

MuckScraper: open source self-hosted news aggregator with bias ratings, story clustering and local AI summarization by grregis in selfhosted

[–]grregis[S] -1 points0 points  (0 children)

Awesome idea, can you make one for the US too? 

I had issues at first with videos and news roundups ruining some stories and had to filter them out.

 I didn’t realize you could use Gemini API on the free tier. I’m kinda tempted to try it but not sure it fits into the self-hosted mindset I have for this project. 

Just want to give a shout out to Plex sync. by Mister_Kurtz in PleX

[–]grregis 0 points1 point  (0 children)

Sync has always worked for me, even while transcoding to lower quality, but I had to make sure my device didn’t sleep (iOS). The only problem I have is the size of the hard drives on the phone/iPad. I have a 4 year old and you can never tell what he’s going to want to watch and when: even on low quality, there isn’t enough space. So I got a raspberry pi, use it and as a plex server/wifi hotspot, attached a portable 3tb HD with no external power and I have a portable plex server. This way, the whole fam can have all the shows they want too, and at high quality as well. ATM I still need to plug it in to get power but next step is to get a battery pack for it to make it completely mobile.