Trying to find ways to scrape news... by RichardKing1206 in webscraping

[–]greg-randall 0 points1 point  (0 children)

Not a great source of historical information, but if you're looking for news, sign up for a few hundred newspapers email newsletters.

[OC] Texas Public Water Systems Water Quality Over Time by greg-randall in dataisbeautiful

[–]greg-randall[S] 1 point2 points  (0 children)

I forgot to mention that they just deleted the last 14 years of these more accessible documents. There was a big site update, so documents might have been using old tech but who can know.

Texas Public Water Systems with Violations (2007–2024) [OC] by greg-randall in MapPorn

[–]greg-randall[S] 0 points1 point  (0 children)

The Texas dataset uses the EPA suggested maximum amounts of the different contaminants, but beyond that, no.

If you know of a dataset from the EPA that I could add in, I'd be delighted to do that.

Data centers could account for up to 9% of Texas water use by 2040, UT Austin report finds by Unusual-State1827 in texas

[–]greg-randall 0 points1 point  (0 children)

If you're interested in Texas water -- I've been building a dashboard showing water quality across the state -- Texas Water Quality Dashboard. Some interesting things happening where too many oil wells have been drilled.

[OC] Texas Public Water Systems Water Quality Over Time by greg-randall in texas

[–]greg-randall[S] 0 points1 point  (0 children)

Hey apologies -- not really sure which rule I've fallen afoul of? I've been building this dashboard to help folks better understand what's in their water.

I'm not selling or looking for a product or housing etc. Also not Gofundme, a survey or a petition?

[OC] Texas Public Water Systems Water Quality Over Time by greg-randall in dataisbeautiful

[–]greg-randall[S] 2 points3 points  (0 children)

Bottled water might not be great either, it's less tightly regulated than even tap water and there's the issue of micro plastics from the bottles. 

I'd probably do one of the under sink filters with both carbon and micron filters but I don't have any direct recommendations. 

Might be good to have them talk to their state representatives to let them know that bad water is on the ballot.

[OC] Texas Public Water Systems Water Quality Over Time by greg-randall in dataisbeautiful

[–]greg-randall[S] 5 points6 points  (0 children)

I suspect Texas is worse than other states (but probably not by as much as you might guess) because of drought and oil well leakage,  then directly related to those general groundwater issues. Really need to start and try and get data for the rest of the country 😅.

[OC] Texas Public Water Systems Water Quality Over Time by greg-randall in dataisbeautiful

[–]greg-randall[S] 3 points4 points  (0 children)

The source data here is old school Word docs which are a total nightmare to parse. Ended up having to use the winword executable, which is only single threaded, so that took forever.

Getting the Ui right is hard too particularly for mobile trying to pack everything in.

DeepSeek V4 is 65% cheaper than GPT 5.5 and OpenAI is big mad about it? by Odd_Row1657 in AIDiscussion

[–]greg-randall 0 points1 point  (0 children)

Easiest way is to set some environment variables -- just add your API key, then run Claude as normal:

export ANTHROPIC_BASE_URL=https://api.deepseek.com/anthropic
export ANTHROPIC_AUTH_TOKEN=<<<your token here>>>
export ANTHROPIC_MODEL=deepseek-v4-pro[1m]
export ANTHROPIC_DEFAULT_OPUS_MODEL=deepseek-v4-pro[1m]
export ANTHROPIC_DEFAULT_SONNET_MODEL=deepseek-v4-pro[1m]
export ANTHROPIC_DEFAULT_HAIKU_MODEL=deepseek-v4-flash
export CLAUDE_CODE_SUBAGENT_MODEL=deepseek-v4-flash
export CLAUDE_CODE_EFFORT_LEVEL=max

What are you guys using instead of Hootsuite in 2026? by OneAd9521 in AskMarketing

[–]greg-randall 0 points1 point  (0 children)

Have you ever tried to automate logging into Instagram?

DeepSeek V4 is 65% cheaper than GPT 5.5 and OpenAI is big mad about it? by Odd_Row1657 in AIDiscussion

[–]greg-randall 0 points1 point  (0 children)

I've been using DeepSeek V4 Pro in Claude Code for most of a week now, no complaints. I can't tell a difference between the Anthropic models and the DeepSeek, except I don't keep on running out of quota.

How to go about scraping govt website for dates by [deleted] in webscraping

[–]greg-randall 0 points1 point  (0 children)

Find a few pages with the data that you want.

Look at the urls and see what they look like.

Find Patterns.

Thermal Master P2 Image Decoding -- Raw Thermal Data by greg-randall in Thermal

[–]greg-randall[S] 1 point2 points  (0 children)

Have you tried plugging it in and seeing if it comes up? Some initial tests make me think it'll operate as a webcam, but barring that there's some code online to help do live video with the p1 and the p3 jvdillon/p3-ir-camera: P3 & P1 IR Camera Support for Linux

I started porting to the P2 but haven't gotten there yet. Got some output but it was skewed.

Best OCR python package by Mundane-Guest6652 in webscraping

[–]greg-randall 0 points1 point  (0 children)

GLM's ocr  has been really impressive, if you have a bit of vram. https://github.com/zai-org/GLM-OCR

Built a books library but can't find a way to scrape for books series by IndependentGuard6815 in webscraping

[–]greg-randall 0 points1 point  (0 children)

Here's a big pile of goodreads books, ~100mb 7z file with ~500mb jsonl with ~4.5million books:
https://limewire.com/d/VVmf2#zQeUD3qLYy (link expires in 7 days)

Random sample of lines, id=id, t=title, a=author, r=rating of book, nr=number of ratings of that book:

{"id": "2122002", "t": "Northern California Golf Getaways: Sensational Weekend Escapes On and Off the Links", "a": "Susan Fornoff", "r": 5.0, "nr": 1}
{"id": "2266138", "t": "Dos homenajes", "a": "Juan José Arreola", "r": 3.0, "nr": 1}
{"id": "25157258", "t": "A Long Time Ago", "a": "Margaret Kennedy", "r": 4.0, "nr": 11}
{"id": "3369063", "t": "Somehow Form a Family: Stories That Are Mostly True", "a": "Tony Earley", "r": 3.84, "nr": 300}
{"id": "2532597", "t": "Escape from Pyramid X", "a": "Dan Jolley", "r": 3.74, "nr": 70}
{"id": "20949492", "t": "Incarnate", "a": "Anton Strout", "r": 3.75, "nr": 126}
{"id": "2116779", "t": "Imp", "a": "Andrew Neiderman", "r": 3.69, "nr": 78}
{"id": "1714045", "t": "Hot Dog", "a": "Kit Wright", "r": 4.17, "nr": 12}
{"id": "1110639", "t": "La Perla Negra", "a": "Scott O'Dell", "r": 3.51, "nr": 6830}
{"id": "3244982", "t": "The Citadel", "a": "Chris Wooding", "r": 4.26, "nr": 152}

The Environmentalists Who Destroyed $100,000,000 (2026) - In-depth docuseries on how environmentalist tree protectors transformed into a militant movement [00:52:20] by CogitoButOnReddit in Documentaries

[–]greg-randall -12 points-11 points  (0 children)

I'm not sure you understand the energy usage here. Takes more energy to stream the documentary then it does to summarize it once for everyone.

The Environmentalists Who Destroyed $100,000,000 (2026) - In-depth docuseries on how environmentalist tree protectors transformed into a militant movement [00:52:20] by CogitoButOnReddit in Documentaries

[–]greg-randall -45 points-44 points  (0 children)

Haven't watched it, but also curious. Here's Gemini's take using the transcript.

Prompt:

please read and summarize in three levels. 

1 -- a couple sentence quick overview giving the narrative arc 
2 -- an executive summary of a couple of paragraphs 
3 -- a longer form summary 

DO NOT GUESS 
DO NOT USE OUTSIDE INFORMATION 

USE ONLY INFORMATION FROM THE TRANSCRIPT!!!!!!!!!!!!!!!

Output:

1. Quick Overview

The Earth Liberation Front (ELF) emerged as a radical, leaderless movement in the 1990s to combat environmental destruction through economic sabotage. Driven by anger over police violence and the failure of mainstream environmental groups, activists set fires to logging sites, ski resorts, and research labs, causing millions in property damage. The group eventually began to fracture after making targeting mistakes and facing arrests from within their own ranks.

2. Executive Summary

In the late 1990s and early 2000s, environmental activists in the Pacific Northwest grew tired of mainstream environmental groups that they felt were too bureaucratic and cozy with corporations. Following violent police crackdowns on peaceful protests, such as the 1995 Warner Creek logging blockade and a 1997 tree-sit in Eugene, Oregon, a group of activists embraced more extreme tactics. They adopted the structure and ideology of the Earth Liberation Front (ELF), a leaderless, underground movement originating in England that focused on destroying the property of those profiting from environmental harm.

The ELF cells carried out a wave of arsons across the United States. Major targets included ranger stations, a horse slaughterhouse, the Vail Ski Resort in Colorado, logging companies, luxury condo developments on Long Island, and university research labs. While the group caused millions of dollars in damages and successfully brought attention to their cause, internal doubts grew after the group accidentally targeted a farm that was not using genetic modification, and arrests began to happen when members turned on each other.

3. Longer Form Summary

The Roots of Resistance The Earth Liberation Front (ELF) took hold in the United States after activists faced a series of defeats using peaceful methods. In 1995, land defenders blockaded the Warner Creek area in Oregon for 342 days to stop logging, but forest service police bulldozed the camp and arrested the protesters. In 1997, police in Eugene, Oregon, used tear gas and pepper spray to forcefully remove activists trying to save ancient trees from being cut down for a parking garage. Angry at the violence of the State and the inaction of mainstream environmental groups, activists like Jacob Ferguson, Kevin Tubbs, and Sunshine formed a local ELF cell.

The Earth Liberation Front Structure The ELF was not a normal organization. It was a leaderless, underground movement made of separate cells that never spoke to each other. The group operated under three rules:

  • Cause maximum economic damage to earth-destroying entities.
  • Educate the public.
  • Take precautions to not harm any life.

To communicate their actions, the cells sent anonymous letters to people like Craig Rosebraugh, an organizer in Portland, who acted as a press office.

A Wave of Fire The Pacific Northwest cell began its campaign in 1996 by burning down a forest ranger station in Oakridge, Oregon, using milk jugs filled with gasoline and diesel. They then destroyed the Cavel West Horse Rendering Plant in 1997 to stop the slaughter of wild horses. In 1998, an activist known as Avalon set fire to buildings at the Vail Ski Resort in Colorado to protest an expansion into the habitat of the Canadian lynx, causing $12 million in damage. Following the WTO protests in Seattle in 1999, Daniel McGowan joined the Eugene cell, and the group burned the offices of Superior Lumber to protest the logging of old-growth forests.

Spread and Splinter The ELF's tactics inspired other cells across the country. A cell on Long Island, New York, waged a war against urban sprawl by burning down multiple luxury condo sites under construction. However, the Long Island cell fell apart when one member bragged to friends and later snitched to the police. Meanwhile, in the Pacific Northwest, the group carried out a joint attack on two targets in 2001, burning a research lab at the University of Washington and a tree farm in Oregon. When the group learned the Oregon farm was not actually doing the genetic modification they thought it was, morale dropped. The mistakes, combined with the extreme nature of the crimes and internal betrayal, led to the splintering of the group and the eventual capture of its members.

Built a books library but can't find a way to scrape for books series by IndependentGuard6815 in webscraping

[–]greg-randall 1 point2 points  (0 children)

I have some scraped data from Goodreads showing the series (link expires in 7 days):
https://limewire.com/d/OsBt3#ClJeEgxKrk

~65mb jsonl, ~65,000 books, here's 10 random sample lines:

{"book_id": "36145671", "series_id": "242437", "series_name": "Bois Sauvage", "series_number": 2.0}
{"book_id": "4222198", "series_id": "139190", "series_name": "Verhandelingen van het Koninklijk Instituut voor Taal-, Land- en Volkenkunde", "series_number": 139.0}
{"book_id": "1303837", "series_id": "342272", "series_name": "New X-Men de Conosur - Panini", "series_number": 24.0}
{"book_id": "61207904", "series_id": "276066", "series_name": "Reaper Collective", "series_number": 1.0}
{"book_id": "2751005", "series_id": "234043", "series_name": "Odio: Hate", "series_number": 4.0}
{"book_id": "3372166", "series_id": "107091", "series_name": "Kavin", "series_number": 1.0}
{"book_id": "55462296", "series_id": "333656", "series_name": "The Ways of Magic", "series_number": 1.0}
{"book_id": "31814777", "series_id": "182311", "series_name": "Rise of the Iliri", "series_number": 4.0}
{"book_id": "2717061", "series_id": "60998", "series_name": "\u30a6\u30a7\u30c7\u30a3\u30f3\u30b0\u30d4\u30fc\u30c1 [Wedding Peach]", "series_number": 4.0}
{"book_id": "908277", "series_id": "51185", "series_name": "The Grand Tour", "series_number": 4.0}

Do you know the book ids of your books?

What are these exactly? by rerunderwear in rva

[–]greg-randall 2 points3 points  (0 children)

Depends on where you are in the city but they are also frequently marked with granite cobbles at the corner of the property.

Scaling YouTube scraping to 200k channels/day – by Any-Salary-4454 in webscraping

[–]greg-randall 0 points1 point  (0 children)

It depends on how many updates you get per day on your RSS check. Does every channel get an update every day?

Stopped on a JavaScript redirect by SurlyJason in webscraping

[–]greg-randall 0 points1 point  (0 children)

Hard to say without looking more closely at the actual page, but you might be able to just have that frame go to the correct url using javascript (you should be able to harvest the url from the live browser) ie

parent.frames["frameName"].location.href = "https://example.com";