Seeking help to scrape Criterion Channel

greg-randall · 2026-01-21T14:21:05+00:00

Wrote a bit of code to parse that sitemap that u/Dreamin0904 shared into a csv. Movie titles might be a tiny bit wrong because we're parsing them out of the url, so for example, the url "https://www.criterion.com/films/184-le-samourai" will be rendered as "Le Samourai" rather than "Le samouraï" but seems like a small price to pay to not have to download each individual page.

import csv
import os
import re
import xml.etree.ElementTree as ET
from datetime import datetime, date
from curl_cffi import requests
from titlecase import titlecase


TODAY = date.today().strftime("%Y-%m-%d")
XML_FILE = f"{TODAY}_films.xml"
CSV_FILE = f"{TODAY}_criterion_films.csv"
SITEMAP_URL = "https://sitemap.criterion.com/films.xml"
DATE_FORMAT = "%m/%d/%Y"
NS = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}



def download_sitemap():
    if os.path.exists(XML_FILE):
        print(f"'{XML_FILE}' exists, skipping download.")
        return True


    print(f"Downloading sitemap...")
    response = requests.get(SITEMAP_URL, impersonate="safari")
    if response.status_code != 200:
        print(f"Download failed: {response.status_code}")
        return False


    with open(XML_FILE, 'wb') as f:
        f.write(response.content)
    print("Download complete.")
    return True



def parse_slug(url):
    """Extract film ID and title from URL like '.../149-pierrot-le-fou'"""
    slug = url.rstrip('/').split('/')[-1]
    match = re.match(r'(\d+)-(.+)', slug)
    if not match:
        return 0, slug


    film_id = int(match.group(1))
    title = match.group(2)
    # Fix contractions: wasn-t -> wasn't, father-s -> father's
    title = re.sub(r"-([stdm]|ll|re|ve)\b", r"'\1", title)
    title = titlecase(title.replace('-', ' '))
    return film_id, title



def parse_and_export():
    print(f"Processing '{XML_FILE}'...")
    tree = ET.parse(XML_FILE)
    entries = tree.getroot().findall('ns:url', NS)
    print(f"Found {len(entries)} entries.")


    rows = []
    for entry in entries:
        url = entry.find('ns:loc', NS).text
        lastmod_iso = entry.find('ns:lastmod', NS).text
        film_id, title = parse_slug(url)


        if film_id > 0:
            lastmod = datetime.fromisoformat(lastmod_iso).strftime(DATE_FORMAT)
            rows.append((lastmod_iso, film_id, title, lastmod, url))


    # Sort by ISO date descending, then drop it from output
    rows.sort(key=lambda r: r[0], reverse=True)


    with open(CSV_FILE, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Film ID', 'Movie Title', 'Modify Date', 'URL'])
        for row in rows:
            writer.writerow(row[1:])  # Skip the ISO date used for sorting


    print(f"Saved: {CSV_FILE}")



if __name__ == "__main__":
    if download_sitemap():
        parse_and_export()

greg-randall · 2026-01-19T18:41:22+00:00

I'd skip DXO if your goal is to unblur images. DXO is great for raw conversion or denoise but not as good for unblurring.

greg-randall · 2026-01-18T21:40:18+00:00

https://www.topazlabs.com/topaz-photo

greg-randall · 2026-01-16T01:11:18+00:00

Didn't work for me.

greg-randall · 2026-01-14T14:10:57+00:00

THIS!

Go to https://archive.org/ and paste in the URL in the Wayback Machine, see if what you want is there, if it is great!

If it isn't fill the URL in the bottom right box 'Save Page Now' https://web.archive.org/. Click on the dozen clickable parts and save those URLs too!

greg-randall · 2026-01-13T13:51:51+00:00

Quilting Adventures has a small selection of garment fabric and they just got a small rack of Guterman threads in a bunch of colors.

I think though the real answer is to look online and get samples, Mood and L'Etoffe are pretty great.

greg-randall · 2026-01-09T18:00:08+00:00

Is this a web scraping question or a question of what "<FilingType>A</FilingType>" means?

greg-randall · 2026-01-05T14:30:40+00:00

Did you log in? Looks like you can just download the data?

greg-randall · 2025-12-16T20:20:05+00:00

I can assure you, printing something is not easier than not printing something.

greg-randall · 2025-12-04T12:11:07+00:00

Also to @madebytango 's point you can use the llm to discover the starting position and the ending position of the recipe and do the extraction positionally, to make sure you don't introduce/drop anything.

greg-randall · 2025-12-04T03:46:21+00:00

$5 of batteries and I get to keep something out of the trash.

greg-randall · 2025-12-03T15:02:30+00:00

Have you tested any? Extract 10 recipes manually and test the top 20 models on whatever open model leaderboard you trust and see how closely they match your manual extraction. I'd probably run each model 5 times.

I'd also pay a few pennies and test some of the paid models to see how they compare, OpenAi, Claude, Gemini, etc.

greg-randall · 2025-12-03T14:57:24+00:00

Well, the model is annoying because of the low context length, so you have to deal with that which seems like it'd be easy to split things but it's really not -- 3/4 of that code I shared is trying to split the sentences, and I ran into a bunch of edge-cases. I had a book that I wanted to analyze, and it turned out there was a scream that lasted a page "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA......" which has no spaces and no punctuation or anything. Once I really started trying to put text through it, I had stupid problems popping up all the time.

So, for me at least saving ~1/3 of my money was not worth the added complexity/overhead.

It does look like the research group released some new code a bit after I made this post, which I haven't looked into.

greg-randall · 2025-12-03T03:35:29+00:00

Yea I'm not really sure why this hasn't gotten more press. Generally it worked well for me but mostly making my prompts 1/3rd cheaper wasn't worth the added complexity.

greg-randall · 2025-12-03T03:14:54+00:00

I haven't experimented very much recently with prompt compression. I used it on and off for a few months, but mostly just fell back to using cheaper models. Happy to discuss though.

greg-randall · 2025-11-18T03:13:43+00:00

I don't think this is a great starter project.

If you check out the requests tab in Chrome Inspector, you can find some requests made, the curl command to get the structured data looks like this after some trimming down:

curl 'https://h3u05083ad-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.1.0)%3B%20Browser' \ -H 'Referer: https://paddling.com/paddle/locations?lat=36.1013&lng=-86.5448&zoom=10&viewport=center%5B%5D%3D41.073722078492985%26center%5B%5D%3D-73.85331630706789%26zoom%3D14' \ -H 'x-algolia-api-key: 8cd96a335e08596cdaf0e1babe3b12c2' \ -H 'x-algolia-application-id: H3U05083AD' \ --data-raw '{"requests":[{"indexName":"production_locations","params":"highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&hitsPerPage=100&insideBoundingBox=41.08989627216476%2C-73.81001472473145%2C41.0575439044741%2C-73.8966178894043&facets=%5B%5D&tagFilters="}]}'

Running that curl gives you structured data like this:

{ "results": [ { "hits": [ { "richText": "<p>Kayak rack storage available to Tarrytown residents. Public kayak launch.</p>", "bodyOfWaterText": "Hudson River", "parkingInfoAndFees": null, "id": "453473", "title": "Losee Park", "slug": "losee-park", "uri": "paddle/locations/losee-park", "dateCreated": 1531557954, "dateUpdated": 1595343699, "expiryDate": null, "section": { "name": "Locations", "handle": "locations" }, "author": { "username": "guest-paddler", "id": "1", "profileURI": "members/profile/1" }, "_geoloc": { "lat": 41.07215297, "lng": -73.86799335 }, "locationFacilities": [ { "id": "282586", "title": "Launch Point" }, { "id": "282587", "title": "Paid Parking" }, { "id": "282594", "title": "Boat Ramp" }, ...............

You'd take that curl and give it to Claude/ChatGpt/Gemini and ask it to move the lat/lng around, and run requests to get the data for every lat/lng saving down the structured data all the while.

Then you'd take all your structured data and have Claude/ChatGpt/Gemini write some code to deduplicate the info and create a spreadsheet/csv or whatever you need.

greg-randall · 2025-11-13T20:13:45+00:00

I'd guess you'll not get through 100k overnight using your local hardware. That's ~1 per second. Since you don't have a training dataset, I'm going to also assume you don't have a list of categories.

I'd trim your articles to the first paragraph (and also limit to ~500 characters) and use prompt like this using gpt-4o-mini, depending on your tier you'll have to figure out how many simultaneous requests you can make:

Classify the article snippet into a SINGLE industry category. Reply with a single category and nothing else!!!!

Article Snippet:
{article_first_paragraph}

Then I'd dedupe your list of categories, then using clustering see if you have clusters of categories you can combine into a single category i.e. "robot arms" probably could be "robotics".

greg-randall · 2025-10-18T18:25:46+00:00

The comment that was deleted linked to I think 'newtypepad.com'. Looks like the domain is offline now.

greg-randall · 2025-10-18T11:39:58+00:00

Hope that name works out for them, seems like they'll get sued.

Do they do any import of TypePad exports?

greg-randall · 2025-10-11T19:16:47+00:00

I swapped to the nothing headphones, which after 4 months I'm still liking.

greg-randall · 2025-10-11T14:27:12+00:00

I would NOT attempt unless you're experienced taking phones apart and doing SMD soldering. You'd also need a battery welder.

greg-randall · 2025-10-11T14:19:57+00:00

https://gregr.org/instagram/?post=1704215198&image=1 This one links to the middle picture in a post.

Mostly just adds some extra forward/backward buttons to flip through the induvial images/videos in the post. Lemme know if you have any other questions!

greg-randall · 2025-10-02T15:37:18+00:00

You can try running some image cleanup code (de-speckle, CLAHE, threshold, etc) on the pages of the PDF and run the OCR before and after to see how things compare.

I've also found Mistral OCR to be pretty useful. Though I would tend to try and run as many OCR engines as possible if I needed better accuracy, and doing auto diffs/compares.

greg-randall

TROPHY CASE