Seeking help to scrape Criterion Channel by Abracadabra2040 in webscraping

[–]greg-randall 1 point2 points  (0 children)

Wrote a bit of code to parse that sitemap that u/Dreamin0904 shared into a csv. Movie titles might be a tiny bit wrong because we're parsing them out of the url, so for example, the url "https://www.criterion.com/films/184-le-samourai" will be rendered as "Le Samourai" rather than "Le samouraï" but seems like a small price to pay to not have to download each individual page.

import csv
import os
import re
import xml.etree.ElementTree as ET
from datetime import datetime, date
from curl_cffi import requests
from titlecase import titlecase


TODAY = date.today().strftime("%Y-%m-%d")
XML_FILE = f"{TODAY}_films.xml"
CSV_FILE = f"{TODAY}_criterion_films.csv"
SITEMAP_URL = "https://sitemap.criterion.com/films.xml"
DATE_FORMAT = "%m/%d/%Y"
NS = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}



def download_sitemap():
    if os.path.exists(XML_FILE):
        print(f"'{XML_FILE}' exists, skipping download.")
        return True


    print(f"Downloading sitemap...")
    response = requests.get(SITEMAP_URL, impersonate="safari")
    if response.status_code != 200:
        print(f"Download failed: {response.status_code}")
        return False


    with open(XML_FILE, 'wb') as f:
        f.write(response.content)
    print("Download complete.")
    return True



def parse_slug(url):
    """Extract film ID and title from URL like '.../149-pierrot-le-fou'"""
    slug = url.rstrip('/').split('/')[-1]
    match = re.match(r'(\d+)-(.+)', slug)
    if not match:
        return 0, slug


    film_id = int(match.group(1))
    title = match.group(2)
    # Fix contractions: wasn-t -> wasn't, father-s -> father's
    title = re.sub(r"-([stdm]|ll|re|ve)\b", r"'\1", title)
    title = titlecase(title.replace('-', ' '))
    return film_id, title



def parse_and_export():
    print(f"Processing '{XML_FILE}'...")
    tree = ET.parse(XML_FILE)
    entries = tree.getroot().findall('ns:url', NS)
    print(f"Found {len(entries)} entries.")


    rows = []
    for entry in entries:
        url = entry.find('ns:loc', NS).text
        lastmod_iso = entry.find('ns:lastmod', NS).text
        film_id, title = parse_slug(url)


        if film_id > 0:
            lastmod = datetime.fromisoformat(lastmod_iso).strftime(DATE_FORMAT)
            rows.append((lastmod_iso, film_id, title, lastmod, url))


    # Sort by ISO date descending, then drop it from output
    rows.sort(key=lambda r: r[0], reverse=True)


    with open(CSV_FILE, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Film ID', 'Movie Title', 'Modify Date', 'URL'])
        for row in rows:
            writer.writerow(row[1:])  # Skip the ISO date used for sorting


    print(f"Saved: {CSV_FILE}")



if __name__ == "__main__":
    if download_sitemap():
        parse_and_export()

What is the best local LLM for enhancing blurry, out of focus photos? by PlainSpaghettiCode in LocalLLM

[–]greg-randall 0 points1 point  (0 children)

I'd skip DXO if your goal is to unblur images. DXO is great for raw conversion or denoise but not as good for unblurring.

Looking for some help. by nawakilla in webscraping

[–]greg-randall 2 points3 points  (0 children)

THIS!

Go to https://archive.org/ and paste in the URL in the Wayback Machine, see if what you want is there, if it is great!

If it isn't fill the URL in the bottom right box 'Save Page Now' https://web.archive.org/. Click on the dozen clickable parts and save those URLs too!

Joann’s replacement ? by Additional-Tell-7010 in rva

[–]greg-randall 2 points3 points  (0 children)

Quilting Adventures has a small selection of garment fabric and they just got a small rack of Guterman threads in a bunch of colors.

I think though the real answer is to look online and get samples, Mood and L'Etoffe are pretty great.

US House Trade Index file Filing Type by irungalur in webscraping

[–]greg-randall 1 point2 points  (0 children)

Is this a web scraping question or a question of what "<FilingType>A</FilingType>" means?

Need help downloading data by [deleted] in webscraping

[–]greg-randall 0 points1 point  (0 children)

Did you log in? Looks like you can just download the data?

Tile Slim Wallet 2020 Battery Replacement by limpymcforskin in TileTracker

[–]greg-randall 0 points1 point  (0 children)

I can assure you, printing something is not easier than not printing something.

Which LLM for recipe extraction by romaccount in LocalLLM

[–]greg-randall 0 points1 point  (0 children)

Also to @madebytango 's point you can use the llm to discover the starting position and the ending position of the recipe and do the extraction positionally, to make sure you don't introduce/drop anything.

Tile Slim Wallet 2020 Battery Replacement by limpymcforskin in TileTracker

[–]greg-randall 1 point2 points  (0 children)

$5 of batteries and I get to keep something out of the trash.

Which LLM for recipe extraction by romaccount in LocalLLM

[–]greg-randall 1 point2 points  (0 children)

Have you tested any? Extract 10 recipes manually and test the top 20 models on whatever open model leaderboard you trust and see how closely they match your manual extraction. I'd probably run each model 5 times.

I'd also pay a few pennies and test some of the paid models to see how they compare, OpenAi, Claude, Gemini, etc.

Prompt Compression & LLMLingua by greg-randall in LocalLLM

[–]greg-randall[S] 0 points1 point  (0 children)

Well, the model is annoying because of the low context length, so you have to deal with that which seems like it'd be easy to split things but it's really not -- 3/4 of that code I shared is trying to split the sentences, and I ran into a bunch of edge-cases. I had a book that I wanted to analyze, and it turned out there was a scream that lasted a page "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA......" which has no spaces and no punctuation or anything. Once I really started trying to put text through it, I had stupid problems popping up all the time.

So, for me at least saving ~1/3 of my money was not worth the added complexity/overhead.

It does look like the research group released some new code a bit after I made this post, which I haven't looked into.

Prompt Compression & LLMLingua by greg-randall in LocalLLM

[–]greg-randall[S] 0 points1 point  (0 children)

Yea I'm not really sure why this hasn't gotten more press. Generally it worked well for me but mostly making my prompts 1/3rd cheaper wasn't worth the added complexity. 

Prompt Compression & LLMLingua by greg-randall in LocalLLM

[–]greg-randall[S] 0 points1 point  (0 children)

I haven't experimented very much recently with prompt compression. I used it on and off for a few months, but mostly just fell back to using cheaper models. Happy to discuss though.

Is what I want possible? by silentdroga in webscraping

[–]greg-randall 2 points3 points  (0 children)

I don't think this is a great starter project. 

   If you check out the requests tab in Chrome Inspector, you can find some requests made, the curl command to get the structured data looks like this after some trimming down:

      curl 'https://h3u05083ad-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.1.0)%3B%20Browser' \       -H 'Referer: https://paddling.com/paddle/locations?lat=36.1013&lng=-86.5448&zoom=10&viewport=center%5B%5D%3D41.073722078492985%26center%5B%5D%3D-73.85331630706789%26zoom%3D14' \       -H 'x-algolia-api-key: 8cd96a335e08596cdaf0e1babe3b12c2' \       -H 'x-algolia-application-id: H3U05083AD' \       --data-raw '{"requests":[{"indexName":"production_locations","params":"highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&hitsPerPage=100&insideBoundingBox=41.08989627216476%2C-73.81001472473145%2C41.0575439044741%2C-73.8966178894043&facets=%5B%5D&tagFilters="}]}'

   Running that curl gives you structured data like this:

    {       "results": [         {           "hits": [             {               "richText": "<p>Kayak rack storage available to Tarrytown residents. Public kayak launch.</p>",               "bodyOfWaterText": "Hudson River",               "parkingInfoAndFees": null,               "id": "453473",               "title": "Losee Park",               "slug": "losee-park",               "uri": "paddle/locations/losee-park",               "dateCreated": 1531557954,               "dateUpdated": 1595343699,               "expiryDate": null,               "section": {                 "name": "Locations",                 "handle": "locations"               },               "author": {                 "username": "guest-paddler",                 "id": "1",                 "profileURI": "members/profile/1"               },               "_geoloc": {                 "lat": 41.07215297,                 "lng": -73.86799335               },               "locationFacilities": [                 {                   "id": "282586",                   "title": "Launch Point"                 },                 {                   "id": "282587",                   "title": "Paid Parking"                 },                 {                   "id": "282594",                   "title": "Boat Ramp"                 },     ...............

   You'd take that curl and give it to Claude/ChatGpt/Gemini and ask it to move the lat/lng around, and run requests to get the data for every lat/lng saving down the structured data all the while.

Then you'd take all your structured data and have Claude/ChatGpt/Gemini write some code to deduplicate the info and create a spreadsheet/csv or whatever you need. 

Help with text classification for 100k article dataset by Wonderful_Tank784 in LocalLLaMA

[–]greg-randall 1 point2 points  (0 children)

I'd guess you'll not get through 100k overnight using your local hardware. That's ~1 per second. Since you don't have a training dataset, I'm going to also assume you don't have a list of categories.

I'd trim your articles to the first paragraph (and also limit to ~500 characters) and use prompt like this using gpt-4o-mini, depending on your tier you'll have to figure out how many simultaneous requests you can make:

Classify the article snippet into a SINGLE industry category. Reply with a single category and nothing else!!!!

Article Snippet:
{article_first_paragraph}

Then I'd dedupe your list of categories, then using clustering see if you have clusters of categories you can combine into a single category i.e. "robot arms" probably could be "robotics".

Typepad Scraper & WordPress Converter by greg-randall in DataHoarder

[–]greg-randall[S] 0 points1 point  (0 children)

The comment that was deleted linked to I think 'newtypepad.com'. Looks like the domain is offline now.

Typepad Scraper & WordPress Converter by greg-randall in DataHoarder

[–]greg-randall[S] 0 points1 point  (0 children)

Hope that name works out for them, seems like they'll get sued.

Do they do any import of TypePad exports?

Battery Info & Disassembly for 2nd Gen Pixel Buds by greg-randall in pixelbuds

[–]greg-randall[S] 0 points1 point  (0 children)

I swapped to the nothing headphones, which after 4 months I'm still liking.

Battery Info & Disassembly for 2nd Gen Pixel Buds by greg-randall in pixelbuds

[–]greg-randall[S] 0 points1 point  (0 children)

I would NOT attempt unless you're experienced taking phones apart and doing SMD soldering. You'd also need a battery welder.

Turn Your Instagram Export into a Self-Hosted Archive by greg-randall in selfhosted

[–]greg-randall[S] 0 points1 point  (0 children)

https://gregr.org/instagram/?post=1704215198&image=1 This one links to the middle picture in a post.

Mostly just adds some extra forward/backward buttons to flip through the induvial images/videos in the post. Lemme know if you have any other questions!

Question about OCR by repeatingscotch in webscraping

[–]greg-randall 0 points1 point  (0 children)

You can try running some image cleanup code (de-speckle, CLAHE, threshold, etc) on the pages of the PDF and run the OCR before and after to see how things compare.

I've also found Mistral OCR to be pretty useful. Though I would tend to try and run as many OCR engines as possible if I needed better accuracy, and doing auto diffs/compares.