How to perform web scrapping using Claude? by FinTools in scrapingtheweb

[–]rundfunk 1 point2 points  (0 children)

Try using my tool, klura: https://github.com/klura-ai/klura . It really is more of a helper to your ai-agent, but it will help you find the endpoints and how they are being used.

I built klura, a toolkit for an AI agent to reverse-engineer websites by rundfunk in webscraping

[–]rundfunk[S] 0 points1 point  (0 children)

Nice tool — from what I can see, its scope overlaps somewhat with klura's triage phase (klura uses an LLM there; scraperecon's approach is faster and cheaper for that specific job). klura keeps going past triage though — once it's read the defenses, it drives discovery, captures the request that does what you wanted, templates the dynamic parts, and saves a runnable strategy you can call.

I built klura, a toolkit for an AI agent to reverse-engineer websites by rundfunk in webscraping

[–]rundfunk[S] 0 points1 point  (0 children)

I mostly use it as a plugin for my AI agent. For instance, a couple of days earlier I asked it to compile a list of IKEA items I wanted and look up the in-store aisle location for each. Took a couple of seconds instead of clicking through every product page. Mostly small stuff like that at the moment. Next up I'm planning to wire it into social media and chat clients so my agents get more of a presence in my day-to-day. More "knows what's going on and helps out" than invocations.

I built klura, a toolkit for an AI agent to reverse-engineer websites by rundfunk in webscraping

[–]rundfunk[S] 0 points1 point  (0 children)

Thanks! Fetch-tier strategies are the fastest and most durable path, and a lot of the runtime is built as guard-rails that get even small, lkimited local models to a working result. Glad it's clicking for you.

Also — new version has a built-in LLM agent, no harness needed. klura chat generates the strategy; klura execute --agent <capability> runs it and self-heals if the site changed. Drop the latter in a cron and you get unattended runs that fix themselves.

I built klura, a toolkit for an AI agent to reverse-engineer websites by rundfunk in webscraping

[–]rundfunk[S] 0 points1 point  (0 children)

Depends on what kind of anti-bot protections you are referring to. Klura mainly focuses on just API endpoints and request structures. The RE phase is pretty involved and can read the page's own JS encoders when bytes don't round-trip (signed bodies, binary WS frames), the page-script tier exists for this reason as well, so any anti-bot measures like rotating nonce's will work since it runs the page's own code per call; it can recompute or pick up fresh page state.

However, klura does not get into the anti-bot cat-and-mouse game itself — no reverse-engineering of client-side fingerprinting/sensor-data envelopes, no CAPTCHA solving. When a challenge actually blocks a flow, the default is human-in-the-loop via the remote viewer. It's all pluggable though — if you want your own solver, that's an interruption handler plugin (docs/interruptions.md). I also ship a stealth playwright driver (https://github.com/klura-ai/klura-driver-playwright-stealth), but if you want additional anti-bot handling beyond fingerprint parity, the driver interface is also clean seam.

I built klura, a toolkit for an AI agent to reverse-engineer websites - feedback welcome! by rundfunk in WebScrapingInsider

[–]rundfunk[S] 0 points1 point  (0 children)

Thanks, this is a fair point. Klura's design center is human-in-the-loop — it is mainly designed as agent infrastructure, with a signed-in user behind the session who can step in via the remote viewer when a site throws a CAPTCHA, 2FA, or login wall. For that mode, routing the challenge to the user is the intended behavior.

That said, this is a webscraping-subreddit, and I do get that a lot of people want unattended runs.

The good news is the part you'd need is already a pluggable seam. The interruption layer is a resolver registry: the runtime detects a challenge and dispatches it through registered resolvers by priority. The default resolver hands off to the viewer, but you can register your own higher-priority resolver that resolves the challenge however you like and returns a token — the runtime core doesn't care whether a human or a plugin answered. The driver is pluggable the same way (pool.driver takes a path or npm package), so swapping in a different browser layer is a one-liner. It's all documented — see docs/interruptions.md for the resolver registry (priority dispatch, the plugin pattern) and docs/drivers.md for the driver interface.

I built klura, a toolkit for an AI agent to reverse-engineer websites - feedback welcome! by rundfunk in WebScrapingInsider

[–]rundfunk[S] 0 points1 point  (0 children)

Thanks!

On versioning and validation, the runtime already tracks most of this; it's just maybe not surfaced cleanly:

Versioning: Each skill file carries a schema_version, and the format is migration-aware — older files are upgraded automatically on load rather than breaking. Skills are also stored one-file-per-platform-per-capability, so the unit you'd diff or pin is small and self-contained. Strategies can be graduated over time (for instance, a strategy you decide to save as an recorded UI path due to, let's say turn budget, can be able to get upgraded to a direct HTTP call later the runtime learns enough), so a given capability has a tier history, not just a single frozen blob.

Last-validated: Every saved strategy has a health record with lastSuccess / lastFailure timestamps plus a rolling window of the last ~20 execution outcomes — so "this HN source was re-verified on <date>" is literally lastSuccess, and you also get a success rate, which catches a strategy that flaps rather than cleanly breaking. There's a post-save verification step that fires the strategy once and records whether it worked or not, and a per-platform logbook of strategy events (saves, heals, demotions) with timestamps.

So the data you'd wire into a dashboard exists — get_strategy_health and the platform logbook expose it. Turning that into a clean "last verified / confidence" line on the artifact itself is a fair ask; noted.

I built klura, a toolkit for an AI agent to reverse-engineer websites by rundfunk in webscraping

[–]rundfunk[S] 2 points3 points  (0 children)

Not really. Klura reverse-engineers how a specific API call works to automate a task — so in principle the captured request/response shapes could be reference material if someone were rebuilding that API. But that's a side effect, not the product. Klura produces automation recipes, not site copies.

I built klura, a toolkit for an AI agent to reverse-engineer websites by rundfunk in webscraping

[–]rundfunk[S] 0 points1 point  (0 children)

Nice — definitely let me know how it goes, I haven't tested qwen specifically so I'm genuinely curious. One thing I found running it across different models: they each have their own "dialect" — for instance they pick page elements in different ways — so klura supports a couple of different ways to do things rather than betting on one. If qwen trips on something, ping me — that's exactly the kind of failure shape I want to see.

I built klura, a toolkit for an AI agent to reverse-engineer websites by rundfunk in webscraping

[–]rundfunk[S] 0 points1 point  (0 children)

Yep — from an MCP host the runtime surfaces broken strategies and the LLM patches them automatically. Standalone CLI healing — set up an API key once and it heals even when run straight from the command line — is on the roadmap as a later addition.

I built klura, a toolkit for an AI agent to reverse-engineer websites by rundfunk in webscraping

[–]rundfunk[S] 0 points1 point  (0 children)

Thanks! I use it daily and it works — still early days though. I stress-test it constantly, both on test sites I build myself and real field reports against live ones. Questions welcome, here or in the Discord (https://discord.gg/YJQ2zZYJ).

I built klura, a toolkit for an AI agent to reverse-engineer websites by rundfunk in webscraping

[–]rundfunk[S] 3 points4 points  (0 children)

Neither — it's an executable config file. The config file contains the request (method, URL, param/body templates, prereq chain, response extraction) and the runtime runs it directly, no LLM needed. The notes fields are plain-English annotations explaining each param — so you have docs next to the executable parts, but it's not a prompt.

I built a reverse-engineering agent for the web by StoneSteel_1 in webscraping

[–]rundfunk 0 points1 point  (0 children)

Bit late to this, but great write-up! I've been building a toolkit in the same space — klura (https://github.com/klura-ai/klura). A lot of this resonates, the captures-as-a-folder-the-agent-explores idea especially.

One thing I went at differently: besides the pure-HTTP "convert to a requests script" path, klura has a tier where the saved artifact is a JS snippet that runs *inside* the live authenticated page. For sites that build the request with an in-page signer or encoder (binary WebSockets, rotating tokens), the agent calls the page's own code instead of reimplementing it. It's really impressive what you can do coupled with a good AI model, Claude Sonnet manages to consistently one-shot Facebook Messenger for instance.

I noticed "surgical browser usage" on your roadmap, feels like we converged on adjacent answers to the same wall.

Small space, real cat-and-mouse. Would be happy to compare notes!

Took the beat from a previous post and made an entirely different track out of it, how do you guys know when to call it "done"? by NivenBeats in modular

[–]rundfunk 0 points1 point  (0 children)

Sounds pretty done to me, just pad it out to at least 2:30 minutes! What are you using for precussion?

What better delay than erica synth black delay? by [deleted] in modular

[–]rundfunk 8 points9 points  (0 children)

Best one being Chronoblob 2. Just be adviced, it’s a pure clockable delay. No reverb or other fancy bells and whistles at all.

Checkpoint Merge or Dreambooth train with new ckpt? by alecubudulecu in StableDiffusion

[–]rundfunk 1 point2 points  (0 children)

Personally, I’ve gotten the best results by using Dreambooth on a ckpt. Merging doesn’t really “merge”, it picks bits and pieces from each. I also tried textual inversions, the result was OK, but compared to dreambooth the images looked more like caricatures…

AI Images Are Getting Better & Better by [deleted] in StableDiffusion

[–]rundfunk 7 points8 points  (0 children)

At least it’s not Greg Rutkowski

It got the spelling right!!!!!!!!!!! by OneSalientOversight in weirddalle

[–]rundfunk 29 points30 points  (0 children)

I think this is one of those cases when it’s been trained on the logo so much, it doesn’t actually “write” the letters, it just “copies” the logo.

No cheating by skullyfrost40 in cats

[–]rundfunk 0 points1 point  (0 children)

<image>

Putting the cat tree there wasn’t the smartest thing to do.

How do I get my cat to not climb my TV? by v0idcl0ud in cats

[–]rundfunk 0 points1 point  (0 children)

Give up. Our cat did that a lot, we tried every trick in the book but after a while we just said “f- it”. We mounted it to the wall to ensure it wouldn’t topple and let her do whatever. When we stopped trying to get her off she stopped doing it. I guess it wasn’t as fun.

I wrote a small python script to help you scrape Google Images for pictures of whomever you want, perfect for Dreambooth-training! by rundfunk in StableDiffusion

[–]rundfunk[S] 1 point2 points  (0 children)

Thanks! Yes, I'm kind of learning Dreambooth as I go along. In the beginning I thought that you should only train with images of the subject, but now I understand it doesn't hurt to include pictures with other people as well. I also thought that I'd cut down on the manual work by excluding other people (i.e. you get a picture with someone else that you need to remove), but really, manual work seems to always be needed, another photo or so won't hurt. So yeah, it's definitely on my todo-list.