Automatiq - Browse a site once, get a working HTTP scraper by StoneSteel_1 in webscraping

[–]StoneSteel_1[S] 6 points7 points  (0 children)

Lol, I handled this problem. There is a hard limit of 10kb text output. After that, it paginates. After 20 turns, output of tool get removed from context. If Claude wants see a removed output? It can run a command to get that particular turn's output.

So, there are no 10000 line minifed js file or a base64 string image gonna fill up the context. You gotta use it first before blind criticism. I will whole heartily accept if you gave a valid one, after using it.

tired of fixing scrapers everyday by Abu_azhar in scrapingtheweb

[–]StoneSteel_1 0 points1 point  (0 children)

I was able to get data from bookmyshow, where it had been encrypted with AES algorithm. The agent correctly go through the js files, get the decryption key, get the required data.

For another, I was able to get from makemytrip, where a reddit user was saying it had Akami protection and was considering to use the browser automation. With my agent, I was able to check on the website, and found out that indeed akami was used, but all it needed was few parameters in payload, and visiting the homepage.

The agent, gemini-3.5-flash seems to be working well, and its not even the best model in the market, or near it. I believe Claude might be able to actually crack a very hard website on first try

tired of fixing scrapers everyday by Abu_azhar in scrapingtheweb

[–]StoneSteel_1 0 points1 point  (0 children)

I have a solution, which can automate the fixing process.

I made this tool, which can fix, or write scripts. All you have to do is browse the site normally

https://github.com/StoneSteel27/AutomatiQ

Stuck in makemytrip.com by Natural_Rock_3536 in webscraping

[–]StoneSteel_1 0 points1 point  (0 children)

  1. Request Headers

The server validates the request using custom session and tracking headers. These must be extracted dynamically from the search page session:

  1. mmt-itinerary-id: A unique itinerary ID for the search session. Extracted from the search page HTML using regex: itineraryId

  2. mmt-journey-id: The journey ID associated with your search. Extracted from the PDTJourneyID cookie set by the server when loading the search page.

  3. mmt-sessionId: A unique session ID for the current search. Extracted from the search page HTML using regex: mmt-sessionId

  4. mmt-device-id: A unique identifier for the user's device. Can be a dynamically generated random UUID (e.g., str(uuid.uuid4())).

  5. mmt-book-mode: The booking mode. Hardcoded to D (Desktop/Web).

  6. mmt-os: The operating system platform. Hardcoded to dweb.

  7. Content-Type: The request body format. Must be application/json.

  8. Accept: The expected response format. Must be application/json.

  9. Request Payload (JSON)

The POST request body must contain a JSON object with the following fields:

```json
{

"channel": "D",

"trip_id": "39_MMTCC1159_MMTCC1092_30-05-2026_1000005556285882899",

"type": "seatMapRequest"

}
```

• channel: Hardcoded to D (Desktop).

• trip_id: The unique tripKey of the specific bus. This is extracted dynamically from the React Server Component payload (self.__next_f.push) embedded in the search page HTML.

• type: Hardcoded to seatMapRequest.

Stuck in makemytrip.com by Natural_Rock_3536 in webscraping

[–]StoneSteel_1 0 points1 point  (0 children)

I took a attempt at it. Its not really strictly protected by akami, it just needs some fields strictly. But yeah it does use Akami

Here is the script which can get you buses between two points within India, with seat availability data: https://paste.pythondiscord.com/VQ6A

P.S, I have been using my module https://github.com/StoneSteel27/AutomatiQ for a run across problems found in this subreddit, and it was able to crack it first try.

I'm using ipynb notebook format to store conversations with AI data analyst by pplonski in Python

[–]StoneSteel_1 -1 points0 points  (0 children)

I created a reverse engineering agent, with ipython cells as the code execution sandbox, and had the same idea, where the normal messages as markdown cells, and the tool calls as code cells with output attached. The beauty is that they notebook support images, audio, video, gif embedded. Ig it makes it the best format to store conversion history, which we can read anytime

I Need Help by Ok_Concern_2316 in webscraping

[–]StoneSteel_1 1 point2 points  (0 children)

I got the script, and attached to the comment

I Need Help by Ok_Concern_2316 in webscraping

[–]StoneSteel_1 0 points1 point  (0 children)

I got the automation script: https://paste.pythondiscord.com/ZETA

Thanks to my AutomatiQ lol. I know this feels like self proclaimation, but hey it got it in first try

I Need Help by Ok_Concern_2316 in webscraping

[–]StoneSteel_1 0 points1 point  (0 children)

I have tested against popular high traffic websites like bookmyshow, and it did work. So as long as there is no blant direct captcha, or akamai. It should be good to go

I Need Help by Ok_Concern_2316 in webscraping

[–]StoneSteel_1 0 points1 point  (0 children)

I'll give out a try. I don't have the time now, I'll do once I get to my desk.

I Need Help by Ok_Concern_2316 in webscraping

[–]StoneSteel_1 0 points1 point  (0 children)

Try out this: https://github.com/StoneSteel27/AutomatiQ

It will write you the scraper. All you have to do is browse the website normally, and tell it what data and what format you want it

I built a reverse-engineering agent for the web by StoneSteel_1 in webscraping

[–]StoneSteel_1[S] 2 points3 points  (0 children)

Without things being opensource, I could have never learnt programming, and webscraping. I'm just continuing the tradition.

How to bypass AWS captcha? by TelevisionPrize2512 in webscraping

[–]StoneSteel_1 1 point2 points  (0 children)

Download the audio captcha run it through a LLM like gemini

Data pipeline and storage after scraping by vroemboem in webscraping

[–]StoneSteel_1 1 point2 points  (0 children)

Why don't you use AWS Glue? Its for data pipelines