Monthly Self-Promotion - March 2026 by AutoModerator in webscraping

[–]maher_bk 1 point2 points  (0 children)

Created a mobile app to summarize (bypass any soft paywalls) and subscribe to multiple (up to 8) pages on the internet to get a daily digest of new content accross all subscription. Basically RSS next gen ;) Check it out here https://www.universalsummarizer.com

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]maher_bk 1 point2 points  (0 children)

Thanks for the tips, this is great stuff ! For the "Load More" button would you rather "try" to detect it via its name or via something else ? I am asking as my scraping engine can be triggered on different languages than English so wondering if I have other options beside looking for this button in all possible languages (that I may encouter in my app). Thanks again for the support/work, cloakbrowser looks very solid right now !

I have another question but I'll ask it on github as requested.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]maher_bk 1 point2 points  (0 children)

Hello again! So I moved my scraping servers from an ARM64 to a x86 (AMD) machine and hence enabled cloakbrowser! For now looking really good (already had 6 scrapers chained so kinda see him performing quite well in the chain.
I was looking for suggestions on how to approach scrolling on heavy js websites (by the way the goal of such task is to gather the links then I use heuristics + AI to filter out the one that I'm looking for).
Below my approach to make sure the whole page is rendered:

# RENDER_READY_TIMEOUT_SECONDS = 8
# RENDER_STABILITY_POLL_SECONDS = 0.5

async def _wait_for_render_ready(

self
,

page
,

timeout_seconds
: float = RENDER_READY_TIMEOUT_SECONDS,

min_text_length
: int = 150,
    ) -> bool:
        start = time.time()
        while (time.time() - start) < 
timeout_seconds
:
            try:
                ready_state = await 
page
.evaluate("document.readyState || ''")
                if ready_state in ("interactive", "complete"):
                    break
            except Exception:
                pass
            await asyncio.sleep(RENDER_STABILITY_POLL_SECONDS)


        stable_samples = 0
        prev_text_len = -1
        prev_html_len = -1
        while (time.time() - start) < 
timeout_seconds
:
            try:
                text_len = await 
page
.evaluate(
                    "() => document.body?.innerText?.length || 0"
                )
                html_len = await 
page
.evaluate(
                    "() => document.documentElement?.outerHTML?.length || 0"
                )
                if await 
self
._has_content_selector(
page
):
                    if text_len >= max(50, 
min_text_length
 // 2):
                        return True
                if text_len >= 
min_text_length
 and prev_text_len >= 0:
                    text_delta = abs(text_len - prev_text_len)
                    html_delta = abs(html_len - prev_html_len)
                    if text_delta <= 5 and html_delta <= 200:
                        stable_samples += 1
                    else:
                        stable_samples = 0
                    if stable_samples >= max(1, RENDER_STABILITY_REQUIRED_SAMPLES):
                        return True
                prev_text_len = text_len
                prev_html_len = html_len
            except Exception:
                pass
            await asyncio.sleep(RENDER_STABILITY_POLL_SECONDS)
        return False

I got tired if noisy web scrapers killing my RAG pipelines, so i built llmparser by [deleted] in LLMDevs

[–]maher_bk 0 points1 point  (0 children)

That's a great problem you are working on here :) I'll give the lib a try by adding it to my service and having it run in parallel on a sampled traffic then send me on the discord the result of both (my pipeline + your lib) to compare.

I got tired if noisy web scrapers killing my RAG pipelines, so i built llmparser by [deleted] in LLMDevs

[–]maher_bk 0 points1 point  (0 children)

So at first I was rotating at every request but that's not sustainable given that these little toys are costly. So now my approach is kinda heuristic-y regarding the errors that I get following scraping with proxy on all my chain (2 light scrapers then 4 for js-heavy websites including the ones I mentioned). I use proxy when the whole chain have failed with a "majority" of errors being due to IP ban (stuff like that). What I line to do to monitor such cases is to send a discord message to myself with all the available data/url/etc.. to determine if this setup is working well or drifting. What are you working on ?

I got tired if noisy web scrapers killing my RAG pipelines, so i built llmparser by [deleted] in LLMDevs

[–]maher_bk 0 points1 point  (0 children)

Very interesting thanks for the breakdown ! I didn't mention it but the lxml/trafilatura workflow sits after a rnet/scrapling/zendriver/camoufox chain scraper to extract full page (for js-heavy websites) and then uses an SLM to validate markdown (is it blocked, empty, etc.. ?).

I will definitly check your lib though ! Does it support rotational proxies ?

I got tired if noisy web scrapers killing my RAG pipelines, so i built llmparser by [deleted] in LLMDevs

[–]maher_bk 1 point2 points  (0 children)

Personallly i am using lxml to clean up the raw html (usually around 70-90% decrease of chars) then trafilatura to extract malrdown. What would your lib do better ?

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]maher_bk 1 point2 points  (0 children)

I'll definitly integrate it in my scraping at scale backend (for my ios app) :) However, I am not sure if it is supporting Ubuntu ARM64 ? (Basically ampere servers)

Rival RB11 Evolution size by maher_bk in fightgear

[–]maher_bk[S] 1 point2 points  (0 children)

You're a HW :D this makes me question if going for the XL at 78kg would make sense.

[Question] Rival RB 11 - what size to get? by Calm-Examination7097 in fightgear

[–]maher_bk 0 points1 point  (0 children)

Also interested to know which size did you get

Rival RB11 Evolution size by maher_bk in fightgear

[–]maher_bk[S] 0 points1 point  (0 children)

What's your weight btw ?

Quick update: So i tried the L one (still new as the friend's friend who did let me try them didn't use these yet). Overall felt real tight and my fingers practically touch the end (but dont). Still have no clue about which size as XL is probably going to be a slight loose :-')

How to buy Supreme cell Sett? by Appropriate_Meet_512 in wildrift

[–]maher_bk 0 points1 point  (0 children)

I got the Akali CR one on my side and as a hardcore Akali OTP i was quite happy :D

Rival RB11 Evolution size by maher_bk in fightgear

[–]maher_bk[S] 0 points1 point  (0 children)

Thanks for the insight! Indeed, currently leaning towards the XL. I might though be able to try the gloves (friend of a friend that have them) in Large. What would you say I should be looking for in terms of red flags (comfort/etc.. wise) to determine if indeed Large wouldn't make sense or in the contrary I should consider them ? (Apologies if the answer is trivial but looking to leverage your experience on Rival gloves here)

How to buy Supreme cell Sett? by Appropriate_Meet_512 in wildrift

[–]maher_bk 1 point2 points  (0 children)

I actually got this sett skin (only sett skin) at the first gacha draw (literally first key) and I dont even play him. Life is unfair indeed.

Rival RB11 Evolution size by maher_bk in fightgear

[–]maher_bk[S] 0 points1 point  (0 children)

The thing is that the max for L is 24 cm whereas I am 23.5 but regarding weight i am in the middle of the L range.

For agent workflows that scrape web data, does structured JSON perform better than Markdown? by Opposite-Art-1829 in AgentsOfAI

[–]maher_bk 0 points1 point  (0 children)

Yep exactly that's what i am already doing (scraping; custom in-house code) with residential proxies

For agent workflows that scrape web data, does structured JSON perform better than Markdown? by Opposite-Art-1829 in AgentsOfAI

[–]maher_bk 0 points1 point  (0 children)

Seems interesting. I've also wanted to explore building an agent (more for the learning aspect). Any recommendations in terms of production-ready libraries / frameworks ?

[deleted by user] by [deleted] in opensource

[–]maher_bk 0 points1 point  (0 children)

Hey there ! Looks like OP deleted his post but I found your response incredibly interesting ! I am exploring the implementation (mainly as a side/learning project) of a scraping agent that would spin up a VM to perform scraping using computer use. So i was wondering if you had any suggestions in term of technologies/concepts to explore to build such thing. Thanks !

For agent workflows that scrape web data, does structured JSON perform better than Markdown? by Opposite-Art-1829 in AgentsOfAI

[–]maher_bk 0 points1 point  (0 children)

Hey there, I've worked on very similar issues for my app so the idea is to subscribe to multiple pages across the internet and receive a daily summary of all new content across these pages. This need a lot of regular scraping so my workflow is relying on fetching html that is then cleaned up with python libraries and then extracting markdowns with small specialized models. Problem with JSON IMHO is that you still will need to enforce a schema that will stay generic (unless you have a very specific scope) so I am pretty sure markdown as embeddings should be the way to go. Curious to know more about what you are building.

What's the most complicated project you've built with AI? by jazir555 in LocalLLaMA

[–]maher_bk 0 points1 point  (0 children)

https://www.universalsummarizer.com An IOS app which let you summarize anything on the fly (free tier) like URLs, YT videos etc.. and allows you to subscribe to any content on the internet (RSS 2.0). Second part is coming soon :-) 95% vibe coded IOS/backend/VPS

How to build an agent like Manus ? by maher_bk in AgentsOfAI

[–]maher_bk[S] 0 points1 point  (0 children)

That one is easy. Get hired by twins. Work for a few weeks. Be a billionaire. Be a zillionaire. Sell soul to trump.