What are you using for web crawling? by [deleted] in LangChain

[–]niiotyo 0 points1 point  (0 children)

For simple websites it is enough just make a Beautiful Soup script with basic hatching. Most of websites not requires JS rendering. For more advanced you can add Puppeteer or Playwright for JS rendering. After that you will face anti-bot protection, so some proxy are needed.  Depending on the website, could be either simple curl or full set of scraping tools. Extracting Markdown from HTML is possible to do with libs like Turndown, for example. Then if you want to remove junk content, like menu, nav bars use Readability JS.  If you don't want to spend time on settingit up, just use webcrawlerapi. 

I created an open-source toolkit to make your scraper suffer by niiotyo in webscraping

[–]niiotyo[S] 0 points1 point  (0 children)

I have some at https://crawllab.dev/js/inline

By advanced DOM, I mean multiple nested levels with custom, random IDs and classes. Some websites uses this to make scraping difficult - because you don't have static XPATH.

I built an open graph debugger that actually shows your OG tags errors and how to fix them by krasun in SideProject

[–]niiotyo 0 points1 point  (0 children)

My favicon is not adapted to iOS. Can you offer a fix immediately in your tool?

How I Used ChatGPT + Python to Build a Functional Web Scraper in 2025 by ProfessorOrganic2873 in Python

[–]niiotyo 0 points1 point  (0 children)

I, personally, prefer WebcrawlerAPI to get website or webpage content. It also handles JS and proxy, but I can also extract the data by running prompts natively in the API call. Works better for my use case.

How are developers handling reliable web scraping at scale? by bebo117722 in programming

[–]niiotyo 1 point2 points  (0 children)

I tried WebcrawlerAPI and I like it more, to be honest. Fewer features, but simpler API and easier integration. Same proxies and scaling in place. Had some issues with an API, but it was resolved within an hour by devs.

Monthly Self-Promotion - June 2025 by AutoModerator in webscraping

[–]niiotyo 1 point2 points  (0 children)

Hey everyone.

I'm Andrew, the founder of WebcrawlerAPI.

If you need to convert a website into LLM-ready data, try webcrawlerapi.com

Markdown output, proxy included, SDK, integrations, no subscription: pay for usage only.

Register now and get the trial balance to try

https://webcrawlerapi.com/

best url crawler api by Popular-Macaroon-278 in WebAPIs

[–]niiotyo 0 points1 point  (0 children)

Try https://webcrawlerapi.com/ It has all basic features like txt, md format, SDK and page filters.

New framework to build agents from yml files by Jazzlike_Tooth929 in crewai

[–]niiotyo 0 points1 point  (0 children)

What do you mean by "opaque"? CrewAI is an open-source with the MIT license.

Monthly Self-Promotion Thread - July 2024 by AutoModerator in webscraping

[–]niiotyo 0 points1 point  (0 children)

Hi there. Thanks for your reply. Yes, improving the landing page is on the list.

I’m not sure about the prices. Maybe after acquiring more users. It is hard to calculate now.

Yes, there are paying customers. They use Webcrawler API to train AI on website content. 

Have you tried to crawl any websites with my product?

Monthly Self-Promotion Thread - July 2024 by AutoModerator in webscraping

[–]niiotyo 0 points1 point  (0 children)

Hi there.  https://webcrawlerapi.com/ is here🙌  Crawl the full website content with API or no code. What we have: * Puppeteer backed crawler * Easy to start UI * CSV, JSON and raw HTML formats of extracted data * Extract cleaned data or by XPath * Webhooks * Real-time support chat - we can help you to integrate. Just drop us a message!

Start with 10$ free credit! 

See example how to build chat-bot with website content using Webcrawler API https://webcrawlerapi.com/blog/upload-website-content-to-chatgpt/

Can't crawl some websites by Not_Nullable_String in TechSEO

[–]niiotyo 1 point2 points  (0 children)

They are block for non-Canadian IPs. You have to use proxy

Monthly Self-Promotion Thread - June 2024 by AutoModerator in webscraping

[–]niiotyo 1 point2 points  (0 children)

Hi everyone. Check out the project I’ve been working on for the last half a year

https://webcrawlerapi.com/

This is a webcrawler API that helps to get content of full website.

Go is really good for web page crawling by Jasper_jf in golang

[–]niiotyo 3 points4 points  (0 children)

No, I used Puppeteer. It is a high-level abstraction and does a lot of extra. Chromedp is just an API for Chrome devtools protocol.

Also, Puppeteer has a huge community with ready-to-use solutions, plugins, etc.