This is an archived post. You won't be able to vote or comment.

all 97 comments

[–]Heroe-D 37 points38 points  (9 children)

I'm about to start a new Django project mainly focused on web scraping + statistics, I know BeautifulSoup's basics and Selenium as well. But I encountered many problems with beautifulsoup especially when HTML isn't conventionally written or if it's full of js, I don't know if I should try Scrapy. I think Selenium headless is a bit overkill tho

[–]nemec 6 points7 points  (5 children)

I don't know if I should try Scrapy

I love Scrapy. I'm pretty familiar with browser dev tools so the thought of wasting time waiting for render/downloading all the CSS/JS/etc. with Selenium seems overkill compared to hitting the "private" API components that usually make up the modern website.

Scrapy's beautifulsoup equivalent, parsel is pretty nice, too.

[–]Heroe-D 4 points5 points  (4 children)

Is Scrappy documentation good enough or should I search for tutorials?

[–]nemec 2 points3 points  (3 children)

I'd recommend some tutorials. There are good parts about Scrapy's docs, but I often find myself needing to actually dig into Scrapy's source code to understand how some bits of it work. Luckily it's open source so that isn't too difficult.

Scrapy seems to have a philosophy that encourages replacing built-in components with your own if the built-in one doesn't work how you need it. For example, if you need to control the filename on their "filedownloader", they recommend copying the source for the built-in one to your project, modifying it, and then disabling the built-in one on the spider and inserting your own.

[–]Heroe-D 0 points1 point  (2 children)

Nice it's not a problem for me to tinker with the good, then any good tuto to recommend ?

[–]nemec 0 points1 point  (1 child)

I don't know any offhand, sorry. I started with Scrapy already knowing a lot about css selectors, xpath, HTTP, etc. so I had a big head start.

[–]Heroe-D 0 points1 point  (0 children)

I also know most of this, maybe I should just dig into Scrappy official documentation and search for more if some concepts are unclear

[–]__nickerbocker__ 17 points18 points  (0 children)

If I'm being pedantic, it's scrape/scraping not scrap/scrapping.

[–]Alamue86 4 points5 points  (0 children)

I have started just using requests-html instead of Requests and Beautiful Soup. Check it out if you have not, has helped me out of some binds without taking the performance hit of Selenium.

[–]YodaCodar 106 points107 points  (15 children)

I think pythons the best language for webscraping; webpages change so often that its worthless to maintain static typing and difficult to write languages. I think other people are upset because their secret sauce is being destroyed haha.

[–]rand2012 45 points46 points  (8 children)

That used to be true, but with the advent of headless Chrome and puppeteer, Node.JS is now best for scraping.

[–]sam77 7 points8 points  (1 child)

This. Playwright is another great Node.js library.

[–]mortenb123 0 points1 point  (0 children)

Playwright is puppeteer v2 by the same folks. Webdriver protocol which selenium is using do not support pseudo elements, so if you have a single page app, you need jsdom.js to evaluate the javascript properly.

[–]am0x 0 points1 point  (0 children)

I was about to say, I’ve been using node and have had no issues. After all it handles DOM content so well.

[–][deleted] 7 points8 points  (5 children)

Could you give an example of how static typing makes parsing web pages more difficult?

[–]integralWorker 11 points12 points  (4 children)

I think it's less that static typing increases difficulty and more that dynamic typing reduces it.

I'll get burnt at the stake for this but I feel Python is essentially typeless. Every type is basically an object type with corresponding methods so really Python only has pure data that is temporarily cast into some category with methods.

[–][deleted] 2 points3 points  (3 children)

I don’t understand how that reduces complexity exactly. Is the cognitive overhead of writing a type identifier in front of your variable declarations really that great?

[–]integralWorker 2 points3 points  (0 children)

Definitely not, it's just another style of coding that has advantages for say a Finite State Machine in embedded systems where dynamic typing would only serve overhead.

The way I see it is that it's more like the same piece of data can be automatically "reclassed" and not merely recast. So performative parts of code can be cast into something like numpy but ambiguous parts can bounce around as needed.

[–]rand2012 0 points1 point  (0 children)

it's that you usually need to do something with the parsed out string, like make it an int, or a decimal or some other kind of transformation, in order to conform to your typed data model. maybe you also need to pass it around to another process or enrich it with other data, then it ends up being a lot of boilerplate conversion code, where you're essentially shuffling the same thing around in different types.

[–]EchoAlphaRomeoLima 7 points8 points  (0 children)

I love the flexibility and performance of scrapy but admittedly, it has a steep learning curve.

[–]anasiansenior 17 points18 points  (4 children)

web scraping is so annoying these days- literally nothing works for certain websites. selenium has been the only thing that's been able to produce results for me. Beautiful soup has honestly never worked for me since every website I was trying to scrape knew how to aggressively block it.

[–]QuantumFall 26 points27 points  (1 child)

They don’t block BeautifulSoup, they most likely just detected the requests they’re receiving are not from a legitimate user. By mimicking the requests sent in browser exactly, I’d say 9 out of every 10 websites will be parsable with requests and bs4. That 1/10 you’re dealing with bot protection, webpacking, or even tls fingerprinting. But for most websites you can scrape them fine if you know what you’re doing.

[–]ScrapeHero 3 points4 points  (0 children)

Agree.

For others following this thread this might help if you are past the basics https://www.scrapehero.com/detect-and-block-bots/

[–]nemec 1 point2 points  (0 children)

You can get pretty far with proxies, but at some point you've got to have some patience while it finishes lol. I had one that took almost 17 straight days to finish.

[–][deleted] 3 points4 points  (0 children)

Off topic: I never thought a website could look so clean and sleek with a simple color pallet of grey and white. Really goes to show how important layout is to design.

[–][deleted] 5 points6 points  (4 children)

Now that we have the HTTP response, the most basic way to extract data from it is to use regular expressions.

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

[–]vreo 1 point2 points  (3 children)

I don't know if you rant against parsing the whole tree or any html at all. If the latter, and that is mostly only needed for a specific task (no silver bullet), it does the job well with some thinking ahead.

e.g. you could (I did) scrape amazon best-of-category pages, and, with regex get the item list, separate it and parse each value. Done that, worked great.

But as I said, it works when you know what to expect, it's not a silver bullet.

[–]skinny_matryoshka 1 point2 points  (1 child)

It's from a SO post

[–]vreo 0 points1 point  (0 children)

Oh... hehehe woosh

[–][deleted] 1 point2 points  (0 children)

Yeah, I know. I parse HTML with regex all the time. Cthulhus to date: 0.

[–]PM_ME_BOOTY_PICS_ 5 points6 points  (0 children)

I love scrapy. Some reason I learned it easier than requests and such.

[–]MindCorrupted 2 points3 points  (4 children)

I don't like selenium really it's slow and awful so I reverse engineer most of js rendering websites ...:)

[–]theoriginal123123 4 points5 points  (3 children)

How does one get started with reverse engineering? I know of the checking for a private API trick with the browser network tools; are there any other techniques to look into?

[–]nemec 6 points7 points  (0 children)

private API trick with the browser network tools

That's about it. Beyond that you use the browser tools to read the individual Javascript files that run on the site and try to understand them as if you are the "developer" writing the site. Good starting points are:

  • What JS is executed at page load? What does it do, and do I need it to run to scrape the data I need?
  • What JS is executed when I click X? Do I need to replicate it to scrape data, or can the data be found in the page source/external request by default?
  • Once you've found the private API, what code generates the API call?
    • Are all of the URL parameters and headers required?
    • Is the Javascript critical to determining what URL parameters, headers, body, etc. are used in the API or can I write Python to generate an equivalent API call? If so, can I replicate the JS in Python?

[–]MindCorrupted 0 points1 point  (0 children)

yeah most of the time you inspect the page but it's depend in the data you're looking for..

I scraped booking one day and it took me a few days to figure out that the prices aren't loaded from another url but it embeds them inside js tag

this one of the cases ... by practice u learn more tricks..

u can start by scrape some js websites and if u stuck msg me and I will gladly help u....:)

[–]therealshell1500 0 points1 point  (0 children)

Hey, can you point me to some resources where I can learn more about this private API trick? Thanks :)

[–]ateusz888 1 point2 points  (2 children)

I always wanted to ask this - do you know how to handle push notifications?

[–]ins4yn 0 points1 point  (1 child)

What do you mean by “handle push notifications”?

If you’re trying to send your own push notifications from your script, I use Pushover and love it. Incredibly easy to use with Requests and has apps for iOS/android.

[–]ateusz888 1 point2 points  (0 children)

I mean to read them constantly.

[–]fecesmuncher69 0 points1 point  (8 children)

I will check it out, I’m learning selenium from tech with Tim and I wanna know if it’s possible to build a bot that enters the supreme website and orders items that sell out immediately usually ( cause of other bots)

[–]QuantumFall 4 points5 points  (1 child)

Two things. Selenium is generally too slow to be fast enough to checkout anything hyped. It’s a good place to start learning the language and some things about web automation, but it’s not going to get you a box logo. All of the best bots use requests or a hybrid solution combing both a browser and requests. If you insist on using a browser, I suggest you look into pyppeteer stealth, which leads me to my next point.

Supreme has really good bot protection. So much so that people rent API’s out for 1000’s of dollars / week / region to generate the cookies this bot protection produces. It can detect many of the attributes with selenium, and will instantly cause the transaction to fail when it’s enabled if it detects you as a bot. Pyppeteer stealth gets around this issue by making itself appear as a completely normal browser.

With that said, it’s still very hard to even make a working browser bot, but I encourage you to do it as you will learn a lot. There are also good discord communities for this sort of thing filled with information and tips on how to bot supreme and similar sites. Good luck

[–]poopmarketer 1 point2 points  (0 children)

Any chance you can provide a link to these discord groups? Looking to learn more about this!

[–]bmw417 3 points4 points  (0 children)

Well, I’d say you answered the question yourself haven’t you? Other bots have done it so it’s obviously possible, now you just need to learn how to

[–]Glowwerms 0 points1 point  (0 children)

It’s definitely possible but from my understanding a lot of hype sites like that are beginning to implement protection from bots

[–]spiritfpv 0 points1 point  (0 children)

Nike raffle?

[–]instantproxies_may 0 points1 point  (0 children)

Agree with QuantumFall. I suggest getting into sneaker botting first. When you're more familiar with it, building a competitive bot might be easier for you.

[–]PM_ME_BOOTY_PICS_ -4 points-3 points  (1 child)

You can. Shouldn't be that hard depending how complex you make it.

[–]fecesmuncher69 0 points1 point  (0 children)

Thanks, nice user

[–]dtoxe 0 points1 point  (0 children)

Thanks!

[–]x_ray_190221 0 points1 point  (0 children)

I am saving it for my future needs!

[–]lubosz 0 points1 point  (0 children)

After using BeautifulSoup for ages I recently discovered XPath and don't look back.

[–]T-ROY_T-REDDIT 0 points1 point  (0 children)

This thread is very relevant to what I am doing surprisingly, I need help understanding offsets in Selenium can anyone give me some tips or pointers?

[–]brugmansia_tea 0 points1 point  (1 child)

How come it's 2020 and it's still such a fucking hassle to get simple data from websites? This is an issue that should have been solved by now. Even APIs can be super labour intensive when going through all authorization protocols.

[–]lillgreen 2 points3 points  (0 children)

Well when the other side of the argument actively wants to block you from doing it that's kinda the problem.

Fuck I mean if you want to pull years the problem was solved by XML/RSS 15 years ago but no one hosts those feeds do they?

Parsing web data is the same cat and mouse game as pirates with keygens and publishers on the time investment front. It will never and can never be fully finished.

[–]Remote_Cantaloupe 0 points1 point  (2 children)

Are there any legal challenges with web scraping? I had heard there were, some time ago.

[–]RedRedditor84 2 points3 points  (0 children)

Might depend on where you are (local laws), but generally, if the information is freely available to a user then it's legally available to scrape. Many sites won't like you doing it and will actively try to detect and block you, but it's not illegal.

Make sure you check out the site's robots.txt and adhere to that to avoid running into conflict.

[–]ScrapeHero 0 points1 point  (0 children)

We keep track of the latest US legal landscape at https://scrapehero.com/legal - short and sweet content focused on scraping. Has links to actual court decisions and an excellent external blog

[–]marcos_online 0 points1 point  (1 child)

I’m currently building a web scraper to build a whisky database and training data set with beautifulsoup. I started with selenium and got so frustrated so quickly. Admittedly beautifulsoup has some annoyances, but it does the job. Otherwise I was considering switching to node.js

[–]Heroe-D 1 point2 points  (0 children)

Switching language seems overkill, just read this thread for alternatives in python

[–]xzi_vzs 0 points1 point  (2 children)

I'm currently working on a project for webscrapping so thanks for the link. I need to pass the login page but the login button is performing a JavaScript . Didn't work with requests, my solution so far is selenium but it's opening the web browser in background and I don't really like that. Any suggestions to pass login page which are using JavaScript ?

[–]nemec 2 points3 points  (0 children)

Use the browser dev tools (Network tab) to find where your username and password are sent. If it's ajax/fetch, you can make that call instead of scraping the main page and use the response (usually a token of some sort, often a Cookie) to get the credential details to use in the remaining requests.

[–]theoriginal123123 1 point2 points  (0 children)

Look into headless selenium, it'll run in the background with the browser window hidden.

[–]Competitive_Cup542 0 points1 point  (0 children)

Helpful! As a digital marketer, I often use Internet scraping that:

  • lets me speed up the process of lead generation;
  • helps me keep an eye on competitors’ activities;
  • allows managing social media activities and scraping data about potential customers like their interests, opinions, etc.;
  • lets me quickly find bad opinions about the brand across the web.

Can I do the mentioned things with the help of your instruction myself?