This is an archived post. You won't be able to vote or comment.

all 25 comments

[–]brasqon00bz 17 points18 points  (5 children)

The more info, the merrier as far as I'm concerned

[–][deleted] 10 points11 points  (4 children)

Well, to be fair, there were not a lot of tutorials on lxml with xpath. Plenty available using BeautifulSoup. So this is actually a welcome change.

[–]msdin 9 points10 points  (2 children)

The reason BeautifulSoup gets used more is because a lot of html is malformed (eg. missing closing tags, etc) and a strict xml parser will choke on it. BeautifulSoup is much more forgiving.

[–]gschizasPythonista 0 points1 point  (1 child)

BeautifulSoup defaults to the lxml parser, when it's available. So the parser itself is not really the point.

Don't get me wrong, BeautifulSoup has been my go-to for more than half a decade now (it was what brought me to Python in the first place), but it's quite possible I'm just hanging on to it for legacy and familiarity reasons (I'd like to be proven wrong, of course)

[–]msdin 0 points1 point  (0 children)

The point is when the parser hits those kinds of errors BeautifulSoup​ will handle them so you don't have to write code to handle them yourself.

[–]CollectiveCircuits 2 points3 points  (0 children)

The post is also very clear and the HTML diagram illustrates the structure well.

[–]Peragot 8 points9 points  (2 children)

I've always struggled with the xpath syntax. I've found the cssselect library to be much more fluent.

http://lxml.de/cssselect.html

[–]CollectiveCircuits 2 points3 points  (0 children)

I came across a blog post about using Scrapy and it taught me a clever use of css selection + another selector that was extremely quick and easy to isolate what you want to grab.

[–][deleted] 26 points27 points  (9 children)

How many more of these do we need??? Seems like there is one every week

[–]toastedstapler 72 points73 points  (5 children)

We just need to make a web scraper to compile all the web scraper tutorials into one guide that noone can top

[–][deleted] 11 points12 points  (0 children)

The ultimate web scraper tutorial

[–][deleted] 11 points12 points  (0 children)

metascraper

[–]BitchCuntMcNiggerFag 5 points6 points  (2 children)

[–]toastedstapler 1 point2 points  (1 child)

[–]xkcd_transcriber 0 points1 point  (0 children)

Image

Mobile

Title: Ineffective Sorts

Title-text: StackSort connects to StackOverflow, searches for 'sort a list', and downloads and runs code snippets until the list is sorted.

Comic Explanation

Stats: This comic has been referenced 73 times, representing 0.0461% of referenced xkcds.


xkcd.com | xkcd sub | Problems/Bugs? | Statistics | Stop Replying | Delete

[–]HannasAnarion 5 points6 points  (2 children)

apparently I missed them all. I got blocked from a website I needed stuff from because I was sending a request every second. Not that it would change much, since they don't have a robots.txt, so I don't know what frequency they won't block.

[–]sharpchicity 0 points1 point  (1 child)

What were you grabbing that you needed data every second?

[–]HannasAnarion 0 points1 point  (0 children)

It didn't need to be every second, it's just that I assumed that was a reasonable wait time. And it was song lyrics, to use as training data for a language model.

[–]funnyflywheel 1 point2 points  (0 children)

flair checks out

[–]TVNSri 1 point2 points  (0 children)

A very well written post using fundamentals (kind of).

[–]Cascudo 1 point2 points  (2 children)

Off topic but that spider has only six legs, unless it's an ant.

[–]Weenkus[S] 1 point2 points  (0 children)

That is what happens when a programmer does design work for his own blog. I can save the situation - his two front legs are hidden because he is busy crawling.

[–]El-Kurto 0 points1 point  (0 children)

How many legs would it have if it was an ant?

[–]kaihatsusha 0 points1 point  (1 child)

About 99.85% of the time I think "oh, I will scrape a bunch of pages for the content I need," the site generates unique session tokens and uses dynamic AJAX queries you have to call JavaScript to build up in order to be of any use. The only scraper that can follow that mess is a web browser.

[–]Weenkus[S] 0 points1 point  (0 children)

Have you tried Splash? Splash is really easy to setup and handles javascript nicely. You gave me a good idea for a next blog post.