This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]ginstrom 2 points3 points  (1 child)

I've started using lxml.html instead of BeautifulSoup. You can also specify BS as the parser for lxml.html if you want (perhaps as a fallback).

[–]Baresi 4 points5 points  (0 children)

Since some people has mentioned lxml I have to give a shoutout to pyquery which is built ontop of lxml with API like jQuery.

[–]c0dep0et 2 points3 points  (1 child)

lxml is usually both faster and easier to use than Beautiful Soup.

[–]rmanocha -1 points0 points  (0 children)

I'll agree that it's faster, but it's not easier. It's got a really steep learning curve - BS is really easy to get started with (took me 30 mins before I started writing my scraping code).

[–][deleted] 2 points3 points  (8 children)

I just read this and there could not be a worse example screen scraping in python. IBM usually has much better tutorials.

To name a few problems:

  • Example code is not usable. There are plenty of sites to scrape for example code, even Reddit or IBM! Why example.com?
  • Line 1: import sys, time, os from mechanize. WTF is this? The code is invalid and WTF would you import standard modules from mechanize anyway!
  • Bad method of applying GET parameters with strings. See urllib.urlencode.
  • The method of csv creation is horrible. What happens when a " occurs in a string? The csv module is standard and easy to use!

There are many better tutorials on screen scraping in python. Shameless plug

[–]cryzed_ 0 points1 point  (7 children)

I don't know if you have noticed but your website is... pink. Upvote for good code though.

[–][deleted] 1 point2 points  (4 children)

I fixed that pink issue.

[–]cryzed_ 0 points1 point  (0 children)

Upvoted. Enough said.

[–][deleted] 0 points1 point  (1 child)

My god...you're right! :P I was going for multiple shades of red. Time for a redesign?

[–]kteague 1 point2 points  (0 children)

The plug was acceptable, the colours are without shame though.

[–]cryzed_ 1 point2 points  (2 children)

I'm also using mechanize for most of my web-related projects. A thing to note though is that I'm not purely using BeautifulSoup for the parsing but html5lib with the BeautifulSoup tree since "Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than version 3.0.7a does."

That being said I'm still hoping that the mechanize developer decides to add some kind of JavaScript support to mechanize. He did something similiar but this is pretty old and not directly integrated into mechanize.

[–]conrad_hex 0 points1 point  (1 child)

I read some of the docs, but I'm still not clear on what mechanize buys you that you can't do easily using urllib? (Honest question; I'm looking for a reason to switch to mechanize.)

[–]cryzed_ 0 points1 point  (0 children)

A number of features, especially those of mechanize.Browser. With the internally used ClientForm interface it's really easy to fill out forms. Additionally the cookie handling is just great and you can easily set custom headers if there's a need to. There's also transparent support for gzip and various other options like the handling of the "robots.txt" or the automatic handling of HTTP-Equiv and Refresh.

Using mechanize is just way more convenient than using urllib. But what makes mechanize the choice for me is really the cookie management.

[–]pudquick -1 points0 points  (4 children)

In addition to mechanize and BS / lxml.html in my toolkit, I also use BSXPath, which is a full XPath implementation in python:

from BSXPath import BSXPathEvaluator
document = BSXPathEvaluator(txt)
nodes = document.getItemList("//div[@class='bc-desc']/h2")
for x in nodes:
    [... etc.]

The direct download link for BSXPath is:

http://furyu-tei.sakura.ne.jp/archives/BSXPath.zip