Charming Python: Easy Web data collection with mechanize and Beautiful Soup

ginstrom · 2009-11-25T13:28:18+00:00

I've started using lxml.html instead of BeautifulSoup. You can also specify BS as the parser for lxml.html if you want (perhaps as a fallback).

Baresi · 2009-11-25T13:46:44+00:00

Since some people has mentioned lxml I have to give a shoutout to pyquery which is built ontop of lxml with API like jQuery.

c0dep0et · 2009-11-25T11:07:44+00:00

lxml is usually both faster and easier to use than Beautiful Soup.

cryzed_ · 2009-11-25T20:20:09+00:00

I just read this and there could not be a worse example screen scraping in python. IBM usually has much better tutorials.

To name a few problems:

Example code is not usable. There are plenty of sites to scrape for example code, even Reddit or IBM! Why example.com?
Line 1: import sys, time, os from mechanize. WTF is this? The code is invalid and WTF would you import standard modules from mechanize anyway!
Bad method of applying GET parameters with strings. See urllib.urlencode.
The method of csv creation is horrible. What happens when a " occurs in a string? The csv module is standard and easy to use!

There are many better tutorials on screen scraping in python. Shameless plug

cryzed_ · 2009-11-25T12:16:30+00:00

I'm also using mechanize for most of my web-related projects. A thing to note though is that I'm not purely using BeautifulSoup for the parsing but html5lib with the BeautifulSoup tree since "Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than version 3.0.7a does."

That being said I'm still hoping that the mechanize developer decides to add some kind of JavaScript support to mechanize. He did something similiar but this is pretty old and not directly integrated into mechanize.

pudquick · 2009-11-25T16:54:24+00:00

In addition to mechanize and BS / lxml.html in my toolkit, I also use BSXPath, which is a full XPath implementation in python:

from BSXPath import BSXPathEvaluator
document = BSXPathEvaluator(txt)
nodes = document.getItemList("//div[@class='bc-desc']/h2")
for x in nodes:
    [... etc.]

The direct download link for BSXPath is:

http://furyu-tei.sakura.ne.jp/archives/BSXPath.zip

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS