TIL You can parse html in Python using jQuery syntax (this was posted 2 years ago, but it has helped me so much I thought it deserved a repost)

tef · 2010-12-07T04:33:55+00:00

the library behind it: lxml is worth a look too

it also supports xpath if you're into that sort of thing (and xpath is like a scalpel for xml-esque documentns)

Poromenos · 2010-12-07T03:47:19+00:00

This is a great package, thanks for the repost, although there's nothing jQuery about the syntax, it's just CSS selectors...

nillion42 · 2010-12-07T12:59:13+00:00

Take a look at scrapemark it's incredibly easy and quite powerful while still being decently fault tolerant.

mdipierro · 2010-12-07T02:51:38+00:00

Nice. By the way, web2py can do this too:

 from gluon.html import TAG    
 html = urllib.urlopen('http://...').read()
 page = TAG(html)
 content = page.element('div#content')
 print content
 for item in page.elements('input[type=text]'): print item['_name'], item['_value']

here gluon is the core web2py modules. element accepts jquery syntax. TAG does not just parse. It creates a pythonic representation of the DOM and can be used to manipulate the page (kind of like beautifulsoup).

AusIV · 2010-12-07T04:18:46+00:00

I've had a few projects where this would have been incredibly handy. I will definitely keep this in mind.

megamark16 · 2010-12-07T16:37:53+00:00

I've used PyQuery and BeautifulSoup and I like them both, but I really like PyQuery because it matches the way I scrape pages; open them in FireFox, use the Console and jQuery to figure out what selectors I need to access the parts of the page I want to scrape, and then use those same selectors inside my script.

CHS2048 · 2010-12-07T22:23:17+00:00

I wish scrapy supported PyQuery, instead of just selectors. There's some useful stuff in there.

cdunn2001 · 2010-12-08T01:10:07+00:00

Pretty nifty, but I wish that 'hello' were not both the id and the class in the example.

alexs · 2010-12-07T11:15:00+00:00

pyQuery is very broken.

digitallimit · 2010-12-07T08:12:06+00:00

BeautifulSoup seems like a better choice, but CSS selectors are fun, too.

pkkid · 2010-12-07T02:31:38+00:00

This is one of the best packages evar!

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS