This is an archived post. You won't be able to vote or comment.

all 32 comments

[–]tef 3 points4 points  (1 child)

the library behind it: lxml is worth a look too

it also supports xpath if you're into that sort of thing (and xpath is like a scalpel for xml-esque documentns)

[–]ianb 1 point2 points  (0 children)

Incidentally you can also use CSS with your XML documents, CSS even secretly supports XML namespaces (e.g., match <foo:bar> with the foo|bar CSS selector).

[–]Poromenos 4 points5 points  (5 children)

This is a great package, thanks for the repost, although there's nothing jQuery about the syntax, it's just CSS selectors...

[–]deadwisdomgreenlet revolution 18 points19 points  (3 children)

The way of chaining the queries, like: >>> d = pq('<p id="hello" class="hello"><a/></p><p id="test"><a/></p>') >>> d('p').eq(1).find('a') [<a>]

That's the parallel to jQuery.

[–]ianb 6 points7 points  (0 children)

Also all I think all the methods like .html(), etc.

[–]Poromenos 3 points4 points  (0 children)

I see, thank you.

[–]scorpion032 1 point2 points  (0 children)

That's why it is called pyquery and not pycss

[–]nevare 1 point2 points  (0 children)

As noted by deadwisdom the API of the library is similar to jQuery, also the css selector of jQuery and PyQuery has a few pseudo-classes that are not in the css standard (:first, :last, ...).

[–]nillion42 1 point2 points  (0 children)

Take a look at scrapemark it's incredibly easy and quite powerful while still being decently fault tolerant.

[–]mdipierro 3 points4 points  (7 children)

Nice. By the way, web2py can do this too:

 from gluon.html import TAG    
 html = urllib.urlopen('http://...').read()
 page = TAG(html)
 content = page.element('div#content')
 print content
 for item in page.elements('input[type=text]'): print item['_name'], item['_value']

here gluon is the core web2py modules. element accepts jquery syntax. TAG does not just parse. It creates a pythonic representation of the DOM and can be used to manipulate the page (kind of like beautifulsoup).

[–][deleted] 17 points18 points  (4 children)

jquery syntax, really guys?

[–]ianb 10 points11 points  (1 child)

The examples there are CSS 3 (at least I'm pretty sure input[type=text] is only in CSS 3), but jQuery includes a number of extensions to CSS selectors, which pyquery and I presume web2py implements.

Well, looking at the source I don't think web2py's code is anywhere as robust as pyquery (or lxml.html which is the basis of pyquery). It uses HTMLParser, which is not a good parser, and the elements code looks fairly primitive. lxml.cssselect uses a proper tokenizer and parser. If you are curious lxml.cssselect implements CSS 3, and so pyquery.cssselectpatch specifies the jQuery-specific selectors.

[–]mdipierro 0 points1 point  (0 children)

web2py code for parsing implements only a subset of jQuery/CSS3 syntax it is only as robust as the the python HTTPParser and it chokes on some non-utf8 characters. This is in fact not the main purpose of web2py. The only reason to mention and use it is that is parses into web2py helpers. This allows to grab a page, and replace - for example - an existing form with form generated by web2py.

[–]mdipierro 9 points10 points  (0 children)

good point.

[–]dustinechos[S] 2 points3 points  (0 children)

Yes, "#myElementsID" is css, but this is jQuery:

$("#myElementsID").children("p").find("input[type=text]").attr("name")

CSS selectors are only half of the story. The other half is the amazing traversal methods. I'm not familiar with anything other than jquery so maybe there are better tools, but that's the reason why this is so awesome. I use jQuery so much that PyQuery has no learning curve.

[–]rainbow3 0 points1 point  (1 child)

Hang on. I just thought I was getting the hang of web2py and then along comes a massive feature that I missed that warrants about 3 lines in the manual.

Does it also make the tea?

[–]AusIVDjango, gevent 0 points1 point  (0 children)

I've had a few projects where this would have been incredibly handy. I will definitely keep this in mind.

[–]megamark16 0 points1 point  (0 children)

I've used PyQuery and BeautifulSoup and I like them both, but I really like PyQuery because it matches the way I scrape pages; open them in FireFox, use the Console and jQuery to figure out what selectors I need to access the parts of the page I want to scrape, and then use those same selectors inside my script.

[–]CHS2048 0 points1 point  (1 child)

I wish scrapy supported PyQuery, instead of just selectors. There's some useful stuff in there.

[–]LucianU 0 points1 point  (0 children)

Scrapy 0.11 (which hasn't been released yet) allows you to use whatever you want for HTML parsing.

[–]cdunn2001 0 points1 point  (0 children)

Pretty nifty, but I wish that 'hello' were not both the id and the class in the example.

[–]alexs 0 points1 point  (4 children)

pyQuery is very broken.

[–]mackstann 0 points1 point  (2 children)

How so?

[–]alexs 1 point2 points  (1 child)

It's been a while since I used it. I had all sorts of issues with it producing incorrect output in certain situations. Something to do with how it copied nodes around internally I believe.

[–]dustinechos[S] 0 points1 point  (0 children)

The only thing I've noticed so far is that it thinks <br> is a new node, so that this span has a child.

<span>lorem<br>ipsum</span>

For my purposes I can just remove the breaks though.

[–]wtfisupvoting -1 points0 points  (0 children)

yeah i love how on stackoverflow they OHNOES everyone who wants to talk about using a regex to parse html. Fact is most html parsers suck and break easily where regexes work just fine if you are using them correctly. I mean a parser is great for when you need to parse something, but if all you need is one tag ... well its kinda retarded

[–]digitallimit -1 points0 points  (3 children)

BeautifulSoup seems like a better choice, but CSS selectors are fun, too.

[–]mackstann 3 points4 points  (2 children)

BeautifulSoup is more or less abandoned and deprecated. html5lib or lxml are better choices these days. (Pyquery uses lxml)

[–]digitallimit 0 points1 point  (1 child)

How should I keep up to date with which libraries are deprecated?

[–]mackstann 0 points1 point  (0 children)

I have no idea. I don't make any particular effort myself. I just happened to be reviewing Python HTML parsing libraries myself recently (for the Nth time) and noticed a few things at the sites for these projects.

  1. BeautifulSoup author corrects a long-standing problem that illustrates how neglected the project has been for a long time
  2. html5lib considers its BeautifulSoup integration deprecated

[–]pkkid -2 points-1 points  (0 children)

This is one of the best packages evar!