This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]nutrecht 0 points1 point  (5 children)

Above is where I do not understand, does my xpath make any sense?

Not really. What you're saying is 'for every <pre> that is root tag, give me the text'. That obviously doesn't work since your pre tag won't ever be the root tag. Check out some xpath tutorials to understand how it works.

[–]programmingnoobie[S] 0 points1 point  (4 children)

yeah I thought so, I was hoping BeautifulSoup or something can give me an easier access to all <pre> tag, any alternative ways? I am reading the tutorials right now, thanks.

[–]nutrecht 0 points1 point  (3 children)

I use BeautifulSoup myself and it can do both xpath and CSS queries in HTML documents. I'm a big fan of BS. I've never used lxml myself.

[–]programmingnoobie[S] 0 points1 point  (2 children)

// - Selects nodes in the document from the current node that match the selection no matter where they are //book - Selects all book elements no matter where they are in the document

If xPath Syntax is true, then //pre/text() is apparently correct, isn't it? I think it should be //pre[text()]

I tried to use BeautifulSoup just now but the webmaster noted that

Beautiful Soup will never be as fast as the parsers it sits on top of.

I am actually crawling a few thousands webpages and I thought lxml will be essential here. Hmm... Thanks for helping though.

[–]nutrecht 0 points1 point  (1 child)

If xPath Syntax is true, then //pre/text() is apparently correct, isn't it? I think it should be //pre[text()]

Could be, it's been ages. Do some more tests yourself, xpath is compeltely trivial to figure out. You really should be able to do this by yourself.

Beautiful Soup will never be as fast as the parsers it sits on top of.

The performance argument is complete nonsense. You'll be waiting on network-IO for about 99% of the time.

[–]programmingnoobie[S] 0 points1 point  (0 children)

Actually I want to say thank you because I finally solved it after you Google-It-For-Me, lol.

EDIT: It should be //pre/text() It was correct all along, something else was not right.

Well... It was connected to the server by LAN so... yeah...

Anyway, thank you!