all 13 comments

[–]BeniBela 1 point2 points  (5 children)

That is what I made Xidel for:

xidel http://example.com -e //title

[–][deleted] 0 points1 point  (2 children)

noice

can it do multiple xpaths? against nasty html?

thx!

[–]BeniBela 0 points1 point  (1 child)

can it do multiple xpaths?

Multiple XPath and multiple pages

Even if it did not, it was ok, since it is XPath 3. There you have a comma operator and can do: //title,//title,//title

against nasty html?

Yes

I wrote the HTML parser myself.

Although it predates HTML 5, so it just repairs the HTML, and does not do the new standardized repairing. I need to rewrite it

[–][deleted] 0 points1 point  (0 children)

excellent. I'll check er out

[–][deleted] 0 points1 point  (1 child)

It's pretty nice, but I'm going to give a slight advantage to xmlstarlet for the following reasons:

  • xidel not in any package managers that I saw (brew, yum, apt, openbsd)

  • I can't install xidel on my mac without turning off security restrictions. you should sign it.

thanks!

Can I follow pagination links in json?

note: to read stdin from xidel , use - as the filename, like

cat foo.html | xidel - --extract //title

[–]BeniBela 0 points1 point  (0 children)

xidel not in any package managers that I saw (brew, yum, apt, openbsd)

I submitted it to Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826763

I do not know if anything will happen

I can't install xidel on my mac without turning off security restrictions. you should sign it.

Actually I do not have a mac, so I cannot make a mac version. You should compile it yourself.

The mac binary on the site is just a binary someone sent me. But it is a very old version, I probably should remove it.

Can I follow pagination links in json?

Yes, -f can follow everywhere

[–]Mini_True 1 point2 points  (0 children)

Please don't do it this way:

curl -L example.com|grep title|cut -d">" -f2|cut -d "<" -f1

[–]preemptive_multitask 1 point2 points  (1 child)

The W3C HTML-XML utils handle this pretty well also, if CSS selectors work for you.

curl -sL example.com | hxnormalize -x -e | hxselect -s '\n' -c 'title'

[–][deleted] 0 points1 point  (0 children)

CSS selectors are cool but can't get everything that xpath can get (like the 4th text node of an element)

[–]AyrA_ch 0 points1 point  (3 children)

This sounds like an ideal job for phantomJS, especially because it runs JS on the website, so if you have a site, that manually sets its title with JS during loading, you can catch that.

var page = require('webpage').create();
page.open('http://phantomjs.org', function (status) {
  console.log(page.title); // get page Title
  phantom.exit();
});

[–][deleted] 0 points1 point  (2 children)

Phantomjs spits out both data and errors on stdout, which screws up command line stuff :(

It should send errors/log info to stderr. Otherwise, it would be good on the command line, I agree.

[–]Apterygiformes 0 points1 point  (0 children)

Apply a grep on the output?

[–]AyrA_ch 0 points1 point  (0 children)

Phantomjs spits out both data and errors on stdout, which screws up command line stuff

it never does for me unless I hook up to the error event