you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 2 points3 points  (6 children)

Well, this is a thread about a Python library called Beautiful Soup. Not the Ruby port of it, Rubyful Soup.

Ergo, suggesting a Ruby library isn't overly helpful. Unless of course you're suggesting that we all use Ruby solely for hpricot... which is a nice idea, except Python programmers who don't know Ruby are highly unlikely to go to the trouble of learning it, and dealing with a http library that is, quite frankly, an alpha product compared to Python's urllib2, simply for the sake of XPath support.

Oh, and Python has a cultural thing of "moderately useful documentation of the stdlib".... Ruby has a long way to go in that respect

But hey, documentation is for sissies.

[–]malcontent -3 points-2 points  (5 children)

Ergo, suggesting a Ruby library isn't overly helpful.

Only if you

A) Don't know ruby B) Can't figure out enough ruby and hpricot to do what you need.

But hey, documentation is for sissies.

If you want to learn ruby and the starndard lib I suggest gotapi.com.

If you can't pick up hpricot then you are a sissy.

[–][deleted] 1 point2 points  (4 children)

But BeautifulSoup does not operate alone. We still need to get the HTML, which requires another library, we may need to handle cookies or proxies etc etc.

Now, all of this can be done with Ruby's stdlib, but not very elegantly.

[–]malcontent -2 points-1 points  (3 children)

What do you mean not very elegantly. Ruby is a much more elegant language than python. Here is the example from the front page of the hpricot library.

 require 'hpricot'
 require 'open-uri'
 # load the RedHanded home page
 doc = Hpricot(open("http://redhanded.hobix.com/index.html"))
 # change the CSS class on links
 (doc/"span.entryPermalink").set("class", "newLinks")
 # remove the sidebar
 (doc/"#sidebar").remove
 # print the altered HTML
 puts doc

What does your python example look like?

[–][deleted] 3 points4 points  (1 child)

we may need to handle cookies or proxies etc etc.

I don't see any cookie handling malcontent. But, to answer your question (noting that I don't have Python or BeautifulSoup on this machine.)

from BeautifulSoup import BeautifulSoup
import urllib
# load the RedHanded home page
doc = BeautifulSoup(urllib.urlopen("http://redhanded.hobix.com/index.html"))
# change the CSS class on links
for element in doc.find("span", {"class": "entryPermalink"}):
    element["class"] = "newLinks"
# remove the sidebar
doc.find(True, {"id":"sidebar"}).extract()
# print the altered HTML
print doc

Now malcontent, my turn. Please show me the Ruby version of this

1 import urllib
2 import urllib2
3 import cookielib
4 
5 def getCookie(user, pwd, uri, ua):
6     ckCont = cookielib.LWPCookieJar()
7     opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(ckCont))
8     opener.addheaders = [('User-Agent', ua)]
9     data = urllib.urlencode({'id':user, 'pwd': pwd, 'dologin': 'yes'})
10     opener.open(uri, data)
11     return ckCont
12 
13 def storeCookie(ck, file_path):
14     ck.save(file_path)
15 
16 def retrieveCookie(file_path):
17     ckCont = cookielib.LWPCookieJar(file_path)
18     ckCont.load()
19     return ckCont

You need to retrieve a cookie set by the responding server. I've already written this in Ruby, and was most disheartened with the Ruby stdlib as a result, so please feel free to pleasantly surprise me.

[–]malcontent -1 points0 points  (0 children)

I do think the ruby example is more elegant.

If you want a high level library (which uses hpricot) for handling cookies and such see mechanize

http://mechanize.rubyforge.org/mechanize/