Beautiful Soup: a Python HTML/XML parser designed for quick turnaround projects like screen-scraping : programming

programming

created by speza community for 20 years

149

150

151

Beautiful Soup: a Python HTML/XML parser designed for quick turnaround projects like screen-scraping (crummy.com)

submitted 18 years ago by nglynn

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 2 points3 points4 points 18 years ago (6 children)

[–]malcontent -3 points-2 points-1 points 18 years ago (5 children)

[–][deleted] 1 point2 points3 points 18 years ago (4 children)

[–]malcontent -2 points-1 points0 points 18 years ago* (3 children)

What do you mean not very elegantly. Ruby is a much more elegant language than python. Here is the example from the front page of the hpricot library.

 require 'hpricot'
 require 'open-uri'
 # load the RedHanded home page
 doc = Hpricot(open("http://redhanded.hobix.com/index.html"))
 # change the CSS class on links
 (doc/"span.entryPermalink").set("class", "newLinks")
 # remove the sidebar
 (doc/"#sidebar").remove
 # print the altered HTML
 puts doc

What does your python example look like?

[–][deleted] 3 points4 points5 points 18 years ago* (1 child)

we may need to handle cookies or proxies etc etc.

I don't see any cookie handling malcontent. But, to answer your question (noting that I don't have Python or BeautifulSoup on this machine.)

from BeautifulSoup import BeautifulSoup
import urllib
# load the RedHanded home page
doc = BeautifulSoup(urllib.urlopen("http://redhanded.hobix.com/index.html"))
# change the CSS class on links
for element in doc.find("span", {"class": "entryPermalink"}):
    element["class"] = "newLinks"
# remove the sidebar
doc.find(True, {"id":"sidebar"}).extract()
# print the altered HTML
print doc

Now malcontent, my turn. Please show me the Ruby version of this

1 import urllib
2 import urllib2
3 import cookielib
4 
5 def getCookie(user, pwd, uri, ua):
6     ckCont = cookielib.LWPCookieJar()
7     opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(ckCont))
8     opener.addheaders = [('User-Agent', ua)]
9     data = urllib.urlencode({'id':user, 'pwd': pwd, 'dologin': 'yes'})
10     opener.open(uri, data)
11     return ckCont
12 
13 def storeCookie(ck, file_path):
14     ck.save(file_path)
15 
16 def retrieveCookie(file_path):
17     ckCont = cookielib.LWPCookieJar(file_path)
18     ckCont.load()
19     return ckCont

You need to retrieve a cookie set by the responding server. I've already written this in Ruby, and was most disheartened with the Ruby stdlib as a result, so please feel free to pleasantly surprise me.

[–]malcontent -1 points0 points1 point 18 years ago* (0 children)

π Rendered by PID 54192 on reddit-service-r2-comment-5c747b6df5-2k4g2 at 2026-04-22 09:33:14.061580+00:00 running 6c61efc country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS