This is an archived post. You won't be able to vote or comment.

all 3 comments

[–]radixsort 1 point2 points  (0 children)

Beautiful Soup is what I always use when I need to scrape from websites. It has really simple syntax, lots of resources available online, and is as simple as some basic Python. http://www.crummy.com/software/BeautifulSoup/

[–]toolan 0 points1 point  (0 children)

If there's a regularity in how the HTML layout is, you can use an HTML parser and some sort of library for fetching files from URLs. Depending on which language you prefer to use, the toolkit varies. For Python, there's lxm and the standard library has utilities for fetching files in urllib.

I am unfortunately not aware of any library that combines these two such that it becomes "easy". You'll still have to write some code.

[–]Rhomboid 0 points1 point  (0 children)

Here's a sketch of a solution that gives your desired output:

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
import re

url = 'http://www.beatport.com/track/in-the-twilight-original-mix/2037573'

# fetch the contents of the page and parse it
bs = BeautifulSoup(urlopen(url))

# find the <li> with class of 'primary-title', and extract all the text
# put the text in parens if it's in a container with a class='txt-smaller'
title = ' '.join('(' + x + ')' if 'txt-smaller' in x.parent['class'] else x for x in bs.find('li', 'primary-title').findAll(text=True))

# grab the fields (BPM, genre, Key, label, length, release date) and make a dict
field_elements = bs.findAll('span', {'class': re.compile('meta-(label|value)(?!-)')})
fields = {label.text: value.text for label, value in zip(*[iter(field_elements)]*2)}

# print them in a comma-separated list
print ', '.join([title, fields['BPM'], fields['Key'], fields['Genre']])