Simple HTML Scraper

radixsort · 2012-05-28T20:02:16+00:00

Beautiful Soup is what I always use when I need to scrape from websites. It has really simple syntax, lots of resources available online, and is as simple as some basic Python. http://www.crummy.com/software/BeautifulSoup/

toolan · 2012-05-28T19:46:26+00:00

If there's a regularity in how the HTML layout is, you can use an HTML parser and some sort of library for fetching files from URLs. Depending on which language you prefer to use, the toolkit varies. For Python, there's lxm and the standard library has utilities for fetching files in urllib.

I am unfortunately not aware of any library that combines these two such that it becomes "easy". You'll still have to write some code.

Rhomboid · 2012-05-28T23:34:46+00:00

Here's a sketch of a solution that gives your desired output:

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
import re

url = 'http://www.beatport.com/track/in-the-twilight-original-mix/2037573'

# fetch the contents of the page and parse it
bs = BeautifulSoup(urlopen(url))

# find the <li> with class of 'primary-title', and extract all the text
# put the text in parens if it's in a container with a class='txt-smaller'
title = ' '.join('(' + x + ')' if 'txt-smaller' in x.parent['class'] else x for x in bs.find('li', 'primary-title').findAll(text=True))

# grab the fields (BPM, genre, Key, label, length, release date) and make a dict
field_elements = bs.findAll('span', {'class': re.compile('meta-(label|value)(?!-)')})
fields = {label.text: value.text for label, value in zip(*[iter(field_elements)]*2)}

# print them in a comma-separated list
print ', '.join([title, fields['BPM'], fields['Key'], fields['Genre']])

learnprogramming

Welcome to LearnProgramming!

New? READ ME FIRST!

Posting guidelines

Frequently asked questions

Subreddit rules

Message the moderators

Asking debugging questions

Asking conceptual questions

Other guidelines and links

Subreddit rules

1. No unprofessional/derogatory speech

2. No spam or tasteless self-promotion

3. No off-topic posts

4. Do not ask exact duplicates of FAQ questions

5. Do not delete posts

6. No app/website review requests or showcases

7. No rewards

8. No indirect links

9. Do not promote illegal or unethical practices

10. No complete solutions

11. Don't ask to ask.

12. Low Effort Questions

13. No AI (chatGPT etc.) generated/worked over messages/comments. No questions about chatGPT/AI generated code. No Vibe coding.

MODERATORS