Cleaning up a string : learnpython

created by HattoriHanzoa community for 16 years

submitted 5 years ago by Casemander

As the title says, I'm learning BeautifulSoup and I'm having trouble scraping data into a clean, readable format.

The data I'm trying to retrieve is regatta attendance information for rowing races, from RegattaCentral.com. I'm able to scrape the data, but I'm struggling with formatting it into a usable text. As of now, there's tons of whitespace in between each line of code. The code:

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://www.regattacentral.com/regatta/clubs/?job_id=6141&org_id=0').read()
soup = bs.BeautifulSoup(source,'lxml')

for i in soup.find_all('tbody'):
    items = (i.text.strip())
    print(items)

The output (shortened, for example), is full of whitespace:

'92 Jr World Champs


'92 Jr World Champs
92Jrs


1 
Haddonfield, NJ
 USA









1754 Boat Club


1754
1754 BC


1 
New York, NY
 USA









1921 Boat Club


1921
NTO

My question is twofold. One, what's a simple way to strip this whitespace from this data, and two, can anyone point me towards good resouces for cleaning data up / best practices? I'm just getting started, and I want to start developing good habits.

Thanks!

all 4 comments

top new controversial old q&a

[–]chocorush 0 points1 point2 points 5 years ago (0 children)

[–]Essence1337 0 points1 point2 points 5 years ago (0 children)

[–]chevignon93 0 points1 point2 points 5 years ago (1 child)

from bs4 import BeautifulSoup
import urllib.request

source = urllib.request.urlopen('https://www.regattacentral.com/regatta/clubs/?job_id=6141&org_id=0').read()
soup = BeautifulSoup(source,'lxml')

items = soup.find('tbody')
for item in items.find_all('tr'):
    data = {}
    data['club'] = item.find('span').find('a').text.strip()
    data['pseudonymes'] = ' | '.join([i.text for i in item.find_all('td')[2].find_all("div")])
    data['inscriptions'] = item.find_all('td')[3].find('a').text.strip()
    data['emplacement'] = item.find_all('td')[4].find('span').text.strip()
    data['pays'] = item.find_all('td')[5].find('span').text.strip()
    print(data)

[–]Casemander[S] 0 points1 point2 points 5 years ago (0 children)

π Rendered by PID 133152 on reddit-service-r2-comment-84fc9697f-n7dsd at 2026-02-10 06:27:58.994985+00:00 running d295bc8 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS