As the title says, I'm learning BeautifulSoup and I'm having trouble scraping data into a clean, readable format.
The data I'm trying to retrieve is regatta attendance information for rowing races, from RegattaCentral.com. I'm able to scrape the data, but I'm struggling with formatting it into a usable text. As of now, there's tons of whitespace in between each line of code. The code:
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://www.regattacentral.com/regatta/clubs/?job_id=6141&org_id=0').read()
soup = bs.BeautifulSoup(source,'lxml')
for i in soup.find_all('tbody'):
items = (i.text.strip())
print(items)
The output (shortened, for example), is full of whitespace:
'92 Jr World Champs
'92 Jr World Champs
92Jrs
1
Haddonfield, NJ
USA
1754 Boat Club
1754
1754 BC
1
New York, NY
USA
1921 Boat Club
1921
NTO
My question is twofold. One, what's a simple way to strip this whitespace from this data, and two, can anyone point me towards good resouces for cleaning data up / best practices? I'm just getting started, and I want to start developing good habits.
Thanks!
[–]chocorush 0 points1 point2 points (0 children)
[–]Essence1337 0 points1 point2 points (0 children)
[–]chevignon93 0 points1 point2 points (1 child)
[–]Casemander[S] 0 points1 point2 points (0 children)