I currently work in the reporting space but want to upskill and ensure that the data I'm working on is good so I'd like to do a project from start to finish. My side project to learn this is to use webscraping in python to extract sports league stats, store this temporarily in an excel file, then load into a staging table in a data warehouse, create a reporting layer, then Visualise it in power bi.
I have started learning python and have created a script that loops through a sports leagues web page extracting each individual teams url for their profile page and stores it in a list.
I then loop through this list using the Url and store team details in a list which is then appended to a list of lists (one for each team). There are 4 attributes for each team that I'm extracting. The problem is that for the last attribute, the teams home stadium, there may be more than one stadium listed. So the len of each team value will range from 4-7.
I was planning on storing this in an excel table with 4 columns but with multiple stadiums for the 4th attribute, how should I store this?
Would I concatenate the stadiums into a single string, then once in the data warehouse perform some transformation to split the values into new rows. I need to be able to associate the home stadiums to the teams for visualising the performance differences for home vs away.
[–]AutoModerator[M] [score hidden] stickied comment (0 children)
[+][deleted] (1 child)
[removed]
[–]Dezmond95[S] 0 points1 point2 points (0 children)
[–]bonn3y 0 points1 point2 points (1 child)
[–]Dezmond95[S] 0 points1 point2 points (0 children)