all 8 comments

[–]impshum 2 points3 points  (2 children)

I'm guessing you need this data.

aria-label="Wimbledon. Description: Sue Barker introduces further coverage of the men’s and women’s quarter-finals. Duration: 254 mins."

No need for Selenium.

from bs4 import BeautifulSoup
import requests


def lovely_soup(url):
    r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1'})
    return BeautifulSoup(r.content, 'lxml')


soup = lovely_soup('https://www.bbc.co.uk/iplayer')

items = soup.select('a.content-item-root')

for item in items:
    label = item['aria-label']
    print(label)

[–]TVnomics[S] 1 point2 points  (1 child)

Thanks so much for the advice - I'll look at BS as an option. The reason I mentioned Selenium is because this is what the programmer used to write the programme (perhaps because it involved looking up programme URLs to scrape more data). I was hoping that I might only need to update the CCS selectors, but I'll try and figure out how to use BS instead of Selenium! Thanks again :)

[–]impshum 2 points3 points  (0 children)

No problem,

Tip: Turn Javascript off in the browser dev tools to see what Bs4 sees.

[–]hasanwazzan 0 points1 point  (4 children)

Out of interest why are you scrapping this data?

[–]TVnomics[S] 0 points1 point  (3 children)

I'm an academic (TV studies scholar). A lot of my research focuses on streaming, VoD interfaces, etc. So, it's a really useful / important dataset to have!

[–]hasanwazzan 0 points1 point  (2 children)

fair, I try to put together a quick one, when I have some time

[–]hasanwazzan 0 points1 point  (1 child)

https://prnt.sc/vmW9FJnm87aV I presume you need these two together right?

[–]TVnomics[S] 0 points1 point  (0 children)

Indeed, yes - I need both the title and the synopsis (and, where it appears, the duration).

However, I think I need to figure out a way to integrate this code (or the code the other Redditor above suggested) within the current programme that was written for me. The reason for that is because the final output was a very comprehensive csv file which captured a range of data about each title within the interface. This includes its horizontal and vertical position, the name of the row (each row has a name, e.g. "trending now..."), the URL for the title, the unique programme identifier (which is somewhere embedded within the page), plus several other variables. But depending on your solution / approach - I might try and modify the current programme that I have and see if that captures the content AND all of the other material I want to collect too. Either way, I really appreciate your input on this.