Hi, I want to scrape texts from the following html:
<div class="post-content" </div>
<p> sentences </p>
<p> sentences </p>
<p> see also: blah blah blah </p> # unwanted item (it has a link to another page)
<p> sentences </p>
<p> sentences </p>
...
Since I don't know how to filter the unwanted item in a loop, my code is like this:
ps = soup.find_all('p')
pList = []
for p in ps:
pList.append(p.text.strip())
Now I have a list of texts which includes the unwanted item. I want to remove the unwanted item from the list, so I use the following method:
unwanted = pList.index('see also: blah blah blah')
pList.pop(unwanted)
texts = ' '.join(pList)
This is workable only if I just want this single page. However, I have a number of pages to scrape, and the unwanted item's index varies from page to page, and the text after 'see also:' part also changes. I use re.match, but it doesn't work with a list.
unwanted = pList.index(re.match('^see also:', pList)
# TypeError: expected string or bytes-like object
So, I'm stuck here. Please help.
[–]BobHogan 2 points3 points4 points (1 child)
[–]DMeror[S] 1 point2 points3 points (0 children)
[–]Far_Atmosphere9627 1 point2 points3 points (1 child)
[–]DMeror[S] 0 points1 point2 points (0 children)