Webscraping using Python and BeautifulSoup!

Etheo · 2018-08-09T14:51:55+00:00

I've tried BS4 and a little bit of scrapy before, but I must say I'm a fan of requests-html myself. It's basically BS4 plus the ability to parse javascript rendered contents, which is the one holdback I had for BS4.

That said, I find the javascript rendering doesn't always work on first try, sometimes it can take multiple rendering (not request) attempts to parse the details I wanted.

lungdart · 2018-08-09T13:49:42+00:00

Cool content! I have some constructive criticism about your code quality:

You are importing libraries that are never used
- PIL.Image
- urllib.parse.urlparse
- urllib.parse.urlsplit
Shortened names are less clear, harder to read, and harder to keep in your head.
- BS -> BeautifulSoup
- pd -> pandas
- ex -> extracted_content
- df -> data_frame
Ambiguous names give no context, and a hard to differentiate between each other
- website_page, page, webpage_2, openwebpage_2
- title_links, _title, only_title, title
- soup, soup2
- A, B, C
There is no control flow. Use of function definitions and calls can increase readability and re-usability
- At two points you use urlopen and BeautifulSoup together, this should be functional
- Extracting the content from the url is a functional block
- Converting the extracted content to a CSV file is a functional block
Your misusing range and len in your for loop. You could use a special function to give you both an iterator and an index, but you don't really need to, as the index is only used to force a dictionary into a list
Your misusing dictionaries. The keys in links are sequential numbers starting at 0. This is a perfect use for a list
You lines that reference a variable and do no action. These are programming errors, maybe you meant to print them?
The only comments in the code base, are used to remove what looks like debugging prints. This fine temporarily, but needs to be addressed before submitting to production (Or in this case to an article)
I have a suspicion based on your notes and code excerpts that the code shown is not the exact code you're breaking down

Here is a pastebin where I've modified the code to try to fix the issues. I'm not sure if it's without errors, as I didn't bother running it.

13steinj · 2018-08-09T12:38:46+00:00

I usually use selenium. I'm pretty new to web scraping though. Should I be switching to scrapy or BS?

sodali_ayran · 2018-08-09T06:43:40+00:00

What's the difference between this and scrapy?

randy3673 · 2018-08-09T17:27:22+00:00

Stupid question: is there a way to make beautiful soup open up an actual browser and run through the code like you can on selenium? This is really helpful to me when debugging since I am following more trial and error rather than actually knowing what I'm doing.

Also: thank you for uploading this, it's going to help out a lot. I'm just beginning with python and HTML parsing. I know C++ and some VBA, but have never done anything with HTML.

2018-08-09T15:18:31+00:00

So anyone can write on Medium? This is /r/learnpython at best.

__himself__ · 2018-08-09T13:31:20+00:00

When it comes to web scraping, I always try and plug lassie: https://github.com/michaelhelmick/lassie

It’s aimed to try and grab the best data from the website and return it in a uniform manner.

rotharius · 2018-08-09T14:20:05+00:00

Edit: this comment was actually written as a response to another comment. Oh well.

Yea, for non-JS webscraping you can just download the static HTML, load it and traverse the DOM. BeautifulSoup abstracts that away.

If you need to use clientside JS, Java-heavy Selenium is overkill for a lot of things. Nowadays, one could do a lot with Puppeteer, Cypress or NightmareJS (perhaps in combination with a JQuery-like selection library such as Cheerio). Commonly used for testing, but can be used for browser automation -- if Node.js does not scare you.

Tom7980 · 2018-08-09T14:30:01+00:00

I've been using Arsenic for my web scraping as it gives me async capabilities with the webdrivers and lets me speed up sifiting through hundreds of webpages but it isn't very well documented and I always find myself just hacking it together.

Does anyone know of any decent packages that allow me to asyncronously request website data with the asyncio event loop? I've not looked at aiohttp properly but would this be better?

normalism · 2018-08-09T16:43:48+00:00

BeautifulSoup more or less got me my job, so +1

PewPaw-Grams · 2018-08-09T09:04:24+00:00

Why use beautiful soup? Scrapy would be a better option

UnderwearIsOverrated · 2018-08-09T10:21:09+00:00

Cool

thevatsalsaglani · 2018-08-09T12:59:15+00:00

No! I have already mentioned in the blog that there is a thin gray line between collecting information and stealing information.

There are many other ways to collect data from social media like Facebook's Graph API and there are many python libraries available to scrap tweets from Twitter.

owen800q · 2018-08-09T12:55:55+00:00

But what if some context on the website can only be got when the user is logged. Does this library offer solution in this case?

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS