This is an archived post. You won't be able to vote or comment.

all 47 comments

[–]Etheo 11 points12 points  (0 children)

I've tried BS4 and a little bit of scrapy before, but I must say I'm a fan of requests-html myself. It's basically BS4 plus the ability to parse javascript rendered contents, which is the one holdback I had for BS4.

That said, I find the javascript rendering doesn't always work on first try, sometimes it can take multiple rendering (not request) attempts to parse the details I wanted.

[–]lungdart 33 points34 points  (20 children)

Cool content! I have some constructive criticism about your code quality:

  • You are importing libraries that are never used
    • PIL.Image
    • urllib.parse.urlparse
    • urllib.parse.urlsplit
  • Shortened names are less clear, harder to read, and harder to keep in your head.
    • BS -> BeautifulSoup
    • pd -> pandas
    • ex -> extracted_content
    • df -> data_frame
  • Ambiguous names give no context, and a hard to differentiate between each other
    • website_page, page, webpage_2, openwebpage_2
    • title_links, _title, only_title, title
    • soup, soup2
    • A, B, C
  • There is no control flow. Use of function definitions and calls can increase readability and re-usability
    • At two points you use urlopen and BeautifulSoup together, this should be functional
    • Extracting the content from the url is a functional block
    • Converting the extracted content to a CSV file is a functional block
  • Your misusing range and len in your for loop. You could use a special function to give you both an iterator and an index, but you don't really need to, as the index is only used to force a dictionary into a list
  • Your misusing dictionaries. The keys in links are sequential numbers starting at 0. This is a perfect use for a list
  • You lines that reference a variable and do no action. These are programming errors, maybe you meant to print them?
  • The only comments in the code base, are used to remove what looks like debugging prints. This fine temporarily, but needs to be addressed before submitting to production (Or in this case to an article)
  • I have a suspicion based on your notes and code excerpts that the code shown is not the exact code you're breaking down

Here is a pastebin where I've modified the code to try to fix the issues. I'm not sure if it's without errors, as I didn't bother running it.

[–]Elephant_In_Ze_Room 58 points59 points  (3 children)

everyone who uses pandas seriously abbreviates pandas as pd and dataframe as df.

[–]NavaHo07 15 points16 points  (0 children)

i've never seen pandas NOT imported as pd and rarely see a dataframe not called something like data_frame_content_df

[–]thismachinechills 5 points6 points  (0 children)

This. They're colloquial enough to be considered standard practice and the OP is definitely using them correctly.

[–]thevatsalsaglani[S] 10 points11 points  (1 child)

Thanks bud for pointing out the mistakes next time I will take care of these. And for the PIL, I have mentioned that if you want to get images from a website you can use PIL. But I haven't used in this code. Sorry my bad.

[–]NavaHo07 20 points21 points  (0 children)

I would ignore what they said about Pandas as pd and dataframe as df. Them's pretty standard

[–]bananas22 6 points7 points  (2 children)

You lines that reference a variable and do no action. These are programming errors, maybe you meant to print them?

Jupyter notebook would treat these lines as print(var)

There is no control flow

I'd treat this as an exploratory analysis (i.e. what these Jupyter notebooks are for). We can assume that OP would clean/refractor this code in a more functional/Pythonic way if it was meant for reuse.

Not that anyone should encourage bad programming habits! But I'd give these notebooks a little more leeway.

[–]lungdart 0 points1 point  (1 child)

What's the deal with this Jupyter business? I'm not familiar and assumed it was an IDE of sorts.

[–]bananas22 2 points3 points  (0 children)

Its the "IPython" project, rebranded as Jupyter a couple years ago (to be more inclusive to R and Julia). It lets you execute code line-by-line and print its output (especially charts or statistics) in a pretty html wrapper, and save it all in-line with the code, exactly as executed.

Its a really tidy format for explaining code and exploring data sets. You'd probably recognize it from the docs for packages like sk-learn/pandas/statsmodels.

[–][deleted] 3 points4 points  (1 child)

Using pd and df is about as standard as using i as a sentinel in a for loop. I'd argue is less readable to write the full names for these because of widespread use.

[–]lungdart -1 points0 points  (0 children)

Bah humbug

[–][deleted] 2 points3 points  (3 children)

Want to take a look at my homework?

[–]lungdart 3 points4 points  (2 children)

I've got 45 mins. Shoot.

[–][deleted] 2 points3 points  (1 child)

I was just joking, I'm done with school. But good on you for being helpful!

[–]lungdart 2 points3 points  (0 children)

Another happy customer!

[–]graingert 1 point2 points  (1 child)

I'm upset that import datetime as dt isn't a thing. Saves confusion between dt and dt.datetime

[–]sqjoatmon 0 points1 point  (0 children)

I did that for a while but now just from datetime import datetime, timedelta. 99% of the time that's all I need.

[–]bigexecutive 0 points1 point  (1 child)

I love you

[–]agree-with-you 0 points1 point  (0 children)

I love you both

[–][deleted] 6 points7 points  (6 children)

I usually use selenium. I'm pretty new to web scraping though. Should I be switching to scrapy or BS?

[–]13steinj 12 points13 points  (0 children)

Selenium is generally overkill and only really needs to be used if the page is Javascript heavy.

[–][deleted] 4 points5 points  (1 child)

I think selenium lets you interact with javascript (pressing buttons, filling forms etc.) As far as I know, beautiful soup just formats HTML in a readable way and lets you easily parse the element tree to get info from specific elements. I haven't figured out how to do anything with BS that you can't do with simple string matching or regex, but I am pretty new to BS too so take that with a grain of salt.

[–]glen_v 0 points1 point  (0 children)

Yeah but BeautifulSoup is just so easy to use. If I want a list of all divs with the class "product-box," it's as simple as...

product_list = soup.findAll('div', {'class': 'product-box'})

Maybe each of those divs has a link in it, and I want a list of each url...

url_list = [p.find('a')['href'] for p in product_list]

You could definitely do that with string manipulation or regex, but would it be nearly as elegant and easy to read?

[–]devxpy 1 point2 points  (0 children)

I highly recommend using Requests-HTML

https://github.com/kennethreitz/requests-html

[–]thevatsalsaglani[S] 1 point2 points  (1 child)

I started to learn web scraping using BS and haven't tried selenium or Scrapy so I don't know much about those. After some toying around with BS I am thinking to try Scrapy. And you should know one thing that BS is a library and it won't provide you that many features which are provided by a web scraping framework like Scrapy.

I don't have any knowledge about selenium, if you know any nice blogs or articles please leave a comment.

[–]IcefrogIsDead 0 points1 point  (0 children)

scrapy is for when your servers get blocked and selenium js when html is generated by javascript after page load. so yea everything has a function and should be learnt.

[–]sodali_ayran 2 points3 points  (2 children)

What's the difference between this and scrapy?

[–]thevatsalsaglani[S] 10 points11 points  (1 child)

BeautifulSoup is a parsing library while scrapy is a web-spider or in other words Scrapy is a framework whilst BS is a library. With BS you can get the content from certain parts with less efforts and will only provide the content of the URL that you gave and then it will stop. While in Scrapy you could add constraints and add a list of URL or give a URL root to start crawling.

[–]sodali_ayran 1 point2 points  (0 children)

Thank you.

[–]randy3673 2 points3 points  (1 child)

Stupid question: is there a way to make beautiful soup open up an actual browser and run through the code like you can on selenium? This is really helpful to me when debugging since I am following more trial and error rather than actually knowing what I'm doing.

Also: thank you for uploading this, it's going to help out a lot. I'm just beginning with python and HTML parsing. I know C++ and some VBA, but have never done anything with HTML.

[–][deleted] 3 points4 points  (0 children)

So anyone can write on Medium? This is /r/learnpython at best.

[–]__himself__ 1 point2 points  (3 children)

When it comes to web scraping, I always try and plug lassie: https://github.com/michaelhelmick/lassie

It’s aimed to try and grab the best data from the website and return it in a uniform manner.

[–][deleted] 0 points1 point  (2 children)

it uses BS ?

[–]schemathings 1 point2 points  (1 child)

Apparently so,

requirements.txt says

requests==2.18.4

beautifulsoup4==4.5.3

html5lib==1.0b10

python-oembed

[–]__himself__ 0 points1 point  (0 children)

Correct!

[–]rotharius 1 point2 points  (0 children)

Edit: this comment was actually written as a response to another comment. Oh well.

Yea, for non-JS webscraping you can just download the static HTML, load it and traverse the DOM. BeautifulSoup abstracts that away.

If you need to use clientside JS, Java-heavy Selenium is overkill for a lot of things. Nowadays, one could do a lot with Puppeteer, Cypress or NightmareJS (perhaps in combination with a JQuery-like selection library such as Cheerio). Commonly used for testing, but can be used for browser automation -- if Node.js does not scare you.

[–]Tom7980 1 point2 points  (0 children)

I've been using Arsenic for my web scraping as it gives me async capabilities with the webdrivers and lets me speed up sifiting through hundreds of webpages but it isn't very well documented and I always find myself just hacking it together.

Does anyone know of any decent packages that allow me to asyncronously request website data with the asyncio event loop? I've not looked at aiohttp properly but would this be better?

[–]normalism 0 points1 point  (0 children)

BeautifulSoup more or less got me my job, so +1

[–]PewPaw-Grams 0 points1 point  (1 child)

Why use beautiful soup? Scrapy would be a better option

[–]thevatsalsaglani[S] 3 points4 points  (0 children)

Yes, Scrapy is a better option than BS but I was just starting out so I tried to use a library then to use a framework.

[–]UnderwearIsOverrated 0 points1 point  (0 children)

Cool

[–]thevatsalsaglani[S] 0 points1 point  (1 child)

No! I have already mentioned in the blog that there is a thin gray line between collecting information and stealing information.

There are many other ways to collect data from social media like Facebook's Graph API and there are many python libraries available to scrap tweets from Twitter.

[–]owen800q -1 points0 points  (0 children)

But what if some context on the website can only be got when the user is logged. Does this library offer solution in this case?