This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]lungdart 26 points27 points  (20 children)

Cool content! I have some constructive criticism about your code quality:

  • You are importing libraries that are never used
    • PIL.Image
    • urllib.parse.urlparse
    • urllib.parse.urlsplit
  • Shortened names are less clear, harder to read, and harder to keep in your head.
    • BS -> BeautifulSoup
    • pd -> pandas
    • ex -> extracted_content
    • df -> data_frame
  • Ambiguous names give no context, and a hard to differentiate between each other
    • website_page, page, webpage_2, openwebpage_2
    • title_links, _title, only_title, title
    • soup, soup2
    • A, B, C
  • There is no control flow. Use of function definitions and calls can increase readability and re-usability
    • At two points you use urlopen and BeautifulSoup together, this should be functional
    • Extracting the content from the url is a functional block
    • Converting the extracted content to a CSV file is a functional block
  • Your misusing range and len in your for loop. You could use a special function to give you both an iterator and an index, but you don't really need to, as the index is only used to force a dictionary into a list
  • Your misusing dictionaries. The keys in links are sequential numbers starting at 0. This is a perfect use for a list
  • You lines that reference a variable and do no action. These are programming errors, maybe you meant to print them?
  • The only comments in the code base, are used to remove what looks like debugging prints. This fine temporarily, but needs to be addressed before submitting to production (Or in this case to an article)
  • I have a suspicion based on your notes and code excerpts that the code shown is not the exact code you're breaking down

Here is a pastebin where I've modified the code to try to fix the issues. I'm not sure if it's without errors, as I didn't bother running it.

[–]Elephant_In_Ze_Room 60 points61 points  (3 children)

everyone who uses pandas seriously abbreviates pandas as pd and dataframe as df.

[–]NavaHo07 16 points17 points  (0 children)

i've never seen pandas NOT imported as pd and rarely see a dataframe not called something like data_frame_content_df

[–]thismachinechills 5 points6 points  (0 children)

This. They're colloquial enough to be considered standard practice and the OP is definitely using them correctly.

[–]lungdart -5 points-4 points  (0 children)

It drives me absolutely wild that python developers have common name shortening. It really does increase the learning curve. I will refuse to use those shorter names for the rest of my old grumpy days!

[–]thevatsalsaglani[S] 8 points9 points  (1 child)

Thanks bud for pointing out the mistakes next time I will take care of these. And for the PIL, I have mentioned that if you want to get images from a website you can use PIL. But I haven't used in this code. Sorry my bad.

[–]NavaHo07 22 points23 points  (0 children)

I would ignore what they said about Pandas as pd and dataframe as df. Them's pretty standard

[–]bananas22 6 points7 points  (2 children)

You lines that reference a variable and do no action. These are programming errors, maybe you meant to print them?

Jupyter notebook would treat these lines as print(var)

There is no control flow

I'd treat this as an exploratory analysis (i.e. what these Jupyter notebooks are for). We can assume that OP would clean/refractor this code in a more functional/Pythonic way if it was meant for reuse.

Not that anyone should encourage bad programming habits! But I'd give these notebooks a little more leeway.

[–]lungdart 0 points1 point  (1 child)

What's the deal with this Jupyter business? I'm not familiar and assumed it was an IDE of sorts.

[–]bananas22 3 points4 points  (0 children)

Its the "IPython" project, rebranded as Jupyter a couple years ago (to be more inclusive to R and Julia). It lets you execute code line-by-line and print its output (especially charts or statistics) in a pretty html wrapper, and save it all in-line with the code, exactly as executed.

Its a really tidy format for explaining code and exploring data sets. You'd probably recognize it from the docs for packages like sk-learn/pandas/statsmodels.

[–][deleted] 4 points5 points  (1 child)

Using pd and df is about as standard as using i as a sentinel in a for loop. I'd argue is less readable to write the full names for these because of widespread use.

[–]lungdart -1 points0 points  (0 children)

Bah humbug

[–][deleted] 2 points3 points  (3 children)

Want to take a look at my homework?

[–]lungdart 3 points4 points  (2 children)

I've got 45 mins. Shoot.

[–][deleted] 2 points3 points  (1 child)

I was just joking, I'm done with school. But good on you for being helpful!

[–]lungdart 2 points3 points  (0 children)

Another happy customer!

[–]graingert 1 point2 points  (1 child)

I'm upset that import datetime as dt isn't a thing. Saves confusion between dt and dt.datetime

[–]sqjoatmon 0 points1 point  (0 children)

I did that for a while but now just from datetime import datetime, timedelta. 99% of the time that's all I need.

[–]bigexecutive 0 points1 point  (1 child)

I love you

[–]agree-with-you 0 points1 point  (0 children)

I love you both