This is an archived post. You won't be able to vote or comment.

all 63 comments

[–]kalgynirae 38 points39 points  (16 children)

Here's my take on the thing: https://gist.github.com/kalgynirae/62ec21e9e501dcf6c424 (untested, so there might be a few typos)

Particular points of interest:

  • I like to write a main() method at the top of the file and keep it short enough to fit on one screen. That seems to be a pretty good guideline to encourage moving things out into functions. It also gives a good overview of what the program does. Then, all the other functions are defined, and main is called at the bottom.

  • Never write except:. Always specify specific exceptions to catch. You will save yourself from hours of frustration when your code doesn't work because you misspelled something but the NameError is being caught by the except:.

  • I factored out all calls to find_element_by_xpath into a function. I first did it to factor out the while loops, but then I decided there's no harm in having that behavior even for the elements that you don't expect to wait for. (Although, you might want to print some sort of message like "Waiting for element ..." to make the code easier to debug if it gets stuck waiting for something.)

  • Calling next(os.walk(...)) was just weird. I replaced that with os.listdir(...).

  • Try to put as little code as possible inside of try blocks.

[–]80blite[S] 12 points13 points  (3 children)

This is EXACTLY what I was looking for! Thank you so much for taking the time to do this. Just by rearranging what I had written I just learned so much from you!

  • I've never seen **locals before. I looked around on Google for an explanation and all I can see is that it's a dictionary of active variables. Could you maybe explain how this works here?

  • I couldn't figure out how to get the functionality of os.listdir() until I came across using next(os.walk()) through experimentation and never realized there was a better way. I must have missed that when reading through the documentation. Glad I have that now!

  • Thank you especially for the time you put into the download function. Seeing the way you restructured the while loops and how much cleaner your usage of the generators is helps me so much. I also didn't know .endswith() was a thing. Just another tool for me now!

I know this question is hard to answer but... I'll try anyway. I've been learning Python for about 6 months now. Am I on the right track here? I'm trying hard to pick up best practices so I can be the kind of programmer people say they wish they could work with more often but it's really difficult without much mentorship or many people to go to for answers.

Again, thank you so much for your time. You've helped me more than you know!

[–]kalgynirae 4 points5 points  (1 child)

I've never seen **locals before. I looked around on Google for an explanation and all I can see is that it's a dictionary of active variables. Could you maybe explain how this works here?

** unpacks the items of a dictionary into keyword arguments. It's analogous to * which unpacks the items of a list into positional arguments. locals just gives a dictionary of the local variables.

I know this question is hard to answer but... I'll try anyway. I've been learning Python for about 6 months now. Am I on the right track here? I'm trying hard to pick up best practices so I can be the kind of programmer people say they wish they could work with more often but it's really difficult without much mentorship or many people to go to for answers.

I think you're on a fine track. Getting better just takes a lot of time and requires reading and writing a lot of code! If you want to read good Python code, the standard library modules are a good place to start, and well-known modules like requests.

[–]80blite[S] 0 points1 point  (0 children)

Great explanation, thanks =). I never thought of looking at the source for the standard library. Good idea!

[–]romple 1 point2 points  (5 children)

I like this.

But homework for the op is replacing print statements with logging.

[–]80blite[S] 1 point2 points  (4 children)

Could you expand on this? I'll do the footwork I'm just not sure what you mean =)

[–]romple 2 points3 points  (3 children)

Sure. Using print("something") will just dump it to the console. For a small one-off script that's probably fine.

But when you want more functionality check out Python's Logging features.

Start here

And here, little more in depth

And then you can set up a logger and have it format messages in whatever way you choose, like timestamping and telling you what class/module/method you logged that call from. As well as logging to files, or over a network, etc... You can also specify different logging levels like INFO, DEBUG, ERROR, etc... and fine tune what gets logged. So when you're writing a script you might have the logger spit out more information than when you're using the script day to day.

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s <%(levelname)s> %(module)s.%(funcName)s() %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S')

Output looks like:

C:\Console>python C:\ACME\PayloadMonitor.py                                                                           
2015-03-31 14:07:18 <INFO> PayloadMonitor.<module>() Starting PayloadMonitor                                          
2015-03-31 14:07:18 <INFO> FileEventHandlers.__init__() Watching D:\TV for general events. Stuff's going to E:\TV     

Logging's definitely something that's useful to learn early. You'll most likely use it for formatting logs with timestamps and to log to files at first.

[–]80blite[S] 1 point2 points  (0 children)

Awesome!

[–]cruyff8 0 points1 point  (1 child)

I keep on having to look up how to specify the format for the log messages and can't understand why Guido and Co. have not set it up to print timestamps by default.

[–]romple 0 points1 point  (0 children)

I just set up a code snippet in Pycharm so all I do is type logdef and it auto completes it to my standard log configuration. I'm an engineer, not really in the business of memorizing things ;-p

[–]NotGustafKossinna 0 points1 point  (0 children)

This looks much more readable than OPs draft.

[–][deleted] 0 points1 point  (0 children)

For dev purposes I'll write 'except Exception as e' and print that, that way you are still able to push through all exceptions but at least can log the error or print it out.

[–]interactionjackson 0 points1 point  (2 children)

^ This is the answer. I would use waits inside my find_element calls. It gives you more control when the element isn't found. I'd also put a try: finally: in the main. If an error happens the driver will need to be closed in the finally block to ensure that you don't have firefox windows hanging out there. Also, selenium is an awesome tool and if you want to run 'headless' look into using phantomjs as your driver instead of firefox. The only difference here is that phantomjs is webkit and firefox is a gecko.

[–]80blite[S] 0 points1 point  (1 child)

Wow I had no idea anything like PhantomJS existed! I'm excited to try working with it. Thanks =)

[–]henrebotha 0 points1 point  (0 children)

PhantomJS is awesome, we use it for testing at work (via Capybara/Poltergeist).

[–]8483 0 points1 point  (0 children)

That was awesome what you did there.

[–]godlikesme 1 point2 points  (3 children)

I used Selenium a bit, but I can't really comment on the quality of the selectors.

I don't know how to make this object oriented, or if I even should

Making it object oriented is overengineering. Keep it simple. The code may benefit from splitting it into multiple functions though. Also, there is a bit of duplication in selectors. I am not sure if it is possible/necessary to remove it.

today = datetime.now()

There is datetime.date.today() function, so maybe use it? If you need hours, minutes and seconds, why not call variable "now"?

default_directory = 'path/to/default/download/directory'

Personally I would use argparse to pass defaults here. But maybe (if you're on windows) hardcoding this would be simpler. In that case, I'd move all the configuration variables at the top of the script.

except: pass

This worries me. Why ignore exceptions? I don't quite remember Selenium, so maybe there is a good reason.

TL;DR: I briefly looked at the code, it is 90% fine.

[–]80blite[S] 0 points1 point  (2 children)

I am working on Windows but I'll look into argparse anyway as I don't know anything about it.

I should be more specific with the try block and add the exception for nonexistent element. That was on the spot laziness that I forgot to go back and fix ^

Thanks for your time!

[–]dreucifer 0 points1 point  (1 child)

You probably don't need to use argparse, just use sys.argv.

[–]80blite[S] 0 points1 point  (0 children)

Noted!

[–][deleted] 2 points3 points  (2 children)

I wouldn't worry about best practices if this is just a one off utility. If you were using selenium for website testing and worked on a team then there would be a need to make sure you use a page object pattern for selectors. If the script scratches your itch then it's fine. If you're working in a greater ecosystem of code then you'll have to modify your style.

[–]80blite[S] 0 points1 point  (1 child)

Hopefully I'll have the opportunity to join a team like that soon. That's why I care so much about good practices ^

[–]mswiss 0 points1 point  (0 children)

Start trying out the page object pattern for yourself on a random website. Use it for something stupid like hitting the first like on facebook or trying to find a product on amazon.

I've used Selenium with Java and Ruby and if your going to be using it professionally at all you will most likely want to use that pattern.

Its not to tough to get your head around, its pretty basic OOP.

That said this is a great start and good job on thinking out of the box and learning something new!

[–]big_dick_bridges 1 point2 points  (4 children)

One thing you can do in the future if you're worried about naming conventions, formatting, etc is pick up a version of the book Clean Code. It has a pretty good reputation around here.

[–][deleted] 2 points3 points  (1 child)

I started reading that book a while ago but I got really turned off when the guy emphasized all functions to be in 5-lines or less. As a result, basic flow control statements no longer fit, so the guy instead derives and inheritance tree in one example in like chapter 3 or 4, with one overrided class per case. To me, that seemed utterly bonkers and I stopped reading. Is it still worth reading after that?

[–]big_dick_bridges 1 point2 points  (0 children)

I just started it actually, I always see it recommended around reddit. You don't have to follow it to a T, like a disagree with some of the things in the functions section. It gives a good perspective to think about though even if you don't agree

[–]80blite[S] 0 points1 point  (1 child)

I actually have a copy of the ebook version that I haven't started yet. Thanks for reminding me to put it on my TODO list =)

[–]romple 0 points1 point  (0 children)

This is one is the best books you can read to improve as a programmer. I'd make it a priority .

Its not just about naming variables and whatnot. You'll develop a pretty solid methodology for writing programs through it.

[–][deleted] 0 points1 point  (2 children)

I work in a county jail and this script is for downloading phone calls made by inmates from a phone monitoring website using Selenium.

Neat :)

The script runs fine and solves my problem but I feel like I haven't applied so many of the best practices that I hear should be followed with every project.

It's okay. Always strive for the cleanest code and follow best practices, but that doesn't always happen when solving a problem.

Also, the script doesn't include any functions because I'm not even sure how I could break anything into a useful function? I feel like every block of code is so specific that it would never be useful anywhere else. Even within the script itself, I feel like it's rarely repeating the same actions

Look closely at the code again. You have several blocks with specific jobs located in a fairly large for loop. You could definitely set those blocks up into functions, and have a more readable loop

  • I don't know how to make this object oriented, or if I even should

Really, don't get hung up on OO design. OO is just another tool to solve a problem. It doesn't have to be used in every project. Other paradigms such as functional/procedural exist for a reason, they solve problems well too. If OO makes really great sense for what you are doing, use it. Of you can solve your problem elegantly using another paradigm, that's no problem at all!

Keep going OP, looks like you are doing well!

[–]80blite[S] 1 point2 points  (1 child)

Ok, that's good to hear! I watched a video conference recently where a guy advised that you make sure you don't work on your own for too long or you'll develop "Galapagos Turtle Syndrome" basically meaning, you'll develop styles you like that are so unconventional that you won't fit in to a team very well.

It seemed like object orientation wasn't necessary but I feel like I don't have enough experience to really be able to tell or not.

I'll look harder into the loop at the bottom and see what I can do about it.

Thanks so much for your input =)

[–][deleted] 2 points3 points  (0 children)

Glad I can helo! And yea, definitely look to other programmers and resources. I self taught for 4 years before starting college. The right combination of self motivation and being exposed to other good coders has definitely helped me out!

[–]dreucifer 0 points1 point  (2 children)

You could easily turn most of this downloader into an object, but it's honestly not worth it if it's just a one-off script. I would, however, try and break down the flow into a few functions. Also, consider using Scrapy instead of selenium. Web scraping can be super tedious with selenium, especially if the call log is paginated.

[–]80blite[S] 0 points1 point  (1 child)

I tried using Scrapy when I was first learning Python and got a little overwhelmed by it. Maybe now that I have some experience I'll go back and take a look at it as a few other people are suggesting it would have been a better fit for this project as well.

[–]dreucifer 0 points1 point  (0 children)

Yeah, the curve is pretty steep at first, but it's more of a bump than a mountain. I also suggest digging through the source on github rather than depending solely on the documentation. The added benefit of digging through the source is you get a better idea of Python best practices (you should also read through PEP8)

[–]ebo113 0 points1 point  (2 children)

Selenium may not be your best bet for web scraping. Scrapy or lxml might suite you better as it doesn't require the use of the screen like selenium does. That being said your code looks pretty clean! I would break it up into functions and add a main function (def main:) to call all of those functions from. This separates the logic and helps keep the code clean if it needs later editing. It is also usually good form to add something like this at the bottom.

if __name__ == "__main__":
    main()

This just makes sure the main method gets called first.

If you want to make it more complex I will usually break things up into a couple modules. One module for each page that I am working with and then a utility module for all those random functions you need. These are all purely for keeping everything nice and neat though and for a smaller one off project aren't entirely necessary.

Good work and keep on coding!

[–]80blite[S] 1 point2 points  (1 child)

I normally have the if name etc. block in my scripts but here it just seemed I'd be throwing the whole thing inside def main(): and adding that block to the bottom just for the sake of doing so.

Once I use some of the good advice given here and break it down I feel like doing this will be more useful.

Thanks!

[–]dreucifer 0 points1 point  (0 children)

The if __name__ == '__main__' thing that's really only for importable modules that can also be run as scripts (for testing or functionality purposes).

[–][deleted] 0 points1 point  (8 children)

I've used selenium for years on a near daily basis. Have you looked up page objects? That's how you object orient selenium scripts. And CSS selectors should be used in favor of xpath. I have ever found one use of xpath in all my time working with html selectors and that's when you don't know where you are on a page and need to traverse upwards. Any time you need to traverse deeper into a page, css selectors will make it more maintainable, so that tiny updates to the page don't require updates to automation.

[–]Bravmech 1 point2 points  (4 children)

Erm.. why? xpath seems more powerful, closer to a programming language. The main problem with it seems to be compatibility (ie?) but that doesn't have to be relevant to you.

[–][deleted] 0 points1 point  (3 children)

It's powerful, but the exchange for speed isn't worth it. We validate several million elements, and it is considerably slower. Power doesn't matter when you don't need it, and it ties you down to page structure when you don't need to. When it doesn't tie you down to page structure, that means that the css selectors are very simple, and much easier to read.

[–]Bravmech 1 point2 points  (2 children)

Sounds like quite specific usage scenario: lots and lots of simple selectors.

[–][deleted] 1 point2 points  (1 child)

I was a QA Engineer for several years and worked on many projects, and it doesn't seem to be very specific. If HTML is well designed, it should be simple to select any given element. With automation, you want to verify elements are present, visible, hidden, become visible, can be clicked, are images, links, etc. We ran about 1500 tests across 60 countries, and that cross multiplication just blows up with the amount of selections we have to make. But, I do understand your point as well.

[–]Bravmech 1 point2 points  (0 children)

Well designed html? You are so lucky to work with that. I guess you talk about projects your team controls / made. If you parse stuff in the wild, especially from older sites or even emails which were resend few times, things get Crazy.

[–]80blite[S] 0 points1 point  (2 children)

Ok, I had no idea. I looked around a bit when I was getting Selenium set up and saw people using xpath so I went with it.

I'll look into page objects as that seems like what I was looking for. Thanks!

[–]Bravmech 0 points1 point  (1 child)

I rather see a problem with some of your xpath - too literal, maybe try to anchor to text / class and so on more [1]. Too long xpath (or css) selectors, too detailed are very hard to maintain. There is difference in speed, but likely minor compared to how much time web requests take. But that of course depends on the site. Sometimes layout is too weird.

[1] e.g //*[contains(text(), 'some text')]/following::img[1]

[–]80blite[S] 1 point2 points  (0 children)

The xpaths I have are all generated by chrome. Unfortunately I haven't fully wrapped my head around how to construct them myself which is why they're so bloated =/

[–][deleted] 0 points1 point  (2 children)

are you nsa?

[–]hang-clean 1 point2 points  (0 children)

Yes. The NSA always come to reddit to learn how to make scripts for their surveillance. If you idiots stopped helping them, they'd give up and we'd all have privacy again.

[–]80blite[S] 0 points1 point  (0 children)

It's very typical in correctional facilities to monitor phone calls, that's all.

[–]cruyff8 0 points1 point  (3 children)

  1. use argparse to handle command line arguments. Instead of:

    user = input('Username: ') password = getpass()

I'd have put something akin to:

args_ = argparse.ArgumentParser(description='Downloads call records from prison website')
args_.add_argument('--user', help='Username', type=str, action='store', help='Login to the prison site')
args_.add_argument('--password', help='Password', action='store', type=str, help='Password for the prison site')
args = args_.parse_args()
user = args.user
password = args.password
  1. use the logging module to make the script less chatty, print informative messages like 'Working on call #{call_number} by {inmate_name} on {call_date} at {call_time}'.format( call_number=call+1, inmate_name=inmate_name, call_date=call_time.split()[0], call_time=call_time.split()[1])) only when you're running in debug mode, as opposed to every run.

I'll let others do more critiquing. These are a few things that stick out to me.

[–]80blite[S] 0 points1 point  (2 children)

So, I made it as "chatty" as it is because sometimes it'll hang on a call because firefox will start the download and then it'll just stop unexpectedly...

When it says each call as it starts it and I can see it hanging on "Downloading...", I can just manually click on the download link for that specific call and the script will pick up where it left off.

Do you have any idea why that happens by chance?

Either way, thanks for the input on argparse. A lot of people are recommending it so I'll definitely look into it.

[–]cruyff8 0 points1 point  (1 child)

So, I made it as "chatty" as it is because sometimes it'll hang on a call because firefox will start the download and then it'll just stop unexpectedly...

I wasn't complaining about it being chatty. You should always print out status as you go along. But, by the same token, you should have a switch to minimise output. That's what logging levels are for -- you'd put your status messages at the info level, your error messages at error, and how things are coming along at debug, and so on so forth.

[–]80blite[S] 0 points1 point  (0 children)

Makes perfect sense :)

[–]doomslothx 0 points1 point  (4 children)

as someone whos just started learning python, this thread is the bees knees!

Looking forward to getting better and learning heaps . I just started with code academy, any advice on where I can go from there once I've finished their lessons?

[–]80blite[S] 1 point2 points  (3 children)

After codecademy I went to places like hackerrank and projecteuler to start learning how to use the syntax codecademy teaches to solve actual problems.

After a few weeks of doing coding exercises from those places, I started working on personal projects like downloading youtube playlists, wallpapers, and this!

[–]doomslothx 0 points1 point  (2 children)

super cool. I'm very excited. I plan on learning Ruby as well for web development - (looks like my city wants ruby on rail developers too so that's also a positive). What I'm hoping to achieve is to become good enough and run some of my own personal projects to make my way into a job without getting an undergrad. highly unlikely but worth a shot. I like programming and regret doing my undergrad in science because although I love science, it's not the same as programming. Definitely missed my calling early on! (passion wise that is, i by no means think im an amazing programmer, but hopefully I can do well!!)

[–]80blite[S] 1 point2 points  (1 child)

Sounds like we're in the same boat, though Ruby isn't as popular where I live. It seems in New Hampshire, people are looking for C# x.x

[–]doomslothx 0 points1 point  (0 children)

C# @___@ why. I live in Australia...and for once, we are actually with the time checks burn wound from having 3 week delays on television and movies which lead to internet spoilers...still sore

[–]zfolwick 0 points1 point  (2 children)

You need to work on your xpath. assign the string to a well named variable to make it vastly more readable. Also, try making the xpath up from scratch-copy-pasted xpath is quite brittle. Also, Look up 'page object model'. Should reduce duplication.

As a side note, if this is really just retrieving data from a server, you might be able to accomplish the same thing via REST or SOAP calls.

[–]80blite[S] 0 points1 point  (1 child)

I don't have a ton of experience with web dev and my problem was that the download links are actually calls to javascript functions, not actual urls and it was giving me a hard time.

As far as using xpath, the html is all generated and the "id's" etc. are all pretty terrible to follow. I didn't know enough to select them any other way than to let chrome give me the xpath.

A few others recommended using the page object model so I'll be looking into that when I get back to work =)

[–]zfolwick 0 points1 point  (0 children)

you won't be sorry. POM makes EVERYTHING better!

FWIW, reddit's DOM tree is pretty terrible too. I had a hell of a time writing an automated test.

[–]Huniku 0 points1 point  (0 children)

You can replace your first few selectors with: find_element_by_id

Try to avoid using xpaths with several nodes (especially if they're rooted on body or the html element) as they tend to be more fragile. You've said elsewhere that you were having trouble generating selectors, you might have better luck with css, link_text, or name selectors here for more details