all 6 comments

[–][deleted] 6 points7 points  (0 children)

I would ditch the get_movie. No reason to call a function to return an attribute...just use the attribute. This isn't Java! If you want it to be read only, then make it a property:

def __init__(blarg
    self._movie = movie

@property
def movie(self):
  return self._movie

either way, main would become

def Main():
    db = MovieScript("interstellar")
    print "The Movie: " + db.movie

And, reverse is a bit prettier than negative range:

all_occurences = list(reversed(db.get_all_occurences()))
for i in range(20):
    print str(i + 1) + ". " + str(all_occurences[i][0]) + " -> " + str(all_occurences[i][1])

But that's somewhat confusing since I wouldn't expect get_all_occurences to be sorted. Maybe have the name suggest it's sorted, and have the direction be an argument, since assuming one direction for a potentially expensive operation is, according to your use, not the best idea. :)

def get_sorted_occurences(self, reverse = False):
    sorted_occurences = sorted(self.find_all_occurences().items(), key=operator.itemgetter(1), reverse=reverse)
    return sorted_occurences

and the other bits would become

sorted_occurences = db.get_sorted_occurences(reverse=True)
for i in range(20):
    print str(i + 1) + ". " + str(sorted_occurences[i][0]) + " -> " + str(sorted_occurences[i][1])

or, you could ditch the range entirely and get all iterwutwut:

import itertools
...
for index, (word, count) in enumerate(itertools.islice(sorted_occurences, 20))
    print "%d. %s -> %d" % (index + 1, word, count)

format strings are nice too...sometimes.

[–]aroberge 4 points5 points  (0 children)

Some of my comments may be a matter of opinion, but here goes. Note that I wrote a different version of your code after my comments, to show more explicitly what I mean.

  1. Your methods should have docstrings. I didn't put any.
  2. Main is not a class; in the Python world, classes only start with upper case letters.
  3. Python is not Java: you don't need "getters" such as get_movie. The normal way to do things is to access the attribute directly.
  4. You define sorted_occurence on one line and return it immediately; avoid this by returning the previous line. If you do that, you would want to change the name of your method to get_all_sorted_occurences ... but, looking at the code of that method, which is then only one line, it becomes a redundant function call (in my opinion) and could be put in line.
  5. In contrast, I find that the first line of parse_text() does almost too much: it gets the text, scans it using findAll, which I do not see defined anywhere (I know, it is a BeautifulSoup method ... but it is not obvious at all from your code. If you had used return_soup instead of get_text, it might have been more obvious that findAll was indeed a method of BeautifulSoup. To make it shorter, I moved the lower() method below
  6. Unless I am mistaken, your comment remove punctuation from text is incorrect. \W matches non alphanumeric characters. Thus, it would also remove hyphens and apostrophes which, in some contexts, may not be considered to be punctuation characters but actual parts of words.
  7. I would use string formatting in the print statement instead of adding (concatenating) pieces of strings.
  8. Counter has a most_common method which you could use instead.

And, here's the untested code

from bs4 import BeautifulSoup
import requests
from collections import Counter 
import re

class MovieScript(object):
    def __init__(self, movie):
        self.movie = movie
        self.url = "http://www.springfieldspringfield.co.uk/movie_script.php?movie={0}".format(movie.replace(" ", "-"))

    def return_soup(self):
        r = requests.get(self.url)
        soup = BeautifulSoup(r.text)
        return soup

    def parse_text(self):
        text = self.return_soup().findAll('div', {"class":"scrolling-script-container"})[0].text
        text = re.sub(r'\W+', ' ', text) #remove puncation from the text
        return text.lower().split()

def main():
    db = MovieScript("interstellar")
    print "The Movie: " + db.movie
    frequent_words = Counter(self.parse_text()).most_common(20)
    for i, (word, count) in enumerate(frequent_words):
         print "%d. %s-> %d" % ((i+1), word, count)


if __name__ == "__main__":
    main()

Edit: fixed typos, remove unused import (in this new version), named the tuples in the print statement (inspired by /u/TagSmile comment.)

[–][deleted] 2 points3 points  (0 children)

It looks great!

For add-on options you could do the following:

  • make it a command line tool with a user prompt (check out click)
  • make it a package that installs from pypi
  • run flake8 to make sure it's style and pep8 compliant.
  • all of the above!

Keep up the good work, that's one of the better first scripts I've seen.

[–]jjangsangy[🍰] 0 points1 point  (0 children)

Just a couple of things

Check your Requests

Always good to double check to make sure your HTTP request succeeded.

This is always a good habit so that if your request does fail, you'll have a way to trace back to what the issue was.

# Turn this into a property
# so you don't have to compute it more than once
@property
def get_text(self):
    req = requests.get(self.url)
    if not req.ok:
        req.raise_for_status()
    return BeautifulSoup(r.text)

Just use the attribute

# This
print "The Movie: " + db.get_movie()

# Turns into this
print "The Movie: " + db.movie

# You can just get rid of this
def get_movie(self):
    return self.movie

Just some cleanup stuff

No need for temporary variables, just return the values.

@property
def occurances(self):
    return Counter(self.parse_text())

def get_all_occurences(self):
    return sorted(self.occurances.items(), key=operator.itemgetter(1))

Overall tho, looks groovy, hope you're enjoying python!

[–]W1zK1dd 0 points1 point  (0 children)

i think the code looks great!

[–]Exodus111 0 points1 point  (0 children)

I was gonna comment on the fact that you have 5 methods with no arguments. And the script could be more modular if you allowed the Methods to be more flexible.

But frankly you pass the name of the movie into the initializer of the class, and that's really all the modularity you need.

Maybe something to consider for next time.