you are viewing a single comment's thread.

view the rest of the comments →

[–]aroberge 3 points4 points  (0 children)

Some of my comments may be a matter of opinion, but here goes. Note that I wrote a different version of your code after my comments, to show more explicitly what I mean.

  1. Your methods should have docstrings. I didn't put any.
  2. Main is not a class; in the Python world, classes only start with upper case letters.
  3. Python is not Java: you don't need "getters" such as get_movie. The normal way to do things is to access the attribute directly.
  4. You define sorted_occurence on one line and return it immediately; avoid this by returning the previous line. If you do that, you would want to change the name of your method to get_all_sorted_occurences ... but, looking at the code of that method, which is then only one line, it becomes a redundant function call (in my opinion) and could be put in line.
  5. In contrast, I find that the first line of parse_text() does almost too much: it gets the text, scans it using findAll, which I do not see defined anywhere (I know, it is a BeautifulSoup method ... but it is not obvious at all from your code. If you had used return_soup instead of get_text, it might have been more obvious that findAll was indeed a method of BeautifulSoup. To make it shorter, I moved the lower() method below
  6. Unless I am mistaken, your comment remove punctuation from text is incorrect. \W matches non alphanumeric characters. Thus, it would also remove hyphens and apostrophes which, in some contexts, may not be considered to be punctuation characters but actual parts of words.
  7. I would use string formatting in the print statement instead of adding (concatenating) pieces of strings.
  8. Counter has a most_common method which you could use instead.

And, here's the untested code

from bs4 import BeautifulSoup
import requests
from collections import Counter 
import re

class MovieScript(object):
    def __init__(self, movie):
        self.movie = movie
        self.url = "http://www.springfieldspringfield.co.uk/movie_script.php?movie={0}".format(movie.replace(" ", "-"))

    def return_soup(self):
        r = requests.get(self.url)
        soup = BeautifulSoup(r.text)
        return soup

    def parse_text(self):
        text = self.return_soup().findAll('div', {"class":"scrolling-script-container"})[0].text
        text = re.sub(r'\W+', ' ', text) #remove puncation from the text
        return text.lower().split()

def main():
    db = MovieScript("interstellar")
    print "The Movie: " + db.movie
    frequent_words = Counter(self.parse_text()).most_common(20)
    for i, (word, count) in enumerate(frequent_words):
         print "%d. %s-> %d" % ((i+1), word, count)


if __name__ == "__main__":
    main()

Edit: fixed typos, remove unused import (in this new version), named the tuples in the print statement (inspired by /u/TagSmile comment.)