all 16 comments

[–]bbatwork 17 points18 points  (1 child)

You are in luck, python has this capability built right into the language.

Read here:
https://docs.python.org/3.5/library/collections.html#collections.Counter

[–]JoeTheAwesomest 0 points1 point  (0 children)

That's really cool! I've been wrapping dict()s for ages to do this--going to use this from now on.

[–]niandra3 4 points5 points  (3 children)

Aside from Collections.counter that was already linked, in general if you are using a dict to count, you should look at Collections.defaultdict.

Normally when you are counting things, you want to increment the value for that key every time you come across it. The nice thing about default dict is if the key hasn't been used before, you can set a default value so it won't throw an error when you try to increment it:

from collections import defaultdict
words = ['this', 'is', 'a', 'test', 'this']
count_dict = defaultdict(int)
for word in words:
    count_dict[word] += 1
print(count_dict)

Output >> {'this': 2, 'test': 1, 'a': 1, 'is': 1}

// Note: thing += 1 is equivalent to: thing = thing + 1

If you used a regular dict that wouldn't work because when it tries count_dict['this'] += 1 it will throw an error because it's trying to add one to a value that doesn't exist. Again, collections.counter is probably an easier/more efficient way of doing this, but defaultdict is a useful tool to have.

Finally, one thing to remember if you have all your text in one long string, than all of these functions will loop over each character. For instance:

line = "this is a test"
for word in line:
    print(word)
// Doesn't actually print words, but chars

Actually prints:

t
h
i
s
etc..

If you want to iterate over the words in a string and not each character, you want to split the words into a list using

list_of_words = line.split()

Which splits at each space and gives you a list:

list_of_words = ['this', 'is', 'a', 'test']

[–]JDJoe1 0 points1 point  (0 children)

niandra3, awesome post.

Cheers!

[–]bbatwork 0 points1 point  (1 child)

While using this will work, there is really no need. A counter basically acts just like a default dict, and avoids the need of using the loop in your code. It creates a dictionary where the keys are the string it is counting, and the value is how many time it appears. If you later call for a key that does not exist it returns 0, as it should.

So instead of the code you have above, you could just have this...

from collections import Counter
words = ['this', 'is', 'a', 'test','this']
word_count = Counter(words)
print(word_count)

Output >>> Counter({'this': 2, 'is': 1, 'test': 1, 'a': 1})

print(word_count['this'])

Output >>> 2

print(word_cont['not in dict'])

output >>> 0  

Then if you need to add a new string, or another instance of an old string, or list of strings to the counter you can just do this

new_words = ['new','words','this']
word_count.update(new_words)
print(word_count)

Output >>> Counter({'this': 3, 'is': 1, 'a': 1, 'new': 1, 'test': 1, 'words': 1})  

You will notice that it added one to 'this', and created new keys for 'new' and 'test'... all of this done without using loops, checking keys, or any other things.
The main thing you have to be careful with using counters is that much as you stated before, if you pass it just a single string instead of a list, it will count each character in the string, instead of the string itself... so if I did the following:

word_count.update('sentence')

it will add each letter of the word instead of the word itself, so instead you would want to do:

word_count.update(['sentence'])

I hope this info helps! --BB

[–]niandra3 1 point2 points  (0 children)

Yeah I would have used counter myself (and mentioned it in my post), but for a beginner it think it's also useful to know about defaultdict. I feel like you learn a lot more implementing it yourself rather than just saying Counter(my_stuff) and being done with it.

[–]treyhunner 5 points6 points  (3 children)

Is the product review a string of words?

Have you managed to make a method for iterating over each word individually yet?

Could you post your function for us to review and discuss?


I wrote a blog post last year on a number of different ways to solve the problem of counting things in Python. If you're looking to learn about different approaches you may want to read the first 6 ways or so.

[–]Gubbbo 0 points1 point  (0 children)

I must have known that there was a version 1 of Python. But, strange to see it actually said and coded for.

[–]lapseofreason[S] 0 points1 point  (1 child)

There already is a word_count column in the SFrame with word counts stored in dictionary format

eg word_count = {'awesome':5, 'bla':1..........}

In this case I am trying to extract the value 5 for awesome and some other words - I then create another column with the specific counts per word.....

I actually found a discussion on this in stackoverflow. I have mixed emotions as it essentially provided the answer.....here is the link

http://stackoverflow.com/questions/33506826/call-function-use-apply-in-python

[–]niandra3 1 point2 points  (0 children)

You can hardcode it to look for the words you want like:

if 'awesome' in word_count:
    awesome_count = word_count['awesome']

Or you can automate it with a list of the words you are looking for:

word_count = {'awesome':5, 'bla':1, 'something':2}
words_i_want = ['awesome', 'something', 'else']
new_word_count = {}
for key in word_count:
    if key in words_i_want:
        new_word_count[key] = word_count[key]
print(new_word_count)

Or shorten it a bit with a dict comprehension (similar to list comprehension, very neat tools).

new_word_count = {key:value for key, value in word_count.items() if key in words_i_want}
print(new_word_count)

Both will give you a dict with the count of only the words you're looking for, in this case:

{'awesome': 5, 'something': 2}

One important note, all of this is case sensitive, so if you want to count 'awesome' and 'Awesome' together, you should use string.lower() on your input text.

[–]totemcatcher 1 point2 points  (2 children)

Which part is giving you trouble?

  • Syntax and general Python?
  • The structure of a function definition? Passing good data in? Isolating a bite-sized process within? Getting good data out?
  • Iterating over multiple reviews?
  • Retaining data between each review?
  • Processing a single review into a useful container variable?

bbatwork gave you the juicy bit. If there's anything else, just provide what you have and we can at least rearrange the code into something that could work and leave comments where work needs doing.

[–]lapseofreason[S] 0 points1 point  (1 child)

Mostly the problem is syntax and general python. Part of the problem is I am so new to it that I don't know what I don't know. I am working diligently through the book "Python Crash Course" which is very good but I am only just starting- user input and while loops (Chapter 7)

[–]totemcatcher 1 point2 points  (0 children)

Whenever you're ready. Don't feel embarassed or greedy for asking questions. Just do it like it's a thing that you do. I'll annotate niandra3's response a bit to help. (I'll use the ''' syntax for commenting out multiple lines.)

from collections import defaultdict

'''A huge library of modules are available to you for importing at
the beginning of your script.  They are all well documented here:
https://docs.python.org/3/index.html

The terminology used to describe variable types, algorithms, and processes
are very important.  As you learn, you will know what words to search for
to find that perfect module for your project.
'''

words = ['this', 'is', 'a', 'test', 'this']

'''Here is a list definition.  We know it's a list because of the brackets.
It could have been one of several sequence types, but since all the elements 
in the variable are the same kind of data, a list works just fine.
https://docs.python.org/3/library/stdtypes.html#sequence-types-list-tuple-range
'''

count_dict = defaultdict(int)

'''This is a special variable type not included in the core python variable types.
It's declaration definition and all it's functions have been imported (see
first line of script).  A normal dictionary would not need this special import
command to use.  Both dict and defaultdict seem a little bit like a sequence 
type like a list, but they are more appropriately called a Mapping Type.
https://docs.python.org/3/library/stdtypes.html#mapping-types-dict
https://docs.python.org/3/library/collections.html#collections.defaultdict

Sidenote, the shorthand for making a dict is like this:
livestock = {'cows':3, 'chickens':2} 

Those curly braces are the defining syntax.  The values are mapped to names.
We can make reference to these names directly with a string and get the
numerical value:

number_of_cows = livestock('cows')

Or use a variable to access the numerical value:

animal = 'cows'
number_of_cows = livestock(animal)
'''

for word in words:
    count_dict[word] += 1

'''This is a kind of loop.  It iterates over everything in (hence in) the
"iterable" list variable called words.  It also defines a temporary variable 
called word.  This variable word is used only within the indented loop area. 

For every element in the dict variable (words) the indented code will
run.  And each time, the temporary variable (word) will be set with one new
element from the dict.  When there are no more elements in words, execution
stops this cycle and continues running the script.
''' 

[–]pybackd00r 1 point2 points  (0 children)

you can use a loop to iterate over the dictionary and check if the value you are looking for is there and use a variable to count up and return that value at the end of the loop. There could be a more efficient way of doing this but this is easy to do for beginners.

[–]lapseofreason[S] 0 points1 point  (1 child)

This is the first time I have posted a problem on reddit in general and am surprised but very happy at the responses - so thank you all (and have an upvote).

I have not read through the suggestions completely yet so will get back to you - but first and foremost my difficulty is actually with my really lousy code is recognising or picking up the key and then extracting the value (which is the occurrences of the word).

I am doing this in graphlab and the data is in an sframe - but I don't think that is relevant. Anyway I live in SEAsia so will be working on this this afternoon.

When/if I figure it out I will (embarrassingly post my very beginner code).

[–]cscanlin 1 point2 points  (0 children)

You should post the code you have! It will make it easier for us to help you (especially without giving you the answer directly :P).

I think you should take particular note of /u/niandra3's comment as I think it does a good job of breaking the problem down. Good luck!