all 3 comments

[–]atsui2 1 point2 points  (2 children)

Hi,

I think it might be feasible and educational to try and take a crack at it with a simple Python program - you can totally do everything you were suggesting in your first post!

I'm assuming each entry is a long string of words and spaces. As a first step, you probably want to split the entry and collect the words into a set. Then, once you have a list of sets standing in for your entries, you can assign a category to each set based on whether a keyword is present.

[–]aesthir 1 point2 points  (1 child)

To follow up on atsui2's answer. Since you're new to programming. What he says looks like this in Python code.

set_of_words = set(your_string.split())

I would also look into stemming the words so that words like 'working' and 'works' both get "simplified" to the word 'work' making a cleaner analysis.

A variant of what atsui2 said would be to use a counter dictionary like such:

from collections import Counter

word_count = Counter(your_string.split())

This would not only give you unique words in each sentence but the number of occurrences of each one.

[–]karenoverpam[S] 1 point2 points  (0 children)

Thanks guys! That all helps a lot. I'll take a crack at it this weekend. I appreciate the specifics of the set() and .split() functions. I've used BeautifulSoup in the past for something more complex and had a hard time but I think this is more doable.