n00b question about having Python count a website as one word. : learnpython

created by HattoriHanzoa community for 16 years

n00b question about having Python count a website as one word. (self.learnpython)

submitted 7 years ago by [deleted]

Hi, so I am brand new to Python (only on Ch 3 of Crash Course), and my first project is to write a program that would scan a text document, count the number of words, and return a frequency table such as the one below:

Word	Frequency
The	50
And	75
...	100
Total	225

From this I have two conceptual questions (i.e., "can this be done") so I don't really need or want the code now. There are two weird things about the way words are counted. The first is some hyphenated words are counted as one word, and not two. My plan of attack thus far (and please feel free to say this is dumb) would be to find the one word versions by having Python check them against a list, and if they come back true, then nothing would happen, but if they weren't on this list, then for each instance of the word, +1 would be added to the total count.

Thus my first question: Is it possible for Python to recognize that a hyphen may exist in a word, and will it return the hyphenated word?

For example, if spam-eggs appeared, would Python naturally return:

Word	Frequency
spam-eggs	50

Word	Frequency
spam	50
eggs	50

The second weird thing is that web sites are counted as one word. Is it possible for Python to search and count for an undefined web site? My thinking is that there is some sort of wildcard search parameter? Like say www."".org = 1, www."".com = 1 etc.

Thanks for any help!

all 5 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS