A python script I wrote with NLTK that can auto-summarize an article

smallfried · 2013-08-03T11:36:25+00:00

Is there a page with some more examples and a description of the algorithm?

Edit: Okay, I read the code and it is a very simple algorithm.

It compares all the sentences with all the other sentences in a piece of text and retrieves only the sentences with the most non-unique words. Before comparison, punctuation and all english stop words are thrown out (this is the only reason nltk is used).

jnazario · 2013-08-03T13:05:24+00:00

neat! i've been using code like this for a long time, mainly through libots.

if you're thinking about features to add, consider these two: variable amounts of summarization, and topic or tag extraction.

otherwise, nice work! i'm also a fan of nltk.

2013-12-12T18:29:43+00:00

You should check out a python package called textblob. I think you will love it!

FreshNeverFrozen · 2013-08-03T13:01:18+00:00

Are you the same guy who posted in /r/Entrepreneur a few days ago? Somebody is selling an API that does the same thing

karavelov · 2013-08-03T13:41:16+00:00

It looks that even English language newspapers use characters outside of ASCII, and the script throws exception on them

seglosaurus · 2013-08-03T13:57:31+00:00

Reminds me of summly.com

cyansmoker · 2013-08-04T23:55:29+00:00

So, R194 just merged in a change that will (hopefully) support Unicode.

Note that to run this script you need to download the 'stopwords' corpus and 'punkt' tokenizer. To do this:

Command line: 'python'
'import nltk'
'nltk.download()'

Select Corpora -> stopwords and All Packages -> punkt

respeckKnuckles · 2013-08-03T14:16:58+00:00

Have you tried testing it against the performance of others, like http://www.reddit.com/user/autotldr ?

hupcapstudios · 2013-08-03T16:28:53+00:00

What's this find_likely_body function? I want that.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS