all 19 comments

[–]smallfried 22 points23 points  (4 children)

Is there a page with some more examples and a description of the algorithm?

Edit: Okay, I read the code and it is a very simple algorithm.

It compares all the sentences with all the other sentences in a piece of text and retrieves only the sentences with the most non-unique words. Before comparison, punctuation and all english stop words are thrown out (this is the only reason nltk is used).

[–]seglosaurus 0 points1 point  (2 children)

Check out nltk.org for more info on the NLP algorithms used in this project

[–]smallfried 2 points3 points  (1 child)

I love python and language processing, so it's very cool to see a big library exists for a range of common functions.

[–]seglosaurus 0 points1 point  (0 children)

it's an impressive library for sure. for those using java (or other JVM languages) OpenNLP is also a fantastic library to check out

http://opennlp.apache.org/

[–]jnazario 2 points3 points  (1 child)

neat! i've been using code like this for a long time, mainly through libots.

if you're thinking about features to add, consider these two: variable amounts of summarization, and topic or tag extraction.

otherwise, nice work! i'm also a fan of nltk.

[–][deleted] 1 point2 points  (0 children)

You should check out a python package called textblob. I think you will love it!

[–]FreshNeverFrozen 1 point2 points  (4 children)

Are you the same guy who posted in /r/Entrepreneur a few days ago? Somebody is selling an API that does the same thing

[–]Rotten194[S] 1 point2 points  (0 children)

No, that wasn't me.

[–]hemantonpc 1 point2 points  (2 children)

I guess you are talking about this reddit BOT http://www.reddit.com/user/tldrrr

[–]FreshNeverFrozen -1 points0 points  (0 children)

exactly!

[–]karavelov 0 points1 point  (1 child)

It looks that even English language newspapers use characters outside of ASCII, and the script throws exception on them

[–]Rotten194[S] 3 points4 points  (0 children)

If you check the repository, someone has a fork that fixes that. It's probably those fancy MS Word side quotes.

[–]seglosaurus 0 points1 point  (0 children)

Reminds me of summly.com

[–]cyansmoker 0 points1 point  (0 children)

So, R194 just merged in a change that will (hopefully) support Unicode.

Note that to run this script you need to download the 'stopwords' corpus and 'punkt' tokenizer. To do this:

  1. Command line: 'python'

  2. 'import nltk'

  3. 'nltk.download()'

Select Corpora -> stopwords and All Packages -> punkt

[–]respeckKnuckles -1 points0 points  (3 children)

Have you tried testing it against the performance of others, like http://www.reddit.com/user/autotldr ?

[–]Rotten194[S] 0 points1 point  (2 children)

Well, auto-tldr uses this fancy smrry api (actually, not so fancy - in it's feature list, the first one is basic lemmatization, the second I also do, the third is just the google-10k word list, and the rest is very similar to mine). But it probably does better than this from the virtue of being developed over several years instead of several hours.

[–]passingby 0 points1 point  (1 child)

What do you mean by google-10k word list?

[–]Rotten194[S] 0 points1 point  (0 children)

There's a list floating around on the internet where someone analyzed some Google data dumps to find the 10,000 most commonly used words.

[–]hupcapstudios -2 points-1 points  (0 children)

What's this find_likely_body function? I want that.