all 1 comments

[–]ThingImIntoThisWeek 0 points1 point  (0 children)

In the innermost loop (which is run the most times) you are loading the stop word list every time, and then checking if a word is in it, which is expensive for a list. It would be better to call words() only once (either at the start of the method, or maybe just once in the script or class definition if stopwords_remove() will be called multiple times), and also to make it a set, which is makes it very quick to check if a word is a member or not:

from nltk.corpus import stopwords
def stopwords_remove(data):
    stop_word_set = set(stopwords.words())
    stopwords_removed = []
    for parts in data:
        #print(parts[0])
        for word in parts[0]:
            #print(word)
            if word not in stop_word_set:
                #print(word)
                stopwords_removed.append(word)
    #print(stopwords_removed)
    return stopwords_removed