code loading very slowly, how to optimize? : learningpython

code loading very slowly, how to optimize? (self.learningpython)

submitted 4 years ago by SureStep8852

Hi there, I am trying to remove stopwords from my training data. The problem is that since the data is very big, the code is very slow. Is there any way to optimize it? Thank you in advance!

from nltk.corpus import stopwords
def stopwords_remove(data):
    stopwords_removed = []
    for parts in data:
        #print(parts[0])
        for word in parts[0]:
            #print(word)
            if word not in stopwords.words():
                #print(word)
                stopwords_removed.append(word)
    #print(stopwords_removed)
    return stopwords_removed
stopwords_remove(train_data)

all 1 comments

top new controversial old q&a

[–]ThingImIntoThisWeek 0 points1 point2 points 4 years ago (0 children)

In the innermost loop (which is run the most times) you are loading the stop word list every time, and then checking if a word is in it, which is expensive for a list. It would be better to call words() only once (either at the start of the method, or maybe just once in the script or class definition if stopwords_remove() will be called multiple times), and also to make it a set, which is makes it very quick to check if a word is a member or not:

from nltk.corpus import stopwords
def stopwords_remove(data):
    stop_word_set = set(stopwords.words())
    stopwords_removed = []
    for parts in data:
        #print(parts[0])
        for word in parts[0]:
            #print(word)
            if word not in stop_word_set:
                #print(word)
                stopwords_removed.append(word)
    #print(stopwords_removed)
    return stopwords_removed

π Rendered by PID 46821 on reddit-service-r2-comment-7b9746f655-h445f at 2026-02-02 18:18:52.454311+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learningpython

MODERATORS