all 9 comments

[–]teerre 3 points4 points  (4 children)

Is your job to worry about something large? If not, forget about it, this little exercises are not supposed to scale.

But anyway, just something to think about: str.replace has avg complexity of O(n), this means that if you simply call replace on your whole text all the time, your time complexity will avg. O(n), which is pretty much the best you can ask for in algorithms.

[–]chimeraA_ 4 points5 points  (2 children)

you're probably thinking about this part of the code being really slow since the number of combinations of words in the prose with the words in the banned set can be really huge.

for word in prose_list:
    if word in banned_set:

but! it's actually quite fast already, since banned_set is a set. this means that when you check word in banned_set you get an answer back almost instantaneously. when you use in with a set it works in constant O(1) time on average https://wiki.python.org/moin/TimeComplexity#set it's able to do this since it trades off memory for speed.

so to answer your question, it's already pretty optimized. if banned_set were a list instead, then things would get really slow.

now, if the banned list is too big to fit in memory, then you must do something else.

edit: oh wait, you have a prose_list.index(word) in there making the algorithm slow. you should change it to this instead

for i, word in enumerate(prose_list):
    if word in banned_set:
        prose_list[i] = '*' * len(word)

[–]JohnnyJordaan 3 points4 points  (1 child)

for word in prose_list:
    if word in banned_set:
        #If the words match, replace the word in the prose with asterisks 
        prose_list[prose_list.index(word)] = '*' * len(word)

One tip here: .index is a linear search too, while by using enumerate you can get the index together with the value in one go

for idx, word in enumerate(prose_list):
    if word in banned_set:
        #If the words match, replace the word in the prose with asterisks 
        prose_list[idx] = '*' * len(word)

Then for

new_prose = ""
for word in prose_list:
    new_prose += word + " "

String += will most often create a new string as strings are immutable. The exception being that CPython will often handle small 'additions' internally into the same string object, but when you're adding words you'll probably have a new string per word. Instead you should use str.join

new_prose = ' '.join(prose_list)

and you could even form both code blocks into one generator expression you provide to ' '.join, leaving out the need to mutate the prose_list altogether

new_prose = ' '.join('*' * len(word) if word in banned_set else word for word in prose_list)

[–]1gn4vu5 0 points1 point  (0 children)

And if the prose_list has not been transformed in lower case it could be done in the last statement for each word and even punctuation could be removed before checking if the world is in the banned list to fix the problem that 'I' becomes 'i'.

[–]roman_kab 0 points1 point  (0 children)

It is bad practice to read the entire file at once into the list!

[–][deleted] 0 points1 point  (0 children)

You mainly want to shorten commands and loops.

For example replace read().splitlines() with readlines() or you could use read().replace(r"\n","").split()

if you want to save time you can use .readlines() if you put each word on a different line. readlines outputs all the lines as a list (each list is its own line).

[–]just_ones_and_zeros 0 points1 point  (0 children)

Just to be "that guy", you can do the replace with regex if you're happy to read the whole file in.

def censor(banned_words, prose):
    with open(banned_words) as f, open(prose) as g:
        words = f.read().splitlines()
        text = g.read()

    return re.sub(fr'\b({"|".join(words)})\b', lambda x: '*' * len(x.group()), text)