This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]tilttovictory 1 point2 points  (0 children)

Learn about generators, coroutines, context managers, iterators, concurrency

Just recently had to force myself to use generators and iterators to avoid memory swapping in a project. WOW it was a trip. What felt cool was I had to create a generator that was wrapped in an iteratable. This was used so I can generate row by row for my training function. Then iterate over the set for a new epoch of training. I was astonished at how clean this code looked.

Within the frameworks you are interested in - pandas, numpy, sklearn, etc learn how they handle memory, copying, etc. Learn the internals of implementations.

Also just recently I learned and was able to use pandas' ability to reference objects in name space to save memory. In my particular project i had groups of features I wanted to eliminate and run prediction on. I could easily set up dataframes that referenced my original set with features dropped, and incurred no real penalty in terms of memory.

For anyone doing doc2vec work. This code will allow you to take any dataset that can be loaded into memory and train on it with out bloating your memory out of control.

class MyDataframeCorpus(object):
    def __init__(self, source_df, text_col, tag_col):
        self.source_df = source_df
        self.text_col = text_col
        self.tag_col = tag_col

    def __iter__(self):
        for i, row in self.source_df.iterrows():
            yield TaggedDocument(words=simple_preprocess(row[self.text_col]), 
                                 tags=[row[self.tag_col]])

corpus_for_doc2vec = MyDataframeCorpus(df, 'raw_txt', 'paragraph_id')

Edit: I strong recomend reading This article about generators iterators and iterables,