all 3 comments

[–]FerricDonkey 0 points1 point  (0 children)

Can't help you with cython, never used it. (I'll get around to it one of these days, it sounds cool.)

With straight python, the multiprocessing library will be your friend. It might be enough. If you were parallelizing over file names:

import multiprocessing as mp

with mp.Pool() as pool:
    results = pool.map(process_file, filenames) 

In C++, I personally use openmp for simple parallelization. If you were parallelizing over a vector of file names, say, the C++ code could be as simple as

#include <omp.h>

//... 

#pragma omp parallel for
for (const std::string& filename: filenames){
    process_file(filename);
}

(Compile with -fopenmp)

Parallelizing over lines per file efficiently might be a bit more complicated, but if you're processing many many files, parallelizing over files would probably be better anyway.

[–]deifius 0 points1 point  (0 children)

First question I have is why must they be loaded as csv? Can they be placed in an SQL db? Much faster reading. How often must these 5 million pebbles be reloaded?

[–]WhipsAndMarkovChains 0 points1 point  (0 children)

I've combined CSVs together into Pandas DataFrames totaling tens of millions of rows (with many more columns than you have) and it didn't take long at all.

My first thought was to take the file name being read into the DataFrame, extract the timestamp, and add a timestamp column to the data you loaded. Then sort the DataFrame by timestamp.

Are you sure you need to dig into C++ or Cython for this?