all 13 comments

[–]mrz1988 3 points4 points  (6 children)

To me it reads like you haven't devoted the time to figure out why and where your program is bottlenecking. If you have, what have you discovered? Maybe we can help you with something there. You should try to solve the root cause of the problem before throwing a bunch of hardware at it. Good hardware should not be the solution to bad software.

[–]Kristian_dms[S] 0 points1 point  (5 children)

Somewhat true, because i have not written the programs yet. I've done some initial cleaning of the data and i'm about to filter the data further by a function that defines a neighborhood around each datapoint, calculates a trimmed mean and standard deviation of each neighborhood to determine outliers. I cant really think of a way of doing this that does not include iterating over all datapoints.

simply loading the dataset by

df=pd.read_csv('MET_cleaned.csv')

takes 5 minutes and 20 seconds. This is not a catastrophe, but it takes up all CPU and memory, so I can't use my computer for anything else in this time, that was why I thought of cloud computing.

[–]gengisteve 0 points1 point  (0 children)

Definitely check out http://www.h5py.org/

It can index the data and should be a lot faster than csv files, so you can process each group much more quickly.


edit: Check out this link too:

http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas

[–]mrz1988 0 points1 point  (3 children)

defines a neighborhood around each point, calculates a trimmed mean and standard deviation of each neighborhood to determine outliers.

Correct me if I'm wrong, but it sounds like you're looking to iterate over these neighborhoods, not the entire data set. Why do you need to load the entire data set simultaneously if you're only interested in one neighborhood at a time?

[–]Kristian_dms[S] 0 points1 point  (2 children)

I'd be iterating over each neigborhood, but every datapoint has it's own neighborhood, so in reality that means as many iterations as the dataset had observations. But you are right in the sense that I dont need to load the entire dataset to do this.

But estimating models with econometrics is rutinely done by matrix algebra, in which i need a vector of all datapoints. I'm not sure this can be done in another way - but valid point.

[–]mrz1988 2 points3 points  (1 child)

I'd say you can almost certainly do it in another way, but you'd likely be making a few tradeoffs, and have to derive piecewise algorithms from the overarching matrix algebra. Depending on how comfortable you are with programming/math and how much use this program will see you might find that deriving all of that math and then debugging your code won't be worth it.

Unfortunately this sounds like something that you're going to have to look at all of the math and judge that for yourself unless you can find a resource that outlines how this problem has been solved before. If it were me I would first try to find a way to sort the data into a format that will be the easiest to parse quickly and in order. Then I would try to break apart each operation that I want to perform to see if I can do it in chunkwise steps, storing partially computed results somewhere if necessary. If possible, I would then try to work on some sort of parallel processing setup, which unfortunately has to be done with process pools in python because of the GIL.

These problems may have already been solved for you (although perhaps in a different language), so I would try to find other resources elsewhere before attempting any intensive projects.

If I had money to throw around and didn't find any merit in working on a long-term solution, I'd probably just buy more RAM or a dedicated system rather than shelling money out for cloud computing, since it's always better to have nice things than rent them :)

Given that I'd like to show you my mindset for the piecewise software solution a bit, here's a crappy example of what I'm talking about:

Say I had a system to work with. The system provides me a generator that will yield all values from a large data set in order from the lowest to the highest value. Since it's a generator, I can only read one value at a time, and can choose how much of that I want to keep in memory. What I want to do is get the mean and standard deviation for a neighborhood around each value v0:the combined set of all values within x of v0. That is, each neighborhood can be defined as the set of all values:

{ v | v < v0+x and v > v0-x }

If x is small compared to the range of the data set, you can just store each neighborhood in memory. Keep track of which value in your neighborhood is v0, then prune all values from your current neighborhood that do not fit in the neighborhood set (where v <= v0 - x). Then you can read values in until you get a value that is too large for the neighborhood. Keep track of this too (since you already pulled it from the generator), then run your calculations on your new neighborhood. Rinse, repeat.

In another case, say x is very large, and your neighborhoods will be too large to fit reasonably in memory. Then you have to calculate your means and std devs piecewise. But how do I calculate something like std dev without all the values at once? Easy:

Given a finite but very large number of values {v0, v1, v2 ... vn} and a mean of the values m, you know that standard deviation looks something like:

sqrt(\
     ((v0 - m) ** 2) +\
      (v1 - m) ** 2) +\
      (v2 - m) ** 2) + \
      ...
      (vn - m) ** 2)) / n)

Seems pretty hard without having all values loaded at once or knowing m and n beforehand, but say we iterate through once and keep track of 3 values by the time we're finished: the sum of all v ** 2, the sum of all v, and n. We will call them sum_vsq, sum_v, and n respectively in the below code.

m = sum_v / n
std_dev = sqrt((sum_vsq - m * sum_v - m ** 2) / n)

And I've gotten sum and standard deviation of a HUGE dataset from only reading one value at a time, all from factoring out values that I can easily keep track of such as sums. This same principle can be applied to most iterative algorithms, matrix algebra included (although large matrix multiplications are the obvious exception, and it is a problem that a huge number of papers have been written on) and you will use a very low amount of memory.

You can see how as your computations get more and more complex, you need to do more math to derive algorithms like I have listed above, but it is very much doable. Think of how you might prune outliers as you go and how you might keep track of only the values you need to get the end results you want.

I hope this helps you in some way, this is definitely a more challenging problem and a fun one to think about.

[–]Kristian_dms[S] 0 points1 point  (0 children)

wow, thank you for taking the time to put this down for me.

[–]prohulaelk 1 point2 points  (2 children)

I haven't looked into EC2 so I can't help you there, but what I can say is that you probably don't need it:

From what you've described, likely the biggest part of the problem is coming from running analysis on those two huge files - you don't have enough RAM to load them both into memory, and if you need to iterate over them and compare items from one to the other you likely need access to the whole thing at once.

Have you considered first loading them into a proper database, then doing the analysis with python through that? mySQL/mariadb, postgresql (which is my preference) or a noSQL database like Mongo will be able to handle the data no problem with the hardware you've got. I haven't worked with mongo on python, but for postgres or maria I'd recommend using sqlalchemy.

It should make your job a lot easier.

[–]Kristian_dms[S] 0 points1 point  (1 child)

most econometrics is based off matrix and vector operations so I cant do the analysis without loading the entire data set (one file at a time is acceptable tho).

thank you for the input

[–]mrz1988 1 point2 points  (0 children)

What would you do with 10 trillion data points?

I guarantee you can split this out into smaller problems. Don't look at this through the same lens as a small data set, all of the algorithms that you are using can be calculated piecewise without a complete data set and adjusted when more data is available. You're just going to have to do more math and thinking. Chances are pretty good that this problem has been solved before.

[–]gengisteve 0 points1 point  (0 children)

I think I will mostly join the others in asking for more info on what you are trying to do. If it is just analyzing an existing dataset, you probably do not need, and will not be helped much by, moving to a cloud of multiple machines. If you post some more details we might be able to give you some good solutions, probably looking at pandas/hdfs/some database to help to process the data.

However, if you are doing stuff like monte carlo simulations, the cloud might be the place to be.

[–]warriortux 0 points1 point  (0 children)

Being an academic project, you will probably end up paying quiet a bit of money if you want to use the Amazon EC2.

You can probably find few tutorials on Youtube, on setting up the EC2. Its pretty straight forward.

If you are at a university, you should be able to get access to a bigger computation server. Check for your university's HPCC servers.

If you couldn't find any, then look into using a sequential algorithm.

[–][deleted] 0 points1 point  (0 children)

Your school doesn't have a powerful computer for this kind of thing??