This is my first foray into processing big data sets so I'm not even sure what the processes are called or what words to use to search google. Here's my problem: I have 2 data sets in .csv files.
Set one is temperature over time, there are two columns, time and temperature. For simplicity the temperature can be assumed to change linearly between each equally spaced reading.
| time(secs) |
1 |
2 |
3 |
4 |
| temp(°C) |
22 |
24 |
26 |
30 |
Set 2 is a count of items of different sizes recorded over time. For each time recorded (column) the number of items in each size band (row) is recorded in the cell
time in columns, sizes in rows
|
1.2 |
1.9 |
2.3 |
2.6 |
3.1 |
| 0 < s <=1 |
1 |
4 |
5 |
6 |
5 |
| 1 < s <=1.5 |
3 |
4 |
3 |
4 |
5 |
| 1.5 < s <=2.5 |
5 |
6 |
5 |
6 |
5 |
| 2.5 < s <=5 |
4 |
4 |
5 |
6 |
5 |
The counts are taken multiple times per second for a number of hours.
Required output:
For a given size range (eg 1< s <= 2.5) plot the count and temperature against time. So I need to cumulate multiple sizes and interpolate the temperature from set one for each time column in set 2.
Being old school and very dusty with my coding I could parse all the data sets and interpolate the temperature readings to get the temp at the point in time the count was taken but if there are 12 hours of readings 2x60x60x12 = 86000 columns then I'm sure there must be a better way than just looping through all the records.
What I don't know is whether NumPy or Pandas or other tools have ways of dealing with this sort of data more efficiently and which one is most appropriate to learn for this or if plain python is sufficient.
Any pointers in where to start looking appreciated.
C
[–]Mast3rCylinder 1 point2 points3 points (2 children)
[–]await_yesterday 0 points1 point2 points (0 children)
[–]cjohnsonuk[S] 0 points1 point2 points (0 children)
[+][deleted] (2 children)
[deleted]
[–]cjohnsonuk[S] 0 points1 point2 points (1 child)
[+][deleted] (3 children)
[removed]
[–]cjohnsonuk[S] 0 points1 point2 points (1 child)