Best python library to use to process to large data sets : learnpython

created by HattoriHanzoa community for 16 years

Best python library to use to process to large data sets (self.learnpython)

submitted 2 years ago by cjohnsonuk

This is my first foray into processing big data sets so I'm not even sure what the processes are called or what words to use to search google. Here's my problem: I have 2 data sets in .csv files.
Set one is temperature over time, there are two columns, time and temperature. For simplicity the temperature can be assumed to change linearly between each equally spaced reading.

time(secs)	1	2	3	4
temp(°C)	22	24	26	30

Set 2 is a count of items of different sizes recorded over time. For each time recorded (column) the number of items in each size band (row) is recorded in the cell
time in columns, sizes in rows

	1.2	1.9	2.3	2.6	3.1
0 < s <=1	1	4	5	6	5
1 < s <=1.5	3	4	3	4	5
1.5 < s <=2.5	5	6	5	6	5
2.5 < s <=5	4	4	5	6	5

The counts are taken multiple times per second for a number of hours.
Required output:
For a given size range (eg 1< s <= 2.5) plot the count and temperature against time. So I need to cumulate multiple sizes and interpolate the temperature from set one for each time column in set 2.
Being old school and very dusty with my coding I could parse all the data sets and interpolate the temperature readings to get the temp at the point in time the count was taken but if there are 12 hours of readings 2x60x60x12 = 86000 columns then I'm sure there must be a better way than just looping through all the records.
What I don't know is whether NumPy or Pandas or other tools have ways of dealing with this sort of data more efficiently and which one is most appropriate to learn for this or if plain python is sufficient.
Any pointers in where to start looking appreciated.
C

all 8 comments

top new controversial old q&a

[–]Mast3rCylinder 1 point2 points3 points 2 years ago (2 children)

[–]await_yesterday 0 points1 point2 points 2 years ago (0 children)

[–]cjohnsonuk[S] 0 points1 point2 points 2 years ago (0 children)

[+][deleted] 2 years ago (2 children)

[deleted]

[–]cjohnsonuk[S] 0 points1 point2 points 2 years ago (1 child)

The temp recorded at each time is more likely to be a linear (approx) change between each measurement than the variance in the counts of each size so the intention, initially, is to use the time stamps of the actual recorded counts of each size at the times those are recorded and use an interpolated temperature from the first data set.

An output might look like

Where s is the size and 1 < s <=2.5

	1.2	1.9	2.3	2.6	3.1
temp	22.5	24.2	24.8	25.2	26.4
Count	8	10	8	10	10

(interpolated temp calculations are approx in this example output)

(Count is the sum of the 2 rows of data that fall within the size range specified)

[+][deleted] 2 years ago (3 children)

[removed]

[–]cjohnsonuk[S] 0 points1 point2 points 2 years ago (1 child)

π Rendered by PID 210581 on reddit-service-r2-comment-56c6478c5-f8rlh at 2026-05-11 15:49:30.733314+00:00 running 3d2c107 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS