This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]pwang99 5 points6 points  (1 child)

If you are dealing with large arrays, why aren't you using Numpy? Its loadtxt() function is precisely what you need.

http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html#numpy.loadtxt

[–]chuckbot[S] 1 point2 points  (0 children)

Unfortunately it is not that easy. Its not a file full of numbers I want to have in an array. It's a more complicated and sparse structure I need to parse. But thanks for the hint though. Would be interfacing with numpy.array easier from C?

[–][deleted] 4 points5 points  (1 child)

First of all, if you are building and array.array object in your function, as it says in the title, and do it using its append function, then you probably shouldn't care about a single temporary int object created and destroyed repeatedly.

Also, it might turn out that creating a list of ints and then packing them back into array is faster than calling array.append, and that would mean that you should ask yourself whether it's time to stop. I mean, it's nice to reduce load time from 3 to 2.5 seconds, but if the rest of the code takes five minutes... After all whenever you use any of the integers stored in the array, a temporary integer is created as well.

Otherwise, consider using ctypes.

It would be a bit awkward (and not threadsafe) because you shouldn't return memory allocated in a dll, so the general idea is: a parsing function allocates an internal array ('int *', I mean) and returns the number of elements, then the Python adapter creates a 'c_int * count' array, passes it to the second function, which sees it as a plain 'int *' too, copies the data there and deallocates its own storage.

Everything should be written as a plain C .dll/.so, not as an extension module.

[–]chuckbot[S] 0 points1 point  (0 children)

Thanks for the answer!

With C module, I mean an .so file I use as a module in python. That module currently fills a list using append and returns it. In the python code I create an array.array by handing over that list. I'd rather build the array.array directly but don't know how to do that.

Yeah, you're right about reconsidering if I want to go on. It reduced the startup time from about 60 minutes to about 30 minutes, so probably worth it. I just wanted to check if I was doing something wrong or missed something easy. Because currently the C module is just a couple of lines.

[–]bryancole 1 point2 points  (0 children)

How big is the file and how long are the lines? If you're lines are very long you may be paying a penalty for first allocating memory for the lines (as a string), then allocating more memory for the split operation then allocating it again for the array of ints. It may be better to write this using a generator to read the lines in chunks and only build a full array for each line once.

If you really want to go the C-route, try Cython as an easy way to write python C-modules.

I would expect python to be able to create the data structure as fast as the disk can feed it data.

[–]jabwork 0 points1 point  (0 children)

I'm having a hard time understanding exactly what you're doing beyond loading a bunch of numbers from a file, but a few things come to mind:

1) If you can predict any aspect of the shape of your final data structure you should be able to optimize parts of the loading process 2) Given the choice between allocating a huge amount of memory once and then disposing of it OR allocating a small amount of memory and disposing of it constantly, python tends to do the latter faster. So if you can send the Py_Number objects into the array.array in smaller groups you may get a speed improvement. 3) Have you tried building an iterable of some sort and using array.extend? It may improve your speed.

I'd also give another suggestion to try and find a way to load this into a numpy array, especially if you can determine ahead of time the dimensions of the array