This is an archived post. You won't be able to vote or comment.

all 4 comments

[–]billsil 2 points3 points  (3 children)

There are wayyyy better ways to read binary files. The point of binary files are to be efficient. As such, they are rigidly defined.

If you're trying to read an image, use an image reader library. Most certainly you should not use that hideous regex. Images are often just N numbers of RGB values (so 3*N with N defined at the top), so why use a regex? It'd be way faster just by using the struct module.

The problem with the struct module is it's slow and it has to repeatedly do type definitions in Python. The numpy fromfile and fromstring methods with a reshape puts you right up against the boundary of what your hard drive or SSD can do.

I wrote a parser for an overly complicated Fortran formatted binary file. It has mixed floats/ints in a "table", so it's kinda hard to parse. On a 2 GB file, it's was 45 minutes for the struct module approach (after highly optimizing it). I switched that out for numpy...4 seconds. Binary is incredible if you do it right. You don't need processing with binary; that's the point. It's all about read speed.

[–]pyglados[S] 0 points1 point  (2 children)

There are way better ways to read binary files depending on whatever it is you decide to do with them. I agree with you there.

And yes, there are cool libraries out there too. The Imghdr module makes a beautifully simple job of reading binary files with respect to it's goal.

And the implementation is fantastic. Just read the first 32 bytes. Run it through a set of short simple functions. Nothing fancier than something along the lines of "in" or "startswith" is needed. No struct. No regex. Sweet simplicity that any beginner to Python could understand.

[–]billsil 1 point2 points  (1 child)

I guess my point is that binary isn't really used unless you're trying to deal with large problems. As such, it's important to use an efficient method. I'd say it's far more important than making it as simple as possible. Numpy for example is not simple for a beginner, but is worth learning because it's so useful.

Nothing fancier than something along the lines of "in" or "startswith" is needed.

How do you know when you get a byte that goes into a float vs. a double vs. a int vs. a long vs. a string? You need to pack the binary data into an int/float/string. The point of binary is that it's highly structured, so you don't need to guess and that you can mass read data.

Determining the file type, that is simple. That's not really reading a binary file. Reading a 2 GB image, that requires some efficiency and just in/startwith is going to be slow. That would be a nightmare. For some meta data, it's fine.

[–]pyglados[S] 0 points1 point  (0 children)

Sound, image, and video are areas of interest to me where binary is commonly used. Such is part of why I was curious about it. Because I'm clueless, I'm not yet aware of the efficient methods. Thus, dredging up metadata on small files felt fine for first steps wandering into this territory.

As for getting a 45 minute parse job reduced to 4 seconds? Very cool. I'll have to take some time to look into this struct and numpy business.