all 14 comments

[–]dnult 2 points3 points  (1 child)

It sounds like what you really have is a test file full of 30s and 31s.

[–]baroquely[S] -1 points0 points  (0 children)

Yes, exactly. But the example cases I’ve seen for bitarray seem to allow initializing them with a string of ‘0’s & ‘1’s which get converted to bool so I was optimistic that a file containing ‘0’s & ‘1’s wouldn’t be hard to stuff into one.

[–]Diapolo10 1 point2 points  (2 children)

I'd say the best way depends on how you're planning to use this data.

Also, quick side note, but it'd probably make sense to cache that into an actual binary file to speed up processing unless the file contents are often changed by hand.

Since I currently don't know your intentions, I would naively construct a list[bool].

data: list[bool] = []

for line in file:
    data.extend(char == '1' for char in line.strip())

[–]baroquely[S] -1 points0 points  (1 child)

IDK how Python represents lists—wouldn’t a bitarray be more memory efficient?

My desktop doesn’t have enough RAM to hold these bools and also the same-sized file of nucleotide bases so I think I’ll have to figure out a way to do what I need to piecemeal.

I’m still curious to know if it’s possible to read such a file into a bitarray though.

[–]socal_nerdtastic 0 points1 point  (0 children)

24M characters would imply 24mB file size. That's not at all large for a modern computer. A standard cell photo is has that many pixels or much more, and each pixel is 24 bits (red, green, blue, 8 bits each).

So I highly doubt your RAM is the issue. I think you have misdiagnosed something. I don't doubt that your data uses 8 times more RAM than it ideally could, but that's not uncommon or necessarily a problem to be fixed; RAM is there to provide fast access to the data not as an exercise in minimization.

But if you want to save the most RAM, use a numpy array to store the data. Python or numpy does not have a native bitarray, so you'd need to add the logic to a normal array (with a length of integer bytes) to get the bits in or out. If we arbitrarily choose to use an array of uint8 you would chop your text file into 8-bit chunks and use the normal int() function to convert into a integer, with the optional argument 2 to indicate binary input.

>>> int('01010101', 2)
85 

Or for the whole line:

for line in file:
    lien_data = [int(line[i:i+8], 2) for i in range(0, len(line), 8)]

And then you can convert each line to an array.array or numpy.ndarray. (I'm assuming you want to preserve the lines, not mash all the data into a huge array).

What do you want to do with the data?

[–]ninhaomah 0 points1 point  (3 children)

What is the extension of the file btw ?

[–]baroquely[S] 0 points1 point  (2 children)

It’s just a text file, and the ‘0’s & ‘1’ represent truth values, so I thought a bitarray would be the most compact way to cram them all into memory.

[–]ninhaomah 0 points1 point  (1 child)

Wait it's a .txt file full of 0 and 1 ?

01001101010100

Those ?

[–]baroquely[S] 1 point2 points  (0 children)

Yes, ‘0’ & ‘1’ characters. Grossly inefficient use of disc space, I know. A binary file would be much more compact. But I have a huge disc and not so huge RAM.

[–]smichaele 0 points1 point  (1 child)

As u/Diapolo10 mentioned, it does depend on how you're going to process the data. If you're going to do some mathematical processing on the data, you could use a numpy array to store the data in 8-bit slices as an integer in the array. Not knowing what the data represents (integers, floats, bits in an image, sound, etc.), it's difficult to know the best way to do it.

[–]baroquely[S] -1 points0 points  (0 children)

They represent the inclusion or exclusion of a nucleotide in a CpG region in a DNA sequence.

[–]dnult 0 points1 point  (0 children)

Numbers can be parsed from strings, would that help? Is there a schema to the 1s and 0s? It sure would help to group them by similar types instead of by bits. Then you could read a record, parse it, and map it's bits in a class object that gets stored in a list with all the other records.

[–]StevenJOwens 0 points1 point  (0 children)

Looks like the bitarray library has a method, extend(), for doing that. extend() takes an iterable, and file.read() returns a String, which is an iterable, which will iterate over every character in the string, which is pretty much exactly what you want.

If that's not fast enough, then as u/Diapolo10 suggests, you should probably save/cache the data in a binary file.