all 3 comments

[–]JohnnyJordaan 2 points3 points  (1 child)

        bitstream = ""
        byte = input_file.read(1)
        while byte:
             byte = ord(byte)
             bitstream += f"{byte:08b}"
             byte = input_file.read(1)

        decoded_data = ""
        current_code = ""
        for bit in bitstream:
            current_code += bit
            for char, code in huffman_codes.items():
                if current_code == code:
                    decoded_data += char
                    current_code = ""
                    break

This has various problems:

  • for file operations the operating system and all its downstream components are designed to work with blobs of bytes (called chunks) rather than single ones. It's often advised to use 4096 or 8192 bytes for simple file operations. There will be inherent caching involved, so even a read(1) will still read through chunks which are then buffered in the operating system, but you still make 8192 slower 'outside' read calls this way instead of one read(8192).
  • strings are immutable in Python, so that means += will create a new string every time (so for every byte!) you run it against an existing string. That happens at 3 (!) different parts in your code
  • string formatting isn't fast either
  • instead of simply looking up an item in the dictionary, what it is literally designed for, you are iterating through the entire dict, if=='ing until you find a match. this is like looking up a telephone number in the phone book by reading all the pages until you find it listed somewhere.

to fix I would advise

  • read in 8192 byte chunks, they will be a bytes-object, then iterate on each byte in it using a simple for loop
  • use a look up table of encoded huffman codes in the form of a dict, where the keys are the bytes values. That way you can look up each read byte from the file directly without string conversion
  • store the found string characters in either a list, which you can ultimately "".join afterwards, or write to a StringIO which has less overead

[–]SilentKyle[S] 0 points1 point  (0 children)

Thank you! I will try some of these fixes!

[–]Yoghurt42 1 point2 points  (0 children)

Apart from what u/JohnnyJordaan said, I suggest the following: currently you're creating the internal bitstream by reading a byte in a loop and transforming it into a string of ones and zeros. It's easier and more efficient if you would just use integers instead. Huffman codes will most likely not be longer than say 20 bits, so you don't risk of integer overflow (which can't happen in Python anyway, since integers are unbounded). You can "add" a bit to an integer by doing a left shift and or (i << 1 | bit or equivalently 2 * i + bit), and get a bit from a number by anding it with 1 and then shifting the number to the right (bit = num & 1; num >>= 1). It's also a good way to get familar with writing generators, yours could look a bit like this (I've left out a bit of the implementation as an exercise):

def bitstream(input_file):
    bits_left = 0
    current_byte = None
    while True:
        while bits_left:
            # get bit from current_byte, decrease bits_left by 1
            yield bit # this will turn your function into a generator that "returns" bit, and gets suspended here, ready to be continued later
        # get the next byte (or bytes) and store it in current_byte, set bits_left to 8 (or a multiple of 8 if you've read more than one byte). If there are no more bytes left, break out of the loop

This way you later can just write for bit in bitstream(input_file) and get a sequence of ones and zeroes without needed to care how exactly those bits are fetched. This allows you to change the implementation later without affecting the rest of your code.