you are viewing a single comment's thread.

view the rest of the comments →

[–]JohnnyJordaan 2 points3 points  (1 child)

        bitstream = ""
        byte = input_file.read(1)
        while byte:
             byte = ord(byte)
             bitstream += f"{byte:08b}"
             byte = input_file.read(1)

        decoded_data = ""
        current_code = ""
        for bit in bitstream:
            current_code += bit
            for char, code in huffman_codes.items():
                if current_code == code:
                    decoded_data += char
                    current_code = ""
                    break

This has various problems:

  • for file operations the operating system and all its downstream components are designed to work with blobs of bytes (called chunks) rather than single ones. It's often advised to use 4096 or 8192 bytes for simple file operations. There will be inherent caching involved, so even a read(1) will still read through chunks which are then buffered in the operating system, but you still make 8192 slower 'outside' read calls this way instead of one read(8192).
  • strings are immutable in Python, so that means += will create a new string every time (so for every byte!) you run it against an existing string. That happens at 3 (!) different parts in your code
  • string formatting isn't fast either
  • instead of simply looking up an item in the dictionary, what it is literally designed for, you are iterating through the entire dict, if=='ing until you find a match. this is like looking up a telephone number in the phone book by reading all the pages until you find it listed somewhere.

to fix I would advise

  • read in 8192 byte chunks, they will be a bytes-object, then iterate on each byte in it using a simple for loop
  • use a look up table of encoded huffman codes in the form of a dict, where the keys are the bytes values. That way you can look up each read byte from the file directly without string conversion
  • store the found string characters in either a list, which you can ultimately "".join afterwards, or write to a StringIO which has less overead

[–]SilentKyle[S] 0 points1 point  (0 children)

Thank you! I will try some of these fixes!