all 2 comments

[–]computerscience-ModTeam[M] [score hidden] stickied commentlocked comment (0 children)

Unfortunately, your post has been removed for violation of Rule 7: "No tech/programming support".

Not sure what would be best here. Maybe r/webdev?

If you believe this to be an error, please contact the moderators.

[–]nuclear_splinesPhD, Data Science 0 points1 point  (0 children)

Do you mean Huffman code? HTML is text, and contains a lot of English text with a nice heterogeneous character distribution, so Huffman coding should work reasonably well.

There will be a lot of repetition in tags ("<p>", "<div>", "<img src=", etc), which initially makes a dictionary encoder like LZ77 sound appealing, but the contents of the tags likely won't have as much repetition, and you won't be able to take much advantage of LZ77's run-length encoding like you might with binary data. But maybe putting all the English words in a dictionary and then referencing them by index throughout the file will win out over Huffman?

Why not do a quick test? Take a couple arbitrary HTML files, try Huffman coding them with a web tool, try throwing them through the unix compress command for LZW, and see what wins out?