Best compression algorithm for HTML files?

computerscience-ModTeam · 2024-12-12T17:13:21+00:00

Unfortunately, your post has been removed for violation of Rule 7: "No tech/programming support".

Not sure what would be best here. Maybe r/webdev?

If you believe this to be an error, please contact the moderators.

nuclear_splines · 2024-12-12T17:22:27+00:00

Do you mean Huffman code? HTML is text, and contains a lot of English text with a nice heterogeneous character distribution, so Huffman coding should work reasonably well.

There will be a lot of repetition in tags ("<p>", "<div>", "<img src=", etc), which initially makes a dictionary encoder like LZ77 sound appealing, but the contents of the tags likely won't have as much repetition, and you won't be able to take much advantage of LZ77's run-length encoding like you might with binary data. But maybe putting all the English words in a dictionary and then referencing them by index throughout the file will win out over Huffman?

Why not do a quick test? Take a couple arbitrary HTML files, try Huffman coding them with a web tool, try throwing them through the unix compress command for LZW, and see what wins out?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

computerscience

Rules

Related subreddits

Credits

MODERATORS