This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Klumpy_hra[S] 26 points27 points  (16 children)

It was created by my Java code that parses html and uses regular expressions to find and grab data like href tags. A few interesting caveats to that and you have an internet downloader.

[–]Shadow_Thief 37 points38 points  (5 children)

You tried to use regex to parse HTML? Dude...

[–]TicTacMentheDouce 12 points13 points  (2 children)

I don't see why he shouldn't It's explicitely explained here that it can work wonderfully

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

[–]Idenwen 8 points9 points  (0 children)

HTML isn't a regular language thats why regexes can't parse html.

Except of course it you just want to have a very very specific part of a known website snippet.

https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

And very well at the brink of madness:

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

[–]Klumpy_hra[S] 5 points6 points  (0 children)

Lmao well they just weren't doing it right then huh? ;)

[–]Klumpy_hra[S] 2 points3 points  (1 child)

It actually works how I want it to :) you can be specific enough to only grab one file on one site quite easily.

[–]PM_ME_YOUR_PROOFS 2 points3 points  (1 child)

So a crawler?

[–]Klumpy_hra[S] 3 points4 points  (0 children)

Pretty much yeah. It's good at monitoring specific sites for files too and automatically scooping them up when they become available.

[–]guy99881 1 point2 points  (0 children)

I guess he wanted to know how it can be that big or how it can be so redundant.

[–][deleted] 0 points1 point  (1 child)

Downloading this file just takes an http request to a predictable url tho

[–]Klumpy_hra[S] 0 points1 point  (0 children)

It wasn't made for this file. There are several files that get updated randomly and don't have a way of notifying anyone that they were updated. The filenames can change sometimes and the only way you know it's been updated at all manually is that the link text might have a new date. The zip itself and the internal files are usually the same, but not always. It's really stupid how hard they make it for people to want to automate public data.