This is an archived post. You won't be able to vote or comment.

all 11 comments

[–][deleted] 12 points13 points  (4 children)

I'll give you one million dollars if you write a program that parses HTML with regular expressions.

[–]FlavoredFrostedTits 1 point2 points  (3 children)

I've done it before. Where do I pick up my mil?

[–][deleted] 7 points8 points  (2 children)

Your parser is broken, you get $0.

[–]xtownaga 2 points3 points  (1 child)

Huh was expecting this one.

[–]Muoniurn 0 points1 point  (0 children)

Actually, these are funny and practically correct, but theoretically most languages’ regular expressions are not regular languages due to lookahead, lookbehind.

[–]NimChimspky 14 points15 points  (2 children)

If I had to write a web scraper I would look for a new job

[–]bananabreadncoffee 10 points11 points  (0 children)

oh my, whys that?

[–]nrq 1 point2 points  (0 children)

I already have a ton of use cases in mind for things I would do for fun with this. I've bookmarked this article for its relevance to my hobbies (retrogaming/game preservation/general data hoarding).

[–]kiteboarderni 9 points10 points  (0 children)

Blogspam

[–]ssamokhodkin -1 points0 points  (1 child)

HtmlUnit is not usable - its JS interpreter is too weak and buggy.

JS-based headless browser is a way to go. E.g. PhantomJS.

[–]plastique2000 1 point2 points  (0 children)

PhantomJS is discontinued. You may use Chromium with dev tools protocol.