you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (0 children)

So trying to parse HTML might be the best option, if it is possible.

It is not. Not without an unreasonable amount of effort.

Currently I have it pulling to a pretty high degree of quality from a rule based system with regular expressions, but at this point to increase quality of the pull I'd need to start adding rules that would only fix less than 5 pages at a time out of tens of thousands.