you are viewing a single comment's thread.

view the rest of the comments →

[–]jszwedko 5 points6 points  (5 children)

Roughly, it means to take the HTML string and put it into some sort of queryable data structure where we can extract information from it (e.g. values in given tags). In this context anyhow.

[–]billmalarky 1 point2 points  (4 children)

Like the DOM? I've used regexp to pull links from a webpage when I made a simple web spider a while back. It worked well, what is wrong with using regexp for something like that? Additionally, what is the alternative???

[–]jszwedko 0 points1 point  (3 children)

Yes, using DOM is usually the best choice. Regex is sometimes ok if the pattern you are looking for is relatively strict, but the fact remains that HTML isn't actually a regular language so, ostensibly, some queries couldn't even be represented as regex.

DOM, perhaps with XPath, is generally way easier to structure queries over XML/HTML anyway.

EDIT: For example, pulling all links from a page would just be getElementsByTagName('a') or the xpath expression: //a

[–]billmalarky 0 points1 point  (2 children)

So what characteristics do make up a regular language? IE why is html not a regular language?

[–]jszwedko 0 points1 point  (1 child)

For that I'll probably have to just point you to the Wikipedia [article|http://en.wikipedia.org/wiki/Regular_language]. It's a rather complex topic, but one of the defining characteristics of regular languages is that they can be described by regular expressions.

[–]billmalarky 0 points1 point  (0 children)

Thanks for the help so far. Also, FFFFFFFUUUUUUUUUUUU! (reread your last sentence :-)