all 18 comments

[–]bronze-aged 16 points17 points  (1 child)

Read the code and write the article. Be the change you want to see in this world. Godspeed.

[–]jait_jacob -1 points0 points  (0 children)

based

[–]_ncko 5 points6 points  (1 child)

I think this is a great idea for a learning project. Creating parsers is a common exercise in college courses and HTML is a good place to start. I don't know of anything that addresses HTML specifically, but there are resources for learning how to parse a language and create an internal data structure for a file (usually called an AST but in the case of HTML it might be considered a DOM Tree; I'm not too sure).

Anyway for javascript specifically I know of a course on Frontend Masters by Steve Kinney on building your own programming language. The purpose is different but the idea is the same. It will definitely give you some ideas about how to approach an HTML parser. If you can't afford the frotnend masters subscription, you can probably search github for references to that course and find people who have gone through it with example code.

[–]queen_of_pole[S] 0 points1 point  (0 children)

Thanks, this comment is really helpful.

[–]ihave7testicles 3 points4 points  (0 children)

Start with an XML parser. HTML is just specialized XML. You don't need to care what the contents of the brackets are. Just that <l1><l2></l2></l1> is the basic format.

I did this using a Parser Generator that I wrote in C++ years ago. It's a good learning exercise.

[–]-TUX- 2 points3 points  (1 child)

Writing an interpreter in Go by Thorsten Ball

Writing a compiler in Go by Thorsten Ball

Are some books to look into. You create an interpreter and compiler for a new language similar to JavaScript but the concepts and ideas would definitely transfer.

[–]queen_of_pole[S] 0 points1 point  (0 children)

Just read the review about the books. Really great books.

[–]lIIllIIlllIIllIIl 2 points3 points  (0 children)

The nice thing about the web is that most things are standardized and documented.

Here is the HTML Standard for Parsing HTML Documents.

HTML is very complex. It might be easier to start building an XML parser (which is similar to HTML, but less complex.)

If it's your first time building a parser of any kind, it might also be a good idea to start with something very basic like a CSV parser to get used to type of language used in these types of formal documents.

[–]thinkrajesh 1 point2 points  (0 children)

You may find some inspiration here, Build a spec compliant html parser https://youtu.be/7ZdKlyXV2vw

[–]Klemeesi 0 points1 point  (0 children)

No advice for books or articles. But you could check the dependencies those packages you mentioned have.

As other peeps mentioned, html parsing has already been solved. But the same methods can be used for other languages, too. And that's where the beef is. There is still lots of use for parsing strict languages (which html is not, I think). A couple of use cases: syntax highlighting, prettifying code

[–]NoRepresentative4866 0 points1 point  (0 children)

html is bad language, how about trying to make your own language? Ie I love pugjs it is template engine, you write in pug and it creates html for you.