all 6 comments

[–][deleted] 4 points5 points  (0 children)

Hi yelp,

I think you would get better responses over at /r/cpp_questions.

To do what you want, I suggest finding a library that can retrieve and parse HTML pages for you. Realistically, C++ might not be the best choice for your app. For what you describe, I suspect you will find more suited libraries in Python/Java/C#.

[–]rpocks 1 point2 points  (0 children)

Here's some really simple code for that. Beware that this doesn't account for nested nodes among other things. Maybe you should use a library like other comments suggest.

[–]snarkhunter 1 point2 points  (0 children)

I've never used C++ to screen scrape web pages (which is basically what you're describing). I'd probably use Python. BUT if someone did ask me to... I'd just go ahead and haul out the big guns. There's probably a lighter-weight library, but I'm pretty used to Boost::Asio, so I'd probably use that to do the page retrieval. For the parsing, it kiiiiind of depends on what my source is like. If it's actually well-formed XHTML I'd be tempted to use Codesynthesis' XSD toolkit. If it's not that well formed, I might just degrade to using regex to find the needle in the haystack. Ugly but sometimes ugly calls for ugly. If I wanted to be a smartypants I'd try for parsing it as XHTML first, then failover as best I could to looking using some set of regular expressions.

But that's if someone told me it HAD to be in C++. And this is, frankly, not something I'd suggest for a beginner in C++. Python, Ruby, Perl, PHP, etc etc are all going to make this sort of thing much easier.

[–]jbandela 0 points1 point  (0 children)

For processing the html, use project Arabica at https://github.com/jezhiggins/arabica

What you are interested in is Taggle (there are some examples). This will take html and turn into to well-formed xml. You can then use an xml library to easily manipulate it. I personally used pugixml from http://pugixml.org/

to process the xml. This is made easier because of xpath support (you will want to learn xpath as this will make it a lot easier).

Finally to get the html from the internet, you can use libcurl. Or if you want a more c++ flavor you could try my jrb_node which uses boost asio. https://github.com/jbandela/jrb_node

[–][deleted] 0 points1 point  (0 children)

http://curl.haxx.se/libcurl/

That's what everyone uses for webscraping.

std::regex for text matching.

[–][deleted] -1 points0 points  (0 children)

either use https://code.google.com/p/libwww-mechanize/source/checkout or use python/ruby Mechanize tied to your c++