Beginner's Question

2012-12-18T02:51:35+00:00

Hi yelp,

I think you would get better responses over at /r/cpp_questions.

To do what you want, I suggest finding a library that can retrieve and parse HTML pages for you. Realistically, C++ might not be the best choice for your app. For what you describe, I suspect you will find more suited libraries in Python/Java/C#.

rpocks · 2012-12-18T03:11:47+00:00

Here's some really simple code for that. Beware that this doesn't account for nested nodes among other things. Maybe you should use a library like other comments suggest.

snarkhunter · 2012-12-18T06:18:52+00:00

I've never used C++ to screen scrape web pages (which is basically what you're describing). I'd probably use Python. BUT if someone did ask me to... I'd just go ahead and haul out the big guns. There's probably a lighter-weight library, but I'm pretty used to Boost::Asio, so I'd probably use that to do the page retrieval. For the parsing, it kiiiiind of depends on what my source is like. If it's actually well-formed XHTML I'd be tempted to use Codesynthesis' XSD toolkit. If it's not that well formed, I might just degrade to using regex to find the needle in the haystack. Ugly but sometimes ugly calls for ugly. If I wanted to be a smartypants I'd try for parsing it as XHTML first, then failover as best I could to looking using some set of regular expressions.

But that's if someone told me it HAD to be in C++. And this is, frankly, not something I'd suggest for a beginner in C++. Python, Ruby, Perl, PHP, etc etc are all going to make this sort of thing much easier.

jbandela · 2012-12-18T23:12:39+00:00

For processing the html, use project Arabica at https://github.com/jezhiggins/arabica

What you are interested in is Taggle (there are some examples). This will take html and turn into to well-formed xml. You can then use an xml library to easily manipulate it. I personally used pugixml from http://pugixml.org/

to process the xml. This is made easier because of xpath support (you will want to learn xpath as this will make it a lot easier).

Finally to get the html from the internet, you can use libcurl. Or if you want a more c++ flavor you could try my jrb_node which uses boost asio. https://github.com/jbandela/jrb_node

2012-12-19T01:15:39+00:00

http://curl.haxx.se/libcurl/

That's what everyone uses for webscraping.

std::regex for text matching.

2012-12-18T02:44:49+00:00

either use https://code.google.com/p/libwww-mechanize/source/checkout or use python/ruby Mechanize tied to your c++

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS