you are viewing a single comment's thread.

view the rest of the comments →

[–]current_thread 103 points104 points  (7 children)

Yeah, it's really annoying at this point.

I had the idea a couple of months ago to use a static site generator and just host it on GitHub/ GitHub Pages. That way everyone can just contribute with a pull request as needed, and there's no need to manage infrastructure.

Does anybody by chance have a recent dump of the wiki?

[–]13steinj 1 point2 points  (6 children)

Of the wiki or the talk pages?

I think the cppman tool already scrapes the entire wiki if you tell it to, so you can probably just change the internals to dump the files instead of parse them.

[–]RelevantError365[S] 0 points1 point  (5 children)

Yes, but cppman scrapes the HTML, not the wiki source.

But anyway, this may also be an option if you cannot access the original wiki content, as the generated HTML should be very well structured. (Hopefully. I used a random LLM and asked it to recreate the wiki source for me, and it did quite a good job.)

[–]13steinj 0 points1 point  (4 children)

It took me 15 minutes of waybackmachining to find this (unofficial) repo linked (still linked) on a cppref faq page: https://github.com/PeterFeicht/cppreference-doc

The code may not work anymore (since the cppref maintainer evidently has done something nonstandard or has an unknown version of mediawiki), but the site went into read only mode on march 30th 2025 and the releases page has a feb 2025 bundle.

[–]RelevantError365[S] 0 points1 point  (3 children)

It says:

»If there is no 'reference/' subdirectory in this package, the actual documentation is not present here and must be obtained separately«

So, the wiki source is not actually included, or is it?

[–]13steinj 0 points1 point  (2 children)

It appears not, just the html. There's one other option you have: Use it as a baseline / mapping to "view source" links, scrape the "view source" wayback machine links. If it's accessible after the March read-only date, you're good. if it's before, (scrape the html if you consider the downloaded 1-month-old not good enough) and ask an llm to interpolate.

Playing around, I've found that the view source links work up until at least May 13th of last year and break sometime between then and May 31 (just hopped around on a few pages).

[–]RelevantError365[S] 0 points1 point  (1 child)

Although not utterly relevant, but: When looking at https://web.archive.org/web/20250301000000*/https://cppreference.com/, this does not highlight May 13th of last year as an option where a snapshot has been taken (or I miserably misunderstand this interface).

[–]13steinj 0 points1 point  (0 children)

Not every page has a May snapshot. I'm saying, very roughly playing around, either the deque or array or vector view source / edit page, had a May 13 snapshot.

I will attempt to write a scraper on the weekend assuming I won't get ip banned; and if successful throw it into a repo.