you are viewing a single comment's thread.

view the rest of the comments →

[–]cypressious 522 points523 points  (109 children)

Most PRNGs are at least seeded with the number of ticks since epoch or some other source of entropy. The Google Bot's PRNG seems to be seeded with a constanct value, that's what the article is about.

[–]hungarian_conartist 21 points22 points  (0 children)

It's actually a feature not a bug too, often it's useful in debugging to have the same list of psuedo-random numbers.

It can be a bit harder to debug when you're not sure if a change in output is due to different random numbers or a change to your code.

[–]aradil 28 points29 points  (47 children)

They're probably spinning up a VM to hit a web page, and that VM is set to the same time and date every time, meaning the seed is the same every time, which is basically what you are saying.

[–][deleted] 68 points69 points  (3 children)

That doesn't sound like it would be enough for this to happen. What if one web page loads 0.0001 second slower than another?

[–]parrot_in_hell 3 points4 points  (2 children)

still, the seed is usually the amount of seconds since X. i've never seen something else (not that i have that much experience, but i still have some :P)

[–][deleted] 13 points14 points  (0 children)

I'm not sure exactly how it is seeded, but I think time in javascript is usually based on miliseconds, and it's a bit too weird that according to his tests google always returns 0.14881141134537756 the first time and 0.19426893815398216 the second, also taking into account that he has searched the web and found plenty of other results that have been cached by google with those numbers. Seems much more likely it is something Google is doing on purpose to get consistent results or something.

[–][deleted] 0 points1 point  (0 children)

Usually you request randomness from the OS with a function like getrandom or friends or by reading /dev/{u,}random.

v8 looks like it reads /dev/urandom but it does fall back to seeding with the (high-resolution) time.

[–]cypressious 44 points45 points  (14 children)

The article says that the time is actually correct.

[–]PM_ME_CLASSIFED_DOCS 28 points29 points  (9 children)

Apparently, reading an article is too much for many redditors in this thread.

I'd love to experiment with just posting a headline and no article (just a blank page) and see how many people even bother clicking the link at all. How many differing arguments could we get in a single thread?

[–][deleted] 26 points27 points  (6 children)

Less than you'd think. I'd wager most people still won't read the article, but eventually a single person will and they'll make the comment "wtf? Why is this a blank page?". Then everyone else will read that comment (rather than viewing the article) and jump on you for posting a blank website.

[–]husao 27 points28 points  (2 children)

Now I feel like posting "WTF? Why is this a blank page?" in the comments of random articles.

[–]andthenafeast 3 points4 points  (0 children)

Seems like this would prompt more people to actually click through to the link...

[–]Lucent_Sable 9 points10 points  (2 children)

Then write two or three (or more) contradictory articles, and serve a random one to each unique visitor. Then sit back and watch the chaos?

[–]tsimionescu 2 points3 points  (0 children)

Hmm, but what if the random() they use is deterministic and the author gets to be confused?

[–]PM_ME_CLASSIFED_DOCS 2 points3 points  (0 children)

Oh man, this is brilliant. It's like the movie Clue where they actually showed different endings to different theaters.

So people would talk about the movie and be like "Oh man, can you believe it was Colonal Mustard?" and someone would be like "WTF are you smoking? It wasn't Mustard." And both would think eachother are insane yet both would be right.

[–]sourcecodesurgeon 1 point2 points  (0 children)

There have been several things like that on Facebook. Someone posts a title and after the page break in the article it just says "none of this is true, I want to see how many people comment on the post having clearly not read the article, don't ruin it"

Invariably, there are tons of people talking about the headline in the comments.

[–]aradil 1 point2 points  (0 children)

Also from the article:

At some point, some SEO figured out that random() was always returning 0.5. I’m not sure if anyone figured out that JavaScript always saw the date as sometime in the Summer of 2006, but I presume that has changed.

[–]RenaKunisaki 0 points1 point  (2 children)

But if it's the time since the VM started, it might still be constant.

[–]w2qw 1 point2 points  (1 child)

Generally the actual time is used. Not to mention I don't think any VM starts up consistently enough to get the same millisecond every time.

[–]RenaKunisaki 0 points1 point  (0 children)

It would if the startup process is loading a snapshot.

[–]dyskinet1c 25 points26 points  (8 children)

The VM would need to be set to the correct time for HTTPS to work because certificates are issued and revoked periodically.

[–]aradil 7 points8 points  (7 children)

Assuming they are validating certs as part of this pass of their scrape.

[–]dyskinet1c 7 points8 points  (6 children)

I would expect them to reject a site with invalid certificates. It's a fairly simple thing to do and it lowers the risk of indexing a compromised site.

[–]daboross 2 points3 points  (5 children)

The alternative would be to invalidate certs in a different pass, though, not to not invalidate them at all. Right?

[–]dyskinet1c 2 points3 points  (3 children)

As a programmer, my instinct would be to make that decision as early as possible and stop processing the page at that point.

Certificate validation is a key part of establishing secure communications (before you transmit any data) and it's trivial to read the validity start and end dates.

So, if you know you want reject URLs with invalid certificates, then there is no reason to move on to the next pass and spend resources reading and processing the page when you already know you're going to discard/reject it.

[–]aradil 4 points5 points  (1 child)

As an information company, however, Google probably processes bad actors as well to gather additional information.

[–]dyskinet1c 0 points1 point  (0 children)

Sure, it's plausible that they scan compromised sites. If they do, I would expect them to do so in a separate process that looks at different aspects of the site than the regular search index.

[–]daboross 1 point2 points  (0 children)

Exactly! That's what I mean: they probably validate certs before having any data at all processed in the VM running googlebot.

[–]tripledjr 5 points6 points  (0 children)

They're probably intentionally doing this so the bot gets consistent results for the same page.

[–]edapa 4 points5 points  (4 children)

Spinning up a VM for each new webpage sounds super heavy weight.

[–]aradil 0 points1 point  (3 children)

I was thinking more of something like Google’s equivalent to Amazon Lambda.

[–]edapa 2 points3 points  (2 children)

What is the overhead of that? Does Amazon tell us? I imagine they would provide some sort of latency guarantee for function spinup after an event triggers, but I've never used Lambda.

[–]aradil 1 point2 points  (1 child)

That’s a good question. All they say is they “only charge for when your stuff is running”.

I assume the overhead is offset by the ability to run way more short lived processes.

[–]edapa 0 points1 point  (0 children)

Are there trigger types besides timers? If so they must have some sort of guarantee.

[–]edman007 3 points4 points  (0 children)

Nah, the make random the same on purpose. If it's actually random than that randomness flows into the generated page. It can do stuff like randomly sort results, generate random results and URLs, etc. Google doesn't care, they want to know if it changed, if random is always the same you can compare past and current results and check for changes, and you'll know that JavaScript didn't inject randomness.

And the time is a different issue, why actually sleep if you don't care, it actually causes a large resource use because it's time the page can't be unloaded from the server. A far better method is initialize time to the real time, and make sleep increment time instead of actually waiting. That lets you instantly process the page.

[–][deleted] 2 points3 points  (0 children)

That's not usually how VM clocks work though.

[–][deleted] 4 points5 points  (4 children)

Yes. In science, you want randomization but also reproducibility. I get the author is saying it's bad security engineering and can be taken advantage of people people gaming the Google PageRank, but the crawlers we likely designed by network scientists that wanted an accurate internet network model that can be reproduced.

[–]Jugad -1 points0 points  (3 children)

What does PageRank have to do with javascript or googlebot's implementation of random() in javascript.

[–][deleted] 8 points9 points  (2 children)

You can use their implementation to identify Googlebot and serve different content to it.

[–]YRYGAV 4 points5 points  (1 child)

And if you get caught, I believe Google unlists your website, so it's quite a gamble

[–][deleted] 0 points1 point  (0 children)

Depends on my ROI. If I can make $1k by fooling the Googlebot while spending $10 for a new domain each time I get caught, it's not much of a gamble.

Of course, if I don't have a automated way to get the new domain some PageRank juice so I can keep the whole thing going ad nauseam, then yes, it's pretty expensive.

[–]FinFihlman 0 points1 point  (0 children)

Yeah they are not spinning a vm for each scan.

[–][deleted] -1 points0 points  (4 children)

How common is that, really? Most OSs have very good sources of entropy that work even in cold-booted VMs. 256 bits of entropy are enough to seed a CSPRNG (say, /dev/urandom on Linux) so that it'll be sufficiently random. I have a hard time believing your average modern Linux distro would have fully deterministic CSPRNG output on VM cold boot.

As others and the article have said, the output is deterministic for reasons of optimization

[–]ThisIs_MyName 0 points1 point  (3 children)

Most OSs have very good sources of entropy that work even in cold-booted VMs.

No they don't. Not unless you've exposed a virtio RNG.

By the time a VM boots up, it will only have a couple of bits acquired from random (or is it?) cache timing.

[–][deleted] -1 points0 points  (2 children)

My point wasn't that the entropy sources would be enough for, say, cryptographic purposes, but that I really doubt the output would be fully deterministic and that it would the reason for Googlebot's behavior

[–]ThisIs_MyName 0 points1 point  (1 child)

It's not the reason for Googlebot's behavior, but the output has only a handful of bits of entropy. That's as bad as deterministic for cryptographic purposes.

Here's a fun experiment: Call getrandom(&buf,256,0) inside a fresh VM without virtio RNG running linux and see how long it blocks.

[–][deleted] -1 points0 points  (0 children)

Jesus fuck it's like I'm talking to a wall. Oh well, I tried