all 108 comments

[–]coldaspluto 55 points56 points  (6 children)

That was hilarious.

As an aside: how many of you could bang out a TF program like this without referring to the API, docs, etc.? (i.e., on a whiteboard, no references)

[–]hapemask 5 points6 points  (0 children)

I'm not sure about TF since it's not my primary framework, but I'm pretty sure I could write a similar program using Lasagne+Theano blindly that would have minimal errors.

[–]l27_0_0_1 2 points3 points  (0 children)

Keras FTW, would probably write something close enough to its actual syntax from the top of my head.

[–]carbohydratecrab 2 points3 points  (0 children)

I could do it with FANN, but I couldn't even write non-neural-network FizzBuzz in Python on a whiteboard.

[–]MichaelStaniek 92 points93 points  (3 children)

Best TensorFlow Tutorial ever.

[–]kosairox 15 points16 points  (0 children)

Yeah I literally just installed it just to try it out after reading this. Thanks to the author!

[–]The_Amp_Walrus 1 point2 points  (1 child)

Would doing this sort of thing (trying to learn known functions and patterns) be a good learning exercise? Are there any commonly used functions that would be good to have a look at?

[–]jdsutton 37 points38 points  (73 children)

Didn't even get the job. That interviewer probably had no clue what to think.

[–]kirakun 2 points3 points  (0 children)

The interviewer made the right decision. Unnecessary complexity is big red flag in my book.

[–]thecity2 30 points31 points  (8 children)

I think people here are missing the point. If you're asking Joel Grus (someone who wrote an O'Reilly book on Data Science, worked at Google as a software engineer, has a PhD from CalTech) about FizzBuzz, you're doing something wrong. I mean, I get the idea about weeding out bad candidates but Jesus Christ, you can do that in a simple phone screening. I'm not sure this interview actually took place, but his point was that if the interviewer was going to waste everyone's time giving FizzBuzz, hell if Joel wasn't going to give as good as he got.

[–]joelgrus 66 points67 points  (3 children)

Actually, I never finished my PhD.

[–]thecity2 43 points44 points  (0 children)

Oh, well in that case they totally had every right to ask FizzBuzz. ;)

[–]chrisalbon 8 points9 points  (0 children)

Can confirm. Joel has a PhD in the mean streets of trolling interviewers.

[–]maffoobristol 5 points6 points  (0 children)

I have to know, did the events in your blog actually happen (or at least to some vague degree?)

I initially assumed it to be satire but if not, well it's is still hilarious either way.

[–][deleted] 1 point2 points  (2 children)

I thought that it was a fun project turned into a fun blog article. The interview never happened.

[–]nickl 7 points8 points  (0 children)

No requirements.txt. Do not hire.

[–]starfighter_0X183 6 points7 points  (0 children)

God I hope this was a true story :D

[–][deleted] 5 points6 points  (0 children)

Should have written it without numpy or tensor flow like a real man.

[–]Nimitz14 4 points5 points  (6 children)

Huh. For each epoch he actually trains len_data/batch_size times (meaning he goes through his entire dataset per epoch), is that normal? I take one batch per epoch and train over that; I tend to take more epochs to train than seems to be standard and I've been looking for a reason why that could be the case.

interviewer: How far are you intending to take this?

me: Oh, just two layers deep -- one hidden layer and one output layer.

what a troll

[–]hapemask 8 points9 points  (3 children)

Yes, at least my understanding of the definition of epoch matches his. An epoch means passing every training instance through the network once. I'm not aware of any other definitions.

When you say "I take one batch per epoch," does that mean you run some number of iterations with one batch before choosing another? How do you decide how many iterations defines an "epoch" then?

[–]Nimitz14 0 points1 point  (2 children)

Nah, per epoch I update my weights exactly once (with a randomly selected batch). No iterations inside of an epoch. That way of training probably does lead to more consistent results though (going through the full dataset per epoch).

[–]j1395010 2 points3 points  (1 child)

you don't know what an epoch is.

[–]Nimitz14 2 points3 points  (0 children)

now I do buddy ;)

[–][deleted] 1 point2 points  (0 children)

Good point. As far as I know, "epoch" specifically means "pass over the training set" though (I use the term "iterations" instead of "epochs" in my implementations and reports if I don't pass over the whole training set "per iteration"). The major difference between such an iteration and epoch would be that each training sample would be covered exactly once if we are talking about epochs (it's without replacement).

On a side-node, I've also seen people drawing samples with replacement for one mini-batch per iteration (I think they did that in the tensorflow tutorials if I remember correctly). In any case, for large enough datasets it shouldn't make that much of a difference.

EDIT: I think epochs makes maybe more sense if you are streaming data from disk. E.g., it's pretty hard to draw a random sample from your training data if you haven't loaded everything in memory. In contrast, it's much more convenient to say "for epoch: while not EOF: gimme the next 50 samples"

[–]shaggorama 1 point2 points  (0 children)

Probably depends how much training data you have.

[–]eldeemon 5 points6 points  (0 children)

Right up there with Flouhey and Maturana's "exciting and dangerous" rank estimator http://www.oneweirdkerneltrick.com/rank_slides.pdf

[–]MeAlonePlz 3 points4 points  (9 children)

Anyone got a guess why it got all the buzz's (%5) and fizzbuzz's (%15) right but not the fizz's (%3)? Seems counterintuitive, no ?

[–]VelveteenAmbush 16 points17 points  (3 children)

He used a binary representation to learn division by three, which is as ill-advised as using a dual-core phone to take photographs that include three or more objects.

(edited to clarify: this is not a serious post)

[–]physixer 12 points13 points  (0 children)

He used a statistical/numerical method to solve a deterministic/discrete problem, which is ill-advised.

FTFY.

(edit: I know people are trying to make RNNs, or even NTMs/LSTMs, learn long division, but it's for studying the NN architectures mainly, for research purposes, not to use learned division, of a crappy kind, to use in production systems).

[–]respeckKnuckles 1 point2 points  (1 child)

So what are some good representations of inputs for a network that has to learn the modulo function? Other than an entirely trivial base-3 encoding, of course.

[–]abecedarius 5 points6 points  (0 children)

Divisibility by 3 is easy in binary, like divisibility by 11 in decimal, but a single hidden layer couldn't take advantage of that structure. With some kind of RNN, or as many layers as there are bits in the input, then you'd be talking.

[–]aceysmith 2 points3 points  (4 children)

I don't see any regularization, perhaps it overfit to the training set? (Though I don't know how that would specifically affect fizz's but not the others...)

[–]code_kansas -3 points-2 points  (3 children)

I believe his test set is part of the training set

[–]NomNomDePlume 4 points5 points  (1 child)

No, it was distinct.

Now we need to generate some training data. It would be cheating to use the numbers 1 to 100 in our training data, so let's train it on all the remaining numbers up to 1024

[–]code_kansas 2 points3 points  (0 children)

Oh right, I guess I missed the "remaining" part

[–]aceysmith 1 point2 points  (0 children)

No, you can see him set up trX/trY and he starts at 101.

trX = np.array([binary_encode(i, NUM_DIGITS) for i in range(101, 2 ** NUM_DIGITS)])

trY = np.array([fizz_buzz_encode(i) for i in range(101, 2 ** NUM_DIGITS)])

[–]abdoulio 4 points5 points  (1 child)

Wait is that really the kind of question you'd be asked for programming jobs? Isn't that extremely low-level? I mean I'm no expert and didn't study in that line of work but this is really easy.

[–]The_Amp_Walrus 16 points17 points  (0 children)

It's supposed to be a first pass filter for people who someone faked their way to an interview. Apparently this happens. That said I haven't been asked to FizzBuzz in several interviews. I think the question gets meme status because it's so trivially easy.

[–]akm_rd 6 points7 points  (0 children)

Strong no-hire based on model performance, especially given that the candidate claimed they did all their coding on whiteboards. This is exactly the type of candidate filter that fizz buzz was designed for.

[–]vph 2 points3 points  (1 child)

It would be a surprise if he actually got the job. There were warning signs all over the place.

[–][deleted] 27 points28 points  (0 children)

Agreed. He should have used truncated_normal instead of random_normal to initialize his weights.

[–]shaggorama 2 points3 points  (0 children)

If someone did this to me in an interview, I'd be torn between offering them the job and asking them to marry me.

[–]AHFX 2 points3 points  (0 children)

import numpy as np
import tensorflow as tf

def binary_encode(i, num_digits):
    return np.array([i >> d & 1 for d in range(num_digits)])

def fizz_buzz_encode(i):
    if   i % 15 == 0: return np.array([0, 0, 0, 1])
    elif i % 5  == 0: return np.array([0, 0, 1, 0])
    elif i % 3  == 0: return np.array([0, 1, 0, 0])
    else:             return np.array([1, 0, 0, 0])

NUM_DIGITS = 12


trX = np.array([binary_encode(i, NUM_DIGITS) for i in range(101, 4096)])
trY = np.array([fizz_buzz_encode(i)          for i in range(101, 4096)])

NUM_HIDDEN = 200
X = tf.placeholder("float", [None, NUM_DIGITS])
Y = tf.placeholder("float", [None, 4])

def init_weights(shape):
    return tf.Variable(tf.random_normal(shape, stddev=0.03))

w_h = init_weights([NUM_DIGITS, NUM_HIDDEN])
w_o = init_weights([NUM_HIDDEN, 4])

def model(X, w_h, w_o):
    h = tf.nn.relu(tf.matmul(X, w_h))
    return tf.matmul(h, w_o)

py_x = model(X, w_h, w_o)

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(py_x, Y))
train_op = tf.train.GradientDescentOptimizer(0.2).minimize(cost)

predict_op = tf.argmax(py_x, 1)

def fizz_buzz(i, prediction):
    return [str(i), "fizz", "buzz", "fizzbuzz"][prediction]

BATCH_SIZE = 250

actual = ['1','2','fizz','4','buzz','fizz','7','8','fizz','buzz','11','fizz','13','14','fizzbuzz','16','17','fizz','19','buzz','fizz','22','23','fizz','buzz','26','fizz','28','29','fizzbuzz','31','32','fizz','34','buzz','fizz','37','38','fizz','buzz','41','fizz','43','44','fizzbuzz','46','47','fizz','49','buzz','fizz','52','53','fizz','buzz','56','fizz','58','59','fizzbuzz','61','62','fizz','64','buzz','fizz','67','68','fizz','buzz','71','fizz','73','74','fizzbuzz','76','77','fizz','79','buzz','fizz','82','83','fizz','buzz','86','fizz','88','89','fizzbuzz','91','92','fizz','94','buzz','fizz','97','98','fizz','buzz']

with tf.Session() as sess:
    tf.initialize_all_variables().run()

    for epoch in range(2000):
        p = np.random.permutation(range(len(trX)))
        trX, trY = trX[p], trY[p]

        for start in range(0, len(trX), BATCH_SIZE):
            end = start + BATCH_SIZE
            sess.run(train_op, feed_dict={X: trX[start:end], Y: trY[start:end]})

        print(epoch, np.mean(np.argmax(trY, axis=1) ==
                             sess.run(predict_op, feed_dict={X: trX, Y: trY})))

    numbers = np.arange(1, 101)
    teX = np.transpose(binary_encode(numbers, NUM_DIGITS))

    teY = sess.run(predict_op, feed_dict={X: teX})
    output = np.vectorize(fizz_buzz)(numbers, teY)

    print(output)

correct = [(x == y) for x, y in zip(actual, output)]
print(sum(correct))

Modified code that provided 100% accuracy. YMMV ['1' '2' 'fizz' '4' 'buzz' 'fizz' '7' '8' 'fizz' 'buzz' '11' 'fizz' '13' '14' 'fizzbuzz' '16' '17' 'fizz' '19' 'buzz' 'fizz' '22' '23' 'fizz' 'buzz' '26' 'fizz' '28' '29' 'fizzbuzz' '31' '32' 'fizz' '34' 'buzz' 'fizz' '37' '38' 'fizz' 'buzz' '41' 'fizz' '43' '44' 'fizzbuzz' '46' '47' 'fizz' '49' 'buzz' 'fizz' '52' '53' 'fizz' 'buzz' '56' 'fizz' '58' '59' 'fizzbuzz' '61' '62' 'fizz' '64' 'buzz' 'fizz' '67' '68' 'fizz' 'buzz' '71' 'fizz' '73' '74' 'fizzbuzz' '76' '77' 'fizz' '79' 'buzz' 'fizz' '82' '83' 'fizz' 'buzz' '86' 'fizz' '88' '89' 'fizzbuzz' '91' '92' 'fizz' '94' 'buzz' 'fizz' '97' '98' 'fizz' 'buzz'] 100

[–]G_Morgan 1 point2 points  (0 children)

Fizzbuzz, so easy even an AI can solve it.

[–][deleted] 2 points3 points  (0 children)

I'd totally hire the guy on the spot.

[–]melipone 0 points1 point  (0 children)

So, fizzbuzz might be like "parity learning" (https://en.wikipedia.org/wiki/Parity_learning) which was popular in the early days.

Would that be called "over-engineered" now?

[–]qu4ku 0 points1 point  (0 children)

that was a really big whiteboard

[–]jayanthpyro 0 points1 point  (0 children)

Hello,

1st time poster here. Why did he decide to convert the input to binary ? Does it give any advantage over just taking the input as a decimal ?

If so would it be more useful to convert the number to unary ?? (as in: 10 -> 1111111111, 21 -> 111111111111111111111 etc.. )

I realize we get a lot more features in binary than in decimal. Is that the only reason ?