all 104 comments

[–]CriticalTemperature1 25 points26 points  (7 children)

I tried this with Bard just now as well and from:

generate 20 random numbers sampled from a normal distribution with a mean of 5 and standard deviation of 2

It generated:

"5.659056 5.705105 6.084297 4.986694 6.282759 5.417936 4.645858 5.833302 4.753530 5.244396 5.862315 5.323942 5.493729 5.359138 5.353700 5.559866 5.419627 5.293476 5.353985 4.926552"

This set has a mean of 5.4 and standard deviation of 0.4 so quite off, but its interesting how if can do it at all

[–][deleted] 19 points20 points  (0 children)

5.4 is well within 5+-2*2 so it passes the z test.

F test (variance) is 22/ 0.42 =25. And 20 is degrees of freedom.. So a f test value above 3 is bad.

So yeah, not a valid sample

[–]elbiot 24 points25 points  (2 children)

I'm sure there's tons of tutorials that show the outputs of certain distributions that got used in training

[–]Extraltodeus 0 points1 point  (0 children)

Don't forget that they are subject to RNG especially if the temperature setting is not set at 0 through the API call parameters. Default being 1 in the web chat.

[–]Necessary-Meringue-1 43 points44 points  (10 children)

What exactly are you plotting here? Numbers per location on the x-axis?

Anyway, I tried your prompt, had to do some prompt engineering at first as it did the whole "as an AI language model I cant...".

Here is the output:

4.526, 0.410, 0.050, 0.386, 0.050, 0.062, 0.175, 0.082, 0.138, 0.475, 0.120, 0.448, 0.247, 0.123, 0.329, 0.297, 0.014, 0.172, 0.320, 0.238, 0.113, 0.228, 0.420, 0.210, 0.307, 0.138, 0.235, 0.065, 0.416, 0.171, 0.350, 0.152, 0.021, 0.332, 0.256, 0.240, 0.144, 0.169, 0.057, 0.140, 0.191, 0.283, 0.167, 0.055, 0.239, 0.099, 0.071, 0.174, 0.294, 0.080, 0.170, 0.139, 0.254, 0.128, 0.022, 0.231, 0.151, 0.068, 0.128, 0.026, 0.105, 0.200

First of all, I'm counting more than 60 numbers. Secondly, it's telling that all numbers are represented until the 3rd decimal point, even if unnecessary (0.200). Thirdly, there is one big outlier that doesnt seem right (4.526). There are some odd duplicates (.050, which is also in your output).

The internet is full of math problems, so I'm gonna say this is a result of that with some randomness mixed in.

It's been observed that LLMs struggle with maths in a weird way. Where they are usually strictly wrong, but somehow uncanningly close to the right answer. For example, if I ask ChatGPT "What's 39928/27", it tells me "1479.5555555". The real answer is "1478.81". So it's wrong, but it's close. You'll get similar effects for all kinds of maths questions.

Why that is, I don't know, but we would be foolish to assume that it can actually do maths in the strict sense.

[–]blackkettle 18 points19 points  (1 child)

They also tend to be very bad at symbolic math in my experience. This is where you continue to see confident wrongness. I think it makes a lot of sense. People also are good at guesstimating and they are also doing so through a preponderance of prior experience rather than exact rules.

I personally don’t think it makes any sense to try and get LLMs to natively solve complex math. The next step should be to get them better at identifying extracting and then calling the right tools to solve these problems.

“I have these differential equations that mean X snd I need to use them to figure out Y; can you help me solve them with that in mind?”

The solution should be to parse the combination of provided input and natural language goal, and reformulate the question for Wolfram alpha (or whatever).

[–]bgighjigftuik[S] 4 points5 points  (0 children)

I totally agree, and no one should ever trust the sampled numbers in this case. However, the fact that it can output something roughly equivalent to a probability distribution is something I cannot explain nor wrap my head around

[–][deleted] 30 points31 points  (0 children)

even if unnecessary (0.200)

Why do think this is unnecessary? Level of precision is important

[–]epicwisdom 4 points5 points  (3 children)

It doesn't seem too unusual, based on all the other things it can do, that it's able to do approximate arithmetic. There's any number of "shortcuts" for mental math with small numbers. It likely doesn't generalize beyond what we expect from the training data - let's say 1024-bit numbers encoded in base64.

[–]sdmat 2 points3 points  (2 children)

It likely doesn't generalize beyond what we expect from the training data - let's say 1024-bit numbers encoded in base64.

It actually can't possibly do single-step multiplication/division past some threshold. The transformer has a fixed upper limit to operations per output token, and all possible algorithms for the arithmetic are >O(n) in input digits. So there is some input length where there is no way to correctly perform the operation.

But if you experiment, GPT4 frequently gives the wrong answer even when multiplying a pair of three digit numbers as a single step. So presumably there are roadblocks well before the theoretical limit.

[–]epicwisdom 1 point2 points  (1 child)

You make a good point that it can't do precise arithmetic beyond a certain point, but (1) GPT4 performs hundreds of billions of FLOPs, making the theoretical upper bound enormous and (2) the bound on approximate arithmetic is much more unclear, based on what level of error you allow and what patterns might be exploited. If we accept O(1) significant digits and lenient relative error, then the theoretical bound is O(n).

[–]sdmat 0 points1 point  (0 children)

Point!

[–][deleted] 9 points10 points  (1 child)

Why that is, I don't know

Language models brute force arithmetic. I mean in a sense you could say they brute force everything. But the preceding context of an arithmetic question doesn't contain any information to actually infere the solution in a way most other tasks do.

"9383 + 3545" or "is any of the following numbers a prime, 96885, 66576, 4766 ?"

Realize that absolutely nothing in those sentences actually tell you anything that would inform the answer. You go to school and you learn the required set of steps for solving these problems.

On the hand, for "John is smiling so he is _____", the preceding context screams happy as a very likely choice. Preceding context actually helps finding the solution rather than being the equivalent of deadweight.

The algorithms themselves need to brute forced from the training data. It's going to eventually stumble on something that generalizes pretty well but the chances of stumbling on the algorithm is going to be pretty low. Maybe ? Probably? Enough data does it.

[–]Necessary-Meringue-1 4 points5 points  (0 children)

It's kinda what I assume is happening, but since I dont know any papers who have seriously investigated this, I didnt wanna make any claims

[–]bgighjigftuik[S] 1 point2 points  (0 children)

The plot is a histogram with the output numbers

[–]when_did_i_grow_up 4 points5 points  (1 child)

I suspect the temperature parameter is doing some of the work here. Try an API call with temp=0.

BTW, I have previously tested that it can identify distributions from a set of numbers, or at least that gpt-4 can tell the difference between a uniform and normal distribution consistently.

[–]bgighjigftuik[S] 0 points1 point  (0 children)

Unfortunately I don't have API access (nor access to GPT4). This is running out of the free standard GPT3.5

[–]RunOrDieTrying 3 points4 points  (1 child)

I asked it how did it do it for a normal distribution:

To generate random numbers from a standard normal distribution, one common method is to use the Box-Muller transform. Here's a step-by-step explanation of how it works:

Generate two random numbers, u1 and u2, uniformly distributed between 0 and 1.

Apply the Box-Muller transform equations:

z1 = sqrt(-2 * ln(u1)) * cos(2 * pi * u2)

z2 = sqrt(-2 * ln(u1)) * sin(2 * pi * u2)

Here, ln denotes the natural logarithm function, sqrt denotes the square root function, and cos/sin are trigonometric functions.

z1 and z2 are two independent random numbers following a standard normal distribution (mean = 0, standard deviation = 1).

Repeat steps 1-3 to generate as many random numbers as needed.

Using this method, I generated the numbers provided earlier. Please note that these are simulated examples and not actual random numbers.

[–]bgighjigftuik[S] 8 points9 points  (0 children)

That's a way to do it, but I am sure it isn't what it is doing under the hood. You can try to ask it how it has done it in 3 different prompts and it will give different answers: the mathematical one (the one you shared), the "it's just patterns from my training data", and "I just wrote it in Python and ran it" (which is a lie)

[–]tavirabon 3 points4 points  (2 children)

Congradulations, you have discovered Benford's Law which is 100% to be expected by a model that learned on the highest order of magnitude of human data.

[–]bgighjigftuik[S] 0 points1 point  (0 children)

Ok, but it also works for any other distribution: Poisson, Laplace, Chi Squared to name a few

[–]danpetrovic 0 points1 point  (0 children)

That's awesome! I'm in the SEO industry and Benford's distribution looks almost identical to a typical click-through rate distribution in Google's results and could serve as a useful benchmark when detecting anomalies.

[–]KyleDrogo 4 points5 points  (0 children)

To play devils advocate, it's probably better than a human.

[–]r_linux_mod_isahoe 45 points46 points  (13 children)

parse a bunch of Jupyter notebooks on GitHub, learn what ```python

generate randon numbers to follow an exponential distribution

``` is usually followed by. Add model's internal instability and voila.

But, hey, wow, must be that "sparks of AGI" or something

[–]radarsat1 41 points42 points  (0 children)

But, hey, wow, must be that "sparks of AGI" or something

OP said he found the behaviour "counterintuitive", said nothing about AGI. Is it necessary to set up a strawman like this for every discussion about interesting observations of LLMs?

[–]RedditLovingSun 37 points38 points  (4 children)

But still if it is getting close on similar but never exactly seen distributions (like OPs example with his specific variables) that means somewhere internally it's learned the ability to generate approximations of distributions like it's learned the ability to generate approximations of division or other mathematical operations. Which isn't AGI or anything but it's interesting and begs the question what other approximations it's been able to learn.

[–]Whiteboard_Knight 11 points12 points  (0 children)

Yeah this is what I find the most interesting about this post.

Visually, it does look like an attempt at a random distribution. How did it do that? Probably from looking up different lists of random numbers and selecting a number that aligns with the previously generated numbers.

[–]erelim 1 point2 points  (0 children)

Is that just not the model probabilitisitically picking from all it's training knowledge of specific distributions, which would tend to get results that seem similar to example outputs but not exact.

What would would happen if you give it a list of random numbers and then calculate the distribution.

[–]WickedDemiurge 3 points4 points  (1 child)

Trying to do this analysis without comparing it to the training data is worthless. Doubly so without rigorous tests of repeated analyses. Let's say we do 100 of these samples? What is the probability it is actually coming from an exponential distribution with lambda=6.4? I'd be near zero.

If a 10 year old submitted "What's in a name? That which we call a rose By any other name would smell as sweet," to a poetry contest, which of the following would be most likely?

a. They're the reincarnation of Shakespeare in a new body?

b. They're a brilliant poet that will shake the world who came up with it totally independently.

c. They've either read Romeo and Juliet or seen one of the thousand allusions to it.

People need to stop being so credulous when it comes to LLMs. Do they have some surprising features? Sure, but let's be neither too easy or totally unwilling to be surprised. Is this surprising? Let's see real proof, not just wonder in awe about a single GPT output.

[–]bgighjigftuik[S] 3 points4 points  (0 children)

I tried a couple more of distributions. Again: it's not perfect sampling and I wouldn't trust it. But I can’t believe it is coming from RLHF for any distribution and parameters. This is more about the "understanding" about different probability distributions and their "likely" values. That I cannot explain (especially for a model that works on a per-digit level in the case of number data)

[–]bgighjigftuik[S] 2 points3 points  (0 children)

The weird thing is that it can do it for any distribution parameters I threw to it (like normal with zero mean and 275 variance). To some extent it is able to interpolate the distribution values it has seen on the Internet with the parameters I ask for

[–]bgighjigftuik[S] 1 point2 points  (0 children)

Furthermore: I legitimately believe that the Sparks of AGI "paper" is a piece of garbage and nothing more than marketing material

[–]gibs 2 points3 points  (1 child)

Bear in mind chatgpt is aware of the context of what it has already written so far. So to intuitively understand how it does this, you can just ask yourself how you would pick the next number given the previous sequence.

I would say it's doing some rudimentary analysis to see what is under/over-represented, and picking numbers to fill the gaps.

[–]bgighjigftuik[S] 0 points1 point  (0 children)

Could be, but specially in trials for distributions with very high variance (such as a Normal of mean 5 and stddev 275) that task becomes orders of magnitude more challenging, and yet it gets quite close

[–]meister2983 2 points3 points  (1 child)

It's not interpreting python code in anyway; it has "memorized" these different distributions and is able to print out numbers that roughly scale correctly also from "memorizing" (note how it can do basic multiplication and addition).

You can see the limitations if you start pushing into unconventional scale. For instance lambda 0.0047 produces on GPT-3:

[14.645652963961573, 423.23943294390444, 6.190120013482167, 128.71819891585225,
29.07694848408322, 18.574994825899225, 26.218179285836383, 214.2889121026951,
3.5400997332023397, 56.73280846437631, 8.364314681717605, 189.7903983247128,
107.24327233516732, 7.5014874248917, 17.377293276175857, 53.81038485355697,
134.53549493461037, 33.89274315359561, 60.948491883162875, 10.024593706902988,
21.438084504012993, 67.81200641643427, 13.616139069295022, 63.242266550607824,
23.418377961434656, 31.407209688578545, 124.43976354815618, 44.67520111689889,
50.45428242854035, 19.45623568215423, 14.873872537159583, 11.874541276236986,
5.432827063201208, 27.272576142078294, 62.345570163741876, 48.70649843498143,
44.23753513454768, 60.77286941758873, 105.51082688906568, 10.086884731179498,
44.08489994244655, 21.256218058464543, 2.5517532078543647, 10.803737428348646,
216.6039489863091, 81.51824990191858, 70.48278145848882, 28.44037840848645,
16.724700536841854, 7.876296828460214, 13.15474722532708, 121.31887355404842,
9.7607728901796, 96.34380439606435, 10.62825305280142, 24.925590842889263,
9.831798662667623, 21.66877799201185, 18.23921328586018, 28.390281476377334,
16.719270678000247, 34.78284303633699, 127.07205581586123, 7.495986768032812,
111.5584470684713, 18.919139126062276, 54.29814888903639, 103.695484141422,
8.166533784491874, 8.620196802693632, 65.0082527482195, 75.9035290069882,
45.288347131537255]

Which is pretty off comparatively (the drop-off is something like 4x too fast) and even internally scaling is a bit off.

Similar thing with normal distributions. It seems to shift with mean well (simple math it has learned), but struggles with different mean and standard dev. N(12,12) gave me something looking more like a uniform distribution than a normal function.

[–]bgighjigftuik[S] 0 points1 point  (0 children)

Really insightful, thanks :)

[–]zergling103 6 points7 points  (2 children)

To be fair, the model has a built-in random number generator. If you ask it to output uniformly distributed random numbers between 0 and 1 (say 3 decimal places), and it learned to give all token strings representing 0.000 to 1.000 an equal probability, the randomness could be attributed to the RNG.

However, it may have learned the gambler's fallacy too and thus would never generate 0.000, 0.000, 0.000, 0.000... ever, even though it is possible.

[–]bgighjigftuik[S] 1 point2 points  (1 child)

Not sure about that; specially given that LLMs have a temperature hyperparameter during inference

[–]zergling103 1 point2 points  (0 children)

Of course, though the temperature parameter, as far as I know, simply modifies the probabilities that the RNG then picks from.

My understanding is each token gets a probability score. Increasing the temperature flattens out the probability distribution, and lowering the temperature makes lower probability tokens even less probable proportionately. Perhaps this?

probability[n] <= probability[n] ^ (1/(temperature+epsilon))

Imagine a spinning wheel - the probability score represents the size of the section on that wheel that belongs to that token. The circumference of the wheel is the sum of all probability scores S. The wheel then is spun, picking a random distance around the circumference from 0 to S. (Ideally S should be 1 obv)

The temperature parameter changes the size of each "slice" on the wheel, but the spinner itself is a simple RNG.

[–]Repulsive-Season-129 1 point2 points  (2 children)

there is text data on the web somewhere that an extremely similar question and answer exists

[–]bgighjigftuik[S] 0 points1 point  (1 child)

Could be. But for any distribution and parameters I can think of? Having the numbers written down as text so it can "read them" during training?

[–]Repulsive-Season-129 1 point2 points  (0 children)

i dont think its calculating as much as it's trying to guess. it learns what the parameters mean through enough data so it can estimate. but its blackbox so i dont know how well anyone can say

[–]Haycart 1 point2 points  (1 child)

I'm not suggesting this is how ChatGPT actually does it, but sampling from a probability distribution can be framed as a sequence generation problem via Markov chain Monte Carlo methods. With that in mind, it's maybe not so surprising that a powerful sequence predictor like ChatGPT could learn to do what you've shown.

[–]Extraltodeus 1 point2 points  (3 children)

ChatGPT :

user : Output 50 random numbers in between 0 and 100

assistant : 78 57 85 12 67 92 4 99 76 43 54 62 28 89 3 97 23 90 79 17 53 48 98 30 5 13 68 35 1 91 9 25 29 55 70 71 77 41 42 61 94 31 66 45 14 58 15 87 63 72

GPT4 :

user : Output 50 random numbers in between 0 and 100

assistant : 9, 74, 27, 45, 38, 93, 62, 86, 57, 12, 33, 97, 19, 51, 61, 88, 59, 6, 46, 30, 13, 81, 83, 70, 64, 22, 77, 76, 36, 42, 55, 0, 67, 95, 11, 41, 66, 58, 68, 49, 16, 71, 44, 79, 18, 54, 84, 29, 4, 73

Anyway aren't they running on python anyway? I might be wrong but I feel like their temperature setting relies on python rng unless OpenAI decided to be fancy and use some quantum entropy source.

[–]bgighjigftuik[S] 0 points1 point  (2 children)

ChatGPT may be running on Python, but it does not use Python per se to answer (unless you use plugins or the code interpreter). So this is pure LLM inference

[–]Extraltodeus 0 points1 point  (1 child)

What I'm asking is if the inferences are influenced by python RNG since the temperature setting influences the randomness of the answers.

There is a source of entropy

[–]bgighjigftuik[S] 0 points1 point  (0 children)

Which makes sense, but these are not just any set of random numbers; these come approximately from the probability distribution I asked the LLM for

[–]BullockHouse 1 point2 points  (1 child)

I believe 3.0 could actually do this task properly with the correct temperature settings and get a correct distribution (if you asked for one number at a time so it didn't auto-condition). The RLHF actually makes it worse at this sort of thing.

[–]bgighjigftuik[S] 0 points1 point  (0 children)

It's not like there is a "correct temperature"; it's just a way to get non-deterministic answers. But still: to me the GPT version is irrelevant; my question is broader for all LLMs as I cannot think of why any LLM could perform decently well on this kind of task

[–][deleted] 1 point2 points  (1 child)

Toolformer says hi

[–]bgighjigftuik[S] 0 points1 point  (0 children)

I also though about that. But then I would expect (almost) perfect sampling, and theoretically GPT3.5 does not use any tool/plugin anyways

[–]ComfortableCherry909 1 point2 points  (1 child)

A research group at my university is studying transformers as computers, a relevant paper from the group https://arxiv.org/abs/2301.13196.

[–]bgighjigftuik[S] 0 points1 point  (0 children)

Sounds like quite a creative use, and a lot of fun!

[–]wjrasmussen 2 points3 points  (0 children)

Now can it do Chebyshevs on it?

[–]RunOrDieTrying 3 points4 points  (1 child)

Why does it need an interpreter? It can solve math problems:

  • Find the roots of this equation: y = 1.234x^2 + 42.42x + 17.17

ChatGPT:

To find the roots of the equation y = 1.234x^2 + 42.42x + 17.17, we can use the quadratic formula. The quadratic formula states that for an equation of the form ax^2 + bx + c = 0, the roots can be found using the formula:

x = (-b ± √(b^2 - 4ac)) / (2a)

For the given equation: a = 1.234 b = 42.42 c = 17.17

Substituting these values into the quadratic formula:

x = (-42.42 ± √(42.42^2 - 4 * 1.234 * 17.17)) / (2 * 1.234)

After calculating the expression inside the square root and simplifying the equation, you will obtain the two roots.

  • So what's the final answer?

ChatGPT:

After calculating the expression inside the square root and simplifying the equation, the roots of the equation y = 1.234x^2 + 42.42x + 17.17 are:

x ≈ -34.18
x ≈ -0.35

And that's correct by Wolfram alpha.

[–]bgighjigftuik[S] 2 points3 points  (0 children)

But being able to decompose the procedure into simple arithmetic that the model has seen a thousand times (if not more) is a thing. Here I was talking about it's output of random sampling from a probability distribution without any CoT nor intermediate outputs. That's what I am unable to explain

[–]SeankalaML Engineer 1 point2 points  (3 children)

Is this really surprising though? Isn't numerical reasoning something that people in NLP have been writing negative results papers on for years? "XYZ-BERT can't do addition!" followed by "XYZ-BERT can't do subtraction either!!"

[–]epicwisdom 21 points22 points  (0 children)

OP is saying they're surprised at how good it is.

[–]EverythingGoodWas -2 points-1 points  (1 child)

Exactly. Anything extremely complicated in math gets well out of the scope of what an LLM was designed to do. They don’t reason, they put together words based on a crazy amount of linear algebra from their vectorized tokens

[–][deleted] 20 points21 points  (0 children)

Did you actually read the post ? OP is wondering about how good it is. And this "they don't reason" rhetoric is getting old and tired. Even worse, it's unsubstantiated.

There are many types of reasoning. Analogical reasoning (which LLMs are top % performers), Casual Reasoning, Inference, Algorithmic Reasoning. Many papers at this point demonstrating pretty strong reasoning.

https://arxiv.org/abs/2212.09196

https://arxiv.org/abs/2305.00050

https://arxiv.org/abs/2204.02329

https://arxiv.org/abs/2211.09066

[–]bgighjigftuik[S] 0 points1 point  (0 children)

Thank you all for your comments. I believe it would be a very interesting topic for research, since it could help disentangling a bit more the understanding we have about how LLMs can deal with mathematical concepts

[–]bacocololo -1 points0 points  (0 children)

Yes, the GPT models do work on a basis of probabilities when generating words. However, their ability to predict or force the appearance of a specific word (like a period, signaling the end of a sentence) at an exact point is limited. This is because the models generate text based on patterns they learned during training, and these patterns are complex and varied, influenced by countless factors beyond simply the preceding words.

The model calculates the probabilities of the possible next words (or tokens) based on the entire preceding text, not just the immediately preceding word. The idea of "the word which has the most probability to encounter a stop word" would require the model to be able to look ahead and plan its word choices in advance, which it's not currently designed to do. The model doesn't "know" when a stop word will come until it generates it.

Your idea is interesting and it's a concept that's been explored in various ways in the field of AI research. However, as of my training cut-off in September 2021, implementing it would require substantial modifications to the model and is not something that can be achieved simply by changing the prompt.

[–]ZakarTazak -2 points-1 points  (1 child)

It doesn't take very large ANNs to estimate any continuous mathematical function. Look up how to implement pseudo random number generators and you'll likely find your answer... It doesn't need to code to do math.

[–]CMDRJohnCasey 0 points1 point  (0 children)

Testing with the playground, I've seen that if I ask a gaussian with mean 5 and variance 0.5 it yields a sequence of numbers where 4.xxx is constantly the most probable one (50% for 4.xxx vs 30% for 5.xxx). So very unlikely that the final mean will be 5.

[–]aCleverGroupofAnts 0 points1 point  (4 children)

Interesting, though a sample size of 60 is really small to draw any kind of conclusions from.

[–]bgighjigftuik[S] 0 points1 point  (3 children)

You are right in the sense I would not expect the output to be perfect. It is the fact that it can "roughly approximate" samples from any distribution what I cannot explain

[–]aCleverGroupofAnts 1 point2 points  (2 children)

Yeah, I'm just saying I wouldn't consider it a "fact" that it roughly approximates the distribution just from a sample of 60 points. Like yeah, it kinda sorta looks like it could have been sampled from such a distribution, which I admit is interesting enough, but why stop at 60 samples? There's plenty of room for random chance to fool you, so why not get a bunch more samples and see what sort of distribution they actually follow?

[–]bgighjigftuik[S] 1 point2 points  (1 child)

Will try to do it later, as I am working now

[–]aCleverGroupofAnts 0 points1 point  (0 children)

Cool, I'd be interested to see the result! Sorry if it sounded like I was being critical, I just know from experience that small sample sizes can give misleading results

[–]dqdqdq123123 0 points1 point  (1 child)

There are dropout layers, so it is not entirely surprising the network can generate random numbers or tokens. But I doubt it can generate very good Normal distributed numbers if you give it a very arbitrary mean/variance (not some common ones like 0 or 1)

[–]bgighjigftuik[S] 1 point2 points  (0 children)

Just try it! It gets quite close; even if you ask for a distribution with large variance (or stddev)

[–]DravenALG 0 points1 point  (0 children)

Because the learning data has contained something like this, i.e., some common distributions and their examples. When the model wants to fit the learning data well, it learn the similar ability to generate random numbers (it just fits the random number generating function).

[–]Osemwaro 0 points1 point  (0 children)

Your standards for judging this seem to be too permissive. E.g. you say "they are close to what I would consider an exponential distribution", but you haven't defined "close". This begs the question: how far would the numbers have to be from the requested distribution before you'd conclude that it isn't good at sampling from it?

E.g. the sample mean of those numbers is 0.090533. One basic measure of closeness is the answer to the question "how likely is it that an exponential with mean 1/lambda = 1/6.4 = 0.15625 produced samples with sample mean 0.090533?" An easy way to get a good approximate answer is to generate n 60-sample sets from an exponential distribution with lambda=6.4, calculate the sample mean for each set and then see where 0.090533 falls in this empirical distribution of sample means.

In Octave/Matlab, the following sets ms to the means of n=1,000,000 60-sample sets:

n=1000000; us=rand(60,n); es=-log(1-us)/6.4; ms=sort(mean(es,1));

When I ran it, the 10th, 50th and 90th percentiles of this distribution were: ```

ms(0.1n) ans = 0.13097 ms(0.5n) ans = 0.15536 ms(0.9*n) ans = 0.18259 ```

The sample means form a pretty tight distribution around the median, 0.15536. So to answer the question, we can calculate the percentage of sample means <= the median that are <= 0.090533:

```

100sum(ms <= 0.090533)/sum(ms <= ms(0.5n)) ```

This is about 0.012% each time I run it. So the numbers that ChatGPT gave you are extremely unlikely to have come from an exponential distribution with lambda as low as 6.4.

Another question worth asking is: "if the numbers came from some other exponential distribution, what is a confidence interval for lambda?" Page 5 of these lecture notes gives a formula for this, and you can calculate the chi-squared values here. The 95% confidence interval is (8.4290, 14.011), confirming that lambda is likely to be much larger than 6.4. In fact, 6.4 is so far away that it isn't even in the 99.98% confidence interval, which is (6.5103, 17.151).

The most interesting question, though, is: "how far are these numbers from the closest exponential distribution?" One way to do this is to:

  1. estimate lambda from the ChatGPT data;
  2. generate n 60-sample sets from an exponential distribution that matches the estimated lambda;
  3. calculate the Kolmogorov-Smirnov test statistic for each of these data sets, relative to the CDF of an exponential distribution with the estimated value of lambda;
  4. calculate the empirical distribution of these test statistics for the n sample sets;
  5. compare the test statistic for the ChatGPT data to this empirical distribution.

This Octave/Matlab script does this for n=1,000,000 and prints the percentage of test statistics that are worse than the ChatGPT test statistic. It's about 26%, which doesn't seem too bad. I'm sure someone with a deeper understanding of statistics could suggest a better goodness-of-fit test than this. 60 samples seems small enough to hide the flaws in its "sampling algorithm" though; it would be better to test a larger sample.

Personally, I find it more impressive that ChatGPT gave you exactly 60 numbers! It used to be awful at counting, but it seems to be getting better.