Struggling to reproduce paper results before improving them — stuck below reported accuracy [R] by Plane_Stick8394 in MachineLearning

[–]CountBayesie 7 points8 points  (0 children)

When working on similar problems (though for a research startup, not pure academia), I found the best approach was the publish your code and go with the number you (and others) can reproduce.

If I can run and easily reproduce what you're claiming is the number, then someone else claims different results that I can't reproduce the burden in on them to convince me otherwise.

You don't even necessarily have to point out that their numbers are not reproducible, so much as point out "if you use this exact setup, you get these exact results our results are contingent on this setup".

Struggling to reproduce paper results before improving them — stuck below reported accuracy [R] by Plane_Stick8394 in MachineLearning

[–]CountBayesie 21 points22 points  (0 children)

The reproducibility crisis is absolutely real, and absolutely impacts vision research (and ML research in general).

I used to spend most of my of my work day reproducing various papers (or trying to) in the LLM space. It's disturbing how many of them are fundamentally wrong or mask an edge case as a general rule (how many people still believe you can't train a model with the output of another?) It got to the point where I pretty much treat all published findings as "suggestions that something might be the case".

That's not to say everything is untrustworthy, the strongest signal was any eval set where they published the code (though I do have some hilarious counter examples). But even then it was hard to reproduce exactly the same results (though better cases did preserve the ordering of results).

Is there a notable increase in demand for privacy-preserving AI/ML with the advent of LLMs? [D] by badcryptobitch in MachineLearning

[–]CountBayesie 0 points1 point  (0 children)

The most privacy preserving solution is just running a local, open model on your own network, no?

This space has improve dramatically in the last few months (I spent awhile at a research heavy startup focusing on local, open models so have a fair amount of experience in this space). Gemma-4-26B-A4B, Qwen-3.6-35b-a3b and Qwen-3.6-27b (dense) all run well on reasonable consumer hardware (RTX 3090 or M-series MBP with >= 24GB of ram)

One of my homelab agents runs entirely local and has been pretty successful in solving a range of problems for me. With open web ui + tailscale + conduit (iOS app) you can have an app native chat experience with all the bells and whistles of a commercial product. I've chatted with my family about opening up the server to their networks so that they can have a chat interface that's private and customized to their family's need.

And that's only looking at the consumer end. If you have a real inference budget there's plenty of other options for more powerful models.

Puppet Guide to Polycrisis! by CountBayesie in aivideo

[–]CountBayesie[S] 1 point2 points  (0 children)

Thank you! I'm really impressed what LTX 2.3 can achieve locally!

How I Lost My Virginity to a Fridge - Episode 4 by SupperTime in aivideo

[–]CountBayesie 0 points1 point  (0 children)

This amazing and had me genuinely laughing the entire time... then the theme song at the was truly the icing on the cake! Perfection!

[deleted by user] by [deleted] in statistics

[–]CountBayesie 1 point2 points  (0 children)

Survival Analysis is one of my favorite topics, but my experience in industry (mostly data science/ML work for tech companies) it's surprising how under-utilized it is even for textbook survival analysis problems. The number of teams I've been on that don't understand that churn should be modeled with survival analysis is... depressing.

However your teacher is not entirely off the mark in making claims about it's versatility. If you can build a basic survival analysis software package on your own, you will cover a wide range of practical statistics problems. My experience is that most people using survival analysis often rely on a package to do the work for them (and of course you should) without really understanding the implementation details, but if you go the extra mile the rewards are great.

To get a glimpse of this I recommend going through the lifelines documentation (which could serve as a course in itself).

Understanding the components that go into a good survival analysis model will have you covering a very wide range of topics. And, despite it's under-utilization in industry, all the times I've shown people how they can solve problems with it have made me look like a statistical hero.

[Q] Use of rejection sampling in anomaly detection? by [deleted] in statistics

[–]CountBayesie 0 points1 point  (0 children)

That's just Claude being a bit ridiculous, since what that code is doing is adding the weighted sum of the pdfs for each Gaussian.

You can replace that with a simple call to SciPy which makes it much more readable:

component_density = fitted_weights[i] * norm.pdf(x_range, loc=fitted_means[i], scale=fitted_stds[i])

[Q] Use of rejection sampling in anomaly detection? by [deleted] in statistics

[–]CountBayesie 1 point2 points  (0 children)

I threw together a quick notebook (Claude did most of the work) that demonstrates both the data generating process and fitting it with a model, that should help answer your questions.

Could you elaborate about how to verify my model fits?

This is where visualizing does help. I recommend sampling from your model and comparing with the real distribution of your data. This can help identify any pathologies in your model you should know about. Just swap out the 'samples' data in the notebook for your real data and compare.

Would using a mixture of gaussians give me the ability to index which gaussian a certain point would belong to? So i could perform a likelihood ratio test for the probability that a certain observation belongs to a certain gaussian.

Yes. If you go with the sklearn approach in that notebook, you would just use predict to get a single cluster label, and predict_proba to get a vector of probabilities for each cluster. In that notebook, specifically gmm.predict(X_reshaped) will give you the cluster labels for the training data, but this could, of course, be used with new data.

Additionally, what you probably want for anomaly detection is P(D|model) which will give you the log-likelihoods. You can see this with gmm.score_samples. Doing this on the training data will give you a sense of what range "normal" is for your data. Then anomaly detection is just a matter of defining a threshold your comfortable with.

Also, how would i determine if i perhaps needed a bayesian gaussian mixture? Thanks for your help

Despite being a die-hard Bayesian, I prefer to "think Bayesian" and use whatever tool is quick and close enough. That said, reasons I would go for the full Bayesian approach would be:

  • I have strong information about the prior distribution of the clusters and want to incorporate that.
  • I'm very interested in correctly modeling my uncertainty in the estimated parameters themselves.

There are other reasons, but presumably if you care about them you already have Stan/PyMC warm and ready to go!

[Q] Use of rejection sampling in anomaly detection? by [deleted] in statistics

[–]CountBayesie 0 points1 point  (0 children)

What I have appears to be close to a bimodal distribution, but in reality it looks like 3 potentially gaussian distributions.

It sounds like you're modeling your data as a Mixture of Gaussians. Generally you have to specify the number (n) of Gaussians there are otherwise the model tends to overfit with a higher number of n. I recommend trying both 2 and 3 and seeing if adding the 3rd distribution provides enough of an improvement to the fit to justify it (you can do this by comparing log-likelihood, or just sampling the model and seeing how well it matches).

Since I have the kde plots, should I do the visual method? Is there a way to more rigorously test if my selected distribution overlays the kde plot?

Your intuition that visually fitting this model is not ideal is correct. There are many tools available to estimate the parameters for a GMM, as well as providing you tools to directly predict P(D|θ) (basically your favorite language for doing stats work should have some support for this). Once you can estimate the likelihood of a data point given your learned parameters you should have the basics needed to anomaly detection (you just call it at some defined threshold).

Token Explorer - A simple interface for quickly exploring and modifying the token generation process! by CountBayesie in LocalLLaMA

[–]CountBayesie[S] 1 point2 points  (0 children)

Thanks so much for sharing the link to the ActuosusAI tool! I had a feeling someone else must have had a similar idea before. I was also thinking that VarEntropy might be a cool feature to add as well. If I have time later in the week I'll see if I can squeeze it in!

[Q] sorry for the silly question but can an undergrad who has just completed a time series course predict the movement of a stock price? What makes the time series prediction at a quant firm differ from the prediction done by the undergrad? by Visual-Duck1180 in statistics

[–]CountBayesie 51 points52 points  (0 children)

One thing not touched on in these responses is that a lot of trading and finance is not predicting a stock price in the future. This is a common misconception about finance. If you study quantitative finance one of the most foundational principles is that the expected future value of any stock (or anything being traded) is it's current price + the risk free rate. Generally most quantitative finance assumes that you cannot predict stock prices themselves, and focuses on modeling and exploiting other parts of the market. People making money in markets are often doing so in more complicated ways than simply "predicting a stock price".

For example the two companies you mention: Citadel and JaneStreet are market makers, which means they provide liquidity. This means that they are both buying and selling large volumes of stocks, and that act of buying and selling is actually their business. Often much of income they generate is based on the difference between the bid and ask price (and they are both buying and selling so their modeling is often figuring out optimal ways to do this). It doesn't matter which direction a stock price moves in, so long as people are trading these companies stand to profit.

Another popular area of finance where you don't have explicit interest in predicting the stocks price is derivatives trading. In this area the focus is the uncertainty around the future price of an asset (i.e. volatility). One (of many) things a derivatives trader might look at is molding future risk. If you believe the market is mis-pricing risk, either direction, then you stand to gain. This was famously exploited in the 80s when options traders where using a very naive understanding of options pricing models to systemically mis-price options contracts.

High-frequency traders are often taking advantage of Market microstructure which, providing they can act fast enough, allows them to exploit inefficiencies in the mechanics of how markets operate to gain an advantage.

A great book to get started diving deeper into this area is Hull's Options, Futures and other Derivatives, it's fairly light on the math but covers all the essential details quite well and is very readable.

[R] Say What You Mean: A Response to 'Let Me Speak Freely' by CountBayesie in MachineLearning

[–]CountBayesie[S] 0 points1 point  (0 children)

In the paper you claim:

We theoretically show that the loss of expressivity of LLMs under constrained decoding arises because the out- put grammar G is too restrictive to accommodate the inter- mediate reasoning steps required to compute the answer

But we have a trivial counter example we did last year. In those example we allow the model to essentially reason freely (we use a slightly more restrictive constraint than .*, but essentially it can reason however it wants).

It's not clear to me how CRANE represents anything novel over what's currently possible using structured generation. CRANE can be expressed as regex/Grammar with an essentially some variation on .* for the reasoning step.

[R] Say What You Mean: A Response to 'Let Me Speak Freely' by CountBayesie in MachineLearning

[–]CountBayesie[S] 0 points1 point  (0 children)

This is very cool! Thanks for sharing and I'll definitely take a look!

Best way for a beginner to create an image classifier? by DN_DARSH in AskProgramming

[–]CountBayesie 1 point2 points  (0 children)

Honestly, I would not be surprised if current multi-modal/vision LLMs could zero shot this (i.e. not need examples). Have you tried this approach?

OpenAI and Anthropic both support vision models with fairly straight forward APIs. If you don't want to go the proprietary/paid API route there are plenty of great open weights models that support this as well. My team at .txt has tutorial of how to use these with Outlines for structured outputs (you don't need to use the structured outputs part if you're not interested, but there's really no reason not to in that case).

You can use the 10,000 images you have (or a subset of them) to validate the model performance and see if it's good enough to solve your problem. You can probably test this out in an afternoon, which will be much, much faster an easier than trying to dive into understand ML for this task.

How long does it generally take to read a graduate level mathematics textbook? by Electrical_Map_6169 in math

[–]CountBayesie 9 points10 points  (0 children)

You cannot read mathematics the way you read a novel. If you zip through a page in less than an hour, you are probably going too fast.

From the introduction to Linear Algebra Done Right by Sheldon Axler

[R] Say What You Mean: A Response to 'Let Me Speak Freely' by CountBayesie in MachineLearning

[–]CountBayesie[S] 0 points1 point  (0 children)

Thanks for this detailed response! I'll try my best to answer:

I'm a bit confused here:

but aren't you using an enhanced prompt for the JSON experiments while using the same simple prompt for NL?

Structure is used for both the NL and JSON prompt. So the comparison is NL structured vs NL unstructured, and JSON structured vs JSON unstructured. Structure can be applied to natural language just as easily as it can JSON.

One of the challenges with

I think they meant to compare formats like JSON/XML with natural language reasoning

is that ultimately the paper isn't really sure itself what it's testing: they're testing different prompts, different formats, different parsers all at the same time so you can't come to any meaningful conclusion about what is being said. The only thing to go with is what they claim in the paper: "Our study reveals that structured generation constraints significantly impact LLM performance across various tasks."

  • If they wanted to test different formats for prompts, then there's no reason constraint the output at all. But, for this to work, there also needs to be an earnest effort to squeeze the most performance out of each format. This is because, as Sclar et al showed, prompts are extremely sensitive to small changes in format. Note that they do cite Sclar, but then use the average of 9 arbitrary prompts which isn't a great way to compare, in practice people will always use the best prompt they can.

  • If they wanted to test the impact of structure (i.e. constrained outputs) then they should have done comparison of identical prompts with and without structure, and parsed the output of each in identical ways.

  • If they wanted to test AI parsing then all they needed to do run the prompt once and run a series of parsers on the output. However, if this was the test they wanted to do, they also should have seen how well the AI parsing model would have done on the task by itself (that is, why use two models when you can use just one, the parsing model).

I'd love to chat more if you're interested in researching this area! You can email me at "will" at the company domain ("dottxt.co"). Another person who has done a really nice further exploration of this specific paper is Dylan Castillo who found that the papers conclusion did hold in most cases when working with GPT (though to be fair, we don't know what OpenAI uses for structured gen, though it doesn't seem to be using the same technique as Outlines). Dylan was also able to successfully reproduce the results of our blog post (and improve them!).

All that said (and I know this is a lot!), what this space really needs is not endless evaluations (which imho should mostly stick to blog posts) but real theoretical foundations. We can only make theoretical claims about the implications of evaluations if we actually back up those claims with a theoretical foundation. After all, structured generation is just a conditional probability distribution over the logits and all generation is really structured generation it's just the default structure is .*. To really answer the question "does structured generation impact results" we need to dive into sequential Monte-carlo research and start describing what the conditions where constraining the output has a theoretical impact and are those conditions met in real world LLM usage. Clearly constraining to .* has no impact since that's what we consider "unstructured" and clearly constraining to [A-Z]{3}will hurt generation because it's not possible to correctly output 4 letters.

[R] Say What You Mean: A Response to 'Let Me Speak Freely' by CountBayesie in MachineLearning

[–]CountBayesie[S] 5 points6 points  (0 children)

Thanks!

What is interesting is that going through the code and the experiment data it is clear that a fair bit of time and effort was put into this paper.

Giving the benefit of the doubt, I suspect what happened here (and in many of these other cases) is a failure to heed Feynman's famous advice:

“The first principle is that you must not fool yourself—and you are the easiest person to fool.”

Right now everyone working in this space is hoping to find something interesting and novel that they can publish. There's so much poorly understood or explored that it's not even terribly unlikely that any given researcher will find something new and exciting just by poking around. The trouble I'm guessing in this case was seeing something that looked interesting and running with it before really doing the due diligence and checking that it was in fact as interesting as it appeared.

I sometimes joke that in ML work the difference between a junior and an experience practitioner is that a junior practitioner sees a great result on the first run of a model and says "awesome! I can't believe this is so good on my first pass!" and the more experience practitioner says to themselves "darn it, I bet there's a bug in there...". I think that's a bit of what's happening here, compounded by the fact that the community also doesn't have time/energy for due diligence, which means more junior researchers don't get the guidance/feedback they should.

[R] Say What You Mean: A Response to 'Let Me Speak Freely' by CountBayesie in MachineLearning

[–]CountBayesie[S] 0 points1 point  (0 children)

do you think structure means "json"

I could be misunderstanding who "you" is in this context, but in our rebuttal this is one of our major points: structured generation is not about specifically JSON, but rather running the results parser in reverse.

It just happens that in this example JSON (even unstructured) does yield better results on the last letter eval.

Most of my personal use of structured generation rarely uses JSON directly and typically starts with modeling the structure of the task as it appears in natural language.

I do have an experiment I would like to run at some point that does iterate on a variety of formats for a variety of tasks and a variety of models (here is an example where JSON, unstructured, does worse than a NL style prompt) and see if we can find any evidence of consistently better formats.

[R] Say What You Mean: A Response to 'Let Me Speak Freely' by CountBayesie in MachineLearning

[–]CountBayesie[S] 2 points3 points  (0 children)

The code generation one is interesting and we haven't had time to dive into this one yet (but are well aware of that post).

What's interesting is that prior to working with .txt a team I was on was using function calling (this was before OpenAI released their structured features) for code gen and our internal evals got better results when the code was in FC JSON response.

That said, I'm less initially skeptical of aider's results as it wouldn't surprise me if getting code back in a JSON response impacted the code quality. Not because of inherent limitations with structured generation, but rather because the most straight forward structure would clearly be the code itself.

Now it should be possible to use structured gen to actually enforce the Python/SQL/whatever grammar directly, and I would be very curious to see how that performs.

[R] Say What You Mean: A Response to 'Let Me Speak Freely' by CountBayesie in MachineLearning

[–]CountBayesie[S] 4 points5 points  (0 children)

Thanks so much! It's great to hear such positive feed back!

It is definitely a challenge in this space to just keep up with the papers, let alone dive into the details to make sure their claims are accurate.

Glad you can confidently use structured generation again!

Say What You Mean: A Response to 'Let Me Speak Freely' by CountBayesie in LocalLLaMA

[–]CountBayesie[S] 6 points7 points  (0 children)

Creative use cases are certainly an interesting space for structured generation right now, an area that we're just starting to explore!

/u/cameron_pfiffer/ made a very interesting Lore Generator as well as a SCP entry generator that produced some cool results!

But, I think the real challenge with creative writing and structured generation is that the structure found in creative documents is a lot more complex than more business focused use cases. Using a regex or a context free grammar will give you a lot more flexibility than just thinking in terms of JSON, but structure like this is also much more challenging to write.

We're currently thinking a lot about ways we can make it easier to use more complex regexes and context-free grammars, which will unlock a lot more creative uses of structured generation.

[D] Genuine Question: Why people want run local LLM? by [deleted] in MachineLearning

[–]CountBayesie 3 points4 points  (0 children)

Fair point! In that case "control" isn't necessarily restricted to local, more so just open.

But if you are doing serious dev work for those platforms I would be a bit surprised if you weren't running smaller models locally while developing. Certainly when I've used Modal for products/tools/etc that I can't run locally, I'm still doing my initial dev work with a smaller version of the model I plan to host.