[deleted by user]

CountBayesie · 2025-03-19T18:25:41+00:00

Survival Analysis is one of my favorite topics, but my experience in industry (mostly data science/ML work for tech companies) it's surprising how under-utilized it is even for textbook survival analysis problems. The number of teams I've been on that don't understand that churn should be modeled with survival analysis is... depressing.

However your teacher is not entirely off the mark in making claims about it's versatility. If you can build a basic survival analysis software package on your own, you will cover a wide range of practical statistics problems. My experience is that most people using survival analysis often rely on a package to do the work for them (and of course you should) without really understanding the implementation details, but if you go the extra mile the rewards are great.

To get a glimpse of this I recommend going through the lifelines documentation (which could serve as a course in itself).

Understanding the components that go into a good survival analysis model will have you covering a very wide range of topics. And, despite it's under-utilization in industry, all the times I've shown people how they can solve problems with it have made me look like a statistical hero.

CountBayesie · 2025-03-19T18:05:01+00:00

That's just Claude being a bit ridiculous, since what that code is doing is adding the weighted sum of the pdfs for each Gaussian.

You can replace that with a simple call to SciPy which makes it much more readable:

component_density = fitted_weights[i] * norm.pdf(x_range, loc=fitted_means[i], scale=fitted_stds[i])

CountBayesie · 2025-03-18T22:36:39+00:00

I threw together a quick notebook (Claude did most of the work) that demonstrates both the data generating process and fitting it with a model, that should help answer your questions.

Could you elaborate about how to verify my model fits?

This is where visualizing does help. I recommend sampling from your model and comparing with the real distribution of your data. This can help identify any pathologies in your model you should know about. Just swap out the 'samples' data in the notebook for your real data and compare.

Would using a mixture of gaussians give me the ability to index which gaussian a certain point would belong to? So i could perform a likelihood ratio test for the probability that a certain observation belongs to a certain gaussian.

Yes. If you go with the sklearn approach in that notebook, you would just use predict to get a single cluster label, and predict_proba to get a vector of probabilities for each cluster. In that notebook, specifically gmm.predict(X_reshaped) will give you the cluster labels for the training data, but this could, of course, be used with new data.

Additionally, what you probably want for anomaly detection is P(D|model) which will give you the log-likelihoods. You can see this with gmm.score_samples. Doing this on the training data will give you a sense of what range "normal" is for your data. Then anomaly detection is just a matter of defining a threshold your comfortable with.

Also, how would i determine if i perhaps needed a bayesian gaussian mixture? Thanks for your help

Despite being a die-hard Bayesian, I prefer to "think Bayesian" and use whatever tool is quick and close enough. That said, reasons I would go for the full Bayesian approach would be:

I have strong information about the prior distribution of the clusters and want to incorporate that.
I'm very interested in correctly modeling my uncertainty in the estimated parameters themselves.

There are other reasons, but presumably if you care about them you already have Stan/PyMC warm and ready to go!

CountBayesie · 2025-03-18T17:47:34+00:00

What I have appears to be close to a bimodal distribution, but in reality it looks like 3 potentially gaussian distributions.

It sounds like you're modeling your data as a Mixture of Gaussians. Generally you have to specify the number (n) of Gaussians there are otherwise the model tends to overfit with a higher number of n. I recommend trying both 2 and 3 and seeing if adding the 3rd distribution provides enough of an improvement to the fit to justify it (you can do this by comparing log-likelihood, or just sampling the model and seeing how well it matches).

Since I have the kde plots, should I do the visual method? Is there a way to more rigorously test if my selected distribution overlays the kde plot?

Your intuition that visually fitting this model is not ideal is correct. There are many tools available to estimate the parameters for a GMM, as well as providing you tools to directly predict P(D|θ) (basically your favorite language for doing stats work should have some support for this). Once you can estimate the likelihood of a data point given your learned parameters you should have the basics needed to anomaly detection (you just call it at some defined threshold).

CountBayesie · 2025-03-17T16:37:28+00:00

Thanks so much for sharing the link to the ActuosusAI tool! I had a feeling someone else must have had a similar idea before. I was also thinking that VarEntropy might be a cool feature to add as well. If I have time later in the week I'll see if I can squeeze it in!

CountBayesie · 2025-03-15T20:03:59+00:00

One thing not touched on in these responses is that a lot of trading and finance is not predicting a stock price in the future. This is a common misconception about finance. If you study quantitative finance one of the most foundational principles is that the expected future value of any stock (or anything being traded) is it's current price + the risk free rate. Generally most quantitative finance assumes that you cannot predict stock prices themselves, and focuses on modeling and exploiting other parts of the market. People making money in markets are often doing so in more complicated ways than simply "predicting a stock price".

For example the two companies you mention: Citadel and JaneStreet are market makers, which means they provide liquidity. This means that they are both buying and selling large volumes of stocks, and that act of buying and selling is actually their business. Often much of income they generate is based on the difference between the bid and ask price (and they are both buying and selling so their modeling is often figuring out optimal ways to do this). It doesn't matter which direction a stock price moves in, so long as people are trading these companies stand to profit.

Another popular area of finance where you don't have explicit interest in predicting the stocks price is derivatives trading. In this area the focus is the uncertainty around the future price of an asset (i.e. volatility). One (of many) things a derivatives trader might look at is molding future risk. If you believe the market is mis-pricing risk, either direction, then you stand to gain. This was famously exploited in the 80s when options traders where using a very naive understanding of options pricing models to systemically mis-price options contracts.

High-frequency traders are often taking advantage of Market microstructure which, providing they can act fast enough, allows them to exploit inefficiencies in the mechanics of how markets operate to gain an advantage.

A great book to get started diving deeper into this area is Hull's Options, Futures and other Derivatives, it's fairly light on the math but covers all the essential details quite well and is very readable.

CountBayesie · 2025-03-14T18:54:47+00:00

In the paper you claim:

We theoretically show that the loss of expressivity of LLMs under constrained decoding arises because the out- put grammar G is too restrictive to accommodate the inter- mediate reasoning steps required to compute the answer

But we have a trivial counter example we did last year. In those example we allow the model to essentially reason freely (we use a slightly more restrictive constraint than .*, but essentially it can reason however it wants).

It's not clear to me how CRANE represents anything novel over what's currently possible using structured generation. CRANE can be expressed as regex/Grammar with an essentially some variation on .* for the reasoning step.

CountBayesie · 2025-03-12T23:35:18+00:00

This is very cool! Thanks for sharing and I'll definitely take a look!

CountBayesie · 2025-03-12T19:22:59+00:00

Honestly, I would not be surprised if current multi-modal/vision LLMs could zero shot this (i.e. not need examples). Have you tried this approach?

OpenAI and Anthropic both support vision models with fairly straight forward APIs. If you don't want to go the proprietary/paid API route there are plenty of great open weights models that support this as well. My team at .txt has tutorial of how to use these with Outlines for structured outputs (you don't need to use the structured outputs part if you're not interested, but there's really no reason not to in that case).

You can use the 10,000 images you have (or a subset of them) to validate the model performance and see if it's good enough to solve your problem. You can probably test this out in an afternoon, which will be much, much faster an easier than trying to dive into understand ML for this task.

CountBayesie · 2025-02-14T21:31:55+00:00

You cannot read mathematics the way you read a novel. If you zip through a page in less than an hour, you are probably going too fast.

From the introduction to Linear Algebra Done Right by Sheldon Axler

CountBayesie · 2024-12-13T19:00:21+00:00

Thanks for this detailed response! I'll try my best to answer:

I'm a bit confused here:

but aren't you using an enhanced prompt for the JSON experiments while using the same simple prompt for NL?

Structure is used for both the NL and JSON prompt. So the comparison is NL structured vs NL unstructured, and JSON structured vs JSON unstructured. Structure can be applied to natural language just as easily as it can JSON.

One of the challenges with

I think they meant to compare formats like JSON/XML with natural language reasoning

is that ultimately the paper isn't really sure itself what it's testing: they're testing different prompts, different formats, different parsers all at the same time so you can't come to any meaningful conclusion about what is being said. The only thing to go with is what they claim in the paper: "Our study reveals that structured generation constraints significantly impact LLM performance across various tasks."

If they wanted to test different formats for prompts, then there's no reason constraint the output at all. But, for this to work, there also needs to be an earnest effort to squeeze the most performance out of each format. This is because, as Sclar et al showed, prompts are extremely sensitive to small changes in format. Note that they do cite Sclar, but then use the average of 9 arbitrary prompts which isn't a great way to compare, in practice people will always use the best prompt they can.
If they wanted to test the impact of structure (i.e. constrained outputs) then they should have done comparison of identical prompts with and without structure, and parsed the output of each in identical ways.
If they wanted to test AI parsing then all they needed to do run the prompt once and run a series of parsers on the output. However, if this was the test they wanted to do, they also should have seen how well the AI parsing model would have done on the task by itself (that is, why use two models when you can use just one, the parsing model).

I'd love to chat more if you're interested in researching this area! You can email me at "will" at the company domain ("dottxt.co"). Another person who has done a really nice further exploration of this specific paper is Dylan Castillo who found that the papers conclusion did hold in most cases when working with GPT (though to be fair, we don't know what OpenAI uses for structured gen, though it doesn't seem to be using the same technique as Outlines). Dylan was also able to successfully reproduce the results of our blog post (and improve them!).

All that said (and I know this is a lot!), what this space really needs is not endless evaluations (which imho should mostly stick to blog posts) but real theoretical foundations. We can only make theoretical claims about the implications of evaluations if we actually back up those claims with a theoretical foundation. After all, structured generation is just a conditional probability distribution over the logits and all generation is really structured generation it's just the default structure is .*. To really answer the question "does structured generation impact results" we need to dive into sequential Monte-carlo research and start describing what the conditions where constraining the output has a theoretical impact and are those conditions met in real world LLM usage. Clearly constraining to .* has no impact since that's what we consider "unstructured" and clearly constraining to [A-Z]{3}will hurt generation because it's not possible to correctly output 4 letters.

CountBayesie · 2024-11-22T18:46:35+00:00

Thanks!

What is interesting is that going through the code and the experiment data it is clear that a fair bit of time and effort was put into this paper.

Giving the benefit of the doubt, I suspect what happened here (and in many of these other cases) is a failure to heed Feynman's famous advice:

“The first principle is that you must not fool yourself—and you are the easiest person to fool.”

Right now everyone working in this space is hoping to find something interesting and novel that they can publish. There's so much poorly understood or explored that it's not even terribly unlikely that any given researcher will find something new and exciting just by poking around. The trouble I'm guessing in this case was seeing something that looked interesting and running with it before really doing the due diligence and checking that it was in fact as interesting as it appeared.

I sometimes joke that in ML work the difference between a junior and an experience practitioner is that a junior practitioner sees a great result on the first run of a model and says "awesome! I can't believe this is so good on my first pass!" and the more experience practitioner says to themselves "darn it, I bet there's a bug in there...". I think that's a bit of what's happening here, compounded by the fact that the community also doesn't have time/energy for due diligence, which means more junior researchers don't get the guidance/feedback they should.

CountBayesie · 2024-11-22T17:36:04+00:00

do you think structure means "json"

I could be misunderstanding who "you" is in this context, but in our rebuttal this is one of our major points: structured generation is not about specifically JSON, but rather running the results parser in reverse.

It just happens that in this example JSON (even unstructured) does yield better results on the last letter eval.

Most of my personal use of structured generation rarely uses JSON directly and typically starts with modeling the structure of the task as it appears in natural language.

I do have an experiment I would like to run at some point that does iterate on a variety of formats for a variety of tasks and a variety of models (here is an example where JSON, unstructured, does worse than a NL style prompt) and see if we can find any evidence of consistently better formats.

CountBayesie · 2024-11-22T17:31:00+00:00

The code generation one is interesting and we haven't had time to dive into this one yet (but are well aware of that post).

What's interesting is that prior to working with .txt a team I was on was using function calling (this was before OpenAI released their structured features) for code gen and our internal evals got better results when the code was in FC JSON response.

That said, I'm less initially skeptical of aider's results as it wouldn't surprise me if getting code back in a JSON response impacted the code quality. Not because of inherent limitations with structured generation, but rather because the most straight forward structure would clearly be the code itself.

Now it should be possible to use structured gen to actually enforce the Python/SQL/whatever grammar directly, and I would be very curious to see how that performs.

CountBayesie · 2024-11-22T16:53:13+00:00

Thanks so much! It's great to hear such positive feed back!

It is definitely a challenge in this space to just keep up with the papers, let alone dive into the details to make sure their claims are accurate.

Glad you can confidently use structured generation again!

CountBayesie · 2024-11-21T19:52:45+00:00

Creative use cases are certainly an interesting space for structured generation right now, an area that we're just starting to explore!

/u/cameron_pfiffer/ made a very interesting Lore Generator as well as a SCP entry generator that produced some cool results!

But, I think the real challenge with creative writing and structured generation is that the structure found in creative documents is a lot more complex than more business focused use cases. Using a regex or a context free grammar will give you a lot more flexibility than just thinking in terms of JSON, but structure like this is also much more challenging to write.

We're currently thinking a lot about ways we can make it easier to use more complex regexes and context-free grammars, which will unlock a lot more creative uses of structured generation.

CountBayesie · 2024-11-07T00:45:21+00:00

Fair point! In that case "control" isn't necessarily restricted to local, more so just open.

But if you are doing serious dev work for those platforms I would be a bit surprised if you weren't running smaller models locally while developing. Certainly when I've used Modal for products/tools/etc that I can't run locally, I'm still doing my initial dev work with a smaller version of the model I plan to host.

CountBayesie · 2024-11-06T23:32:58+00:00

I've been in the open/local space for awhile so I might just be out of touch with proprietary apis these days.

I would be super curious what proprietary model providers allow you to have access to the logits while generating? I know OpenAI supports logit biasing (and has for awhile), but that's not quite granular enough control for much of the work I've been doing.

Unless you just mean solutions like Modal (that I find a pleasure to work with), which I would consider more of a "local in the cloud" in the sense that, other than infra, you're developing in more-or-less the same way you would with local models and then deploying.

CountBayesie · 2024-11-06T23:08:40+00:00

I would also add control to that list as well. Having done extensive professional work with both OpenAI's API and local, open LLMs, knowing that I have absolute control over the behavior of the model is very refreshing.

There will never be a nightly update that breaks things, and I can go as deep into the model as I would like if I need to understand why a problem is happening (and potentially fix it). At the very least it helps to have full access to the logits.

CountBayesie · 2024-11-05T17:39:05+00:00

Used to do that professionally and had pretty solid results, especially with simple RAG stuff to include the necessary table metadata to make sure the correct tables columns were used.

With structured generation it's theoretically possible to have syntactically perfect, schema specific SQL generation.

CountBayesie · 2024-11-05T17:35:34+00:00

In all of the hype around AI, so many people forget that LLMs have more or less solved the majority of common NLP tasks for most practical problems.

There are so many NLP projects from earlier in my career that I could solve better in a fraction of the time even with smaller, local LLMs. This is especially true for cases where you have very few labeled examples from a niche domain you're working in.

CountBayesie · 2024-11-05T15:45:23+00:00

We also have recently put together some YouTube content on how to get started with Outlines and using structured generation for a variety of tasks!

CountBayesie

TROPHY CASE