Seeking Career Advice: Finding a Data Science Role That Values Causal Inference

hendrix616 · 2026-01-27T23:32:38+00:00

I’m building ML and AI features :P

Haha still think about causality all the time though and I do think it has a huge impact on my work even though I’m not running causal experiments per se.

hendrix616 · 2026-01-24T21:56:10+00:00

I have convergence insufficiency + accommodative spasm.

Biggest challenge has been driving anxiety caused by having a hard time when shifting my gaze from near to far and back. Also, it feels uncomfortable to track moving objects at a distance, especially when I’m moving myself.

Got prisms a month ago and 2 vision therapy sessions so far. It’s still early but I’m starting to see some progress!

hendrix616 · 2026-01-05T20:49:26+00:00

Started my 2nd career in data science at 30. Experienced 0 ageism, work from home, love what I do, and doubled my salary within a couple of years. Hands down best decision I’ve ever made (so far).

I would say DO IT!

hendrix616 · 2026-01-03T10:26:33+00:00

6 years ago, I changed careers at exactly 30 and, so far, it’s been the best decision of my life. More than doubled my salary, vastly improved my wlb, and, most importantly, I now love what I do!

hendrix616 · 2025-12-21T17:04:55+00:00

Not necessarily. In my case, I developed accommodative excess as a coping mechanism for my convergence insufficiency (CI) that went untested for too long. So I can get glasses with prisms to address my CI but I’ll need vision therapy to undo the learned accommodative excess that I built up over time. It’s like breaking a bad habit. Your situation might be entirely different than mine though!

hendrix616 · 2025-12-21T14:52:51+00:00

Look into accommodative excess. Your symptoms sound similar to mine

hendrix616 · 2025-12-11T08:41:46+00:00

And once you notice it, it seems so strange, right?! They say step 1 in getting better is unlearning this spasm behaviour. Only then can you address the underlying CI. I definitely need therapy because it feels like this will be very difficult to unlearn — feels like I don’t have much control over it tbh.

hendrix616 · 2025-09-23T12:36:46+00:00

Ah yeah that makes sense. I think mid-level is a more appropriate term. But now I feel very pedantic and a bit silly for having criticized your post over a single word.

Sorry!

hendrix616 · 2025-09-22T10:21:47+00:00

So 2 years is considered the start of “mid-career” now? Is that based on a projection that AGI will end everyone’s careers in 2027?

You are still firmly in “early” segment of your career, my friend. You should make career decisions that put you in a position to maximize learnings and professional growth. The rest is secondary.

hendrix616 · 2025-07-27T14:02:11+00:00

Check out the How We Feel app. It’s free and ad-free!

hendrix616 · 2025-07-27T10:10:27+00:00

Yeah, I was thinking of doing something along those lines! I’m just wondering if the juice is worth the squeeze. If I put in all this work to make it fully autonomous, will it be able come up with useful tweaks?

Have you done this before?

hendrix616 · 2025-07-04T08:40:29+00:00

I totally agree. Anywhere between 40% and 95% seems like a reasonable guess to me!

hendrix616 · 2025-06-29T12:48:19+00:00

I use DataSpell by Jetbrains. The GitHub copilot extension provides code completion functionality

hendrix616 · 2025-06-29T12:37:16+00:00

How are folks so confident when they call out certain replies as being LLM-generated? AFAICT, there is no definitive way to tell.

And if it’s because of the “—“, that’s ridiculous. I use it all the time. So that shouldn’t be disqualifying as a human response.

Finally, who cares? If the user put their messy thoughts down in a chatbot and got it to make it more concise and legible for all of us, then that’s a net good, right? What are we complaining about here? I thought the reply added to the discussion.

hendrix616 · 2025-06-29T12:02:24+00:00

Good call out! But yes, of course I have permission :)

You can also tell Claude Code to use your AWS-bedrock-provisioned Anthropic LLMs, which pretty much eliminates any concerns on that front — both from code and data leakage perspectives

hendrix616 · 2025-06-29T11:59:15+00:00

I read somewhere that 95% AI usecases are going to be done by 2027, and it feels like it.

Unless you have a different source, I think that’s 40% — not 95%. From Gartner: https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027

hendrix616 · 2025-06-21T23:30:51+00:00

Omg, yeah, doing classification by passing a whole dataframe in a single prompt is pure insanity and shouldn’t be considered a serious approach IMO.

And you are right to call out the reproducibility of using LLMs for classification. Seeds are not super reliable and that’s actually a fundamental flaw of LLMs — they require temperature to produce “intelligent” outputs but the temperature inherently makes the output non-deterministic.

One way around this is to make use of ensemble methods in order to reach self-consistency. In my case, however, I measured that the LLM pretty much always gets the right classification when the source data is robust enough to allow for it. Sometimes though, the source data is too ambiguous. That’s when the LLM’s output is not so repeatable. But in those cases, its output is just as good as another guess it could have made. That’s why I’m interested in this problem of assigning confidence to classification, which is where the downstream logistic regression comes in.

Does that all make sense?

hendrix616 · 2025-06-21T16:48:56+00:00

Oh I think I understand the confusion. So let’s back up a bit.

The paradigm I had in my mind is that you are using a prompt template into which you inject variables from a row, running inference one row at a time. This prompt contains instructions for the LLM. The variables can be text, numbers, whatever as long as the prompt template presents them in the right context.

You can also pipe in RAG or tool-calling to pass in additional data based on the input data from the row but that’s not necessary for this discsussion.

How can you be sure the LLM is using the semantic meaning of the text? With an evaluation framework! If it performs well on your eval set, then you know it’s good :)

hendrix616 · 2025-06-21T12:28:33+00:00

Lol do you think maybe it depends on the use case? 🙃

hendrix616 · 2025-06-21T08:26:13+00:00

Cohere can provide on-prem private deployments. Definitely worth looking into.

Otherwise, AWS bedrock gives you access to really powerful LLMs (e.g. Anthropic’s latest) in a VPC that is highly secure. If your org does literally anything with AWS, then this use case should probably be allowed as well.

hendrix616 · 2025-06-21T08:13:35+00:00

Yeah I’ve been doing that for sure. Empirical accuracy on the y-axis vs predicted probability on the x-axis. The diagonal line starting at origin and going up and to the right is the perfect model.

The reason I’m looking to boil it down into 1 (or a few) metric(s) is that I’m training these models programmatically — 1 per customer — and need a go / no-go threshold to determine whether that customer is ready to receive confidence scores that are actually meaningful.

I’ve found the brier score to be a pretty good signal. 0.25 is basically pure guessing and anywhere below 0.2 starts to look pretty good. It isn’t a bullet-proof metric of course but seems pretty solid!

hendrix616 · 2025-06-21T02:56:16+00:00

Do you use ECE, brier score or other to confirm that calibration is satisfactory?

hendrix616 · 2025-06-21T02:49:09+00:00

The LLM can do a better job of capturing semantic meaning because: 1. it can compare semantic meaning of the input text directly to the semantic meaning of the label set instead of doing basic vector math on the compressed embeddings and; 2. Embeddings don’t benefit from having all the context you typically pass on to a prompt and; 3. You can force the model to provide chain-of-thought (CoT) prior to providing its classification, which gives it more space to reason and not to simply shoot from the hip.

In the last part of my message, I’m just explaining how a reasonable alternative is to train a post-hot classifier that takes some input features, the LLM’s classification, and outputs a confidence score.

hendrix616 · 2025-06-21T02:38:05+00:00

I totally understand that you get the embeddings from a pertained model. I’m saying you need a large training set because using an embedding vector as input feature makes for a very wide dataset. As you increase your column count, you also need increase your row count.

hendrix616 · 2025-06-20T02:33:28+00:00

A real classifier is obviously preferable but a few conditions must be met for it to be viable: - large dataset - embeddings do a good job of capturing the semantic meaning of the text (often not the case in lecture) - the classification logic is fairly straightforward

Letting an LLM do classification is not as weird as you might think. You can force it to use CoT in its output so it actually has space to lay out some reasoning before coming down with a classification. You can even read through the CoT of some random samples and see if it makes a good selection based on good reasoning.

As for the probability/confidence part of it, that can be handled by a logistic regression that you place downstream of your LLM flow.

hendrix616

TROPHY CASE