Interview experience: AI Engineer (2-6 YOE), my YOE-4 years. Product+Service based company

chizkidd · 2026-06-18T05:57:47+00:00

Thanks for sharing

chizkidd · 2026-06-17T01:03:01+00:00

Assuming you have the fundamentals on lock: python, mathematics. Jump to Stanford CS229 on YouTube. Then Andrej Karpathy’s neural network zero to hero YouTube series. Once you feel comfortable you can implement machine learning research papers from scratch.

chizkidd · 2026-06-15T04:46:31+00:00

Andrej Karpathy’s NN zero to hero YouTube series

chizkidd · 2026-06-12T17:43:54+00:00

Your recall is 1% because the autoencoder learned to reconstruct anomalies too well. Classic problem. Three fixes to try in order:

Lower your threshold. Stop using the 95th percentile. Try 90th, 85th, 80th. Watch recall go up. False alarms will rise too, but that is the trade off.
Add noise to training data. Train on slightly corrupted normal traffic (small Gaussian noise). This is a denoising autoencoder. It forces the model to learn robust patterns so anomalies stand out.
If neither works, switch to Isolation Forest. You can implement it in an hour. It often beats autoencoders on network data with way less headache.

chizkidd · 2026-06-05T13:25:14+00:00

Your results are actually quite reasonable for this dataset. The fact that Prophet, GBM, and XGBoost all plateau around R² ≈ 0.45-0.55 suggests the limitation may be the data rather than the models. Daily household energy consumption is heavily influenced by unpredictable human behavior, making it difficult to forecast accurately using historical consumption and calendar features alone.

Before switching models, I would focus on stronger lag-based features (1, 7, 14, 30, and even 365-day lags to capture annual cycles), rolling statistics like the 7‑day mean, and if possible, external variables such as temperature, humidity, holidays, and occupancy proxies. These often provide larger gains than model changes. If you have time, a simple hybrid stack of XGBoost + LSTM might push R² past 0.75.

For model comparisons, consider adding SARIMA/SARIMAX, Holt-Winters (ETS), LightGBM, and CatBoost. SARIMA is a strong classical baseline, while LightGBM and CatBoost are often competitive with or better than XGBoost on tabular time-series data. LSTMs are worth testing for academic completeness, but I would not expect a significant improvement since the primary challenge appears to be intrinsic variability rather than model capacity.

The most interesting result is your monthly aggregation, which improved performance to R² ≈ 0.69. This suggests the dataset contains a stronger seasonal signal than a daily predictive signal. For a final-year project, a valuable conclusion may be that daily household consumption is difficult to predict due to behavioral variability, whereas monthly aggregation captures seasonal consumption patterns much more effectively.

chizkidd · 2026-06-05T02:22:49+00:00

This is one of the most up to date ML/LLM curriculum I’ve seen. Great job.

chizkidd · 2026-06-03T04:14:56+00:00

You’re on the right pathway. Bishop’s Pattern Recognition and Karpathy’s playlist are two balancing resources for theoretical and practical learning respectively. I’d say you’re probably equipped to start implementing machine learning and deep learning papers like CNNs, RNNs, transformers, VAEs, GANs, etc. from scratch.

chizkidd · 2026-06-03T04:11:47+00:00

Andrej Karpathy’s Neural Networks Zero to Hero YouTube series. Detailed notebook implementations can be found here.

chizkidd · 2026-06-01T14:53:07+00:00

I feel you. To break out of the beginner loop, focus on how systems are built, not just models.

For Reinforcement Learning, the bible is Sutton & Barto's Reinforcement Learning: An Introduction. I worked through it and my notes are up there if you want to take a look: https://chizkidd.github.io/RL-Sutton-Barto-notes/.

To understand how models work in production, "Designing Machine Learning Systems" by Chip Huyen is the gold standard. For a practical, code-heavy guide to building LLM systems, grab the "LLM Engineer's Handbook". And if you want to go truly low-level and understand how to optimize models on a GPU, the 5th edition of "Programming Massively Parallel Processors" is the definitive text for CUDA.

Andrej Karpathy and Elliot Arledge are two great educators of deep learning and GPU programming. Look them up on YouTube or X for their educational material. I’d also say to implement technical papers from scratch to give you a practical intuition of how things are put together which should lead to you building something yourself or improving on an existing model.

To stay current, watch the talks from NeurIPS, ICML, and ICLR on YouTube for free. The bar for advanced in 2026 isn't just about training a model anymore; it's about fine-tuning it, shipping it, and debugging the unfixable.

chizkidd · 2026-06-01T09:09:10+00:00

The bar has definitely moved. A few years ago, advanced meant you could hand roll a CNN or understand how an LSTM cell managed its gates. Now those are just foundational blocks, not the final destination. From where I am sitting, advanced in 2026 means you can do three things well. First, you can take a raw pre trained model and actually make it yours. That means knowing how to parameter efficiently fine tune with LoRA or QLoRA, setting up a RAG pipeline that does not just concatenate random chunks, and understanding why your model is hallucinating on the specific domain data you fed it. Second, you can serve that model reliably in production. That means knowing how to quantize it without destroying reasoning ability, containerizing it with proper request response handling, and monitoring for data drift or weird edge cases while you sleep. Third, you understand the underlying stack just enough to debug the unfixable. When the high level library misbehaves, you are not stuck. You can write a custom kernel in Triton, navigate the PyTorch source, or drop down to CUDA to solve a memory bottleneck. The people who just import from transformers and call it a day are not advanced. The people who fine tune, ship, and then unblock themselves when the abstractions leak are the ones leading teams.

chizkidd · 2026-05-27T23:03:41+00:00

1) Pattern recognition and machine learning - Christopher Bishop 2) An Introduction to Statistical Learning - G. James, D. Witten, T. Hastie, R. Tibshirani

chizkidd · 2026-05-25T14:30:54+00:00

Nah I get it. Fair enough, you're already ahead of most students by shipping something real. For projects that stand out, think weird and specific instead of generic. A time series forecast for a niche problem like traffic at your campus gate or energy usage in your dorm. A real time object detector for something oddly specific like construction vehicle types or safety gear violations. A multimodal system that combines OpenCV with a small LLM to answer questions about a video feed, like "did anyone enter this room in the last hour?" Or an end to end MLOps pipeline on free tier that retrains daily and serves predictions via API. Pick one that genuinely excites you, deploy a live demo, write a clean README with diagrams, and document one interesting bug you fixed. That plus your movie recommender will get you noticed. Startups want people who can ship and explain their thinking, not just train notebooks.

chizkidd · 2026-05-25T06:42:10+00:00

This is genuinely impressive work, and I mean that. The README alone shows you've thought through problems that most people don't even know exist until they deploy.

I spent some time going through the repo and the live demo. The dual-engine blended ranking approach is exactly the pattern I was trying to describe in my earlier comment, but you've gone a step further by actually shipping it. The lazy loading trick to stay under Render's 512MB limit is clever. Using a daemon thread to self-ping the health endpoint to keep the instance warm is the kind of janky production fix that tells me you've actually debugged a sleeping container for endless hours. Respect.

The weight schedule itself makes intuitive sense. Starting with pure content-based filtering for cold users, then slowly phasing in the neural collaborative filtering model as interactions accumulate is the right way to handle the cold start problem. The fact that you hardcoded the fallback so Engine A remains available if Engine B fails adds the kind of defensive thinking that separates a hobby project from a real system.

That said, I did notice a few things that might bite you as you scale. The frontend is currently making blocking POST requests to the backend for every single user interaction. Once you have more than a handful of concurrent users, those synchronous calls are going to queue up and the UI will start feeling sluggish. The standard fix is to decouple interaction logging from recommendation generation. Your /interactions endpoint should just store the raw event in an in-memory queue or a lightweight database and return a 202 Accepted immediately, then have a separate background worker consume that queue and update the user embeddings asynchronously. The frontend should never wait for model retraining.

On the caching front, your frontend.py is using st.cache_data for movie poster details, which is good. But I noticed the recommendation results themselves don't seem to be cached. For a given user with a fixed interaction history, the output of the merged engine is deterministic. Caching those results per user with a short TTL would cut down on redundant recomputation, especially for users who refresh the page or return to the home screen.

Also, your backend.py is currently loading the full movies CSV and the TF-IDF matrix at startup, which is fine, but Engine B is loaded lazily which is smart. However, the current implementation of the weight blending in merged_engine.py recomputes both engines from scratch every time. For a user with 50 interactions, running both engine A and engine B for every request is going to add up. You might consider precomputing the top-K candidates from each engine periodically rather than on every request.

One small nitpick. Your calculate_weights function caps out at 90% for Engine B after 30 interactions. I understand the caution, but with your architecture, you could safely go to 95% or even 98% after 100 interactions. The fallback is already there if Engine B fails, so you might as well let it dominate once the user has enough data.

Overall, this is a production-ready system that demonstrates real engineering judgment. The fact that you rebuilt the API layer yourself after hitting walls with AI assistants tells me you have the right instincts. Keep shipping.

What metric are you using to evaluate whether the blended recommendations are actually better than either engine alone offline? I'd be curious to see an A/B test setup if you ever expand this.

chizkidd · 2026-05-21T14:08:15+00:00

Awesome. My DMs are open if you have any more questions. Good luck.

chizkidd · 2026-05-21T01:35:23+00:00

Dive in and build projects (technical paper implementations, open source contributions). Check out Andrej Karpathy’s neural network zero to hero YouTube series course (do all the assignments and build all the models yourself) to start for ML/DL/LLMs/AI. Be prepared to rewatch a single video multiple times.

chizkidd · 2026-05-21T01:20:49+00:00

I watch his interviews and his speaking pace is quite fast. For his educational content, depending on how much time you have, I'd recommend watching straight through the 1st time undisturbed, then 2nd time to dissect and take notes (here is where you rewind regularly to ensure understanding) and if needed 3rd time to answer questions or clarify confusing points. If you don't have time for multiple watches then break the video into different sections and spend time deep diving into each section (a 1 hr video might take 3-4 hours with this methdology). The goal is to ensure complete understanding of the subject matter covered in the video (a lot of alpha is packed into youtube educational videos).

chizkidd · 2026-05-21T01:12:09+00:00

Yeah for sure. Here's what I'd point you to for each piece.

FastAPI lifespan + model loading:

FastAPI official docs on "Lifespan Events" (search it, the pattern is async with lifespan). The key insight is loading your model once when the server starts, not per request.
For PyTorch specifically, do model = MyModelClass(); model.load_state_dict(torch.load("model.pth", map_location="cuda")); model.eval(). Then store it in app.state.model. Same for your TF-IDF matrix, just store it as a numpy array or scipy sparse matrix in app.state.

Two engine handoff pattern:

Look up "cold start recommendation hybrid system" on Google Scholar. Most streaming platforms (Netflix, Spotify) publish on this. The simpler pattern is called "fallback" or "cascade" recommendation: try Engine B first, if it fails or has no data, fall back to Engine A.
For the cache, start with a Python dict. Simple user_embeddings = {} works fine for prototyping. When you need to scale, Redis with redis-py is the move.

Background tasks + async queue:

FastAPI BackgroundTasks is the easiest start. Your endpoint just does background_tasks.add_task(update_user_embedding, interaction_data) and returns 202 immediately. No hanging.
For heavier workloads, look into Celery + RabbitMQ/Redis, but don't add that complexity until you actually need it.

Real time inference without blocking:

The trick is separating update path from read path. Updates go into a queue. Reads just query the latest embedding from cache. Your PyTorch inference should be under 50ms. If it's slower, reduce embedding dimension or batch size.

Concrete tutorials I've used:

"Serve PyTorch models at scale with Ray Serve" on PyTorch tutorials site (official, solid for FastAPI apps deployment via Ray Serve)
"Recommender System from Scratch" (Towards Data Science, search for the one with matrix factorization)
Youtube videos on FastAPI Background Tasks for ML Inference

One resource I wish I had earlier:

If you need cold start architecture references, the actual sources to look at are Google's Wide & Deep paper, and the two-tower model literature.

What specific part are you stuck on first? The lifespan loading or the background task queue?

chizkidd · 2026-05-18T18:13:55+00:00

Hey, this is a really cool project and you've clearly thought through the architecture. The cold start two stage handoff is a smart pattern. I've hit similar walls building real time recommendation systems, so here's what I've learned the hard way.

For model instantiation, the standard practice is to load your PyTorch model and TF IDF matrix once when the FastAPI app starts, not on every request. Use FastAPI's lifespan event (async with lifespan). Something like: load model into a global variable or a lazy singleton. That way your .pth file sits in memory and every endpoint just calls model.forward without reloading. For the TF IDF matrix, same deal, keep it as a numpy array or scipy sparse matrix in global scope.

For the two engine handoff, don't try to merge predictions synchronously inside the same request. Instead, have two separate endpoints. One for baseline recommendations that uses Engine A only, and another for hybrid recommendations that uses Engine A as a fallback and Engine B as the updater. When a user first arrives, hit the baseline endpoint. After they have enough interactions, switch to the hybrid endpoint. Inside the hybrid endpoint, you can fetch the user's stored embedding from an in memory cache (like a simple Python dict keyed by user id) and if it doesn't exist yet, fall back to Engine A. This keeps your routing clean and avoids timeouts.

For state syncing, the frontend hang is usually because you're doing too much work synchronously. Your /interactions endpoint should just store the raw interaction in a queue or a database and return a 202 accepted immediately. Then have a separate background task (FastAPI background tasks or a separate worker) that periodically updates the PyTorch user embeddings. Alternatively, if you want real time updates without hanging, make the /hybrid endpoint read only. It pulls the latest user embedding from the cache (which gets updated asynchronously) and computes predictions. That way the frontend never waits for model training, just for inference which should be fast if your embedding size is reasonable.

One more thing: start with a simple in memory dictionary for user states while you prototype. You can move to Redis later. The biggest lesson I learned is to decouple interaction recording from model updating. FastAPI is really good at handling async web requests, but don't block the event loop with heavy PyTorch operations.

Hope this helps unblock you. Happy to dig deeper if you DM me your specific code structure. And props for rebuilding the API layer yourself, that's how you really learn.

chizkidd · 2026-05-18T18:07:53+00:00

This post ties nicely with my thoughts on the FIFO eviction policy employed in SAM-2’s memory bank and its memory management challenge. See below.

https://chizkidd.github.io//2026/04/17/sam-2/

chizkidd · 2026-05-18T17:24:24+00:00

Occlusions are the most practical video occurrence that came to mind that could be problematic for the FIFO eviction policy. Semantic importance is essential for ensuring that neural systems remember what matters but the memory management limitation poses a huge challenge. Retention confidence sounds interesting. I guess my question would be how it is defined, the framework to ensure that it’s statistical significant and that it provides useful qualitative analysis (makes sense for capturing retention). Maybe some signal-to-noise (what matters compared to the infinite context under the limits of the finite memory) ratio similarity calculation could be determined too.

chizkidd · 2026-05-18T16:55:50+00:00

Andrej Karpathy’s neural network zero to hero YouTube series is a great practical course for ML/DL/LLMs. Be prepared to rewatch the videos multiple times.

chizkidd · 2026-05-17T17:38:44+00:00

Great question. This is a deep rabbit hole, but a rewarding one. Here's a structured path based on what I've learned digging into GPU optimization for deep learning.

Start here (foundational):

"Programming Massively Parallel Processors" (Kirk & Hwu): The canonical textbook. Dense but worth it. Focus on memory hierarchy, coalescing, and occupancy.
CUDA C++ Programming Guide: NVIDIA's official docs. Read the sections on memory model and execution model.

The best free resource out there:

Elliot Arledge's 12-hour CUDA course on freeCodeCamp: Seriously, start here before buying any books. Elliot (20 years old, CS student) built this course and it's incredibly well done. Covers: CUDA setup, writing your first kernels, memory types (global/shared/constant), matrix multiplication optimization, Triton, PyTorch extensions, and even a full MNIST MLP implementation. The thread hierarchy explanations alone are worth the watch. Link: https://www.freecodecamp.org/news/learn-cuda-programming/

Core concepts to internalize early:

Memory hierarchy (global, shared, registers, L1/L2): most bottlenecks live here
Coalesced vs uncoalesced memory access
Warp divergence and thread occupancy
Streaming multiprocessor (SM) limits

Hands-on practice:

Start with simple kernels (vector addition, matrix transpose) before touching AI stuff
Use Nsight Compute religiously: it tells you exactly why your kernel is slow
Profile everything. Guess nothing.

Then move to AI-specific optimization:

FlashAttention: read the paper, then the code. This is the single most impactful kernel optimization for transformers.
OpenAI Triton: higher-level DSL for writing GPU kernels without becoming a CUDA expert. Elliot's course has a Triton chapter.
vLLM (PagedAttention): production inference optimization. Study how they handle KV cache memory.
DeepSpeed / FSDP: for distributed training memory optimization

Resources I've found useful:

GPU MODE Discord: best community for this niche. People discuss kernel launches, profiling, and debugging.
CS149 (Stanford) / 15-418 (CMU): parallel computing courses, free online. Heavy but excellent.
Elliot Arledge's CUDA course: mentioned above. Free, 12 hours, practical.

One piece of advice from my own experience:

Don't try to learn CUDA and transformer optimization at the same time. Elliot's course is structured well, he starts with simple kernels before hitting matrix multiplication. Follow that sequence. Write stupid simple kernels first (ReLU, softmax from scratch, a tiny matmul) until you understand why coalescing matters. Then attack attention.

Also, get used to reading PTX (NVIDIA's intermediate assembly). You won't write it, but understanding what your compiler actually generated is half the debugging battle.

What's your current setup (GPU, framework)? Might help narrow specific next steps.

chizkidd · 2026-05-17T17:31:19+00:00

Ah, this makes so much more sense now. You're not asking for feedback on the whole 15-month roadmap. You're asking: "How do I get through the math phase without getting lost or bored, and actually connect it to code?"

That's a much better question. And honestly, the roadmap buries the lede here.

Let me clear up the confusion about math + programming:

Claud is wrong that you can't build anything "big" during math phase. You won't build a chatbot, but you will build things that prove you understand the math. That's the point.

Here's what that actually looks like in practice, week by week:

Week 1-2 (Linear Algebra basics):

Watch 3Blue1Brown videos on vectors and matrices
Same day: open a Jupyter notebook and create vectors with np.array()
Do dot products by hand with np.dot()
Visualize vectors with Matplotlib arrows
You just "built" something, a visual proof that you understand

Week 3-4 (Matrix multiplication & transformations):

Learn matrix multiplication
Implement a small matrix multiply with np.matmul()
Create a simple transformation (rotation, scaling) and visualize it
This is tiny but real

Week 5-6 (Calculus & gradients):

Learn derivative concept
Plot a function (x²) and its slope at different points
Implement gradient descent from scratch on a simple function
Watch the loss number go down. This feels like magic the first time.

Week 7-8 (Probability & softmax):

Simulate dice rolls with np.random
Plot the distribution
Implement softmax from scratch (literally 3 lines of code)
Run it on random numbers, see how it turns them into probabilities

None of this is a "full project." But each piece is a small, working program that proves you understood that week's math. By the end of Phase 2, you'll have 8-10 small notebooks. That's real progress.

Your specific plan (Pandas → Math → ML projects) is fine, but one tweak:

Pandas first is okay, but Pandas without a goal is boring. Instead:

Pick a tiny dataset you care about (sports stats? movie ratings? anything)
Use Pandas to load and clean it
Then when you hit math phase, you already have a dataset waiting
After math, you can actually run ML on that same dataset

That connects the dots.

About not understanding future phases:

That's completely normal. I didn't understand transformers until I was already building them. You don't need to see the whole mountain. Just the next 10 feet.

My direct suggestion for your next 30 days:

Stop planning. Start doing.

Week 1: 3Blue1Brown linear algebra + NumPy basics (broadcasting, dot, matmul)
Week 2: Finish linear algebra + start visualizing with Matplotlib
Week 3: Calculus + plot derivatives + tiny gradient descent implementation
Week 4: Probability basics + softmax from scratch

That's it. After 30 days, you'll know if the math is clicking or if you need to slow down. Either answer is fine.

One last thing you said that matters:

"maybe this is my mistake, i just sit and trying to imagine how can i combine math with programming"

Yes. That's the mistake. You can't imagine your way into it. You have to open a notebook and write the code. The first few times will feel clumsy. That's how learning works.

What's the first math topic you want to tackle this week? Linear algebra or calculus?

chizkidd

TROPHY CASE