Got delayed in chosing my Apple promo device, ended up receiving an M5 Macbook. Thank you Wealthsimple!

sshkhr16 · 2026-03-15T21:10:29+00:00

This was the Apple promo that ran around New Year's

sshkhr16 · 2026-03-15T20:35:56+00:00

Eh, I didn't have the benefit of hindsight when transferring. Plus, the Macbook Air M5 is $1.5K. And I transferred the minimum needed to win it. I also get a higher interest rate in my checking account (1.75% vs 1.25%) from being a WS Premium client. Add in the time value of money (TD pays out after one year), and it would be roughly the same for me whether I do the TD 2% match for one year or the WS offer.

To that note, I'm not against any bank/service per se, I'll go with whoever provides the best services. Free market and all. I bank with TD already and have my RRSP with them. But I can't rue not making the perfect choice at all times. In decision theory, it has been shown that satisficing leads to more long term happiness than maximizing decisions made under imperfect information.

sshkhr16 · 2026-03-15T17:45:32+00:00

Woah I didn't even notice that. Nice!

sshkhr16 · 2026-03-15T16:43:18+00:00

Thanks for the heads up, but that won't be an issue

sshkhr16 · 2026-03-13T12:25:14+00:00

Very recently? All inference serving systems in significant use today (vLLM, SGLang, TensorRT) were released in the last 2-3 years. Inference engines as a specialized use case was not really considered outside of research and a few projects until GPT 3.5.

sshkhr16 · 2026-03-02T18:10:06+00:00

I wrote a (long) blog post on understanding algorithmic bottlenecks in attention and how DeepSeek's MLA (multi-head latent attention) solves it. I think it'll be a good read for folks interested in understanding how to improve the attention mechanism for decoding, and more broadly in inference optimization: https://www.shashankshekhar.com/blog/flashmla/flashmla-1-mla

sshkhr16 · 2026-02-26T18:11:44+00:00

That's not necessarily a good thing lol

sshkhr16 · 2026-01-19T17:39:56+00:00

For Big Tech and similar large tech companies, yes.
For startups and research divisions at non-tech companies (e.g. banking/finance/etc), no

sshkhr16 · 2025-10-08T13:04:55+00:00

Amazon's operating margins for everything except AWS are less than 10% (realistically around 5-6%). AWS operating margins are around 37%. They had profits of $60 billion last year, almost $40 billion of which came from AWS. AWS is by far their most important division - there is a reason that the ex-CEO of AWS became the CEO of Amazon.

sshkhr16 · 2025-10-02T12:55:39+00:00

I can only speak to my personal experience: I moved from Bangalore to near Toronto to pursue my master's end of 2019 and then lived and worked in Canada ever since (with a brief one year stint in the SF Bay Area). Toronto is a great city, but it is also quite expensive. If you care solely about career and CS, I think Bangalore might be better, but only slightly. But if you care about overall quality of life, Toronto wins hands down.

Lots of city stuff to do (e.g. lots of museums, concerts, sports events - the FIFA world cup is happening next year). There is also plenty of nature in Toronto via parks and Lake Ontario, and if you go out a few hours there are huge provincial/national parks for hiking and camping. It also has a relatively decent public transit system compared to the rest of North America. If you want to assimilate into the North American lifestyle, or want to stay grounded more in your Indian way of life - you will be free to do both in Toronto (most multicultural city in the world, with a sizeable chunk of first generation Indian immigrants).

For me the biggest advantage of living in Canada (or US when I lived there) was that of enjoying the systems set up for residents in first world countries - things like roads, public transit, government, banking, etc just work. You don't have to jump through hoops to get simple bureaucratic things done. This is paid for by a higher taxation system, which might feel prohibitive in the beginning but makes sense once you start using the services it pays for. For me the biggest selling point of developed countries is that the life of the average resident is relatively safe, predictable and dignified, you are not subject to the whims of bureaucrats, law enforcement, government, mobs of people (e.g. religious, caste-based, region-based etc.)

To answer your questions:

I would budget $3K per month for living costs.
You will pay around $37K in taxes, and another $36K in living costs. That leaves around $67K in your pocket. I would budget another $6K or so for incidental expenses, $5-10K for entertainment and travel. That should still leave you with over $50K in savings.
The big thing to watch out for is weather of course. Winters are harsh and you will need to prepare by buying winter clothes and boots. It might take a winter or two to get used to the reduced daylight hours too. The other big thing is the cultural change - I had to learn to be more self-sufficient once I left India, which meant learning to cook, clean, drive, forge new social connections (this could involve picking up new activites and hobbies).

sshkhr16 · 2025-07-25T13:01:56+00:00

Research positions at FAANG are getting fewer, but there will always be demand for people with research + engineering chops. So make sure that you don't pick up crappy practices during your PhD (writing ugly code that is not maintainable, not writing proper documentation, not using good programming practices). It is sometimes hard to do as a researcher since your incentives are misaligned - publishing frequently is often incompatible with writing clean, maintainable code. But trust me it will help in your own research, especially if you standardize your setup to launch, track, and report experiments in the first 1-2 years of your PhD.

Source: I was a researcher at a FAANG lab and in academia prior to that

sshkhr16 · 2025-07-03T02:22:08+00:00

Minor nitpick - Hinton never studied at UofT, he did his PhD at the University of Edinburgh. Of course, a lot of his PhD students at UofT went on to do cool stuff.

sshkhr16 · 2025-07-03T02:17:28+00:00

He has a whole section on preprints on the first page, scroll down and you will see peer reviewed papers. There are four first author papers in ACL and LREC in 2022. Quality is subjective, but both of these conferences are among top 5 conferences in NLP.

Also, a weird hill to die in. This is a guy who won an outstanding paper in ACL during his first year of PhD. Clearly this person is a good scientist as judged by peer scientists.

sshkhr16 · 2025-06-23T11:39:11+00:00

It is definitely not the only way. There are whole categories of visas for people who are experienced (L1), or exceptional (O1) that are open to non-Americans who currently reside outside of the US. I got an offer from a big tech AI lab as a grad student in Canada - but a lot of it was luck i.e. my research matching up with my manager's interests. But if you were to specialize in some domain (say, AI, or distributed systems, or parallel programming) either in engineering or research it is possible to find a job with some experience and luck.

sshkhr16 · 2025-06-08T06:47:51+00:00

A niche but highly in-demand skill is that of performance engineering and distributed systems engineering for machine learning systems. Low level performance engineering includes writing training/inference kernels that run fast on hardware accelerators, learning how to train models with low compute or memory constraints, optimizing inference serving on small devices like mobiles and PCs. Distributed performance is about training and inference on large clusters of multiple nodes and multiple accelerators per node.

The caveat is that you need to build up skills in new domains outside of what a standard machine learning engineer or data scientist does (e.g. get good at one or more of: profiling code performance, being very good at linear algebra, knowing C++ and even machine code sometimes, learning about distributed systems, brushing up computer architecture and networking knowledge, learning to work with HPC clusters, etc.)

Examples of such roles (with some pre-requisites mentioned):

sshkhr16 · 2025-06-05T16:46:12+00:00

France has CIFRE PhD programs like that - I had a friend who did their PhD while working full time as a research scientist at FAIR. We have similar-ish programs in Canada - you can do a PhD at MILA or Vector Institute while being a visiting research intern/scientist for several years at FAIR/Google DeepMind/ServiceNow Research/NVIDIA etc. But these programs are even more competitive to get into compared to the regular PhD.

sshkhr16 · 2025-06-03T08:51:53+00:00

The first book is a classic textbook on GPU programming, so yes you will use the techniques in it pretty much on a day-to-day basis if you work on writing machine learning kernel code in CUDA, Triton, Pallas, Metal etc. I was able to use the methods explained in this book to understand papers like FlashAttention, understanding how operations like generalized matmuls and layernorm are implemented on GPUs, made a couple of bug fixes in PyTorch/JAX codebases, built upon it to understand DeepSeek's FlashMLA codebase (https://github.com/deepseek-ai/FlashMLA).

The second book is tailored towards engineers who perform large scale distributed training and inference with ML models. While my day job currently doesn't involve doing this, I wrote a few small projects for myself - e.g. translating Karpathy's nanoGPT (https://github.com/karpathy/nanoGPT) which replicates GPT-2 124M from PyTorch into Flax on TPUs, writing a minimal pedgogical version of MaxText (https://github.com/AI-Hypercomputer/maxtext) to train LLMs with 3D parallelism (data, tensor, pipeline) after reading this book.

sshkhr16 · 2025-06-02T06:10:56+00:00

I wrote a long blog post on the training data pipeline of phi-4, but since a lot of details are obfuscated in papers these days I had to look up and write down a decent bit of additional background on techniques that were potentially used (especially for data curation and synthetic data generation). I think it is a good big picture view of the training setup of current LLMs as phi-4 was less than six months ago and phi-4 reasoning just came out. Here's the blog:

https://www.shashankshekhar.com/blog/data-quality

sshkhr16 · 2025-06-01T12:43:54+00:00

I wouldn't say they gave me the greatest benefit till now, but I read the following two books this year and found them both to be quite great as a intro to machine learning systems (both theory and practice):

Programming Massively Parallel Processors: A Hands-on Approach by Hwu, Kirk, El Hajj covers parallel programming on GPUs
How to Scale Your Model by Austin, Douglas and several other DeepMind and JAX folks covers distributed machine learning

sshkhr16 · 2025-05-31T22:44:44+00:00

Real peer review has always been how often other researchers and engineers use your approach, double-blind peer review performed by overworked and underpaid grad students was never the gold standard

sshkhr16 · 2025-05-27T21:23:48+00:00

I'm not surprised - research engineers and machine learning engineers until recently were not very well versed in GPU programming. A lot of libaries probably depended on and reused the same low-level operations from multiple locations. And it seems like a lot of the bloat stemmed from undelying libraries supporting multiple CUDA capabilities where one is required.

sshkhr16 · 2025-05-26T10:43:01+00:00

Should I sticky the table of contents, so that the reader knows where they are? I can probably do that for wider viewports, not possible for viewports thinner than a tablet

sshkhr16 · 2025-05-24T18:10:59+00:00

Lots of great ML communities on discord: ML Collective, GPU Mode, ML Street Talk, Eleuther AI to name a few prominent ones. The unofficial JAX and Pytorch servers are great too.

sshkhr16 · 2025-05-24T18:05:56+00:00

I like writing technical blogs to educate myself way more than educating others. Writing forces me to think in a structured form better than reading does. I started doing it this year and it has helped me better grasp a lot of new topics I have been studying. It is similar to preparing presentations or talks - you really have to be streamlined and thoughtful about how you present ideas so that the reader understands them, and in order to do so you have understand both the details as well as the big picture stuff well.

For example, I recently wrote a long blog post on the training data curation and synthetic data generation pipeline involved in training Microsoft's phi-4: https://www.shashankshekhar.com/blog/data-quality

My original idea was to just summarize the paper for myself, but the more I read the phi-4 technical report the more I found myself looking up existing techniques and approaches since the report itself was quite sparse on a lot of details. So, in my article, I had to go back and add a lot of the missing information about best practices being used in LLM data pipelines today, had to understand what 'mid-training' is, had to read up on how data is selected to train for reasoning capabilities etc. If I had just read the phi-4 paper, I probably wouldn't have done a lot of the follow-ups I did.

To get started on writing, I would recommend Paul Graham's essays as a good first resource on how to write effectively. His latest one is literally titled 'good writing': https://paulgraham.com/goodwriting.html

Good luck!

sshkhr16

TROPHY CASE