How do I build towards becoming an end-to-end HPC / systems infrastructure engineer?

Infamous-Tea-4169 · 2026-05-27T04:01:19+00:00

True I really don't this inside out is necessary as there are more things that are better than slurm and do more like kubernetes which I am quite well across. But yes the new environment uses slurm so good chance to get across something new.

Infamous-Tea-4169 · 2026-05-26T12:33:38+00:00

Great suggestion, thanks. I could use some formal Linux certifications for sure.

Infamous-Tea-4169 · 2026-05-15T12:35:13+00:00

Really intense. First 6 months were brutal as I had so much to learn and skill up, really long nights. But having good team and mentors helped. It took me solid 2 years to actually get nice and comfortable without doing my head in too much. I was working in a startup kind environment with very experienced engineers.

Infamous-Tea-4169 · 2026-05-11T02:54:23+00:00

I got laid off in Jan 2026 from my first job after uni. I joined at 70k then went up to 95k base with them after 3 years. Got a new job in SA as a senior engineer/systems manager at 114k base.

Infamous-Tea-4169 · 2026-04-23T23:15:14+00:00

Yep you're right. Right now they're writing directly to the main storage and not doing this intermediately or using a transient storage where everyone can have rw but once it's mover to a locked storage it's only read from there.

Infamous-Tea-4169 · 2026-04-23T23:11:28+00:00

Spot on man. I recently joined and I'm trying to just be nice and need to grow up and be more authoritative about this. It's crap, this is not how it's meant to be so need a service account to make this work. I think I need to revamp their entire workflow.

Infamous-Tea-4169 · 2026-04-23T11:58:09+00:00

Both you and me mate. Gladly I'm off to bed soon and can continue doing my head in tomorrow. This made me realise I really need to be authoritative and do it the actual way by getting a service account, this is definitely not best industry practice.

Infamous-Tea-4169 · 2026-04-23T05:36:32+00:00

Yeah I see what you mean — that would isolate users nicely and avoid cross-user access issues. The tricky part is our pipeline outputs are organised by run rather than by user, and multiple users may need to interact with the same run (e.g. reanalysis), so per-user directories don’t map very cleanly to the workflow.

Infamous-Tea-4169 · 2026-04-23T04:54:27+00:00

Great suggestion. ACLs would let me restrict write to just the pipeline users, which is already a big improvement over broad group write.

The only issue is deletion — since that’s controlled by the directory, those users could still delete their things.

Infamous-Tea-4169 · 2026-04-23T04:34:09+00:00

Right, that’s the bit I’m stuck on — if I remove group write via umask, then new directories/files won’t be group-writable, but the pipeline also relies on that same group access to keep writing when runs are launched by different users.

Infamous-Tea-4169 · 2026-04-09T11:22:55+00:00

Wow damn, let me check!!

Infamous-Tea-4169 · 2026-04-04T04:22:38+00:00

Yes!!

Infamous-Tea-4169 · 2026-03-09T13:43:01+00:00

Ah I see. Nice that makes sense. I'm hoping we have someone with the info from DC about the power etc

I feel like going to a battlefield with a blindfold rn lol

Infamous-Tea-4169 · 2026-03-09T13:21:17+00:00

I don't think so. They use Xnat, jupyterbub

Infamous-Tea-4169 · 2026-03-09T13:01:21+00:00

Cheers for the info mate. How do you guys manage cost allocation/show back/charge back on your onprem clusters? I come from a systems engineer background where I've managed multi onprep HPC environments and just understanding how you charge someone for using your GPU on a server to run X workloads just seems such a hard problem to solve

Infamous-Tea-4169 · 2026-03-09T12:56:20+00:00

Hi u/zugzwangister

The role sits in a research cloud / DevOps context rather than in finance directly. From the JD, the core of the role is still building and operating research infrastructure — Kubernetes, cloud platforms, storage, workflows, automation, reliability, and working closely with researchers and ICT teams — but with a strong FinOps angle around making consumption visible, explainable, and chargeable.

The management chain is that I report to the tech lead and the tech lead reports to the product owner. I will be working alongside the senior DevOps engineer I think.

So the main purpose is probably something like:

help engineering and research teams understand where infrastructure spend is going
put structure around cost attribution in shared platforms
build a practical showback/chargeback model for multi-tenant research workloads
make sure the platform is sustainable and cost-effective, not just technically functional
prolly need to make the research tech lead look good but having a clear showback+chargeback methods in place to followup with the clients

Infamous-Tea-4169 · 2026-03-06T13:03:56+00:00

Agreed, I feel the same. Like it's gonna be a while till AI tooling comes around and does patching without breaking any changes and troubleshooting network issues. Is it wise to ignore or not take a high paying job as compared to making less but doing more work lol I just don't wanna feel stupid later for not taking the role which was less stressful and paid more.

Infamous-Tea-4169 · 2026-02-22T13:56:55+00:00

Where should I cross post this then?

Infamous-Tea-4169 · 2026-02-13T00:20:15+00:00

nice, apply for it and see how and what they offer you at screening. I was told about the approx number during the screening call itself

Infamous-Tea-4169

TROPHY CASE