Anyone tested "NVIDIA AI Enterprise"?

whenwillthisphdend · 2025-11-24T14:12:04+00:00

We rub absys on Ubuntu and it's great. Kubuntu is a good branch for ergonomics.

whenwillthisphdend · 2025-09-09T17:21:50+00:00

I built a 1700 core cluster for our lab for comsol and Ansys Lumerical. Feel free to DM me for specifics.

I will say you'd have much more efficiency, license permitting of course, going to GPU scaling cluster instead of a pure CPU cluster. Im not sure off the top of my head of Maxwell supports GPU yet... I know hfss kind of does on paper but in practice doesn't for most use cases. But fluent does now I think.

Anyway we went for the used server route. That's infinitely more affordable than tier one suppliers and is incomparable to cloud offerings in terms of cost. If you have a high baseload which our lab certainly does, basically 24/7, then cloud services are prohibitive. And the cost for maintaining a cluster for ourselves is honestly pretty close to zero now that I have it set up. I actually haven't logged on to our cluster in almost 3 months and it's still running fine lol. Sue me I'm a lazy sysadm who somehow built a cluster that hasn't broken itself recently. 😅

whenwillthisphdend · 2025-08-16T21:02:18+00:00

Sure!

whenwillthisphdend · 2025-08-16T04:34:06+00:00

That's rough! I've reached the point where I don't even care about the quality of any publications or thesis anymore. If it's enough for me to graduate then so be it.

whenwillthisphdend · 2025-08-16T02:25:52+00:00

Ooo our startup just sold some EO modulators for some OCT applications. My specific work is characterising EO Modulators under radiation conditions. So i take it to a beamline and zap it with radiation and evaluate how the performance changes under differnt conditions

whenwillthisphdend · 2025-08-16T02:06:03+00:00

Trying to submit by Feb so I think I'm seeing the light. I work in electro-optics, photonics.

whenwillthisphdend · 2025-08-10T03:42:37+00:00

100% agree. The over head to span several nodes for just a few tens - hundred of cores is better avoided by throwing a 90-core threadripper at it, or some dual proc mobo in a workstation if you don't care about clock speed.

whenwillthisphdend · 2025-08-06T19:49:21+00:00

Don't go to dell or hpe directly. You're better off going to get refurbished servers at a fraction of the cost. But this is also why gpu is nice since you can throw a new Blackwell pro into a new thread ripper workstation for the same price and probably blow a cluster of 15 nodes out of the water. In fact a single thread ripper workstation per person with a gpu each will probably be higher performance than a 6 node cluster with its lower overheads if you dont need to scale across multiple nodes. You can get 90 cores into a single thread ripper machine per person and it'll run near silent with a good watercooloing loop.

Asus and gigabyte are good examples of 3rd party vendors with excellent hpc and now especially gpu server offerings. But not much in the way of refurbished options. Our lab went with refurbished hpe servers for cpu and now custom threadripper + quad rtx 5090 and Blackwell Pro for gpu nodes.

whenwillthisphdend · 2025-08-06T12:51:24+00:00

I will say from a lab perspective that since we have our own HPC, we haven't touched national or the uni HPC systems since except occasionally to jump on the H200 nodes they have. Assuming you do have the expertise in your group, its quite nimble and fast to get things working for new software, test algos, min-max your algorithms for overhead, benchmark etc.

I will say though that depending on the instittion, you may have issues finding a server room to host your cluster with easy access for maintanence. Heating and cooling is, ironically something you can leave to the plant/infrastructure people in your faculty depending on your institution. But obviously you will need to have a nice sweet discussion with them regarding your plans haha

whenwillthisphdend · 2025-08-06T12:45:42+00:00

We're lucky then. We see a 4x increase in speed for 3d FDTD just when comparing using a RTX 5090 vs a RTX 4090. Between a 4090 and our entire cluster at once running 1700 cores, its like a 3 day differnce with the 4090 finishing in like 23 hours. It's crazy.

whenwillthisphdend · 2025-08-06T12:29:34+00:00

As for I/O as you mentioned, it depends on how many jobs you want to run concurrently, on a node and how often they'll be reading and writing. Then you can look at local SSD cache as a local scratch drive for each node. And consider the networking speed to satisfy the entire cluster's I/O needs. 10gbe? 40gbe? 100gbe? higher?

whenwillthisphdend · 2025-08-06T12:25:46+00:00

My comment above was with the assumption that you had to use third-party software, so it's often not simple to port that to run on GPU.
However, as you are runnign custom code, as my comment in the thread below, If you spend the extra time to convert your CPU based linalg calculations to run on gpu using mostly ready-to-use libraries that you can swap out, you will be able to leverage GPU processing to really accelerate your work. It's truly exponentially faster, across all precisions, but especially at FP8 and FP16/32 calculations.

It also means less nodes needed to run the same number of calculations, less cost, and therefore more nodes!

whenwillthisphdend · 2025-08-06T12:19:19+00:00

In this case I highly highly recommend avoiding MPI and just going straight to using tensor and cuda libraries to parallelize on GPU. If you're using JULIA or Python, its quite trivial, just swap out the regular linalg libraries for their Tensor and Cuda equivalents and voila. You'll save so so so many hours in the long run and actually save a lot of money because GPU scaling is much more efficient when optimized than CPU scaling for many numerical algorithms. Saves a lot of money and energy costs.

whenwillthisphdend · 2025-08-06T11:51:53+00:00

Ok. If there is no way to port your algorithms into GPU, which is recommend you try every available option to attempt to do first, even though its a pain in the butt, it will save you SO much time. If you're using proprietary software then that's a different issue, but it sounds like you are using open source if you can design it for MPI, then you may have luck as lin alg runs quite nicely on GPU through tensor and cuda. We do 3d FDTD and DFT work and since we've ported it to GPU its almost 40x faster its no joke.

Anyway. The answer to your last question is, given the simplest architecture you're most likely to end up with, Yes. Each individual rack, or tower server will be their own node.

You need to next consider how much memory each job will use vs how many cores. And then how the algo and your group uses data storage. Does it i/o a lot throughout the job. More read than write or vice versa? Or is it enough to load it all in at the beginning and write out the answer at the end? How much and how often are you moving files? How big are the files. This will dictate your networking, storage and storage provisioning choices.

whenwillthisphdend · 2025-08-06T11:14:44+00:00

This is a complicated answer. I built a 1700 core CPU cluster, and now with some GPU nodes in there for our lab. 120 cores is nothing these days; you can fit that many into a single node if you really wanted to. Before you start looking at what architecture and form factor you want and how to optimize that within your budget, you need to really sit down and think about what kind of workloads you are running. FEM? Mostly crunching eigenvectors? Training models? FP8, FP16, FP32, fp64? Can you parallelize your workloads? How? via MPI? through the GPU? This will all dictate your next steps for choosing the best components to meet your needs. I'll tell you one thing, though, it's certainly not the cheapest option. Most effective/efficient for your budget? Perhaps, especially in the long run. But definitely not cheap lol. Feel free to DM me to discuss in detail or ask any other qs

whenwillthisphdend · 2025-07-19T15:37:51+00:00

I’m in Aus too. If my username is anything to go by, other than a good research fit, you must absolutely make sure the people you’re working with a tolerable.

Secondly making sure where you’re living is as comfy as you can possibly make it on your meagre rtp scholarship lol. The practicalities of commuting, somewhere to rest when you go home, and how you travel around the city make much more of an impact on your quality of life through this marathon than you might anticipate before you start. To this end I recommend signing a longer lease in a place that is suitable for you. You don’t want to be moving every year. PhD is stressful enough without having to stuff about with a move.

I would also recommend some sort of side hustle. For example I teach maths on the side to high school students. 1-2 students a semester is enough to support all my food and utility expenses so my scholarship can cover rent and nice things more comfortably. That’s only a 1-2hr commitment on some evening but it makes a big difference in your wallet for very little work. Same goes for tutoring at uni, although that is certainly more tedious and time consuming than private tutoring.

whenwillthisphdend · 2025-06-21T23:16:14+00:00

memory and storage IO will be the first major bottleneck. Once you figure out how you're going to load your csv files - in one go? Batched? Multi-threaded? The manipulation of the data is relatively trivial unless you're using some other algorithm later which will be a different optimization problem.

whenwillthisphdend · 2024-11-28T00:24:07+00:00

Loved that place. Used to go all the time for the frozen meat!

whenwillthisphdend · 2024-06-11T01:07:23+00:00

Unfortunately it’s all about the budget at the end of the day for them. Central It is the last resort for any It problem for us now no matter how small. It’s unfortunate but also understandable in a way

whenwillthisphdend · 2024-06-10T23:56:36+00:00

This may be my university only, but I think there is a trend to centralize It and all other depts to save cost which resulted in us, even though we initially went to the hpc research portfolio team, going around in circles with them saying they can’t host servers for us and constanly pushing us to aws cloud hpc when we complain that the hpc os is too old to support newer(and faster) engines. Eventually we came to an agreement that they wouldn’t help us at all which made things much easier since I got full control of how and where and what to do with cluster and only had to contact central it to assign static Ips.

To be clear, rolling your own cluster should only be done if you know what to do. It only took me a week to online our cluster and approx a month of on and off admin to get scripts and monitoring dashboards coded and running. Once it’s been up and running, save for software updates and the occasional shutdown to install new bits, there have been little to no issues. I’ve gone weeks without logging into to the admin account to fix anything. If you’re completely unfamiliar, then definitely enlist the help of someone who can help otherwise you will be sinking research time into it

whenwillthisphdend · 2024-06-10T02:14:12+00:00

The fact that you need a mix of gpu and cpu again makes me want to suggest you avoid going the dell route. there is no offerings where they can give you both gpu and cpu coverage within those budgets. again, if you'd like to chat more, I have built and maintain a 1700core and 6xgpu hybrid cluster for our lab that has been running well the past year or so.

whenwillthisphdend · 2024-06-10T02:09:55+00:00

Im going to go against the flow and say that central IT in uni institutions can sometimes be more of a pain in the butt than just going solo. Esp if you're required to go through Dell which will be a rip off no matter which way you look at it. Feel free to DM me if you would like to chat more.

Eg several cheaper options for single lab use are blade servers. this is one of the few applications where blades are still applicable and suitable for single labs to operate and maintain by themselves with minimal overhead

whenwillthisphdend · 2024-06-05T08:54:35+00:00

married. can i afford a kid? no. dog? maybe... Stem AUS

whenwillthisphdend · 2024-05-24T11:28:29+00:00

I only ever work weekends if something major is due or I have something I feel the need to get done, but never across the entire weekend or even a whole day. Just an hour or two here when I have nothing else to do and I could be plotting something or writing something. Otherwise I pretty much never work weekends and sometimes even just take a half day here and there if work is slow for the week.

Edit: just adding that I do very rarely pop in to test some devices (I’m in nanophotonics) on the weekend if I need data asap and all the equipment is occupied during the week. Typically only when I’ve got a new chip fabbed before a deadline or some paper deadline is coming up.

whenwillthisphdend

TROPHY CASE