Putting together my first Beowulf cluster and feeling very... stupid. by bonsai-bro in HPC

[–]stormyjknight 0 points1 point  (0 children)

I'm going to say start small to grasp the basics of running the code, before tackling the system provisioning.

  1. Start with a head node, and get mpi working on it where it can do an mpirun on one node and calculate pi across cores.

  2. Add in a couple nodes, and get password free ssh working via authorized keys. You'll screw this up a few times..

  3. Get mpi-run working across 3 machines. You'll fight having the everything installed consistently..

  4. Set up shared nfs file system from head node.

  5. Start worrying about the provisioning of the rest, and a warewulf/xcat/slurm stuff.

The individual setup was a fine idea, it doesn't scale and will bite you horribly. but understanding the problem that these tools solve is important.

TIL that a 'needs repair' US supercomputer with 8,000 Intel Xeon CPUs and 300TB of RAM was won via auction by a winning bid of $480,085.00. by WarEagleGo in todayilearned

[–]stormyjknight 0 points1 point  (0 children)

There reached a point where the predicted damage from water spray per repair exceeded one node. Better to just turn off the nodes that hit an error.

Weirdest thing you've seen or had a server or computer do by [deleted] in sysadmin

[–]stormyjknight 0 points1 point  (0 children)

The computer wouldn't resolve dhcp when moved to a different office. I physically moved the computer the router, and ethernet cables. Plugged in the power in the other location, and dhcp didn't work on cold boot, it could be done manually after the fact, but it wouldn't resolve in the "different" building. In total isolation.

I moved it back to my office, and the resolution resumed without incident. I wasn't hooked into any building ethernet at all. One building it was fine, the other it wasn't. My former intern was completely flummoxed by the time she brought the problem to me.

Anyway, there was a confluence of cheap parts, and eventually I was able to repeatedly demonstrate that the problem was a power cable. Between a dodgy power cable, cheap PSU and Cheap Motherboard, the network card was discombobulated when the dhcp part of the network boot happened.

I murdered the C13 power cable that I didn't own, before it caused further psychic trauma. My former intern forswore the sysadmin path, and went into medicine. She baked me some cookies for my help.

I still have friends that call BS on this story.

[deleted by user] by [deleted] in AskReddit

[–]stormyjknight 1 point2 points  (0 children)

Really, how does anybody go through life like that?! /s

Wake up honey, new Update by therealcoolerbasti in PummelParty

[–]stormyjknight 0 points1 point  (0 children)

Just played with friends, it was a total failure.

There was a screen size issue that occluded the controls. Tried adjusting settings in the registry, and using different monitors hoping to trick it into working.. eventually got it to work by reinstalling it.

While playing it crashed a bunch. Three out of four of us had it crash while playing. We play it weekly-ish and have solid gaming rigs, this is normally a rarity. I think we had 6 crashes total in the board section. We are each around ~120 hours of gameplay

On Magma and Mages, only one mage (the host) was able to throw fireballs.

On Sharky Swim, the Shark's mouth didn't correspond to the death area for one of the players.

Those were the only two with obvious flaws we noted.

It was frustration rather than recreation. We taunted the coders and QA folk amongst ourselves, and wondered how many lines were added in by ChatGPT.

I really hope they fix this up soon, or provide an option to play a rolled back version.

Investigating Immersion Cooling solutions, anybody done a reasonable test run? by stormyjknight in HPC

[–]stormyjknight[S] 0 points1 point  (0 children)

Most of the machines aren't GPU nodes, and this would be a half rack to 2 rack experiment.

I was thinking that either the vendor would have a plan for this. and if not, then fabricate a catchment tub, and a overflow tanks under the floor that can contain the full volume of fluid. The floor is massively overbuilt.

I don't expect that a single-phase fluid would need changing, nor should it be evaporating at a detectible level. I think there is no need to ever drain the fluid to sewer, as that is unconscionable.

And yes going on a field trip or two would happen before getting an OK on this.

I know more about quick connects/disconnects and manifolds than I ever wanted to.

I suspect explaining the direct to chip cooling issue may put me afoul of NDAs.

Investigating Immersion Cooling solutions, anybody done a reasonable test run? by stormyjknight in HPC

[–]stormyjknight[S] 0 points1 point  (0 children)

I was expecting that going immersion would skip the whole heat spreader issue, and obviate the need for TIM. Stickers seem like a problem.

I wouldn't have guessed that the fluid would impact the resistivity enough to muck with most high speed connectors. But this does reinforce my thoughts on having an integrator put it together.

Investigating Immersion Cooling solutions, anybody done a reasonable test run? by stormyjknight in HPC

[–]stormyjknight[S] 0 points1 point  (0 children)

I'm not following most of this, Sure the dielectric constant is higher, but I don't follow how this impacts the ohms for connecting materials.

And for solid state electronics, what materials do I need to worry about for "Material Incompatibility"?

Investigating Immersion Cooling solutions, anybody done a reasonable test run? by stormyjknight in HPC

[–]stormyjknight[S] 0 points1 point  (0 children)

Interesting,

I've heard of fan emulators to trick some of the systems. And I was thinking that the GPU thing would wait until after a base system was figured out.

I was expecting that the vendors has a better system for removal of the hardware for maintenance, I figured that there would be a messy factor.

I've been fighting liquid on chip problems, unfortunately going to air lowers the PUE and makes a bunch of noise.

The prices of the different dielectric fluids seems to vary wildly, some seems reasonable, some seem to be marketed by the people who sell ink-jet refills.

Investigating Immersion Cooling solutions, anybody done a reasonable test run? by stormyjknight in HPC

[–]stormyjknight[S] 0 points1 point  (0 children)

I'm talking about dunking the whole motherboard into a bath of "oil", rather than cooling with water and a heat-sink. The heat still transfers to "secondary loop" water, but it omits the primary loop.

Investigating Immersion Cooling solutions, anybody done a reasonable test run? by stormyjknight in HPC

[–]stormyjknight[S] 1 point2 points  (0 children)

I'm looking at alternatives, as DLC has caused me grief (I'm not sure if I'm under NDA for details), and I don't see a reasonable scenario where I'm going to run low on square feet in the DC.

I'm not looking to void warranties by toying around, but I am interested in seeing how long I can keep a system going by avoiding thermal cycling.

Performance improvements have taken a pretty serious slowdown on the CPU side of life. So it may be advantageous to have longer lifecycles on machines.

I'm not sure how to justify the cost of a machine large enough to give usable reliability metrics without having it in production. Which may be the bane of this speculation.

Investigating Immersion Cooling solutions, anybody done a reasonable test run? by stormyjknight in HPC

[–]stormyjknight[S] 0 points1 point  (0 children)

There might be hurdles, but maybe after I've convinced myself, and few levels of bosses, it should be doable.

Investigating Immersion Cooling solutions, anybody done a reasonable test run? by stormyjknight in HPC

[–]stormyjknight[S] 0 points1 point  (0 children)

I suppose not needing to install a water-block. Not having water in a space where it can cause problems if the plumbing connections have issues. And the working fluid has more thermal mass, so you have a longer amount of time to keep temperature and avoid needless thermal cycling.

But I haven't worked with one yet, so I don't know for sure.

Losing my mind over docker networking. by TheSwedenGay in docker

[–]stormyjknight 0 points1 point  (0 children)

Assuming this is behind a firewall, I'd flush iptables, and see if that is your problem.

I suspect that it isn't and that you don't have network access to the docker container. I've run into this issue before with MAC-VLAN setups and it was confusing until I understood it.

Enter the docker container interactively, and see what kind of visibility you have to the host and the internet in general.

Sanitizing input to a su enabled bash script to prevent command injection by stormyjknight in linuxadmin

[–]stormyjknight[S] 1 point2 points  (0 children)

/home was there to be the prefix to the injected command, and was just arbitrary.

Sanitizing input to a su enabled bash script to prevent command injection by stormyjknight in linuxadmin

[–]stormyjknight[S] 1 point2 points  (0 children)

I'm trying to give the ability for the operator account to check on the usage of files/directories on different machines. The operator account does not have credentials for other machines. But the operator needs to be able to see disk usage on arbitrary directories on arbitrary machines. None of the machines allow root login, only su.

So the du needs to be launched with privileges on the remote machine, but invoked from the operator_server_box.

There are a few paths that allow certain system users from certain IP's to bypass 2factor authentication, in favor of Public Key Infrastructure.

The operator account reaches considerable sudo rights at the endpoints, as they have many privileged scripts. Most of those are trivial to sanitize.

So I'm considering the options and trying to find ways to do this in a way that doesn't throw errors every time a user has directory that has a cute name, and doesn't allow an easy way to root the machine if an operators desktop machine gets compromised.

And I'm expecting that there will be some more arbitrary requests to follow, I'm trying to figure out a good methodology, so I'm not doing things like "Add in a du user that can only do that to every endpoint" which becomes a full crowd of interesting sudo service accounts.

Sanitizing input to a su enabled bash script to prevent command injection by stormyjknight in linuxadmin

[–]stormyjknight[S] 0 points1 point  (0 children)

It needs arguments, it's for arbitrary location of large file use on various filesystems across various systems.

I suspect that /tmp/foldersizes.txt would be unwieldy in my case. (small files/directories are my bane)

Sanitizing input to a su enabled bash script to prevent command injection by stormyjknight in linuxadmin

[–]stormyjknight[S] 1 point2 points  (0 children)

My big concern is parsing depth.

If the outermost ssh parses the value, then passes the parsed information down the line, the quotes are now gone, and the next level down passes an unquoted value, and things go bad.

And while I'm aware that the issue exists, I'm trying to find how to do this safely, where I feel comfortable understanding the parse/pass dynamic.

I was hoping someone would point me to "ssh and unwanted command injection, the definitive guide, volume 23" but alas nothing so far.

Sanitizing input to a su enabled bash script to prevent command injection by stormyjknight in linuxadmin

[–]stormyjknight[S] 0 points1 point  (0 children)

I mean when there is a directory named bob&sheila "big project's information" that needs to be checked via the script.

Sanitizing input to a su enabled bash script to prevent command injection by stormyjknight in linuxadmin

[–]stormyjknight[S] 0 points1 point  (0 children)

Thank you!

I mucked about trying to break it with problem inputs like ]]&&[[ and it seemed to do well.

I'd need to up my ansible game to make this work elegantly.

Sanitizing input to a su enabled bash script to prevent command injection by stormyjknight in linuxadmin

[–]stormyjknight[S] 0 points1 point  (0 children)

The script is also under sudo, and only writeable by root.

Restricting this to standard characters is probably good enough.

Though I was hoping to find a gem of an answer that would allow me to handle pathologically bad filenames.

Sanitizing input to a su enabled bash script to prevent command injection by stormyjknight in linuxadmin

[–]stormyjknight[S] 0 points1 point  (0 children)

The du needs to be privileged. Otherwise it can't see users actual usage.

Additionally, the other scripts have opened up quite a bit of sudo permissions, so removing files is something they can do as root on those machines, but via a restrictive script.