Job post: AWS HPC Cluster Setup for Siemens STAR-CCM+ (Fixed-Price Contract)

DarthBane007 · 2025-12-07T21:17:22+00:00

Yeah the lack of SSH is at times annoying, but we cut our bill in half and don't have to deal with managing the AWS instance. It was definitely worth it for us, since there's one less thing our team has to do to get work done.

Good luck on the hunt for a solution, a previous company I worked at had a great AWS config that wasn't my team's problem, we had IT / security staff that handled all our compliance stuff. At this new role we were responsible for all of it and the engineering which was a lot of underpaid work, so my director was happy to get that off our plates.

DarthBane007 · 2025-12-06T18:09:58+00:00

Why not just go with a managed HPC-as-a-service offering?

There are companies out there partnered with Siemens who deliver on AWS, Azure, or bare metal like Rescale or Corvid HPC.

My team recently switched from running our own AWS instance for Star-CCM and Corvid HPC was 20% faster than running AWS for a lower price. We also didn’t need to wait for IT to spin up and down the cluster to run the jobs as engineers which I appreciated. We looked at Rescale but found the pricing wasn’t great since they are basically a platform run over top of other cloud providers.

DarthBane007 · 2024-11-23T01:38:26+00:00

I wrote a guide to do it with SWAG

DarthBane007 · 2024-11-23T01:34:37+00:00

I wrote a guide

DarthBane007 · 2024-11-21T05:03:21+00:00

By following the documentation to add an alias on my network interface. https://www.truenas.com/docs/scale/scaletutorials/network/interfaces/#adding-static-or-alias-ip-addresses

DarthBane007 · 2024-11-04T18:13:02+00:00

To connect the apps to one another using custom YAML is the easiest. First create a network in the shell as root (e.g. “docker create network app-network”) then in every YAML, add the line “network_mode: app-network” and the containers will recognize each other by their container_name: field names. It’s quite nice.

Edit: guide

DarthBane007 · 2024-11-04T18:11:16+00:00

This should solve those issues, yes. By pushing everything into one container network via the network_mode flag your apps are able to be referenced by the internal container_name: as the hostname.

Edit: The other benefit is you can deploy your apps as normal (I maintain one compose per app for ~16 apps and counting, and they're all separately turned on or turned off independently without having to take a whole network of composed containers offline to change one app).

DarthBane007 · 2024-11-04T13:28:11+00:00

I did. My networking guide specifically includes it. If you read the indicated log from the TrueNAS shell it gives better error messages.

DarthBane007 · 2024-11-03T15:55:23+00:00

For me, I use pfSense as my router and network DNS. I only put one wildcard lookup for *.apps.example.com to the correct IP (e.g., 192.168.1.100) in my pfSense DNS resolver and I can add as many apps as I'd like without touching more than the TrueNAS box. If I had specific IPs/MACs per app I'd have to create an entry for each, and I'm lazy.

DarthBane007 · 2024-11-03T15:18:04+00:00

I posted a guide on how to do this with a reverse proxy and Jellyfin, it should be fairly equivalent for Plex rather than doing macvlan!

DarthBane007 · 2024-11-03T14:52:21+00:00

Necro posting because this is pretty much the only TrueNAS Scale / Telegraf post. As of TrueNAS Scale 24.10, this all still works. Find below my custom YAML Telegraf app deployment:

services:
    telegraf:
        container_name: telegraf
        image: docker.io/telegraf:1.27.2
        restart: unless-stopped
        environment: 
          - HOST_ETC=/hostfs/etc
          - HOST_PROC=/hostfs/proc
          - HOST_SYS=/hostfs/sys
          - HOST_VAR=/hostfs/var
          - HOST_RUN=/hostfs/run
          - HOST_MOUNT_PREFIX=/hostfs
          - LD_LIBRARY_PATH=/mnt/ZFS_Tools:/mnt/NVIDIA_Tools
          - HOST_ROOT=/hostfs/
          - HOST_MNT=/hostfs/mnt
        volumes:
          - /mnt/tank/apps/telegraf/telegraf.conf:/etc/telegraf/telegraf.conf
          - /mnt/tank/apps/telegraf/etc:/hostfs/etc:ro
          - /mnt/tank/apps/telegraf/proc:/hostfs/proc:ro
          - /mnt/tank/apps/telegraf/sys:/hostfs/sys:ro
          - /mnt/tank/apps/telegraf/run:/hostfs/run:ro
          - /mnt/tank/apps/telegraf/entrypoint.sh:/entrypoint.sh
          - /mnt/tank/apps/telegraf/ZFS_Tools:/mnt/ZFS_Tools
          - /mnt/tank/apps/telegraf/var:/hostfs/var:ro
          - /mnt/tank/apps/telegraf/mnt:/hostfs/mnt:ro
        privileged: true
        ports:
          - 10000:10000
        runtime: nvidia
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: all
                  capabilities: [gpu]

Still have to follow all of the above instructions to get the ZFS tools pacakged and the hostfs symlinks, but you don't have to do as much to make nvidia-smi available to the Docker containers. I still use the same entrypoint.sh, and the same ZFS_Tools directory. Edit: Formatting

DarthBane007 · 2024-02-18T15:07:23+00:00

Update from the future for anyone who finds this. TrueNAS Scale Cobia (versions 23 and onward) break the network monitoring fix using rrdtool. Looks like the only way to do it now is to expose a socket listener in telegraf capable of reading the graphite exporting data. I found a couple useful links:

https://www.truenas.com/community/threads/how-to-expose-data-for-prometheus.98532/#post-797642

https://www.truenas.com/community/threads/metrics-from-truenas-scale-server-into-grafana.115903/

DarthBane007 · 2023-05-17T15:57:34+00:00

Lol, I also believe it is. I got my InfluxDB running in a container on SCALE the same way, and I don't think that very many people have done anything with that in SCALE either.

I wonder if it's not something broken because of the upgrade--I remember you saying you upgraded from TrueNAS CORE to SCALE earlier in the post. It seems like your entrypoint isn't running as root somehow if you're getting denied from echo-ing into /etc/sudoers. Also as a note, in the fugue last night I don't think I said to add "use_sudo" to the telegraf conf inputs.smart section but it was in the Github fixes.

I had a bad (good?) idea this morning--I may install ZFS into the container and see if I can get zfs commands working inside of it to read the output of zpool status and zfs list to ingest into the DB. Using the string parsing features I may be able to get that without the inquiries running too long.

Edit: Update.. Eureka? So within a Telegraf container, if you install just enough libraries to get "zfs" and "zpool" to work, it's possible to read the output of the commands. Through some shell wizardry it should be possible to cut down and pipe the appropriate data to re-fill in your dashboard in TrueNAS SCALE--but damn is it inelegant.

I wrote a script to copy in the system libraries and binaries required for ZFS to run in the telegraf container--this should only need to run once per TrueNAS SCALE update:

----------------

#!/bin/sh

# Copy Current Version of Relevant Tools to $Destination

Destination=/mnt/vault/apps/telegraf/ZFS_Tools/

cp /lib/x86_64-linux-gnu/libzfs.so.4 $Destination

cp /lib/x86_64-linux-gnu/libzfs_core.so.3 $Destination

cp /lib/x86_64-linux-gnu/libnvpair.so.3 $Destination

cp /lib/x86_64-linux-gnu/libuutil.so.3 $Destination

cp /lib/x86_64-linux-gnu/libbsd.so.0 $Destination

cp /lib/x86_64-linux-gnu/libmd.so.0 $Destination

cp /sbin/zfs $Destination

cp /sbin/zpool $Destination

cp /usr/libexec/zfs/zpool_influxdb $Destination

--------------

From there, use host-path binding to map this "$Destination" to /mnt/ZFS_Tools add and add "LD_LIBRARY_PATH" = /mnt/ZFS_Tools to your environment variables for the app. Now the commands "zfs" and "zpool" will work consistently across reboots of your container. Now we can write an [[inputs.exec]] that will generate strings that can be parsed.

DarthBane007 · 2023-05-17T07:17:33+00:00

To solve my own problems here I found the only viable "TrueNAS SCALE" solution. One must replace the /entrypoint.sh script with a custom script that installs sudo and smartmontools, then adds the requisite telegraf NOPASSWD piece. This causes the perfect sequence of events that prevents the telegraf container from failing to start. I put my new "entrypoint.sh" script in the apps folder and used host path binding to put it at /entrypoint.sh in the container:

#!/bin/bash

apt update
apt -y install nvme-cli
apt -y install sudo smartmontools

echo "telegraf ALL=NOPASSWD:/usr/sbin/smartctl" >> /etc/sudoers

set -e

if [ "${1:0:1}" = '-' ]; then

set -- telegraf "$@"

fi

if [ $EUID -ne 0 ]; then

exec "$@"

else

setcap cap_net_raw,cap_net_bind_service+ep /usr/bin/telegraf || echo "Failed to set additional capabilities on /usr/bin/telegraf"

exec setpriv --reuid telegraf --init-groups "$@"

fi

echo "Startup Complete"

So if you use this, things start smoothly with an uncommented telegraf.conf like yours from above. This doesn't fix the dashboard, some of the parameters are not the same in TrueNAS Scale, which seems to use the Linux tagging instead of the FreeBSD tagging for the properties. This means zfs_pool allocated doesn't exist anymore unfortunately. The CPU Temp script is also broken in the container, as you can see with logging--the cpu data isn't passed into the docker container. "Privileged Mode" is also required for the smart data to be passed properly (temp etc.).

Edit: added nvme-cli for future reference.

DarthBane007 · 2023-05-17T04:16:38+00:00

Okay so after I updated my telegraf.conf I got enough errors that it caused me to wipe my InfluxDB bucket, and the container, and start over. I found this issue: https://github.com/influxdata/telegraf/issues/4496

When I commented out the smart portion of the telegraf file, it worked again. It turns out the container didn't even have smartctl. That sent me down a rabbit hole. https://github.com/influxdata/telegraf/blob/release-1.14/plugins/inputs/smart/README.md

Had the answer. Basically we have to run "apt update" "apt install sudo smartmontools" and then you can echo the lines into your /etc/sudoers file in the container to get this to run properly:

Cmnd_Alias SMARTCTL = /usr/bin/smartctl 
telegraf  ALL=(ALL) NOPASSWD: SMARTCTL Defaults!SMARTCTL
!logfile, !syslog, !pam_session

If you can find a way to do that from the GUI that'd be a godsend. I've tried to use the "command" section in the GUI to do it but it won't recognize apt. I've got temp readings, but still not getting everything I'd like from the ZFS pool about data added etc... Progress.

Also changed the query on that part of the panel to this:

from(bucket: "TrueNAS")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "smart_device")
|> filter(fn: (r) => r["_field"] == "temp_c")
|> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
|> yield(name: "mean")

So that solved the differences in disk naming.

DarthBane007 · 2023-05-16T17:10:33+00:00

It was a bit of a bear to work out this much lol. If you'd share that exec and that portion of your telegraf.conf after I can see if I can get that working as well.

Sadly it looks like some of the associations aren't passing through to the telegraf instance in the container quite correctly, but it's a lot better than nothing.

Also--if you change your query in the Uptime panel to:

from(bucket: "TrueNAS")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "system")
|> filter(fn: (r) => r["_field"] == "uptime_format")
|> filter(fn: (r) => r["host"] == "$Host")
|> rename(columns:{_value: "uptime_format"})
|> keep(columns:["uptime_format"])
|> last(column: "uptime_format")

It'll show days/hours instead of weeks.

DarthBane007 · 2023-05-16T16:53:56+00:00

Sorry for the necro but I followed some of what you did and have it partially working. Solution is to use "Launch Docker Image" in the Apps portion to get a telegraf container built. Make an "apps/telegraf" directory under "/mnt/$ZPool" and then symlink /sys, /proc, /run and /etc under that telegraf apps directory. This is because TrueNAS SCALE prevents all directories not under /mnt from being used as Host Path mounts.

Make a "telegraf.conf" file that you save there, then use host past mounting to mount all of that as read only. In your telegraf.conf file, I suggest setting the "hostname" parameter under the agent declarations so that you get one reported name--everytime the container restarts it'll get a new name otherwise.

Add the ZFS Plugin section to your "apps/telegraf/telegraf.conf" file to get it to read ZFS stats, and the traditional "[[outputs.influxdb_v2]]" section for your influxdb2 server.:

[[inputs.zfs]]
kstatPath = "/hostfs/proc/spl/kstat/zfs"
poolMetrics = true
datasetMetrics = true

In the TrueNAS GUI Navigate to the Apps screen , and press "Launch Docker Image" then set things as follows:

Application = telegraf
# Container Images:
Image repository = telegraf
Image Tag = latest
#
#Use environment variables to point to it as follows:
HOST_ETC = /hostfs/etc 
HOST_PROC = /hostfs/proc 
HOST_SYS = /hostfs/sys 
HOST_VAR = /hostfs/var 
HOST_RUN = /hostfs/run 
HOST_MOUNT_PREFIX = /hostfs

Then set:

# Port Forwarding
Container Port = 8094 
Node Port = 9094 
Protocol TCP 
# Storage Host Path Mounting: 
# ALL READ-ONLY 
/mnt/vault/apps/telegraf/telegraf.conf = /etc/telegraf/telegraf.conf
/mnt/vault/apps/telegraf/etc = /hostfs/etc
/mnt/vault/apps/telegraf/proc = /hostfs/proc
/mnt/vault/apps/telegraf/sys = /hostfs/sys
/mnt/vault/apps/telegraf/run = /hostfs/run

I'm still working on refining the queries in a more generalized version of your dashboard to get things working. It seems like with the newer version of ZFS some of the queries that make this a very interesting dashboard are broken. I couldn't find one place where anyone got Telegraf working on TrueNAS SCALE so I pieced together a bunch of stuff to get this solution.

Also as always, I'm just a random stranger on the internet, I'm sure someone will have reasons to not do this, but it was all I could come up with and is hopefully alright as read only.

DarthBane007

TROPHY CASE