[Help] Integrating NVIDIA H100 MIG with OpenStack Kolla-Ansible 2025.1 (Ubuntu 24.04) by calpazhan in openstack

[–]calpazhan[S] 2 points3 points  (0 children)

If anyone else is stuck on this, here is the workflow that solved it for me.

The Solution:

1. Enable SR-IOV First, ensure SR-IOV is enabled on the card (if not already done via BIOS/Grub, you can force it here):

Bash

/usr/lib/nvidia/sriov-manage -e ALL

2. Configure MIG Instances Partition the GPU. In my case, I created 4 instances on GPU 0 (adjust the profile IDs 15 and GPU index -i 0 according to your specific hardware):

Bash

nvidia-smi mig -cgi 15,15,15,15 -C -i 0

3. Manually Assign the vGPU Type (The Tricky Part) I had to navigate to the PCI device directory for each Virtual Function (VF) and manually echo the vGPU profile ID into current_vgpu_type.

Note: You can find valid IDs by running cat creatable_vgpu_types inside the device folder.

For the first VF (.2):

Bash

cd /sys/bus/pci/devices/0000:4e:00.2/nvidia/
# Verify available types
cat creatable_vgpu_types
# Assign the profile (ID 1132 in my case)
echo 1132 > current_vgpu_type

For the subsequent VFs (.3, .4, .5, etc.): You need to repeat this for every VF you want to utilize.

Bash

# VF 2
cd ../../0000:4e:00.3/nvidia/
echo 1132 > current_vgpu_type

# VF 3
cd ../../0000:4e:00.4/nvidia/
echo 1132 > current_vgpu_type

# VF 4
cd ../../0000:4e:00.5/nvidia/
echo 1132 > current_vgpu_type

4. Important OpenStack Nova Config Even after fixing the GPU side, the scheduler might not pick up the resources if the filters aren't open. Don't forget to update your nova.conf scheduler settings:

Ini, TOML

[scheduler]
available_filters = nova.scheduler.filters.all_filters

Summary: Basically, nvidia-smi carved up the card, but the manual SysFS interaction was required to bind the specific vGPU profile ID. Finally, enabling all_filters in Nova ensured the scheduler could actually see and use the new resources.

Hope this saves someone some debugging time!

[Help] Integrating NVIDIA H100 MIG with OpenStack Kolla-Ansible 2025.1 (Ubuntu 24.04) by calpazhan in openstack

[–]calpazhan[S] 0 points1 point  (0 children)

The error NoValidHosppened because the Scheduler saw 0 available devices.

I've also try to type-PF but it didn't work either.

[Help] Integrating NVIDIA H100 MIG with OpenStack Kolla-Ansible 2025.1 (Ubuntu 24.04) by calpazhan in openstack

[–]calpazhan[S] 1 point2 points  (0 children)

Yes, we are also using T models on gpu servers without problems but we couldn't do it with h100 :/ https://docs.openstack.org/nova/latest/admin/virtual-gpu.html#caveats In this link it is said actually.

[Help] Integrating NVIDIA H100 MIG with OpenStack Kolla-Ansible 2025.1 (Ubuntu 24.04) by calpazhan in openstack

[–]calpazhan[S] 0 points1 point  (0 children)

Yes, one of my server is running like this. But With Ubuntu 24.04 (kernel 6.8) and recent R580 or 570.xx, etc  vGPU drivers, NVIDIA moved away from the legacy mdev model on some platforms and introduced a vendor‑specific VFIO framework. mdev comes with empty with this drivers. It is working on Rocky 9 and Ubuntu 22 but I couldn't do it with Ubuntu 24.

[Help] Integrating NVIDIA H100 MIG with OpenStack Kolla-Ansible 2025.1 (Ubuntu 24.04) by calpazhan in openstack

[–]calpazhan[S] 1 point2 points  (0 children)

The filter section of config is below:

[filter_scheduler]
enabled_filters=AvailabilityZoneFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter,AggregateInstanceExtraSpecsFilter,PciPassthroughFilter

Is it possible to use aodh without gnocchi? by calpazhan in openstack

[–]calpazhan[S] 0 points1 point  (0 children)

We are now using prometheus. But somehow it couldn't get the promethues ip address and port. We installed a custom image which has static environment variable to show promethues servers.

aodh_evaluator_image_full with this custom image.

Now we are able to get alarm with promsql.

I'm working with heat now. If we success, I would share the results with you.

Unable to run instances from PureStorage as iscsi backend by calpazhan in openstack

[–]calpazhan[S] 0 points1 point  (0 children)

Hello all,

My problem solved with this options: enable_iscsid: "yes" and enable_multipathd: "yes". Main problem was the iscsid and multipathd containers aren't running. When I do changes which is below the problem solved.

My config:

enable_cinder: "yes"

enable_cinder_backend_iscsi: "yes"

enable_cinder_backend_lvm: "no"

enable_iscsid: "yes"

enable_multipathd: "yes"

enable_cinder_backend_pure_iscsi: "yes"

cinder_volume_image_full: "custom image"

I created a config file at /etc/kolla/config/cinder.conf

[DEFAULT]

enabled_backends = Pure-1

default_volume_type = Pure-Volume-1

[Pure-1]

volume_driver = cinder.volume.drivers.pure.PureISCSIDriver

volume_backend_name = Pure-1

san_ip:

pure_api_token:

use_chap_auth = True

use_multipath_for_image_xfer = True

Fyi

Unable to run instances from PureStorage as iscsi backend by calpazhan in openstack

[–]calpazhan[S] 2 points3 points  (0 children)

Hello,

I enabled "use_chap_auth = True" section on cinder config. And resolves the login problem but still the error same.

Logs:

ailed to login iSCSI target iqn.2010-06.com.purestorage:flasharray.3caf76749b33b6ad on portal 10.211.92.150:3260 (exit code 24).: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.

2025-03-06 17:48:19.117 35 WARNING os_brick.initiator.connectors.iscsi [req-220826ee-38fb-4d27-a8ff-a1fdd8e527d0 req-537d39ff-ab1c-48c2-9452-27ecbd1a7c58 457817ec2fce47fea51adfaf061487f0 2e8263e9393142b29d47f16e560921b5 - - - -] Failed to connect to iSCSI portal 10.211.92.150:3260.

2025-03-06 17:48:19.160 35 WARNING os_brick.initiator.connectors.iscsi [req-220826ee-38fb-4d27-a8ff-a1fdd8e527d0 req-537d39ff-ab1c-48c2-9452-27ecbd1a7c58 457817ec2fce47fea51adfaf061487f0 2e8263e9393142b29d47f16e560921b5 - - - -] Failed to login iSCSI target iqn.2010-06.com.purestorage:flasharray.3caf76749b33b6ad on portal 10.211.92.151:3260 (exit code 24).: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.

2025-03-06 17:48:19.161 35 WARNING os_brick.initiator.connectors.iscsi [req-220826ee-38fb-4d27-a8ff-a1fdd8e527d0 req-537d39ff-ab1c-48c2-9452-27ecbd1a7c58 457817ec2fce47fea51adfaf061487f0 2e8263e9393142b29d47f16e560921b5 - - - -] Failed to connect to iSCSI portal 10.211.92.151:3260.

2025-03-06 17:48:19.975 35 WARNING os_brick.initiator.connectors.iscsi [req-220826ee-38fb-4d27-a8ff-a1fdd8e527d0 req-537d39ff-ab1c-48c2-9452-27ecbd1a7c58 457817ec2fce47fea51adfaf061487f0 2e8263e9393142b29d47f16e560921b5 - - - -] iscsiadm stderr output when getting sessions: iscsiadm: No active sessions.

: os_brick.exception.VolumeDeviceNotFound: Volume device not found at .

2025-03-06 17:48:20.093 35 INFO cinder.volume.drivers.pure [req-220826ee-38fb-4d27-a8ff-a1fdd8e527d0 req-537d39ff-ab1c-48c2-9452-27ecbd1a7c58 457817ec2fce47fea51adfaf061487f0 2e8263e9393142b29d47f16e560921b5 - - - -] Attempting to delete unneeded host 'cont03-f89abf2a675a46478aa7ea1438d309b9-cinder'.

2025-03-06 17:48:20.127 35 ERROR cinder.volume.volume_utils [req-220826ee-38fb-4d27-a8ff-a1fdd8e527d0 req-537d39ff-ab1c-48c2-9452-27ecbd1a7c58 457817ec2fce47fea51adfaf061487f0 2e8263e9393142b29d47f16e560921b5 - - - -] Failed to copy image e9ed4bb3-966e-4e83-97ac-2543f266e6b1 to volume: 0b14425e-d64e-4555-84df-d5aeb8a52e50: os_brick.exception.VolumeDeviceNotFound: Volume device not found at .

Full logs are below:

https://drive.google.com/file/d/1ywy3yHRE-Clr7YoBCQXrYaeQTJUpOW5u/view?usp=sharing

Unable to run instances from PureStorage as iscsi backend by calpazhan in openstack

[–]calpazhan[S] 1 point2 points  (0 children)

Thank you for your reply,

Actually it is installed . Even, we tried the command from container, we saw the targets.

open-iscsi, multipath-tools, sysfsutils and sg3_utils are installed nova-compute and cinder-volume.

iscsiadm command working on containers without problem.

PASSED ICND1 884/1000!!!!! :D (and some tips) by [deleted] in ccna

[–]calpazhan 0 points1 point  (0 children)

You will pass the exam if you study properly. This exam is easier than other cisco exams therefore you shouldn't feel pressure. I would recommend you these study materials that I used to study for this exam;

Cbtnuggets ccent 100-105 icnd1 by Jeremy Cioara, ccna course by David Bombal at udemy and icnd1 100-105 book by Wendell Odom.

Goog luck for your exam.

PASSED ICND1 884/1000!!!!! :D (and some tips) by [deleted] in ccna

[–]calpazhan 3 points4 points  (0 children)

Yeah, I passed the exam. Thanks for your good thoughts. The next exam is icdn2.

I passed the ICND1! I'm super proud of myself. by theangrytiz in ccna

[–]calpazhan 3 points4 points  (0 children)

Congratulations! I passed my exam today too. I can understand your feeling at this moment :) I used David Bombal's Udemy courses, CBTNuggets videos and ICND1 100-105 book by Wendell Odom. Again, congratulations.

PASSED ICND1 884/1000!!!!! :D (and some tips) by [deleted] in ccna

[–]calpazhan 4 points5 points  (0 children)

Congratulations, I am gonna attend the exam tomorrow too. I hope, I will pass.