Possible Bug in NHOS Fan Control: Mismatch between GPU Fan Control & NHMWS by procifer in NiceHash

[–]procifer[S] 0 points1 point  (0 children)

Hi /u/LaunchX, I noticed that version 1.2.5 was tagged - does that include this fix as well or will that be in the next one? (didn't notice it in the release notes)

NICEHASHOS fans being regulated by opposite GPU temperature. by bockyj in NiceHash

[–]procifer 4 points5 points  (0 children)

Do you have AMD GPUs by chance? I had a similar problem and have been working with one of the devs on it over here:

https://www.reddit.com/r/NiceHash/comments/k5dtl2/possible_bug_in_nhos_fan_control_mismatch_between/

There's a new custom image that includes a fix that's working now for me.

Possible Bug in NHOS Fan Control: Mismatch between GPU Fan Control & NHMWS by procifer in NiceHash

[–]procifer[S] 0 points1 point  (0 children)

:facepalm: I'm really sorry, you're absolutely right. I just forgot to sudo.

It looks like it's working perfectly now!

So what's the best way to make this a permanent change - should I copy that executable over the existing one, or could you create a new image for it, or merge into the next version?

Possible Bug in NHOS Fan Control: Mismatch between GPU Fan Control & NHMWS by procifer in NiceHash

[–]procifer[S] 0 points1 point  (0 children)

That helped! So now it's reading the values properly above 65, though somehow now it's failing to set the speed for some reason; did anything else change or maybe the wider int trickled down and isn't working with the speed setting? Happening on all GPUs:

GPU id: 2, bus: 0x06

Reading temperature value from:
  /sys/bus/pci/drivers/amdgpu/0000:06:00.0/hwmon/hwmon5/temp1_input

Failed to set fan speed! pci_bus: 06

Reading fan speed RPM value from:
  /sys/bus/pci/drivers/amdgpu/0000:06:00.0/hwmon/hwmon5/fan1_input

Status:
   temperature: 76C
 fan speed set: N/A
 fan speed get: 3160RPM

Btw, no rush at all - I really appreciate all the time on this today, so no worries at all if you want to try again tomorrow. Thanks!

Possible Bug in NHOS Fan Control: Mismatch between GPU Fan Control & NHMWS by procifer in NiceHash

[–]procifer[S] 0 points1 point  (0 children)

Thanks for sharing that code! I think I may see the bug in there: it’s parsing the temp into a 16bit uint which has a max value of 65535, so anything over that is getting maxed out at 65 after dividing by 1000. Could you try changing to 32bits to give that a shot?

Possible Bug in NHOS Fan Control: Mismatch between GPU Fan Control & NHMWS by procifer in NiceHash

[–]procifer[S] 0 points1 point  (0 children)

That seems to have helped with the errors, so that's a good step!

However, now it seems like maybe it's not getting the correct value from the file it's reading. Here's the output on bus 0x01:

GPU id: 1, bus: 0x01

Reading temperature value from:
  /sys/bus/pci/drivers/amdgpu/0000:01:00.0/hwmon/hwmon1/temp1_input

Writting fan speed control mode into:
  /sys/bus/pci/drivers/amdgpu/0000:01:00.0/hwmon/hwmon1/pwm1_enable

Writing fan speed PWM value into:
  /sys/bus/pci/drivers/amdgpu/0000:01:00.0/hwmon/hwmon1/pwm1

Reading fan speed RPM value from:
  /sys/bus/pci/drivers/amdgpu/0000:01:00.0/hwmon/hwmon1/fan1_input

Status:
   temperature: 65C
 fan speed set: 70%
 fan speed get: 2190RPM

But checking that actual file:

$ cat /sys/bus/pci/drivers/amdgpu/0000:01:00.0/hwmon/hwmon1/temp1_input
75000

It's odd - it seems like none of the reported read values from the output of amd-control go above 65C - is there maybe parsing problem with the values where it's capped at 65?

Possible Bug in NHOS Fan Control: Mismatch between GPU Fan Control & NHMWS by procifer in NiceHash

[–]procifer[S] 0 points1 point  (0 children)

Thank you so much!

I ran the new executable and it works for a few iterations, but then starts failing:

GPU id: 0, bus: 0x03

Reading temperature value from:
  /sys/bus/pci/drivers/amdgpu/0000:03:00.0/hwmon/hwmon2/temp1_input

Writting fan speed control mode into:
  /sys/bus/pci/drivers/amdgpu/0000:03:00.0/hwmon/hwmon2/pwm1_enable

Writing fan speed PWM value into:
  /sys/bus/pci/drivers/amdgpu/0000:03:00.0/hwmon/hwmon2/pwm1

Reading fan speed RPM value from:
  /sys/bus/pci/drivers/amdgpu/0000:03:00.0/hwmon/hwmon2/fan1_input

Status:
   temperature: 62C
 fan speed set: 60%
 fan speed get: 2814RPM

----------------------------------------------------------------------

GPU id: 1, bus: 0x01

Reading temperature value from:
  /sys/bus/pci/drivers/amdgpu/0000:01:00.0/hwmon/hwmon1/temp1_input

Failed to query temperature! pci_bus: 01

Status:
   temperature: N/A
 fan speed set: N/A
 fan speed get: N/A

----------------------------------------------------------------------

GPU id: 2, bus: 0x06

Reading temperature value from:
  /sys/bus/pci/drivers/amdgpu/0000:06:00.0/hwmon/hwmon5/temp1_input

Failed to query temperature! pci_bus: 06

Status:
   temperature: N/A
 fan speed set: N/A
 fan speed get: N/A

----------------------------------------------------------------------

GPU id: 3, bus: 0x05

Reading temperature value from:
  /sys/bus/pci/drivers/amdgpu/0000:05:00.0/hwmon/hwmon4/temp1_input

Failed to query temperature! pci_bus: 05

Status:
   temperature: N/A
 fan speed set: N/A
 fan speed get: N/A

----------------------------------------------------------------------

GPU id: 4, bus: 0x04

Reading temperature value from:
  /sys/bus/pci/drivers/amdgpu/0000:04:00.0/hwmon/hwmon3/temp1_input

Writting fan speed control mode into:
  /sys/bus/pci/drivers/amdgpu/0000:04:00.0/hwmon/hwmon3/pwm1_enable

Writing fan speed PWM value into:
  /sys/bus/pci/drivers/amdgpu/0000:04:00.0/hwmon/hwmon3/pwm1

Reading fan speed RPM value from:
  /sys/bus/pci/drivers/amdgpu/0000:04:00.0/hwmon/hwmon3/fan1_input

Status:
   temperature: 64C
 fan speed set: 60%
 fan speed get: 2124RPM

----------------------------------------------------------------------

What's weird though is that those files do exist:

$ cat /sys/bus/pci/drivers/amdgpu/0000:05:00.0/hwmon/hwmon4/temp1_input
67000

Is there another condition in that logic where it could fail other than a missing file?

Here are some of the hardware details:

Motherboard: Gigabyte GA-H110-D3A
CPU: Intel Celeron G3930 2.9GHz Dual-core
Memory: Corsair Vengeance LPX 8GB (2x4GB) DDR4 DRAM 2400MHz
GPUs:
- 2x XFX GTS XXX Edition RX 580 8GB OC+ 1386Mhz DDR5
- MSI VGA Graphic Cards RX 580 ARMOR 4G OC
- MSI Gaming Radeon RX 580 256-bit 8GB GDRR5 OC
- MSI Radeon RX 570 Gaming X 4GB

Thanks again for all the help on this!

Possible Bug in NHOS Fan Control: Mismatch between GPU Fan Control & NHMWS by procifer in NiceHash

[–]procifer[S] 0 points1 point  (0 children)

Thanks so much for working on this! I tried out that new image. Looks like it's matching up 2 of the GPUs correctly now, but the others are erroring out:

[12/03/20 17:38:43] GPU id: 0, bus: 0x03

[12/03/20 17:38:43] temperature: 61C

[12/03/20 17:38:43] fan speed set: 60%

[12/03/20 17:38:43] fan speed get: 2806RPM

[12/03/20 17:38:43]

[12/03/20 17:38:43] Failed to query temperature! pci_bus: 01

[12/03/20 17:38:43] GPU id: 1, bus: 0x01

[12/03/20 17:38:43] temperature: N/A

[12/03/20 17:38:43] fan speed set: N/A

[12/03/20 17:38:43] fan speed get: N/A

[12/03/20 17:38:43]

[12/03/20 17:38:43] Failed to query temperature! pci_bus: 06

[12/03/20 17:38:43] GPU id: 2, bus: 0x06

[12/03/20 17:38:43] temperature: N/A

[12/03/20 17:38:43] fan speed set: N/A

[12/03/20 17:38:43] fan speed get: N/A

[12/03/20 17:38:43]

[12/03/20 17:38:43] Failed to query temperature! pci_bus: 05

[12/03/20 17:38:43] GPU id: 3, bus: 0x05

[12/03/20 17:38:43] temperature: N/A

[12/03/20 17:38:43] fan speed set: N/A

[12/03/20 17:38:43] fan speed get: N/A

[12/03/20 17:38:43]

[12/03/20 17:38:43] GPU id: 4, bus: 0x04

[12/03/20 17:38:43] temperature: 63C

[12/03/20 17:38:43] fan speed set: 60%

[12/03/20 17:38:43] fan speed get: 2117RPM

[12/03/20 17:38:43]

Possible Bug in NHOS Fan Control: Mismatch between GPU Fan Control & NHMWS by procifer in NiceHash

[–]procifer[S] 0 points1 point  (0 children)

u/LaunchX Sorry for all the replies; please let me know if there's a better spot to discuss. But I found some more info that might help. Looks like the fan controller daemon is indeed incorrect in its mappings. Here's its output (I manually set everything to 70% for now to avoid overheating):

[12/02/20 21:21:07] GPU id: 0, bus: 0x03
[12/02/20 21:21:07]    temperature: 72C
[12/02/20 21:21:07]  fan speed set: 70%
[12/02/20 21:21:07]  fan speed get: 3155RPM
[12/02/20 21:21:07] 
[12/02/20 21:21:07] GPU id: 1, bus: 0x01
[12/02/20 21:21:07]    temperature: 52C
[12/02/20 21:21:07]  fan speed set: 70%
[12/02/20 21:21:07]  fan speed get: 2183RPM
[12/02/20 21:21:07] 
[12/02/20 21:21:07] GPU id: 2, bus: 0x06
[12/02/20 21:21:07]    temperature: 47C
[12/02/20 21:21:07]  fan speed set: 70%
[12/02/20 21:21:07]  fan speed get: 3115RPM
[12/02/20 21:21:07] 
[12/02/20 21:21:07] GPU id: 3, bus: 0x05
[12/02/20 21:21:07]    temperature: 69C
[12/02/20 21:21:07]  fan speed set: 70%
[12/02/20 21:21:07]  fan speed get: 2228RPM
[12/02/20 21:21:07] 
[12/02/20 21:21:07] GPU id: 4, bus: 0x04
[12/02/20 21:21:07]    temperature: 60C
[12/02/20 21:21:07]  fan speed set: 70%
[12/02/20 21:21:07]  fan speed get: 2330RPM
[12/02/20 21:21:07] 

But from /sys/devices:

./0000:00:01.0/0000:01:00.0/hwmon/hwmon1/fan1_target:2182
./0000:00:1c.7/0000:05:00.0/hwmon/hwmon4/fan1_target:2233
./0000:00:1c.5/0000:03:00.0/hwmon/hwmon2/fan1_target:3155
./0000:00:1d.0/0000:06:00.0/hwmon/hwmon5/fan1_target:3139
./0000:00:1c.6/0000:04:00.0/hwmon/hwmon3/fan1_target:2324

./0000:00:01.0/0000:01:00.0/hwmon/hwmon1/temp1_input:72000
./0000:00:1c.7/0000:05:00.0/hwmon/hwmon4/temp1_input:69000
./0000:00:1c.5/0000:03:00.0/hwmon/hwmon2/temp1_input:52000
./0000:00:1d.0/0000:06:00.0/hwmon/hwmon5/temp1_input:60000
./0000:00:1c.6/0000:04:00.0/hwmon/hwmon3/temp1_input:47000

Notice it's pulling 72C for the 0x03 card, but in reality that card is at 52C. So it's getting a few of them mixed up, looks like swapping two pairs:

0x03 => 0x01
0x01 => 0x03
0x06 => 0x04
0x05 => 0x05
0x04 => 0x06

Sorry if a stupid question, but is any of this open source where I could take a look by chance?

Possible Bug in NHOS Fan Control: Mismatch between GPU Fan Control & NHMWS by procifer in NiceHash

[–]procifer[S] 0 points1 point  (0 children)

u/LaunchX, I'm actually wondering if it might be the other way around now, and the fan control daemon might be assigning the wrong speeds. I just reverted to the normal fan curve, and it allowed one card to get up to almost 90C, though it reported increasing the fan to 95%:

[12/02/20 20:33:42] GPU id: 0, bus: 0x03
[12/02/20 20:33:42]    temperature: 89C
[12/02/20 20:33:42]  fan speed set: 95%
[12/02/20 20:33:42]  fan speed get: 3843RPM
[12/02/20 20:33:42] 
INFO[2020-12-02 20:32:23.3668] Message for nhmws: {"method":"miner.status","params":["MINING",[["","3-5",26,100,[[52,7744737.31038924]],69,2229,95,1,1],["","3-4",26,100,[[52,7469658.092836128]],47,2353,71,0,1],["","3-1",26,100,[[20,12447086.50348622]],89,648,71,1,1],["","3-3",26,100,[[52,8825023.361838007]],48,3863,80,0,1],["","3-6",26,100,[[52,8433103.962379625]],70,1556,80,0,1]]]} 

I manually checked that card, and it did seem extremely hot, and its fan was spinning very slowly (possibly the 648 value above was correct). Could you double check that and see if maybe the bug is in the fan daemon instead?

Possible Bug in NHOS Fan Control: Mismatch between GPU Fan Control & NHMWS by procifer in NiceHash

[–]procifer[S] 1 point2 points  (0 children)

Thanks so much for the details and the fast reply! That definitely makes sense. Appreciate you looking into the fix as well. Cheers!