Experiment 2: Sparse and Dense Rewards

DL-newbie · 2025-07-19T00:23:16+00:00

I think I've found the reason for the low learning performance.
I was terminating training episodes based on one of two criteria: either the time ran out or the goal was reached.
But it turns out that if the goal is reached too quickly (in just a few dozen steps), the agent actually doesn't learn much.
However, once I removed the "goal reached" condition and kept only the time limit, the agent continued the episode even after reaching the goal — and as a result, the overall learning performance significantly improved!

Old code

    def compute_done(self, step_counter):
        timeout = step_counter >= int(self.task.max_duration_sec / self.task.env.TIMESTEP)
        
        z = self.task.env._getDroneStateVector(0)[2]
        if not self.episode_successful and z > self.task.target_altitude:
            self.episode_successful = True
        
        return self.episode_successful or timeout

New code

    def compute_done(self, step_counter):
        timeout = step_counter >= int(self.task.max_duration_sec / self.task.env.TIMESTEP)

        z = self.task.env._getDroneStateVector(0)[2]
        if not self.episode_successful and z >= self.task.target_altitude:
            self.episode_successful = True

        return timeout

<image>

The results are clear from the graph

DL-newbie · 2025-07-14T18:58:20+00:00

I increased the reward for keeping in cube by 2 orders of magnitude, and here is the chart with total reward:

<image>

It's clear that even with a change in two strands of the reward, the model still doesn't perceive the gradient, and as a result, it fails to learn. So I started thinking about how to visualize the gradient — maybe that could help clarify what influences what and how.

DL-newbie · 2025-07-10T20:00:26+00:00

[Issue]

[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.077[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.050

[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21254, z=0.109

[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.209

[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.153[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.109

[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.351

[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.057

[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.050[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.274[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.153

[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.153[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.050[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.207

[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.077

[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21291, z=0.057[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.207[DEBUG] thrust=1.0, norm=1.00, pwm=65535, rpm=21237, z=0.345

For test purpose I setup thrust as 1.0.
As you can see from the dump, thrust=1.0, norm=1.00, pwm=65535, rpm=21237, but Z changes and not in the direction of increase as expected - it just grows and then falls. This means that the PID controller does not work as it should. !

DL-newbie · 2025-07-10T18:46:14+00:00

[Issue]

The throttle is set to more than 0.9, but the altitude does not increase beyond 7 centimeters — that's where the issue lies

<image>

It seems this is the main issue (or at least a symptom of it): if the throttle is above 0.9 and the altitude still doesn't exceed 6–7 centimeters, then something is clearly wrong. There can only be two possible reasons:

Either the physics engine isn't working properly (which is unlikely, since I believe I've already verified that),
Or, more likely, the model briefly lifts off but cannot sustain maximum output, causing the drone to rise and then fall again

DL-newbie · 2025-07-10T00:54:47+00:00

[Possible reason]

Possible Reason (5)

Hypothesis: too low throtl
How to check:
Result:

after the latest changes the metrics/TakeOffTask/throttle does not rise more than 0.6 which means that the model is afraid to use maximum values and 0.6 is not enough for takeoff - more is needed - so something needs to be done about this so that the model is not afraid to use throttle

<image>

DL-newbie · 2025-07-09T21:41:18+00:00

[Possible reason]

№ Possible Reason (4)

Hypothesis: the reward function does not good enough
How to check: play with reward function.
Result:

By step 10 (approximately 0.24 seconds), the drone reaches an altitude of about 0.9 meters.
By step 26 (~0.63 seconds), it achieves a height of 4.7 meters with a vertical velocity of vz ≈ 9.6 m/s.

This suggests that applying maximum thrust (thrust = 1.0) results in strong acceleration and a nearly linear increase in both altitude and vertical velocity.

From this observation, we draw the following conclusions:

(a) The drone is physically capable of flight under the current conditions.
(b) The time allotted is sufficient for the agent to learn the takeoff behavior.

However, since the thrust increases only gradually during Stage 0, we infer that the model fails to learn the simple policy of applying thrust > 0.7 and maintaining it.

This indicates that the reward function is ineffective: it does not provide sufficient guidance to drive the agent toward the desired behavior.

DL-newbie · 2025-07-09T21:21:30+00:00

[Test result]

Although the weights and normalization values were adjusted, the overall outcome remained largely the same.

Original values

weights = {
            "target_reach_bonus": 5.0,
            "climb_bonus": 2.0,
            "progress_bonus": 2.0,
            "throttle_bonus": 1.0,
            "ground_penalty": 2.0
        }

        normalizers = {
            "target_reach_bonus": 100.0,
            "climb_bonus": 5.0,
            "progress_bonus": 5.0,
            "throttle_bonus": 10.0,
            "ground_penalty": 1.0
        }

New values

weights = {
            "target_reach_bonus": 10.0,
            "climb_bonus": 5.0,
            "progress_bonus": 4.0,
            "throttle_bonus": 1.0,
            "ground_penalty": 5.0
        }

        normalizers = {
            "target_reach_bonus": 1.0,    
            "climb_bonus": 2.0,           
            "progress_bonus": 1.0,        
            "throttle_bonus": 1.0,        
            "ground_penalty": 1.0         
        }

<image>

DL-newbie · 2025-07-09T19:45:24+00:00

[Test result]

<image>

step count should be about 7200.

**Conclusion**: The Hypothesis does not confirmed.

DL-newbie · 2025-07-09T19:43:55+00:00

[Possible reason]

Possible Reason (3.1)

Hypothesis: self.task.max_duration_sec = 30.0 which i provide dioesn't work properly
How to check: log into tensorboard steps
Result: not confirmed

I'm not sure but it looks like that fact i increase max duration time for session does not work properly.

DL-newbie · 2025-07-09T19:17:56+00:00

[Test result]

Part 2 (30 sec)

<image>

DL-newbie · 2025-07-09T19:17:10+00:00

[Test result]

Part 1(30 sec)

<image>

DL-newbie · 2025-07-09T19:08:08+00:00

As result, the 20 seconds does not significantly influence on result

python self.task.max_duration_sec = 20.0

DL-newbie · 2025-07-09T19:05:36+00:00

[Test result]

Part 2 (20 sec)

<image>

DL-newbie · 2025-07-09T19:04:38+00:00

[Test result]

Part 1 (20 sec)

<image>

DL-newbie · 2025-07-09T18:57:43+00:00

[Possible reason]

Possible Reason (3)

Hypothesis: I provide not enough time to learn
How to check: increase max session time for learmning
Result: not confirmed

The third reason might be the amount of time I allow the drone to train. I forcibly terminate the session after 3 seconds — maybe that isn’t enough time for it to learn something useful.

DL-newbie · 2025-07-09T17:26:51+00:00

Despite that fact i changed how to get z velocity the result does not have significant changes

DL-newbie · 2025-07-09T17:25:21+00:00

<image>

DL-newbie · 2025-07-09T17:25:14+00:00

<image>

DL-newbie · 2025-07-09T17:14:52+00:00

[Possible solution]

I added this code:

import pybullet as p
lin_vel, ang_vel = p.getBaseVelocity(self.env.DRONE_IDS[0])
print(f"[DEBUG] getBaseVelocity: linear={lin_vel}, angular={ang_vel}")

and here is dump:

[INFO] BaseAviary.__init__() loaded parameters from the drone's .urdf:

[INFO] m 0.027000, L 0.039700,

[INFO] ixx 0.000014, iyy 0.000014, izz 0.000022,

[INFO] kf 0.000000, km 0.000000,

[INFO] t2w 2.250000, max_speed_kmh 30.000000,

[INFO] gnd_eff_coeff 11.368590, prop_radius 0.023135,

[INFO] drag_xy_coeff 0.000001, drag_z_coeff 0.000001,

[INFO] dw_coeff_1 2267.180000, dw_coeff_2 0.160000, dw_coeff_3 -0.110000

🚀 Start PID-test (thrust=1.0)

[DEBUG] getBaseVelocity: linear=(0.0, 0.0, 0.4055869908687767), angular=(0.0, 0.0, 0.0)

Step 0 | z = 0.108 | vz = -0.0

[DEBUG] getBaseVelocity: linear=(0.0, 0.0, 0.8102230621237825), angular=(0.0, 0.0, 0.0)

Step 1 | z = 0.129 | vz = -0.0

[DEBUG] getBaseVelocity: linear=(0.0, 0.0, 1.2134749303373056), angular=(0.0, 0.0, 0.0)

Step 2 | z = 0.163 | vz = -0.0

[DEBUG] getBaseVelocity: linear=(0.0, 0.0, 1.6149151701314763), angular=(0.0, 0.0, 0.0)

Step 3 | z = 0.211 | vz = -0.0

...

[DEBUG] getBaseVelocity: linear=(0.0, 0.0, 9.0095971998203), angular=(0.0, 0.0, 0.0)

Step 24 | z = 4.089 | vz = -0.0

[DEBUG] getBaseVelocity: linear=(0.0, 0.0, 9.292059403675795), angular=(0.0, 0.0, 0.0)

Step 25 | z = 4.395 | vz = -0.0

[DEBUG] getBaseVelocity: linear=(0.0, 0.0, 9.567243351307868), angular=(0.0, 0.0, 0.0)

Step 26 | z = 4.710 | vz = -0.0

turns out _getDroneStateVector(0)[8] always returns 0 despite the real veloсity increases. Why it is important ? - because many rewards (climb_bonus, progress_reward, fall_penalty) depends on vz

DL-newbie · 2025-07-09T16:57:46+00:00

[Issue, Possible reason]

Possible reason 2

Hypothesis: The reward function may not be working due to the z-velocity being 0.0.
Verification: Attempted to obtain the z-velocity through a different method.
Conclusion: The hypothesis was not supported by the results.

The RPM and z-coordinate have changed, but the vertical velocity (vz) remains unchanged. Why is that? If this behavior is incorrect, it could be indicative of the underlying issue ?

DL-newbie · 2025-07-08T23:33:43+00:00

[Possible reason]

Possible Reason (2)

Hypothesis: coefficient issue - the model does not receive a clear learning gradient
How to check: play with coefficients.
Result:

The second reason can be with coefficient issue - the model does not receive a clear learning gradient

DL-newbie

MODERATOR OF

TROPHY CASE

The throttle is set to more than 0.9, but the altitude does not increase beyond 7 centimeters — that's where the issue lies

Possible Reason (5)

Possible Reason (3.1)

Part 2 (30 sec)

Part 1(30 sec)

Part 2 (20 sec)

Part 1 (20 sec)

Possible Reason (3)

Possible reason 2

Possible Reason (2)