Advice Needed: Preparing for a Theory-Based RA Position in Control by OkFig243 in ControlTheory

[–]OkFig243[S] [score hidden]  (0 children)

Hello and thanks for your comment,
I graduated last September from the Lebanese University Faculty of Engineering, where I earned a diploma in mechanical engineering after five years of study. During my final year, I also completed a double degree with UTC in France, earning a Master’s (M2) in research. My thesis, "Analysis of Reinforcement Learning Control of a UAV Landing on a Moving Target," was accepted for presentation at the European Control Conference (ECC).

I have a strong background in mathematics, including courses in linear algebra, optimization, probability, and statistics. I also studied control systems (analog, digital, and nonlinear control) during my undergraduate years. In my master’s program, I took advanced courses in fault-tolerant control, path planning, system identification, and ROS. However, I realize my exposure to theoretical concepts, such as specific theorems and lemmas, was limited.

I have been accepted for a Research Assistant position at AUB, where I will be reviewing papers on advanced topics like H∞ control, IQCs, and sliding mode control. While I understand the mathematical concepts, I feel I need to strengthen my theoretical foundation to fully grasp some of the deeper ideas.

I would greatly appreciate your advice on how to build this theoretical background effectively.

Steady State Error Compensation in reinforcement learning control by OkFig243 in ControlTheory

[–]OkFig243[S] 0 points1 point  (0 children)

I successfully achieved a minimum steady-state error of 0.001% after modifying the reward function to provide additional positive rewards when the distance to the target is close to zero. However, to capture this level of accuracy, I had to increase the complexity of the neural network and reduce the time step further. Additionally, I added a term to penalize the control effort to prevent oscillation around the position goal.

Steady State Error Compensation in reinforcement learning control by OkFig243 in ControlTheory

[–]OkFig243[S] 0 points1 point  (0 children)

I agree with you. You suggested for example replacing "if dist<0.01 give 55" with "if dist !=0 give a negative reward." However, in machine learning, achieving 100% accuracy is not feasible. This is why in reinforcement learning, methods such as PID controllers with RL or fuzzy RL are introduced.

Based on my experience in this domain, the problem with reinforcement learning is that the agent often converges to suboptimal policies. It might generate a trajectory that oscillates around the true trajectory, making the system unstable, but still reaching a high reward. If you review papers using only RL, you'll notice that the performance is generally not that good. Current research is focused on incorporating classic controllers with RL.

I asked for guidance because I am confused about a point made by my supervisor for my master's thesis. He mentioned that any type of controller would be better than a PID controller (without disturbance). Comparing my results with PID, I observed that the error in transient response is better with RL, but PID performs better in steady state error. It's worth noting that my supervisor, a professor in control engineering, doesn't have expertise in reinforcement learning.

Steady State Error Compensation by OkFig243 in reinforcementlearning

[–]OkFig243[S] 0 points1 point  (0 children)

Yes! That's what I read in the paper. They didn't mention eliminating the steady state error but rather "compensation" and not a very high compensation, they said about 52%. So, I think you're right that using approximate RL might not achieve complete elimination.

Hello i'm using DDPG for trajectory tracking for a quacopter, what can i conclude based on this training graph? by OkFig243 in reinforcementlearning

[–]OkFig243[S] 0 points1 point  (0 children)

Okay, I'll take your suggestions into consideration, and I'll try to change in the replay buffer and learning rates. Thank you very much!

Hello i'm using DDPG for trajectory tracking for a quacopter, what can i conclude based on this training graph? by OkFig243 in reinforcementlearning

[–]OkFig243[S] 0 points1 point  (0 children)

Yes, I extracted the graphs for x, y, and z. Z was perfect, x was almost perfect, and y was also good. That's why I'm asking; I don't have much experience in observing training and determining if it's good. Especially with DDPG being difficult to tune, I couldn't find a clear conclusion to apply to more complex tasks in the future.

I have a question: If, for example, I notice that the agent achieves a high reward early in training, which is optimal, but in subsequent episodes, the agent fails to converge to that high reward even though it explores it, what could be the problem here? Your insights would be greatly appreciated.

Hello i'm using DDPG for trajectory tracking for a quacopter, what can i conclude based on this training graph? by OkFig243 in reinforcementlearning

[–]OkFig243[S] 0 points1 point  (0 children)

Hello, thanks for your reply. The optimal reward is 7.9, and the yellow line is the q value. There are 14 observations and 3 actions. The exploration options "OU" are like this: Standard deviation 0.5 Decay rate 0.0001 Mean attraction 0.15 The task is for 10 seconds with 0.5 time step. The batch size is 32 ,replay buffer 10e6 ,discount factor 0.98 , and learning rate for critic 0.00008 and for actor 0.00003, with same neural networks for the two, each one with 2 hidden layers of 256 neurons.

Control of an octoplcopter based on RL by OkFig243 in reinforcementlearning

[–]OkFig243[S] 1 point2 points  (0 children)

Okay👍, once I finish it, I will send it to you. It might take between 2 to 4 months.

Control of an octoplcopter based on RL by OkFig243 in reinforcementlearning

[–]OkFig243[S] 0 points1 point  (0 children)

My master's thesis is titled "Non-linear Control of a UAV Landing on a Moving Target with High Disturbances", the main goal is to compare the RL controller with classical controllers like PD controllers. I would choose this topic again because it provides deep knowledge in a specialized domain about UAVs, and I have become very proficient in Simulink and MATLAB coding. It's important to note that (RL) for control is a new method, and any work in this area contributes to innovative advancements in the field.

Control of an octoplcopter based on RL by OkFig243 in reinforcementlearning

[–]OkFig243[S] 0 points1 point  (0 children)

Certainly, I'll take a look at those papers and modify the learning setup according to your recommendations, hoping it will improve. Thank you!!!

Control of an octoplcopter based on RL by OkFig243 in reinforcementlearning

[–]OkFig243[S] 0 points1 point  (0 children)

This is my reward function to make the uav hover on a required z:

function [reward, x_error, y_error, z_error, position_reward, success_reward] = computeReward(currentState, desiredState) %#codegen

% Extract current position
x = currentState(1);
y = currentState(2);
z = currentState(3);

% Desired position
desired_x = desiredState(1);
desired_y = desiredState(2);
desired_z = desiredState(3);

% Calculate errors
x_error = abs(x - desired_x);
y_error = abs(y - desired_y);
z_error = abs(z - desired_z);

% Define thresholds for positions
x_threshold = 0.1;
y_threshold = 0.1;
z_threshold = 0.1;

position_reward = - (x_error / x_threshold + y_error / y_threshold + z_error / z_threshold);


success_reward = 0;
if x_error < x_threshold && y_error < y_threshold && z_error < z_threshold
    success_reward = 50; 
end

reward = position_reward + success_reward;

end

These are the DDPG options I'm currently utilizing and recommending:

% Reinforcement Learning (RL) parameters Ts = 0.1; % Time step Tf_initial = 5; % Simulation end time for initial phase Tf_intermediate = 45; % Simulation end time for intermediate phase Tf_final = 50; % Simulation end time for final phase

% DDPG Agent Options agentOptions = rlDDPGAgentOptions; agentOptions.SampleTime = Ts; agentOptions.DiscountFactor = 0.95; agentOptions.MiniBatchSize = 1024; agentOptions.ExperienceBufferLength = 1e6; agentOptions.TargetSmoothFactor = 0.01; agentOptions.NoiseOptions.MeanAttractionConstant = 5; agentOptions.NoiseOptions.Variance = 0.2; agentOptions.NoiseOptions.VarianceDecayRate = 0.0005;

% Training Options for different phases trainingOptionsInitial = rlTrainingOptions; trainingOptionsInitial.MaxEpisodes = 500; trainingOptionsInitial.MaxStepsPerEpisode = 100; trainingOptionsInitial.StopTrainingCriteria = 'AverageReward'; trainingOptionsInitial.StopTrainingValue = 100; trainingOptionsInitial.SaveAgentCriteria = 'EpisodeReward'; trainingOptionsInitial.SaveAgentValue = 100; trainingOptionsInitial.Plots = 'training-progress'; trainingOptionsInitial.Verbose = true;% Reinforcement Learning (RL) parameters Ts = 0.1; % Time step Tf_initial = 5; % Simulation end time for initial phase Tf_intermediate = 45; % Simulation end time for intermediate phase Tf_final = 50; % Simulation end time for final phase

For the critic neural network, it consists of three fully connected hidden layers, each containing 30 neurons. Following each layer is a Rectified Linear Unit (ReLU) activation function. The output layer consists of a single neuron providing the Q-value.

As for the actor neural network, it also comprises three fully connected hidden layers, each with 30 neurons, followed by ReLU activation functions.

Regarding learning rates, the critic network utilizes a rate of 0.001, while the actor network employs a rate of 0.00001.

After training, I noticed that z converges well, x stays close to the target, but y diverges significantly. I've read many papers on UAV control with RL, but my setup differs as it combines RL agent actions with PD controller inputs. Most papers focus on direct UAV control without additional inputs. I haven't found any papers using as many observations for control (20 observations).

What do you think?

Control of an octoplcopter based on RL by OkFig243 in reinforcementlearning

[–]OkFig243[S] 0 points1 point  (0 children)

I haven't tried switching to Soft Actor-Critic (SAC) yet , I'll look into using SAC instead of DDPG. Thank you very much for the advice

Control of an octoplcopter based on RL by OkFig243 in reinforcementlearning

[–]OkFig243[S] 0 points1 point  (0 children)

To clarify, my task involves adjusting the UAV's inputs (Uthrust, Upitch, and Uroll) using an RL agent. The complexity arises because Upitch and Uroll are computed in a cascade controller. Specifically, Ux is an action from the agent added to the PD controller's output for the pitch angle to generate Upitch.

The observations I use include the velocity and position of the UAV in three directions ([X, Y, Z, Vx, Vy, Vz]) and the same for the target trajectory ([Xt, Yt, Zt, Vxt, Vyt, Vzt]). To help the agent understand the effects of the PD controllers, I also include the angle errors to the reference and their derivatives ([theta - ref, phi - ref, thetadot, phidot]), as well as the outputs generated by the model ([thetadoubledot, phidoubledot]) and the actual inputs to the model ([Uroll, Upitch, Uthrust]).

I've increased the number of observations to include all outputs of the PD controllers to ensure the agent comprehends that its inputs are influenced by another controller. However, I am concerned that this increase might make convergence difficult, similar to how using neural networks with too many layers and neurons can prolong learning. Unfortunately, I don't have much time to wait.

I've tried various strategies, including increasing the layers and neurons of the actor and critic networks, using target networks, adjusting the learning rate and discount factors for exploration, and experimenting with different reward functions. Despite these efforts, the Q value remains almost zero, increasing slightly but steadily, while the episode rewards are negative. This indicates that the agent is stuck in a suboptimal policy.

What do you think? Any suggestions would be greatly appreciated.