EVALUATION OF THE CONTROL ALGORITHM

Evaluating control algorithms for a balancing robot is essential to determine their effectiveness, robustness, and adaptability in various scenarios. This section analyzes the performance of the implemented control methods using key evaluation criteria and testing conditions.

4.1. Displacement of the robot over 3000 episodes

Displacement refers to the agent's change in position or state relative to a reference point or goal after taking actions during an episode.

Tracking displacement helps evaluate the agent's performance regarding movement, stability, or reaching a target.

Using DQN

In the initial Exploration Phase (Episodes 0–1000), the agent exhibits low and inconsistent displacement. This phase reflects random exploration as the agent learns the basic dynamics of the environment. See figure 4.1 for more detail.

In the improvement Phase (Episodes 1000–2500), a clear upward trend in displacement is observed, with increasing variability. This indicates the agent is starting to identify more effective strategies but has not yet fully optimized its behavior.

In the convergence Phase (Episodes 2500–3000), the displacement stabilizes at a consistently high level, peaking at around 30 units. The algorithm has likely converged, meaning the agent has learned an optimal or near-optimal policy for the task.

Figure 4.1: Displacement using DQN

Using SAC

Displacement values oscillate between -3 and +4, indicating variability in the agent's movements. The agent may be alternating between successful and less effective policies during training. See Figure 4.2.

While the values fluctuate, the overall range remains stable without significant increases or decreases, which could suggest the agent is consistently exploring its action space. The lack of a clear upward or downward trend could suggest the agent's performance in terms of displacement is neither improving nor degrading significantly.

Figure 4.2: Displacement using SAC

69 4.2. Epsilon over 3000 episodes

In epsilon-greedy exploration, epsilon defines the probability of the agent taking a random action instead of the action suggested by its current policy.

As training progresses, epsilon is gradually reduced to shift the agent's behavior from exploration to exploitation.

Using DQN

The initial Phase (Episodes 0–1000), can be seen in Figure 4.3, the epsilon starts near 1, encouraging high exploration. The agent tries various actions to understand the environment and collect diverse experiences.

Decay Phase (Episodes 1000–3000), the epsilon decreases steadily as training progresses.

This encourages the agent to rely more on exploiting its learned policy, focusing on refining optimal actions.

End Phase (Around Episode 3000), the epsilon approaches a value near 0, indicating minimal exploration.

Figure 4.3: Epsilon using DQN Using SAC

Figure 4.4 shows the curve decreases steadily from around 0.9 at the start to near 0 at the end. This indicates the agent starts with high exploration, prioritizing random actions to learn about the environment.

By episode 3000, epsilon approaches zero, meaning the agent is almost entirely relying on the learned policy rather than exploring. A high epsilon at the beginning ensures the agent gathers diverse experiences, which is crucial for learning an effective policy. As epsilon decreases, the agent focuses on leveraging its learned knowledge to achieve better performance.

Figure 4.4: Epsilon using SAC

4.3. Loss per timestep over 3000 episodes

Loss is typically associated with the difference between the predicted Q-values (from the neural network) and the target Q-values (calculated using the Bellman equation). A lower loss indicates better alignment between predictions and targets.

The goal is to minimize this loss over time, enabling the agent to make accurate predictions and decisions.

Using DQN

In Figure 4.5, In the stable Phase (Episodes 0–2000), the loss remains low and consistent, indicating incremental learning. The agent is gradually improving its policy with relatively small updates to the model.

Increase in Loss (Episodes 2000–3000): a sudden spike in loss occurs, peaking near episode 3000. This phase likely corresponds to the agent exploring challenging states or a shift in the environment dynamics.

Post-Peak Stability: loss begins to decline after the peak, suggesting the agent is adapting to the new information and stabilizing its learning.

Significance: Sudden Spikes: These could indicate moments of high exploration, instability in the value function, or changes in the training dynamics (e.g., due to epsilon decay or encountering sparse rewards). Adaptation: The decrease in loss following the spike demonstrates the algorithm's ability to recover and refine its performance.

Figure 4.5: Loss using DQN

Using SAC

In Figure 4.6, in the initial Phase (0–1000 Episodes), the loss starts near zero and gradually becomes more negative. This is expected as the agent learns to approximate the Q-values during its exploratory phase.

Mid Training (1000–2000 Episodes): the loss stabilizes somewhat but remains significantly negative. This indicates that the agent is encountering challenging transitions as it continues to learn.

Later Phase (2000–3000 Episodes): The loss becomes more erratic and reaches large negative values. This could be due to: Increased complexity in approximating Q-values for certain states as the agent shifts from exploration to exploitation. Potential instability in the training process (e.g., due to the use of replay buffers or learning rate issues).

Figure 4.6: Loss using SAC

4.4. Transient reward over 3000 episodes

Transient Reward: the reward obtained by the robot within a single episode. It represents how well the robot performed in that particular attempt to balance.

Figure 4.7, in Initial Struggles: the initial part of the transient reward graph shows low and fluctuating values. This indicates that the robot was initially performing poorly, likely due to random actions and a lack of learned control.

Learning Progress: As training progresses, the transient reward starts to increase. This suggests that the robot is learning to perform better and achieve higher rewards, which likely means it is becoming more stable and balanced.

Plateauing: After around 2500 episodes, the transient reward seems to reach a plateau or level off. This indicates that the robot's performance has stabilized.

Figure 4.7: Transient reward using DQN

Using SAC

Figure 4.8, in Early Episodes (0 to 1000), the rewards fluctuate greatly, indicating the agent is heavily exploring different actions to learn the environment. This phase is crucial for the agent to gather information and understand the impact of various actions on the rewards.

Mid Episodes (1000 to 2000): the rewards continue to fluctuate but show some stabilization. The agent might be starting to exploit known actions that yield higher rewards, though exploration still plays a significant role.

Late Episodes (2000 to 3000): Fluctuations persist, indicating ongoing exploration.

The lack of a clear trend suggests that the environment may be highly stochastic, or the agent's learning rate and exploration rate may need adjustment for better convergence.

Figure 4.8: Transient reward using SAC

4.5. Num of seconds alive over 3000 episodes

Seconds Alive: represents the duration in seconds that the robot remained upright and balanced within a single episode.

Using DQN

See Figure 4.9, Initial Struggles: The initial part of the graph shows low and fluctuating values for seconds. This indicates that the robot was initially unstable and could only maintain balance for a short duration.

Learning Progress: As training progresses, the number of seconds alive generally increases.

This suggests that the robot is learning to control its motors more effectively to maintain balance for longer periods.

Plateauing: After around 2500 episodes, the number of seconds alive seems to reach a plateau or level off.

Figure 4.9: Number of seconds alive using DQN

Using SAC

In Figure 4.10, the graph shows significant fluctuations in the number of seconds alive across the 3000 episodes. The line frequently peaks and dips, indicating variability in the agent's performance.

There is no clear upward or downward trend over the episodes, suggesting that the agent's performance remains highly variable and does not show consistent improvement or deterioration.

While there are peaks and troughs, the overall range of values (between 0 and 7 seconds) remains fairly consistent throughout the episodes. This implies that the agent's survival time fluctuates within a specific range but does not drastically change over time.

Figure 4.10: Number of seconds alive using SAC

4.6. Goal reached percentage during training

Goal Reached Percentage: represents the percentage of episodes within a particular interval where the robot achieved the desired balancing goal.

Using DQN

In Figure 4.11, Initial Struggles: In the initial intervals, the goal reached percentage is very low. This indicates that the robot was initially struggling to balance for the target duration.

Learning Progress: As training progresses, the goal reached percentage gradually increases.

This suggests that the robot is learning to balance for longer durations and achieving the goal more frequently.

Significant Improvement: There's a sharp increase in the goal-reached percentage around the 2500-2800 episode range. This indicates a significant improvement in the robot's balancing performance.

High Success Rate: In the final interval (2800-3000 episodes), the goal reached percentage is very high, close to 80%. This suggests that the robot has achieved a high success rate in balancing for the target duration.

The RL algorithm effectively trained the robot to balance for the desired duration.

Figure 4.11: Goal reached percentage using DQN

Using SAC

When using SAC, the robot can not reach the goal (see Figure 4.12).

Figure 4.12: Goal reached percentage using SAC

4.7. Angle of the robot

According to the data in the figure, this is a simulation of the robot's tilt angle after being trained and the robot's distance during the test. The robot's tilt angle ranges from 89.2 to

90.8, showing that the robot is not too stable (see Figure 4.13). However, this tilt angle is still enough for the robot to move and reach the set target.

Figure 4.13: Angle of Robot

What is the Deep Learning?

What is the Deep Reinforcement Learning?