Defining the problem of DRL for balancing robot- 123docz.net

CHAPTER 3: DEEP REINFORCEMENT LEARNING FOR BALANCING

3.1. Defining the problem of DRL for balancing robot

Deep Reinforcement Learning (DRL) has become a transformational approach in robotics, allowing robots to learn complex tasks and adapt to changing environments through trial and error.

Learning-based Control

In learning-based control, robots learn control policies directly from sensor data, removing the need for manually designed controllers. This approach enables robots to adapt and optimize their behavior based on real-world inputs, making it more flexible and scalable than traditional control methods.

End-to-End Learning

DRL offers end-to-end learning, where perception, decision-making, and control are integrated into a single framework. This capability simplifies the system design, allowing for more efficient and unified solutions. It is particularly useful for balancing and obstacle avoidance tasks, where the robot must seamlessly process sensory information, make decisions, and execute control actions without needing separate, specialized components.

Environment

Figure 3.1 shows the 3D space in which the robot moves: boundaries, obstacles, and goals.

The structure of the environment in which the robot operates: -5cm < x < 45cm, -15cm < y

< 15cm. If the robot leaves these boundaries, it is considered to have reached its final state.

Figure 3.1: Environment

State

In robotics, the state of a robot refers to the set of variables that describe its current configuration and interaction with the environment, such as position, velocity, and sensor readings. The state provides the necessary information for the robot to make informed decisions and take appropriate actions, enabling it to navigate and perform tasks effectively in dynamic environments. Here, I divided the robot into two parts: a base frame and two wheels (see Figure 3.2).

Figure 3.2: Construct of robot [28]

54 Base frame

The base frame will consist of some parameters:

Center of Gravity (COG) position: (𝑥, 𝑦, 𝑧)

Orientation of COG in Euler (𝑥, 𝑦, 𝑧): Represents how the robot tilts or orients along each axis (roll, tilt, yaw) (see Figure 3.3).

Figure 3.3: Center of Gravity position

Linear velocity (𝑣𝑥, 𝑣𝑦, 𝑣𝑧): the speed of the robot along each axis (𝑥, 𝑦, 𝑧) in 3D space.

(see Figure 3.4).

Figure 3.4: Linear velocity of COG

Angular velocity (𝜔𝑥, 𝜔𝑦, 𝜔𝑧): The speed at which the robot rotates around each pillar. (see Figure 3.5).

Figure 3.5: Angular velocity of COG

For each COG of the robot, the position (𝑥, 𝑦, 𝑧) and orientation in Euler (𝑥, 𝑦, 𝑧), the linear velocity (𝑣𝑥, 𝑣𝑦, 𝑣𝑧), angular velocity (𝜔𝑥, 𝜔𝑦, 𝜔𝑧) will be recorded, forming a 12- dimensional vector: 3 values for the position: position along the x, y, and z axes, 3 values for direction (Eulerian angle): rotation around the x, y, and z axes, 3 values for linear velocity: speed along the x, y, z axes, 3 values for angular velocity: rotation speed around the x, y, z axes. Total: 12 states

Wheel state

Figure 3.6: Two wheel [29]

Definition

Joint angular velocity (𝜔): direct rotation speed of the wheel

Target joint angular velocity (𝜔𝑡): The control system outputs angular velocity to the joints to achieve the desired movement or posture.

56 State

The wheel of the robot is also defined in the 3D coordinate system in Figure 3.6.

Local position (𝑥, 𝑦, 𝑧): Position of each wheel relative to the robot frame.

Local orientation in Euler angles (𝑥, 𝑦, 𝑧): Describes the wheel's orientation relative to the robot frame (how each wheel is tilted).

Global linear velocity (𝑣𝑥, 𝑣𝑦, 𝑣𝑧): Speed of each wheel (tacks forward, backward, or sideways drift).

Global angular velocity (𝜔𝑥, 𝜔𝑦, 𝜔𝑧): Speed of the wheel rotating around the x, y, and z axes.

Joint angular velocity (𝜔)

Target joint angular velocity (𝜔𝑡)

Both left and right wheel states can be represented by a vector of size 14.

In conclusion, the state space of the robot is a vector have 40 dimensions.

Action

It is a set of possible actions that the robot can perform to adjust its balance and movement.

Initially defined as two continuous values representing changes in the angular velocity of the left and right wheels.

The angular velocity of each wheel can only change in specific steps: [5dv, 3dv, 1dv, 0.1dv, 0, -0.1dv, -1dv, -3dv, -5dv]. 𝑑𝑣 = 0.5: (delta velocity = change in velocity – manual adjustment) is the default.

There are 9 discrete actions for each wheel, representing different magnitudes of changes in angular velocity.

Reward

The final state reward that the robot agent will receive in the final state fall: -1

Cross the boundary (goes beyond the defined spatial range): -1

Reach the goal: +1. The robot crosses the line predetermined target at x distance of 30cm:

+0.7. The robot maintains balance without falling over 400 time steps (20 seconds): +0.2.

If the maximum number of steps is reached, the robot reaches a neural terminal state: 0 If the robot’s position gets closer to the target at each step: 0.1

If the robot’s position remains the same or moves away from the target: -0.1 Policy

Determine how the agent chooses actions.

Greedy: always chooses the action with the largest Q value.

Epsilon-Greedy: sometimes performs random actions to explore the environment (exploration).

End

One episode ends when:

Reach the target: reach the target line, the target line is a predetermined position along the x-axis of the environment and the robot must cross this line to reach the target. The y-axis does not affect the target. When the robot’s x-coordinate crosses the finish line, it is considered to have completed the task and the agent receives a reward of +1

Fall: the height of the body exceeds ± 45° from the initial vertical position. The robot will fall if any part of its body contacts the ground, slope, or any other obstacle. In both cases of falling, the robot receives a penalty of: -1

Timeout: The robot exceeds the maximum number of time steps allowed: 50000 steps (20Hz frequency, 2500 seconds, approximately 4167 minutes). Reward = 0

Agent

An agent is an entity that is capable of observing the environment, performing actions, and receiving rewards from the environment based on those actions.

Use DQN, and SAC to train and decide the next action Steps:

Initialization: The agent is initialized, where the robot is placed in a specific position and orientation in the environment.

Observation: The agent observes the current state of the environment.

Decision Making: Based on the current state, the agent decides the next action through its policy.

Take Action: The agent acts, adjusting the speed of the wheels.

Receive Reward: After acting, the agent receives a reward from the environment.

Update Policy: The agent uses the reward and experience gained to update its policy, to optimize future actions.

Defining the problem of DRL for balancing robot

What is the Deep Learning?

What is the Deep Reinforcement Learning?