What is the Deep Reinforcement Learning?

2.2. Introduction to Deep Reinforcement Learning

2.2.3. What is the Deep Reinforcement Learning?

2.2.3.1. Introduction

Deep reinforcement learning (DRL) is the combination of RL and DL. The goal of DRL is to learn optimal actions that maximize our reward for all states that our environment can be in. [23]

DRL incorporates DL into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. DRL algorithms can take in very large inputs and decide what actions to perform to optimize an objective. [24]

43 2.2.3.2. Core components of DRL

DRL building blocks include all the aspects that power learning and empower agents to make wise judgments in their surroundings. Effective learning frameworks are produced by the cooperative interactions of these elements. [25]

Agent: The decision-maker or learner that interacts with the environment. The agent acts according to its policy and gains experience over time to improve its decision-making abilities.

Environment: The system outside the agent with which it communicates. The environment provides feedback to the agent based on its actions, delivering incentives or punishments in response.

State: A representation of the current condition or situation of the environment at a specific moment. The agent bases its actions and decisions on this state.

Action: A choice made by the agent that leads to a change in the state of the environment.

The agent's policy guides the selection of these actions.

Reward: A scalar feedback signal from the environment indicating whether the agent's behavior in a specific state is favorable. Rewards help the agent learn positive behaviors.

Policy: A plan that directs the agent's decision-making by mapping states to actions. The goal is to find an ideal policy that maximizes cumulative rewards.

Value Function: This function estimates the anticipated cumulative reward an agent can expect to receive from a specific state while following a particular policy. It is useful for evaluating and comparing states and policies.

Model: A representation of the environment's dynamics that allows the agent to simulate potential outcomes of actions and states. Models are helpful for planning and forecasting.

Exploration-Exploitation Strategy: A decision-making approach that balances exploring new actions to gather more information and exploiting known actions to achieve immediate benefits.

Learning Algorithm: The process by which the agent updates its value function or policy based on experiences from interacting with the environment. Learning in DRL is driven by various algorithms, including Q-learning, policy gradients, and actor-critic methods.

Deep Neural Networks: In DRL, deep neural networks serve as function approximators that can handle high-dimensional state and action spaces. They learn complex input-to-output mappings.

Experience Replay: A technique that randomly samples from stored past experiences during training. This improves learning stability and reduces the correlation between consecutive experiences.

2.2.3.3. How does DRL work?

In DRL, an agent interacts with an environment to learn how to make optimal decisions.

Initialization: Construct an agent and set up the issue.

Interaction: The agent interacts with its surroundings through acting, which results in states and rewards.

Learning: The agent keeps track of its experiences and updates its method for making decisions.

Policy Update: Algorithms modify the agent’s approach based on data.

Exploration-Exploitation: The agent strikes a balance between using well-known actions and trying out new ones.

Reward Maximization: The agent learns to select activities that will yield the greatest possible total rewards.

Convergence: The agent’s policy becomes better and stays the same over time.

Extrapolation: Skilled agents can use what they’ve learned in fresh circumstances.

Evaluation: Unknown surroundings are used to assess the agent’s performance.

Use of the trained agent in practical situations. [25]

2.2.3.4. Algorithm of DRL

Similar to the RL, DRL has the algorithm: model-based and model-free algorithms. See Figure 2.14, we can see the difference between model-based and model-free.

Model-based algorithms

Model-based algorithms utilize transition and reward functions to determine the optimal policy in environments where the agent has complete knowledge. By having access to a model that outlines actions, probabilities, and rewards, the agent can plan its actions effectively. This approach works best in static or fixed environments where the dynamics do not change over time.

Model-free algorithms

In contrast, model-free algorithms operate with limited knowledge of the environment's dynamics. These algorithms do not rely on transition or reward functions but instead derive the optimal policy through direct experience. This makes them ideal for scenarios where the agent has incomplete or uncertain information about the environment. Model-free approaches are particularly useful in dynamic or unpredictable environments.

Figure 2.14: Model-based vs. Model-free [26]

Real-world environments, like those encountered by self-driving cars, are dynamic and frequently changing. In these cases, model-free algorithms typically perform better.

MDP

MDP is a framework in Reinforcement Learning that formalizes sequential decision- making. It involves an agent that interacts with its environment over time.

At each time step, the agent observes the current state and selects an action. This action leads to a transition to a new state, and the agent receives a reward based on its choice. This sequence of states, actions, and rewards is known as a trajectory. Figure 2.15 will show a deep DRL cycle.

The agent's primary objective is to maximize the total rewards from its actions, focusing on both immediate and cumulative rewards throughout the entire process.

Figure 2.15: DRL cycle [27]

Bellman Equations

State: a numerical representation of what an agent observes at a specific point in the environment. Action: The input the agent gives to the environment based on its policy.

Reward: Feedback from the environment reflects the agent's performance in achieving its goals.

The Bellman Equations address two main questions: What long-term reward can the agent expect if it takes the best actions from state ‘s’? What is the value of the current state the agent is in?

These equations are used in DRL, particularly in deterministic environments. The value of a state (s) is determined by the maximum value of possible actions. The agent aims to choose the action that maximizes this value by adding the reward for the optimal action ‘a’

and a discounted future reward (γ), which reduces the value of future rewards. This approach simplifies the calculation of the value function, allowing complex problems to be broken down into smaller, recursive subproblems.

In the context of Bellman Optimality Equations, solving the system explicitly is often impractical for large state spaces. Thus, we use DP, which breaks problems into simpler sub-problems and creates a lookup table to estimate state values. There are two main classes of DP: Value Iteration and Policy Iteration.

Value Iteration is a method that determines the optimal policy by selecting the action that maximizes the state-value function for each state. The state-value function, V(s), is initialized with random values and is iteratively updated until convergence. This process guarantees that the optimal results will be achieved, as the function refines the value of each state step-by-step.

On the other hand, Policy Iteration involves two phases: Policy Evaluation and Policy Improvement. During the Policy Evaluation phase, the state values are calculated based on the current policy, and in the Policy Improvement phase, the policy is enhanced using these evaluated state values. The agent starts with a random policy 𝜋(𝑖). During Policy Evaluation, state values are computed, and Policy Improvement updates the policy to 𝜋(1).

This iterative back-and-forth process continues until the optimal policy is found.

In Policy Evaluation, the process is iterative, beginning with a random policy, which serves as a table of state-action pairs, and it loops until the values stabilize. In the Policy Improvement phase, the agent selects the action that maximizes the equations, forming the policy for the next iteration. This continuous improvement of the policy ensures convergence towards the optimal solution: 𝑣(𝑠) = 𝔼[𝑅𝑡+1+ 𝛾𝑣(𝑆𝑡+1)|𝑆𝑡 = 𝑠)

Q-learning

Q-Learning integrates policy and value functions to evaluate the usefulness of actions for future rewards. The quality of a state-action pair is represented as Q(s, a), reflecting the expected future value based on the current state and the agent's best policy. Once the agent learns the Q-Function, it identifies the action that yields the highest quality for a state.

The optimal Q-function, Q*, helps determine the best policy by selecting actions that maximize value for each state. Essentially, Q* indicates the maximum expected return achievable by any policy π: 𝑞∗(𝑠, 𝑎) = max

𝜋 𝑞𝜋(𝑠, 𝑎)

In basic Q-Learning, a q-map lookup table records state-action pairs and their corresponding values. Deep Q-Learning (DQN) employs neural networks to predict Q- values for given states.

2.2.3.5. Application

DRL can learn to solve large and complex decision problems – problems whose solution is not yet known, but for which an approximating trial-error mechanism exists that can learn a solution out of repeated interactions with the problem. [23]

Industrial Manufacturing

Deep Reinforcement Learning (DRL) is widely utilized in robotics, where agents learn to navigate dynamic environments, making it highly applicable in industrial manufacturing.

This technology reduces labor costs, minimizes product defects, and decreases downtime, all while improving production speed, leading to more efficient and cost-effective operations.

49 Self-Driving Cars

Machine learning, particularly deep learning techniques, is at the heart of self-driving cars.

These vehicles use visual data and neural networks to recognize pedestrians, roads, and traffic signs. They are trained to make safe decisions in complex driving scenarios, with a focus on minimizing human loss and optimizing routes for efficiency and safety.

Trading and Finance

While supervised learning techniques are often employed to predict stock market trends, Reinforcement Learning (RL) plays a critical role in decision-making processes such as whether to hold, buy, or sell shares. RL models are frequently evaluated against market benchmarks to ensure they achieve optimal performance in a constantly changing financial environment.

Natural Language Processing

RL is also making significant strides in natural language processing tasks, such as question- answering and chatbot development. Through reward-based methods, bots are trained to engage in coherent, informative conversations, constantly refining their responses to provide better user interactions.

Healthcare

In healthcare, RL is being leveraged to train bots for tasks like precision surgeries and accurate disease diagnosis. These systems can analyze biological data to predict medical conditions, offering the potential for earlier diagnosis and improved treatment plans, ultimately enhancing patient care and outcomes.

2.2.3.6. Example

Design the algorithm to play the Snake Game using Reinforcement learning (see Figure 2.16).

Figure 2.16: Snake Game

The Snake game is a classic game, familiar to us. The task is to design an algorithm using DRL to control the snake to eat the most food.

In the framework of this game, we can see that when the snake hits the wall or hits itself, it will die, the game ends and that is also the end of the training process. The reward of the game can be specified as: Eat food: +10, Distance between the snake and food: ± 1, Die: - 10

The snake will interact with the environment, save actions with good reward rates and the direction of the snake will gradually improve after training.

Key points:

Input: the state that the snake can face (food location, wall location, direction the snake moves).

Output: the direction in which the snake can get the most rewards.

Training: from the input of the snake’s states, the model will train based on the snake’s interactions with the environment. From there, it will receive rewards and the model that receives the most rewards will be saved as a learning foundation for the next training.

The model will always be trained until the game. The best direction will be selected based on the rewards the snake receives.

The result of the game can be seen in Figure 2.17. In the beginning, the reward and score are very low, but after training, it is increasing.

Figure 2.17: Result of Snake Game

What is the Deep Reinforcement Learning?

What is the Deep Learning?

Defining the problem of DRL for balancing robot