Using Q-learning Algorithm to Realize Automatic Maze Walking Robot

  algorithm, Big data

[Technology Salon 002] Data Center: Construction Practice of Yixin Agile Data Center | Yixin Technology Salon will be broadcast live online at 8: 00 pm on May 23. Click to sign up

Project description:

In this project, you will use reinforcement learning algorithm to realize an automatic maze robot.

  • As shown above, the intelligent robot is displayed in the upper right corner. In our maze, there are two scenarios: trap (red bomb) and end point (blue target point). Robots should try to avoid traps and reach their destinations as soon as possible.
  • The actions that the trolley can perform include: walking upwardsuGo to the rightrGo downdGo leftl.
  • After performing different actions, you will get different rewards according to different situations. specifically, there are the following situations.

    • Hit the wall: -10
    • Go to the finish line: 50
    • Go to Trap: -30
    • The rest: -0.1
  • We need to revise itrobot.pyIn order to realize a Q Learning robot and achieve the above goals.

Section 1 algorithm understanding

1.1 Overview of Intensive Learning

Reinforcement learning, as one of machine learning algorithms, is also a mode in which agents learn “experience” in “training” to achieve a given task. However, unlike supervised learning and unsupervised learning, in the framework of reinforcement learning, we pay more attention to learning through the interaction between agents and the environment. Usually in supervised learning and unsupervised learning tasks, agents often need to achieve this goal through a given training set, supplemented by a given training goal (such as minimizing loss function) and a given learning algorithm. However, in reinforcement learning, agents learn through the rewards they get from interacting with the environment. This environment can be either virtual (e.g. a virtual maze) or real (self-driving cars collect data on real roads).

There are five core components in reinforcement learningThey are Environment, Agent, State, Action and Reward. At a certain time node t:

  • The agent perceives its state from the environment
  • Agents select actions according to certain criteria
  • The environment feeds back rewards to the agent according to the actions selected by the agent.

Through a reasonable learning algorithm, the agent will successfully learn an existing state under such problem setting.Select actionThe strategy of.

1.2 calculate q value

In our project, we will implement the reinforcement learning algorithm based on Q-Learning. Q-Learning is a Value Iteration algorithm. Unlike the Policy Iteration algorithm, the Value Iteration algorithm calculates the value or Utility of each “state” or “state-action” and then tries to maximize this value when performing the action. Therefore, accurate estimation of each state value is the core of our value iteration algorithm. Usually we will consider the long-term reward for maximizing the action, that is, not only the reward brought by the current action, but also the long-term reward for the action.

In the Q-Learning algorithm, we record this long-term reward as a Q value, and we will consider the Q value of each “state-action”. Specifically, its calculation formula is:

That is, for the current “state-action”, we consider to perform actionsRewards from Post-Environment, and perform actionsArrivalAfter that, the maximum Q value that can be obtained by performing any action,Is the discount factor.

However, in general, we use a more conservative method of updating the Q table, that is, introducing the relaxation variable alpha, and updating it according to the following formula, so that the iterative change of the Q table is smoother.

According to known conditions.

Known: As shown in the above figure, the robot is located at s1 and its actions areuThe reward for the action is the same as the default setting for the topic. The Q value of each action executed in s2 is:u: -24,r: -13,d: -0.29、l: +40, γ takes 0.9.

1.3 How to Select Actions

In intensive learning, “exploration-utilization” is a very important issue. Specifically, according to the above definition, we will try our best to let the robot choose the best decision every time to maximize the long-term reward. But this has the following disadvantages:

  • In the initial study, our Q value will be inaccurate. If we choose according to the Q value at this time, it will cause errors.
  • After learning for a period of time, the robot’s route will be relatively fixed, so the robot cannot effectively explore the environment.

Therefore, we need a way to solve the above problems and increase the exploration of robots. Therefore, we consider using epsilon-greedy algorithm, that is, when the car selects actions, it randomly selects actions with a part of probability, and selects actions with a part of probability according to the optimal Q value. At the same time, the probability of selecting random actions should gradually decrease with the training process.

In the following code block, implement the logic of epsilon-greedy algorithm and run the test code.

import random  
import operator  

actions = ['u','r','d','l']  
qline = {'u':1.2, 'r':-2.1, 'd':-24.5, 'l':27}  
epsilon = 0.3 # 以0.3的概率进行随机选择  
def choose_action(epsilon):          
   action = None  
     if random.uniform(0,1.0) <=  epsilon: # 以某一概率  
        action = random.choice(actions)# 实现对动作的随机选择  
         action = max(qline.items(), key=operator.itemgetter(1))[0] # 否则选择具有最大 Q 值的动作  
     return action  

    res += choose_action(epsilon)  


res = ''  

for i in range(100):  

     res += choose_action(epsilon)  


Section 2 code implementation

2.1MazeClass understanding

We first introduced the maze classMaze, this is a very powerful function, it can randomly create a maze according to your requirements, or according to the specified file, read in a maze map information.

  • UseMaze("file_name")Create a maze from the specified file, or use theMaze(maze_size=(height, width))To randomly create a maze.
  • Usetrap numberParameters, when creating a maze, set the number of traps in the maze.
  • Type the name of the maze variable directly and press enter to display the maze image (e.g.g=Maze("xx.txt"), then direct inputgJust.
  • It is suggested that the size of the generated labyrinth should be 6~12 long and 10 ~ 12 wide.

In the following code block, create your maze and show it.

from Maze import Maze  
%matplotlib inline  
%confer InlineBackend.figure_format = 'retina'  
   ## to-do: 创建迷宫并展示  
g=Maze(maze_size=(6,8), trap_number=1)  
Maze of size (12, 12

You may have noticed that we have placed a robot by default in the maze. In fact, we have configured a corresponding API for the maze to help the robot move and sense. The two API you will use later aremaze.sense_robot()Andmaze.move_robot().

  • maze.sense_robot()For a function without parameters, output the robot’s current position in the maze.
  • maze.move_robot(direction)For the input moving direction, move the robot and return the reward value of the corresponding action.

Move the robot randomly and record the rewards obtained, showing the final position of the robot.

rewards = []      
 ## 循环、随机移动机器人10次,记录下奖励  
for i in range(10):  
    res = g.move_robot(random. Choice(actions))  
 ## 输出机器人最后的位置  
## 打印迷宫,观察机器人位置  


2.2RobotClass implementation

RobotClass is the part that we need to focus on. In this class, we need to implement many functions to successfully implement an reinforcement learning agent. Generally speaking, before we moved the robot artificially in the environment, but now through the implementationRobotIn this class, robots will move by themselves. By implementing the learning function,RobotClass will learn how to choose the best action and update the corresponding parameters in reinforcement learning.

First of allRobotThere are multiple inputsalpha=0.5, gamma=0.9, epsilon0=0.5Represent the default values of various parameters related to reinforcement learning, which you have already learned before.MazeShould be the maze object where the robot is located.

Subsequent observationRobot.updateFunction that indicates that each time an action is performed,RobotProcedures to be executed. According to these procedures, the function of each function is also clear.

Run the following code to check the effect (remember tomazeThe variable is changed to the name of the variable that you created the maze.

import random  
import operator       

 class Robot(object):   

    def __init__(self, maze, alpha=0.5, gamma=0.9, epsilon0=0.5):    

         self. Maze = maze  
         self.valid_actions = self.maze.valid_actions  

         self.state = None  
         self.action = None     

         # Set Parameters of the Learning Robot  
         self.alpha = alpha  
         self.gamma = gamma    

         self.epsilon0 = epsilon0  
         self. Epsilon = epsilon0  
          self.t = 0    

          self.Qtable = {}  
          self. Reset()    

    def. reset(self):  
                 Reset the robot 
         self.state = self.sense_state()  

    def. set status(self, learning=False, testing=False):  
         Determine whether the robot is learning its q table, or 
         executing the testing procedure. 
         self. Learning = learning  
         self.testing = testing     

     def. update_parameter(self):  
         Some of the paramters of the q learning robot can be altered, 
         update these parameters when necessary. 
         if self.testing:  
             # TODO 1. No random choice when testing  
            self. Epsilon = 0  
             # TODO 2. Update parameters when learning  
             self. Epsilon *= 0.95     

        return self. Epsilon     

     def. sense_state(self):  
         Get the current state of the robot. In this 
           # TODO 3. Return robot's current state  
                    return self.maze.sense_robot()    

     def. create_Qtable_line(self, state):  
         Create the qtable with the current state 
         # TODO 4. Create qtable with current state  
         # Our qtable should be a two level dict,  
         # Qtable[state] ={'u':xx, 'd':xx, ...}  
         # If Qtable[state] already exits, then do  
         # not change it.  
         self.Qtable.setdefault(state, {a: 0.0 for a in self.valid_actions})             
     def. choose_action(self):  
        Return an action according to given rules 
         def. is_random_exploration():    

             # TODO 5. Return whether do random choice  
             # hint: generate a random number, and compare  
             # it with epsilon  
            return random.uniform(0, 1.0) <= self. Epsilon  
         if self. Learning:  
             if is_random_exploration():  
                # TODO 6. Return random choose aciton  
                 return random. Choice(self.valid_actions)  
                 # TODO 7. Return action with highest q value  
                 return max(self.Qtable[self.state].items(), key=operator.itemgetter(1))[0]  
         elif self.testing:  
             # TODO 7. choose action with highest q value  
             return max(self.Qtable[self.state].items(), key=operator.itemgetter(1))[0]  
             # TODO 6. Return random choose aciton  
            return random. Choice(self.valid_actions)     

    def. update_Qtable(self, r, action, next_state):  
         Update the qtable according to the given rule. 
         if self. Learning:  
             # TODO 8. When learning, update the q table according  
             # to the given rules  
            self.Qtable[self.state][action] = (1 - self.alpha) * self.Qtable[self.state][action] + self.alpha * (  
                         r + self.gamma * max(self.Qtable[next_state].values()))  
    def. update(self):  
         Describle the procedure what to do when update the robot. 
        Called every time in every epoch in training or testing. 
         Return current action and reward. 
         self.state = self.sense_state()  # Get the current state  
         self.create_Qtable_line(self.state)  # For the state, create q table line  
        action = self.choose_action()  # choose action for this state  
         reward = self.maze.move_robot(action)  # move robot for given action  
        next_state = self.sense_state()  # get next state  
         self.create_Qtable_line(next_state)  # create q table line for next state  
         if self. Learning and not self.testing:  
             self.update_Qtable(reward, action, next_state)  # update q table  
            self.update_parameter()  # update parameters     

        return action, reward  
 # from Robot import Robot  
 # g=Maze(maze_size=(6,12), trap_number=2)  
 robot = Robot(g) # 记得将 maze 变量修改为你创建迷宫的变量名  

('d', -0.1)
Maze of size (12, 12)

2.3 useRunnerClass training Robot

After completing the above, we can begin to treat usRobotTraining and participation. We have prepared another wonderful classRunnerTo realize the whole training process and visualization. Using the following code, you can successfully train the robot. And you will generate a file namedfilenameThe video recorded the whole training process. By watching this video, you can find problems in the training process and optimize your code and parameters.

Try to use the following code to train the robot and adjust the parameters. Optional parameters include:

  • Training parameters

    • Number of training sessionsepoch
  • Robot parameters:

    • epsilon0(epsilon initial value)
    • epsilonAttenuation (can be linear, exponential attenuation, can adjust the speed of attenuation), you need to adjust in
    • alpha
    • gamma
  • Maze parameters:

    • Maze size
    • Number of Traps in Maze
  • Optional parameters:
  • epoch = 20
  • epsilon0 = 0.5
  • alpha = 0.5
  • gamma = 0.9
  • maze_size = (6,8)
  • trap_number = 2
from Runner import Runner  
g = Maze(maze_size=maze_size,trap_number=trap_number)  
r = Robot(g,alpha=alpha, epsilon0=epsilon0, gamma=gamma)  
 runner = Runner(r, g)  
runner.run_training(epoch, display_direction=True)  
 #runner.generate_movie(filename = "final1.mp4") # 你可以注释该行代码,加快运行速度,不过你就无法观察到视频了。  

Userunner.plot_results()Function to print some parameter information of the robot during training.

  • Success Times represents the cumulative number of times the robot succeeded in the training process, which should be a cumulative image.
  • Accumulated Rewards represent the value of the accumulated rewards that the robot obtains in each training epoch, which should be a gradually increasing image.
  • Running Times per epoch represents the number of times the car trains in each training epoch (the Epoch will be stopped when reaching the end point and will be transferred to the next training). This should be a gradually decreasing image.

    Userunner.plot_results()Output training results.


Author: Yang Fei

Source:Yixin Institute of Technology(