Strategic Route Learning Robot Behaviour Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract - This paper presents a novel Fuzzy Sarsa(λ) Learning (FSλL) approach applied to a strategic route leaning task of a mobile robot. FSλL is a hybrid architecture that combines Reinforcement Learning and Fuzzy Logic control. The Sarsa(λ) Learning algorithm is used to tune the rule-base of a Fuzzy Logic controller which has been tested in a route learning task. The robot explores its environment using its fixed experience provided by a discretized Fuzzy Logic controller, and then learns optimal policies to achieve goals in less time and less error.

Index Terms - Reinforcement Learning, Fuzzy Q-Learning, Fuzzy Logic Controllers, Sarsa(λ), Robot Autonomy.

I. Introduction

Reinforcement Learning (RL) is widely used for understanding and automating goal-directed learning as well as decision-making [1]. According to R.S. Sutton et. al. the interaction between a learning agent and its environment in terms of states, actions, rewards, and value functions, is to achieve efficient search in the space of policies. Kaelbling and Littman presented the RL problem faced by agents that learn behaviours through trial-and-error within dynamic environ-ments [2]. They discuss central issues of reinforcement learning, including the trading of exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, and constructing empirical models to accelerate the learning. A similar approach, Sarsa(λ), is considered as one of the most powerful methods in RL. Agents which have built-in Sarsa(λ) algorithms, initially explore the state space by obtaining instant experience. The efficiency of learning an on-policy like Sarsa is quite promising since it can be more effective than other off-policy methods like Q-learning.

The use of an Adaptive Heuristic Critic (AHC) RL engine is to refine a FLC engine as it has been presented by [3][7]. The actor part of the AHC used in this project is a conventional FLC in which the parameters of the input membership functions are learnt by an immediate internal reinforcement signal. This internal reinforcement signal derives by a prediction of the evaluation value of a policy and the external reinforcement signal. The evaluation value of a policy is learnt by temporal difference (TD) learning in the critic part that is also represented by a FLC. Furthermore, Glorennec has also attempted to fuse two model-free algorithms, Q-Learning and Fuzzy Q-Learning, for a broom balancing task [8].

Inspired by [2][3][7][8][9], it has been attempted in this paper to combine Reinforcement Sarsa(λ) Learning (RSL) with FLCs to form the FSλL hybrid architecture in order to achieve faster learning performances. The RSL algorithm had been chosen for the route learning task because it helps robot agents, being under unsupervised control, to passively observe the state-action space and the immediate rewards encountered, by checking the updated experience taken by fuzzy rule-bases so that to integrate this information in order to achieve successfully predefined goals. Once the learnt policy along with the taken experience is "good" enough to control the robot agent, the learning error and the time steps required to achieve goals are reduced.

The rest of the paper is organized as follows. Section II presents the route learning task and the proposed hybrid learning architecture. In section III, the FLC architecture is used to show how the learning process is accelerated by fetching artificial experience. In section IV the FSλL controller is presented along with the core FSλL algorithm. The experimental results and analysis are presented in section V to demonstrate the learning performance of the proposed hybrid approach. Finally, conclusions and future improvements are given in section VI briefly.

II. Hybrid Learning Architecture

A. Strategic Route Learning Task

The FSλL approach presented in this paper is mainly used to accelerate the learning process of a robot agent. It has been tested on a route learning task in which the robotic agent has to learn how to find an optimal route toward a goal point, as shown in Fig. 1. During searching the environment, the only information given to the robot agent is the coordinates of the goal. Immediate rewards are collected to define the agent's experience, which is learnt by trial. According to this artificial experience, built by immediate rewards and pre-given fuzzy experience on how to perform discretized actions under certain situations, the robot agent constructs learning strategies which helps it to confront the task's nature in general and the route learning task in particular.

Fig. 1 The route learning task performed by a strategic optimal policy.

B. Fuzzy Sarsa(λ) Learning Architecture

Inspired by [2][3][7][8][9], the FSλL model is described by a learning architecture in Fig. 2, in which the FSλL is connected with a static environment that feeds the robot agent with state information. According to the perceived states the agent applies actions to the environment that responds with rewards. This is the state-action-reward cycle which takes place in all RL algorithms as well.

The FSλL architecture is based on the conventional Sarsa(λ) algorithm and FLCs. Initially the Sarsa(λ) algorithm prepares the process of the quintuple sars'a' and passes all this information through the δt error to the defuzzifier engine.

The defuzzifier engine undertakes the updating of the Q-function. The updated Q-function is propagated back to the Sarsa(λ) algorithm and then, according to the updated Q(s, a) values, the Sarsa(λ) algorithm selects an action with probability 1-ε (ε-greedy policy) and the agent performs that action to the environment. Immediately after that, a reward r evaluates the action performed and the algorithm receives the next state s'. The next action α' is not selected again by the ε-greedy policy; a FLC generates the next action according to the state α perceived initially from the environment. This last information will end the FSλL algorithm's cycle. In the next algorithmic cycles the ε-greedy policy will select among fuzzy actions by tuning the rule-base.

C. Reinforcement Sarsa(λ) Learning

The learning controller for the route learning task is based on the Sarsa(λ) learning that can explore the state space and exploit it at the same time by obtaining instant experience. Thus it does not have to wait until the goal area is reached in order to propagate back the collected rewards and then to build a "delayed" experience. The idea in Sarsa(λ) is to apply the TD(λ) prediction method to state-action pairs rather than to states only. Obviously, a trace of eligibility is needed for each state-action pair et(s, α), described by the following update equation:


Fig. 2 Fuzzy Sarsa(λ) Learning architecture.

where is the learning gain (). If it decreases sufficiently slowly towards zero, then the value function converges toward to optimal policy [8]. δt is the TD error, which is described by:


whereas for the replacing eligibility traces we have:


Sarsa(λ) is an on-policy algorithm, which approximates the [3], i.e. the action values for the current policy π. In our case the ε-greedy method is used for action selection with maximum Q(s, a) values and probability 1-ε, otherwise the selection mechanism switches by choosing a random action [4]. The use of eligibility traces can improve the efficiency of control algorithms such as Sarsa [1]. One of the most significant characteristics for the Sarsa algorithm is that it credits each action taken during the agent's locomotive procedure so that on-line learning can be achieved. Each trace of eligibility constitutes a short term memory of the frequency of the state-action pair visits or triggered rules in the case of Reinforcement Fuzzy Learning [2]. Each trace decreases its content exponentially until the next update which can occur by a new visit to the corresponded state [8].

III. Fuzzy Logic Controller

Two different FLC systems are used in this project. The first FLC classifier is used as a mechanism providing fixed experience based on discretized input and output spaces which help the robot agent to achieve better action performances. The second FLC classifier is used to refine the updated Q-function (1) so that the learning to be accelerated.

The input state space of the discretized FLC algorithm uses five discrete laser vectors: LLV, LFLV, CLV, RFLV and RLV, each of which is assigned with five fuzzy variables: Left, LFrt, Front, RFrt and Right. Each variable is divided into three areas: Ner (for near), Med (for medium) and Far (for far). These distinct areas represent the distance from the agent's body to the detected object/obstacle. The output action space of the FLC has been divided into five discrete angle actions allocated at the central point of each laser vector. Act0 to Act4 correspond to five action variables: Lws, LForw, Forw, RForw and Rws. Fig. 3 presents the five laser vectors, each of which is divided in three range areas: Far, Med and Ner. Each active laser vector can generate one of the state weights. If there are more than one laser vectors detected, then all the generated state weights are summed.

LLV: Left Laser Vector (weight = 1)

LFLV: Left Front Laser Vector (weight = 2)

CLV: Central Laser Vector (weight = 4)

RFLV: Right Front Laser Vector (weight = 6)

RLV: Right Laser Vector (weight = 8)

Fig. 3 Agent's laser vectors in 5 different areas.

The FLC's Fuzzy Inference Engine (FIE) selects each discrete action according to the fuzzy antecedents of the input space by consulting a fuzzy rule-base which processes these input laser data so that to generate an output fuzzy consequent data set.

A. Membership Functions

Here, two types of membership functions are used; the trapezoid μfs(x) is used for the fuzzy input space and the singleton μfs(x) for the fuzzy output space. The maximum value of membership functions for both trapezoid and singleton μfs is defined by the height (4) that equals to 1:


Fuzzy sets are used to describe elements of uncertainty, where N is the number of rules and n is the dimension of the input space [5].

1) Trapezoid μf(x): The support area of a trapezoid μf is defined by the section whose values are greater than zero.


The core area and the geometric representation of a trapezoid μf are defined by the section whose values have maximum degree of membership to fuzzy set trapezoid described by the quantities:



2) Singleton μf(x): The singleton μf does not have support area, it can be only characterized by its core point a.


B. Fuzzification and Fuzzy Logical Operators

The total number of fuzzy rules μ, is the product of the number of fuzzy sets of all input state variables [5][7].


The true value for the rule activated by the input vector x is calculated by the Mamdani's minimum fuzzy implication:


For this implication, one type of T-norm operator is adopted; the probabilistic PAND:


C. Fuzzy Inference System

The Fuzzy Inference Engine (FIE), according to the fuzzy rule-base, uses the membership functions in order to map some fuzzy sets onto other fuzzy sets [6]. A sample fuzzy rule is given below [5]:

The above fuzzy rule is implemented by a fuzzy implication Ri as follows:


D. Height Singleton Deffuzification

The Singleton Height Defuzzification has been used here because it reduces the computation cost and it is useful for symmetric consequents, which makes the defuzzification process much faster. However, because of the height singleton method the actions performed by the robot agent correspond to five very distinct angle values. Therefore the action space will be limited. On the other hand the output action responses will be faster. The height defuzzifier evaluates μi(x) and then takes the apexes of each μf and computes the output of the FLC, as shown in the following general form:


where αpexi denotes the output recommended by the rule i and by taken centre of gravity of the fuzzy set; this is associated with the activation of the rule [5] while denotes the weight of the rule i.

IV. Fuzzy Sarsa(¬) Learning (FSλL) Controller

A. Fuzzy Rule-Bases in Sarsa(¬) Learning

Similarly to Q-learning, a typical RFL system represents its rule-base by connecting the action selection and the associated q-values to form a new set of fuzzy rules, i.e. the rules have the form [2][7][8][9]:



The FLC form represents a typical fuzzy rule; for a given state s in a fuzzy set F, the output a becomes c. On the other hand, the RFL form represents a fuzzy rule in which the learning agent can choose an action, α, among the state set A for each rule i by a[i, k[i]] while q[i, k[i]] denotes its corresponded q-value [10]. The learning process is undertaken to determine the best set of rules by optimising the future reinforcements. The initial rule-base has N rules as follows:

For every rule i, let be the subscript of the available actions chosen by an EEP [8][9].

B. Fuzzy Value Function Approach

The greedy policy is the one which is used by an FIS to select the value of a state vector that can obtain better results. Let q[i, maxk[i]] be the maximum q-value for the rule i, and the value of the value-function becomes:


C. Updating the q-Values

Recall that the TD error δt in Q learning describes the difference between the Q value of next state-action pair and the Q value of the current state-action pair. Thus, the error signal which is used to update the action q-values is given by:


An effective method for appropriate action selection is to perform the gradient descent on the expected reward and to integrate the rule-base with back-propagation so that the immediate rewards to be maximized. This class is called reinforce algorithm, including linear reward-inaction as a special case [2]. The gradient algorithm is used to update the q-values incrementally [8] by the following equation:


where α is the learning rate.

D. Updating the e-Values

RFL algorithms obtain more effective and faster learning when they combined with eligibility traces (λ) [8]. To replace eligibility traces used in our analysis, we have:


Thus, the updating formula for the q-value becomes:


E. Q-Function and FIS Representation

A FIS is used to represent the set of actions A(st) which is associated with the Q-function.

In a RFL system the evaluation of an action is obtained by defuzzifying the input state vector s (where s=x). The Takagi-Sugeno defuzzification is used for the calculation of the action (α(s) function) selection as well as to update the Q-function (Q(s, α) function) as is shown below [8][9].

(21) (22)

Recall that both α(s) and Q(st, αt) functions use singleton-height defuzzification to calculate the selection of each action.

F. FS¬L Algorithm

Table I shows the Fuzzy Sarsa(λ) Learning algorithm, which is influenced by [2][3][7][8][9]. According to Fig. 2, the q-values are zeroed to avoid the distortion of the training since they are not considered essential for the 1st stage of the learning process as it happens in Q-learning [8][10]. Exploration can be achieved by performing a set of arbitrary actions at each time step t. These actions are automatically generated through reinforcement signals and experience can be exploited by the agent in future steps. Then an Exploration/ Exploitation policy (EEP) is used to select actions.

After the Q-function is initialized to zero as well as the eligibility is traced by the e-function, the episode counter starts to perform the learning procedure by executing a number of functions. A subsumption architecture which supervises the status of the FSλL algorithm will decide whether to finalize the current running episode under the following circumstances: (i) a number of steps are over a predefined threshold, (ii) goal is achieved, (iii) an unexpected event comes into foreground.

TABLE I FS¬L Algorithm _

Initialize Q(s, a) to zero, for all s, a

Initialize e(s, a) to zero, for all s, a

Repeat for each episode:


Init a static and a fuzzy state a

Repeat for each step of episode:


Select a static action a using e-greedy policy

Take the static action a

Receive a reward r

Get the next state s'

Compute the next action a' derived from FLC

Update Q-Function


dt error

Δq gradient

e traces

defuzzify Q(st, at)


Restore the pairs: (s, a) ← (s', a')


}Until s is terminal (goal area) ____________________________________________________________________________________

The episode's body executes the state capturing by getting two state snapshots: (1) static state: denoting the current sensory discretized value evaluated with fixed weights, of the environment at the agent's current position, and (2) fuzzy state: denoting the fuzzy evaluation of the weights taken by the static consequent. Both states are captured by using the laser scanner. Thereafter, the ε-greedy policy is used to select actions with maximum Q-value according to the current static state. The selected action is executed by the TakeAction( ) engine which is a low-level discretized controller function.

The performance of the action taken is evaluated, if goal is achieved or not, by a reward and according to that, the next state is computed. Immediately after, the next action is calculated by the action-defuzzifier engine which chooses an action according to the fuzzified states as well as to the rule- base. At the end of the basic procedure which is a preparation for the Q-function, all the data derived from the states and actions are evaluated by the Q-function. At the same time, the Q-function evaluates itself by the Q-defuzzifier engine which updates the Q-content. Before the value is being defuzzified, the Q-function calculates the TD error which is an evaluation of the difference between the current and the next state-action pair; the q-gradient. This is a back propagation reinforce algorithm [8] and the traces of eligibility are snapshots of each state visited. Finally, the next states and actions are replaced by the current states and actions.

V. Experimental Results and Analysis

The FSλL route learning task has been tested in a simulated environment: arena A1 shown in Fig. 1, without any training courses while its I/O parameters are shown in Table II. In this arena, a triangle obstacle is located in the middle while there is only one way out path to the goal area. Testing the route learning task it has been observed that the agent's general behaviour was stable with improving learning rate around to ~90%: from the 1st episode starting with ~120 time steps, to the 24th episode finalizing with time steps fluctuating around ~20. Ultimately, the agent has adopted an optimal policy from the 3rd episode where it was able to find the goal area in ~20 time steps (plays).

The same experiment has been tested by giving three training episodes. The three first episodes have been given by a human-tutor showing the agent the way to the goal area. The agent's behaviour observed from the 4th episode as improving and stable with precisely same learning performances as in the previous experiment.

The next experiment took place in the simulated arena A2 in which a squared obstacle was presented in the middle of the arena whereas two paths to the goal area were accessible. The agent's learning performances observed in this experiment were almost the same with the two previous experiments.

Finally, the conventional Reinforcement Sarsa(λ) Learning algorithm has been tested in the arena A1 without training courses given. The agent's behaviour has been observed as unstable but with improving learning rate from the 1st to the 23rd episode fluctuating around ~50% while an optimal policy has been adopted from the 15th episode with plays around to ~45.

A. Learnt Trajectories

The robot agent's performance in the route learning task is presented in Fig. 4, including the 1st episode and three more randomly selected episodic tasks. In this testing, the robot agent performed 24 episodes without external training by an expert tutor (user) using the FSλL controller. Table III presents the learning parameters of the FSλL controller.

Looking at the 1st episode of Fig. 4, it can be seen that the robot's behaviour is unstable, which means that the robot agent initially explores the arena until it finds a path for the goal area. In the 6th episode the robot spends less time steps for the goal area because of the experience obtained in the past five episodes. As it can be seen from the left side of the arena, the learning performance has been improved and the uncertainty which had produced incorrect actions in previous episodes has been reduced dramatically. In the 11th episode it is clear that the robot agent has formed a path which shows that the learning has been obviously improved. In this episode the uncertain actions have been reduced and the agent has obtained precise knowledge about where the goal area is located.

TABLE II rule table of the route learning task


I/O Variable Correspondence

State Weights

















TABLE III learning parameters for the route learning testing


Learning Parameters






Learning rate




Percentage of Exploration




Discount Factor




Lambda Parameter




Environment Discrete Factor














Fig. 4 FS¬L algorithm performance (24 episodes in arena A1).

B. Collected Rewards

Fig. 5 shows the collected rewards in a 2D space, which were taken over the agent's performance in 24 episodes in the arena A1. In this reward representation, rewards have been allocated according to the arena's formation (locations of the walls and obstacles). This can be verified by comparing the trajectories of the 11th and 22nd episodes of Fig. 4. The rewards reveal the most visited states and the exact locations the agent interacted with. The rewards have formed a curve trajectory which starts from the left side area and ends to the right.

C. Learning Performances

As it is shown in the play performance graph, Fig. 6(a), the amount of collected rewards was reduced during the first three episodes dramatically. Over the next 17 episodic tasks, the rewards kept a stable attitude fluctuating around 20 per episode while in the 19th episode, a sharp diminution of rewards was less than 10.

Fig. 6(b) shows the reward performance (learning error) graph. The collected rewards taken on the 1st episode were close to 50, meaning that the bumps into walls and obstacles were frequent. The optimal policy can now drive the robot agent to the goal area without meeting walls or obstacles. In case that a new dynamic obstacle would be placed on the agent's learnt path, the FSλL algorithm would reconfigure its parameters by adapting itself towards the new condition. Gradually, the agent would be able to learn about the obstacle presented and consequently to find the goal area; a fact which constitutes the systems adaptability.

Finally, the average performance of the system over 24 episodes fluctuates around 20 plays while the collected rewards were approximately 12.

Fig. 5 Collected rewards in the 23rd episode €­ arena A1.

Play performance

(b) Reward performance

Fig. 6 Agent's learning performance.

VI. Conclusion and Future Work

The Fuzzy Sarsa(λ) Learning algorithm presented in this paper can improve the learning performance at about 40% more than the conventional Sarsa(λ) algorithm regarding the route learning task which has been tested in different arenas as well as with and without training courses. Furthermore, the FSλL approach can be applied as well in other robot tasks such as visual tracking, emotional behaviours, motion control etc. It has been found that an FSλL controller can drive a robot agent to achieve goals by accelerating the learning rate and by producing optimal control policies since it learns control policies more quickly than moderately experienced program-mers can hand-code.

However, there are still many open questions, i.e. how a FSλL system would behave if we establish a more effective policy method than ε-greedy. Instead of the ε-greedy policy for the action selection, the core FSλL algorithm can adopt more effective policy methods such as softmax, Genetic Algorithms (GA), or Artificial Neural Networks (ANN) in order to improve the quality of the actions selected for particular state cases. Further improvements can be done especially on the low-level discretized controller in order to maintain more stable learning behaviours.