Packet Routing Network


In this Project a self-learning algorithm for packet routing is presented. A reinforcement-learning algorithm is embedded into the normal shortest path algorithm. The algorithm learns using local communication. Results of a number of experiments on a simulated dynamic network show that this approach leads to a better packet delivery times when compared to traditional shortest path algorithms.

Chapter 1


In Communication Networks, information is transferred from one Router to another Router in small data chunks, which are called Packets. The process of sending a packet from source node to destination node is called packet routing. Each router in the network adapts to routing policies that can sustain high network loads and low average packet delivery time. These decision makers (Routers) learn based on the information they get back from their neighboring nodes as they send packets to them, which is called forward exploration. The simplest such policy is shortest-path algorithm, which always routes packets through the path with minimum number of hops. This policy is not always good because some intermediate nodes along a popular route might have a large queue to be processed. In such a case it would be better to send the packet through another route that may be longer in terms of number of hops but results in shorter delivery time. Hence as the traffic builds up at some popular routes, alternative routes must be chosen to keep the average packet delivery time low.

Lady using a tablet
Lady using a tablet


Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

This project explores the performance of a dynamic network which learns through exploration and exploitation based on a policy that balances minimizing the number of hops and time taken with possibility of congestion along popular routes.

Chapter 2

Literature Review and Theory

2.1 Routing in Communication Networks[7]

Routing is a technique by which data (packets) finds its way from one host computer to another. In the Internet context there are three major aspects of routing

  • Physical Address Determination
  • Selection of inter-network gateways
  • Symbolic and Numeric Addresses

Physical Address Determination is necessary when an IP datagram is to be transmitted from a computer. It is necessary to encapsulate the IP datagram within whatever frame format is in use on the local network or networks to which the computer is attached. This encapsulation clearly requires the inclusion of a local network address or physical address within the frame.

The Selection of inter-network gateways is necessary because the Internet consists of a number of local networks interconnected by one or more gateways. Such gateways, generally known as routers, sometimes have physical connections or ports onto many networks. The determination of the appropriate gateway and port for a particular IP datagram is called routing and also involves gateways interchanging information in standard ways.

The Symbolic and Numeric Addresses aspect which involves address translation from a reasonably human friendly form to numeric IP addresses is performed by a system known as the Domain Name Server or DNS for short.

2.2 Study of Existing Routing Algorithms

Routing protocols use some metrics to evaluate the best route for a packet for it travel to a specific destination. These metrics include factors like path bandwidth and router queues etc., determined by the routing algorithms. In this process, routing tables are maintained by routers, which are initialized and constantly updated. Routing tables are updated by analyzing routing updates from neighboring routers; a router can build a detailed picture of network topology.

Routing algorithms fill routing tables with a variety of information like destination, next hop etc. which enables it to determine optimal route for a particular destination. Sometimes routing tables also contain desirability of a path depending on the router queues etc. So Routing algorithms play an important role in routing in the Internet. Some of the very popular routing algorithms are:

Dijkstra's Algorithm:

  • Goal: Find the least cost paths from a given node to all other nodes in the network
  • Notation:

dij = Link cost from i to j if i and j are connected

Lady using a tablet
Lady using a tablet


Writing Services

Lady Using Tablet

Always on Time

Marked to Standard

Order Now

Dn = Total path cost from s to n, where s is the source node and n is the next node in the path

M = Set of nodes so far for which the least cost path is known

  • Method:

Initialize: M = {s}, Dn - dsn

Find node w Ï of M, Whose Dn is minimum

Update Dn

Bellman-Ford Algorithm:

  • Notation:

H = Number of hops being considered

D (h) n = Cost of h-hop path from s to n

  • Method:

Find all nodes 1 hop away

Find all nodes 2 hop away

Find all nodes 3 hop away

Initialize: D (h) n = µ for all n ¹ s; D (h) n = 0 for all h

Find jth node for which h+1 hops cost is minimum

D (h+1) n = min j [D (h) j + djn ]

ARPAnet Routing (1969 - 78)

  • Features: Cost = Queue Length,
  • Each node sends a vector of costs (to all nodes) to neighbors. Distance vector
  • Each node compares new cost vectors based on the new info using Bellman-Ford algorithm

ARPAnet Routing (1979 - 86)

  • Problems with earlier algorithm: Thrashing (packets went to areas of low queue length rather than the destination), Speed not considered.
  • Solution: Cost = Measured delay over 10 seconds
  • Each node floods a vector of cost of neighbors. Link-state. Converges faster after topology changes.
  • Each node computes new cost vectors based on the new info using Dijkstra's algorithm.

2.3 Overheads in Routing Algorithms

The above Routing Algorithms have problems such as

  • Correlation between delays reported and those experienced later (high in light loads, low during heavy loads)
  • Oscillations under heavy loads
  • Unused capacity at some links
  • Over-utilization of others
  • More variance in delay, more frequent updates causing more overhead

Chapter 3

3.1 Problem Statement

The Problem can be efficiently demonstrated with reinforcement learning approach using SARSA (l) method.

The state of a router (x) can be represented with two parameters - destination of the packet (d) and neighborhood router (y) to which the packet could be delivered. The packet routing policy is to decide the immediate router, out of all neighborhood routers, to which the packet should be delivered in order to reach the destination in minimum time.

3.2 Development Methodology

The state is represented with the router currently having the packet. The state consists of three parameters, present router identifier, destination router identifier and the neighboring router identifier to which the packet could be sent. The policy at each router answers the question: -“to which adjacent router the packet should be sent so as it reaches the destination as quickly as possible?” The action at the router is to send the packet according to the selected policy. Since the policy's performance is measured by the total time taken to deliver a packet, there is no “Gaining signal” for directly evaluating or improving the policy until a packet finally reaches its destination. However, using reinforcement learning, the policy can be updated more quickly by using only local information. The project is demonstrated with reinforcement learning approach using SARSA method.

Let Qx (d, y) be the time that a router (x) estimates to deliver a packet (P) bound for router (d) by way of x's neighbor router y, including any time that P would have to spend in router x's queue. Upon sending packet (P) to neighborhood router (y), x immediately gets back y's estimate for the time remaining in the trip, named as

t = min (z Î neighborhood of y) Qy (d, z) ---- Equation I

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

If the packet spent q units of time in x's queue and s units of time in transmission between nodes x and y, then x can revise its estimate as follows:

Qx (d, y) = a (i + s + t - Qx (d, y))---- Equation II

Where (i+s+t) is the new estimate, Qx (d, y) is the old estimate and a is the learning factor. So the resulting algorithm can be characterized as a version of Bellman Ford shortest path algorithm. The resulting algorithm is called “SARSA-routing” and represents the Q function Qx (d, y) with a large table.

One advantage of the learning algorithm over static routing algorithm is the potential for adapting to changes in crucial system parameters during network operation. The two factors that describes the dynamic environment of the network are-

Topology: The topology of the network system changes constantly. It means the new links come and disappear arbitrarily. The learning algorithm reacts quickly to such changes and able to manage routing traffic efficiently.

Load in the Network: The traffic or load in the network changes dynamically and the learning algorithm adapts to the changes quickly and again when network traffic is lowered, routers adapt to the new situation quickly.

3.3 Implementation

The learning problem was divided into discrete tasks, which are called episodes and apply SARSA learning algorithm.

Each router is represented as a state. The source router is called Start State and the destination router is called as final state. The episode starts with Start State (Source Router) and ends at final state (Destination Router). At each intermediate step in the episode, the reward is given as negative when unsuccessful and is given as positive when the packet is delivered successfully to the destination.

Each router maintains a two dimensional routing table. The entry at ith row and jth column in the routing table represents the total reward to send a packet to the ith destination router through jth neighboring router. So, when a packet arrives at a router to reach the destination router (d), the policy at the router checks the total reward for all the elements in the dth row and picks the neighboring router (corresponding column element) that has the maximum total reward. After packet is sent to the neighboring node, the routing table of the present router is updated using the Equation-I (as described above). If the neighboring router is the destination, the immediate reward is positive else that is negative.

To implement the dynamic nature of the network system, a thread was implemented to run in parallel to the main thread (which is synchronized with the main thread), which keeps adding and removing the links in the network system randomly.

Algorithm Implemented[1]

Initialize routing table of every state in the network system arbitrarily.

Do for each episode,

Select a random start and final state.

Select an action according to the policy (Î-greedy).

Do for each intermediate step in the episode

Calculate t = min (z Î neighborhood of y) Qy(d, z).

Update Qx (d, y) = a (i + s + t - Qx (d, y)).

Repeat until the next state is the Destination State.



  • When the immediate reward on failure was small positive, the system didn't learn as quickly as it would when the immediate reward on failure was negative. In the latter case the system learns rapidly.
  • The routers make their policy decisions based on only local information. So, the estimates are used to adjust the Qx (d, y) values for each neighbor y. When any shorter paths to a destination appear, or if there are inefficiencies in the policy, this information propagates very quickly through the network and the policy adjusts accordingly.

3.4 Why Not Q-Routing?

A close look at the algorithm reveals that Routing cannot fine-tune a policy to discover shorter paths, since only the best neighbor's estimate is ever updated. For instance, if a node learns an overestimate of the delivery time for an optimal route, then it will select a sub optimal route as long as that route's delivery time is less than the erroneous estimate of the optimal route's delivery time. This drawback of greedy Q­learning is widely recognized in the reinforcement learning community, and several exploration techniques have been suggested to overcome it.

A common one is to have the algorithm select actions with some amount of randomness during the initial learning period. But this approach has two serious drawbacks in the context of distributed routing:

(1) The network is continuously changing, thus the initial period of exploration never ends; and more significantly,

(2) Random traffic has an extremely negative effect on congestion. Packets sent in a sub optimal direction tend to add to queue delays, slowing down all the packets passing through those queues, which adds further to queue delays.

3.5 Why Not Monte-Carlo?

According to the Monte-Carlo algorithm, the estimates and policies are changed only upon the completion of an episode. These methods are thus incremental in an episode-by-episode sense, but not in a step-by-step sense. To update the values after an episode, a trailing message should be sent from the destination to the starting state, which creates extra traffic in the system. Our goal is to reduce the traffic in the network system and delivery time of a packet. But with this method, extra traffic is added to the network, which increases the delivery time of the packet. So, Monte-Carlo methods are not preferable to solve this problem.

Chapter 4

Graphical User Interface and Graphs

The discussion about the graphs and graph theory is beyond the scope of this document. This chapter gives a brief definition and explanations of the terms, which are related to this project.

4.1 Graphs and Graph Terminology

A graph G consists of a nonempty set V called the set of nodes (points, vertices) of the graph, a set E which is the set of edges of the graph, and mapping from set of edges E to a set of pairs of elements of V[9].

Any nodes, which are connected by an edge in the graph, are called adjacent nodes.

In a graph G = (V, E), an edge which is directed from one node to another is called a directed edge. A graph in which all the edges are directed is called a directed graph. A graph in which every edge is undirected is called an undirected graph. In this project all the edges are undirected edges and the graph is undirected graph[9].

In a graph G = (V, E) let x Î E such that x joins nodes u, v. Node u is called the originating or initiating node and v is known as terminating or ending node.

An edge of a graph, which joins itself, is called a loop (sling)[9].

A graph in which weights are assigned to every node is called a weighted graph. This project deals with weighted graph only. Each edge is weighed which represents the cost (time) for a packet to be transported via that node. The weights of the edges are fixed but due to the dynamic nature of the graph in this project the edges are added and deleted on fly, which causes the weights to vary widely.

4.2 Graph Drawing and Representation

In this project a random graph is drawn, simulating a dynamic network in which the edges are added and deleted dynamically. User input is taken for the number of nodes in the network. This can be easily extended to large networks.

It is easy to store and represent graphs in Matrix form because it is easy to store and manipulate matrices. Certain well-known operations of matrix algebra can be used to obtain paths, cycles, and other characteristics of a graph.

Let G = (V, E) be a simple graph in which V = (v1, v2, ………, vn) and the nodes are assumed to be ordered in from v1 to vn. This graph can be represented as a n ´ n matrix A whose elements aij are given by

aij = 1 if (vi, vj) Î E

0 otherwise

is called the adjacency matrix of graph G.

In our case it's a weighted graph so the graph is not a adjacency (boolean) matrix. In our case it is slightly different. It is a n ´ n matrix where

aij = weight of the edge if (vi, vj) Î E

0 otherwise

4.3 GUI Development User Interface and Graphs

The entire GUI in this project is drawn using Java Applets. GUI contains client area where the dynamic simulated network is drawn after user enters the number of nodes of the network. It also contains a text area where the routing taking place and the nodes involved are shown. User also needs to input the originating node and the destination node. Snapshot of the GUI is as shown below.

Chapter 5

Results and Discussions

5.1 Behavior of Shortest Path Algorithm

Shortest path routing algorithms have been widely used in today's computer networks. In such algorithms, each node attempts to route packets to their destination over the path of minimum distance and updates the distance periodically to adapt to topological and traffic changes. There are two main classes of algorithms: Distance vector algorithm and link state algorithm. In the distance vector algorithm, each node maintains a routing table containing the distance of the shortest path to every destination in the network. A node only informs its immediate neighbors of any distance changes to any particular destinations. In link state algorithm, each node keeps track of entire network topology and computes the distance based on the link distance information broadcast by every node in the network.

Shortest path algorithms have served well in the networks where traffic is low and the network conditions change slowly. However as the network speed increases and the new applications proliferate, the network becomes much more dynamic and the traffic patterns are less predictable. The range of traffic rates the network has to deal is much wider and the distribution of the traffic can be very uneven.

In the networks where the traffic approaches the capacity of the paths and changes dynamically, the shortest path algorithms exhibit oscillatory behavior and cause performance degradation.

5.2 Behavior of Algorithm with SARSA

The algorithm was tested for wide variety of dynamic networks. The results showed that, this resultant network system performs far better than non-adaptive system. The algorithm presented explores and exploits the shortest path and the results are propagated all the way back as the routing tables at each node are updated based on the result of the exploration. If the result is negative (packet not delivered to its destination), then a negative reward propagates all the way back and a new route is explored. The level of exploration can be easily set depending on how dynamically the network changes.

5.3 Graphs and Snapshots

Graphs were drawn to relate average message time versus number of episodes divided by 10. The graphs are described as follows.

Explanation of Graph-I

The Graph-I describes the relation between average message time and episode/10. This graph was a result of 50 node network run over 10000 episodes. The results were very positive. Initially till the routing tables at nodes are empty so there is a larger and message times. After about 30 episodes the routing tables have information in them and most of the routes are explored which results in a more stable message delivery time. Because the simulated network was dynamic in nature we can see those peaks in the graphs after certain intervals. That is because of the few edges disappearing and few of them being added.

Explanation of Graph-II

Graph II shows the comparison between Non adaptive shortest path algorithm and adaptive algorithm. This graph shows the clear difference between adaptive and non-adaptive Networking system. Non-adaptive networking system could not avoid looping with out any prior knowledge about the network. So, the average message time it takes to send a packet to their destination is more compared to the adaptive networking system.

Explanation of Graph-III

Graph III is a comparison between 100, 1000 and 10000 episode runs on a 50N dynamic network. As we can see from the graph all the three runs take some time initially to stabilize and after that they seem pretty stable except for when a new node is added or an existing node is removed.

Explanation of Graph-IV

Graph IV shows the standard deviation of 10 runs of 100 episodes each on a 50 node network. The variance shown by this graph is between 30 and 70.

Explanation of Graph-V

Graph V shows the standard deviation of 10 runs of 1000 episodes each on a 50 node network. The variance shown by this graph is between 20 and 130. But there is only one reading of 130, for the most part the readings are between 20 and 45.

Explanation of Graph-VI

Graph VI shows the standard deviation of 10 runs of 10000 episodes each on a 50

node network. The variance shown by this graph is between 10 and 210. The variance in this graph seems to be more because it's recorded for 10000 episodes.

5.4 Simulation Results and Interpretation

The algorithm was tested for wide variety of dynamic networks. The results showed that, this resultant network system performs far better than non-adaptive system. The results are as follows.

Initially the time taken for a message to reach its destination is in multiples of 100 in a network system of size 50 routers. But, later it was drastically reduced to less than 10. The reason is described as follows.

Initially, the router doesn't have any prior knowledge about the network system. It arbitrarily picks one neighboring router out of all and sends the packet to that router. But, according to the reinforcement algorithm, if the packet is not delivered to the destination, the immediate reward is negative. So, if the packet unknowingly sent to the router that forms a loop, it will be avoided because of the negative immediate reward. From next time onwards, the router behaves intelligently and will not deliver packet to that previous unsuccessful neighbor. In this way, the very difficult problem faced by most of the network engineers, “counting to infinity problem”, is solved. After few episodes, the table contains almost exact estimates and network system learns completely. That's why, after few episodes, the average message time taken to deliver a packet to its destination is minimized. This was clearly shown in the Graph-III.

The environment is simulated to be dynamic. So, the nodes (routers) and links between the nodes appear and disappear automatically. In the Graph-I, when the links between the nodes disappear, the average message time increases and when the new link between the nodes appear, the average message time decreases. We can clearly observe these changes at the position of peaks and troughs. After few episodes, the network system is so adaptive that it is able to recover from any changes rapidly.

Chapter 6

Conclusion and Future Work

6.1 Conclusions

The SARSA­routing algorithm, without having to know in advance the network topology and traffic patterns, and without the need for any centralized routing control system, is able to discover efficient routing policies in a dynamically changing network. Algorithms based on SARSA­routing but specifically tailored to the packet routing domain will likely perform even better.

6.2 Recommendations for Further Study

One of the most interesting directions for future work is to replace the table-based representation of the routing policy with a function approximation. This could allow the algorithm to integrate more system variables into each routing decision and to generalize over network destinations. Potentially, much less routing information would need to be stored at each node, thereby extending the scale at which the algorithm is useful.


[1] Reinforcement Learning: An introduction by Richard S. Sutton and Andrew G. Barto, MIT Press, Cambridge, MA, 1998, A Bradford book

[2] An Adaptive Ant-Based Routing Algorithm Used Routing History in Dynamic Networks, Kazuyuki Fujita, Akira Saito, Toshihiro Matsui, Hiroshi Matsuo

[3] L. R. Ford, Jr. Flows in Networks. Princeton University Press, 1962

[4] M. Littman and J. Boyan. A distributed reinforcement learning scheme for network routing. Technical Report CMU­CS­93­165, School of Computer Science, Carnegie Mellon University, 1993

[5] H. Rudin. On routing and delta routing: A taxonomy and performance comparison of techniques for packet­switched networks. IEEE Transactions on Communications

[6] Packet Routing: A Reinforcement Learning Approach Justin A. Boyan and Michael L. Littman, Carnegie Mellon University

[7] A. Tanenbaum. Computer Networks. Prentice Hall, second edition

[8] C. Watkins, Learning from Delayed Rewards, PhD Thesis, King's College,

Cambridge, 1989

[9] Introduction to Data Structures and Algorithms, Trembly and Sorenson