# Bayes Theorem For Machine Acceptor Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Bayes theorem deals with the role of new information in revising probability estimates. The theorem assumes that the probability of a hypothesis (the posterior probability) is a function of new evidence (the likelihood) and previous knowledge (prior probability). The theorem is named after Thomas Bayes (1702-1761), a nonconformist minister who had an interest in mathematics. The basis of the theorem is contained in as essay published in the Philosophical Transactions of the Royal Society of London in 1763.

Bayes theorem is a logical consequence of the product rule of probability, which is the probability (P) of two events (A and B) happening P(A,B) is equal to the conditional probability of one event occurring given that the other has already occurred P(A|B) multiplied by the probability of the other event happening P(B). The derivation of the theorem is as follows: P (A, B) = P (A|B) Ã- P (B) = P (B|A) Ã- P (A)

Thus: P(A|B) = P(B|A)Ã-P(A)/P(B).

Bayes' theorem has been frequently used in the areas of diagnostic testing and in the determination of genetic predisposition. For example, if one wants to know the probability that a person with a particular genetic profile (B) will develop a particular tumour type (A) that is, P(A|B). Previous knowledge leads to the assumption that the probability that any individual will develop the specific tumour (P (A)) is 0.1 and the probability that an individual has the particular genetic profile (P (B)) is 0.2. New evidence establishes that the probability that an individual with the tumor P (B|A) has the genetic profile of interest is 0.5.

Thus: P (A|B) = 0.1Ã-0.5/0.2 = 0.25

The adoption of Bayes' theorem has led to the development of Bayesian methods for data analysis. The Bayesian approach to data analysis allows consideration of all possible sources of evidence in the determination of the posterior probability of an event. It is argued that this approach has more relevance to decision making than classical statistical inference, as it focuses on the transformation from initial knowledge to final opinion rather than on providing the "correct" inference. In addition to its practical use in probability analysis, Bayes' theorem can be used as a normative model to assess how well people use empirical information to update the probability that a hypothesis is true.

## SIMPLE STATEMEMT OF THEOREM:

Bayes gave a special case involving continuous prior and posterior probability distributions and discrete probability distributions of data, but in its simplest setting involving only discrete distributions, Bayes' theorem relates the conditional and marginal probabilities of events A and B, where B has a non-vanishing probability:

Each term in Bayes' theorem has a conventional name:

P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into account any information aboutÂ B.

P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value ofÂ B.

P(B|A) is the conditional probability of B given A. It is also called the likelihood.

P(B) is the prior or marginal probability of B, and acts as a normalizing constant.

Bayes' theorem in this form gives a mathematical representation of how the conditional probability of event A given B is related to the converse conditional probability of B given A.

## SIMPLE EXAMPLE:

Suppose there is a school with 60% boys and 40% girls as students. The female students wear trousers or skirts in equal numbers; the boys all wear trousers. An observer sees a (random) student from a distance; all the observer can see is that this student is wearing trousers. What is the probability this student is a girl? The correct answer can be computed using Bayes' theorem.

The event A is that the student observed is a girl, and the event B is that the student observed is wearing trousers. To compute P(A|B), we first need to know:

P(A), or the probability that the student is a girl regardless of any other information. Since the observers sees a random student, meaning that all students have the same probability of being observed, and the fraction of girls among the students is 40%, this probability equals 0.4.

P(B|A), or the probability of the student wearing trousers given that the student is a girl. As they are as likely to wear skirts as trousers, this isÂ 0.5.

P(B), or the probability of a (randomly selected) student wearing trousers regardless of any other information. Since half of the girls and all of the boys are wearing trousers, this is 0.5Ã-0.4 + 1Ã-0.6 = 0.8.

Given all this information, the probability of the observer having spotted a girl given that the observed student is wearing trousers can be computed by substituting these values in the formula:

## DERIVATION:

To derive the theorem, we start from the definition of conditional probability. The probability of event A given event B is

Equivalently, the probability of event B given event A is

Rearranging and combining these two equations, we find

This lemma is sometimes called the product rule for probabilities. Discarding the middle term and dividing both sides by P(B), provided that it is non-zero, we obtain Bayes' theorem:

The lemma is symmetric in A and B and dividing by P(A), provided that it is non-zero, gives a statement of Bayes' theorem where the two symbols have changed places.

## What is Machine Acceptor?

Acceptors and recognizers (also sequence detectors) produce a binary output, saying either yes or no to answer whether the input is accepted by the machine or not. All states of the FSM are said to be either accepting or not accepting. At the time when all input is processed, if the current state is an accepting state, the input is accepted; otherwise it is rejected. As a rule the input are symbols (characters); actions are not used. The example in figure 2 shows a finite state machine which accepts the word "nice". In this FSM the only accepting state is number 7.

The machine can also be described as defining a language, which would contain every word accepted by the machine but none of the rejected ones; we say then that the language is accepted by the machine. By definition, the languages accepted by FSMs are the regular languages-that is, a language is regular if there is some FSM that accepts it.

Acceptor FSM: parsing the word "nice"

## Start state:

The start state is usually shown drawn with an arrow "pointing at it from anywhere"

## ACCEPT STATE:

Finite state machine that determines if a binary number has an odd or even number of 0s where S1 is an accepting state.

An accept state (sometimes referred to as an accepting state) is a state at which the machine has successfully performed its procedure. It is usually represented by a double circle.

An example of an accepting state appears on the right in this diagram of a deterministic finite automaton (DFA) which determines if the binary input contains an even number of 0s.

S1 (which is also the start state) indicates the state at which an even number of 0s has been input and is therefore defined as an accepting state. This machine will give a correct end state if the binary number contains an even number of zeros including a string with no zeros. Examples of strings accepted by this DFA are epsilon (the empty string), 1, 11, 11..., 00, 010, 1010, 10110 and so on.

## INTRODUCTION TO STATISTICAL TRANSLATION:

How do computers translate texts from one language to another? Human translators use a great deal of detailed knowledge about how the world works to correctly translate all the different meanings the same word or phrase can have in different contexts. This makes automated translation seem like it might be a hard problem to solve, the kind of problem that may require genuine artificial intelligence. Yet although translators like Google Translate and Yahoo!'s Babel Fish are far from perfect, they do a surprisingly good job. How is that possible?

I describe the basic ideas behind the most successful approach to automated machine translation, an approach known as statistical machine translation. Statistical machine translation starts with a very large data set of good translations, that is, a corpus of texts (e.g., United Nations documents) which have already been translated into multiple languages, and then uses those texts to automatically infer a statistical model of translation. That statistical model is then applied to new texts to make a guess as to a reasonable translation.

## Formulating the problem:

Imagine you're given a French text , f , and you'd like to find a good English translation e, . There are many possible translations of f into English, of course, and different translators will have different opinions about what the best translation, e, is. We can model these differences of opinion with a probability distribution pr(e|f) over possible translations, e, given that the French text was f. A reasonable way of choosing the "best" translation is to choose e which maximizes the conditional probability pr(e|f) .

The problem with this strategy is that we don't know the conditional probability pr(e|f). To solve this problem, suppose we're in possession of an initial corpus of documents that are in both French and English, e.g., United Nations documents, or the Canadian parliamentary proceedings. We'll use that corpus to infer a model estimating the conditional probabilities pr(e|f). The model we'll construct is far from perfect, but with a large and high quality initial corpus, yields pretty good translations. To simplify the discussion we assume e and f are single sentences, and we'll ignore punctuation; obviously, the translator can be applied serially to a text containing many sentences.

Now, how do we start from the corpus and infer a model for pr(e|f)? The standard approach is to use Bayes' theorem to first rewrite pr(e|f) as

Because f is fixed, the maximization over e is thus equivalent to maximizing

What we're going to do is to use our data set to infer models of and , and then use those models to search for e maximizing .

Broke machine translation up into three problems:

Build a language model which allows us to estimate

Build a translation model which allows us to estimate

Search for maximizing the product.

Each of these problems is itself a rich problem which can be solved in many different ways.

## The Translation Model:

Simple translation model allowing us to compute \mbox{pr}(f|e). Intuitively, when we translate a sentence, words in the source text generate (possibly in a context-dependent way) words in the target language. In the sentence pair (Jean aime Marie | John loves Mary) we intuitively feel that John corresponds to Jean, loves to aime, and Mary to Marie. Of course, there is no need for the word correspondence to be one-to-one, nor for the ordering of words to be preserved. Sometimes, a word in English may generate two or more words in French; sometimes it may generate no word at all.

Despite these complications, the notion of a correspondence between words in the source language and in the target language is so useful that we'll formalize it through what is called an alignment. Rather than give a precise definition, let me explain alignments through an example, the sentence pair (Le chien est battu par Jean | John (6) does beat (3,4) the (1) dog (2)). In this example alignment, the numbers in parentheses tell us that John corresponds to the 6th word in the French sentence, i.e., Jean. The word does has no trailing parentheses, and so doesn't correspond to any of the words in the French sentence. However, beat corresponds to not one, but two separate words in the French sentence, the 3rd and 4th words, est and battu. And so on.

Two notions derived from alignments are particularly useful in building up our translation model. The first is fertility, defined as the number of French words generated by a given English word. So, in the example above, does has fertility 0, since it doesn't generate anything in the French sentence. On the other hand, beat has fertility 2, since it generates two separate words in the French sentence.

The second notion is distortion. In many sentences, the English word and its corresponding French word or words appear in the same part of the sentence - near the beginning, perhaps, or near the end. We say that such words are translated roughly undistorted, while words which move a great deal have high distortion. We'll encode this notion more formally shortly.

We'll build up our translation model using some simple parameters related to fertility and distortion:

The fertility probability \mbox{pr}(n|e), the probability that the English word ehas fertility n.

The distortion probability \mbox{pr}(t|s,l), which is the probability that an English word at position scorresponds to a French word at position tin a French sentence of length l.

The translation probability \mbox{pr}(f|e), one for each French word fand English word e. This should not be confused with the case when fand eare sentences!

The translation model as a way of computing \mbox{pr}(f|e), where e is an English sentence, and f is a French sentence. In fact, modify that definition a little, defining the translation model as the probability \mbox{pr}(f,a|e) that the French sentence f is the correct translation of e, with a particular alignment, which we'll denote by a. I'll return to the question of how this change in definition affects translation in the next section. Rather than specify the probability for our translation model formally, how it works for the example alignment (Le chien est battu par Jean |John (6) does beat (3,4) the (1) dog (2)):

\mbox{pr}(1|John) \times \mbox{pr}(Jean|John) \times \mbox{pr}(6|1,6) \times

\mbox{pr}(0|does) \times

\mbox{pr}(2|beat) \times \mbox{pr}(est|beat) \times \mbox{pr}(3|3,6) \times \mbox{pr}(battu|beat) \times \mbox{pr}(4|3,6) \times

\mbox{pr}(1|the) \times \mbox{pr}(Le|the) \times \mbox{pr}(1|5,6) \times

\mbox{pr}(1|dog) \times \mbox{pr}(chien|dog) \times \mbox{pr}(2|6,6) \times

\mbox{pr}(1|<null>) \times \mbox{pr}(par|<null>)

This should be self-explanatory except the final line, for the French word par. This word has no corresponding word in the English sentence, which we model using a special word <null>.

What remains to be done is to estimate the parameters used in constructing the translation model - the fertility, distortion and translation probabilities. It starts with a simple guess of the parameters. E.g., we might guess that the distortion probabilities \mbox{pr}(t|s,l) = 1/l are uniform across the sentence. Similar guesses could be made for the other parameters. For each pair (e,f)of sentences in our corpus we can use this guess to compute the probability for all possible alignments between the sentences. We then estimate the "true" alignment to be whichever alignment has highest probability. Applying this procedure to the entire corpus gives us many estimated alignments. We can then use those estimated alignments to compute new estimates for all the parameters in our model. E.g., if we find that 1/10th of the alignments of sentences of length 25have the first word mapped to the first word then our new estimate for \mbox{pr}(1|1,25) = 1/10. This gives us a procedure for iteratively updating the parameters of our model, which can be repeated many times. Empirically we find (and it can be proved) that the parameters gradually converge to a fixed point.

## Bayesian Machine Acceptor

Bayesian statistics provides a framework for building intelligent learning systems. Bayes Rule states that

P(M|D) = P(D|M)P(M)/P(D)

We can read this in the following way: "the probability of the model given the data (P(M|D)) is the probability of the data given the model (P(D|M)) times the prior probability of the model (P(M)) divided by the probability of the data (P(D))".

Bayesian statistics, more precisely, the Cox theorems, tells us that we should use Bayes rule to represent and manipulate our degree of belief in some model or hypothesis. In other words, we should treat degrees of beliefs in exactly the same way as we treat probabilities. Thus, the prior P(M) above represents numerically how much we believe model M to be the true model of the data before we actually observe the data, and the posterior P(M|D) represents how much we believe model M after observing the data. We can think of machine learning as learning models of data. The Bayesian framework for machine learning states that you start out by enumerating all reasonable models of the data and assigning your prior belief P(M) to each of these models. Then, upon observing the data D, you evaluate how probable the data was under each of these models to compute P(D|M). Multiplying this likelihood by the prior and renormalizing results in the posterior probability over models P(M|D) which encapsulates everything that you have learned from the data regarding the possible models under consideration. Thus, to compare two models M and M', we need to compute their relative probability given the data: P(M)P(D|M) / P(M')P(D|M').

Incidentally, if our beliefs are not coherent, in other words, if they violate the rules of probability which include Bayes rule, then the Dutch Book theorem says that if we are willing to accept bets with odds based on the strength of our beliefs, there always exists a set of bets (called a "Dutch book") which we will accept but which is guaranteed to lose us money no matter what the outcome. The only way to avoid being swindled by a Dutch book is to be Bayesian. This has important implications for Machine Learning. If our goal is to design an ideally rational agent, then this agent must represent and manipulate its beliefs using the rules of probability.

In practice, for real world problem domains, applying Bayes rule exactly is usually impractical because it involves summing or integrating over too large a space of models. These computationally intractable sums or integrals can be avoided by using approximate Bayesian methods. There is a very large body of current research on ways of doing approximate Bayesian machine learning. Some examples of approximate Bayesian methods include Laplace's approximation, variational approximations, expectation propagation, and Markov chain Monte Carlo methods (many papers on MCMC can be found in this repository)

Bayesian decision theory deals with the problem of making optimal decisions -- that is, decisions or actions that minimize our expected loss. Let's say we have a choice of taking one of k possible actions A1 ... Ak and we are considering m possible hypothesis for what the true model of the data is: M1 ... Mm. Assume that if the true model of the data is Mi and we take action Aj we incur a loss of Lij dollars. Then the optimal action A* given the data is the one that minimizes the expected loss: In other words A* is the action Aj which has the smallest value of Î£i LijP(Mi|D)

## Bayes Rule Applied to Machine Learning:

Plikelihood of

P() prior probability of

P(D) posterior of given D

## Prediction:

(for many models)

Supposing we want to translate French sentences to English. Here, the hidden configurations are English sentences and the observed signal they generate are French sentences. Bayes theorem gives p(e|f)p(f) = p(e, f) = p(f|e)p(e) and reduces to the fundamental equation of machine translation: maximize p(e|f) = p(f|e)p(e) over the appropriate e (note that p(f) is independent of e, and so drops out when we maximize over e). This reduces the problem to three main calculations for:

p(e) for any given e, using the N-gram method and dynamic programming

p(f|e) for any given e and f, using alignments and an expectation-maximization (EM) algorithm

e that maximizes the product of 1 and 2, again, using dynamic programming

The analysis seems to be symmetric with respect to the two languages, and if we think can calculate p(f|e), why not turn the analysis around and calculate p(e|f) directly? The reason is that during the calculation of p(f|e) the asymmetric assumption is made that source sentence be well formed and we cannot make any such assumption about the target translation because we do not know what it will translate into.

We now focus on p(f|e) in the three-part decomposition above. The other two parts, p(e) and maximizing e, uses similar techniques as the N-gram model. Given a French-English translation from a large training data set (such data sets exists from the Canadian parliament),

NULL And the program has been implemented

Le programme a ete mis en application

the sentence pair can be encoded as an alignment (2, 3, 4, 5, 6, 6, 6) that reads as follows: the first word in French comes from the second English word, the second word in French comes from the 3rd English word, and so forth. Although an alignment is a straight forward encoding of the translation, a more computationally convenient approach to an alignment is to break it down into four parameters:

Fertility: the number of words in the French string that will be connected to it. E.g. n( 3 | implemented ) = probability that "implemented" translates into three words - the word's fertility

Spuriousness: we introduce the artifact NULL as a word to represent the probability of tossing in a spurious French word. E.g. p1 and its complement will be p0 = 1Â âˆ’Â p1.

Translation: the translated version of each word. E.g. t(a | has ) = translation probability that the English "has" translates into the French "a".

Distortion: the actual positions in the French string that these words will occupy. E.g. d( 5 | 2, 4, 6 ) = distortion of second position French word moving into the fifth position English word for a four-word English sentence and a six-word French sentence. We encode the alignments this way to easily represent and extract priors from our training data and the new formula becomes

p(f|e ) = Sum over all possible alignments an of p(a, f | e ) =

= n_0(v_0 | \sum_{j=1}^{l}{v_j} ) \cdot \prod_{j=1}^{l} n(v_j | e_j)v_j! \cdot \prod_{j=1}^{m} t(f_j | e_{a_j}) \cdot \prod_{j:a_j\not =0}^{m} d(j | a_j,l,m). \,

For the sake of simplicity in demonstrating an EM algorithm, we shall go through a simple calculation involving only translation probabilities t(), but needless to say that it the method applies to all parameters in their full glory. Consider the simplified case (1) without the NULL word (2) where every word has fertility 1 and (3) there are no distortion probabilities. Our training data corpus will contain two-sentence pairs: bcÂ â†’Â xy and bÂ â†’Â y. The translation of a two-word English sentence "b c" into the French sentence "x y" has two possible alignments, and including the one-sentence words, the alignments are:

b c b c b

| | x |

x y x y y

called Parallel, Crossed, and Singleton respectively.

To illustrate an EM algorithm, first set the desired parameter uniformly, that is

t(x | b ) = t(y | b ) = t(x | c ) = t(y | c ) = Â½

Then EM iterates as follows Iterations of an EM algorithm

The alignment probability for the "crossing alignment" (where b connects to y) got a boost from the second sentence pair b/y. That further solidified t(y | b), but as a side effect also boosted t(x | c), because x connects to c in that same "crossing alignment." The effect of boosting t(x | c) necessarily means downgrading t(y | c) because they sum to one. So, even though y and c co-occur, analysis reveals that they are not translations of each other. With real data, EM also is subject to the usual local extremum traps.

## Problems in Natural language:

- Data sparseness

- Spelling variants/errors ('airplane', 'aeroplane' or 'foetus', 'fetus')

- Ambiguity ('saw' - a tool or the past tense of the verb 'see')

- Pronoun resolution

## Techniques using machine learning:

- State machines

- Neural networks

- Genetic algorithms etc.

## â€¢ Nowadays, the dominant approach

- Bayesian Theorem (network)

## Reasons for using Bayesian Networks:

â€¢ Extension of probabilistic models

â€¢ Explicitly represent the conditional dependencies

â€¢ Provides an intuitive graphical visualization of the knowledge

â€¢ Representation of conditional independence assumptions

â€¢ Representation of the joint probability distribution of the model.

â€¢ Less probabilities of the probabilistic model

â€¢ Reduced computational complexity of the inferences

## Basic example of Rain and Traffic Jam:

S-Snow, CL-Clouds, R-Rain, F-Flood, A-Car

Accident in a street, T-Traffic Jam, D-Delay, C Causality

## Term similarity between Traffic Jam (T) and Rain (R):

term-sum(T,R) =P(T|R) + P(R|T)

=P(T|R) + P(T|R)P(R)/P(T)

=P(T|A)P(A|R)(1+P(R)/P(T))

## Bayesian Networks and Natural Language Understanding

â€¢ Part-of-Speech(POS) Tagging

â€¢ Word Sense Disambiguation

â€¢ Machine Translation

â€¢ Information Retrieval

## Part-of-Speech Tagging:

â€¢ The process of marking up the words based on its definition, as well as its context:

â€¢ Ex: The sailor dogs the hatch.

## Feature set:

Capitalization

Hyphenation

Numeric

Prefix

Suffix

The probability of a complete sequence of POS

tags T1â€¦Tn is modeled as:

## Community

â€¢ People living in a particular area

â€¢ An association of people with similar interests

â€¢ Common ownership

â€¢ The body of people in a learned occupation

## Town

â€¢ An urban area with a fixed boundary that is smaller than a city

â€¢ The people living in a municipality smaller than a city

â€¢ An administrative division of a county

â€¢ The one node per sense approach

â€¢ The one node per word approach

## Machine Translation:

â€¢ The task of translating the text from one natural language to another.

â€¢ Static Bayesian networks, dynamic Bayesian networks

â€¢ Filali has introduced a new generalization of DBN, as multi dynamic Bayesian networks

(MDBN)

â€¢ MDBN has multiple streams of variables that can get unrolled, but where each stream may

be unrolled for a differing amount.

â€¢ MDBN is a variant of DBN.

â€¢ DBN consists of a directed acyclic graph

- G = (V, E) = (V1U V2, E1U E2 U E2 â†’ )

â€¢ Multi-Dynamic Bayesian Network (MDBN)

## Dynamic Branch Prediction using Bayes' Theorem:

During Computer Organization, we were discussing how a CPU, given a condition, decides which portions of code to execute. Having that lecture in the back of my mind, I soon began wondering if one could apply Bayes' Theorem to the problem of Dynamic Branch Prediction. As many people know, Bayes' Theorem is a useful rule for calculating conditional probabilities. In the Networks lectures, we used it to figure out the probability of having a disease given the result of a test or what color marbles are likely to be in an urn given a few samples. Bayes' theorem really begins to shine, however, when one looks beyond simple textbook cases. Before we can talk about its application, allow me to provide some context.

At the heart of every computer is a Central Processing Unit or CPU for short. When a computer is turned on, the CPU begins to iterate through programs stored in its memory, performing computations as dictated by the lines of code. This is not the end of the story. Even the simplest programming tasks require the power afforded by programming language constructs such as conditional statements, loops, and pointers. AnÂ if-statementÂ can cause the CPU to skip lines based on some evaluated expression and a loop causes the CPU to return to a certain line until a specific condition is met. In general, these actions require the CPU to skip around within the program. This behavior is calledÂ branching.

In some computer architectures, such asÂ MIPS, portions of the CPU hardware are split into different stages. Together these stages are referred to as theÂ pipeline. Each stage prepares part of an instruction (roughly corresponding to a line of code) and passes the result to the next stage, much like in an assembly line. This is possible because not all of the hardware is used at the same time; so portions of the CPU are partitioned to allow for the computation of multiple instructions at once. Needless to say, with more sophisticated hardware more complicated problems follow. When a CPU encounters a branching instruction, it must first compute the result to decide whether or not to branch. Unfortunately because of pipelining, the branch decision will not progress far enough down the assembly line for the decision to be calculated before the next instruction is loaded. One solution to this conundrum is to simply guess what the branch outcome will be.

Dynamic Branch Prediction is an active topic in the field of computer architecture. Modern computer programs can be thousands of lines long and may require the computer to branch very often during its runtime. Because of this, we want our branch prediction to be accurate and quick. The conventional branch prediction scheme uses a hash table to store a history of past branch outcomes to be consulted when a similar branch decision must be made. If the table leads to an incorrect prediction, the entry must be updated. Additional hardware keeps track of how many times the prediction is wrong. In a basic Two-Level Adaptive Branch Prediction, it would take 2 wrong predictions to change an entry in the table. The number of levels corresponds to the number of past branch outcomes the predictor can "remember". This will allow the predictor to follow trends while ignoring small fluctuations in branch outcomes. The accuracy of the predictor increases with the number of levels while yielding diminishing returns after a certain point. The trade-off here is flexibility. It takes exponentially more hardware to implement an increase in the number of past outcomes a predictor can remember.

The 2-Level Adaptive Branch Predictor State Machine

How does Bayes' Theorem fit into this? We can think of this in terms of the information cascade experiment talked about in lecture. Instead of majority-redÂ orÂ majority-blue, we want to calculate the probability of branchingÂ orÂ not-branching. Our previous signals,Â redÂ andÂ blue, now

becomeÂ takenÂ andÂ not-taken. We want to take the branch if the probability of a branch outcome is higher than the probability of a non-branch outcome. Our main problem is to calculate the following:

Pr[branch | X1, X2, X3, â€¦] = Pr[branch] * Pr[X1, X2, X3, â€¦ | branch] / Pr[X1, X2, X3, â€¦]

Where X1, X2, X3, â€¦ are collected past outcomes. X1Â is the oldest branch outcome in the history and XnÂ is the most recent. Note that we are assuming that all past branch outcomes are independent of each other-a perfectly reasonable assumption to make for a typical program. Each of these variables can either be 1 forÂ takenÂ or 0 forÂ not-taken. We can approximate the probability of a branch by leaving out the denominator on the right hand side, since it is a scaling factor independent of the branch variable. This is not needed to compare the relative likelihoods of each outcome. This greatly simplifies the calculation, since multiplication and division are very expensive in terms of hardware. The formula then becomes:

Pr[branch | X1, X2, X3, â€¦] = Pr[branch] * Pr[X1, X2, X3, â€¦ | branch]= Pr[branch] * Pr[X1Â | branch] * Pr[X2Â | branch] * â€¦ * Pr[XnÂ | branch]

Another challenge that we must face is that of actually implementing Bayes' theorem in hardware. How do we represent Pr[XiÂ | branch] using hardware, in binary? First of all, we know that both XiÂ (a past outcome) and branch (the current outcome) are boolean in the respect that they can only have the values 0 or 1. This simplifies to 4 cases (2 bits in binary):

## Pr[XiÂ | branch]

00

0

0

0

01

0

1

1/3

10

1

0

2/3

11

1

1

1

The higher order bit in the output is enough to indicate a probability higher than Â½. We can take advantage of this fact and use the number of higher order 1's in the table to determine the probability. Since the calculation for Pr[no-branch | X1, X2, X3â€¦] is done in the same manner, we need only compare the number of higher order 1's in either expression to determine which outcome is more likely. This is also relatively simple to do using logic gates and MSI components as opposed to hardware needed for addition or multiplication.

So how does this compare to the conventional Adaptive Branch Predictor? The conventional predictor has an asymptotic space complexity of O(2n), while our Bayes' Theorem Branch Predictor is O(n). This means that the conventional predictor requires about 2nspaces of storage in the hash map for n remembered outcomes and the Bayes' Theorem predictor requires only about n spaces for the same number of outcomes. (see Singer, Brown, and Watson below). With less hardware to create, this implementation seems to be more convenient to use with longer history tables. It is possible to improve upon the basic design to increase the accuracy of the Bayes' Theorem branch predictor by varying the caching methods as well as other parameters. Amazingly, the application of Bayes' theorem allows us to replace a "brute force" state machine with a bit of calculation that results in an overall simpler and more flexible branch predictor.

## Bayes decision rule:

In statistical machine translation, we are given a source language sentence fJ 1 = f1 . . . fj . . . fJ , which is to be translated into a target language sentence eI 1 = e1 . . . ei . . . eI . Among all possible target language sentences, we will choose the sentence with the highest probability.

(1)

(2)

The decomposition into two knowledge sources in Equation 2 is known as the source-channel approach to statistical machine translation [5]. It allows an independent modeling of the target language model Pr(eI 1) and the translation model Pr(fJ 1 |eI1)1. In our system, the translation model is trained on a bilingual corpus using GIZA++ [6], and the language model is trained with the SRILM toolkit [7].

## Weighted finite-state transducer-based translation

We use the weighted finite-state tool by [8]. A weighted finite-state transducer (Q,Â§[ {Â²},Â­[ {Â²},K,E, i, F, Â¸, Â½) is a structure with a set of states Q, an alphabet of input symbols Â§, an alphabet of output symbols Â­, a weight semiring K, a set of arcs E, a single initial state i with weight Â¸ and a set of final states F weighted by the function Â½ : F ! K. A weighted finite-state acceptor is a weighted finite-state transducer without the output alphabet.A composition algorithm is defined as: Let T1 : Â§Â¤ Ã- Â­Â¤ ! K and T2 : Â­Â¤ Ã-Â¡Â¤ ! K be two transducers defined over the same semiring K. Their composition T1Â±T2 realizes the function T : Â§Â¤ Ã- Â¡Â¤ ! K. By using the structure of the weighted finite-state transducers, the translation model is simply estimated as the language model on a bilanguage of source phrase/target phrase tuples, see [9].

## Phrase-based translation

The phrase-based translation model is described in [10]. A phrase is a contiguous sequence of words. The pairs of source and target phrases are extracted from the training corpus and used in the translation. The phrase translation probability Pr(eI 1|fJ 1 ) is modeled directly using a weighted log-linear combination of a trigram language model and various translation models: a phrase translation model and a word-based lexicon model. These translation models are used for both directions: p(f|e) and p(e|f). Additionally, we use a word penalty and a phrase penalty. The model scaling factors are optimized with respect to some evaluation criterion [11].