Microsatellite Analysis Project Preview Essay Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Microsatellites are short tandem repeats (STRs) of sequences 2-6 bases in length, which are common to all eukaryotic genomes. These short base pair sequences, commonly called motifs, can repeat dozens of times, with microsatellites hundreds of repeats long having been observed. They have been found to make up around 3% of the human genome. [Lander et al, 2001]

Microsatellites exhibit high levels of polymorphism; in fact, rates of microsatellite polymorphism dwarf those found for SNPs, the most commonly analyzed variable loci, and has been classed as the genomic feature most variable between individuals [Payseur et al, 2011]. The mutation rate which fuels such polymorphisms is between 10-3 and 10-5 in microsatellites [Weber & Wong, 1993], but generally between 10-8 and 10-9 for SNPs. [Nachman & Crowell, 2000]. This high mutation rate creates a elevated rate of change in microsatellite sequence over relatively short spans of evolutionary time - this makes evolutionary analysis of microsatellites particularly daunting. Four types of microsatellite are generally observed: perfect microsatellites -one specific repeating motif sequence without any mutations/additional sequences altering the pattern and with unrelated flanking sequences; imperfect microsatellites - the repeating motif sequence has suffered some degree of mutation such that the repeating base pattern has been altered; interrupted microsatellites - the repeating motif sequence is split at one (or several points) by an intervening additional sequence unrelated to the repeating motif; and compound microsatellites - one repeating motif sequence is followed directly by another. All four common types of microsatellite are illustrated in figure 1, below [Oliveira et al, 2006; Virela & Amos, 2009; Li et al, 2002]

[Created with reference to: Oliveira et al, 2006; Virela & Amos, 2009]Figure 1: The four main types of microsatellite

C:\Users\Cameron Family\Desktop\Presentation1.png

As described by Kelkar et al (2011), microsatellites follow a pattern of expansion and decay over evolutionary time which has been referred to as the microsatellite life cycle. Through various mutation events (detailed in figure 2, below), areas susceptible to becoming microsatellites (known as proto-microsatellites) can eventually develop into new microsatellite sequences. Proto-microsatellite sequences may be those with an interrupted repeating motif sequence, or those which have a repeating motif sequence, but with a number of repeats below the threshold required for said sequence to behave as a typical microsatellite. Different kinds of proto-microsatellites, and the mutation events required to transform them into true microsatellites, are detailed in figure 2.

[Created with reference to: Kelkar et al, 2011; Oliveira et al, 2006]Figure 2: Proto-microsatellites examples and possible mutation events required for microsatellite birth/death

C:\Users\Cameron Family\Desktop\4th Year Stuff\Project PDFs\Microsatellite Birth & Death Events (excpt slippage).png

Note: blue highlighted sections represent changes from the left hand sequence to cause a microsatellite birth event, and vice versa for red highlighted sections. Dots to either side of sequences are flanking regions.C:\Users\Cameron Family\Desktop\4th Year Stuff\Project PDFs\Microsatellite Birth & Death - Slippage Event.png

In addition to these factors, Transposable elements (TEs) are a potential breeding ground for microsatellite genesis: microsatellite sequences can arise directly after an element transposes to a new part of the genome, or could form within the TE after transposition, via one of the methods illustrated in figure 2. Transposable elements (TEs) - most commonly Alus and L1 elements - have been shown to have a large role in microsatellite genesis. [Kelkar et al, 2011]

Alus sequences are relatively short nucleotide sequences which are collectively the most commonly found TEs in the human genome, being found in the UTR or intronic regions of 3/4s of all known human genes and making up more than 10% of the total composition of the human genome [Kim et al, 2004]. Alu sequences act as hotbeds for microsatellite birth, both at their induction and also later after their integration into the genome sequence. This is largely due to the large stretch of adenosine bases, the poly (A) tail, found 3' of the Alu transposable sequence. This mononucleotide repeating motif, as well as being a microsatellite in its own right, also later acts as a breeding ground for the growth and decay of new microsatellite sequences, particularly [AT]n repeats, which occur with relative frequency in older adenosine stretches. [Kelkar et al, 2011]

L1s are a type of human LINE (Long INterspersed Element), a long self replicating sequence which can be found in both whole and (most commonly) partial forms across the genome, and makes up around 17% of the entire human genome. L1 transposable elements behave in a similar manner to Alu repeats, as they also feature a significant adenosine stretch at induction, with the chance of this microsatellite leading to the development of other microsatellite births in time. However, in addition to microsatellite development being promoted in this 3' area, they also commonly develop across the length of the L1 element, due to the density of adenosine and thymine nucleotides within the L1 sequence, which increases the chance of [AT]n motif nucleotides developing; and also due to an inherently high level of sequences within the length of a typical L1 repeat which act as proto-microsatellites. [Kelkar et al, 2011].

The inherent features of these transposable elements make them rich in microsatellite sequences: Kelkar et al (2011) explains that microsatellites, especially non-perfect, interrupted microsatellites, are common to Alus and L1s, and that while these transposable elements make up around a quarter of the human genome, they were statistically shown to house around 41% of the total number of interrupted microsatellites calculated (126,297 of 293,972 total). [Kelkar et al, 2011]

After the birth of a microsatellite, characterised by the sequence in question reaching the threshold number of repeats to behave as a microsatellite, the microsatellite most often enters a period of rapid change - its adulthood - consisting of a series of repeat number expansions and contractions. By far the primary mechanism by which microsatellites can elongate and shorten in this manner is known as strand slippage, a mechanism detailed in figure 3. This process is hypothesised to be the major reason why microsatellite sequences have such a high level of polymorphism, and how their length can easily fluctuate both positively and negatively during the middle "adulthood" phase of the microsatellites existence. [Kelkar et al, 2008; Kelkar et al, 2011; Payseur et al, 2011]

Microsatellite death works as a sort of opposite to the process of microsatellite birth. Long microsatellite sequences may lose enough repeats (via slippage) to be below the threshold value and hence cease to be microsatellites. Alternatively, substitutions or indels may occur within the sequence of a microsatellite, breaking up the sequence enough such that it is no longer considered to be a microsatellite (even an imperfect one). The mechanisms behind these events were elaborated in figure 2. [Kelkar et al, 2011]

Figure 3: The strand slippage mechanism, and its role in the evolution of microsatellites

[Created with reference to: Oliveira et al, 2006; Kelkar et al, 2011; Pray, 2008 - Figure 3]C:\Users\Cameron Family\Desktop\4th Year Stuff\Project Stuff\Figure 3.png

Microsatellite distribution in the genome is not random - the existence of STRs, especially those which are more prone to fluctuations in length, is influenced strongly by the genomic surrounds. For example, polymorphic microsatellites are very rarely found in exonic regions of the genome -coding regions are highly conserved, and variation of an exon region could have significant effect on the protein it will code. Asides from selection against polymorphism and birth in exons, microsatellites were found to also avoid the untranslated regions (UTRs) near coding sequences, while intergenic and intronic sequences were found to house more mutable microsatellites, or the inception of new microsatellite sequences via the introduction of transposable elements. [Payseur et al, 2011]

Structural aspects of a microsatellite can also alter its mutability. For example higher numbers of repeats in a microsatellite increases its risk of mutation. If slippage is considered the dominant form of microsatellite mutation, this makes sense, as the longer the sequence is, the more likely that strand misalignment - and subsequent slippage - will occur. Contrastingly: longer, more complicated motif sequences possess a smaller risk of mutation. To apply the same logic as before, slippage is less likely to occur when chances of incorrect alignment are decreased (due to less repetition within the microsatellite). [Payseur et al, 2011]

In addition to these general structural effects on mutability, the nucleotide components of the microsatellite's repeating motif also seems to have an effect on the chance of mutation. Kelkar et al, (2008), hypothesise these effects, and propose the forms of mono-, di-, tri- and tetra-nuclear motif microsatellites which are most at risk of mutation.

For mononucleotide motifs, the adenosine repeating motif, [A]n, often called a poly A tail, is most prone to mutation, while no significant difference in mutation risk is observed between the other mononucleotide microsatellites. [Kelkar et al, 2008]

For dinucleotide motifs, repeating stretches of [AT]n were observed to be the most mutable. It is hypothesised that this is due to the [AT]n repeating motif possessing fewer hydrogen bonds in its structure than the other dinucleotide motifs. Since the process of slippage required the breakage and subsequent reformation of hydrogen bonds between two strands, a weaker hydrogen bonding force would mean that this process is less intensive, and therefore more likely to occur. Contrastingly, [CG]n repeats, also known as CpG repeats, were the least mutable dinucleotide motifs. Rather than decreasing mutation risk, this disparity is more likely due to where CpG repeats tend to be located. CpG repeats are a relative rarity in most of the genome, mainly due to the tendency of a cytosine base followed by a guanine to undergo methylation, which subsequently targets the cytosine for deamination, converting it into thymine. Without selective pressure, the cytosine bases of CpG repeats are under significant risk of being converted into thymine bases. Therefore: CpG islands, such as those in a [CG]n motif microsatellite, are only preserved in areas with great selective pressure to do so, such as exonic sequences or, to a lesser extent, UTR regions. This leads to these microsatellites, in turn, having a lower rate of mutation than other dinucleotide motif microsatellites. [Kelkar et al, 2011; Kelkar et al, 2008]

For trinucleotide motif microsatellites, experimentally it is found that [AAG]n has the highest rate of mutation; whereas for tetranucleotide motifs (the highest numbered motifs analyzed by the study) [AAAG]n and [AAGG]n are the most mutable. For these particular motifs, the role of double bonds (and the smaller number of hydrogen bonds that results from this) in these structures is once again suggested as a key reason for their higher rates of mutation. [Kelkar et al, 2011; Kelkar et al, 2008]

With high rates of polymorphism, and relevance to population studies, familial inheritance and kinship analysis, microsatellites seem well suited to being used in a more widespread capacity in scientific analysis than they are currently. One of the reasons that microsatellite analysis has not become more widespread is because of the difficulty faced in sequencing microsatellite sequences using current sequencing apparatus. Next generation sequencing methods tend to use the shotgun sequencing technique, whereby DNA is fragmented into small sections, via restriction enzymes, which are then sequenced via Sanger sequencing techniques. These short sequences, or reads, are then aligned in an overlapping fashion, with a reference sequence or library, so that an entire genome can be reconstructed from these reads. The fragments are overlapped so each fragment's particular position within the genome is verified by its association with the flanking regions found in adjacent fragments. The process of DNA fragmentation, sequencing of these fragments and subsequent alignment with a reference will generally be repeated multiple times to increase accuracy - with each repeat increasing the "depth" of the sequencing. [Koboldt et al, 2010]

While this process has been highly successful for the majority of the genome, the system is not suited for large, uninterrupted sequences, such as microsatellites. The short fragments utilised in shotgun sequencing are often so small that they only contain a section of an entire microsatellite, or that the read's sequence does not possess enough flanking DNA to correctly align it with the reference sequence. Even if two reads were obtained which covered the entire microsatellite sequence, it would be very hard to align them correctly, as it would not be sure how much overlap existed between both reads. These scenarios are illustrated in figure 4, below. [Harismendy et al, 2009; Teer & Mullikin, 2010]

[Created with reference to: Harismendy et al, 2009; Koboldt et al, 2010; Teer & Mullikin, 2010]Figure 4: The challenge of reconstructing microsatellite regions using short fragment sequences in shotgun genome sequencing

C:\Users\Cameron Family\Desktop\4th Year Stuff\Project Stuff\Figure 4.png

Furthermore, the mapping of a polymorphic microsatellite sequence to a reference sequence is also particularity difficult, namely due to the idiosyncrasies of software currently used for the alignment of sequenced reads with a reference sequence. For example, software designed to match sequences which are identical, or which have very few variations from each other, and ignore sequences which have significant differences will ensure that a read sequence with slight differences (i.e. a few SNPs) from the reference sequence will still be aligned to the correct position in the genome. However, microsatellites, with their alleles being both highly polymorphic and length variable, can often be considered significantly different from their actual alignment sequence on the reference genome, leading to these reads being discarded, or aligned to another, incorrect point on the reference genome. [Koboldt et al, 2010]

Microsatellite evolution is frequently analyzed, via sequence analysis and parsimony, within the great apes - chimpanzees, orang-utans, gorillas and humans, generally with other, evolutionarily distinct, primates used as a reference group; examples include the new world monkey, the marmosets, and the globally widespread old world monkey, the macaque. [Kelkar et al, 2011] One method that is often used to try to ascertain relatedness between individuals of a different species is the stepwise mutation model. With microsatellites, it is considered that if the microsatellite allele in one individual is one repeat different than in another, then you could predict that these individuals are closely related. Several things seriously affect the reliability of such analysis, however. One thing is homoplasy or convergent evolution, where alleles are the same between two individuals, but arose independently. Another thing to consider is re-evolution - whereby a particular allele (in this case a microsatellite allele) may have changed then changed back to its original form, during a span of evolutionary time. These processes may lead to the divergence time between two species being misrepresented. Alternatively: if an allele were to change between two species, then change back in a third, it could lead to these three species being ordered incorrectly in terms of when they emerged (if parsimony is followed). Critically: the high mutation rate of microsatellites, at least compared to that of SNPs, significantly neutralises the usefulness of parsimony in the analysis of species relatedness. However, the stepwise mutation model can be used for individuals of the same species, for example for analysis in humans. [Oliveira et al, 2006; Vowles & Amos, 2004; Kelkar, 2008]

One point of contention in regards to microsatellites and their evolution is whether microsatellite polymorphism over evolutionary time has an effect on its flanking sequences. Valera & Amos (2009) argue that the levels of co-evolution between microsatellites and flanking sequences up to 50 bases either side of the microsatellite sequence, are compelling evidence that the increased mutation risk associated with microsatellites is shared, to a degree, with neighbouring sequences; and that overall, sequences adjoining a particular microsatellite will be more similar to one another than to random sequences or to sequences flanking different microsatellites, a key indicator of convergent evolution. On the other hand, Webster & Hagberg (2007) argue that the evidence presented by Valera & Amos is not sufficient to make the argument that microsatellites can affect these flanking sequences, and that such an accusation, if true, would mean that a very large portion of the genome (more than 30%) would be under selection by a mechanism previously unconsidered. Webster & Hagberg instead suggest a different hypothesis, that these flanking sequences are more often than not sequences which were once part of the microsatellite they flank, or part of a compound microsatellite structure, which has since degraded through substitution and indel mutations.


The aim of this project is to analyse the microsatellite sequences taken from a study performed by Professor Monckton and his team; which intended to sequence the microsatellite alleles of 20 individuals in a manner dissimilar to what is carried out using contemporary next generation sequencing techniques, such as the Genome Analyzer IIx by Illumina.

While the mechanisms of the majority of next generation sequencing techniques are well suited for sequencing most of the genome, these systems are in no way suited for microsatellite sequences, especially longer microsatellite sequences. In response to this, the project carried out utilised a different system, based on one generally used for exome sequencing (which is detailed in figure 5)

For the project carried out, a system similar to the one detailed in figure 5 was used, except this time using probe sequences specific for certain microsatellite sequences in the human genome. Genome probe sequences were researched using the Tandem Repeat Finder database (TRFd). Only microsatellites with mono, di and trinucleotide repeating motifs were analyzed in this process. This was done in order to limit the potential number of microsatellites that would be identified across the 20 samples to be analyzed; in addition, shorter motif sequences would be easier to capture using probes. Furthermore, a minimum microsatellite nucleotide length of 15bp was required, namely to ensure that only genuine microsatellites would be captured by the probes. Finally, only perfect (i.e. uninterrupted) microsatellite sequences were accepted; this helps decrease the chance of incorrect microsatellite sequences being captured by the probes and also helps in decreasing the overall sample size that would be obtained. In the end, 4437 mono-nucleotide, 19,340 di-nucleotide and 1,284 trinucleotide motif sequences were obtained, approximately 24,000 microsatellite sequences in total across all 20 individuals whose DNA was analyzed.

Figure 5: The mechanism of exome sequencing/targeted exon capture

[Created with reference to: Teer & Mullikin, 2010; Koboldt et al, 2010]C:\Users\Cameron Family\Desktop\4th Year Stuff\Project Stuff\Figure 5.png

The mechanism for exome sequencing is as follows. Firstly, the genome of the individual you wish to examine is fragmented into many differently sized sections, also known as reads [STEP 1].

After this step, a selection process is carried out using probes specific to the exonic sequences of the genome, which are attached to magnetic beads coated with Strepavidin, a small bacterial protein used as an interface to allow the probe sequences in question to bind to the magnetic beads. [STEP 2]

The exonic sequences, attached to beads, are now separated from the rest of the DNA in solution via magnetic attraction [STEP 3]

Of the 20 different individuals analyzed, most were single persons unrelated to any of the other subjects. However, two different familial units were also analyzed - one a father/mother/child trio, the other a mother/son duo. These familial sets are invaluable for analysis of the hereditary pattern of microsatellite, and also for confirming the relative accuracy of the results (confirmation that the correct microsatellite alleles have been recorded for a child by comparison to the sets of alleles present in its parents, go some way to confirming that the sequencing process was successful

The project carried out seemed to identify and sequence microsatellites within these 20 individuals at a very high level of depth. However, in order to confirm that these findings are indeed correct, it is necessary to re-analyze this data in the context of another mechanism, which will be carried out in several steps over a period of 10 weeks. Initially, the obtained microsatellite sequences obtained in the previous project will be analyzed, and several sequences will then be designed as PCR primers. These PCR primers will then be used to amplify the original individuals' microsatellite DNA, with this process being calibrated using several different operating temperatures for the sequence annealing stage of PCR. Proper optimisation of the annealing temperature will both increase the yield of intended PCR products, and decrease the yield of unwanted, non-specific products. Once more of the microsatellite sequences have been amplified, these can be analyzed using the traditional method of Sanger sequencing. [Rychlik et al, 1990] The results of this project, if successful, should hopefully help to corroborate the findings of the original microsatellite analysis study.