This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Metagenomics refers to the study of the collective set of genomes of mixed microbial communities. With the advent of next-generation sequencing techniques, this area has received renewed interest, as researchers seek to understand the interaction between humans and their microbiota. This case study describes tools and techniques used to analyse metagenomic data and mine for genes of interest. We test out in-silico approaches for the discovery of lantibiotic genes within the tongue metagenome of 9 individuals. This yielded several lantibiotics which can now be cultured in the laboratory for identification and confirmation.
Metagenomics refers to culture-independent studies of the collective set of genomes of mixed microbial communities . The development of next generation DNA-sequencing techniques has greatly enhanced our ability to study microbiota to high resolution. In recent years, there has been emerging interest in the study of the human microbiome as it is becoming increasingly obvious that interactions between microbiota and humans play a large role in human health.
The human microbiome is the entire population of microbes that colonize the human body, including the gastrointestinal tract, the genitourinary tract, the oral cavity, the nasopharynx, the respiratory tract, and the skin . Microbes that live on and inside us outnumber human cells by a factor of 10 to one, and include bacteria, fungi as well as viruses . Characterising the human microbiota is important, as they provide a range of metabolic functions that we lack, performing different functions in health and in disease. The National Institute of Health has started a Human Microbiome Project with the aims of determining whether individuals share a core human microbiome and understanding whether changes in the microbiome can be correlated with changes in health.
Prokaryotic genomes are typically sequenced by Sanger shotgun sequencing, which involves shearing the DNA content of the genomic clone into random fragments then cloning into plasmid vectors grown in monoclonal libraries. The DNA is then sequenced by dye-termination methods and sequence fragments are assembled by software. There are several disadvantages of this method, for example, some genes cannot be incorporated into the library vector due to toxicity. Furthermore, in metagenomics, the raw genomic material does not come from a single organism. The DNA from shotgun sequencing may only provide a partial genomic picture, and the more abundant species would dominate the sample.
Recent technological advances in sequencing have enabled metagenomic profiling to be performed with greater speed and at lower cost. Sanger sequencing currently produces longer reads of up to 800 bases, which are very useful for inferring gene functions for metagenomics. However, pyrosequencing eliminates the laborious step of preparing clone libraries, hence is faster and cheaper. The large number of short reads enable rRNA based community analysis to be carried out with reasonable accuracy. For example, 200-base reads, accounting for 12% of the data in the 16S rRNA gene, yield community clustering results as accurate as those obtained using 70% of the original number of full-length sequences, provided that the region of 16s rRNA is chosen carefully eg the V2 or V4 region . However, in cases where sequences obtained are highly divergent from related sequences, obtaining the entire sequence length is crucial.
The number of sequences required to characterize a sample depends on the goal of the study, the diversity of species in the sample and the read length. If the goal is to estimate the major bacterial phyla in each sample, relatively few sequences per sample are required. However, if complete characterization of all sequences is desired, larger numbers of sequences would be needed, especially if many species are rare.
1.3 Methods for data analysis
Analysis of diversity can take on several directions. The focus can be qualitative (examining only the presence of species), or quantitative (also taking into account abundance). It can include alpha diversity (how many lineages there are in one sample) or beta diversity (how lineages are shared among samples). An analysis can either be phylogenetic (using a tree to relate sequences) or taxon based (treating all taxa in a species as phylogenetically equal. Many sequences arise from uncultured microbes that have not been formally described, hence taxa are defined by similarity in sequences. There are advantages of each approach. Phylogenetic methods tend to reveal more information when samples are diverse and when there are few sequences per sample. However, taxon based methods are helpful for building networks that relate species to one another or for comparing which operational taxonomic units are shared among subsets of species.
The identification of genes in metagenomic data is extremely challenging, as many reads may remain as singletons, especially in species rich environments. Most traditional gene finding tools search for whole open reading frames (ORFs), taking into account information from large genomic stretches, which are unavailable in metagenomic data. Using the Basic Local Alignment Search Tool (BLAST) against known databases is a common approach, but only works for known homologs. It is unable to find new families or genes that have no homologs in known databases. Ab initio gene prediction tools are required for this task; they rely on pattern recognition algorithms, and may employ both supervised as well as unsupervised learning techniques. Many of these algorithms incorporate Hidden Markov Models (HMMs), however, this has the disadvantage of poor specificity in identifying partial ORFs that may be part of true genes.
1.4 Functional annotation
This is particularly challenging in metagenomic data as many ORFs are incomplete and many have no known homologs in databases. One alternative may be to skip the gene calling step and to use six-frame translations on the reads. These putative partial ORFs can be searched for motifs and HMM profiles. This approach has a low probability of calling a false ORF that also includes a known sequence signature. Motif Extraction is an unsupervised motif creation method that uses this technique to search for enzymes in metagenomic data, by first identifying enzymes by unsupervised learning, then associating them with functions by supervised learning. This allows for new motifs to be identified within ORFs even if their function is unknown. BLASTing unassembled single reads may also be used to find functional information, but this may have a lower sensitivity compared to previous methods. There are several online open source tools for the analysis of metagenomic sequences. These are:
MG-RAST  - This is implemented in Perl and requires raw sequence data in fasta format. Further description is provided in Section 3
RAMM-CAP  - This tool uses an open reading frame calling programme with six reading frame translation within each reading frame. Functional annotation is then performed using Pfam and Tigrfam, with HMMER.
IMG/M  - The data held within the server can be search by a keyword based genome browser. It also provides an estimate of the phylogenetic composition of a metagenome based on the distribution of the best BLAST hits of the protein coding genes.
MEGAN  - Sequence comparison of all reads against databases is performed with a BLAST search. A taxonomical analysis of the sample is obtained by assigning the reads to different nodes in the NCBI taxonomy using an algorithm that assigns each read to the lowest common ancestor
SHOTGUNFUNCTIONALIZER  is an R package that contains tools for importing, annotating and visualizing metagenomic data produced by shotgun high throughput sequencing. It utilizes statistical techniques for assessing functional differences between samples.
CARMA  - This focuses on a phylogenetic approach to metagenomic analysis, and is especially suitable for short fragment DNA, using Pfam domain and protein families as phylogenetic markers to identify source organisms of DNA fragments
SIMILARITIES AND DIFFERENCES
MG-RAST provides comparative functional sequence based analysis for uploaded samples, while IMG/M provides similar analysis for metagenomes in the IMG/M database. RAMM-CAP also provides similar analysis comparative analysis. While most of the tools perform well on longer sequence fragments, CARMA specialises in short fragment DNA. MEGAN carries out taxonomic analysis by reading a BLAST file output then assigning each read to the lowest common acestor on the phylogenetic tree. CARMA is similar to MEGAN but uses Pfam as its source for taxonomic classification. CARMA can run its own BLAST while MEGAN requires previously generated BLAST output.
2. BACTERIOCINS AND LANTIBIOTICS
Bacteriocins are proteinaceous toxins produced by bacteria to inhibit the growth of other bacterial strains. Class I bacteriocins are small peptide inhibitors, and are mostly lantibiotics. Class II bacteriocins are small heat-stable proteins; they have a wide range of effects on membrane permeability and cell wall formation. Class III bacteriocins are large, heat-labile proteins.
Lantibiotics are small peptides antibiotics containing internal bridges resulting from the formation of of (β-methyl)lanthionine residues. They belong to a class of molecules called bacteriocins, which are peptide antibiotics produced by bacteria.
The structural gene of lantibiotics encodes a ribosomally synthesised precursor prepeptide which is named LanA, which contains a leader sequence at the N-terminus and a propeptide at the C-terminus. Many of the serine and threonine residues in the propeptide are dehydrated to form dehydroalanine (Dha) and dehydrobutyrine (Dhb) respectively. When these modified residues interact with an intrapeptide cysteine, a thioether bond is formed, resulting in the formation of lanthionine (Lan, from Dha) or β-methyl lanthionine (meLan, from Dhb) . The position of the dehydrated amino acid and its target cysteine determines the size and position of the resulting ring.
Lantibiotics can be divided into 4 groups according to the nature of the enzymes that catalyse (me)Lan formation. For type 1 lantibiotics, 2 enzymes are involved; LanB, the lanthionine dehydratase that catalyses the dehydration of amino acids, and LanC, the lantionine synthetase that catalyses thioether formation. Type 2 lantibiotics contain a single LanM enzyme which performs both functions . Type 3 and 4 are lantipeptides which are also catalysed by distinct enzymes such as the RamC-like and LanL enzymes . Lantibiotics can also be grouped according to their primary sequence structure.
Figure 1 demonstrates lantibiotic peptides representative of different structure groupings. Based on structure, lantibiotics can be separated into two groups, type A and type B . Type A lantibiotics are elongated positively charged molecules, which act by depolarising the cytoplasmic membranes leading to pore formation. The prototype is nisin. The classical view of the type A lantibiotics is that they are elongated flexible molecules with a positive charge, and were generally thought to act by depolarising the cytoplasmic membranes, leading to the formation of pores and the leakage of essential cell constituents. The prototype Type A lantibiotic is Nisin. Type B lantibiotics are globular in structure, with negative or zero charge. They interfere with enzyme reactions within bacterium.
2.2 Lantibiotic gene discovery
Lantibiotics have a diverse range of applications. Of the type 1 lantibiotics, nisin, mutacin and planosporicin have been shown to be active against multi-drug resistant gram positive pathogens . Pep5 and epidermin inhibit Staphylococcus epidermidis adhesion to catheters . Galvin et al showed that methicillin-resistant Staphylococcus aureus and vancomycin-resistant Enterococcus are sensitive to lacticin 3147 . Furthermore both nisin and lacticin 3147 have been investigated for the treatment and prevention of mastitis in cattle . In addition, epidermin and gallidermin are active against Propionibacterium acnes. Lantibiotics have also been used as food preservatives and gastrointestinal probiotics . Nisin has been used in cheese, milk, dressings, canned food, crumpets, liquid egg and dairy desserts. .
In the past, culture based strategies have been responsible for the identification of most lantibiotics, and these have yielded results from the oral cavity, intestine, soil as well as milk. However, with the improvement of genomic sequencing technologies, in silico screening for lantibiotic genes is becoming an effective tool for discovery of novel compounds. For example, doing a BLAST search on NCBI for LanC homologues using the NisC sequence as a driver resulted in the identification of 56 homologues, within which there were 49 potential lantibiotic encoding gene clusters . In another study with the lacticin 3147 modification enzyme LtnM1 as a driver sequence, 89 LanM homologues were found, of which 61 were in strains not known to be lantibiotic producers. One of the strains - B licheniformis, was selected for functional testing, and a novel 2-peptide lantibiotic was discovered, which exhibited antimicrobial activity against Listeria monocytogenes, methicillin resistant Staphylococcus aureus and vancomycin resistant enterococcus .
This project aims to
Mine for lantibiotics within the tongue metagenome of 9 healthy individuals using BLAST and HMMER
Compare the results obtained from BLAST and HMMER
Examine the structure of the hits obtained in relation to protein superfamilies
3.2 Methods and Materials
The flowchart of methods is shown in Figure 2
Tongue scrape samples were obtained from nine healthy individuals (volunteers, aged 24-51, from the UCL Research Department of Structural and Molecular Biology in compliance with the UCL Research Ethics Committee), none of whom had taken antibiotics in the previous six month period. DNA was isolated from the material scraped from the tongue and kept as 9 separate DNA samples. Equimolar amounts of DNA were then taken from each sample and approximately 8 µg were sequenced on the Roche 454 Titanium DNA sequencing platform.
The ultimate aim in this metagenomic analysis is to provide a fully comprehensive functional annotation of the tongue metagenome. To do this we must use different methods to retrieve as many answers as possible for a more complete analysis. One method used to functionally annotate the human tongue metagenome data set follows a domain-based approach. The 454-sequenced metagenome DNA fragments were previously calculated as 6-frame protein translated sequences using the tool, 'transeq' from EMBOSS-6.3.1. These protein sequences were then scanned against CATH HMMs using HMMER 3.0 and DomainFinder3 software . This scanning method detected and predicted the presence of protein domains that have been described by the CATH resource. These results were stored in a local database.
MG-RAST  is another method used in functional annotation of the metagenome, however this pipeline uses subsystem classification rather than domain classfication. MGrast does this by screening sequence for potential potential encoding genes via a BLASTX search against databases within the International Nucleotide Sequence Database Collaboration and a phylogenetic reconstruction is computed. In parallel with the BLASTX searches, the sequence data is compared to all accessory databases by using the appropriate algorithms and significance selection criteria. These databases include several rDNA databases, including GREENGENES, RDP-II, and the European 16S RNA database.
Using an e-value of 1e-10, 85 metagenome sequence matches were made to bacteriocin-like peptides. These 85 sequence matches were useful but they had no other functional information attached, thus we decided to search the functional annotations assigned by CATH to see whether any more information could be provided so as to confirm this functional prediction. As each metagenome sequence has its own unique identifier we were easily able to link back into the database and pull out the CATH code and a text description, where available. These descriptions were very broad and did not necessarily link to lantibiotic function, thus the 85 sequences were used in a NCBI BLASTN search.
The rapid increase in genomic knowledge has prompted the development of on-line lantibiotic specific tools and repositories such as BAGEL and BACTIBASE screening strategies.
BAGEL  is a web-based software tool that identifies bacteriocins and related biosynthetic clusters by taking advantage of the fact that accessory genes encoding proteins needed for modification and processing of the bacteriocins are commonly located near to the putative gene. Open reading frame detection is provided, hence it is independent of GenBank annotations. To increase sensitivity, the ORF searching procedure focuses on short length ORFs. BAGEL2  includes the extended use of HMMs, as well as the manually curated databases of known bacteriocins and context genes (encoding proteins for modification, immunity/transport and two component systems). Based on the updated classification scheme, The authors of BAGEL2 have also added an advanced classification algorithm, that can predict subclasses more accurately.
BACTIBASE  is a similar server that mines for bacteriocin genomes based on peptide sequences collected from the UniProt database and from scientific literature using PubMed, since not all known bacteriocin sequences are present in ExPasY or NCBI. In BACTIBASE, the BLAST programme is used for sequence homology search, while ClustalW is used for sequence alignment. Each entry is checked in the Protein DataBank as well as Uniprot dabase. The database also contains general data such as peptide class, producer organism, taxonomy and target bacterial organisms. Physicochemical properties, eg mass, isoelectric point, net charge, pH, hydrophobicity, aliphatic index, secondary/tertiary structure, half life in mammalian cells are also included if the information is available.
3.2.3 Search Tools
HMMER  calculates a score term for a probability model of non-homology. The profile/sequence bit scores are turned into a final log odds bit score using the score correction. The multiple segment viterbi (MSV) algorithm is then used to search for high scoring ungapped alignments, passing the sequence to the next step if the MSV score passes a threshold. False positive MSV hits are corrected using the bias filter with a HMM approach. After that, the Viterbi filter calculates an optimal gapped alignment score and the sequence is passed to the next step if the score exceeds a threshold. The forward and backward filter/parser calculates the posterior probabilities of domain locations. From these, sub-sequences which contain a lot of probability mass for a profile match are identified. For each identified domain, an ad hoc 'null2' hypothesis is constructed for each domain's composition and used to calculate a biased composition score correction. A maximum expected accuracy alignment is then calculated.
We used a HMM constructed on Lacticin 481 for the HMMER search. Lacticin 481 is a Type A lantibiotic which has a thioether bridge that spans half the length of each peptide, resulting in a compact molecule with a bicyclic ring structure towards the C-terminal end of the molecule and an N terminal linear conformation.  The largest group of lantibiotics - the Lacticin 481 group - is named upon and based on the structure of this lantibiotic (Table 1). We also used a HMM constructed on Nisin for the HMMER search. Both HMMs were downloaded from the Bagel2 database.
BLAST  is a local alignment tool that is heuristic in nature. The first step involes making a look up table of all the short subsequences and neighbouring subsequences in the query sequence. The database is then scanned for similarities. When a match is identified, it is used to initiate gap free and gapped extensions of the subsequence. After the algorithm has looked up all possible subsequences from the query sequence and extended them maximally, it assembles the best alignment for each sequence-query pair and converts this information to an SeqAlign data structure. The BLAST Formatter can use the information in the SeqAlign to retrieve the similar sequences found and display. BLAST uses statistical theory to produce a bit score and an E-value for each alignment pair. The bit score gives an indication of how good the alignment is, while the E value represents the statistical significance for the given alignment, hence the latter reflects the size of the database.
We used LtnM1 and NisC as our driver sequences as previous blast searches with these proteins have yield good results in in-silico screens [11,21]. LtnM1 encodes an enzyme that catalyzes dehydration of serine and threonine residues into didehydroalanine and didehydrobutyrine respectively, and their reaction with cysteine residues to form the thioether-containing residues lanthionine and methyllanthionine. NisC is a nisin modification enzyme that catalyzes the coupling of the double bond in dehydro-amino acids to the thiol groups of cysteines after NisB dehydrates the serines and threonines in propeptide part.
HMMER search using a Lacticin 481 HMM yielded 4 hits (Table 2) while HMMER search using a Nisin HMM yielded no hits. Each of these 4 hits from Lacticin 481 were blasted against the NCBI database. The sequence with the lowest HMMER E value, yielded no similarity matches on Mega BLAST (highly similar sequences). However, when using discontiguous megablast, several hits were obtained. These, together with hits from the other 3 sequences, are shown in Table 3.
Discontiguous Mega BLAST was designed for comparison of diverged sequences with alignments that have low degree of identity, where the original Mega BLAST is not effective . The original mega BLAST look for exact matches as the starting point, so it is less productive when less conserved sequences are compared. It may miss significant alignments or find too many short random alignments. In discontiguous Mega BLAST, the 'discontiguous word' approach is used for identifying initial offset pairs, after which gapped extension is performed, so it achieves higher sensitivity (but lower specificity) than the original mega BLAST. The alignment between the first sequence hit and Streptococcus macedonicus is shown in Figure 3.
S. Macedonicus was initially isolated from naturally fermented Greek Kasseri cheese . The lantibiotic macedocin biosynthetic gene cluster is contained within a 15 171 base pair region in the S. macedonicus ACA-DC 198 chromosome, which consists of 10 ORFs. ORF1 is a relaxase gene; relaxases are conjugative plasmid-encoded proteins essential for the horizontal transfer of genetic information contained on plasmids that occurs during bacterial conjugation.
The next closest hit is Streptococcin A-FF22. Compared to that, the macedocin gene cluster contains an additional structural gene and an insertion sequence between the regulatory and the biosynthetic operons .
Streptococcus salivarius plasmid pSsal-K12 appeared as the top BLAST hit of the following 3 sequences with the lowest E value. Lantibiotic-producing strains of S. salivarius contain large plasmids. Each plasmid encodes one (salivaricin A, A2, A4, or B), two (salivaricin A,B or A2) or three lantibiotics (salivaricin A3, streptococcin A-FF22, and streptin. The plasmid encoding salivaricins A2 and B is transmissible from S. salivarius K12 to a plasmid-free derivative of the same strain . This suggests that S. salivarius may act as a repository for the dissemination of bacteriocin loci in the oral microbiota .
MG-RAST functionally classified a 85 sequence fragments as bacteriocins (Table 4). The sequence with the lowest E value was blasted against NCBI. This yielded the hits in Table 5. Of the 85 sequence fragments, 42 in total gave a hit with NCBI blast.
Blasting NisC against the tongue metagenome yielded only 1 hit with p value below below 0.05. This hit was blasted against the NCBI database. Mega BLAST revealed no similar sequences, but discontinguous Mega BLAST revealed sequence similarity to several genes, of which the gene with the lowest E value is Nisin Q. Alignment is shown in Figure 4, and full NCBI BLAST results are shown in Table 6.
Macedocin, Streptococcin and salivaricin, are all part of the lacticin 481 group. This group consists of 16 lantibiotics with linear N terminal end and globular cross-bridged C terminus. The molecular masses of lacticin 481 group lantibiotics range from 2315 Da (salivaricin A) to 3245 Da (mutacin II) . Lantibiotics within the lacticin 481 group are active over a wide range of temperatures (up to 100°C) and pH (between pH 4 and 10). In contrast, nisin is only stable over pH 2-6 . The stability of lantibiotics in the lacticin 481 group has been attributed to thioether bridges locking the molecules into biologically active conformations . The stability against proteases, on the other hand, has been credited to (Me)Lan residues restricting the conformational freedom of potential cleavage sites.
S. macedonicus may possess properties that can be used in the food industry. This includes exopolysaccharide production , peptidase activity  and the ability to inhibit food spoilage bacteria such as tyrobutyricum  and Brochothrix sp . In fact, S. macedonicus has been employed as a cheese adjunct starter culture and a cheese protective culture . S. macedonicus may also have medicinal use, as macedocin inhibits pathogenic streptococci and clostridium perfringens. Georgalaki et al found that the macedocin molecule is identical to SA-FF22 and SA-M49 produced by the pathogenic S. pyogenes , while Maragkoudakis et al has shown via PCR and southern hybridization that S. macedonicus ACA-DC 198 does not have genes which are homologous to S.pyogenes virulence determinants. Since only non-pathogenic micro-organisms can be used in food, macedocin can be used as a preservative while this is not possible with Streptococcin A-FF22.
The S. salivarius strain has been made into a probiotic treatment called 'Blis K12 throat guard'  which involves colonizing the mouth with this strain by sucking on lozenges. This has been demonstrated to reduce growth of S. pyogenes, thereby improving throat health. This product is also useful in the treatment of halitosis as increasing the levels of S. salivarius on the tongue helps to exclude some odour causing bacteria . Furthermore, certain Salivaricin A strains have also been suggested to prevent otitis media .
The large majority of applications under study for lantibiotics of the lacticin 481 group are related to lacticin 481. It could be used to accelerate cheese ripening by lysing starter cells, enhancing the release of intracellular aminopeptidases. Although previous studies demonstrated that the lacticin 481 spectrum of action does not cover pathogenic strains, it was recently shown to affect the survival of Listeria monocytogenes, Staphylococcus aureus, and Escherichia coli O157:H7 in raw milk cheese .
BLAST search using NisC as a driver sequence yielded Nisin Q as a hit. There are 4 natural nisin variants - nisin A, nisin Z, nisin Q and nisin U. The first 3 are produced by Lactococcus Lactis while the last is produced by Streptococcus uberis. Nisin Q has four amino acid substitutions on the mature peptide and two on the leader peptide as compared with nisin A . Nisin A has been widely used in the food industry as a preservative due to its selective toxicity and high stability. While nisins Q and A have similar biochemical features, nisin Q is more stable under oxidative conditions because the methionine at position 21 of nisin A is substituted by leucine in nisin Q .
3.5 Further Studies
While it is likely that the sequences demonstrated produce known lantibiotics ie. macedocin, salivaricin and streptococcin A-FF22, since the alignment is not perfect, there is a possibility that these sequences could represent novel lantibiotics with similar structures. For further study, these sequences could be cloned in bacteria and the inhibition spectrum measured with well diffusion assays. The lantibiotic preparation can then be generated by inoculating into Luria Bertani broth, harvesting cells by centrifugation and examining again in well diffusion assays. Lactococcus lactis can be used as the indicator strain. High performance liquid chromatography can be used for purification and mass spectrometry can be performed to determine the properties of the lantibiotic.
Studying the tongue metagenome requires careful attention to assembling sequences, performing functional annotation and subsequent taxonomic analysis. Mining for genes of interest in the metagenome can be performed with BLAST or HMMER search. In this case study, an in silico approach to mining for lantibiotics yielded significant results with a HMMER search. These hits can be tested out in-vitro using these sequences cloned into bacteria. This approach may yield novel lantibiotics with properties that can be used in the food or medicinal industry.