A detailed study of Bioinformatics

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Questions 1

A) - Explain the difference between the terms of homology, similarity and identity with reference to protein sequences.

Protein homology is when the proteins are derived from a common ancestor - i.e two or more structures are said to be homologous if they are alike because of shared ancestry. Homology of protein sequences may also indicate common function.

Homology can be concluded among proteins on the basis of sequence similarity. For example proteins are likely to be homologous if if two or more proteins have highly similar sequences. But common ancestry may also give rise to sequence similarity. Short sequences may be similar by chance and If both sequences were selected to bind to a particular protein they may be fro example a transcription factor. Sequence evolution information can be contained in families of similar sequences and they can the be the building blocks on which to perform more sensitive homology sequence searches.

In the comparasion of protein sequences the extent to which two sequences have the same i.e. amino acid at equivalent positions is usually expressed as the percentage identity.

B) What is the BLOSUM scoring matrix? Explain why it is necessary and how it was derived?

BLOSUM stands for BLOcks of amino acid substitution matrix and is used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionary divergent protein sequences based on local alignments. It was first introduced by Henikoff and Henikoff in 1992 and used a different approach to previous models. And it led to a marked improvement in protein sequence alignment.

Several sets of BLOSUM matrices exist using different alignment databases, named with numbers. Those matrices with low numbers are designed for comparing distant related sequences while those with high numbers are designed for comparing closely related sequences. The higher the number the more likelihood of homology.

C) The diagram below show a guide tree that is constructed as part of the clustalW alignment process. Describe the order in which alignments are carried out.

The clustal w program (in its simplest usage)takes a set of homologous sequences (all DNA / RNA or all protein) and produces a single multiple alignment . Primarily, all the sequences are compared to each other in a pairwise fashion and then a guide tree is created from the pairwise sequence distances. Each step in the final multiple alignment consists of aligning two alignments of sequences. This us done progressively following the branching order on the guide tree.

Question 3

A)- explain what is meant by a metabolic network and describe how they may be represented visually and computationally

A set of interconnected metabolic pathways is called a metabolic network and is the complete set of metabolic and physical processes that determine the biochemical and physiological properties of a cell such as the Krebs cycle.

The networks can be shown as diagrams that have been produced by mass spectrography (see diagram below) or through computer packages that capture the hierarchical relationship in networks and use the interactions to understand both local details and global relationships of a large network simultaneously. Graph drawing algorithms can be used to visualise the networks. The networks can also be reconstructed computationally by databases such as Biocyc, Ecocyc and Metacyc. For example Biocyc is a collection of approximately 1,000 pathway/genome databases with each database dedicated to one organism.

Ab initio prediction of metabolic networks using Fourier Transform Mass Spectrometry data

source - http://www.aisee.com/graph_of_the_month/metabolic.htm

B)- you have been provided with the complete genome sequence of a bacterium. Explain a computational approach that you could use to identify genes encode enzymes.

You would use comparative genomic approaches utilising cloning and sequencing. Sequence homology on the computer applies sequence homology to known enzyme-encoding genes.. The most common approach is to identify genes encoding a specific metabolic enzyme by establishing sequence homology to functionally characterised enzymes in other species. Using a sequence profiling tool - it presents information related to a keyword input or genetic sequence or gene name. This tool would take the sequence or keyword and search one or more databases for information related to that sequence.

Question one

a) Define the terms Semantics, Controlled Vocabulary, Taxonomy and Onthology

Semantics is the study of the relationships between areas and what they represent.

Controlled Vocabulary provides a way to organise knowledge for subsequent recovery and is used in taxonomies, subject headings and subject indexing schemes.

Taxonomy is the practice and science of classification and a taxonomic scheme is a particular classification arranged on a hierarchical structure.

Ontologies are the structural frameworks for organising and categorising information. Ontology deals with questions concerning what exists or can be said to exist and how such entities can be grouped and related within a hierarchy and sub divided according to similarities and differences.

b) Describe how ontologies are useful in bioinformatics, giving at least one example to illustrate your answer

They are computer readable precise formulations of concepts in a given field . They are valuable framework for coping with the large growth of valuable biological data generated by high output technologies. An example is the Gene Onthology database which is part of the Gene Onthology project, the aim of which is standardising the representation of the gene and gene product attributes across databases and species

c) You have been asked to design a data standard to capture a minimal description of micro array experiment. Describe your approach. Include in your answer a brief description of the benefits of your standard to the biological community and outline the types of information your standard would require.

The recorded information about each experiment should be detailed enough to enable comparisons to similar experiments and permit replication of the experiments and sufficient to interpret the experiment and allow replication of the experiments. The information should be structured in such a way that enables useful querying as well as automated data analysis. As a minimum the following should be recorded:

1. Experimental design

Array design - each array used and each element on the array

Samples - samples used, extract preparation and labelling

Hybridizing - procedures and parameters,

Measurement - images, quantification and specifications

Normalisation controls - types, values and specifications.

The benefit of having a set standard for the minimum information needed is that the micro array data can be easily interpreted and that results derived from its analysis can be independently checked.

Question 2

a) Define the term network with respect to biology. Give three examples of biological networks that can be constructed from large scale databases. In each case name a suitable type of dataset that could be used to construct the network.

Biological networks are usually depicted as nodes connected by edges. The nodes are the genes proteins, or enzymatic substrates. Edges are often the sharing of functional properties, direct molecular interactions or regulatory interactions . Biological networks are the representation of multiple interactions within a cell, a global view intended to help understand how relationships between molecules dictate cellular behaviour.

Metabolic networks are biological networks that can be constructed from large scale databases. One example is the Krebs Cycle that can be constructed by searching for the correlation between the genome and metabolism in the Genned database. Another example is the construction of Transcriptional regulatory networks using the SCPD database and the construction of signal transduction databases using the CADLIVE database.

b) A biologist is interested in a gene of unknown function. Describe how biological networks and integrated functional networks in particular can be used to provide evidence about its putative function.

One way is to use a mutation and see what the network does with a gene as the network will then not work properly. Integration means using many micro arrays, physical and genetic interactions. These include tissue, biological process and development stage specific networks each predicting relationships specific to an individual biological context. These integrated biological networks enable rapid investigation of uncharacterised genes in specific tissues and developmental stages of interest.

c) A bioinformatician wishes to construct an integrated functional network from the networks you describe in (a). Outline process that could be used to facilitate this integration. Comment on the requirement for a gold standard network and suggest a suitable gold standard network for this integration task.

One gold standard assay for investigating protein - protein interaction and hence integrated networks, is the use of co-immunoprecipation. (Co-IP) An antibody is selected that targets a protein of known origin that is a member of a larger complex of proteins. By targeting this member with an antibody it may become possible to take the entire protein complex out of solution and identify unknown members of the complex.

This works when the proteins in the complex bind to each other, making it possible to pull several members of the complex out of solution by latching with an antibody onto one member.

An ideal gold standard test has a sensitivity of 100% with respect to the detection and a specificity of 100% ( In practice, there are sometimes no true "gold standard" tests. They are regarded as definitive.

Question three

a) Compare and contrast the operation of dynamic programming based algorithms for sequence alignment with those based on heuristic techniques.

The technique of dynamic programming can be applied to manufacture global alignments and local alignments.

Dynamic programming can be used in aligning nucleotide to protein sequences, a task made more complicated by the need to take into account insertions or deletions.

The dynamic programming method is guaranteed to find an optimal alignment given a certain scoring function and algorithm. Dynamic programming can be prohibitively slow for large numbers of or extremely long sequences.

A heuristic is any way an algorithm can be directed towards solving a problem through the use of domain specific information. The heuristic doesn't always always help solve the problem but it may help the algorithm solve the problem faster. The main purpose is to reduce the search pace by reducing the need to explore irrelevant paths s. A heuristic is independent of an algorithm .

b) Th text below is part of the output of the BLASTP program used to search the non-redundant database with a newly acquired protein sequence. Comment on the BlastP output and what it tells you about the query protein. Describe any additional computational analyses that could be done to find out more about this sequence.

The blast P program is a basic local alignment search tool program. The output shows that the sequences of the two bacteria are similar.

Other computational analyses that can be used are Position specific Iterative Blast (Psi -Blast) or Pattern-Hit Initiated Blast. There is also Genewise which compares DNA Sequences at the level of its conceptual translation.