Bioinformatics Analysis Of Nuclear Envelope Proteins Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


The nucleus is the central defining feature of eukaryotic cells and contains most of the cell's genetic information. It plays an important role in the functioning and maintenance of a cell by storing the hereditary information, providing the site for DNA transcription, production of the ribosomes in the nucleolus, and exchange of genetic material [1]. It is one of the most complex structures of the cell whose exact functioning is still not understood clearly.

The nucleus is separated from the rest of the cytoplasm by the nuclear envelope (NE). It is a double membrane system which is continuous with the endoplasmic reticulum (ER) and consists of three membrane domains: The outer nuclear membrane (ONM), inner nuclear membrane (INM) and the pore membrane (PoM) which connects the outer and the inner membranes. The INM and the ONM contain their own unique sets of proteins apart from the ER proteins and the ribosome's that are found in both the ER and the ONM. Transport of soluble proteins and RNA between the nucleus and the cytoplasm is regulated by the nuclear pore complexes (NPCs) which are inserted at the PoM and provide the only breaks in the membrane sheet that covers the nucleus [2].

NPCs are enormous proteinaceous assemblies (>60 MDa) that exceed in size the 50 nm span between the ONM and INM [2]. They show a broad range of structural and compositional conservation among all eukaryotes. The complex can be minimally characterized into three substructures: The cytoplasmic fibrils, a central core and the nuclear basket. The central core consists of eight spokes that are sandwiched between the cytoplasmic fibrils and the nuclear basket. It is through this spoke structure the active transport of molecules occur [3]. Each NPC contains at least 456 protein molecules and is composed of approximately 30 distinct proteins which are called the nucleoporins (NUPS). Several NUPS have been identified and characterized and many proteins have known functions at the NPC such as transport factors and chaperones [4]. It is notable that the mass of NPCs measured by cryo-electron microscopy is much greater than can be accounted for by the 30 proteins identified in purified biochemical fractions by mass spectrometry [3]. Thus it is possible that purification for mass spectrometry removed some material and additional protein components have yet to be identified.


Since the discovery of several inherited diseases caused by mutations in nuclear envelope proteins, an extensive amount of work is being carried out to understand the molecular pathways underlying the proper functioning of the nuclear envelope. During the process many nuclear envelope transmembrane proteins (NET's) have been identified. The first NET proteins identified were the INM proteins that bound to the lamins. Several strategies were used by different groups to identify NET proteins in different organisms. Two studies were carried out for the determination of the NE proteome. The first study from the Otto laboratory produced 3 separate NE fractions: Chaotrope -insoluble fraction, a non ionic detergent - insoluble fraction and a salt insoluble fraction. The fractions were separated in 2D gels and the protein spots were analysed by MALDI mass spectrometry. The proteins that were found in both the detergent and salt insoluble fractions were considered as NET's as chaotropic insoluble fractions contain both ER and NE transmembrane proteins. This is the "comparative" approach. The second study uses a "subtractive" approach, which was done from the Gerace and Yates laboratories. In this a microsomal membrane (MM) fraction was used to identify the ER proteins. The MM fraction was analyzed separately from the NE fractions and all proteins appearing in both the fractions were subtracted from the NE fraction and, theoretically, since the contaminating membranes of the NE fractions should also occur in the MM fraction it can be considered that those proteins that were left after subtraction are true NET's. Each technique has its own advantages and limitations and the results from the techniques were validated by identification of all expected previously characterized NET's in the NE fraction [2].

Soluble proteins and RNA are transported through the central channel of the NPC and mechanisms are well characterized. In contrast, mechanisms for transport of transmembrane proteins to the INM are uncertain. However, they likely involve lateral diffusion in the membrane around the outer face of the NPCs, negotiating peripheral channels that some data suggest require a specific signal-mediated gating mechanism. Recent proteomics studies have identified a large number of INM proteins from which it may be finally possible to identify such INM signal sequences

A separate body of work has addressed the question of how transmembrane proteins translocate to the inner nuclear membrane after their synthesis in the ER. One study suggested proteins access the INM by free diffusion in the PoM between the NPCs and the membrane. Another suggested it required ATP while yet another indicated that the Ran GTPase (that normally functions in transport of soluble molecules through the central channel of the NPC) was involved. Because no study tested a wide range of INM proteins nor did they compare these mechanisms, it was unclear whether they were all parts of a complex multifaceted mechanism or represented multiple distinct mechanisms. More recently, 16 nuclear membrane proteins were directly compared by FRAP, measuring a 30 fold range of velocities for translocation between the ER and INM, and this suggest that there are multiple mechanisms involved in this process. From this study it was found that 2 NET's required ATP but not Ran, one NET required Ran but not ATP, other NETs tested required neither Ran nor ATP and may freely diffuse in the membrane or use as yet not identified mechanisms. One such mechanism may involve phenylalanine-glycine (FG) motifs on the NETs because addition of FG motifs to NETs also facilitated translocation. Thus, there are multiple translocation mechanisms involved [5].


The nucleus is an important compartment in the cell where several biological processes take place. Regulation of many of those processes takes place at the INM, where many transmembrane proteins directly bind to and regulate chromatin. Several specific functions have been attributed to the nuclear lamin polymer and several associated NETs, such as chromatin organization, nuclear assembly, DNA replication, transcription and others. Mutations in lamins and lamin-associated INM proteins have been found to cause a number of genetic diseases in humans, such as muscular dystrophies, lipodystrophy, neuropathy, and pre-mature aging. Understanding the functions of these nuclear envelope proteins helps in understanding the molecular mechanism behind these diseases which may help in finding better treatments. The import of soluble proteins into the nucleus is mediated by short binding sites on the protein sequences, called the nuclear localization signals (NLSs). The NLS is an amino acid sequence which acts like a "tag" on the protein which is transported into the nucleus through the NPCs in the nuclear envelope. A series of mobile proteins associated with the NPCs called transport receptors recognize this tag and effectively "carry" their cargos mostly through the NPC by FG motifs on the surface of the transport receptors interacting with FG repeats on nucleoporins. There are a few well-characterized NLS sequence motifs, though many more specific motifs that function as NLSs have also been identified. Understanding the process of membrane translocation of INM proteins may also involve specific sequence motifs and the purpose of this proposal is to use the large proteomic datasets of novel INM proteins to search for such motifs.


* To identify sequence motifs of the nuclear transmembrane proteins

* To understand the mechanism of translocation and the role played by these motifs in translocation

* To study the similarities or differences existing in the identified sequence motifs.

· To standardize nuclear localization signals searching methods for a certain class of nuclear envelope proteins.


From the background studies we have extensive lists of proteins identified in a nuclear envelope fraction and we know that there are different transport mechanisms by which these proteins cross the nuclear envelope barrier and gain entry into the organelle. This smaller group of proteins can be classified into different subgroups based on their entry mechanism and sequence motifs may be able to be identified for every group. The different algorithms will be used to search for the shared sequence characteristics associated with the subgroups. The size limitations of the peripheral channels may be another parameter that may be applied to the datasets during the search process and there are known lamin binding regions on several of the proteins which can be used to identify shared sequence characteristics.

There are different bioinformatics based approaches used to discover or identify sequence motifs are:

MEME (Multiple EM for Motif Elicitation): It is one of the most widely used tools for searching novel signals in sets of biological data. It works by searching for repeated sets of sequence patters in the set of data given by the user. It also allows for comparison of the motifs determined with the known database of motifs which gives an idea about the function. The input sequences are uploaded in the FASTA format. MEME chooses the width and the number of occurrences of each motif automatically in order to minimize the E- value which is the probability of finding an equally well conserved pattern in random sequences. The MEME output is a HTML file which shows the motifs as subsets of the input sequences. It also gives a "block diagram" which shows the relative position of the motifs in the input sequences. The buttons in the MEME HTML output helps in forwarding one or all the sequences to the MAST web server where various databases can be searched for similarity or to check if this pattern is seen in any other genome. The algorithm of MEME uses expectation maximization statistical approach along with greedy search for multiple motifs. Expectation maximization is a method for finding maximum likelihood estimates of parameters [6].

Gibbs motif sampler: This sampler allows to find motifs or the conserved regions in both DNA and protein sequences. It is one of the first successful motif algorithm and it runs very fast when compared to the other algorithms. This algorithm or the search technique is based on Gibbs random sampling. The limitation with this method is that it may sometimes not predict true motifs and to overcome this problem iGibbs (improved Gibbs motif sampler) was developed in which motif search is guided by clustering sequences. When the motifs occur in different subsets of the input sequences this algorithm automatically clusters similar sequences and identifies the motifs from the clusters [7].

PROMI (Profile Analysis based On Mutual Information): It is a tool that enables comparative analysis of user classified protein sequences. It is a web server using both Perl and R. This tool proves useful to us as we can classify our protein sequences into different categories based on the translocation mechanism and we can determine sequence motifs common to each group of proteins [8].

NucPred: It is a tool used for predicting nuclear localization of proteins. It is based on regular expression matching and multiple program classifiers induced by genetic programming. A likelihood score is derived from the programs for each input sequence and each residue position. Different forms of visualization are provided to assist the detection of nuclear localization signals (NLSs). The NucPred server also provides access to additional sources of biological information (real and predicted) for a better validation and interpretation of results. The NucPred core is an ensemble of 100 sequence predictors which make a Boolean decision i.e. yes or no for individual protein sequences , if they have a role in the nucleus or not [9].

WoLF PSORT: It is an update to the PSORTII program. This method exploits the observation that the amino acid content correlates strongly with the localization sites. It uses the weighted version of kNN algorithm (k-Nearest Neighbors). Dually localized proteins can also be predicted with about 50% prediction confidence. It is shown that when WoLF PSORT is combined with BLAST there is 83% accuracy in the predictions. It should be noted however that the most common "classical" NLS sequence is a stretch of basic amino acids and many highly basic proteins can easily be mispredicted to have NLSs [10].

Along with this there are several other techniques like HMM (Hidden Markov Model) predictions of motifs, GLAM2 and GLAM2SCAN. The InterPro and the ELM databases also have several tools and features which can be used to identify, recognize and also helps in annotation of the protein sequence motifs.


The membrane protein NLS motifs that may have been identified from the above mentioned methods will need to be validated i.e. to be sure that the motifs identified are associated with the different groups of nuclear envelope proteins and also to be sure that they are not contaminant proteins, which may the ER proteins motifs we can compare the results obtained for the same dataset with different methods. By doing this we can also make an attempt to standardize a motif searching method for a particular class of proteins. When we are sure that the results are quite consistent over different methods and a set of potential targeting sequences are identified they can be prepared in the lab as fusions to proteins normally not resident at the nuclear envelope to determine if they are sufficient to direct targeting to the nucleus.


Identification of sequence motifs unique to nuclear envelope proteins. Different proteins have different transportation mechanism so there may be difference in the motifs of each group of proteins which may lead to the understanding of the transportation mechanism and hence get a clearer picture of their pathways. Standardizing the NLS motif searching technique for a set of proteins.






















Familiarizing data handling

Motif search methods

Literature review

Analysis of results from different methods

Validating results

Drawing useful conclusions

Report writing


1. Review: Dynamic interactions of nuclear lamina proteins with chromatin and transcriptional machinery. A. Mattout-Drubezki and Y. Gruenbaum* Department of Genetics, The Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem 91904.

2. Comparative proteomic analyses of the nuclear envelope and pore complex suggests a wide range of heretofore unexpected functions. Dzmitry G. Batrakou, Alastair R.W. Kerr, Eric C. Schrimer* Wellcome Trust Centre for Cell Biology, University of Edinburgh, Edinburgh, UK.

3. Peering through the Pore: Nuclear Pore Complex Structure, Assembly, and Function. Mythilli Suntharalingam and Susan R. Wente. Department of Cell and Developmental Biology Vanderbilt University medical centre, Tennessee.

4. The Molecular architecture of the nuclear pore complex. Frank Alber, Svetlana Dokudovskaya, Leisbeth M. Veenhoff et al. dio:10.1038/nature06405

5. A Systems Approach to Determine the Mechanism of Translocation of transmembrane Proteins to the Inner Nuclear Membrane. Nikolaj Zuleger, David A. Kelly, A. Christine Richardson, Alastair R. W. Kerr, Martin W. Goldberg, Andrew Goryachev, and Eric C. Schirmer.

6. MEME: discovering and analyzing DNA and protein sequence motifs. Timothy L. Bailey*, Nadya Williams1, Chris Misleh1 and Wilfred W. Li1 Institute of Molecular Bioscience, The University of Queensland, St Lucia, QLD 4072, Australia and 1SDSC, UCSD, La Jolla, CA, USA.

7. iGibbs: improving Gibbs motif sampler for proteins by sequence clustering and iterative pattern sampling. Kim S, Wang Z, Dalkilic M. School of Informatics, Indiana University, Indiana 47408, USA.

8. Species-specific analysis of protein sequence motifs using mutual information. Jan Hummel1, Nima Keshvari1, Wolfram Weckwerth1 and Joachim Selbig*. Max Planck Institute of Molecular Plant Physiology, Am Mühlenberg 1, D-14424 Potsdam, Germany and 2University of Potsdam, Institutes of Biochemistry/Biology and Computer Science, c/o Max Planck Institute of Molecular Plant Physiology, Am Mühlenberg 1, D-14424 ,Potsdam, Germany.

9. NucPred-Predicting nuclear localization of proteins. Markus Brameier, Andrea Krings and Robert M. MacCallum, Bioinformatics Research Center (BiRC), University of Aarhus, 8000 Aarhus C, Denmark, Stockholm Bioinformatics Center (SBC), Stockholm University, 106 91 Stockholm, Sweden and Division of Cell & Molecular Biology, Imperial College London, South Kensington Campus, London, UK.

10. Protein Subcellular Localization Prediction with WoLF PSORT. Paul Horton Computational Biology Research Center National Institute of Advanced Industrial Science and Technology, Tokyo, Japan.