Assigning Sub Cellular Localization Biology Essay


Proteins make up for the majority of the cellular structures and perform most of the crucial functions in the cell, such as catalyzing of biochemical reactions, transporting nutrients, and recognizing and transmitting of signals. These roles are specified by information encoded in genes. To cooperate toward common physiological functions, proteins must be localized in the same sub cellular compartment. The sub cellular protein localization of a protein has been known as the key functional characteristic of proteins{Bork98}. It has been the target of intensive research by computational biologists. The global determination of sub cellular location of proteins is not only a step towards elucidating the protein's interaction partners, function, potential role(s) in the cellular machinery, but also it is beneficial to the drug discovery. The knowledge of sub cellular localization of a protein can give an insight to design of experimental strategies for investigating the functional characterization. Recent advances in the large-scale genome sequencing have resulted in the avalanche of new protein sequences whose functions are unknown. The prediction of protein candidates located in sub cellular compartments is useful to choose the proteins worth being investigated, among the growing number of known sequences.

Lady using a tablet
Lady using a tablet


Essay Writers

Lady Using Tablet

Get your grade
or your money back

using our Essay Writing Service!

Essay Writing Service

Despite many parallel and complementary efforts, assigning sub cellular localization has still not been achieved for any mammalian proteome. In this context, the first efforts had been done in experimental manners.

Experimental determination of sub cellular localization range from tagging of proteins using green fluorescent protein (GFP) and isotopes[1] to immunolocalization, regardless of recent technological advancements, these methods remain time-consuming and labor-intensive, and they also have limitations. The availability of larger experimental datasets demands for an automotive systematic way to characterize the enormous number of new protein sequences. Computational methods for assigning localization on a proteome-wide scale offer an attractive complement and have become a hot topic in bioinformatics. The number of large-scale sub cellular location predictor of newly identified proteins has been developed. These tools can be categorized based on the type of data that they exploit or the way that they construct prediction rules. We categorized these methods based on the data that they demand.

Methods based on protein sorting-signal. This group of methods classifies the proteins based on the existence of targeting sequences. The underlying theory for these methods is:

Because sorting signals usually determine protein localization, it is reasonable to recognize sorting signals, and predicting localization sites based on them. In table \ref{table1} and figure \ref{fig1} the common eukaryote sorting signals are illustrated.

Table 1- common eukaryote sorting signals. Illustration taken from article \cite{Emanuelsson02}.

Figure 1-Schematic view of sorting signals, the corresponding final compartments, and reported sequence features. Arrowhead, cleavage site; SP, signal peptide; cTP, chloroplast transit peptide; mTP, mitochondrial targeting peptide; IMS, intermembrane space (in mitochondria); MIP, mitochondrial intermediate peptidase; PTS, peroxisomal targeting signal; aa, amino acids. A = Alanine; x = any amino acid; R = Arginine; M = Methionine; V = Valine; S = Serine; K = Lysine; L = Lucine; H = Histidine. Illustration taken from the article \cite{Emanuelsson02}.

iPSORT \cite{Bannai02} is a sub cellular localization site predictor that uses biologically interpretable rules for N-terminal sorting signals. It predicates the existence of a Signal Peptide (SP), Mitochondrial Targeting Peptide (mTP), or Chloroplast Transit Peptide (cTP).

TargetP \cite{Emanuelsson00} is a neural network-based protein sub cellular predictor that uses N-terminal sequence information only. Similar to iPSORT, it discriminates between proteins located in the mitochondrion, the chloroplast, the secretary pathway, and "other" localizations with a success rate of 85% (plant) or 90% (non-plant) on redundancy-reduced test sets. TargetP outperforms iPSORT in most predictions.

Composition-based methods- Amino acid composition representation of a sequence contains 20 components with each reflecting the occurrence frequency for one of the 20 native amino acids in an entire sequence. Based on the observation that the proteins located in the same subcellular compartment have a similar amino acid composition \cite{Nishikawa82}, several numbers of algorithms were proposed to predict the subcellular location of a query protein according to its amino acid composition of entire sequence such as Neural Networks (NNs) , Hidden Markov Models (HMMs) , Support Vector Machines (SVMs), covariant discriminant algorithm. \cite{Reinhardt98} constructed a prediction tool for prokaryotic sequences and in eukaryotic sequences using supervised neural networks. \cite{Yuan99} proposed a HMMs and \cite{Chou99} used a covariant discriminant algorithm to predict subcellular localization of prokaryotic sequences.

Lady using a tablet
Lady using a tablet


Writing Services

Lady Using Tablet

Always on Time

Marked to Standard

Order Now

The performance of all aforementioned models was reported. \cite{Reinhardt98} achieved a total accuracy of 81% for three subcellular locations in prokaryotic sequences and 66% for four locations in eukaryotic sequences. \cite{Yuan99} obtained 89% accuracy for prokaryotic sequences and 73% for eukaryotic sequences. \cite{Chou99} obtained a total accuracy of 87% by the jackknife test prokaryotic sequences.

The main con of representing proteins in the form of their overall amino acid composition is the fact that the sequence-order information would be lost. To overcome this shortcoming pseudo-amino acid composition (PseAA) was proposed by \cite{Lin09}. The PseAA composition includes a set of greater than 20 components, where the first 20 represent its conventional amino acid composition as the conventional amino acid composition presentation, and the additional factors incorporate some sequence-order information via various modes.

Functional-domain-based methods - as well as cellular localization of a protein is an indicator for its functionality, the functional description of a protein can also be a reasonable indicator for its localization. This category uses this observation and tries to classify proteins by considering the correlation between the function of a protein and its subcellular location. The main difference between this method and composition-based and sorting-signal methods is the starting point. The start point in the composition-based and sorting signal method is amino acid sequence of the protein. However, in the case of functional-domain-based methods the referring point is a description of a functionality of a protein in addition to its (pseudo) amino acid composition. In this context, a protein is represented as a point in a high-dimensional space in which each basis is defined by one of the functional domains obtained from the functional domain database, the gene ontology database, or their combination \cite{Chou02}.

Homology-based methods- This category is based on the hypothesis that homologous sequences are also likely to share the same subcellular localization \cite{Bork98}. This notion was first studied by \cite{Nair02}. Subsequently, a number of methods tried to determine the subcellular localization proteins by assessing protein homology to proteins of experimentally known localization, including Proteome Analyst (PA) \cite{Szafron04}. It uses the presence or absence of the tokens from certain fields of the homologous sequences in the SWISSPROT database as a means to compute features for classification. LOChom \cite{Nair04} is another tool that infers the suncellular localization of proteins through sequence homology. It uses PSI-BLAST \cite{ Schaffer01}, \cite{ Altschul 97} for aligning a sequence to a localization annotated database of proteins. If any homologues to the sequence was found, then the subcellular localization is transferred from the homologue to the sequence.

Fusion-based models- The methods if this category use several data sources and try to integrate these to improve the performance of the classifier. Our method fall into this category.

\cite{Calvo06} proposed an integrative method to predict mitochondrial localization based on eight genome scale data sets.

table-Eight individual methods and an integrated approach (named Maestro) were used to predict mitochondrial localization of all 33,860 Ensemble human proteins. The genome-wide false

discovery rate was estimated from large gold standard training data. The false discovery rate for individual methods is high. Illustration taken from article by \cite{Clavo06}.

\cite{Emanuelsson03} proposed a method named PeroxiP to classify peroxisomal proteins utilizing amino acid compositions, peroxisomal targeting signal type 1(PTS1), nine residue next to C-terminal tripeptide and sequence motifs. It consists of preprocessing module, a motif identification module and pattern recognition module. Since, peroxisomal targeting signal is a weak indicator of peroxisomal proteins, the preprocessing module conducts TargetP \cite{ Emanuelsson00}, \cite{Nielson97} and TMHMM\cite{Sonnhammer98} predictors to exclude as many as possible potential false positives. The sequences that passed preprocessing module were classified as peroxisomal or nonperoxisomal based on the presence of sequence motifs. The output of this module can be encountered as peroxisomal localization, or it can undergo the pattern recognition module. Two simple classifiers were constructed in this module, "permissive" and "restrictive". Permissive classifies sequences with [ACHKNPST][HKNQRS][AFILMV] as C-terminal tripeptide, as peroxisomal. And the restrictive checks the presence of 32 motifs: AHL, AKA, AKF, AHI, AKL, AKM, AKV, ANL, ARF, ARL, ARM, CKL, HRL, HRM, KKL, NKL, PHL, PKL, PRL, SHL, SKF, SKI, SKL, SKM, SKV, SNL,

Lady using a tablet
Lady using a tablet

This Essay is

a Student's Work

Lady Using Tablet

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Examples of our work

SQL, SRL, SRM, THL, TKL, TKV. And if any of these motifs was present at the C-terminal tripeptide of the sequence, this sequence would be notified as peroxisomal. To improve the performance of the method another module was proposed and the result of these two stages was fed to pattern recognition module.

Machine learning module is a union of a SVM (SVM stands for Support Vector Machine), NN (Neural Networks). Nine residues next to C-terminal tripeptide and amino acid composition of entire sequence are features that represent proteins in the feature space. A standard feed-forward NN with sigmoid neurons and a SVM were trained independently and used to predicate the subcellular localization. A sequence is predicated as peroxisomal by this module if either the NN or the SVM predicates so. Figure \ref{PeroxiP} shows the PeroxiP architecture. The performance of the PeroxiP predictor was estimated on the set of all human SWISS-PROT proteins with subcellular location annotated(SWISS-PROT release 40.17) and it reached sensitivity of 0.50 and a specificity of 0.64 and Mathews correlation coefficient 0.5. the performance is better than general subcellular localization predictor such as PSORT \cite{Nakai92}, \cite{Nakaie97}.

\caption{PeroxiP predication schema. Preprocessing module excludes trans-membrane and secreted proteins. Motif identification module checks for restrictive and permissive motif. Four methods were proposed M1 and M3 uses pattern recognition module in combination with restrictive and permissive PTS1 motif, respectively. For methods 2 and 4 the pattern recognition module is by-passed.}

Since our study share, the same sub cellular localization with the PeroxiP we investigate this method a bit further. We tried to replicate their study. To this end, We have to construct the data set that they used to classify proteins peroxisomal localization. The initial PeroxiP dataset contains 152 peroxisomal proteins with a true peroxisomal targeting signal type 1 (PTS1) as well as 308 non-peroxisomal proteins with a PTS1-like C-terminal tripeptide. The data were extracted from Swiss-Prot release 39.27 and are available on the PeroxiP website. This dataset could not be directly used to replicate the PeroxiP model as the manual motif reduction and redundancy reduction had not been performed on this dataset. To minimize the effect of potential sequencing and annotation errors, three out of the 35 PTS1 motifs (Refer to table 1 of \cite{Emanuelsson02}) were excluded from the accepted set of motifs. Proteins that had a C-terminal tripeptide in which one of the three.

Positions contained an amino acid found only once at that position in the entire set of 152 peroxisomal proteins, were removed. \cite{Emanuelsson02} concluded that: In total, this resulted in the exclusion of three motifs,-YRM, -ASL, and -ARY". Motif -AKA contains the only A in the final position and according to the mentioned constraint, it must be excluded, but it is not removed by.

\cite{Emanuelsson03}. To keep our dataset and prediction rules close to PeroxiP model, we did not remove this motif from our list of motifs. This results in the reduction of the set of known proteins with accepted PTS1 from 152 to 149 proteins, and the set of nonperoxisomal proteins with PTS1 like C-terminal tripeptide from 308 to 271.

This reduced the number of peroxisomal samples to 91 and number of non peroxisomal sequences to 156. The data set of PeroxiP includes 90 peroxisomal and 151 nonperoxisomal sequences, which differ only slightly to our data set.

Another subcellular localization predicator that specializes in predicting peroxisomal targeting is PTS1Prowler \cite{Hawkins07}. Its predication schema is similar to PeroxiP, and it consists of three stages. First it filters out sequences with a C-terminal tripeptide not occurring among peroxisomal protein in SWISS-PROT R45. A SVM classifier is constructed based on 12 residues of C-terminal and amino acid composition of entire sequence. At the last stage using PProwler localization predicator secreted proteins is discovered and filtered out of the final set of candidates of peroxisomal proteins.

Efficiency and success not only depend on prediction method but also on the input data. The PTs1Plrowler data set was extracted from SWISS-PROT R25. And follows similar process to that PeroxiP.

Methods Limitations:

The obvious disadvantage of amino acid composition method is the loss of sequence order information by transforming amino acid sequence to 20 dimensional amino acid composition space. It is possible to include some order information by applying pseudo amino acid composition method. By looking at sub cellular localization predictors one can conclude that using only (pseudo) amino acid composition is not enough to construct an efficient predicator. Another category contains methods that make use of sorting signal. The sub cellular localization prediction methods which depend on sorting signals will be inaccurate when the signals are missing or only partially included. Even in the presence of the sorting signal, it can be a weak indicator of protein localization. For example, for peroxisomal targeting signal of type 1 (PTS1), in SWISS-PROT R40, there are approximately twice as many proteins containing a PTS1 like signal at their C-terminus as there are truly peroxisome-located proteins with PTS1-signal, \cite{Emanuelsson003}.

The next category includes functional-domain-based methods. A remarkable advantage of the functional domain composition representation is the use the functional domain database to incorporate the information of not only some sequence-order effects but also the structural and functional types. Using functional domain databases is not risk-free. For example, since on SWISS-PROT entry might describe various versions of a given protein, an annotation-based automatic assignment of sub cellular localization might result in assigning to several cellular compartments. There are single sequences with multiple annotations \cite{Eisenhaber 99}. A major drawback to the functional domain based methods is the lack of complete functional domain database and as the subsequence of it, lack of training set. But this method has room to improve and by progress of functional domain databases, it would be extended and used more frequently. Another category of methods is homology-based methods, a disadvantage is that the accuracy of these methods depends on the thresholds for annotation transfer. And to establish an accurate threshold exhaustive study of the sequence conservation of sub-cellular localization is required. Even with efficient threshold homology based methods are limited. Because more than half of new genes have no significant homology with any genes with known function, thus predicting their sub cellular localization only based on the existence of homologue sequences is impossible.

The final category was the fusion based methods, the use of these methods has been limited due to their demand for high-quality genome-scale data sets and training data.


Predicting sub cellular localization for peroxisomal proteins is more complicated than some other sub cellular compartments such a mitochondrial, tans-membrane or secreted proteins, due to the scarce data and (lack of) complexity in the sorting signal.

Specially, PTS1 signal is reported to be a weak signal. There are many proteins that are not located to peroxisome but still contain a signal-like motif. For the PTS1 signal, there are in SWISS-PROT (release 40) approximately twice as many nonperoxisomal proteins including PTS1 signal like at their C-terminus as peroxisomal proteins with true PTS1.

PTS2 is the peroxisomal sorting signal type 2, is a complex signal. There is not enough experimental data available to identify it. State-of-the-art methods for predicting peroxisomal localization of proteins, such as PeroxiP \cite{Emanuelsson03} and PTS1Prowler \cite{Hawkins07} excludes it from their prediction schema.

The goal of this project is to integrate various genomic data sources such as: gene expression data, sequence data, targeting signals, phylogenetic profiles, protein domains, etc. The integrative approach is taken to combine weak and complementary information related to peroxisomal localization in each of these datasets and make a strong classification method to identify novel peroxisomal proteins.

Identification of peroxisomal proteins with supervised learning is further complicated by the fact that we only have a small set of positive examples: for human and mouse about 80 proteins are known to be peroxisomal. In this project semi-supervised techniques will be investigated in order to take advantage of the large amount of unlabeled data available.


In this chapter, we introduce different data types that might contain information about peroxisome. In following sections we will explain the process of gathering relevant data that provide complementary clues about peroxisomal proteins such as: microarray data, sequence data, Domains:: PFAM and/or InterPro domains, Mass-spectrometry.

I try to document the way that this data is gathered and processed to be used by semi supervised learner.

\subsection{Sequence data}

Sequence data is the classic molecular biology data type. Proteins can be presented as variable-length sequences from the alphabet of 20 amino acids. The typical size of sequence is 10-1000 amino acids long, and it is known as the primary structure of a protein. Information about various protein sequences and the functional roles of the respective proteins, can be found in UniProtKB. UniProtKB is a protein database that consists of two parts:

1-Swiss-Prot, which is manually annotated and reviewed.

2-TrEMBL, which is automatically annotated and is not reviewed.

Now a day, sequencing entire genomes has become almost a routine. We study Mus Musculus which its peroxisomal proteome was studied by \ref{wiese06}. I obtained the list of peroxisomal proteins for this organism from available resources. For Mus Musculus, there are two reliable resources, the first one is the UniProt website and the other is the study by Wiese et al. \cite{Wiese07}.

Microarray data

Although a full set of identical genes are present in every cell, but only a fraction of these genes is active or expressed. The kinds and amounts of the genes being expressed in the cell at a particular time depends on its function and condition at that point in time. In a tissue sample, the expression of a gene can be measured by the present amount of transcribed RNA encoded by that gene. Microarray technology offers an efficient tool to perform this measurement; it enables scientists to examine the expression level of thousands of genes simultaneously. In the following, we describe principles of microarray technology and its applications. DNA molecules or oligonucleotides corresponding to the genes whose expression has to be analyses are called the probes. They are attached in an ordered fashion to a solid surface that can be a nylon membrane, quartz wafer or a glass slide. Available techniques for placing probes on the microarray slide makes it possible to produce arrays with several thousand genes (i.e. a substantial part of the genome) represented on a few square centimeters. These techniques differ from one manufacturer to another, but the main two techniques are: 1-Miniaturisation and automation of array production with robotic spotters. 2- In situ synthesis of oligonucleotides.

The measurement of an abundance of the corresponding transcripts begins by reverse transcribing the mRNAs of a cell sample to cDNA. cDNAs is first labeled with a fluorescent or radioactive marker and hybridized with the arrays. The intensity of the hybridization signal is proportional to respective mRNA concentration in the cell. After washing the array the concentration can be determined by measuring the intensity of the signal emitted by the molecular labels.

Microarray data contains much noise due to the unstable experimental conditions such as the hybridization procedure, usage of different labeling dyes, chip plate effects and scanning factors. Therefore, a number of preprocessing steps are performed on the microarray data that mainly consists of image analysis and normalization steps. Image analysis is applied to the raw microarray data to extract the intensity value of each probe in the microarray.

There are various microarray image analysis methods available, but they generally consist of similar steps. They first start by identifying the probe spots on the microarray scans, followed by extracting the foreground and background intensities for each channel. The background intensities are then used to correct foreground intensities in order to produce correct probe value estimates. After that, normalization techniques are applied to the data to reduce the systematic errors. This step is necessary to ensure that the conclusions draw from the analysis are based on underlying biological differences between the experiment samples and not on technical variations. Normalization methods are applied between the different microarrays as well as within each microarray.

The expression profile or transcriptome refers to the complete collection of mRNAs present. Thus comparing the hybridization signals for diverse mRNA samples allows changes in mRNA levels to be determined under the conditions tested for all the genes represented on the arrays. The purpose of array experiments and transcriptome characterisation are to address biological issues and this can be achieved at various levels of complexity. On the gene level to examine the behavior of genes. It can also be performed on the pathway level.

We are about to integrate the outcome of microarray experiments with other data sources to identify new peroxisomal proteins. Current methods for analysing microarray experiments are based on the hypothesis that genes sharing function or sub cellular localization show similar expression profile across a set of conditions.

From a number of microarray experiments, a set of experiments can be constructed, allowing the user to follow the mRNA relative amount under various experimental conditions. Microarray data consists of files of the scanned microarrays and extra information about the probes, samples identifiers, hybridization details and manufacturing. Data are often translated to logarithmic scale, which means that overexpressed genes are assigned positive values and under expressed gene negatives values. Usually microarray data presented in a n x m expression matrix, with n being the number of genes in the microarray and m the number of samples \ref{fig:microarray}.

Experiment 1

Experiment 2


Experiment m

Gene 1

Log2(Ratio 1,1)

Log2(Ratio 1,2)


Log2(Ratio 1,m)

Gene 2

Log2(Ratio 2,1)

Log2(Ratio 2,2)


Log2(Ratio 2,m)






Gene n

Log2(Ratio n,1)

Log2(Ratio n,2)


Log2(Ratio n,m)

Figure 3: Microarray gene expression matrix. The rows correspond with the genes in the microarray and the columns with the samples. Gene's expression profile for Gene 1 is its respective row, and sample expression profile for experiment or sample 1 is its respective column, thus column one.

The entry xij in the expression matrix represents the expression of gene i in the sample j. A single row in the expression matrix represents the expression profile for that gene across all samples, while a single column represents the expression profile of all genes for the corresponding sample. The expression matrix has high dimensionality; it usually contains tens of thousands of genes and only a few dozen samples. The high costs of microarray experiments and the difficulty in acquiring the samples are the main reason for such few samples. Further analysis of the microarray data is performed on this matrix.

Microarray data is the result of a joint European study on peroxisomes. The experiments were done by different partners, which Bioinformatics Laboratory of academic medical centre of University of Amsterdam is a member of it. Failure in the biogenesis of peroxisomes or deficiencies in the function of single peroxisomal proteins, leads to serious diseases in human such as: Refsum, RCDP, Hypotonia, Zellweger and many other diseases. Based on available Clinical studies on these diseases several experiments were initiated. I will mention some of these evidences here for Refsum disease, RCDP and Zellweger syndrome. Mutations in two genes have been identified in Refsum disease: PHYH, the gene that encodes phytanoyl-CoA hydroxylase, is mutated in more than 90% of individuals, and PEX7 the gene that encodes the PTS2 receptor, is mutated in fewer than 10% of individuals. Molecular genetic testing of the PHYH and PEX7 genes detects mutations in more than 95% of affected individuals and is available on a clinical basis. Recent studies have shown that type I RCDP is caused by mutations in the PEX7. The most sever peroxisomal biogenesis disorder is the Zellweger syndrome. It characterized by reduction or absence of peroxisomes in the cells of the liver, kidneys, and brain. It has been shown that following certain diets has therapeutic benefits for patients with one of these diseases. As you can see tissue, diet and one or more genes play role in the peroxisomal disorder or in the treatment of these diseases. In this study the kinds and amounts of mRNA produced by a cell were measured, which in turn provides insights into how the cell responds to its changing needs or environmental stimuli.

The above figure summarizes the experiments. The Genome-wide expression was measured in 162 different conditions and saved on a log2 scale. Each experiment differs in one or more conditions, which are: genotype, diet, age, tissue. Genotype can be knockout or wild type mice. The mices were scarified after 2 or 12 days or 3, 5 or 7 months based on the expectation of manifestation of the specific disease, for example Zellweger syndrome, manifests itself in early infancy, and therefore mices that were used for studying this disease were sacrificed after 12 days. After scarifying the mices, testis, kidney, cortex, medulla, cerebellum, heart and livers were withdrawn immediately for preparing tissue homogenates. The experiments often compare knockout (KO) mice versus normal' (WT = wild type) mice, the knockout gene can be one of the following: Pex5, Pex7, Phyh, Amacr, Decr1, Mfp1, Mfp2 or Thiolase B. Mices were fed with different eating patterns and foods, their diet can be Phytol, Normal (or chow) and High fat or they were fasted 24 hours before their scarification.

\cite{Ilkka09}. Figure \ref{Microarray_experiments} summarizes the experiments.





\caption{Microarray experiments overview}\label{Microarray_experiments}




The figure \ref{fig:part of metaboleme} shows the knockout genes in their metabolic pathways in the cell. Pex7 is PTS2 and PEX5 is PTS1 receptor. These two genes are involved in biogenesis and maintenance of peroxisomes while other knock out genes are matrix proteins and are mostly responsible for metabolic activities such as alpha (AMACR, PHYH) and beta (LBPDBP, THIOLASE B) oxidation, 2-4-dienoyl-Coenzyme A reductase (DECR1). LBPDBP represents Mfp1, Mfp2 and Mfp1/Mfp2, it is a synonym to EHHADH. DECR1 encodes for family of proteins called PDCR. Thiolase B is represented as ACAA1 in the figure \ref{fig:knock out gene}.

\labele{fig:knock out gene}- This illustration is part of complete schematic view of Mus Musculus metabolic pathways. The knockout genes are tagged in the picture. Some of the genes are represented with the name of their protein family (DECR1 and Thiolase B) or with other names than they have been called in the experiment( LBPDBP represented with EHHADH). Illustration taken from the peroxisomeDB website \cite{Schluter 10}.

Majority of the peroxisomal matrix proteins contains a C-terminal PTS1, and the minority an N-terminal PTS2. The PTS1- or PTS2-containing matrix proteins are recognized by soluble receptors (PTS1 by Pex5p, PTS2 by Pex7p and its coreceptors) in the cytosol, which guide them to a docking site at the peroxisomal membrane. Thus, lack of Pex5 and Pex7 causes the peroxisomal matrix proteins to remain in the cytosol, where they cannot function or are degraded. One can expect in this situation the expression profile of the peroxisomal proteins be remarkably alike.

For each experiment we calculated Pearson correlations between every pair genes within the peroxisomal data set. Also we calculated Pearson correlations between every genes in peroxisomal dataset and genes in the non peroxisomal data set. These correlations were then normalized using Fisher's Z-transform ,Which maps a correlation r into a Z-score, where the collection of pairwise Z-scores within a dataset is guaranteed to be normally distributed.:

We ferther transform the data to N(0,1) by dividing by dataset standard deviation and subtracting the mean this makes cross-dataset analyses more robust.

The table \ref{table:Microarray} shows the mean of the resulting transformation. The third column shows the distance between first and second column. As we expect Pex7 and Pex5 pose the biggest differences.