Bioinformatic Prediction Of Regulatory Networks Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Marine sponges are well known for having highly complex associations with bacteria, they are also well known to produce a variety of secondary metabolites and it is thought that the sponge associated bacteria play a role in the production of these compounds. To investigate this, various bacterial specimens were isolated from the marine sponge Haliclona simulans in a bio-discovery program namely Streptomyces strains SM2 and SM8. These were identified to have potent antibacterial and antifungal activity and hence the genome sequence for each was obtained and annotated using automatic gene calling software IMG (Markowitz et al., 2009).

The focus of this project is the strain Streptomyces SM8, but more specifically secondary metabolite clusters within SM8 which were identified previously through comparison against known secondary metabolite clusters within other Streptomyces species. This study led to the discovery of 19 potential secondary metabolism clusters present within the SM8 strain. These clusters also called "cryptic clusters" are believed to contain many potential antibiotic compounds that are not produced in standard laboratory conditions. The aim is, through the use of bioinformatic analysis, to develop hypotheses to guide experimental approaches for the production or overproduction of these "cryptic" metabolites.

The capacity to produce these metabolites has useful implications in medicine especially with the increased requirement for novel antibiotics due to the rapid spreading of bacterial resistances and the emergence of multi-resistant pathogenic strains, which poses severe clinical problems in the treatment of infectious diseases (Dickschat, 2011).

Chapter 1


The focus of this project was on:

Identifying regulatory genes in Streptomyces SM8 and associated secondary metabolism clusters.

Identifying global DNA motifs in Streptomyces SM8 within secondary metabolism clusters and individual motifs within each secondary metabolism cluster.

Defining regulatory pathways within secondary metabolism clusters and methods of regulation of products using well studied Streptomyces species as a guide.

Suggest experimental hypothesis for production/overproduction of individual secondary metabolites of interest.

Chapter 2


Streptomyces are a gram-positive Actinobacteria, with over 500 species of Streptomyces described to date. They are commonly found in soil but can also be found in marine environments and contain a linear chromosome with a high GC-content generally over 73% GC rich (Seghezzi et al., 2011). They are extremely well-studied due to their complex secondary metabolism, producing over two-thirds of clinical antibiotics of natural origin, parasiticides, herbicides, and pharmacologically active substances, including antitumor agents and immunosuppressants.  (Bentley et al., 2002, Ohnishi et al., 2008).

Even though the function of many of the secondary metabolites that are produced by Streptomyces is not known, it is still believed that there are many important metabolites that are not produced under normal laboratory conditions that may be beneficial to human health and disease control.

Importance of secondary metabolites

Though secondary metabolites are not directly necessary to a cells survival they often confer competitive advantages upon the bacteria. Through genomic sequencing of bacterial genomes such as that of the Streptomyces species it has been shown that there is a much greater potential for secondary metabolites than is currently recognised. Many new secondary metabolites that have been bioinformatically predicted remain silent in normal laboratory conditions. (Brakhage and Schroeckh, 2011).

Primary metabolic pathways lead to few end products that are only of use in the general function of the bacterium but are vital to the survival of the cell. These primary metabolites rarely have significant uses outside of the cell.

On the other hand secondary metabolism synthesises new compounds that are not essential to the survival of the cell but are produced when the cell is not operating under optimum conditions. Relatively few microbial types produce the majority of secondary metabolites with Streptomyces contributing a large number of these novel medically significant secondary metabolites. For example Streptomyces griseus produces more than 50 antibiotics alone. One of the first of these to be identified and put to use was the antituberculosis agent Streptomycin, which was the first aminoglycoside antibiotic, discovered more than 60 years ago.

Figure Structure of Streptomycin

Secondary metabolites though previously believed to be few in number, have since the sequencing of many of the Streptomyces species been found to be much more abundant than previously thought. The S.coelicolor genome contains 25 secondary metabolism clusters (Bentley et al., 2002), S.avermitilis contains 30 such clusters (Ikeda et al., 2003) and S.griseus contains the highest number with 34 secondary metabolite clusters identified. These clusters account for 4-6% of the whole genome information which reiterates the importance of secondary metabolism for these organisms.

Since the discovery of Streptomycin many more antibiotics have been found to be associated with Streptomyces species. Even though tubercle bacilli soon became resistant to streptomycin (it has since been replaced by para-amino-salicylic acid) stretomycetes remain a very important bacterial source of antibiotics and cytostatics. Due to the emerging resistance of bacteria to common antibiotics, new technologies such as combinatorial biosynthesis are being used for the production of novel metabolites using streptomycetes. This technology involves the use of a combination of genes from different biosynthetic pathways to produce modified metabolites.

The advent of bioinformatics approaches has led to the discovery of further potentially medically significant molecules, associated with secondary metabolism, which to date have not been expressed in normal laboratory conditions (Ohnishi et al., 2008).

Regulation of secondary metabolites in other Streptomyces species

The synthesis of secondary metabolites in laboratory conditions generally occurs in the growth phase but can also be highly influenced by a number of environmental conditions that would normally trigger their production in nature, be it due to stress on the organism or a change in conditions that elicit their production. The expression of these secondary metabolite gene clusters is controlled by a number of different families of regulatory genes some of which are only found in actinomycetes (Bibb and Hesketh, 2009, Bibb, 2005).

The production of secondary metabolites has been seen to coincide or precede the development of aerial hyphae in surface grown cultures or stationary phase in cultures grown in liquid. The clusters involved in the production of secondary metabolites contain pathway-specific regulatory genes that are required to be active for the production of their associated secondary metabolites and hence allow for stringent control of their production.

Figure The A-factor regulatory cascade of Streptomyces griseus (Bibb, 2005)

As can be seen in the above diagrammatic representation of the regulation of Streptomycin, it involves a number of regulators taking part in both repression and activation of other regulators leading to the production of Streptomycin.

The complexity of these pathways can be enormous and in many cases is still not fully understood and requires both the use of bioinformatics analysis and the incorporation of wet lab testing to fully allow the understanding of the complex regulatory pathways that occur.

Isolation and sequencing of Streptomyces SM8 and identification of secondary metabolism clusters

Streptomyces SM8 was originally isolated from the marine sponge Haliclona simulans. The strain was found to produce antifungal and antibacterial activities, against E.coli, C. glabrata and B.subtilis, and chemical analyses indicated the presence of previously unknown compounds as shown in Table 1 (Kennedy et al., 2009).

Test Strain


E. coli NCIMB 112210

C. glabrataCBS138

B. subtilis1A40

L. moF2365

S. pneuTigr


VISA 35403

hVISA 22900

P. aer PAO1












Table Activity of SM8 using deferred antagonism assays

Sequencing of the SM8 genome was carried out prior to this study using Roche titanium 454 pyro-sequencing, giving a total of 534 contigs. These were the annotated using the automated gene calling software IMG/ER. From its evident antimicrobial activity shown through assays carried out by Kennedy et al, (Table 1) the possibility for cryptic clusters was seen. To elucidate this, the genome was scanned against previously studied Streptomyces species to identify putative secondary metabolism clusters.

This study resulted in the identification of 19 putative secondary metabolism clusters the majority of whom were identified from Streptomyces SM4 which had been previously studied for the presence secondary metabolism clusters (Brakhage and Schroeckh, 2011) The clusters identified are shown in Table 2.

Predicted secondary metabolite

Genome coordinates - Scaffold Numbers

Closest homologue



StreptomycesS4, Scaffold 8, 3878554-3911349



StreptomycesS4, Scaffold 6, 410147-419826

Hopene/squalene synthase


StreptomycesS4, Scaffold 8, 588141-598581



StreptomycesS4, Scaffold 8









StreptomycesS4, Scaffold 8



StreptomycesS4, Scaffold 5






StreptomycesS4, Scaffold 6, 81953-106578

PKS/NRPS hybrid


StreptomycesS4, Scaffold 6, 7264-45109



StreptomycesS4, Scaffold 8, 4240081-4309220



StreptomycesS4, Scaffold 8, 1719586-1721878



StreptomycesS4, Scaffold 6, 295706-300701

PKS/NRPS hybrid


StreptomycesS4, Scaffold 8, 503893-520001



StreptomycesS4, Scaffold 8, 276268-301035



StreptomycesS4, Scaffold 8, 3930113-3950474

Unknown small NRPS-like protein


StreptomycesS4, Scaffold 5



StreptomycesS4, Scaffold 6, 65083-81878


Chapter 3

Materials and Methods

3.1 Files used

The contig files used for this analysis were obtained through the automated annotation pipeline IMG/ER (Markowitz et al., 2009). The sequences were automatically annotated using the "Prodigal" annotation pipeline with IMG/ER (Hyatt et al., 2010). Databases were manually created for promoter regions within each of the secondary metabolite clusters, regulators across the entire genome of SM8 and an individual database for regulators within the secondary metabolism clusters.

To build the data base for the promoter regions each of the contigs was visually inspected within IMG/ER and genes or gene clusters believed to have promoter regions entered into the database using by taking the start position of the gene or gene cluster and using excel to extrapolate promoter regions going 100bp downstream of the beginning of the gene and 500bp upstream to allow for the discovery of possible DNA regulatory motifs within promoter sequences.

Figure IMG representation of genes within an individual contig

For the database of regulatory genes the IMG annotation for the SM8 genome was scanned for regulatory genes and these were extracted to an Excel database which contained Pfam and COG information and the start and end position of each gene along with the strand on which the gene was located.

3.2 Python scripting

To allow for analysis of the databases, python scripts were created for each database which took information from each of the databases in the form of CSV files. For each individual database a separate python script was written to allow for the correct analysis of the datasets.

The python scripts were used to build FASTA files from each given dataset taking the start and end position of the gene in question along with the contig on which the gene is located. It then extracted the portion of DNA from the contig and based on whether the gene was located on the positive or negative strand would cleave the sequence as is or return the reverse complement and attach this sequence to the FASTA file with the appropriate annotation.

The script also allowed for the sequence to be added to the CSV file which then allowed for the sequence for each gene or promoter region to be added to the database so further analysis that may be require to be carried out will be easier.

Scaffold Number

Promoter Start

Promoter End

Gene Start Codon




Gene No

Secondary Metabolism Cluster

Table CSV file format for python

3.3 BLAST analysis of secondary metabolism regulatory genes

The gene annotation returned by IMG/ER can in some cases be flawed due to limitations that are put on the system like large databases that it has to search for the identification of genes within the genome. Because of this the annotation placed on some genes can be either incomplete where the gene calling software may identify a gene as producing a hypothetical protein or giving a product that may be far off the actual product when checked against a bacterium in the same species.

To test for and correct these errors, the regulatory genes that are putatively involved in secondary metabolism were each run through BLAST individually against a more constrained subset where they were searched against just the Streptomyces species to allow for a more accurate identification of the genes and allow the regulatory mechanism to be more easily identified and correlated against other know regulatory pathways in well studied strains of Streptomyces such as Streptomyces coelicor or Streptomyces grisues.

The BLAST (Basic Local Alignment Search Tool) searches for regions of local similarity between a given sequence and a given database of sequences in this case the Streptomyces species. It then calculates the statistical significance of the matches to allow for the selection of the best possible match within your search (Altschul et al., 1990).

3.4 MEME analysis for the identification of putative DNA Motifs

In the past, binding sites were typically determined through DNase footprinting, and gel-shift or reporter construct assays, however with the advent of computational techniques the discovery of putative DNA motifs has become much easier. The MEME (Multiple EM for Motif Elicitation) Suite is a software toolkit with a unified web server interface that enables users to perform four types of motif analysis: motif discovery, motif-motif database searching, motif-sequence database searching and assignment of function. The MEME algorithm is widely used for the identification of both DNA and protein based sequence motifs, returning these motifs in the form of a sequence motif (Bailey and Gribskov, 1998).

Figure Overview of MEME suite tools

The MEME suite of web based tools was used for the identification of DNA motifs present within the promoter regions of genes involved in secondary metabolism. Initially testing was carried out to identify optimal consensus sequence length. This was done by running analysis using varying threshold limits to allow for the most significant sequence, within each training set, to be identified. From this analysis a consensus sequence of 15bp was found to be the optimal length.

The MEME suite takes files in the FATSA format which had been previously built using the python scripts developed. Initially all secondary metabolite promoter regions were scanned using MEME but due to restrictions on the size of files that can be submitted (60,000 characters) this was not possible unless using MEME ChiP. This algorithm allows for lager sets of data to be analysed however it cuts down each sequence to the central 100bp (Machanick and Bailey, 2011). This however was not sufficient as the 600bp promoter region selected for each gene within the secondary metabolism cluster, may contain significant DNA motifs either side of the 100bp central region and will most likely contain motifs in the region 100bp upstream of the start point of the gene.

To avoid this problem each promoter sequence was cut into 6 different sections, 100bp downstream, 100bp upstream, 100-200bp upstream, 200-300bp upstream, 300-400bp upstream and 400-500bp upstream. FASTA files were then again built for each of these sections covering all genes involved in secondary metabolism and run through the MEME algorithm for the identification of significant DNA motifs. This allowed for the analysis of global DNA motifs that are involved within all of the secondary metabolism clusters.

As well as global motifs, it was of interest to identify any possible DNA motifs that may be conserved within each individual secondary metabolism cluster. For this FASTA files were built which contained the entire promoter region (600bp) for genes located within each contig believed to be involved in secondary metabolism. This resulted in FASTA files for each secondary metabolite which was then subsequently run through the MEME algorithm allowing significant motifs to be identified.

The motifs were also analysed using FIMO (Find Individual Motif Occurrences) analysis searching through other genomes to find the occurrence of the motif within these genomes to justify assuming that this motif is actually biologically significant. FIMO computes a log-likelihood ratio score for each position in a given sequence database and uses dynamic programming methods to convert this score to a P-value and then applies false discovery rate analysis to estimate a q-value for each position in the given sequence. A P-value threshold of 0.0001 for significance within the searched genomes was chosen so as only to obtain occurrences that are statistically unlikely to occur by chance. The genomes chosen for comparison within Streptomyces were three closely related bacterium S.coelicolor, S.griseus and S.avermilitis. For comparison against other bacterium with high GC content Mycobacterium avium paratuberculosis was used. As the baseline reading for unrelated genomes Escherichia coli O157 and Listeria monocytogenes were used as they are well characterised.

Chapter 4


4.1 Initial observations

IMG/ER analysis provided some basic statistics about SM8, outlined in table 4 below. GC content was measured as total G+C bases divided by total A/G/C/T bases. The predicted gaps are labelled "N" in IMG/ER and are not counted towards total bases, potentially leading to an overestimation of GC content. The total number of genes identified was 6722 with 98.88% of these being identified as coding genes. From these 538 or 8% were identified as regulatory genes within the entire genome. Even though only 70 of these regulatory genes were seen to be present within the secondary metabolism clusters there may also be global regulators not present within the clusters that also have a regulatory function for secondary metabolism.


% of total

Total number of bases



GC content



Total called genes



Protein coding genes



Regulatory genes



Secondary metabolite regulatory genes



Table Streptomyces SM8 statistics from IMG/ER

4.2 Difficulties identifying regulatory networks and DNA motifs within secondary metabolite clusters

Though the secondary metabolite clusters have been identified within the SM8 genome this identification has been based purely on individual contigs being compared against clusters within other Streptomyces species. This can lead to areas of the genome being included in the analysis that may not necessarily be involved within the cluster. From this there may be extra regulatory genes identified that may have no effect on the regulation of the clusters.

Another issue with this method of identifying clusters may mean that significant regulators that may ne intrinsic to the regulation of a cluster may be omitted due to it being present on a cluster that was not significantly related enough to be included. For this reason the analysis of regulatory networks within these clusters may not be concrete evidence of the regulatory network and require experimental approaches such as transcriptome or microarray analysis for definitive identification of regulatory networks.

DNA motif analysis has also been seen to be difficult as many possible DNA motifs may be identified but few are then biologically significant when tested in laboratory conditions (Studholme et al., 2004). Prior experimental analysis has been shown to be advantageous when identifying DNA motifs as regulons of sets of genes which are known to be co regulated are discovered from micro array analysis it allows for the search area for motif analysis to be greatly reduced. This gives more confidence in the motifs that are then found within these regulons.

Also selection of appropriate limits on the length of DNA motifs is extremely important when trying to identify significant motifs. Shorter motifs may occur by chance but motifs such as the TATA box is considered to be the core promoter sequence present in ~24% of human genes and is only 6bp long consisting of 5' - TATAAA -3' generally followed by three or more adenine bases (Lifton et al., 1978).

4.3 Important secondary metabolism clusters

4.3.1 Candicidin Cluster

Candicidin is an aromatic polyene macrolide and was named due to its strong activity against species of Candida. It was originally used as a treatment for vaginal candididas and prostatic hyperplasia however use of this antibiotic has declined in recent years. It is produced by some Streptomyces species namely Streptomyces griseus and was identified in SM8 as a potential secondary metabolite.

Figure Structure of Candicidin-D molecule (Zielinski et al. 1979)

IMG identified two regulatory genes within the Candicidin cluster and identified them as a luxR family regulator and a response regulator containing a CheY-like receiver domain. However upon BLAST analysis of these regulators for confirmation of the predicted function they came back as partial ORF's from Streptomyces griseus with 63% and 100% identity respectively as shown in Table 2.

Predicted Function Using BLAST

Gene Product Name IMG

Streptomycesgriseus partial ORF 1, canA, canC, canF, canT, canRA, canRB

Bacterial regulatory proteins luxR family.

Streptomycesgriseus partial ORF 1, canA, canC, canF, canT, canRA, canRB

Response regulator containing a CheY-like receiver domain and an HTH DNA-binding domain

Table Regulators within Candicidin cluster

From the literature it is clear that these ORF's are intrinsic in the production of Candicidin and were controlled by the regulator pabAB which was not present within the initial search for regulatory factors within the Candicidin cluster (Gil and Campelo-Diez, 2003). However pabAB is present within the cluster on contig 349 discovered through KEGG orthology.

Figure Candicidin biosynthesis pathway (Caspi et al., 2008)

The biosynthesis of Candicidin involves the 4-aminobenzoate (PABA) molecule, activated to PABA -CoA which acts as the starter for the head-to-tail condensation of four propionate and 14 acetate units to produce a polyketide molecule. To this polyketide the deoxysugar mycosamine is attached giving the final product Candicidn D.

DNA motif analysis also found within the candicidin cluster a regulatory motif 14bp long. The consensus sequence for this motif is shown in figure 8 bellow. This consensus sequence was identified to be located within all of the promoter regions on the candicidin secondary metabolism cluster.

Figure DNA motif for candicidin cluster

FIMO analysis of the motif within the selected genomes showed the motif to have a high occurrence rate among the closely related Streptomyces species however there is a significant drop in occurrences when the motif is searched for within Mycobacterium avium paratuberculosis which has an equivelant GC content to that of the Streptomyces family.

The baseline comparisons of Escherichia coli O157 and Listeria monocytogenes showed very little occurrence of the motif, which means its occurrence within the Streptomyces species is not just by chance. This means the the motif will puatively have biological significance within Streptomyces and a regulatory role within the candicidin cluster given that it is present within the promoter region of each of the genes in the cluster.

Figure FIMO analysis of candicidin cluster DNA motif

4.3.2 Lantibiotic cluster

Lantibiotics are ribosomally synthesized, posttranslationally modified peptide antibiotics produced by Gram-positive bacteria.  They are characterized by lanthionine and methyl lanthionine bridges that give lantibiotics their characteristic conformations and stability.

Lantibiotics are encoded by a structural gene (generically named lanA) that encodes a prepropeptide with an N-terminal leader peptide followed by the region that will become the mature lantibiotic (the propeptide). The prepropeptide undergoes modification and processing before the mature product is exported from the cell, coincident with, or followed by, leader peptide removal. The N-terminal leader sequence may have roles in directing export, retaining the prelantibiotic in an inactive state until export and recruiting modifying enzymes.

Formation of lanthionine bridges occurs via the dehydration of serine and threonine residues, followed by cyclization with cysteine residues. In type AI lantibiotics (e.g., nisin), dehydration is carried out by a LanB enzyme and cyclization by LanC. In type AII (e.g., lacticin 481) and B (e.g., cinnamycin) lantibiotics, a bifunctional LanM carries out both of these reactions. Formation of C-terminal S-[(Z)-2-aminovinyl]-D-cysteine requires a LanD enzyme. Additional genes, such as those involved in pathway-specific regulation, lantibiotic export, and producer cell immunity, are also found in lantibiotic gene clusters.

The lantibiotic cluster showed the second largest number of regulators from all the secondary metabolism clusters. The regulators are shown below in table 6. The regulation of this cluster is made up of three TetR family transcriptional regulators which are known to be negative regulators for the expression of secondary metabolites in Streptomyces species (Novakova et al., 2010). Members of this family function mainly as repressors and regulate genes encoding biosynthetic enzymes for antibiotics, drug-efflux pumps and other proteins. The structure of the TetR family proteins consists of two domains: a DNA-binding domain which contains a helix-turn-helix motif and a regulatory domain that recognizes the signals via ligand binding. TetR-family proteins show high sequence similarity in their DNA-binding

domains and almost no significant similarity in the regulatory domains (Ramos et al., 2005).

Though this cluster only contains one SARP family transcriptional regulator these are a well characterised family of regulators that regulate antibiotic production in Streptomyces characterized by an N-terminal OmpR-type winged helix-turn-helix. It has been previously shown that a TetR family regulator already seen to be present in this cluster, in Streptomyces aureofaciens, regulates the antibiotic auricin by repressing the activation of a SARP family transcriptional activator which directly regulates the production of auricin. (Novakova et al., 2011).

Predicted Function Using BLAST

Gene Product Name IMG

SARP family transcriptional regulator [Streptomycessp. S4]

DNA-binding transcriptional activator of the SARP family

LuxR family transcriptional regulator [Streptomycessp. S4]

Response regulator containing a CheY-like receiver domain

 TetR family transcriptional regulator [Streptomycessp. S4]

Transcriptional attenuator LytR family

Transcriptional regulator [Streptomycesalbus J1074]

Transcriptional regulator HxlR family

TetR family transcriptional regulator [Streptomycessp. S4]

Transcriptional regulator TetR family

TetR-family transcriptional regulator [Streptomycesalbus J1074]

Transcriptional regulator TetR family

GntR family transcriptional regulator [Streptomycessp. S4]

Transcriptional regulators

UbiC transcription regulator-associated domain-containing protein [Streptomycesalbus J1074]

Transcriptional regulators

Transcriptional-repair coupling factor [Streptomycesalbus J1074]

Transcription-repair coupling factor (mfd)

Table Regulators within lantibiotic cluster

Along with regulatory factors a putative DNA motif was identified within the lantibiotic cluster appearing in the entirety of the promoter regions submitted to the MEME algorithm. This means there is a strong likelihood that this motif is related to the operation and regulation of this cluster.

Figure DNA motif for lantibiotic cluster

This motif was again run through FIMO for analysis for the presence of this motif within other bacterium. Occurrences of this motif are low in relation to occurrences of other motifs within related clusters, this may relate to the lantibiotic cluster not being present within these species or possible that there is no significant biological function for this motif outside of the lantibiotic cluster in Streptomyces SM8.

Figure FIMO analysis of lantibiotic cluster DNA motif

4.3.3 NRPS cluster

The biosynthesis of many large complex substances has been found to require specialised enzymatic machinery, particularly polyketide synthases (PKS) and non-ribosomal peptide synthetases (NRPS), and much effort has been made to identify and categorise genes which code for them.

A NRPS operates in a chemically different but mechanically similar fashion to a PKS; The A domain selects an amino acid and loads it onto a PCP domain. The C domain binds a new amino acid to the current one with an amide bond, elongating the chain and moving it to the next PCP domain, where if there is another module waiting the cycle repeats until a TE domain terminates the chain and releases it. As with a PKS, further domains may be present to modify the substrate during the process, such as E, Cy, M, NM and R.

Figure Example of typical NRPS operation (Donadio et al., 2007)

The highest number of regulators identified was found in relation to the NRPS cluster so defining a regulatory pathway without significant experimental evidence would be intrinsically difficult. However it is clear that there are a number of negative regulators such as TetR-family regulators and positive gene inducers such as SARP-family regulatory proteins. MarR regulatory proteins are also present within the cluster which have been shown to be a regulatory factor for genes that are critical in response to changing environments and can be either a gene activator or repressor (Wilkinson and Grove, 2006).

HspR has also been shown to be a negative regulator of heat shock gene expression leading to the hypothesis that this heat shock may be the mitigating environmental change that acts as an activator for the MarR-family transcriptional regulator (Schmid et al., 2005).

Predicted Function Using BLAST

Gene Product Name

TetR-family transcriptional regulator

[Streptomyces albus J1074]

Transcriptional regulator TetR family (IMGterm)

TetR-family transcriptional regulator

[Streptomyces albus J1074]

Transcriptional regulator TetR family (IMGterm)

DNA-binding protein [Streptomyces sp. S4]

Predicted transcriptional regulator

Helix-turn-helix type 11 domain-containing protein [Streptomyces sp. S4]

Predicted transcriptional regulator

HspR [Streptomyces albus J1074]

Predicted transcriptional regulators

Pit accessory protein

[Streptomyces albus J1074]

Phosphate transport regulator (distant homolog of PhoU)

Two-component system response regulator [Streptomyces albus J1074]

Two component transcriptional regulator LuxR family (IMGterm)

Putative PAS/PAC sensor protein

[Streptomyces sp. S4]

Serine phosphatase RsbU regulator of sigma subunit

SARP family pathway specific regulatory protein [Streptomyces sp. S4]

Transcriptional regulatory protein C terminal.

MarR-family regulatory protein

[Streptomyces albus J1074]

Transcriptional regulator MarR family (IMGterm)

MarR-family regulatory protein

[Streptomyces albus J1074]

Transcriptional regulators

PadR family transcriptional regulator

[Streptomyces sp. W007]

Predicted transcriptional regulators

ArsR-family transcriptional regulator

[Streptomyces albus J1074]

Transcriptional regulator ArsR family (IMGterm)

GntR-family transcriptional regulator

[Streptomyces albus J1074]

Predicted transcriptional regulators


MEME identified a 15bp long motif within this NRPS cluster and was found within all 99 promoter sequences that were identified within this cluster. It contains a repeating sequence at the beginning of the motif of CGT-CGT-CGT which means the logo is approximately palindromic, which provides two very similar recognition sites.

Figure DNA motif for NRPS cluster

FIMO analysis (Figure 13) showed a high occurrence of this motif within closely related Streptomyces species but it also showed a high occurrence in Mycobacterium avium paratuberculosis which was used as a reference for other species of bacteria that have a high GC content. Due to the high occurrence in all species that contain a high GC content this DNA motifs significance could be put down to it containing a large number of G and C bases rather than it having a strong biological significance.

Figure FIMO analysis for the NRPS cluster DNA motif

4.3.4 PKS/NRPS hybrid cluster

The genes coding for PKS or NRPS enzymes have been found to remain highly conserved in their genome, and even arranged in the assembly line order in which the enzymes operate. The core regions of a typical PKS consist of ketosynthase (KS), acyltransferase (AT), and acyl carrier protein (ACP) domains, while a typical NRPS contains adenylation (A), condensation (C), and peptidyl carrier protein (PCP) domains. Multiple PKS and NRPS genes may be involved in the biosynthesis of one large end-product, and the genes tend to be conserved in clusters, in the same order as the assembly line nature in which they operate (Donadio et al., 2007).

For a PKS, the AT domain is responsible for selecting an acyl-CoA as the substrate for the biosynthesis process performed by that gene, and loading it onto the starting ACP.

Figure Example of typical PKS operation (Donadio et al., 2007)

The chain is passed onto a KS domain, where it is held while another acyl-CoA is loaded and bound to the first in a Claisen condensation reaction, leaving the KS domain free and the newly elongated chain bound to the ACP domain. If there is more than one module, the chain is moved from the ACP domain to the next KS domain, and the cycle of elongation repeats until a TE domain terminates and releases the chain. Further domains may also be present to modify the substrate during the process, such as KR, ER or DH.

Regulation within this cluster appears to be carried out by a small number of regulators, though regulators may be omitted due to the use of contig analysis for the detection of secondary metabolism clusters or the action of global regulators may be important within this cluster. The two TetR family regulators are known repressors of secondary metabolism while the LysR family of transcriptional regulators are known to be negative auto regulators and have been shown to be active in the regulation of PKS in Streptomyces coelicolor but was regulated along with a regulator from the SARP family which was not found to be present within this contig (Colombo et al., 2001).

Predicted Function Using BLAST

Gene Product Name IMG

LysR family transcriptional regulator [Streptomycessp. S4]

Transcriptional regulator

TetR family transcriptional regulator [Streptomycessp. S4]

Transcriptional regulator TetR family

TetR family transcriptional regulator [Streptomycessp. S4]

Transcriptional regulator TetR family

Table Regulators within PKS/NRPS hybrid cluster

The motif found (Figure 15) through MEME within this cluster was only found in 7 of the 12 promoter regions and due to the small data set this motif may not be of any biological significance, however the presence of adenine bases within this motif might suggest otherwise with SM8 having such a high GC content (~73%) and suggest that further investigation may be required into the validity of this DNA motif.

Figure DNA motif for NRPS/PKS hybrid cluster

The frequency of this motif within other Streptomyces species again further validates the postulation that this may be a significant DNA motif with it having a high occurrence in these species even though they have a high GC content, while not appearing as frequently in E.coli or Listeria as frequently as would be expected as they have a much lower GC content than the others analysed.

Figure FIMO analysis for NRPS/PKS hybrid cluster motif

4.3.5 Potential Gramicidin cluster

Linear gramicidin is a membrane channel forming pentadecapeptide that is produced via the nonribosomal pathway. It consists of 15 hydrophobic amino acids with alternating L- and D-configuration forming a β-helix-like structure. It has an N-formylated valine and a C-terminal ethanolamine. The primary structure of gramicidin A was determined as formyl-Val-Gly-Ala-D-Leu-Ala-D-Val-Val-D-Val-Trp-D-Leu-Trp-D-Leu-Trp-D-Leu-Trp-ethanolamine. The other naturally occurring isoforms, gramicidin B and C, have either phenylalanine or tyrosine replacing tryptophan at position 11, respectively. Gramicidin D refers to the naturally produced mixture of gramicidins A, B, and C of ∼80% A, 5% B, and 15% C (Kessler et al., 2004).

Little is detailed in the literature regarding the regulation of gramicidin. From the regulators identified from within the potential gramicidin cluster there is two two-component system response regulators which serve as a basic response system for bacteria in response to environmental changes. They generally involve a membrane bound histidine kinase that senses a specific environmental stimulus and a related response regulator that elicits a cellular response through the regulation of target genes (Stock et al., 2000).

MarR family regulators are also known as regulators in response to environmental change. ArsR family regulators are a family of metalloregulatory transcriptional repressors which have been shown to allow cells to respond to heavy metal toxicity. They generally act as repressors in the absence of metal ions, but when bound to a metal they become activators (Busenlehner et al., 2003).

Predicted Function Using BLAST

Gene Product Name

ArsR family transcriptional regulator

[Streptomycessp. S4]

Predicted transcriptional regulators

Two-component system response regulator

[Streptomycesalbus J1074]

Response regulators consisting of a CheY-like receiver domain

TetR-family transcriptional regulator

[Streptomycesalbus J1074]

Transcriptional regulator TetR family

MarR family transcriptional regulator

[Streptomycessp. S4]

Transcriptional regulators


[Streptomycesalbus J1074]

Two component transcriptional regulator LuxR family

Two-component system response regulator

[Streptomycesalbus J1074]

Two component transcriptional regulator LuxR family

Table Regulators within Gramicidin cluster

MEME analysis identified a 15bp motif (Figure 17) which occurred in all 64 promoter sequences within the gramicidin cluster however it does have a high GC content and given the high GC content of the bacterium it is likely that this motif has occurred by chance and is not a transcription factor binding site within the promoter region. Further investigation using experimental techniques such as DNA foot printing would be required to confirm its biological significance.

Figure DNA motif for gramicidin cluster

FIMO analysis showed the motif to occur frequently in the high GC content bacteria reinterring the fact that the motif itself has a high GC content leading to the suggestion that this motif occurred by chance rather than being biologically significant.

Figure FIMO analysis for gramicidin cluster motif

4.3.6 Siderophore cluster

Siderophores are relatively low molecular weight, ferric ion specific chelating agents utilised by bacteria and fungi growing under low iron stress. The role of these compounds is to use iron from the environment and to make the mineral, which is almost always essential, available to the microbial cell (Neilands, 1995).

Regulation of siderophores is generally carried out by the fur (ferric uptake regulation) protein which activates the sideophore through polymerization around the operator, which is supported by observations with the electron microscope (Le Cam et al., 1994). However no gene encoding the fur protein was found within the cluster, to ensure none of the regulators identified were miss annotated an alignment was carried out, using MEGA, against the fur gene but no similarity was found (Tamura et al., 2011).

Figure UPGMA tree for siderophore cluster regulators

Without the presence of the fur protein

Predicted Function Using BLAST

Gene Product Name

XRE family transcriptional regulator [Streptomyces sp. S4]

Predicted transcriptional regulator


[Streptomyces albus J1074]

Predicted transcriptional regulators

Two-component system response regulator [Streptomyces albus J1074]

Response regulators consisting of a CheY-like receiver domain and a winged-helix DNA-binding domain

AraC-family transcriptional regulator [Streptomyces albus J1074]

Transcriptional regulator AraC family with amidase-like domain

LacI-family transcriptional regulatory protein [Streptomyces albus J1074]

Transcriptional regulator LacI family

TetR-family transcriptional regulator [Streptomyces albus J1074]

Transcriptional regulator TetR family

TetR family transcriptional regulator [Streptomyces sp. S4]

Transcriptional regulator TetR family

TetR-family transcriptional regulator [Streptomyces albus J1074] 

Transcriptional regulator TetR family

Table Regulators within siderophore cluster

DNA motif analysis found a 15bp motif within the siderophore cluster which is similar to the "iron box" motif which is a regulatory motif related to the regulation of siderophores in many species of bacteria. The "iron box" or "fur box" consensus sequence in the operator is GATAATGATAATCATTATC, an array which occurs with some variation in the regulatory DNA of iron-affected systems in many microbial species. Though the fur protein is not present within the SM8 genome this motif may be the binding site for a homologue of the fur protein present within SM8 that was not identified by the automated gene calling software IMG/ER.

Figure DNA motif for siderophore cluster

4.3.7 Additional clusters

Of the 19 clusters identified the five chosen were the secondary metabolites that showed the most potential as having possible uses, but also clusters identified like that of Fredericamycin and Antimycin that may have been of biological significance were found to have no regulatory genes present within their clusters, which may be due to contigs being absent from the cluster where the regulatory genes may have been located.

Other clusters such as the Geosmin cluster are of little interest as its contribution to cell is only that of giving soil its scent when the bacteria is present within it. MEME analysis was still carried out on these clusters however to find DNA motifs that could be of importance elsewhere in the cell and not necessarily only be relevant to the cluster itself.





Hopene-squalene synthase



Unknown small NRPS like protein