Cdx2 Chip Seq In Ts Cells Biology Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The strategy outlined below is aimed at identifying high-confidence binding sites of Cdx2 in TS cells by utilizing position and fold-enrichment data calculated from the ChIP-seq data set along with expression dynamics in TS cells versus differentiated TS cells and TS cells versus ES cells. The present status quo for ChIP-seq analysis is to identify all significant binding sites above background, in my view this normally generates a mind-numbingly large list of purported target genes most of which are functionally irrelevant.

Wishva ran the MACS peak calling algorithm {Zhang, 2008 #1776} on the Cdx2 ChIP-seq data including the input control sample. This identified 16,735 Cdx2 binding sites (see: 'Cdx2 MACS peaks.xlsx'). These binding sites represent biologically real protein-DNA interactions in TS cells however associating biological function (ie. transcriptional control of associated genes) to these binding sites is more challenging. Recent ChIP-chip studies in the Drosophila blastoderm have indicated a clear association between the strength of binding (ie. fold-enrichment) and functionality {Li, 2008 #1950}. In mouse ES cells a key binding location of Oct4 is in its own promoter creating a positive feedback loop {Chew, 2005 #196}. From Oct4 ChIP-seq data in mouse ES cells {Chen, 2008 #1509} this binding site is the 8th greatest fold-enriched site out of a total of 6,541 peaks identified by MACS (see: 'Oct4 MACS peaks.xlsx') implying functionally important binding sites are highly fold-enriched. Thus it is noteworthy that the Cdx2 binding site with the greatest fold-enrichment is nearest to the Id2 TSS, a gene we know is directly regulated by Cdx2.

is 10-fold down-regulated after 6 days of TS cell differentiation (-Fgf4). From our single cell work {Guo, 2010 #1965}, we know Id2 is an early specific marker of outer cells of the morula and this expression follows the zygotic transcriptional activation of Cdx2. Profiling of Cdx2 null blastocysts confirms Cdx2 is required for expression of Id2 in the blastocyst (the manuscript currently under preparation). Not only does this show that the level of enrichment points towards functionality of a binding site but that at least some Cdx2 binding sites identified in TS cells can also correlate with regulation in the trophectoderm. Another highly enriched binding site, ranked 3rd, is associated with the receptor for activin, Acvr1b. Acvr1b is required for TS cell maintenance {Erlebacher, 2004 #315}{Natale, 2009 #1951}. These two examples (Id2 and Acvr1b) provide some substance to the notion that functionally important targets of Cdx2 bind Cdx2 strongly (ie. high fold-enrichment)

Using Id2 as a "poster child" for positive direct regulation by Cdx2, by a close examination of Cdx2 peaks around ID2 we can gain insight into the binding pattern of Cdx2 around genes it regulates. Besides the above-mentioned top-ranked peak there are other prominent Cdx2 binding sites in close proximity to Id2, for instance the Id2 TSS is also the closest TSS to the 38th, 373rd and 1009th greatest fold enriched binding sites. The distance to the Id2 TSS of the most prominent peak is ~50kb away while the other three peaks are closer but none within 10kb. Another presumed direct target of Cdx2 is Eomes, there are a few Cdx2 peaks associated with Eomes with the two most prominent peaks 80 and 50 kb upstream of the TSS. Thus from the Id2 and Eomes examples it strongly suggests Cdx2 can regulate transcription from extended distances (not surprising) and that multiple binding sites are employed in a single gene's activation.

Using the nearest gene to a binding site as the target gene of that binding site is a rather simplistic view of gene regulation but nonetheless, is the standard in the field. An example of the challenges in associating binding sites with individual genes can be found with the 2nd most enriched binding site in our data, this is associated with Ddc. This binding site is within an intron and ~58kb away from the TSS, a TSS for another gene (FIGNL1) is actually closer, being ~16kb upstream of its TSS. Both these genes are expressed at relatively high levels in TS cells and down-regulated upon differentiation (Kidder data set). A prime example of the challenges in associating cis elements to the genes they regulate is found in the control of Shh where one particular (and essential) regulatory element lies 1MB away and in th eintron of another gene {Sagai, 2005 #1068}. As our ChIP-seq data is linear and not 3-dimensional as in the ChIA-PET technology {Fullwood, 2009 #2009}, at present I am content on just using the nearest TSS rule.

As the main interest of our current study is on identifying genes regulated by Cdx2 (as opposed to the molecular structure of Cdx2-DNA interactions) we want to develop a scoring system for the likelihood a gene is a functional target of Cdx2. This would need to that take into account fold-enrichment and multiple binding sites (not normally done in previous ChIP-seq studies that I am aware of). The standard in previous methods have used nearest gene within a certain window for example 100kb (ref?).

Creating a TF association score tailored to our data

A recent PNAS paper attempts an association score algorithm {Ouyang, 2009 #1949} that takes into account binding strength (actually tag counts irrespective of input control) and distance to the TSS. The logarithmic distance measurement comes as a factor exp(-dk/d0) where dk is the distance from the peak to the TSS and d0 is an arbitrarily chosen constant which they have set to 5000. From Mikael's plotting of this, a binding site at a distance of 19kb contributes very little to the association score. In fact, a peak with 72 reads at 19 kb from the TSS is roughly equivalent to a peak with only 4-5 reads at 5000 bp from the TSS. This raises the issue of distance with respect to level of transcriptional control, as intuitively one would consider a 72 read binding site much more biologically significant than a 5 read binding site whether it is 5kb or 20kb away. In addition, there is no biological reason why an enhancer at 10kb should be any more powerful than an enhancer at 50 kb if one buys into the definition of an enhancer being orientation and position independent.

Nonetheless, Mikael did apply this PNAS algorithm to our data and the subsequent association score list generated was somewhat informative. Some very interesting/relevant genes rose to the top, for instance Pou5f1 (the top binding site associated with this gene in the CCAT list was ranked 447th) ranked 10th and Elf5 (the top binding site associated with this in the CCAT list was ranked 938th) was 12th on this new list. The association score ranking for these two genes was a result of multiple binding sites relatively close to the TSS for both these genes and emphasizes the need to take into account multiple binding sites. However, the Ouyang association score did very poorly with Id2 (2380th) and Eomes (3088th) two genes we know are regulated by Cdx2. This low ranking for these two genes is a result of their binding sites (very strong and multiple ones at that) relatively far from the TSS.

Rules for Creating Our Cdx2 association scores:

Measure of strength of binding

We have used fold-enrichment over input control as a measure of strength of binding which is much improved over tag counts as it takes into account background effects. Notable in the list using the Ouyang algorithm was the fact that the genes with the 1st and 4th strongest association scores (Sfi1 & Eif4enif1) were ranked this way because of a mapping artifact which was obvious when looking at the binding profiles in the UCSC browser. In our new association score list these are ranked 3,933th and 3,449th, respectively as using the input control clears up the issue of problematic mapping as the identical problematic mapping occurs in the input control.

Measure of Distance

Part of the definition of an enhancer is that it is position independent, thus binding 10kb versus 80kb away from the TSS should really not differ in the level of influence of the activation at a specific promoter. The problem with greater distance away is the presumed greater chance of not being a binding site that is regulating the gene in question but actually regulating another gene. However, if we restrict our analysis to scoring binding sites only to the nearest TSS then I think we can reasonably argue that it is fair to give the same score for a binding site that is 10 kb away as for one that is 80 kb away with the caveat that we will miss more complex associations where a specific cis regulatory element is not always the closest TSS to the regulated gene.

Distance will be measured from the binding site to the nearest TSS and not to the nearest gene. That is to say, if a binding site lies in an intron of gene 1 but the distance to a neighbouring gene's (gene 2) TSS is closer than it is to the TSS of gene 1 then the binding site will be associated with gene 2 and not to gene 1 at all.

I manually looked (in the UCSC browser view) at the top 80 genes in the down-regulated genes upon TS cell differentiation list (Kidder data) to see if these rules would work. Many of these genes do have very good Cdx2 binding sites associated with them, including ones that are 200 to 300 kb away (and still the closest TSS). There were only one or two instances where a nice binding site was slightly closer to another gene, the vast majority looked like they would rank highly in this new scoring system.

Rules & scores

Only score binding sites with respect to the nearest TSS and no other TSS.

Scoring gradations:

within 1 kb & 5' to 0.5 kb 3' to TSS = 10 (proximal promoter allowing a bit of variability)

1-2kb 5' or 0.5-2 kb 3' = 7

2-10kb 5' or 3' = 5

10kb and upwards 5' or 3' = 3

Note even with this distance scoring system the strength of binding (ie. fold-enrichment) will be the major overall contributor to the final score rather than position.

The list generated looks great (see: worksheet "Cdx2 association score final" in "Cdx2 association scoring w MACS.xlsx"). Binding sites were initially associated to the RefSeq gene track data downloaded from the UCSC browser on 2June2010 (date listed as this track is updated daily). This gene track includes microRNAs and other non-coding RNAs. There were a total of 23,505 transcription start sites identified here which includes some redundancies for genes with multiple TSSs. The Cdx2 peaks mapping closest to these TSSs can be seen in (see: worksheet "Cdx2 peaks to TSS scoring" in "Cdx2 association scoring w MACS.xlsx"). Using this data inputted into our scoring algorithm generated the values in the "Cdx2 association score final" worksheet. Many genes of interest lie in the top 5% of scoring (1,175 most highly associated genes). These include Cdx2 (ranked 833), Pou5f1 (1113), Gata3 (118), Ets2 (247), Fgfr1 (448), Fgfr2 (152), Fgfr4 (294), Elf5 (10th, after combining scores for 2 Elf5 TSSs), Krt8 (754), Nanog (900), Id2 (9), Eomes (306), Tead4 (543), Tead1 (22 and 68), Hand1 (20), Msx2 (36), Vgll1 (1106), Dusp6 (33), and Sox2 (483). Though somewhat arbitrary, using the cut-off of 5% has some utility.

With respect to non-coding RNAs there are 8 microRNAs in the top 5% of strongest association scores (Mir190, Mir145, Mir29b-1, Mir148a, Mir205, Mir802, Mir615, and Mir365-2). Prominent long non-coding (lnc) RNAs include Airn (175) and Kcnq1ot1 (672), both of these lncRNAs are involved in imprinting.