Analysis of ChIP-seq data
- Mapping and Post-processing
- Assessment of ChIP quality
- Generating bigWig datasets for display
- Calling peaks
- Inspecting peaks
- Galaxy history
This tutorial is slightly modified from the Galaxy ChIP-seq tutorial written by Anton Nekrutenko (https://galaxyproject.org/tutorials/chip/). That tutorial was inspired by efforts of Mo Heydarian and Mallory Freeberg. Tools highlighted here have been wrapped by Björn Grüning, Marius van den Beek and other IUC members. Dave Bouvier and Martin Cech helped fine tuning and deploying tools to Galaxy’s public server.
In this tutorial we will:
- pre-process sequencing reads
- map reads
- post-process mapped data
- assess quality and strength of ChIP-signal
- display coverage plots in a genome browser
- call ChIP peaks with
- inspect obtained calls
- look for sequence motifs within called peaks
- look at distribution of enriched regions across genes.
Datasets for this tutorial were generated in the lab of Frank Pugh.
For this analysis, we will be using ChIP-exo datasets. For this experiment, immunoprecipitation was performed with antibodies against Reb1. Reb1 recognizes a specific sequence (
TTACCCG), is involved in many aspects of transcriptional regulation by all three yeast RNA polymerases, and promotes formation of nucleosome-free regions (NFRs) (Hartley & Madhani:2009; Raisner:2005).
There are four datasets:
|Reb1_R1||ChIP experiment, Replicate 1|
|Input_R1||Input DNA, Replicate 1|
|Reb1_R2||ChIP experiment, Replicate 2|
|Input_R2||Input DNA, Replicate 2|
|Galaxy data library containing the reads. Here you can see two replicates (
After uploading datasets into your Galaxy history, we will combine all datasets into a single dataset collection. This will simplify downstream processing of the data. The process for creating a collection for this tutorial is is shown here.
Mapping and Post-processing
In this particular case, the data is of very high quality and do not need to be trimmed or postprocessed in any way before mapping. We will proceed by mapping all the data against the yeast genome
sacCer3 with BWA:
|Mapping all data at once. Note that Select input type is set to
For post-processing we will remove all non-uniquely mapped reads. This can be done by simply filtering out all reads with mapping quality less than
20 using NGS: SAMtools → Filter SAM or BAM:
|Filtering multi-mapped reads by restricting the data to reads with mapping quality above 20. Note that by selecting the folder () button, you can select an entire collection of BAM datasets to filter at once.|
Assessment of ChIP quality
After we map and filter the reads, it is time to make some inferences about how good the underlying data is.
Correlation among samples
In our experiment there are two replicates, each containing treatment and input (control) datasets. The first thing we can check is whether the samples are correlated (in other words if treatment and control samples across the two replicates contain this same kind of signal). To do this, we first generate a read count matrix using NGS: DeepTools → multiBamSummary.
|Running multiBAMsummary on a collection of BAM datasets (as before you can select a collection by pressing the folder () button).|
This tool breaks the genome into bins of fixed size (5,000 bp in our example) and computes the number of reads falling within each bin. You should use larger bins (e.g. 10,000bp) with a larger genome, but the yeast genome is only 12Mbp and regulatory signals are more compact. Here is a fragment of its output:
#'chr' 'start' 'end' 'Reb1_R1' 'Input_R1' 'Input_R1' 'Reb1_R2' chrVI 0 1000 19.0 41.0 3.0 6.0 chrVI 1000 2000 29.0 30.0 13.0 5.0 chrVI 2000 3000 0.0 0.0 0.0 0.0 chrVI 3000 4000 0.0 2.0 0.0 0.0 chrVI 4000 5000 7447.0 139.0 7.0 2645.0
we can then feed this matrix into NGS: DeepTools → plotCorrelation to generate heat map like this:
|B. Heatmap of four samples: Treatments (Reb1) and controls (Input) display relatively higher correlation among themselves.|
Here we can see that there are better correlations between replicates (between Reb1_R1 and Reb1_R2, and between input_R1 and input_R2), while correlations between treatments (Reb1) and controls (input) are weak. This is a good sign implying that there is some signal on our data.
Assessing signal strength
How do we tell if we have signal coming from ChIP enrichment? One way of doing this is the Signal Extraction Scaling (SES) approach proposed by Diaz:2012. SES works as follows. Suppose we have two datasets: ChIP and Input DNA. We divide the genome into N non-overlapping windows (N = 10 in the example below) and for each window compute the number of reads. This way we end up with two lists: one listing read counts for ChIP (ChIP list) and the other for Input (Input list):
Window ChIP-count Input-count ------------------------------- 1 3 3 2 4 3 3 2 1 4 1 3 5 3 3 6 27 2 7 18 3 8 2 2 9 45 3 10 8 3
We then sort the ChIP list in ascending order and move elements in the Input list to match this order:
Window ChIP-count Input-count ------------------------------- 4 1 3 3 2 1 8 2 2 1 3 3 5 3 3 2 4 3 10 8 3 7 18 3 6 27 2 9 45 3
Now let’s add another two columns to this dataset. These columns will show the cumulative percentages of reads up to each row for the ChIP and Input data. For example, 0.044 on row 3 is (1 + 2 + 2)/113 = 0.044.
1 2 3 4 5 ------------------------------- 4 1 3 0.008 0.115 3 2 1 0.026 0.153 8 2 2 0.044 0.230 1 3 3 0.070 0.346 5 3 3 0.097 0.230 2 4 3 0.132 0.576 10 8 3 0.203 0.692 7 18 3 0.362 0.807 6 27 2 0.601 0.884 9 45 3 1.000 1.000 ------------------------ 113 26 Where: 1 = Window, 2 = read count in ChIP, 3 = read count in Input 4 = % or read to this point in ChIP 5 = % of read to this point
In the matrix above a large portion of ChIP reads (column 4) is concentrated in the few bins close to the bottom. This is not the case for the input reads (column 5). If we plot two last columns of this matrix we will get a curve like this:
|SES plot for our toy example. Most “reads” in the ChIP experiment are concentrated in the last three bins.|
DeepTools provide a nice explanation of how the success of a ChIP experiment can be judged based on SES (also called fingerprint) plots:
|DeepTools explanation of SES plots.|
So let’s apply this to our own data using NGS: DeepTools → plotFingerprint:
|B. SES fingerprint of four samples: Treatments (Reb1) show a characteristic shape indicating ChIP signal. Approximately 30% of reads are contained in a small percentage of the genome.|
Generating bigWig datasets for display
In this section we will convert BAM files generated with
bwa into bigWig format that will allow us to view read coverage distribution across the genome. We will also “pre-warm” a genome browser for displaying peaks we will be calling in the next section.
Generating bigWig datasets
We will use NGS: DeepTools → bamCoverage:
|Running bamCoverage on a collection of filtered BAM datasets (as before you can select collection by pressing folder () button). Here we set Bin size to
Note that the default behavior of bamCoverage will not perform any pseudo-extensions to the read lengths. In ChIP-seq data, we often apply a pseudo-extension up to the expected ChIP fragment size length, which has the effect of smoothing the data plots. This is not typically appropriate for ChIP-exo data, as the smoothing removes the high-resolution nature of the experimental data. In fact, we often only examine the 5’ coordinate of mapped ChIP-exo reads. However, the default behavior of the bamCoverage tool (i.e. counting the full read length in terms of coverage) is fine for this example.
Displaying coverage tracks in a browser
Now we can display bigWig datasets generated in the previous section in a genome browser. There are a variety of available browsers. In this tutorial we will use the UCSC Browser.
|Collection of bigWigs produced by
Clicking this link in all four datasets (you will need to expand each dataset by clicking on it. This will expose the UCSC links) and focusing the browser on the RPA14 gene will produce the following image:
|Coverage display of one Reb1 replicate at the RPA14 gene.|
While the peaks shown in the browser screenshot above are pretty clear and consistent across the two replicates, looking at the entire genome in the browser is hardly a sustainable way to identify all peaks. There are several ways for identifying binding events genome-wide. They are summarized in the figure below:
|Outline of three ChIP-seq binding event detection methods. Peak-finding methods typically either shift the ChIP-seq tag locations in a 3′ direction by half the expected fragment length, or extend the length of the tag in a 3′ direction to be equal to the expected fragment length. Tags from opposite strands are merged to construct an unstranded tag density landscapes, and binding event locations are predicted from the locations with maximum tag coverage within each region that contains a significant enrichment of ChIP-seq tags (i.e. the peak summit). Peak-pairing methods [e.g. GeneTrack] build similar tag density landscapes, but retain strandedness information and typically do not shift or extend the tag locations. Peak locations are determined on each strand separately, and nearby peaks in the correct stranded orientation within a given distance are paired together. Binding event locations are predicted from the peak-pair midpoint locations. Probabilistic binding detection methods aim to estimate the locations of binding events that could have given rise to the observed ChIP-seq tag locations. These methods begin training with initial guesses of binding event locations and a model of how tags are expected to be distributed around real ChIP-seq binding events. During each training step, every ChIP-seq tag is probabilistically associated with nearby binding events, depending on the distance between the tag and the event location. Given these probabilistic tag assignments, binding event locations are updated to achieve a better fit with their associated tags, and the model of how tags are distributed around binding events is updated to reflect the accumulation of tags around all current binding events. During the training process, binding events with few assigned tags are weeded out of the model, and the process eventually converges to a set of final binding locations. (Figure and legend from Mahony and Pugh:2015).|
In this tutorials we will use the MACS2 peak caller.
How does MACS work?
MACS (or its current version
MACS2) performs several steps for calling peaks from paired treatment/control datasets:
|Steps of the MACS workflow (From Feng:2012).|
Here is a concise description of these steps:
- Removing redundancy - MACS retains uniquely mapped reads and removes reads that are repeatedly mapped to the same location. This reduces effects of PCR amplification biases during library preparation.
- Build model and estimate fragment size - one of the MACS inputs is the fragment size or bandwidth, which is approximate size of DNA fragments generated during fragmentation step of library preparation. MACS first slides a window sized at twice the bandwidth across the genome and finds instances where read counts enriched by between 10 and 30 fold relative to the genome background. It then randomly samples 1,000 of such regions and builds the model. To build the model, it separates reads mapping to each of the strands and builds two distributions (two modes). The midpoint between the two modes is the middle of the binding size and the distance between the modes is the average fragment size
d(see Figure below).
|Peaks mapped to two strands are treated separately to build two coverage density profiles - the two modes. The distance between the modes is the fragment size
Generate peaks - now that d has been defined, MACS slides a window of size 2d across the genome to identify regions significantly enriched in the ChIP sample. MACS assumes that background reads obey Poisson distribution. Thus, given the number of reads in a given interval within the control sample, we can calculate the probability of having the observed number of reads in the ChIP sample (e.g., see flood example here). This procedure is performed for several intervals around the examined location (2d, 1kb, 5kb, 10kb, and the whole genome) and the maximum value is chosen. One problem with this approach is that it only works if both samples (ChIP and control) are sequenced to the same depth, which is not usually happening in practice. To correct this, MACS scales down the larger sample.
Compute False Discovery Rate (FDR) - Feng:2012 explains computing FDR in MACS as follows: “When a control sample is available (and you should really always use it - AN), MACS can also estimate an empirical FDR for every peak by exchanging the ChIP-seq and control samples and identifying peaks in the control sample using the same set of parameters used for the ChIP-seq sample. Because the control sample should not exhibit read enrichment, any such peaks found by MACS can be regarded as false positives. For a particular P value threshold, the empirical FDR is then calculated as the number of control peaks passing the threshold divided by the number of ChIP-seq peaks passing the same threshold.”
In our case we have two replicates, each containing ChIP and input DNA samples. We will first run
MACS2 on pooled data (combining two ChIP samples and two inputs, respectively). We will then run
MACS2 on each replicate individually. Finally, we will pick a robust set of peaks present in all three callsets.
Splitting data into individual samples
One complication with the way we processed all data is that we have combined everything in a single dataset collection. MACS however will need for us to separate ChIP samples and controls. Fortunately for us we have set readgroups when we were mapping reads to the yeast genome. This will come handy for us right now because we will:
- merge the entire collection of mapped and filtered BAMs into a singe BAM dataset
- split this dataset into four separate BAM files using readgroups
- run MACS on resulting files.
First, to merge a collection of mapped, filtered BAM files into a single dataset we will use NGS: Picard → MergeSamFiles:
|Merging a collection with
Next, we will use NGS: SAMtools → Split to separate merged file into individual BAM files. Each resulting BAM file will contained aligned reads corresponding to original four datasets:
|Splitting BAM dataset on readgroups. This will produce four BAM datasets.|
|Resulting datasets. Each contains aligned reads from the four original conditions.|
Now it is time to run MACS2. First we will use the NGS: Peak calling → MACS2 predictd tool from the
MACS2 package. This tool will help us to find optimal parameters for running peak calling function of
This procedure will help us estimate the d parameter by performing the [cross-correlation] analysis between reads mapping to + and - strands. Let’s look at these results:
|Replicate 1||Replicate 2|
|Peak model and lag between strands.|
In the case of these data, peaks are very sharp and have narrow gap between them:
33 bp for replicate 1 and 2, respectively. We will use an average of these values,
--extsize parameter for calling peaks using NGS: Peak calling → MACS2 callpeak:
|Calling peaks with
|In this lower part of
If you set parameters as was shown above
MACS2 will produce two outputs (if it produced more just find the ones called
narrow peaks and
summits). Let’s click on the pencil icon() adjacent to
narrow peak datasets and rename then as shown below:
Next, we will run
MACS2 on BAM datasets for Replicate 1 only:
|Calling peaks with
- rename resulting datasets as `R1 summits` and `R1 peaks`
- run `MACS2` run on Replicate 2
- rename resulting `summits` and `narrow peak` datasets as `R2 summits` and `R2 peaks`.
In the end, you should have something like this:
What is in the output?
Looking at MACS2 data we have gotten the following numbers of peaks:
|Pooled||Replicate 1||Replicate 2|
Peaks data is generated in the following format:
1 2 3 4 5 6 7 8 9 10 -------------------------------------------------------------------- chrI 35 491 MACS2_peak_1 176 . 10.35332 21.51081 17.68957 101 chrI 87135 87212 MACS2_peak_2 127 . 7.71763 15.89278 12.78060 26 chrI 92612 92793 MACS2_peak_3 153 . 9.22373 18.72748 15.31966 49 chrI 119739 119782 MACS2_peak_4 78 . 6.08885 10.52482 7.82302 25
where columns are:
- Iterative id given by
- Integer score for display
- Strand (irrelevant in this case)
- Fold-change (fold enrichment for this peak summit against random Poisson distribution with local lambda)
- -log10P-value (e.g., 17.68 is 1 x 10-17)
- -log10Q-value from Benjamini–Hochberg–Yekutieli procedure
- Relative summit position to peak start
How many peaks are common between replicates?
To see how many peaks are common between the pooled datasets and the two replicates we will use Operate on Genomic Intervals → Join tool twice.
First we will join
Peaks pooled with
|Joining Pooled and R1 results with
Next we will join the result of the previous operation with
|Joining Pooled/R1 with R2 results with
This results in 722 regions are shared among polled, R1, and R2 peaks. Let’s call these the High confidence peaks. Before we can use them, however, let’s cut out only relevant columns. Since we have produced this dataset by joining three other datasets, it is three times wider (30 columns). To cut this first three columns we can use Text Manipulation → Cut columns tool:
|Cutting columns from
Next, we need to make sure that the output of the
Cut columns tool has the type
BED. Also specify the genome build as
sacCer3. To do this we will edit its metadata as show below:
|Setting metadata to datatype
Let’s look at everything in the browser
Let’s visualize Merged peaks as well as Narrow peaks and Summits produced by
MACS2 in the UCSC genome browser by clicking on
display at UCSC links adjacent to
Peaks pooled and
High confidence peaks datasets:
|An overview in UCSC. Here you can see original bigWig datasets along with predicted peaks.|
What sequence motifs are found within peaks
In this experiment, antibodies against Reb1 protein have been used for immunoprecipitaion. The recognition site for Reb1 is
TTACCCG (Badis:2008 and Harbison:2004). To find out which sequence motifs are found within our peaks, we first need to convert coordinates into underlying sequences. This is done using the BED tools → GetFastaBed tool:
|Extracting genomic DNA corresponding to ChIP-seq peaks. Here we use
Next, we need to make sure that all sequences are sufficiently long for finding patterns. MEME, the tools we will use to find motifs, requires sequences to be at least 8 nucleotides long. Besides, very short peak sequences are unlikely to be exactly overlapping the bound motif, so we will enforce minimum sequence length of 20bp using FASTA manipulation → Filter sequences by length tool:
|Filtering FASTA by length. Here we are removing all sequences shorted than 20 nucleotides.|
Now we can run Motif Tools → MEME:
|Running MEME on length-filtered FASTA sequences from the previous step. Note that Options configuration is set to
MEME generates a number of outputs. The most interesting is the HTML Report. It shows that MEME discovers the expected
TTACCCG motif at high frequency in the sequences:
|MEME Motif found from 619 sequences in the common peak regions.|
MEME can take a while to run, even on a relatively small set of sequences like this. Alternatively, you could also try running Motif Tools → MEME-ChIP, which performs a more comprehensive range of motif analyses and is suitable for analyzing large collections of sequences:
|Running MEME on length-filtered FASTA sequences.|
Unfortunately, the HTML file generated by MEME-ChIP is not viewable within Galaxy. However, you can download the results to your computer, unzip the file, and view the HTML file in a browser. Doing so should provide the following result:
|MEME-ChIP Results summarizing analysis of the common peak regions.|
Summarizing ChIP signal enrichment across all genes
How many genes contain upstream regions enriched in ChIP tags? This is often represented as a heatmap:
|Heatmap example from DeepTools documentation.|
To generate the heatmap, we must first produce normalized datasets for the two replicates we have. This is done using NGS: DeepTools → bamCompare tool:
Because we want to plot enrichment around genes we need to download gene annotation. We will use Get Data → UCSC Main for this:
|Getting data from UCSC. Here make sure you select assembly called
|Here just click Send query to Galaxy.|
Next, to prepare data necessary for drawing the heatmap we will use NGS: DeepTools → computeMatrix utility:
|Computing matrix - the data from which heatmap will be built. Here both normalized datasets are select within Score file box, yeast genes we have just downloaded from UCSC are chosen as Regions to plot. ‘reference-point is set as computeMatrix main option and, finally, upstream and downstream distances are set to 2,000 bp. Obviously you are welcome to play with these parameters.|
Finally, we can visualize the heatmap by using NGS: DeepTools → plotHeatmap tool:
|Drawing heatmap with
The resulting image shows that a significant fraction of the 6,692 genes present in the annotation data we have used contain Reb1 binding sites within their upstream regions:
|Heatmap showing distribution of Reb1 binding sites across upstream regions of 6,692 yeast genes.|
This entire analysis is available as a Galaxy history here. Import it and play with it. If things do not work - complain to the course instructors!