docs/Advanced_Step-by-step_Peak_Calling.md

   1 # Advanced Step-by-step peak calling using MACS3 commands
   2
   3 Over the years, I have got many emails from users asking if they can
   4 analyze their X-Seq (not ChIP-Seq) data using MACS, or if they can
   5 turn on or off some features in `callpeak` for their special needs. In
   6 most of cases, I would simply reply that they may have to find more
   7 dedicated tool for the type of your data, because the `callpeak`
   8 module is specifically designed and tuned for ChIP-Seq data. However,
   9 MACS3 in fact contains a suite of subcommands and if you can design a
  10 pipeline to combine them, you can control every single step and
  11 analyze your data in a highly customized way. In this tutorial, I show
  12 how the MACS3 main function `callpeak` can be decomposed into a
  13 pipeline containing MACS3 subcommands,
  14 including `filterdup`, `predictd`, `pileup`, `bdgcmp`, `bdgopt`,
  15 and `bdgpeakcall` (or `bdgbroadcall` in case of broad mark).  To
  16 analyze your special data in a special way, you may need to skip some
  17 of the steps or tweak some of the parameters of certain steps. Now
  18 let\'s suppose we are dealing with the two testing
  19 files `CTCF_ChIP_200K.bed.gz` and `CTCF_Control_200K.bed.gz`, that you
  20 can find in MACS3 github repository.
  21
  22 *Note, currently this tutorial is for single-end datasets. Please
  23 modify the command line for paired-end data by yourself.*
  24
  25 ## Step 1: Filter duplicates
  26
  27 In the first step of ChIP-Seq analysis by `callpeak`, ChIP and control
  28 data need to be read and the redundant reads at each genomic loci have
  29 to be removed. I won\'t go over the rationale, but just tell you how
  30 this can be done by `filterdup` subcommand. By default, the maximum
  31 number of allowed duplicated reads is 1, or `--keep-dup=1` for
  32 `callpeak`. To simulate this behavior, do the following:
  33
  34 `$ macs3 filterdup -i CTCF_ChIP_200K.bed.gz --keep-dup=1 -o CTCF_ChIP_200K_filterdup.bed`
  35 `$ macs3 filterdup -i CTCF_Control_200K.bed.gz --keep-dup=1 -o CTCF_Control_200K_filterdup.bed`
  36
  37 You can set different number for `--keep-dup` or let MACS3
  38 automatically decide the maximum allowed duplicated reads for each
  39 genomic loci for ChIP and control separately. Check `macs3 filterdup
  40 -h` for detail, and remember if you go with auto way, the genome size
  41 should be set accordingly. *Note*, in the output, MACS3 will give
  42 you the final number of reads kept after filtering, you\'d better
  43 write the numbers down since we need them when we have to scale the
  44 ChIP and control signals to the same depth. In this case, the number
  45 is 199583 for ChIP and 199867 for control, and the ratio between them
  46 is 199583/199867=.99858
  47
  48 ## Step 2: Decide the fragment length `d`
  49
  50 This is an important step for MACS3 to analyze ChIP-Seq and also for
  51 other types of data since the location of sequenced read may only tell
  52 you the end of a DNA fragment that you are interested in (such as TFBS
  53 or DNA hypersensitive regions), and you have to estimate how long this
  54 DNA fragment is in order to recover the actual enrichment. You can also
  55 regard this as a data smoothing technic. You know that `macs3 callpeak`
  56 will output something like, it can identify certain number of pairs of
  57 peaks and it can predict the fragment length, or `d` in MACS3
  58 terminology, using cross-correlation. In fact, you can also do this
  59 using `predictd` module. Normally, we only need to do this for ChIP
  60 data:
  61
  62 `
  63 $ macs3 predictd -i CTCF_ChIP_200K_filterdup.bed -g hs -m 5 50
  64 `
  65
  66 Here the `-g` (the genome size) need to be set according to your sample,
  67 and the mfold parameters have to be set reasonably. To simulate the
  68 default behavior of `macs3 callpeak`, set `-m 5 50`. Of course, you can
  69 tweak it. The output from `predictd` will tell you the fragment length
  70 `d`, and in this example, it is *254*. Write the number down on your
  71 notebook since we will need it in the next step. Of course, if you do
  72 not want to extend the reads or you have a better estimation on fragment
  73 length, you can simply skip this step.
  74
  75 ## Step 3: Extend ChIP sample to get ChIP coverage track
  76
  77 Now you have estimated the fragment length, next, we can use MACS3
  78 `pileup` subcommand to generate a pileup track in BEDGRAPH format for
  79 ChIP sample. Since we are dealing with ChIP-Seq data in this tutorial,
  80 we need to extend reads in 5\' to 3\' direction which is the default
  81 behavior of `pileup` function. If you are dealing with some DNAse-Seq
  82 data or you think the cutting site, that is detected by short read
  83 sequencing, is just in the `middle` of the fragment you are interested
  84 in, you need to use `-B` option to extend the read in both direction.
  85 Here is the command to simulate `callpeak` behavior:
  86
  87 `
  88 $ macs3 pileup -i CTCF_ChIP_200K_filterdup.bed -o CTCF_ChIP_200K_filterdup.pileup.bdg --extsize 254
  89 `
  90
  91 The file `CTCF_ChIP_200K_filterdup.pileup.bdg` now contains the
  92 fragment pileup signals for ChIP sample.
  93
  94 ## Step 4: Build local bias track from control
  95
  96 By default, MACS3 `callpeak` function computes the local bias by taking
  97 the maximum bias from surrounding 1kb (set by `--slocal`), 10kb (set by
  98 `--llocal`), the size of fragment length `d` (predicted as what you got
  99 from `predictd`), and the whole genome background. Here I show you how
 100 each of the bias is calculated and how they can be combined using the
 101 subcommands.
 102
 103 ### The `d` background
 104
 105 Basically, to create the background noise track, you need to extend the
 106 control read to both sides (-B option) using `pileup` function. The idea
 107 is that the cutting site from control sample contains the noise
 108 representing a region surrounding it. To do this, take half of `d` you
 109 got from `predictd`, 127 (1/2 of 254) for our example, then:
 110
 111 `
 112 $ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 127 -o d_bg.bdg
 113 `
 114
 115 The file `d_bg.bdg` contains the `d` background from control.
 116
 117 ### The slocal background
 118
 119 Next, you can create a background noise track of slocal local window, or
 120 1kb window by default. Simply imagine that each cutting site (sequenced
 121 read) represent a 1kb (default, you can tweak it) surrounding noise. So:
 122
 123 `
 124 $ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 500 -o 1k_bg.bdg
 125 `
 126
 127 Note, here 500 is the 1/2 of 1k. Because the ChIP signal track was built
 128 by extending reads into `d` size fragments, we have to normalize the 1kb
 129 noise by multiplying the values by `d/slocal`, which is 254/1000=0.254
 130 in our example. To do so, use the `bdgopt` subcommand:
 131
 132 `
 133 $ macs3 bdgopt -i 1k_bg.bdg -m multiply -p 0.254 -o 1k_bg_norm.bdg
 134 `
 135
 136 The file`1k_bg_norm.bdg` contains the slocal background from control.
 137 Note, we don\'t have to do this for `d` background because the
 138 multiplier is simply 1.
 139
 140 ### The llocal background
 141
 142 The background noise from larger region can be generated in the same way
 143 as slocal backgound. The only difference is the size for extension.
 144 MACS3 `callpeak` by default asks for 10kb (you can change this value)
 145 surrounding window, so:
 146
 147 `
 148 $ macs3 pileup -i CTCF_Control_200K_filterdup.bed -B --extsize 5000 -o 10k_bg.bdg
 149 `
 150
 151 The extsize has to be set as 1/2 of llocal. Then, the multiplier now is
 152 `d/llocal`, or 0.0254 in our example.
 153
 154 `
 155 $ macs3 bdgopt -i 10k_bg.bdg -m multiply -p 0.0254 -o 10k_bg_norm.bdg
 156 `
 157
 158 The file `10k_bg_norm.bdg` now contains the slocal background from
 159 control.
 160
 161 ### The genome background
 162
 163 The whole genome background can be calculated as
 164 `the_number_of_control_reads\fragment_length/genome_size`, and in our
 165 example, it is $199867*254/2700000000 ~= .0188023$. You don\'t need to
 166 run subcommands to build a genome background track since it\'s just a
 167 single value.
 168
 169 ### Combine and generate the maximum background noise
 170
 171 Now all the above background noises have to be combined and the maximum
 172 bias for each genomic location need be computed. This is the default
 173 behavior of MACS3 `callpeak`, but you can have your own pipeline to
 174 include some of them or even make more noise (such as 5k or 50k
 175 background) then include more tracks. Here is the way to combine and get
 176 the maximum bias.
 177
 178 Take the maximum between slocal (1k) and llocal (10k) background:
 179
 180 `
 181 macs3 bdgcmp -m max -t 1k_bg_norm.bdg -c 10k_bg_norm.bdg -o 1k_10k_bg_norm.bdg
 182 `
 183
 184 Then, take the maximum then by comparing with `d` background:
 185
 186 `
 187 macs3 bdgcmp -m max -t 1k_10k_bg_norm.bdg -c d_bg.bdg -o d_1k_10k_bg_norm.bdg
 188 `
 189
 190 Finally, combine with the genome wide background using `bdgopt`
 191 subcommand
 192
 193 `
 194 macs3 bdgopt -i d_1k_10k_bg_norm.bdg -m max -p .0188023 -o local_bias_raw.bdg
 195 `
 196
 197 Now the file `local_bias_raw.bdg` is a BEDGRAPH file containing the
 198 raw local bias from control data.
 199
 200 ## Step 5: Scale the ChIP and control to the same sequencing depth
 201
 202 In order to compare ChIP and control signals, the ChIP pileup and
 203 control lambda have to be scaled to the same sequencing depth. The
 204 default behavior in `callpeak` module is to scale down the larger sample
 205 to the smaller one. This will make sure the noise won\'t be amplified
 206 through scaling and improve the specificity of the final results. In our
 207 example, the final number of reads for ChIP and control are 199583 and
 208 199867 after filtering duplicates, so the control bias have to be scaled
 209 down by multiplying with the ratio between ChIP and control which is
 210 199583/199867=.99858. To do so:
 211
 212 `
 213 $ macs3 bdgopt -i local_bias_raw.bdg -m multiply -p .99858 -o local_lambda.bdg
 214 `
 215
 216 Now, I name the output file as `local_lambda.bdg` since the values in
 217 the file can be regarded as the lambda (or expected value) and can be
 218 compared with ChIP signals using the local Poisson test.
 219
 220 ## Step 6: Compare ChIP and local lambda to get the scores in pvalue or qvalue
 221
 222 Next, to find enriched regions and predict the so-called \'peaks\',
 223 the ChIP signals and local lambda stored in BEDGRAPH file have to be
 224 compared using certain statistic model. To do so, you need to use
 225 `bdgcmp` module, which will output score for each basepair in the
 226 genome. You may wonder it may need a huge file to save score for each
 227 basepair in the genome, however the format BEDGRAPH can deal with the
 228 problem by merging nearby regions with the same score. So
 229 theoratically, the size of the output file for scores depends on the
 230 complexity of your data, and the maximum number of data points, if
 231 `d`, `slocal`, and `llocal` background are all used, is the minimum
 232 value of the genome size and approximately
 233 `(#read_ChIP+#reads_control\*3)\*2`, in our case about 1.6 million.
 234 The command to generate score tracks is
 235
 236 `
 237 $ macs3 bdgcmp -t CTCF_ChIP_200K_filterdup.pileup.bdg -c local_lambda.bdg -m qpois -o CTCF_ChIP_200K_qvalue.bdg
 238 `
 239 or
 240
 241 `
 242 $ macs3 bdgcmp -t CTCF_ChIP_200K_filterdup.pileup.bdg -c local_bias.bdg -m ppois -o CTCF_ChIP_200K_pvalue.bdg
 243 `
 244
 245 The `CTCF_ChIP_200K_pvalue.bdg` or `CTCF_ChIP_200K_qvalue.bdg` file
 246 contains the -log10(p-value)s or -log10(q-value)s for each basepair
 247 through local Poisson test, which means the ChIP signal at each basepair
 248 will be tested against the corresponding local lambda derived from
 249 control with Poisson model. *Note*, if you follow this tutorial, then
 250 you won\'t get any `0` in the local lambda track because the smallest
 251 value is the whole genome background; however if you do not include
 252 genome background, you will see many `0`s in local lambda which will
 253 crash the Poisson test. In this case, you need to set the
 254 `pseudocount` for `bdgcmp` through `-p` option. The pseudocount will
 255 be added to both ChIP and local lambda values before Poisson test. The
 256 choice of pseudocount is mainly arbitrary and you can search on the web
 257 to see some discussion. But in general, higher the pseudocount, higher
 258 the specificity and lower the sensitivity.
 259
 260 ## Step 7: Call peaks on score track using a cutoff
 261
 262 The final task of peak calling is to just take the scores and call those
 263 regions higher than certain cutoff. We can use the `bdgpeakcall`
 264 function for narrow peak calling and `bdgrroadcall` for broad peak
 265 calling, and I will only cover the usage of `bdgpeakcall` in this
 266 tutorial. It looks simple however there are extra two parameters for the
 267 task. First, if two nearby regions are both above cutoff but the region
 268 in-between is lower, and if the region in-between is small enough, we
 269 should merge the two nearby regions together into a bigger one and
 270 tolerate the fluctuation. This value is set as the read length in MACS3
 271 `callpeak` function since the read length represent the resolution of
 272 the dataset. As for `bdgpeakcall`, you need to get the read length
 273 information from Step 1 or by checking the raw fastq file and set the
 274 `-g` option. Secondly, we don\'t want to call too many small `peaks`
 275 so we need to have a minimum length for the peak. MACS3 `callpeak`
 276 function automatically uses the fragment size `d` as the parameter of
 277 minimum length of peak, and as for `bdgpeakcall`, you need to set the
 278 `-l` option as the `d` you got from Step 2. Last, you have to set the
 279 cutoff value. Remember the scores in the output from `bdgcmp` are in
 280 -log10 form, so if you need the cutoff as 0.05, the -log10 value is
 281 about 1.3. The final command is like:
 282
 283 `
 284 $ macs3 bdgpeakcall -i CTCF_ChIP_200K_qvalue.bdg -c 1.301 -l 245 -g 100 -o CTCF_ChIP_200K_peaks.bed
 285 `
 286
 287 The output is in fact a narrowPeak format file (a type of BED file)
 288 which contains locations of peaks and the summit location in the last
 289 column.
 290
 291