docs/callpeak.md

   1 # callpeak
   2
   3 ## Overview
   4 This is the main function in MACS3. It will take alignment files in
   5 various format (please check the detail below) and call the
   6 significantly enriched regions in the genome as 'peaks'.  It can be
   7 invoked by `macs3 callpeak` . If you type this command with `-h`, you
   8 will see a full description of command-line options. Here we only list
   9 the essentials.
  10
  11 ## Essential Commandline Options
  12
  13 ### Input and Output
  14
  15 - `-t`/`--treatment`
  16
  17   This is the only REQUIRED parameter for MACS3. The file can be in
  18   any supported format -- see detail in the `--format` option. If you
  19   have more than one alignment file, you can specify them as `-t A B
  20   C`.  MACS3 will pool up all these files together.
  21
  22 - `-c`/`--control`
  23
  24   The control, genomic input or mock IP data file. Please follow the
  25   same direction as for `-t`/`--treatment`.
  26
  27 - `-n`/`--name`
  28
  29   The name string of the experiment. MACS3 will use this string NAME
  30   to create output files like `NAME_peaks.xls`,
  31   `NAME_negative_peaks.xls`, `NAME_peaks.bed` , `NAME_summits.bed`,
  32   `NAME_model.r` and so on. So please avoid any confliction between
  33   these filenames and your existing files.
  34
  35 - `-f`/`--format FORMAT`
  36
  37   Format of tag file can be `ELAND`, `BED`, `ELANDMULTI`,
  38   `ELANDEXPORT`, `SAM`, `BAM`, `BOWTIE`, `BAMPE`, or `BEDPE`. Default
  39   is `AUTO` which will allow MACS3 to decide the format
  40   automatically. `AUTO` is also useful when you combine different
  41   formats of files. Note that MACS3 can't detect `BAMPE` or `BEDPE`
  42   format with `AUTO`, and you have to implicitly specify the format
  43   for `BAMPE` and `BEDPE`.
  44
  45   Nowadays, the most common formats are `BED` or `BAM` (including
  46   `BEDPE` and `BAMPE`). Our recommendation is to convert your data to
  47   `BED` or `BAM` first.
  48
  49   Also, MACS3 can detect and read gzipped file. For example, `.bed.gz`
  50   file can be directly used without being uncompressed with `--format
  51   BED`.
  52
  53   Here are detailed explanation of the recommended formats:
  54
  55   - `BED`
  56
  57     The BED format can be found at [UCSC genome browser
  58     website](http://genome.ucsc.edu/FAQ/FAQformat#format1).
  59
  60     The essential columns in BED format input are the 1st column
  61     `chromosome name`, the 2nd `start position`, the 3rd `end
  62     position`, and the 6th, `strand`.
  63
  64     Note that, for `BED` format, the 6th column of strand information
  65     is required by MACS3. And please pay attention that the
  66     coordinates in BED format are zero-based and half-open. See more
  67     detail at [UCSC
  68     site](http://genome.ucsc.edu/FAQ/FAQtracks#tracks1).
  69
  70   - `BAM`/`SAM`
  71
  72     If the format is `BAM`/`SAM`, please check the definition in
  73     [samtools](https://samtools.github.io/hts-specs/SAMv1.pdf).  If
  74     the `BAM` file is generated for paired-end data, MACS3 will only
  75     keep the left mate(5' end) tag. However, when format `BAMPE` is
  76     specified, MACS3 will use the real fragments inferred from
  77     alignment results for reads pileup.
  78
  79   - `BEDPE` or `BAMPE`
  80
  81     A special mode will be triggered while the format is specified as
  82     `BAMPE` or `BEDPE`. In this way, MACS3 will process the `BAM` or
  83     `BED` files as paired-end data. Instead of building a bimodal
  84     distribution of plus and minus strand reads to predict fragment
  85     size, MACS3 will use actual insert sizes of pairs of reads to
  86     build fragment pileup.
  87
  88     The `BAMPE` format is just a `BAM` format containing paired-end
  89     alignment information, such as those from `BWA` or `BOWTIE`.
  90
  91     The `BEDPE` format is a simplified and more flexible `BED` format,
  92     which only contains the first three columns defining the
  93     chromosome name, left and right position of the fragment from
  94     Paired-end sequencing. Please note, this is NOT the same format
  95     used by `bedtools`, and the `bedtools` version of `BEDPE` is
  96     actually not in a standard `BED` format. You can use MACS3
  97     subcommand [`randsample`](./randsample.md) or
  98     [`filterdup`](./filterdup.md) to convert a `BAMPE` file containing
  99     paired-end information to a `BEDPE` format file:
 100
 101     ```
 102     macs3 randsample -i the_BAMPE_file.bam -f BAMPE -p 100 -o the_BEDPE_file.bed
 103     ```
 104         or
 105
 106         ```
 107     macs3 filterdup -i the_BAMPE_file.bam -f BAMPE --keep-dup all -o the_BEDPE_file.bed
 108     ```
 109
 110
 111 - `--outdir`
 112
 113   MACS3 will save all output files into the specified folder for this
 114   option. A new folder will be created if necessary.
 115
 116 - `-B`/`--bdg`
 117
 118   If this flag is on, MACS3 will store the fragment pileup, control
 119   lambda in bedGraph files. The bedGraph files will be stored in the
 120   current directory named `NAME_treat_pileup.bdg` for treatment data,
 121   `NAME_control_lambda.bdg` for local lambda values from control.
 122
 123 ### Options controling peak calling behaviors
 124
 125 - `-g`/`--gsize`
 126
 127   It's the mappable genome size or effective genome size which is
 128   defined as the genome size which can be sequenced. Because of the
 129   repetitive features on the chromosomes, the actual mappable genome
 130   size will be smaller than the original size, about 90% or 70% of the
 131   genome size. The default *hs* ~2.9e9 is recommended for human
 132   genome. Here are all precompiled parameters for effective genome
 133   size from
 134   [deeptools](https://deeptools.readthedocs.io/en/develop/content/feature/effectiveGenomeSize.html):
 135
 136   * hs: 2,913,022,398 for GRCh38
 137   * mm: 2,652,783,500 for GRCm38
 138   * ce: 100,286,401 for WBcel235
 139   * dm: 142,573,017 for dm6
 140
 141   Please check deeptools webpage to find the appropriate effective
 142   genome size if you want a more accurate estimation regarding
 143   specific assembly and read length.
 144
 145   Users may want to use k-mer tools to simulate mapping of Xbps long
 146   reads to target genome, and to find the ideal effective genome
 147   size. However, usually by taking away the simple repeats and Ns from
 148   the total genome, one can get an approximate number of effective
 149   genome size. A slight difference in the number won't cause a big
 150   difference of peak calls, because this number is used to estimate a
 151   genome-wide noise level which is usually the least significant one
 152   compared with the *local biases* modeled by MACS3.
 153
 154 - `-s`/`--tsize`
 155
 156   The size of sequencing tags. If you don't specify it, MACS3 will try
 157   to use the first 10 sequences from your input treatment file to
 158   determine the tag size. Specifying it will override the
 159   automatically determined tag size.
 160
 161 - `-q`/`--qvalue`
 162
 163   The q-value (minimum FDR) cutoff to call significant
 164   regions. Default is 0.05. For broad marks, you can try 0.01 as the
 165   cutoff. The q-values are calculated from p-values using the
 166   [Benjamini-Hochberg
 167   procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini%E2%80%93Hochberg_procedure).
 168
 169 - `-p`/`--pvalue`
 170
 171   The p-value cutoff. If `-p` is specified, MACS3 will use p-value
 172   instead of q-value.
 173
 174 - `--min-length`, `--max-gap`
 175
 176   These two options can be used to fine-tune the peak calling behavior
 177   by specifying the minimum length of a called peak and the maximum
 178   allowed a gap between two nearby regions to be merged. In other
 179   words, a called peak has to be longer than `min-length`, and if the
 180   distance between two nearby peaks is smaller than `max-gap` then
 181   they will be merged as one. If they are not set, MACS3 will set the
 182   DEFAULT value for `min-length` as the predicted fragment size `d`,
 183   and the DEFAULT value for `max-gap` as the detected read
 184   length. Note, if you set a `min-length` value smaller than the
 185   fragment size, it may have NO effect on the result. For broad peak
 186   calling with `--broad` option set, the DEFAULT `max-gap` for merging
 187   nearby stronger peaks will be the same as narrow peak calling, and 4
 188   times of the `max-gap` will be used to merge nearby weaker (broad)
 189   peaks. You can also use `--cutoff-analysis` option with the default
 190   setting, and check the column `avelpeak` under different cutoff
 191   values to decide a reasonable `min-length` value.
 192
 193 - `--nolambda`
 194
 195   With this flag on, MACS3 will use the background lambda as local
 196   lambda. This means MACS3 will not consider the local bias at peak
 197   candidate regions. It is particularly recommended while calling
 198   peaks without control sample.
 199
 200 - `--slocal`, `--llocal`
 201
 202   These two parameters control which two levels of regions will be
 203   checked around the peak regions to calculate the maximum lambda as
 204   local lambda. By default, MACS3 considers 1000bp for small local
 205   region(`--slocal`), and 10000bps for large local region(`--llocal`)
 206   which captures the bias from a long-range effect like an open
 207   chromatin domain. You can tweak these according to your
 208   project. Remember that if the region is set too small, a sharp spike
 209   in the input data may kill a significant peak.
 210
 211 - `--nomodel`
 212
 213   While on, MACS3 will bypass building the shifting model. Please
 214   combine the usage of `--extsize` and `--shift` to achieve the effect
 215   you expect.
 216
 217 - `--extsize`
 218
 219   While `--nomodel` is set, MACS3 uses this parameter to extend reads
 220   in 5'->3' direction to fix-sized fragments. For example, if the size
 221   of the binding region for your transcription factor is 200 bp, and
 222   you want to bypass the model building by MACS3, this parameter can
 223   be set as 200. This option is only valid when `--nomodel` is set or
 224   when MACS3 fails to build model and `--fix-bimodal` is on.
 225
 226 - `--shift`
 227
 228   Note, this is NOT the legacy `--shiftsize` option which is replaced
 229   by `--extsize` from MACS version 2! You can set an arbitrary shift
 230   in bp here to adjust the alignment positions of reads in the whole
 231   library. Please use discretion while setting it other than the
 232   default value (0). When `--nomodel` is set, MACS3 will use this
 233   value to move cutting ends (5') then apply `--extsize` from 5' to 3'
 234   direction to extend them to fragments. When this value is negative,
 235   the cutting ends (5') will be moved toward 3'->5' direction,
 236   otherwise 5'->3' direction. Recommended to keep it as default 0 for
 237   ChIP-Seq datasets, or -1 * half of *EXTSIZE* together with
 238   `--extsize` option for detecting enriched cutting loci such as
 239   certain DNAseI-Seq datasets. Note, you can't set values other than 0
 240   if the format is BAMPE or BEDPE for paired-end data. The default is
 241   0.
 242
 243   Here are some examples for combining `--shift` and `--extsize`:
 244
 245   1. To find enriched cutting sites such as some DNAse-Seq
 246     datasets. In this case, all 5' ends of sequenced reads should be
 247     extended in both directions to smooth the pileup signals. If the
 248     wanted smoothing window is 200bps, then use `--nomodel --shift
 249     -100 --extsize 200`.
 250
 251   2. For certain nucleosome-seq data, we need to pile up the centers
 252     of nucleosomes using a half-nucleosome size for wavelet analysis
 253     (e.g. NPS algorithm). Since the DNA wrapped on nucleosome is about
 254     147bps, this option can be used: `--nomodel --shift 37 --extsize
 255     73`.
 256
 257 - `--keep-dup`
 258
 259   It controls the MACS3 behavior towards duplicate tags at the exact
 260   same location -- the same coordination and the same strand. You can
 261   set this as `auto`, `all`, or an integer value. The `auto` option
 262   makes MACS3 calculate the maximum tags at the exact same location
 263   based on binomial distribution using 1e-5 as p-value cutoff; and the
 264   `all` option keeps every tag.  If an integer is given, at most this
 265   number of tags will be kept at the same location. The default is to
 266   keep one tag at the same location. Default: 1
 267
 268 - `--broad`
 269
 270   This option, along with the `bdgbroadcall` command, facilitates
 271   broad peak calling, producing results in the UCSC gappedPeak format
 272   which encapsulates a nested structure of peaks. To conceptualize
 273   'nested' peaks, picture a gene structure housing regions analogous
 274   to exons (strong peaks) and introns coupled with UTRs (weak
 275   peaks). The broad peak calling process utilizes two distinct cutoffs
 276   to discern broader, weaker peaks (`--broad-cutoff`) and narrower,
 277   stronger peaks (`-p` or `-q`), which are subsequently nested to
 278   provide a detailed peak landscape. Please note that, the `max-gap`
 279   value for merging nearby weaker/broad peaks is 4 times of `max-gap`
 280   for merging nearby stronger peaks. The later one can be controlled
 281   by `--max-gap` option, and by default it is the average
 282   fragment/insertion length in the PE data. DEFAULT: False
 283
 284   Please note that, if you only want to call 'broader' peak and not
 285   interested in the nested peak structure, please simply use `-p` or
 286   `-q` with weaker cutoff instead of using `--broad` option.
 287
 288 - `--broad-cutoff`
 289
 290   Cutoff for the broad region. This option is not available unless
 291   `--broad` is set. Please note that if `-p` is set, this is a p-value
 292   cutoff, otherwise, it's a q-value cutoff.  DEFAULT: 0.1
 293
 294 - `--scale-to <large|small>`
 295
 296   When set to `large`, linearly scale the smaller dataset to the same
 297   depth as the larger dataset. By default or being set as `small`, the
 298   larger dataset will be scaled towards the smaller dataset. Beware,
 299   to scale up small data would cause more false positives. So the
 300   default behavior `small` is recommended.
 301
 302 - `--call-summits`
 303
 304   MACS3 will now reanalyze the shape of signal profile (p or q-score
 305   depending on the cutoff setting) to deconvolve subpeaks within each
 306   peak called from the general procedure. It's highly recommended to
 307   detect adjacent binding events. While used, the output subpeaks of a
 308   big peak region will have the same peak boundaries, and different
 309   scores and peak summit positions.
 310
 311 ### Other options
 312
 313 - `--buffer-size`
 314
 315   MACS3 uses a buffer size for incrementally increasing internal array
 316   size to store reads alignment information for each chromosome or
 317   contig. To increase the buffer size, MACS3 can run faster but will
 318   waste more memory if certain chromosome/contig only has very few
 319   reads. In most cases, the default value 100000 works fine. However,
 320   if there are a large number of chromosomes/contigs in your alignment
 321   and reads per chromosome/contigs are few, it's recommended to
 322   specify a smaller buffer size in order to decrease memory usage (but
 323   it will take longer time to read alignment files). Minimum memory
 324   requested for reading an alignment file is about # of CHROMOSOME *
 325   BUFFER_SIZE * 8 Bytes. DEFAULT: 100000
 326
 327 - `--cutoff-analysis`
 328
 329   While set, MACS3 will analyze the number or total length of peaks
 330   that can be called by different cutoff then output a summary table
 331   to help the user decide a better cutoff. Note, minlen and maxgap may
 332   affect the results. DEFAULT: False
 333
 334   Different with the option in `bdgpeakcall`, `callpeak` will perform
 335   both tasks to call peaks and to generate a report for cutoff
 336   analysis. Please check the section *Cutoff Analysis* for more
 337   detail.
 338
 339 ## Output files
 340
 341 1. `NAME_peaks.xls` is a tabular file which contains information about
 342    called peaks. You can open it in excel and sort/filter using excel
 343    functions. Information include:
 344
 345    - chromosome name
 346    - start position of peak
 347    - end position of peak
 348    - length of peak region
 349    - absolute peak summit position
 350    - pileup height at peak summit
 351    - -log10(pvalue) for the peak summit (e.g. pvalue =1e-10, then
 352       this value should be 10)
 353    - fold enrichment for this peak summit against random Poisson
 354       distribution with local lambda,
 355    - -log10(qvalue) at peak summit
 356
 357    Coordinates in XLS is 1-based which is different from BED
 358    format. When `--broad` is enabled for broad peak calling, the
 359    pileup, p-value, q-value, and fold change in the XLS file will be
 360    the mean value across the entire peak region, since peak summit
 361    won't be called in broad peak calling mode.
 362
 363 2. `NAME_peaks.narrowPeak` is BED6+4 format file which contains the
 364    peak locations together with peak summit, p-value, and q-value. You
 365    can load it to the UCSC genome browser. Definition of some specific
 366    columns are:
 367
 368    - 5th: integer score for display. It's calculated as
 369      `int(-10*log10pvalue)` or `int(-10*log10qvalue)` depending on
 370      whether `-p` (pvalue) or `-q` (qvalue) is used as score
 371      cutoff. Please note that currently this value might be out of the
 372      [0-1000] range defined in [UCSC ENCODE narrowPeak
 373      format](https://genome.ucsc.edu/FAQ/FAQformat.html#format12). You
 374      can let the value saturated at 1000 (i.e. p/q-value = 10^-100) by
 375      using the following 1-liner awk: `awk -v OFS="\t"
 376      '{$5=$5>1000?1000:$5} {print}' NAME_peaks.narrowPeak`
 377    - 7th: fold-change at peak summit
 378    - 8th: -log10pvalue at peak summit
 379    - 9th: -log10qvalue at peak summit
 380    - 10th: relative summit position to peak start
 381
 382    The file can be loaded directly to the UCSC genome browser. Remove
 383    the beginning track line if you want to analyze it by other tools.
 384
 385 3. `NAME_summits.bed` is in BED format, which contains the peak
 386    summits locations for every peak. The 5th column in this file is
 387    the same as what is in the `narrowPeak` file. If you want to find
 388    the motifs at the binding sites, this file is recommended. The file
 389    can be loaded directly to the UCSC genome browser. Remove the
 390    beginning track line if you want to analyze it by other tools.
 391
 392 4. `NAME_peaks.broadPeak` is in BED6+3 format which is similar to
 393    `narrowPeak` file, except for missing the 10th column for
 394    annotating peak summits. This file and the `gappedPeak` file will
 395    only be available when `--broad` is enabled. Since in the broad
 396    peak calling mode, the peak summit won't be called, the values in
 397    the 5th, and 7-9th columns are the mean value across all positions
 398    in the peak region. Refer to `narrowPeak` if you want to fix the
 399    value issue in the 5th column.
 400
 401 5. `NAME_peaks.gappedPeak` is in BED12+3 format which contains both
 402    the broad region and narrow peaks. The 5th column is the score for
 403    showing grey levels on the UCSC browser as in `narrowPeak`. The 7th
 404    is the start of the first narrow peak in the region, and the 8th
 405    column is the end. The 9th column should be RGB color key, however,
 406    we keep 0 here to use the default color, so change it if you
 407    want. The 10th column tells how many blocks including the starting
 408    1bp and ending 1bp of broad regions. The 11th column shows the
 409    length of each block and 12th for the start of each block. 13th:
 410    fold-change, 14th: *-log10pvalue*, 15th: *-log10qvalue*. The file can
 411    be loaded directly to the UCSC genome browser. Refer to
 412    `narrowPeak` if you want to fix the value issue in the 5th column.
 413
 414 6. `NAME_model.r` is an R script which you can use to produce a PDF
 415    image of the model based on your data. Load it to R by:
 416
 417    `$ Rscript NAME_model.r`
 418
 419    Then a pdf file `NAME_model.pdf` will be generated in your current
 420    directory. Note, R is required to draw this figure.
 421
 422 7. The `NAME_treat_pileup.bdg` and `NAME_control_lambda.bdg` files are
 423    in bedGraph format which can be imported to the UCSC genome browser
 424    or be converted into even smaller bigWig files. The
 425    `NAME_treat_pielup.bdg` contains the pileup signals (normalized
 426    according to `--scale-to` option) from ChIP/treatment sample. The
 427    `NAME_control_lambda.bdg` contains local biases estimated for each
 428    genomic location from the control sample, or from treatment sample
 429    when the control sample is absent. The subcommand `bdgcmp` can be
 430    used to compare these two files and make a bedGraph file of scores
 431    such as p-value, q-value, log-likelihood, and log fold changes.
 432
 433 ## Cutoff Analysis
 434
 435 Since cutoff can be an arbitrary value during peak calling, there are
 436 many approaches proposed in the community to guide the cutoff
 437 selection such as the [IDR
 438 approach](https://doi.org/doi:10.1214%2F11-AOAS466). In MACS3, we
 439 provide a simple way to do the cutoff analysis. The cutoff analysis
 440 function is provided by `--cutoff-analysis` option in `callpeak`,
 441 `bdgpeakcall`, and `hmmratac`. Among them, the function in
 442 `bdgpeakcall` is more flexible and can be applied on any scoring
 443 scheme. We will sperate this function into a dedicated subcommand in
 444 the future.
 445
 446 Please note that if this `--cutoff-anlaysis` option is on, the report
 447 will be written into a file named `NAME_cutoff_analysis.txt`.
 448
 449 When the option is on, we will generate a list of possible pvalue
 450 cutoffs to check from pscore cutoff from 0.3 to 10, with a step of
 451 0.3. When -log10(pvalue) is 0.3, it represents an extremely loose
 452 cutoff pvalue 0.5; and when it's 10, it represents an extremely
 453 strigent cutoff pvalue 1e-10. Please note that the is different with
 454 `bdgpeakcall` where users can control how the cutoff should be
 455 calculated.
 456
 457 Then for each cutoff we plan to investigate, we will check the number
 458 of peaks that can be called, their average peak length, and their
 459 total length.
 460
 461 The report consists of four columns:
 462
 463 1. score: the possible fold change cutoff value.
 464 2. npeaks: the number of peaks under this cutoff.
 465 3. lpeaks: the total length of all peaks.
 466 4. avelpeak: the average length of peaks.
 467
 468 While there's no universal rule to suggest the best cutoff, here are a
 469 few suggestions:
 470
 471 - You can use elbow analysis to find the cutoff that dramatically
 472    change the trend of npeaks, lpeaks, or avelpeak. But you need to
 473    think about how to define 'dramatical change'.
 474 - You can use some common expectation to decide the cutoff. For
 475    example, the number of peaks should be thousands/ or the avelpeak
 476    should be around 500bps. Of course, it's arbitrary but the table
 477    will give you some insight.