docs/hmmratac.md

   1 # hmmratac
   2
   3 ## Description
   4
   5 HMMRATAC (`macs3 hmmratac`) is a dedicated peak calling algorithm
   6 based on Hidden Markov Model for ATAC-seq data. The basic idea behind
   7 HMMRATAC is to digest ATAC-seq data according to the fragment length
   8 of read pairs into four signal tracks: short fragments,
   9 mono-nucleosomal fragments, di-nucleosomal fragments and
  10 tri-nucleosomal fragments. Then integrate the four tracks using Hidden
  11 Markov Model to consider three hidden states: open region, nucleosomal
  12 region, and background region. The [orginal
  13 paper](https://academic.oup.com/nar/article/47/16/e91/5519166) was
  14 published in 2019, and the original software was written in JAVA, by
  15 the then PhD student Evan Tarbell, a mohawk bioinformatician. In MACS3
  16 project, we implemented HMMRATAC idea in Python/Cython and optimize
  17 the whole process using existing MACS functions and hmmlearn.
  18
  19 Here's an example of how to run the `hmmratac` command:
  20
  21 ```
  22 $ macs3 hmmratac -i yeast.bam -n yeast
  23 ```
  24
  25 or with the BEDPE format
  26
  27 ```
  28 $ macs3 hmmratac -i yeast.bedpe.gz -f BEDPE -n yeast
  29 ```
  30
  31 Note: you can convert BAMPE to BEDPE by using
  32
  33 ```
  34 $ macs3 filterdup --keep-dup all -f BAMPE -i yeast.bam -o yeast.bedpe
  35 ```
  36
  37 Please use `macs3 hmmratac --help` to see all the options. Here we
  38 list the essential ones.
  39
  40 ## Essential Options
  41
  42 ### `-i INPUT_FILE [INPUT_FILE ...]` / `--input INPUT_FILE [INPUT_FILE ...]`
  43
  44 This is the only REQUIRED parameter for `hmmratac`. Input files
  45 containing the aligment results for ATAC-seq paired end reads. If
  46 multiple files are given as '-t A B C', then they will all be read and
  47 pooled together. The file should be in BAMPE or BEDPE format (aligned
  48 in paired end mode). Files can be gzipped. Note: all files should be
  49 in the same format. REQUIRED.
  50
  51 ### `-f {BAMPE,BEDPE}` / `--format {BAMPE,BEDPE}`
  52
  53 Format of input files, "BAMPE" or "BEDPE". If there are multiple
  54 files, they should be in the same format -- either BAMPE or
  55 BEDPE. Please note that the BEDPE only contains three columns --
  56 chromosome, left position of the whole pair, right position of the
  57 whole pair-- and is NOT the same BEDPE format used by BEDTOOLS. To
  58 convert BAMPE to BEDPE,  you can use this command `macs3 filterdup
  59 --keep-dup all -f BAMPE -i input.bam -o output.bedpe`. DEFAULT:
  60 "BAMPE".
  61
  62 ### `--outdir OUTDIR`
  63
  64 If specified all output files will be written to that
  65 directory. Default: the current working directory
  66
  67 ### `-n NAME`/ `--name NAME`
  68 Name for this experiment, which will be used as a prefix to generate
  69 output file names. DEFAULT: "NA"
  70
  71 ### `--modelonly`
  72  This option will only generate the HMM model as a JSON file and
  73  quit. This model can then be applied using the `--model`
  74  option. Default: False
  75
  76 ### `--model`
  77  If provided, HMM training will be skipped and a JSON file generated
  78  from a previous HMMRATAC run will be used instead of creating new
  79  one. Default: NA
  80
  81 ### `-t HMM_TRAINING_REGIONS` / `--training HMM_TRAINING_REGIONS`
  82  Customized training regions can be provided through this option. `-t`
  83  takes the filename of training regions (previously was BED_file) to
  84  use for training HMM, instead of using foldchange settings to
  85  select. Default: NA
  86
  87 ### `--min-frag-p MIN_FRAG_P`
  88  We will exclude the abnormal fragments that can't be assigned to any
  89  of the four signal tracks. After we use EM to find the means and
  90  stddevs of the four distributions, we will calculate the likelihood
  91  that a given fragment length fit any of the four using normal
  92  distribution. The criteria we will use is that if a fragment length
  93  has less than MIN_FRAG_P probability to be like either of short,
  94  mono, di, or tri-nuc fragment, we will exclude it while generating
  95  the four signal tracks for later HMM training and prediction. The
  96  value should be between 0 and 1. Larger the value, more abnormal
  97  fragments will be allowed. So if you want to include more 'ideal'
  98  fragments, make this value smaller. Default = 0.001
  99
 100 ### `--cutoff-analysis-only`
 101
 102  Only run the cutoff analysis and output a report. After generating
 103  the report, the process will stop. The report will help user decide
 104  the three crucial parameters for `-l`, `-u`, and `-c`. So it's highly
 105  recommanded to run this first! Please read the report and
 106  instructions in [Choices of cutoff values](#choices-of-cutoff-values)
 107  on how to decide the three crucial parameters.
 108
 109 ### `--cutoff-analysis-steps`
 110
 111 Steps for performing cutoff analysis. It will be used to decide which
 112 cutoff value should be included in the final report. Larger the value,
 113 higher resolution the cutoff analysis can be. The cutoff analysis
 114 function will first find the smallest and the largest foldchange score
 115 in the data, then break the range of foldchange score into
 116 `CUTOFF_ANALYSIS_STEPS` intervals. It will then use each foldchange
 117 score as cutoff to call peaks and calculate the total number of
 118 candidate peaks, the total basepairs of peaks, and the average length
 119 of peak in basepair. Please note that the final report ideally should
 120 include `CUTOFF_ANALYSIS_STEPS` rows, but in practice, if the
 121 foldchange cutoff yield zero peak, the row for that foldchange value
 122 won't be included.  DEFAULT: 100
 123
 124 ### `--hmm-type`
 125
 126 We provide two types of emissions for the Hidden Markov Model -- the
 127 Gaussian model and the Poisson model. By default, the Gaussian
 128 emission will be used (as `--hmm-type gaussian`). To choose Poisson
 129 emission, use `--hmm-type poisson`. The Gaussian emission can be
 130 described by mean and variance for each state, while the simpler
 131 Poisson only needs the lambda value. The difference can be found in
 132 the saved json file for HMM.
 133
 134 ### `-u HMM_UPPER` / `--upper HMM_UPPER`
 135
 136 Upper limit on fold change range for choosing training sites. This is
 137 an important parameter for training so please read. The purpose of
 138 this parameter is to EXCLUDE those unusually highly enriched chromatin
 139 regions so we can get training samples in 'ordinary' regions
 140 instead. It's highly recommended to run the `--cutoff-analysis-only`
 141 first to decide the lower cutoff `-l`, the upper cutoff `-u`, and the
 142 pre-scanning cutoff `-c`. The upper cutoff should be the cutoff in the
 143 cutoff analysis result that can capture some (typically hundreds of)
 144 extremely high enrichment and unusually wide peaks. Default: 20
 145
 146 ### `-l HMM_LOWER` / `--lower HMM_LOWER`
 147 Lower limit on fold change range for choosing training sites. This is
 148 an important parameter for training so please read. The purpose of
 149 this parameter is to ONLY INCLUDE those chromatin regions having
 150 ordinary enrichment so we can get training samples to learn the common
 151 features through HMM. It's highly recommended to run the
 152 `--cutoff-analysis-only` first to decide the lower cutoff `-l`, the
 153 upper cutoff `-u`, and the pre-scanning cutoff `-c`. The lower cutoff
 154 should be the cutoff in the cutoff analysis result that can capture
 155 moderate number ( about 10k ) of peaks with normal width ( average
 156 length 500-1000bps long). Default: 10
 157
 158 ### `-c PRESCAN_CUTOFF` / `--prescan-cutoff PRESCAN_CUTOFF`
 159
 160 The fold change cutoff for prescanning candidate regions in the whole
 161 dataset. Then we will use HMM to predict/decode states on these
 162 candidate regions. The higher the prescan cutoff, the fewer regions
 163 will be considered. Must be > 1. This is an important parameter for
 164 decoding so please read. The purpose of this parameter is to EXCLUDE
 165 those chromatin regions having noises/random enrichment so we can have
 166 a large number of possible regions to predict the HMM states. It's
 167 highly recommended to run the `--cutoff-analysis-only` first to decide
 168 the lower cutoff `-l`, the upper cutoff `-u`, and the pre-scanning
 169 cutoff `-c`. The pre-scanning cutoff should be the cutoff close to the
 170 BOTTOM of the cutoff analysis result that can capture a large number
 171 of possible peaks with normal length (average length 500-1000bps). In
 172 most cases, please do not pick a cutoff too low that captures almost
 173 all the background noises from the data. Default: 1.2
 174
 175
 176 ## Choices of cutoff values
 177
 178 Before you proceed, it's highly recommended to run with
 179 `--cutoff-analysis-only` for the initial attempt. When this option is
 180 activated, `hmmratac` will use EM to estimate the best parameters for
 181 fragment sizes of short fragments, mono-, di-, and tri-nucleosomes,
 182 pileup fragments, convert the pileup values into fold-change, and
 183 analyze each possible cutoff. This analysis includes the number of
 184 peaks that can be called, their average peak length, and their total
 185 length. After the report is generated, you can review its contents and
 186 decide on the optimal `-l`, `-u`, and `-c`.
 187
 188 The report consists of four columns:
 189
 190 1. Score: the possible fold change cutoff value.
 191 2. npeaks: the number of peaks.
 192 3. lpeaks: the total length of all peaks.
 193 4. avelpeak: the average length of peaks.
 194
 195 While there's no universal rule, here are a few suggestions:
 196
 197 - The lower cutoff should be the cutoff in the report that captures a
 198   moderate number (about 10k) of peaks with a normal width (average
 199   length 500-1000bps long).
 200 - The upper cutoff should capture some (typically hundreds of)
 201   extremely high enrichment and unusually wide peaks in the
 202   report. The aim here is to exclude abnormal enrichment caused by
 203   artifacts such as repetitive regions.
 204 - The pre-scanning cutoff should be the cutoff close to the BOTTOM of
 205   the report that can capture a large number of potential peaks with a
 206   normal length (average length 500-1000bps). However, it's
 207   recommended not to use the lowest cutoff value in the report as this
 208   may include too much noise from the genome.
 209
 210 ## Tune the HMM model
 211
 212 It's highly recommended to check the runtime message of the HMM model
 213 after training. An example is like this:
 214
 215 ```
 216 #4 Train Hidden Markov Model with Multivariate Gaussian Emission
 217 #  Extract signals in training regions with bin size of 10
 218 #  Use Baum-Welch algorithm to train the HMM
 219 #   HMM converged: True
 220 #  Write HMM parameters into JSON: test_model.json
 221 #  The Hidden Markov Model for signals of binsize of 10 basepairs:
 222 #   open state index: state2
 223 #   nucleosomal state index: state1
 224 #   background state index: state0
 225 #   Starting probabilities of states:
 226 #                            bg        nuc       open
 227 #                        0.7994     0.1312    0.06942
 228 #   HMM Transition probabilities:
 229 #                            bg        nuc       open
 230 #               bg->     0.9842    0.01202   0.003759
 231 #              nuc->    0.03093     0.9562    0.01287
 232 #             open->   0.007891    0.01038     0.9817
 233 #   HMM Emissions (mean):
 234 #                         short       mono         di        tri
 235 #               bg:      0.2551      1.526     0.4646    0.07071
 236 #              nuc:       6.538      17.94      3.422    0.05819
 237 #             open:       5.016      17.47      6.897      2.121
 238 ```
 239
 240 We will 'guess' which hidden state is for the open region, which is
 241 the nucleosomal region, and which is the background. We compute from
 242 the HMM Emission matrix to pick the state with the highest sum of mean
 243 signals as the open state, the lowest as the backgound state, then the
 244 rest is the nucleosomal state. However it may not work in every
 245 case. In the above example, it may be tricky to call the second row as
 246 'nuc' and the third as 'open'. If the users want to exchange the state
 247 assignments of the 'nuc' and 'open', they can modify the state
 248 assignment in the HMM model file (e.g. test_model.json). For the above
 249 example, the model.json looks like this (we skipped some detail):
 250
 251 ```
 252 {"startprob": [...], "transmat": [...], "means": [...], "covars": [...],
 253 "covariance_type": "full", "n_features": 4,
 254 "i_open_region": 2, "i_background_region": 0, "i_nucleosomal_region": 1,
 255 "hmm_binsize": 10}
 256 ```
 257
 258 We can modify the assignment of: `"i_open_region": 2,
 259 "i_background_region": 0, "i_nucleosomal_region": 1,` by assigning `1`
 260 to open, and `2` to nucleosomal as: `"i_open_region": 1,
 261 "i_background_region": 0, "i_nucleosomal_region": 2,` Then save the
 262 HMM in a new model file such as `new_model.json`.
 263
 264 Then next, we can re-run `macs3 hmmratac` with the same parameters
 265 plus an extra option for the HMM model file like `macs3 hmmratac
 266 --model new_model.json`
 267