Accurate identification of meiotic crossing-over sites (COs) is essential for correct genotyping of recombining samples. RTIGER is a method for predicting genome-wide COs using allele-counts at pre-defined SNP marker positions. RTIGER trains a rigid Hidden Markov Model (rHMM1) where each genomic state (homozygous parent 1, homozygous parent 2 or heterozygous) corresponds to respectively one hidden state and the allele-counts as the observed variable. COs are identified as transitions in the HMM state.
To account for sparse sampling of individual marker positions in the
sequencing data, RTIGER uses Viterbi Path Algorithm and the
rigidity
parameter. This parameter defines the minimum
number of SNP markers required to support a state-transition. This
filters out low-confidence state-transitions, improving COs
identification performance.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install(version = "3.14")
BiocManager
::install(c("GenomicRanges", "GenomeInfoDb", "TailRank", "IRanges", "Gviz")) BiocManager
RTIGER can be installed directly from CRAN using:
install.packages("RTIGER")
Additionally, the development version (with the most recent features and bug-fixes) can be downloaded from GitHub using the following:
install.packages("devtools")
library(devtools)
install_github("rfael0cm/RTIGER", ref = 'main')
RTIGER uses the allele-count information at the SNP marker positions. The SNP markers correspond to differences between the two genotypes (i.e. parent_1 vs parent_2). RTIGER requires as input one allele-count file for each sample. The allele-count file should be in tab-separated value format, where each row corresponds to a SNP marker. For example:
Chr1 | 37388 | C | 0 | T | 2 |
Chr1 | 71348 | T | 1 | G | 3 |
Chr2 | 18057 | A | 0 | C | 1 |
Chr2 | 38554 | G | 0 | A | 2 |
Chr2 | 75348 | A | 1 | T | 0 |
Chr3 | 32210 | T | 2 | G | 0 |
Here the columns are:
The SNPs can be identified using any generic SNP identification pipeline3.
SNPs in repetitive regions should be filtered out. Further, as crossing-over usually takes place in syntenic regions between the two genomes, for best results, only SNPs in syntenic regions should be selected as markers. If whole genome assemblies are present for both genomes, then this can be easily achieved using methods like SyRI4.
RTIGER uses Julia to perform computationally intensive model training. All Julia packages that are used by RTIGER can be installed using using:
library(RTIGER)
setupJulia()
This step is necessary when using RTIGER for the first time, but can be skipped for later analysis as all required Julia packages would already be installed.
The Julia functions need to be loaded in the R environment using:
sourceJulia()
This step is required every time when using RTIGER.
The primary input for RTIGER is a data-frame termed
expDesign
. The first column of expDesign
should have paths to allele-count files for all samples and the second
column should have unique samples IDs5.
# Get paths to example allele count files originating from a
# cross between Col-0 and Ler accession of the A.thaliana
= list.files(system.file("extdata", package = "RTIGER"), full.names = TRUE)
file_paths
# Get sample names
<- basename(file_paths)
sampleIDs
# Create the expDesign object
= data.frame(files=file_paths, name=sampleIDs)
expDesign
print(expDesign)
#> files name
#> 1 /tmp/RtmpOQjtyL/Rinst79bd6801ed99/RTIGER/extdata/sampleAA.txt sampleAA.txt
#> 2 /tmp/RtmpOQjtyL/Rinst79bd6801ed99/RTIGER/extdata/sampleAC.txt sampleAC.txt
#> 3 /tmp/RtmpOQjtyL/Rinst79bd6801ed99/RTIGER/extdata/sampleB.txt sampleB.txt
RTIGER also requires chromosome lengths for the parent 1. These need to be provided as a named vector where the values are chromosome lengths and the names are chromosome ids.
# Get chromosome lengths for the example data included in the package
<- RTIGER::ATseqlengths
chr_len names(chr_len) <- c('Chr1' , 'Chr2', 'Chr3', 'Chr4', 'Chr5')
print(chr_len)
#> Chr1 Chr2 Chr3 Chr4 Chr5
#> 34964571 22037565 25499034 20862711 31270811
RTIGER does model training, CO identification, and creates summary
plots using the RTIGER
function.
= RTIGER(expDesign = expDesign,
myres outputdir = "/PATH/TO/OUTPUT/DIRECTORY",
seqlengths = chr_len,
rigidity = 200)
The rigidity
parameter defines the required minimum
number of continuous markers that together support a state change of the
HMM model (Figure 1). Smaller rigidity
values increase the
sensitivity in detecting COs that are close to each other, but may
result in false-positive CO identification because of variation in
sequencing coverage. Larger rigidity
values improve
precision but COs that are close to each other might not be identified.
Users are supposed to test and adjust rigidity
based on their specific experimental setup.
RTIGER identifies COs for each sample level and provides summary plots and statistics for each sample as well as for the entire population.
RTIGER creates a folder for each sample in the
outputdir
. This folder contains:
GenotypePlot.pdf
: Graphical representation of the
allele-counts, allele-count ratio, and genotypesCompleteBlock.bed
: BED file providing genomic regions
corresponding to different genotypesP1/P2/Het.bed
: BED files containing the markers present
in genomic regions having genotype: homozygous parent 1, homozygous
parent 2, or heterozygous, respectivelyP1/P2.bw
: BigWig file containing the number of reads
per marker position supporting parent 1 and parent 2, respectivelyCountRatio.bw
: BigWig file containing the ratio of
number of reads supporting parent 1 to number of reads supporting number
2 at the marker positionsIn addition to per sample output, RTIGER also generates three summary plots for the samples.
COs-per-Chromosome.pdf
: Distribution of number of
cross-overs per chromosomeCO-count-perSample.pdf
: Number of cross-overs in each
sampleGoodness-Of-fit.pdf
: The goodness of fit file contains
the histograms of the count ratio distribution for each state. In other
words, the count ratio (counts of reference allele/total number of
allele count) for all the positions decoded as state i will be used to
plot the histogram of the state i. For example, in the case of three
states (paternal, heterozygous, and maternal), the first histogram is
the paternal state for which the distribution of count ratio is close to
one (since the reference allele is the allele inherited from the
paternal line). On the other hand, the maternal state histogram has a
distribution close to zero. Reasonably, the heterozygous state has a
distribution close to 0.5 since the count ratio for the observations
heterozygous regions should be half and half. Real-life experiments are
more complex and might deviate from the theoretical values, i.e the
heterozygous count ratio is skewed to 0 or 1 indicating preference in
the annotation by one of the parents, contamination in some samples, the
model did not converge adequately, or whether the number of states is
correct. These plots help the user to decide if the model fitted for the
data they have is adequate or needs to change the parameters.Backcrossed populations are formed by crossing a hybrid organism with
one of its parent. These populations are different from the populations
based on outcrossing as only two genomic states are possible (homozygous
for the backrossed parent and heterozygous for both parents). To
identify COs in such population, set nstates=2
in the
RTIGER command.
= RTIGER(expDesign = expDesign,
myres outputdir = "PATH/TO/OUTPUT/DIR",
seqlengths = chr_len,
rigidity = 200,
nstates=2)
Campos-Martin R, et al., Reliable genotyping of recombinant genomes using a robust hidden Markov Model, 2022
#> R version 4.2.2 Patched (2022-11-10 r83330)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] knitr_1.42
#>
#> loaded via a namespace (and not attached):
#> [1] qpdf_1.3.1 tidyselect_1.2.0 xfun_0.37
#> [4] bslib_0.4.2 reshape2_1.4.4 generics_0.1.3
#> [7] colorspace_2.1-0 vctrs_0.6.0 oompaData_3.1.3
#> [10] htmltools_0.5.4 stats4_4.2.2 yaml_2.3.7
#> [13] utf8_1.2.3 rlang_1.1.0 e1071_1.7-13
#> [16] jquerylib_0.1.4 pillar_1.8.1 TailRank_3.2.2
#> [19] glue_1.6.2 BiocGenerics_0.42.0 GenomeInfoDbData_1.2.8
#> [22] lifecycle_1.0.3 plyr_1.8.8 stringr_1.5.0
#> [25] zlibbioc_1.42.0 munsell_0.5.0 gtable_0.3.1
#> [28] evaluate_0.20 Biobase_2.56.0 IRanges_2.30.1
#> [31] fastmap_1.1.1 GenomeInfoDb_1.32.4 class_7.3-21
#> [34] fansi_1.0.4 highr_0.10 Rcpp_1.0.10
#> [37] scales_1.2.1 RTIGER_2.1.0 cachem_1.0.7
#> [40] S4Vectors_0.34.0 jsonlite_1.8.4 XVector_0.36.0
#> [43] askpass_1.1 ggplot2_3.4.1 digest_0.6.31
#> [46] stringi_1.7.12 dplyr_1.1.0 oompaBase_3.2.9
#> [49] GenomicRanges_1.48.0 grid_4.2.2 cli_3.6.0
#> [52] tools_4.2.2 bitops_1.0-7 magrittr_2.0.3
#> [55] sass_0.4.5 RCurl_1.98-1.10 proxy_0.4-27
#> [58] tibble_3.2.0 cluster_2.1.4 pkgconfig_2.0.3
#> [61] rmarkdown_2.20 extraDistr_1.9.1 rstudioapi_0.14
#> [64] JuliaCall_0.17.5 R6_2.5.1 compiler_4.2.2
Longer description: https://kups.ub.uni-koeln.de/37286/↩︎
https://www.geeksforgeeks.org/how-to-setup-julia-path-to-environment-variable/?ref=lbp↩︎
For example: https://www.ebi.ac.uk/sites/ebi.ac.uk/files/content.ebi.ac.uk/materials/2014/140217_AgriOmics/dan_bolser_snp_calling.pdf↩︎
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1911-0↩︎
Avoid use of special characters or spaces in file-names as that may result in errors.↩︎