This vignette documents how to import or enter genotypic data for the hierfstat
package. Originally this package was written to estimate and test hierarchical F-statistics, but was then further developed and now include almost all features of the Fstat program (no longer maintained), as well as several others.
The data types that hierfstat
can analyse are haploid or diploid, unphased, multilocus genotypes. Note that each data set must be made of only one ploidy level.
The basic data structure required by most Hierfstat
function is a data frame with the first column containing a population identifier (preferably a number), and the next \(nl\) columns the genotype at each of \(nl\) loci.
In hierfstat
, alleles are encoded as 1, 2 or 3 digits numbers, and genotypes are encoded as numbers with the two alleles collated (as in pasted together). Other type of data can be imported (see below) but for the time being we focus on the primary data type. Thus imagine that you have an individual genotyped at a microsatellite locus with allele length 120
and 124
, the way to encode it for hierfstat
is either 120124
or 124120
. If the data are SNPs, each allele at a locus could be encoded as 1
and 2
, or you may decide to keep the correspondence between nucleotides and alleles (e.g. 1, 2, 3, 4
for A, C, G, T
). Thus, if the two alleles at a SNP locus are A
and T
and an individual is heterozygote, it could be encoded as 14
or 41
.
Example data sets are included in hierfstat
. For instance:
library(hierfstat) #load the library
data(diploid) # info about this data set with ?diploid
head(diploid)
## Pop loc-1 loc-2 loc-3 loc-4 loc-5
## 1 1 44 43 43 33 44
## 2 1 44 44 43 33 44
## 3 1 44 44 43 43 44
## 4 1 44 44 NA 33 44
## 5 1 44 44 24 34 44
## 6 1 44 44 NA 43 44
The first individual (first row of the diploid data frame) belongs to population 1. Its genotype at loc-1
is 44
, thus homozygote for allele 4
. It is heterozygote for alleles 3
and 4
at both loc-2
and loc-3
, and homozygote for allele 3
at loc-4
and finaly homozygote for allele 4
at loc-5
. In fact, loc-1
and loc-4
are monomorphic, meaning that only one allele is present in all individuals from all populations.
If a genotype is missing, it is encoded as NA
. For instance, the fourth individual has not been typed at loc-3
, nor did the 6th individual for the same locus.
The first column of this dataframe contains the identifier of the population to which the individual belongs. We can find how many individuals were typed in each population by using the command table:
table(diploid[,1])
##
## 1 2 3 4 5 6
## 8 8 5 7 9 7
As another example, we look at dataset cont.isl99
, a data frame where alleles are encoded as 2 digits numbers:
data(cont.isl99)
head(cont.isl99)
## Pop loc.1 loc.2 loc.3 loc.4 loc.5
## 1 1 7474 1955 9168 4051 9251
## 2 1 7474 3175 9168 2410 2327
## 3 1 808 3194 9536 9751 9223
## 4 1 874 5294 1876 1310 1292
## 5 1 7484 3875 1010 5107 7712
## 6 1 874 3175 1010 5135 9292
the first individual is homozygous for allele 74
at the first locus (loc.1
) and heterozygous fore alleles 19
and 55
at the second locus. The genotype could have been written 5519
instead of 1955
, it does not matter. Note the genotype of the 3rd and fourth individual at the first locus. They both carry allele 8
, which is in fact encoded as 08
. When it comes first, the leading 0 disappears, but it must be present in second position. Hence genotype 874
, 0874
and 7408
are the same, but different from genotype 748
who would be understood by hierfstat as an individual heterozygous for alleles 07
and 48
.
Last point: alleles for all loci to be analysed simultaneously must be encoded with the same number of digits.
Often the data to be imported are in a text file. If this is the case, the easiest way to import the file into R
is via one of the workhorse of R, the read.table
function.
If the data are in the FSTAT
format, they can be readily imported using the function read.fstat
:
<-hierfstat::read.fstat(system.file("extdata","diploid.dat",package="hierfstat"))
diphead(dip)
## Pop loc-1 loc-2 loc-3 loc-4 loc-5
## 1 1 44 43 43 33 44
## 2 1 44 44 43 33 44
## 3 1 44 44 43 43 44
## 4 1 44 44 NA 33 44
## 5 1 44 44 24 34 44
## 6 1 44 44 NA 43 44
adegenet
is another population genetics analysis package, with the ability to import from several data format. hierfstat
can import and work directly with genind
objects generated by adegenet
:
library(adegenet,quietly=TRUE)
data(nancycats)
head(genind2hierfstat(nancycats)[,1:10]) # only the first 10 loci
## pop fca8 fca23 fca43 fca45 fca77 fca78 fca90 fca96 fca37
## N215 P01 NA 136146 139139 116120 156156 142148 199199 113113 208208
## N216 P01 NA 146146 139145 120126 156156 142148 185199 113113 208208
## N217 P01 135143 136146 141141 116116 152156 142142 197197 113113 210210
## N218 P01 133135 138138 139141 116126 150150 142148 199199 91105 208208
## N219 P01 133135 140146 141145 126126 152152 142148 193199 113113 208208
## N220 P01 135143 136146 145149 120126 150156 148148 193195 91113 208208
data(H3N2)
head(genind2hierfstat(H3N2,pop=rep(1,dim(H3N2@tab)[1]))[,1:10]) # only the first 10 positions
## pop X6 X17 X39 X42 X45 X51 X60 X72 X73
## AB434107 1 1 1 3 2 3 2 3 3 2
## AB434108 1 1 1 3 2 3 2 3 3 2
## AB438242 1 NA NA NA NA NA NA 3 3 2
## AB438243 1 NA NA NA NA NA NA 3 3 2
## AB438244 1 NA NA NA NA NA NA 3 3 2
## AB438245 1 NA NA NA NA NA NA 3 3 2
data(eHGDP)
head(genind2hierfstat(eHGDP))[,1:11]
## pop loc.1 loc.2 loc.3 loc.4 loc.5 loc.6 loc.7 loc.8 loc.9 loc.10
## 1 1 129155 264292 142146 156156 157166 171179 205205 183187 196196 137140
## 2 1 145150 288292 138142 168172 157166 171175 205210 195203 196196 128134
## 3 1 155155 292300 138142 156156 157169 167171 205205 183199 184196 137140
## 4 1 150155 264292 142146 156176 157163 171175 205205 187187 196196 140140
## 5 1 150155 292300 138146 156160 157166 171171 205205 183207 187190 128128
## 6 1 155155 296296 146146 152176 157157 167171 205210 179183 196196 128140
The example below shows how to estimate gene diversities (expected heterozygosities), observed heterozygosities, or estimate genetic distances directly from a genind
object:
#basic.stats(nancycats)
::Hs(nancycats) #mean populations gene diversities hierfstat
## P01 P02 P03 P04 P05 P06 P07 P08
## 0.6531333 0.7043000 0.7239111 0.7525222 0.6431222 0.7510000 0.6719333 0.7598556
## P09 P10 P11 P12 P13 P14 P15 P16
## 0.6913667 0.7020111 0.7867667 0.6718333 0.6883889 0.7928556 0.7217111 0.7016222
## P17
## 0.6046375
::Ho(nancycats) # mean populations observed heterozygosities hierfstat
## P01 P02 P03 P04 P05 P06 P07 P08
## 0.5722222 0.5707111 0.6111111 0.6280222 0.5629667 0.6262667 0.5458556 0.6111111
## P09 P10 P11 P12 P13 P14 P15 P16
## 0.7407556 0.6262667 0.6627444 0.5570000 0.6752222 0.6836667 0.7474889 0.6759222
## P17
## 0.5980875
#genet.dist(nancycats)
Genomic datasets are composed of (very) large numbers of bi-allelic loci, called Simgle Nucleotide Polymorhisms, or SNP. A convenient and efficient format for storing SNP data is allelic dosage, the number of alternative alleles and individual carries at a locus. For a diploid locus, this number can be 0 if the individual carries two reference alleles, 1 if he is heterozygote or 2 if he carries 2 alternate alleles. The allelic dosage format is suitable for analyses with several of the hierfstat
functions, as described in the hierfstat
vignette.
To import genlight objects from adegenet
, just wrap the object’s name in an as.matrix
:
<-read.snp(system.file("files/exampleSnpDat.snp",package="adegenet"), chunk=2,quiet=TRUE)
obj
as.matrix(obj)
## 1.a/t 8.g/c 11.a/c 43.t/a
## foo 1 0 2 0
## bar 0 0 1 2
## toto 1 0 NA 0
## Nyarlathotep 0 1 2 0
## an even longer label but OK since on a single line 1 1 0 0
Variant Call Format (VCF) files have become a standard for genomic data. This is the storage format used for the 1000 genomes for instance.
hierfstat
using the function read.VCF
, built from the gaston
package function read.vcf
:library(gaston,quietly=TRUE)
## Warning: package 'Rcpp' was built under R version 4.1.3
<-system.file("extdata", "LCT.vcf.gz", package="gaston")
filepath <- read.VCF( filepath ) x1
## ped stats and snps stats have been set.
## 'p' has been set.
## 'mu' and 'sigma' have been set.
x1
## A bed.matrix with 503 individuals and 607 markers.
## snps stats are set
## ped stats are set
dim(x1)
## [1] 503 607
with(x1@snps,table(A1,A2))
## A2
## A1 A C G T
## A 0 25 94 22
## C 18 0 22 111
## G 126 24 0 26
## T 27 90 22 0
by default, read.VCF
keeps only the bi-allelic SNPs (whereas gaston::read.vcf
keeps all variants), but you can choose to keep all variant by setting the argument BiAllelic
to FALSE.
as.matrix(x1)
will then provide allelic dosages.
SNPRelate
package, and its function snpgdsVCF2GDS
(As the function writes to a file, the two next code chunks will not be evaluated)library(SNPRelate)
snpgdsVCF2GDS(filepath, "test1.gds", method="biallelic.only")
snpgdsSummary("test1.gds")
<-snpgdsOpen("test1.gds") f
snpgdsVCF2GDS
stores the number of reference alleles, we want the the number of alternate alleles:
<-2-snpgdsGetGeno(f)
x2
#check that allele frequencies are the same with the two methods
all.equal(colMeans(x2)/2,colMeans(as.matrix(x1)/2),check.names=FALSE)
bed
file using gaston
, and use an external program, plink, to convert the VCF file into a BED file.These are only 3 possibilities that I find convenient and efficient, but many other exist, thus feel free to import dosage data into hierfstat
by your preferred route.
While hierfstat
contains several built-in function to generate genetic data (e.g. sim.genot
), functions exist to import data from external, more sophisticated genetic data simulators, namely QuantiNemo and ms
QuantiNemo is a genetic simulation program for markers and traits. Data generated by Quantinemo
can be imported using the function read.fstat
if the save_genotype
setting in quantinemo
is set to \(1\). If the QuantiNemo
output is set to \(2\) (extended), 6 extra columns are outputed and these can be read with hierfstat
using the function qn2.read.fstat
. The component $dat
of the object return by this function contains the genotypes of the individuals simulated, while the component $sex
contains its sex, the component $ped
the individuals’ pedigree and the component $w
their fitness. For more details about the extended FSTAT
format of QuantiNemo
, see its manual.
<-qn2.read.fstat(system.file("extdata","qn2_sex.dat",package="hierfstat"))
datnames(dat)
## [1] "dat" "sex" "ped" "W"
head(dat$sex)
## [1] "F" "F" "F" "F" "F" "F"
head(dat$dat[,1:10])
## Pop trait-1_locus-1 trait-1_locus-2 trait-1_locus-3 trait-1_locus-4
## 1 1 606 1515 101 404
## 2 1 606 1515 101 404
## 3 1 606 1515 101 404
## 4 1 606 1515 101 404
## 5 1 606 1515 101 404
## 6 1 606 1515 101 404
## trait-1_locus-5 trait-1_locus-6 trait-1_locus-7 trait-1_locus-8
## 1 707 404 303 101
## 2 707 415 303 101
## 3 707 404 303 101
## 4 707 415 303 101
## 5 707 415 303 101
## 6 707 1504 303 101
## trait-1_locus-9
## 1 808
## 2 808
## 3 808
## 4 808
## 5 808
## 6 808
#sexbias.test(dat[[1]],sex=dat[[2]])
The program ms of Hudson is commonly used to generate genomic data.
I briefly discussed the ms
software. its output looks like this:
ms 200 100 -t 20 -I 2 100 100 40 -n 2 0.01
29161
//
segsites: 23
positions: 0.0689 0.2534 0.3219 0.3350 0.3547 0.3768 0.4339 0.4359 0.4388 0.4694 0.5003 0.5422 0.6575 0.6985 0.7059 0.7147 0.7453 0.7709 0.7891 0.8439 0.8779 0.8857 0.9380
00100001100000000000000
00001001000000000001000
The first line is the ms
command line, and it instructed the program to simulate 2 populations, with \(\theta=2N_0\mu=20\). The 2 populations differ in size and the smallest (the second) is a 100th of the first. The two populations exchange \(4Nm=40\) migrants per generation. 100 chromosomes are sampled from each population, and this is repeated a 100 times.
The genetic data itself comes as a serie of 0 and 1, collated one to the other. These are the SNP sites, with 0 being the ancestral state and 1 the derived state.
the function ms2bed
will convert ms output to a bed object:
<-ms2bed(system.file("extdata","2pops_asspop.txt",package="hierfstat"))
bedhead(as.matrix(bed[,1:10])) #first 10 columns of bed matrix
## M_1 M_2 M_3 M_4 M_5 M_6 M_7 M_8 M_9 M_10
## 1 0 0 1 0 1 0 0 2 1 0
## 2 1 0 0 0 1 0 0 1 0 0
## 3 0 0 0 0 0 0 0 2 0 0
## 4 0 0 0 0 2 0 0 2 0 0
## 5 0 0 0 0 0 0 0 2 0 0
## 6 0 0 0 0 1 0 0 2 0 0
Take some time to explore the structure of the bed
s4 object (str(bed)
), it is very useful. the @ped
slot contains info relevant to the individuals (their names, family id etc…), while the @snps
slot contains info relevant to the SNPs (their name, position, chromosome, etc…). The dosage matrix is extracted with as.matrix(bed)
.This can be used as argument to all functions containing dosage
in their names, such as beta.dosage, pi.dosage, theta.Watt.dosage, TajimaD.dosage
. For instance, fs.dosage
will produce estimates of populations specific \(F_{IS}\) and \(F_{ST}\).
fs.dosage(as.matrix(bed),pop=rep(1:2,each=50)) # population specific FSTs
## 1 2 All
## Fis -0.0046 0.0342 0.0148
## Fst -0.1616 0.6179 0.2282
As a more interesting example, we can reuse the eHGDP
dataset we’ve encountered previously (see ?eHGDP
), and after having converted it to dosage data with the fstat2dos
function (there is a total of \(8170\) alleles in the data set), we can look at inbreeding within populations and population structure using fs.dosage
:
<-genind2hierfstat(eHGDP)
dat.HGDP<-fstat2dos(dat.HGDP[,-1])
dos.HGDP<-fs.dosage(dos.HGDP,pop=dat.HGDP[,1]) fs.HGDP
We may be interested in finding which populations have either a large \(F_{ST}^P>0.15\) or a low one \(F_{ST}^P <0\):
@other$popInfo[which(fs.HGDP$Fs[2,]>0.15),] eHGDP
## Population Region Label Latitude Longitude
## 23 Colombian AMERICA 23 3.0 -68.0
## 24 Karitiana AMERICA 24 -10.0 -63.0
## 26 Pima AMERICA 26 29.0 -108.0
## 54 Surui AMERICA 54 -11.0 -62.0
## 63 Guaymi AMERICA 63 8.5 -82.0
## 64 Cabecar AMERICA 64 9.5 -84.0
## 68 Ache AMERICA 68 -24.0 -56.0
## 69 Kaingang AMERICA 69 -24.0 -52.5
## 71 Kogi AMERICA 71 11.0 -74.0
## 75 TicunaArara AMERICA 75 -4.0 -70.0
## 76 TicunaTarapaca AMERICA 76 -4.0 -70.0
## 79 Arhuaco AMERICA 79 11.0 -73.8
@other$popInfo[which(fs.HGDP$Fs[2,]<0.0),] eHGDP
## Population Region Label Latitude Longitude
## 27 BantuKenya AFRICA 27 -3.0 37.0
## 28 BantuSouthWest AFRICA 28 -21.0 18.7
## 29 BantuSouthEast AFRICA 29 -28.4 27.6
## 30 Mandenka AFRICA 30 12.0 -12.0
## 31 Yoruba AFRICA 31 8.0 5.0
## 32 BiakaPygmy AFRICA 32 4.0 17.0
## 33 MbutiPygmy AFRICA 33 1.0 29.0
## 34 San AFRICA 34 -21.0 20.0
and you’ll see that samples from AFRICA have low Population specific \(F_{ST}\), while samples from AMERICA have large population specific \(F_{ST}\)