This vignette illustrates the use of the aSPU package with GWAS Summary Statistics.
We will consider the analysis of a coronary artery disease (CAD) data from the CARDIoGRAM and C4D consortium. The data set contains P value data for coronary artery disease (CAD). This study comprised 63,746 CAD cases and 130,681 controls. We mapped these SNPs to the 9th KEGG pathway (KEGG_STEROID_BIOSYNTHESIS) . Let’s load this subset.
library(aSPU)
## Loading required package: gee
## Loading required package: MASS
## Loading required package: mvtnorm
## Loading required package: fields
## Loading required package: spam
## Loading required package: dotCall64
## Loading required package: grid
## Spam version 2.7-0 (2021-06-25) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
##
## Attaching package: 'spam'
## The following object is masked from 'package:mvtnorm':
##
## rmvnorm
## The following objects are masked from 'package:base':
##
## backsolve, forwardsolve
## Loading required package: viridis
## Loading required package: viridisLite
## See https://github.com/NCAR/Fields for
## an extensive vignette, other supplements and source code
## Loading required package: matrixStats
data(kegg9)
The 9th Kegg pathway contains 16 genes.
$gene.info kegg9
## V4 V1 V2 V3
## 1 SOAT1 1 179261017 179326453
## 2 LSS 21 47606360 47650738
## 3 SQLE 8 126008720 126036525
## 5 CYP51A1 7 91739463 91766059
## 6 DHCR7 11 71143457 71161477
## 7 CYP27B1 12 58154117 58162976
## 8 DHCR24 1 55313300 55354921
## 9 HSD17B7 1 162758496 162784608
## 10 MSMO1 4 166246818 166266225
## 11 FDFT1 8 11658190 11698818
## 12 LIPA 10 90971326 91013660
## 13 CEL 9 135935365 135949248
## 14 TM7SF2 11 64877341 64885707
## 16 SOAT2 12 53495274 53520323
The PPs
is a list object contains the SNP information for each genes.
## SNPs mapped on 3rd and 4th genes of 9th Kegg pathway
$PPs[3:4] kegg9
## $CYP51A1
## V1 V2 V3 V4
## rs7796370 rs7796370 7 91739652 0.9035991
## rs12056285 rs12056285 7 91743745 0.7552544
## rs10953067 rs10953067 7 91746159 0.9138308
## rs6965936 rs6965936 7 91752154 0.2282090
## rs10953068 rs10953068 7 91752687 0.7363687
## rs10488513 rs10488513 7 91752911 0.9015296
## rs10953069 rs10953069 7 91760477 0.9718199
## rs721015 rs721015 7 91763192 0.3465701
##
## $CYP27B1
## V1 V2 V3 V4
## rs7306830 rs7306830 12 58154246 0.8323711
## rs10783975 rs10783975 12 58157977 0.7783928
## rs276054 rs276054 12 58161746 0.9735500
The Ps
object contains p-value information for mapped SNPs. Total 394SNPs are mapped on 9th kegg pathway.
length(kegg9$Ps)
## [1] 394
$Ps[1:10] kegg9
## rs2331902 rs10159067 rs3806284 rs3806283 rs6692913 rs1411479 rs12752307
## 0.2115124 0.9898204 0.9488433 0.4944396 0.5838916 0.4224329 0.3675058
## rs12036052 rs12134601 rs10159443
## 0.1982464 0.9293256 0.9375789
Using plink, we mapped our SNPs to the reference population ( Hapmap CEU phase 2 ). We dropped the SNPs of minor allele frequency (MAF) less then 5 percent. Among 394 SNPs mapped on 9th Kegg pathway, 330 SNPs are mapped on the reference data. P-values of these SNPs and correlation matirx of SNPs using the reference population, saved on nP
and ldmatrix
object.
$nP[1:10] kegg9
## rs487230 rs593398 rs6694813 rs550336 rs17111584 rs683880 rs6588545
## 0.9209627 0.8499274 0.9741802 0.6074471 0.5866373 0.9607142 0.3860578
## rs679804 rs555687 rs497159
## 0.8845177 0.9036468 0.8872636
$ldmatrix[1:10,1:10] kegg9
## rs487230 rs593398 rs6694813 rs550336 rs17111584 rs683880
## rs487230 1.0000000 -0.19188474 0.72579887 0.8910864 0.45580139 0.9742395
## rs593398 -0.1918847 1.00000000 -0.17488093 -0.1880254 -0.06052275 -0.1777817
## rs6694813 0.7257989 -0.17488093 1.00000000 0.8350461 -0.08326255 0.7438007
## rs550336 0.8910864 -0.18802536 0.83504607 1.0000000 0.47874154 0.8600249
## rs17111584 0.4558014 -0.06052275 -0.08326255 0.4787415 1.00000000 0.3708139
## rs683880 0.9742395 -0.17778169 0.74380068 0.8600249 0.37081386 1.0000000
## rs6588545 0.4128228 -0.02037425 -0.07921255 0.4209425 1.00000000 0.3588603
## rs679804 0.7257989 -0.17488093 1.00000000 0.8350461 -0.08326255 0.7438007
## rs555687 1.0000000 -0.18763049 0.73226324 0.8901041 0.45918526 0.9740010
## rs497159 0.7257989 -0.17488093 1.00000000 0.8350461 -0.08326255 0.7438007
## rs6588545 rs679804 rs555687 rs497159
## rs487230 0.41282283 0.72579887 1.0000000 0.72579887
## rs593398 -0.02037425 -0.17488093 -0.1876305 -0.17488093
## rs6694813 -0.07921255 1.00000000 0.7322632 1.00000000
## rs550336 0.42094247 0.83504607 0.8901041 0.83504607
## rs17111584 1.00000000 -0.08326255 0.4591853 -0.08326255
## rs683880 0.35886033 0.74380068 0.9740010 0.74380068
## rs6588545 1.00000000 -0.07921255 0.4170467 -0.07921255
## rs679804 -0.07921255 1.00000000 0.7322632 1.00000000
## rs555687 0.41704671 0.73226324 1.0000000 0.73226324
## rs497159 -0.07921255 1.00000000 0.7322632 1.00000000
The snp.info
data object have snp information for each mapped SNPs. The 1st column is rsID, 2nd column is Chr, 3rd column is location and 4th column is p-value.
$snp.info[1:10,] kegg9
## rsID Chr Loc Pval
## rs487230 rs487230 1 55313762 0.9209627
## rs593398 rs593398 1 55314484 0.8499274
## rs6694813 rs6694813 1 55315191 0.9741802
## rs550336 rs550336 1 55318103 0.6074471
## rs17111584 rs17111584 1 55319284 0.5866373
## rs683880 rs683880 1 55319483 0.9607142
## rs6588545 rs6588545 1 55321962 0.3860578
## rs679804 rs679804 1 55325157 0.8845177
## rs555687 rs555687 1 55325692 0.9036468
## rs497159 rs497159 1 55326142 0.8872636
Using the following code we can use aSPUs()
and GATES2()
function to get gene-wise aSPU and GATES p-value for each gene in 9th Kegg pathway.
NULL;
Gps<- NULL;
gl <-for( g in kegg9$gene.info[,1]) {
snps <- which( ( kegg9$snp.info[,2] == kegg9$gene.info[kegg9$gene.info[,1]==g,2]) &
(kegg9$snp.info[,3] > kegg9$gene.info[kegg9$gene.info[,1] == g, 3])&
(kegg9$snp.info[,3] < kegg9$gene.info[kegg9$gene.info[,1] == g, 4]))
kegg9$nP[snps] ;
newP <- kegg9$ldmatrix[snps, snps];
ldsub <-
if( length(snps) > 1) {
aSPUs(newP, corSNP=ldsub , pow=c(1,2,4,8, Inf),
out <-n.perm=10000, Ps=TRUE)
order(newP)
o.pvec = ldsub[o.pvec, o.pvec]
ldmat <- GATES2(ldmat, sort(newP))[1]
gatesp <- rbind(Gps, c(length(newP),out$pvs, gatesp))
Gps <- c(gl, g)
gl <-else if (length(snps) == 1) {
} newP
out <- newP
gatesp <- rbind(Gps, c(length(newP),rep(out,6), gatesp) )
Gps <- c(gl, g)
gl <-
}
}colnames(Gps)[1] <- "nSNP"
rownames(Gps) <- gl
Gps
## nSNP SPUs1 SPUs2 SPUs4 SPUs8 SPUsInf aSPUs
## SOAT1 63 0.7870000 0.7887000 0.7981000 0.7938000 0.7547000 0.85731427
## SQLE 49 0.6266000 0.6426000 0.6433000 0.6209000 0.5467000 0.64643536
## CYP51A1 8 0.7840000 0.6925000 0.6097000 0.5631000 0.5047000 0.59784022
## CYP27B1 3 0.9662000 0.9578000 0.9506000 0.9474000 0.9461000 0.95620438
## DHCR24 26 0.9265000 0.8724000 0.8204000 0.7816000 0.7381000 0.80471953
## HSD17B7 43 0.6403000 0.7167000 0.7431000 0.6388000 0.5188000 0.65173483
## MSMO1 7 0.0099000 0.0101000 0.0141000 0.0217000 0.0407000 0.01709829
## FDFT1 54 0.3151000 0.2938000 0.2777000 0.2207000 0.1299000 0.18468153
## LIPA 57 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00009999
## CEL 5 0.1739000 0.1939000 0.2356000 0.2742000 0.3049000 0.23557644
## TM7SF2 1 0.7816163 0.7816163 0.7816163 0.7816163 0.7816163 0.78161630
## SOAT2 14 0.2227000 0.2547000 0.3044000 0.3556000 0.3886000 0.27347265
## Pg
## SOAT1 0.5686306049
## SQLE 0.8102291397
## CYP51A1 0.6798247959
## CYP27B1 0.9735500000
## DHCR24 0.9777507000
## HSD17B7 0.6360037948
## MSMO1 0.0228054242
## FDFT1 0.1746149004
## LIPA 0.0001761169
## CEL 0.2958316751
## TM7SF2 0.7816163000
## SOAT2 0.3239918282
The row of Gps
means each gene, 1st column indicate the number of SNPs for each gene. 2nd to 6th column indicate SPUs p-values for each power ( 1,2,4,8 and Inf), 7th column indicate aSPUs p-value and 8th column indicate p-value of GATES method. We can see that a gene LIPA is very significant.
"LIPA"
g =
snps <- which( ( kegg9$snp.info[,2] == kegg9$gene.info[kegg9$gene.info[,1]==g,2]) &
(kegg9$snp.info[,3] > kegg9$gene.info[kegg9$gene.info[,1] == g, 3])&
(kegg9$snp.info[,3] < kegg9$gene.info[kegg9$gene.info[,1] == g, 4]))
kegg9$nP[snps] ;
newP <- newP
## rs10509568 rs10509569 rs12780342 rs7922269 rs6586175 rs12358054 rs1556478
## 0.00031650 0.09359480 0.04754370 0.00000756 0.00216860 0.13601460 0.01681000
## rs12415827 rs2297475 rs2297473 rs2297472 rs2254747 rs2254670 rs2254636
## 0.14171810 0.07292150 0.01016540 0.00001680 0.41696630 0.00027910 0.46203620
## rs2071510 rs2071509 rs11203041 rs11203042 rs1041390 rs7896502 rs1041389
## 0.00078460 0.00234250 0.00013330 0.00007830 0.00009400 0.35422370 0.00165870
## rs885561 rs951647 rs12257915 rs1576818 rs1576817 rs1029074 rs17117789
## 0.00016280 0.00140590 0.00023720 0.00173150 0.00102450 0.93826590 0.32191190
## rs2902563 rs7074555 rs11203047 rs3892343 rs12267584 rs1320496 rs1412444
## 0.07394800 0.40199450 0.16127510 0.68171880 0.05303330 0.00019150 0.00091350
## rs12240489 rs2246949 rs2246942 rs2246941 rs10887934 rs3740044 rs2246833
## 0.36096050 0.00007240 0.00008090 0.00008000 0.74540170 0.02227500 0.00004490
## rs2246828 rs928415 rs6586179 rs1051338 rs2250781 rs2250644 rs2902561
## 0.00018290 0.05361800 0.04649850 0.00019460 0.00025170 0.00004480 0.57388670
## rs2243548 rs2243547 rs1332327 rs2183933 rs1332325 rs10887936 rs7094601
## 0.00011620 0.00002260 0.00015360 0.72022840 0.01776480 0.01108720 0.08830380
## rs7922193
## 0.08558400
LIPA have 57 SNPs mapped and we can see that there are many significant SNPs. It makes sense that SPUs(1) have less p-value then SPUs(inf) since there are multiple significant SNPs. GATES have more power when there are small number of significant SNPs in the Gene (similar to minP test), so aSPUs have less p-value than GATES.
We can get p-value for pathways as follows. Let’s perform pathway based analysis using aSPUsPath, Gates-Simes and HYST.
GatesSimes(pvec = kegg9$nP, ldmatrix = kegg9$ldmatrix,
out.g <-snp.info=kegg9$snp.info, gene.info = kegg9$gene.info)
Hyst(pvec = kegg9$nP, ldmatrix = kegg9$ldmatrix,
out.h <-snp.info=kegg9$snp.info, gene.info = kegg9$gene.info)
aSPUsPath(kegg9$nP, corSNP = kegg9$ldmatrix, pow=c(1,2,4,8, Inf),
out.a <-pow2 = c(1,2,4,8),
snp.info=kegg9$snp.info, gene.info = kegg9$gene.info,
n.perm=1000, Ps = TRUE)
out.g; out.h; out.a
## [1] 0.002113403
## [1] 0.1011857
## SPUsPath1,1 SPUsPath2,1 SPUsPath4,1 SPUsPath8,1 SPUsPathInf,1
## 0.824 0.911 0.976 0.991 0.997
## SPUsPath1,2 SPUsPath2,2 SPUsPath4,2 SPUsPath8,2 SPUsPathInf,2
## 0.380 0.557 0.790 0.910 0.964
## SPUsPath1,4 SPUsPath2,4 SPUsPath4,4 SPUsPath8,4 SPUsPathInf,4
## 0.069 0.085 0.204 0.495 0.714
## SPUsPath1,8 SPUsPath2,8 SPUsPath4,8 SPUsPath8,8 SPUsPathInf,8
## 0.037 0.034 0.050 0.164 0.369
## aSPUsPath
## 0.108
As we can see from the gene-wise analysis, there is only one very significant gene LIPA
. In this situation Gates-Simes works well and Gate-Simes p-value is 0.0021134. On the other hand, HYST works well when there are many significant genes with similar effects.
The aSPUsPath adaptively consider all SPUsPath(i,j). We can see that the p-value decrease as our 2nd parameter , pow2
, increases. It makes sense because larger pow2
is more effective if there are fewer associated genes.