Adaptive Gene- and Pathway- Trait Association testing with GWAS Summary Statistics (aSPUs() and aSPUsPath())

2021-06-28

This vignette illustrates the use of the aSPU package with GWAS Summary Statistics.

Data

We will consider the analysis of a coronary artery disease (CAD) data from the CARDIoGRAM and C4D consortium. The data set contains P value data for coronary artery disease (CAD). This study comprised 63,746 CAD cases and 130,681 controls. We mapped these SNPs to the 9th KEGG pathway (KEGG_STEROID_BIOSYNTHESIS) . Let’s load this subset.

library(aSPU)
## Loading required package: gee
## Loading required package: MASS
## Loading required package: mvtnorm
## Loading required package: fields
## Loading required package: spam
## Loading required package: dotCall64
## Loading required package: grid
## Spam version 2.7-0 (2021-06-25) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction 
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
## 
## Attaching package: 'spam'
## The following object is masked from 'package:mvtnorm':
## 
##     rmvnorm
## The following objects are masked from 'package:base':
## 
##     backsolve, forwardsolve
## Loading required package: viridis
## Loading required package: viridisLite
## See https://github.com/NCAR/Fields for
##  an extensive vignette, other supplements and source code
## Loading required package: matrixStats
data(kegg9)

The 9th Kegg pathway contains 16 genes.

kegg9$gene.info
##         V4 V1        V2        V3
## 1    SOAT1  1 179261017 179326453
## 2      LSS 21  47606360  47650738
## 3     SQLE  8 126008720 126036525
## 5  CYP51A1  7  91739463  91766059
## 6    DHCR7 11  71143457  71161477
## 7  CYP27B1 12  58154117  58162976
## 8   DHCR24  1  55313300  55354921
## 9  HSD17B7  1 162758496 162784608
## 10   MSMO1  4 166246818 166266225
## 11   FDFT1  8  11658190  11698818
## 12    LIPA 10  90971326  91013660
## 13     CEL  9 135935365 135949248
## 14  TM7SF2 11  64877341  64885707
## 16   SOAT2 12  53495274  53520323

The PPs is a list object contains the SNP information for each genes.

## SNPs mapped on 3rd and 4th genes of 9th Kegg pathway
kegg9$PPs[3:4]
## $CYP51A1
##                    V1 V2       V3        V4
## rs7796370   rs7796370  7 91739652 0.9035991
## rs12056285 rs12056285  7 91743745 0.7552544
## rs10953067 rs10953067  7 91746159 0.9138308
## rs6965936   rs6965936  7 91752154 0.2282090
## rs10953068 rs10953068  7 91752687 0.7363687
## rs10488513 rs10488513  7 91752911 0.9015296
## rs10953069 rs10953069  7 91760477 0.9718199
## rs721015     rs721015  7 91763192 0.3465701
## 
## $CYP27B1
##                    V1 V2       V3        V4
## rs7306830   rs7306830 12 58154246 0.8323711
## rs10783975 rs10783975 12 58157977 0.7783928
## rs276054     rs276054 12 58161746 0.9735500

The Ps object contains p-value information for mapped SNPs. Total 394SNPs are mapped on 9th kegg pathway.

length(kegg9$Ps)
## [1] 394
kegg9$Ps[1:10]
##  rs2331902 rs10159067  rs3806284  rs3806283  rs6692913  rs1411479 rs12752307 
##  0.2115124  0.9898204  0.9488433  0.4944396  0.5838916  0.4224329  0.3675058 
## rs12036052 rs12134601 rs10159443 
##  0.1982464  0.9293256  0.9375789

Using plink, we mapped our SNPs to the reference population ( Hapmap CEU phase 2 ). We dropped the SNPs of minor allele frequency (MAF) less then 5 percent. Among 394 SNPs mapped on 9th Kegg pathway, 330 SNPs are mapped on the reference data. P-values of these SNPs and correlation matirx of SNPs using the reference population, saved on nP and ldmatrix object.

kegg9$nP[1:10]
##   rs487230   rs593398  rs6694813   rs550336 rs17111584   rs683880  rs6588545 
##  0.9209627  0.8499274  0.9741802  0.6074471  0.5866373  0.9607142  0.3860578 
##   rs679804   rs555687   rs497159 
##  0.8845177  0.9036468  0.8872636
kegg9$ldmatrix[1:10,1:10]
##              rs487230    rs593398   rs6694813   rs550336  rs17111584   rs683880
## rs487230    1.0000000 -0.19188474  0.72579887  0.8910864  0.45580139  0.9742395
## rs593398   -0.1918847  1.00000000 -0.17488093 -0.1880254 -0.06052275 -0.1777817
## rs6694813   0.7257989 -0.17488093  1.00000000  0.8350461 -0.08326255  0.7438007
## rs550336    0.8910864 -0.18802536  0.83504607  1.0000000  0.47874154  0.8600249
## rs17111584  0.4558014 -0.06052275 -0.08326255  0.4787415  1.00000000  0.3708139
## rs683880    0.9742395 -0.17778169  0.74380068  0.8600249  0.37081386  1.0000000
## rs6588545   0.4128228 -0.02037425 -0.07921255  0.4209425  1.00000000  0.3588603
## rs679804    0.7257989 -0.17488093  1.00000000  0.8350461 -0.08326255  0.7438007
## rs555687    1.0000000 -0.18763049  0.73226324  0.8901041  0.45918526  0.9740010
## rs497159    0.7257989 -0.17488093  1.00000000  0.8350461 -0.08326255  0.7438007
##              rs6588545    rs679804   rs555687    rs497159
## rs487230    0.41282283  0.72579887  1.0000000  0.72579887
## rs593398   -0.02037425 -0.17488093 -0.1876305 -0.17488093
## rs6694813  -0.07921255  1.00000000  0.7322632  1.00000000
## rs550336    0.42094247  0.83504607  0.8901041  0.83504607
## rs17111584  1.00000000 -0.08326255  0.4591853 -0.08326255
## rs683880    0.35886033  0.74380068  0.9740010  0.74380068
## rs6588545   1.00000000 -0.07921255  0.4170467 -0.07921255
## rs679804   -0.07921255  1.00000000  0.7322632  1.00000000
## rs555687    0.41704671  0.73226324  1.0000000  0.73226324
## rs497159   -0.07921255  1.00000000  0.7322632  1.00000000

The snp.info data object have snp information for each mapped SNPs. The 1st column is rsID, 2nd column is Chr, 3rd column is location and 4th column is p-value.

kegg9$snp.info[1:10,]
##                  rsID Chr      Loc      Pval
## rs487230     rs487230   1 55313762 0.9209627
## rs593398     rs593398   1 55314484 0.8499274
## rs6694813   rs6694813   1 55315191 0.9741802
## rs550336     rs550336   1 55318103 0.6074471
## rs17111584 rs17111584   1 55319284 0.5866373
## rs683880     rs683880   1 55319483 0.9607142
## rs6588545   rs6588545   1 55321962 0.3860578
## rs679804     rs679804   1 55325157 0.8845177
## rs555687     rs555687   1 55325692 0.9036468
## rs497159     rs497159   1 55326142 0.8872636

Use of aSPUs function

Using the following code we can use aSPUs() and GATES2() function to get gene-wise aSPU and GATES p-value for each gene in 9th Kegg pathway.

Gps<-NULL;
gl <- NULL;
for( g in kegg9$gene.info[,1]) {
    snps <-
    which( ( kegg9$snp.info[,2] == kegg9$gene.info[kegg9$gene.info[,1]==g,2]) &
           (kegg9$snp.info[,3] > kegg9$gene.info[kegg9$gene.info[,1] == g, 3])&
           (kegg9$snp.info[,3] < kegg9$gene.info[kegg9$gene.info[,1] == g, 4]))

    newP <- kegg9$nP[snps] ;
    ldsub <- kegg9$ldmatrix[snps, snps];

    if( length(snps) > 1) {
        out <- aSPUs(newP, corSNP=ldsub , pow=c(1,2,4,8, Inf),
                     n.perm=10000, Ps=TRUE)

        o.pvec = order(newP)
        ldmat <- ldsub[o.pvec, o.pvec]
        gatesp <- GATES2(ldmat, sort(newP))[1]
        Gps <- rbind(Gps, c(length(newP),out$pvs, gatesp))
        gl <- c(gl, g)
    } else if (length(snps) == 1) {
        out <- newP
        gatesp <- newP
        Gps <- rbind(Gps, c(length(newP),rep(out,6), gatesp) )
        gl <- c(gl, g)
    }
}
colnames(Gps)[1] <- "nSNP"
rownames(Gps) <- gl
Gps
##         nSNP     SPUs1     SPUs2     SPUs4     SPUs8   SPUsInf      aSPUs
## SOAT1     63 0.7870000 0.7887000 0.7981000 0.7938000 0.7547000 0.85731427
## SQLE      49 0.6266000 0.6426000 0.6433000 0.6209000 0.5467000 0.64643536
## CYP51A1    8 0.7840000 0.6925000 0.6097000 0.5631000 0.5047000 0.59784022
## CYP27B1    3 0.9662000 0.9578000 0.9506000 0.9474000 0.9461000 0.95620438
## DHCR24    26 0.9265000 0.8724000 0.8204000 0.7816000 0.7381000 0.80471953
## HSD17B7   43 0.6403000 0.7167000 0.7431000 0.6388000 0.5188000 0.65173483
## MSMO1      7 0.0099000 0.0101000 0.0141000 0.0217000 0.0407000 0.01709829
## FDFT1     54 0.3151000 0.2938000 0.2777000 0.2207000 0.1299000 0.18468153
## LIPA      57 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00009999
## CEL        5 0.1739000 0.1939000 0.2356000 0.2742000 0.3049000 0.23557644
## TM7SF2     1 0.7816163 0.7816163 0.7816163 0.7816163 0.7816163 0.78161630
## SOAT2     14 0.2227000 0.2547000 0.3044000 0.3556000 0.3886000 0.27347265
##                   Pg
## SOAT1   0.5686306049
## SQLE    0.8102291397
## CYP51A1 0.6798247959
## CYP27B1 0.9735500000
## DHCR24  0.9777507000
## HSD17B7 0.6360037948
## MSMO1   0.0228054242
## FDFT1   0.1746149004
## LIPA    0.0001761169
## CEL     0.2958316751
## TM7SF2  0.7816163000
## SOAT2   0.3239918282

The row of Gps means each gene, 1st column indicate the number of SNPs for each gene. 2nd to 6th column indicate SPUs p-values for each power ( 1,2,4,8 and Inf), 7th column indicate aSPUs p-value and 8th column indicate p-value of GATES method. We can see that a gene LIPA is very significant.

g = "LIPA"
    snps <-
    which( ( kegg9$snp.info[,2] == kegg9$gene.info[kegg9$gene.info[,1]==g,2]) &
           (kegg9$snp.info[,3] > kegg9$gene.info[kegg9$gene.info[,1] == g, 3])&
           (kegg9$snp.info[,3] < kegg9$gene.info[kegg9$gene.info[,1] == g, 4]))

newP <- kegg9$nP[snps] ;
newP
## rs10509568 rs10509569 rs12780342  rs7922269  rs6586175 rs12358054  rs1556478 
## 0.00031650 0.09359480 0.04754370 0.00000756 0.00216860 0.13601460 0.01681000 
## rs12415827  rs2297475  rs2297473  rs2297472  rs2254747  rs2254670  rs2254636 
## 0.14171810 0.07292150 0.01016540 0.00001680 0.41696630 0.00027910 0.46203620 
##  rs2071510  rs2071509 rs11203041 rs11203042  rs1041390  rs7896502  rs1041389 
## 0.00078460 0.00234250 0.00013330 0.00007830 0.00009400 0.35422370 0.00165870 
##   rs885561   rs951647 rs12257915  rs1576818  rs1576817  rs1029074 rs17117789 
## 0.00016280 0.00140590 0.00023720 0.00173150 0.00102450 0.93826590 0.32191190 
##  rs2902563  rs7074555 rs11203047  rs3892343 rs12267584  rs1320496  rs1412444 
## 0.07394800 0.40199450 0.16127510 0.68171880 0.05303330 0.00019150 0.00091350 
## rs12240489  rs2246949  rs2246942  rs2246941 rs10887934  rs3740044  rs2246833 
## 0.36096050 0.00007240 0.00008090 0.00008000 0.74540170 0.02227500 0.00004490 
##  rs2246828   rs928415  rs6586179  rs1051338  rs2250781  rs2250644  rs2902561 
## 0.00018290 0.05361800 0.04649850 0.00019460 0.00025170 0.00004480 0.57388670 
##  rs2243548  rs2243547  rs1332327  rs2183933  rs1332325 rs10887936  rs7094601 
## 0.00011620 0.00002260 0.00015360 0.72022840 0.01776480 0.01108720 0.08830380 
##  rs7922193 
## 0.08558400

LIPA have 57 SNPs mapped and we can see that there are many significant SNPs. It makes sense that SPUs(1) have less p-value then SPUs(inf) since there are multiple significant SNPs. GATES have more power when there are small number of significant SNPs in the Gene (similar to minP test), so aSPUs have less p-value than GATES.

Use of aSPUsPath function

We can get p-value for pathways as follows. Let’s perform pathway based analysis using aSPUsPath, Gates-Simes and HYST.

out.g <- GatesSimes(pvec = kegg9$nP, ldmatrix = kegg9$ldmatrix,
                    snp.info=kegg9$snp.info, gene.info = kegg9$gene.info)

out.h <- Hyst(pvec = kegg9$nP, ldmatrix = kegg9$ldmatrix,
              snp.info=kegg9$snp.info, gene.info = kegg9$gene.info)

out.a <- aSPUsPath(kegg9$nP, corSNP = kegg9$ldmatrix, pow=c(1,2,4,8, Inf),
                   pow2 = c(1,2,4,8),
                   snp.info=kegg9$snp.info, gene.info = kegg9$gene.info,
                   n.perm=1000, Ps = TRUE)

out.g; out.h; out.a
## [1] 0.002113403
## [1] 0.1011857
##   SPUsPath1,1   SPUsPath2,1   SPUsPath4,1   SPUsPath8,1 SPUsPathInf,1 
##         0.824         0.911         0.976         0.991         0.997 
##   SPUsPath1,2   SPUsPath2,2   SPUsPath4,2   SPUsPath8,2 SPUsPathInf,2 
##         0.380         0.557         0.790         0.910         0.964 
##   SPUsPath1,4   SPUsPath2,4   SPUsPath4,4   SPUsPath8,4 SPUsPathInf,4 
##         0.069         0.085         0.204         0.495         0.714 
##   SPUsPath1,8   SPUsPath2,8   SPUsPath4,8   SPUsPath8,8 SPUsPathInf,8 
##         0.037         0.034         0.050         0.164         0.369 
##     aSPUsPath 
##         0.108

As we can see from the gene-wise analysis, there is only one very significant gene LIPA. In this situation Gates-Simes works well and Gate-Simes p-value is 0.0021134. On the other hand, HYST works well when there are many significant genes with similar effects.

The aSPUsPath adaptively consider all SPUsPath(i,j). We can see that the p-value decrease as our 2nd parameter , pow2, increases. It makes sense because larger pow2 is more effective if there are fewer associated genes.