query-phenomexcan

Last updated: 2020-04-17

Checks: 6 1

Knit directory: bigquery-covid19/

This reproducible R Markdown analysis was created with workflowr (version 1.6.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has staged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200415)

The command set.seed(20200415) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 2199019

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 2199019. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Staged changes:
    Modified:   analysis/query-phenomexcan.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/query-phenomexcan.Rmd) and HTML (docs/query-phenomexcan.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	2199019	Hae Kyung Im	2020-04-17	built
html	2199019	Hae Kyung Im	2020-04-17	built
Rmd	19c2011	Hae Kyung Im	2020-04-17	rmd querying bigquery and postgres
Rmd	7fb3256	Hae Kyung Im	2020-04-16	bigqueries
Rmd	3bb3017	Hae Kyung Im	2020-04-16	edits
html	3bb3017	Hae Kyung Im	2020-04-16	edits
Rmd	b96fa1c	Hae Kyung Im	2020-04-16	query tables

Import libraries and source scripts

## install.packages("bigrquery")
## suppressPackageStartupMessages(library(bigrquery))

library(tidyverse)

── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

✓ ggplot2 3.3.0     ✓ purrr   0.3.3
✓ tibble  3.0.0     ✓ dplyr   0.8.5
✓ tidyr   1.0.2     ✓ stringr 1.4.0
✓ readr   1.3.1     ✓ forcats 0.5.0

── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

Connect to dbs

## connect tp phenomexcan postgres database, define querying functions
suppressPackageStartupMessages(source("code/helpers.R", chdir = TRUE))
phenomexcan_con <- get_db()

dbListTables(phenomexcan_con)

[1] "clinvar_phenotypes" "genes"              "phenotypes"        
[4] "smultixcan"         "spredixcan"         "tissues"           
[7] "ukb_clinvar"

## connect to dapg dataset
dapg_dataset = "GTEx_V8_DAPG"
dapg_con <- DBI::dbConnect(
  bigrquery::bigquery(), 
  project = "gtex-awg-im",
  dataset = dapg_dataset,
  billing = "gtex-awg-im")
##
dbListTables(dapg_con)

Using an auto-discovered, cached token.
To suppress this message, modify your code or options to clearly consent to the use of a cached token.
See gargle's "Non-interactive auth" vignette for more details:
https://gargle.r-lib.org/articles/non-interactive-auth.html
The bigrquery package is using a cached token for imlab2015@gmail.com.

Auto-refreshing stale OAuth token.

 [1] "clusters_eqtl"           "clusters_eqtl_eur"      
 [3] "clusters_sqtl"           "model_summary_eqtl_eur" 
 [5] "model_variants_eqtl_eur" "models_eqtl"            
 [7] "models_eqtl_eur"         "variants_pip_eqtl"      
 [9] "variants_pip_eqtl_eur"   "variants_pip_sqtl"

## connect to gtex-awg-im:annotations.gencode_v26_all
gencode_con <- DBI::dbConnect(
  bigrquery::bigquery(), 
  project = "gtex-awg-im",
  dataset = "annotations",
  billing = "gtex-awg-im")
gencode_tbl <- tbl(gencode_con, "gencode_v26_all")
gencode_df <- gencode_tbl %>% collect()


## connect to prediction weights gtex-awg-im:GTEx_V8_PF_MASHR_EUR_v1.weights_eqtl
mashrpred_con <- DBI::dbConnect(
  bigrquery::bigquery(), 
  project = "gtex-awg-im",
  dataset = "GTEx_V8_PF_MASHR_EUR_v1",
  billing = "gtex-awg-im")
##
dbListTables(mashrpred_con)

 [1] "extra_eqtl"               "extra_sqtl"              
 [3] "smultixcan_eqtl"          "smultixcan_eqtl_count"   
 [5] "smultixcan_sqtl"          "smultixcan_sqtl_count"   
 [7] "spredixcan_eqtl"          "spredixcan_eqtl_count"   
 [9] "spredixcan_eqtl_hn"       "spredixcan_eqtl_hn_count"
[11] "spredixcan_eqtl_hq"       "spredixcan_eqtl_hq_count"
[13] "spredixcan_sqtl"          "spredixcan_sqtl_count"   
[15] "weights_eqtl"             "weights_sqtl"

weights_tbl = tbl(mashrpred_con,"weights_eqtl")

ensid to gene name

enslist =  c("ENSG00000184012", "ENSG00000132842","ENSG00000130234")

## using the table from phenomexcan (sex chr missing)
genes_tbl <- tbl(phenomexcan_con,"genes")
colnames(genes_tbl)

[1] "id"              "gene_id"         "gene_id_version" "gene_name"      
[5] "gene_type"       "band"

genes_tbl %>% filter(gene_id %in% enslist) %>% collect() %>% .[["gene_name"]]

[1] "AP3B1"   "TMPRSS2"

## using gencode table
colnames(gencode_tbl)

[1] "chromosome"     "start_location" "end_location"   "feature_type"  
[5] "strand"         "gene_id"        "gene_name"      "gene_type"

gencode_tbl %>% filter(substr(gene_id,1,15) %in% enslist) %>% collect() %>% .[["gene_name"]]

[1] "AP3B1"   "TMPRSS2" "ACE2"

Query multixcan association with top phenotypes for list of genes

input = list()
input$gene_name = c("TMPRSS2", "AP3B1")
input$limit = 100

tempo <- get_results_from_data_db(input)

results:
SELECT
  p.unique_description as phenotype, p.source as phenotype_source, g.gene_name, g.band as gene_band,
  sm.pvalue, sm.rcp,
  (case when sm.dir_effect_most_signif = 1 then '+'
        when sm.dir_effect_most_signif = -1 then '-'
        else '' end) as best_sign,
  (case when sm.dir_effect_consensus = 1 then '+'
        when sm.dir_effect_consensus = -1 then '-'
        else '' end) as consensus_sign,
  sm.n as n_tissues, sm.n_indep
FROM smultixcan sm inner join phenotypes p on (sm.pheno_id = p.id) inner join genes g on (sm.gene_num_id = g.id)
WHERE 1=1
AND g.gene_name IN (values ('TMPRSS2'), ('AP3B1'))
ORDER BY sm.pvalue
LIMIT 100

Query completed

head(tempo)

                                  phenotype phenotype_source gene_name
1                            Sitting height       UK Biobank     AP3B1
2                  Leg fat-free mass (left)       UK Biobank     AP3B1
3                           Standing height       UK Biobank     AP3B1
4                 Leg predicted mass (left)       UK Biobank     AP3B1
5 Forced vital capacity (FVC), Best measure       UK Biobank     AP3B1
6               Forced vital capacity (FVC)       UK Biobank     AP3B1
  gene_band     pvalue    rcp best_sign consensus_sign n_tissues n_indep
1    5q14.1 5.3877e-37 0.0891         -              -        19       5
2    5q14.1 8.0496e-26 0.1424         -              -        19       5
3    5q14.1 9.8699e-26 0.2496         -              -        19       5
4    5q14.1 1.0095e-25 0.1397         -              -        19       5
5    5q14.1 1.9335e-25 0.3197         -              -        19       5
6    5q14.1 3.2199e-25 0.3423         -              -        19       5

query predixcan all tissues association with top phenotypes for list of genes (this is slow)

input = list()
input$sp_gene_name = "TMPRSS2"
input$sp_limit = 100

if(T) tempop <- get_sp_results_from_data_db(input)

results:
SELECT
  p.unique_description as phenotype, p.source as phenotype_source, t.name as tissue_name, g.gene_name, g.band as gene_band,
  sp.pvalue, sp.zscore, sp.effect_size
FROM spredixcan sp inner join phenotypes p on (sp.pheno_id = p.id)
  inner join tissues t on (sp.tissue_id = t.id)
  inner join genes g on (sp.gene_num_id = g.id)
WHERE 1=1
AND g.gene_name IN (values ('TMPRSS2'))
ORDER BY sp.pvalue
LIMIT 100

Query completed

head(tempop)

                                                   phenotype phenotype_source
1                                             C_MALE_GENITAL       UK Biobank
2                  malignant neoplasm of male genital organs       UK Biobank
3                             Malignant neoplasm of prostate       UK Biobank
4                      Diagnoses - main ICD10: J43 Emphysema       UK Biobank
5 Diagnoses - main ICD10: C61 Malignant neoplasm of prostate       UK Biobank
6 Diagnoses - main ICD10: C61 Malignant neoplasm of prostate       UK Biobank
       tissue_name gene_name gene_band     pvalue  zscore effect_size
1         Prostate   TMPRSS2   21q22.3 8.6656e-08  5.3527   0.0138880
2         Prostate   TMPRSS2   21q22.3 8.6656e-08  5.3527   0.0138880
3         Prostate   TMPRSS2   21q22.3 1.3051e-07  5.2781   0.0132060
4             Lung   TMPRSS2   21q22.3 5.6411e-07  5.0031   0.0022190
5         Prostate   TMPRSS2   21q22.3 4.0875e-05  4.1025   0.0088149
6 Esophagus_Mucosa   TMPRSS2   21q22.3 5.0152e-05 -4.0549  -0.3784200

count total number of rows (this can be expensive depending on the table)

tbl(phenomexcan_con,"genes") %>% tally()

# Source:   lazy query [?? x 1]
# Database: postgres [metaxcan_ro@35.202.243.187:5432/phenomexcan]
  n      
  <int64>
1 22515

get fine-mapped variants

## TODO write function genename2ensid and ensid2genename
genelist =  c("TMPRSS2", "AP3B1","ACE2")
ensverlist = gencode_df %>% filter(gene_name %in% genelist) %>% collect() %>% .[["gene_id"]]

dbListTables(dapg_con)

 [1] "clusters_eqtl"           "clusters_eqtl_eur"      
 [3] "clusters_sqtl"           "model_summary_eqtl_eur" 
 [5] "model_variants_eqtl_eur" "models_eqtl"            
 [7] "models_eqtl_eur"         "variants_pip_eqtl"      
 [9] "variants_pip_eqtl_eur"   "variants_pip_sqtl"

dapg_pip_tbl <- tbl(dapg_con,"variants_pip_eqtl")
tempof <- dapg_pip_tbl %>% filter(gene %in% ensverlist) %>% collect()
tempof <- tempof   %>% left_join(gencode_df %>% select(gene_id,gene_name),by=c("gene"="gene_id"))


head(tempof)

# A tibble: 6 x 8
  tissue gene        rank variant_id         pip log10_abvf cluster_id gene_name
  <chr>  <chr>      <int> <chr>            <dbl>      <dbl>      <int> <chr>    
1 Lung   ENSG00000…    95 chr5_78141447… 2.04e-3       3.23          1 AP3B1    
2 Lung   ENSG00000…     2 chr5_78089857… 3.83e-2       4.50          1 AP3B1    
3 Lung   ENSG00000…    79 chr5_78050844… 3.01e-3       3.40          1 AP3B1    
4 Lung   ENSG00000…    39 chr5_78242771… 6.25e-3       2.94          1 AP3B1    
5 Lung   ENSG00000…   107 chr5_78053912… 1.94e-3       3.22          1 AP3B1    
6 Lung   ENSG00000…   155 chr5_78192860… 7.68e-4       2.81          1 AP3B1

get SNPs in prediction models

genelist =  c("TMPRSS2", "AP3B1")
ensidverlist = gencode_df %>% filter(gene_name %in% genelist) %>% collect() %>% .[["gene_id"]]

tempow <- weights_tbl %>% filter(gene %in% ensverlist) %>% collect() 
tempow <- tempow %>% left_join(gencode_df %>% select(gene_id,gene_name),by=c("gene"="gene_id"))
head(tempow)

# A tibble: 6 x 8
  gene      rsid    varID      ref_allele eff_allele   weight tissue   gene_name
  <chr>     <chr>   <chr>      <chr>      <chr>         <dbl> <chr>    <chr>    
1 ENSG0000… rs3507… chr21_414… AC         A           0.115   Lung     TMPRSS2  
2 ENSG0000… rs7881… chr5_7828… A          C          -0.0278  Brain_C… AP3B1    
3 ENSG0000… rs7881… chr5_7828… A          C          -0.0631  Muscle_… AP3B1    
4 ENSG0000… rs9293… chr5_7820… A          C           0.116   Whole_B… AP3B1    
5 ENSG0000… rs7390… chr21_415… A          G          -0.00317 Esophag… TMPRSS2  
6 ENSG0000… rs7390… chr21_415… A          G           0.00957 Spleen   TMPRSS2

# server <- function(input, output, session) {
#     updateSelectizeInput(session = session, inputId = 'gene_name', choices = c(genes_), server = TRUE)
#     updateSelectizeInput(session = session, inputId = 'sp_gene_name', choices = c(genes_), server = TRUE)
#     updateSelectizeInput(session = session, inputId = 'pheno', choices = c(phenotypes_), server = TRUE)
#     updateSelectizeInput(session = session, inputId = 'sp_pheno', choices = c(phenotypes_), server = TRUE)
#     updateSelectizeInput(session = session, inputId = 'sp_tissue', choices = c(tissues_), server = TRUE)
#     updateSelectizeInput(session = session, inputId = 'uc_ukb_trait', choices = c(phenotypes_), server = TRUE)
#     updateSelectizeInput(session = session, inputId = 'uc_clinvar_trait', choices = c(clinvar_phenotypes_), server = TRUE)
#

sessionInfo()

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RPostgres_1.2.0 forcats_0.5.0   stringr_1.4.0   dplyr_0.8.5    
 [5] purrr_0.3.3     readr_1.3.1     tidyr_1.0.2     tibble_3.0.0   
 [9] ggplot2_3.3.0   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     lubridate_1.7.8  lattice_0.20-41  utf8_1.1.4      
 [5] assertthat_0.2.1 rprojroot_1.3-2  digest_0.6.25    R6_2.4.1        
 [9] cellranger_1.1.0 backports_1.1.6  reprex_0.3.0     evaluate_0.14   
[13] httr_1.4.1       pillar_1.4.3     rlang_0.4.5      curl_4.3        
[17] readxl_1.3.1     rstudioapi_0.11  whisker_0.4      blob_1.2.1      
[21] rmarkdown_2.1    bit_1.1-15.2     munsell_0.5.0    broom_0.5.5     
[25] compiler_3.6.3   httpuv_1.5.2     modelr_0.1.6     xfun_0.13       
[29] askpass_1.1      pkgconfig_2.0.3  htmltools_0.4.0  openssl_1.4.1   
[33] tidyselect_1.0.0 workflowr_1.6.1  fansi_0.4.1      crayon_1.3.4    
[37] dbplyr_1.4.2     withr_2.1.2      later_1.0.0      grid_3.6.3      
[41] bigrquery_1.2.0  nlme_3.1-147     jsonlite_1.6.1   gtable_0.3.0    
[45] lifecycle_0.2.0  DBI_1.1.0        git2r_0.26.1     magrittr_1.5    
[49] scales_1.1.0     cli_2.0.2        stringi_1.4.6    fs_1.4.1        
[53] promises_1.1.0   xml2_1.3.1       ellipsis_0.3.0   generics_0.0.2  
[57] vctrs_0.2.4      tools_3.6.3      bit64_0.9-7      glue_1.4.0      
[61] hms_0.5.3        yaml_2.2.1       gargle_0.4.0     colorspace_1.4-1
[65] rvest_0.3.5      knitr_1.28       haven_2.2.0