Last updated: 2020-04-17

Checks: 6 1

Knit directory: bigquery-covid19/

This reproducible R Markdown analysis was created with workflowr (version 1.6.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


The R Markdown file has staged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200415) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 2199019. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Staged changes:
    Modified:   analysis/query-phenomexcan.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/query-phenomexcan.Rmd) and HTML (docs/query-phenomexcan.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 2199019 Hae Kyung Im 2020-04-17 built
html 2199019 Hae Kyung Im 2020-04-17 built
Rmd 19c2011 Hae Kyung Im 2020-04-17 rmd querying bigquery and postgres
Rmd 7fb3256 Hae Kyung Im 2020-04-16 bigqueries
Rmd 3bb3017 Hae Kyung Im 2020-04-16 edits
html 3bb3017 Hae Kyung Im 2020-04-16 edits
Rmd b96fa1c Hae Kyung Im 2020-04-16 query tables

Import libraries and source scripts

## install.packages("bigrquery")
## suppressPackageStartupMessages(library(bigrquery))

library(tidyverse)
── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.0     ✓ purrr   0.3.3
✓ tibble  3.0.0     ✓ dplyr   0.8.5
✓ tidyr   1.0.2     ✓ stringr 1.4.0
✓ readr   1.3.1     ✓ forcats 0.5.0
── Conflicts ────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

Connect to dbs

## connect tp phenomexcan postgres database, define querying functions
suppressPackageStartupMessages(source("code/helpers.R", chdir = TRUE))
phenomexcan_con <- get_db()

dbListTables(phenomexcan_con)
[1] "clinvar_phenotypes" "genes"              "phenotypes"        
[4] "smultixcan"         "spredixcan"         "tissues"           
[7] "ukb_clinvar"       
## connect to dapg dataset
dapg_dataset = "GTEx_V8_DAPG"
dapg_con <- DBI::dbConnect(
  bigrquery::bigquery(), 
  project = "gtex-awg-im",
  dataset = dapg_dataset,
  billing = "gtex-awg-im")
##
dbListTables(dapg_con)
Using an auto-discovered, cached token.
To suppress this message, modify your code or options to clearly consent to the use of a cached token.
See gargle's "Non-interactive auth" vignette for more details:
https://gargle.r-lib.org/articles/non-interactive-auth.html
The bigrquery package is using a cached token for imlab2015@gmail.com.
Auto-refreshing stale OAuth token.
 [1] "clusters_eqtl"           "clusters_eqtl_eur"      
 [3] "clusters_sqtl"           "model_summary_eqtl_eur" 
 [5] "model_variants_eqtl_eur" "models_eqtl"            
 [7] "models_eqtl_eur"         "variants_pip_eqtl"      
 [9] "variants_pip_eqtl_eur"   "variants_pip_sqtl"      
## connect to gtex-awg-im:annotations.gencode_v26_all
gencode_con <- DBI::dbConnect(
  bigrquery::bigquery(), 
  project = "gtex-awg-im",
  dataset = "annotations",
  billing = "gtex-awg-im")
gencode_tbl <- tbl(gencode_con, "gencode_v26_all")
gencode_df <- gencode_tbl %>% collect()


## connect to prediction weights gtex-awg-im:GTEx_V8_PF_MASHR_EUR_v1.weights_eqtl
mashrpred_con <- DBI::dbConnect(
  bigrquery::bigquery(), 
  project = "gtex-awg-im",
  dataset = "GTEx_V8_PF_MASHR_EUR_v1",
  billing = "gtex-awg-im")
##
dbListTables(mashrpred_con)
 [1] "extra_eqtl"               "extra_sqtl"              
 [3] "smultixcan_eqtl"          "smultixcan_eqtl_count"   
 [5] "smultixcan_sqtl"          "smultixcan_sqtl_count"   
 [7] "spredixcan_eqtl"          "spredixcan_eqtl_count"   
 [9] "spredixcan_eqtl_hn"       "spredixcan_eqtl_hn_count"
[11] "spredixcan_eqtl_hq"       "spredixcan_eqtl_hq_count"
[13] "spredixcan_sqtl"          "spredixcan_sqtl_count"   
[15] "weights_eqtl"             "weights_sqtl"            
weights_tbl = tbl(mashrpred_con,"weights_eqtl")

ensid to gene name

enslist =  c("ENSG00000184012", "ENSG00000132842","ENSG00000130234")

## using the table from phenomexcan (sex chr missing)
genes_tbl <- tbl(phenomexcan_con,"genes")
colnames(genes_tbl)
[1] "id"              "gene_id"         "gene_id_version" "gene_name"      
[5] "gene_type"       "band"           
genes_tbl %>% filter(gene_id %in% enslist) %>% collect() %>% .[["gene_name"]]
[1] "AP3B1"   "TMPRSS2"
## using gencode table
colnames(gencode_tbl)
[1] "chromosome"     "start_location" "end_location"   "feature_type"  
[5] "strand"         "gene_id"        "gene_name"      "gene_type"     
gencode_tbl %>% filter(substr(gene_id,1,15) %in% enslist) %>% collect() %>% .[["gene_name"]]
[1] "AP3B1"   "TMPRSS2" "ACE2"   

Query multixcan association with top phenotypes for list of genes

input = list()
input$gene_name = c("TMPRSS2", "AP3B1")
input$limit = 100

tempo <- get_results_from_data_db(input)
results:
SELECT
  p.unique_description as phenotype, p.source as phenotype_source, g.gene_name, g.band as gene_band,
  sm.pvalue, sm.rcp,
  (case when sm.dir_effect_most_signif = 1 then '+'
        when sm.dir_effect_most_signif = -1 then '-'
        else '' end) as best_sign,
  (case when sm.dir_effect_consensus = 1 then '+'
        when sm.dir_effect_consensus = -1 then '-'
        else '' end) as consensus_sign,
  sm.n as n_tissues, sm.n_indep
FROM smultixcan sm inner join phenotypes p on (sm.pheno_id = p.id) inner join genes g on (sm.gene_num_id = g.id)
WHERE 1=1
AND g.gene_name IN (values ('TMPRSS2'), ('AP3B1'))
ORDER BY sm.pvalue
LIMIT 100
Query completed
head(tempo)
                                  phenotype phenotype_source gene_name
1                            Sitting height       UK Biobank     AP3B1
2                  Leg fat-free mass (left)       UK Biobank     AP3B1
3                           Standing height       UK Biobank     AP3B1
4                 Leg predicted mass (left)       UK Biobank     AP3B1
5 Forced vital capacity (FVC), Best measure       UK Biobank     AP3B1
6               Forced vital capacity (FVC)       UK Biobank     AP3B1
  gene_band     pvalue    rcp best_sign consensus_sign n_tissues n_indep
1    5q14.1 5.3877e-37 0.0891         -              -        19       5
2    5q14.1 8.0496e-26 0.1424         -              -        19       5
3    5q14.1 9.8699e-26 0.2496         -              -        19       5
4    5q14.1 1.0095e-25 0.1397         -              -        19       5
5    5q14.1 1.9335e-25 0.3197         -              -        19       5
6    5q14.1 3.2199e-25 0.3423         -              -        19       5

query predixcan all tissues association with top phenotypes for list of genes (this is slow)

input = list()
input$sp_gene_name = "TMPRSS2"
input$sp_limit = 100

if(T) tempop <- get_sp_results_from_data_db(input)
results:
SELECT
  p.unique_description as phenotype, p.source as phenotype_source, t.name as tissue_name, g.gene_name, g.band as gene_band,
  sp.pvalue, sp.zscore, sp.effect_size
FROM spredixcan sp inner join phenotypes p on (sp.pheno_id = p.id)
  inner join tissues t on (sp.tissue_id = t.id)
  inner join genes g on (sp.gene_num_id = g.id)
WHERE 1=1
AND g.gene_name IN (values ('TMPRSS2'))
ORDER BY sp.pvalue
LIMIT 100
Query completed
head(tempop)
                                                   phenotype phenotype_source
1                                             C_MALE_GENITAL       UK Biobank
2                  malignant neoplasm of male genital organs       UK Biobank
3                             Malignant neoplasm of prostate       UK Biobank
4                      Diagnoses - main ICD10: J43 Emphysema       UK Biobank
5 Diagnoses - main ICD10: C61 Malignant neoplasm of prostate       UK Biobank
6 Diagnoses - main ICD10: C61 Malignant neoplasm of prostate       UK Biobank
       tissue_name gene_name gene_band     pvalue  zscore effect_size
1         Prostate   TMPRSS2   21q22.3 8.6656e-08  5.3527   0.0138880
2         Prostate   TMPRSS2   21q22.3 8.6656e-08  5.3527   0.0138880
3         Prostate   TMPRSS2   21q22.3 1.3051e-07  5.2781   0.0132060
4             Lung   TMPRSS2   21q22.3 5.6411e-07  5.0031   0.0022190
5         Prostate   TMPRSS2   21q22.3 4.0875e-05  4.1025   0.0088149
6 Esophagus_Mucosa   TMPRSS2   21q22.3 5.0152e-05 -4.0549  -0.3784200

count total number of rows (this can be expensive depending on the table)

tbl(phenomexcan_con,"genes") %>% tally()
# Source:   lazy query [?? x 1]
# Database: postgres [metaxcan_ro@35.202.243.187:5432/phenomexcan]
  n      
  <int64>
1 22515  

get fine-mapped variants

## TODO write function genename2ensid and ensid2genename
genelist =  c("TMPRSS2", "AP3B1","ACE2")
ensverlist = gencode_df %>% filter(gene_name %in% genelist) %>% collect() %>% .[["gene_id"]]

dbListTables(dapg_con)
 [1] "clusters_eqtl"           "clusters_eqtl_eur"      
 [3] "clusters_sqtl"           "model_summary_eqtl_eur" 
 [5] "model_variants_eqtl_eur" "models_eqtl"            
 [7] "models_eqtl_eur"         "variants_pip_eqtl"      
 [9] "variants_pip_eqtl_eur"   "variants_pip_sqtl"      
dapg_pip_tbl <- tbl(dapg_con,"variants_pip_eqtl")
tempof <- dapg_pip_tbl %>% filter(gene %in% ensverlist) %>% collect()
tempof <- tempof   %>% left_join(gencode_df %>% select(gene_id,gene_name),by=c("gene"="gene_id"))


head(tempof)
# A tibble: 6 x 8
  tissue gene        rank variant_id         pip log10_abvf cluster_id gene_name
  <chr>  <chr>      <int> <chr>            <dbl>      <dbl>      <int> <chr>    
1 Lung   ENSG00000…    95 chr5_78141447… 2.04e-3       3.23          1 AP3B1    
2 Lung   ENSG00000…     2 chr5_78089857… 3.83e-2       4.50          1 AP3B1    
3 Lung   ENSG00000…    79 chr5_78050844… 3.01e-3       3.40          1 AP3B1    
4 Lung   ENSG00000…    39 chr5_78242771… 6.25e-3       2.94          1 AP3B1    
5 Lung   ENSG00000…   107 chr5_78053912… 1.94e-3       3.22          1 AP3B1    
6 Lung   ENSG00000…   155 chr5_78192860… 7.68e-4       2.81          1 AP3B1    

get SNPs in prediction models

genelist =  c("TMPRSS2", "AP3B1")
ensidverlist = gencode_df %>% filter(gene_name %in% genelist) %>% collect() %>% .[["gene_id"]]

tempow <- weights_tbl %>% filter(gene %in% ensverlist) %>% collect() 
tempow <- tempow %>% left_join(gencode_df %>% select(gene_id,gene_name),by=c("gene"="gene_id"))
head(tempow)
# A tibble: 6 x 8
  gene      rsid    varID      ref_allele eff_allele   weight tissue   gene_name
  <chr>     <chr>   <chr>      <chr>      <chr>         <dbl> <chr>    <chr>    
1 ENSG0000… rs3507… chr21_414… AC         A           0.115   Lung     TMPRSS2  
2 ENSG0000… rs7881… chr5_7828… A          C          -0.0278  Brain_C… AP3B1    
3 ENSG0000… rs7881… chr5_7828… A          C          -0.0631  Muscle_… AP3B1    
4 ENSG0000… rs9293… chr5_7820… A          C           0.116   Whole_B… AP3B1    
5 ENSG0000… rs7390… chr21_415… A          G          -0.00317 Esophag… TMPRSS2  
6 ENSG0000… rs7390… chr21_415… A          G           0.00957 Spleen   TMPRSS2  
# server <- function(input, output, session) {
#     updateSelectizeInput(session = session, inputId = 'gene_name', choices = c(genes_), server = TRUE)
#     updateSelectizeInput(session = session, inputId = 'sp_gene_name', choices = c(genes_), server = TRUE)
#     updateSelectizeInput(session = session, inputId = 'pheno', choices = c(phenotypes_), server = TRUE)
#     updateSelectizeInput(session = session, inputId = 'sp_pheno', choices = c(phenotypes_), server = TRUE)
#     updateSelectizeInput(session = session, inputId = 'sp_tissue', choices = c(tissues_), server = TRUE)
#     updateSelectizeInput(session = session, inputId = 'uc_ukb_trait', choices = c(phenotypes_), server = TRUE)
#     updateSelectizeInput(session = session, inputId = 'uc_clinvar_trait', choices = c(clinvar_phenotypes_), server = TRUE)
#     

sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RPostgres_1.2.0 forcats_0.5.0   stringr_1.4.0   dplyr_0.8.5    
 [5] purrr_0.3.3     readr_1.3.1     tidyr_1.0.2     tibble_3.0.0   
 [9] ggplot2_3.3.0   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     lubridate_1.7.8  lattice_0.20-41  utf8_1.1.4      
 [5] assertthat_0.2.1 rprojroot_1.3-2  digest_0.6.25    R6_2.4.1        
 [9] cellranger_1.1.0 backports_1.1.6  reprex_0.3.0     evaluate_0.14   
[13] httr_1.4.1       pillar_1.4.3     rlang_0.4.5      curl_4.3        
[17] readxl_1.3.1     rstudioapi_0.11  whisker_0.4      blob_1.2.1      
[21] rmarkdown_2.1    bit_1.1-15.2     munsell_0.5.0    broom_0.5.5     
[25] compiler_3.6.3   httpuv_1.5.2     modelr_0.1.6     xfun_0.13       
[29] askpass_1.1      pkgconfig_2.0.3  htmltools_0.4.0  openssl_1.4.1   
[33] tidyselect_1.0.0 workflowr_1.6.1  fansi_0.4.1      crayon_1.3.4    
[37] dbplyr_1.4.2     withr_2.1.2      later_1.0.0      grid_3.6.3      
[41] bigrquery_1.2.0  nlme_3.1-147     jsonlite_1.6.1   gtable_0.3.0    
[45] lifecycle_0.2.0  DBI_1.1.0        git2r_0.26.1     magrittr_1.5    
[49] scales_1.1.0     cli_2.0.2        stringi_1.4.6    fs_1.4.1        
[53] promises_1.1.0   xml2_1.3.1       ellipsis_0.3.0   generics_0.0.2  
[57] vctrs_0.2.4      tools_3.6.3      bit64_0.9-7      glue_1.4.0      
[61] hms_0.5.3        yaml_2.2.1       gargle_0.4.0     colorspace_1.4-1
[65] rvest_0.3.5      knitr_1.28       haven_2.2.0