Last updated: 2020-03-12

Checks: 6 1

Knit directory: hgen471/

This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200105)

The command set.seed(20200105) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 26d014b

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Unstaged changes:
    Modified:   analysis/ldsc_tutorial.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	ecff05c	Yanyu Liang	2020-02-16	added ldsc mini tutorial
html	ecff05c	Yanyu Liang	2020-02-16	added ldsc mini tutorial

\[ \newcommand{\E}{\text{E}} \]

Overview

This is a minial tutorial on using ldsc for LDSC regression. Before hands-on stuff, let’s take a quick recap on what LDSC regression is doing and what kind of input files are required for a minimal LDSC regression analysis.

Under polygenic model, B. K. Bulik-Sullivan et al. (2015) proposed

\[\begin{aligned} \E[\chi_j^2 | l_j] = N\frac{h^2}{M} l_j + Na + 1 \end{aligned}\] , where \(\chi^2\) is GWAS summary statistics, \(N\) is sample size, \(h^2\) is heritability, and \(M\) is the number of common variants we are considering for contributing to heritability. More importantly, this relationship holds for each SNP \(j\). And \(l_j\) is called LD score for SNP \(j\), which is defined as \(l_j := \sum_k r_{jk}^2\) where \(r_{jk}^2\) is the LD between SNP \(j\) and \(k\). So, we could notice that \(l_j\) is population-specific and it does not depend on the trait.

OK, essentially, to regress \(\chi_j^2\) against \(l_j\), we need to know:

\(\chi_j^2\) which is available from GWAS summary statistics
\(N\) GWAS sample size
\(l_j\) which is typically shared by LDSC developers and you can also compute on your own if you are working with a special population

From here we can conclude a general workflow for LDSC regression:

Download or prepare LDSC files
Format your GWAS so that it uses the same SNP ID system as LDSC files
Run LDSC regression software

As minimal tutorial, we won’t touch on how to prepare LDSC files by yourselves. For your interest, you can take a look at link for this task.

Instal LDSC regression software

If you don’t have conda installed on your machine, please install it from here before proceeding. Then, go to ldsc GitHub repository and follow the instruction.

# assuming you've installed conda
cd [your-working-dir]
mkdir software 
cd software 
git clone https://github.com/bulik/ldsc.git
cd ldsc
conda env create --file environment.yml
source activate ldsc

Download LDSC files

LDSC developers pre-computed LD scores for Europeans and East Asians. They are available at here. For this tutorial, we will work with European GWAS. So, we can simply use the one suggested here. These LDSC files are obtained from 1000G Europeans with HapMap3 SNPs.

So, we can do

cd [your-working-dir]
mkdir data
cd data
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
tar -jxvf eur_w_ld_chr.tar.bz2

Prepare GWAS files

For this tutorial, we select GWAS summary statistics from link. For illustration, I downloaded UKB_20022_Birth_weight.txt.gz and UKB_21001_Body_mass_index_BMI.txt.gz. Let’s move the downloaded files to [your-working-dir]/gwas/.

ldsc requires the GWAS summary statistics files follow some format. It is described in details here.

In short, it requires your file to have the following columns:

SNP – SNP identifier (e.g., rs number or other SNP ID matching the LDSC files)
N – sample size (which may vary from SNP to SNP).
Z – z-score. Sign with respect to A1 (warning, possible gotcha)
A1 – first allele (effect allele)
A2 – second allele (other allele)

The recommended practice is to use munge_sumstats.py script shared with the software for this formatting step. For our dataset, we can run

cd [your-working-dir]
mkdir output
python software/ldsc/munge_sumstats.py \
  --sumstats data/gwas/UKB_20022_Birth_weight.txt.gz \
  --N-col sample_size \
  --snp variant_id \
  --a1 effect_allele \
  --a2 non_effect_allele \
  --signed-sumstats zscore,0 \
  --p pvalue \
  --merge-alleles data/eur_w_ld_chr/w_hm3.snplist \
  --out output/UKB_20022_Birth_weight

And run similar command for UKB_21001_Body_mass_index_BMI.txt.gz.

You may notice that the formatting script also does some quality control and sanity check. So, if possible, we’d prefer perform formatting with this script rather than writing our own.

Side note: we find that the default setting of munge_sumstats.py may consume large amount of memory, which could cause significant slow down of the run when the RAM is limited. To account for that, we’d suggest to adjust the default by changing --chunksize. For instance, setting --chunksize 500000 on RCC login node.

Run LDSC regression

OK, it’s time to run LDSC regression.

cd [your-working-dir]
python software/ldsc/ldsc.py \
  --h2 output/UKB_20022_Birth_weight.sumstats.gz \
  --ref-ld-chr data/eur_w_ld_chr/ \
  --w-ld-chr data/eur_w_ld_chr/ \
  --out output/ldsc_UKB_20022_Birth_weight

--ref-ld-chr is specifying LDSC for the run. --w-ld-chr is specifying the weighting scheme in the linear regression (it is more about technical details).

And we can estiamte genetic correlation under similar framework B. Bulik-Sullivan et al. (2015).

cd [your-working-dir]
python software/ldsc/ldsc.py \
  --rg output/UKB_20022_Birth_weight.sumstats.gz,output/UKB_21001_Body_mass_index_BMI.sumstats.gz \
  --ref-ld-chr data/eur_w_ld_chr/ \
  --w-ld-chr data/eur_w_ld_chr/ \
  --out output/rg_UKB_20022_Birth_weight_x_UKB_21001_Body_mass_index_BMI

References

Bulik-Sullivan, Brendan, Hilary K Finucane, Verneri Anttila, Alexander Gusev, Felix R Day, Po-Ru Loh, Laramie Duncan, et al. 2015. “An Atlas of Genetic Correlations Across Human Diseases and Traits.” Nature Genetics 47 (11). Nature Publishing Group: 1236.

Bulik-Sullivan, Brendan K, Po-Ru Loh, Hilary K Finucane, Stephan Ripke, Jian Yang, Nick Patterson, Mark J Daly, et al. 2015. “LD Score Regression Distinguishes Confounding from Polygenicity in Genome-Wide Association Studies.” Nature Genetics 47 (3). Nature Publishing Group: 291.

sessionInfo()

R version 3.6.2 (2019-12-12)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.6.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3      rprojroot_1.3-2 digest_0.6.23   later_1.0.0    
 [5] R6_2.4.1        backports_1.1.5 git2r_0.26.1    magrittr_1.5   
 [9] evaluate_0.14   stringi_1.4.5   rlang_0.4.3     fs_1.3.1       
[13] promises_1.1.0  whisker_0.4     rmarkdown_2.1   tools_3.6.2    
[17] stringr_1.4.0   glue_1.3.1      httpuv_1.5.2    xfun_0.12      
[21] yaml_2.2.0      compiler_3.6.2  htmltools_0.4.0 knitr_1.27

LDSC regression tutorial

Yanyu Liang