1 About

In this post, we will look into the BrainXcan database and learn about how to load the following data into R:

Prediction models
Genotype covariance
IDP GWASs

These data loading rely on python packages pandas and scipy.sparse which has been installed as part of brainxcan conda environment. To use them in R, we can use reticulate.

The naming convention of IDP is IDP-[UKB_Field_ID] and we call it IDP ID. The information of the IDPs can be found at here.

# load reticulate since we want to use python package
library(reticulate)

## Warning: package 'reticulate' was built under R version 3.6.2

# use brainxcan conda env
use_condaenv('brainxcan')
# load python packages
pd = import('pandas')
sp = import('scipy.sparse')

2 Prediction models

We share prediction models of original IDPs and residual IDPs trained from ridge and elastic net. The residual IDPs are obtained by regressing out the first principal component of IDP matrix (by subtype) from the original IDP.

For each set of models, strutural IDPs and diffusion IDPs are saved separately and there are two types of files:

*.parquet
*.perf.tsv.gz

The parquet file contains the prediction model weights where SNPs are in rows and IDPs are in columns The prediction performance (obtained from training pipeline) of the models is in the tsv.gz file. If setting the number of rows in tsv.gz file as n, then the number of columns in the parquet is n + 4 with the four additional columns indicating: rsID (snpid), non effect allele (a0), effect allele (a1), and chromosome (chr).

The naming rule is:

path-to-brainxcan_data/idp_weights/[model_type]/[idp_type].[modality].*
- model_type: ridge or elastic_net
- idp_type: residual or original
- modality: dmri or t1

Let’s take a look at them.

# set up the file path
model_type = 'ridge'
idp_type = 'residual'
modality = 'dmri'
parquet_file = paste0('idp_weights/', model_type, '/', idp_type, '.', modality, '.parquet')
perf_file = paste0('idp_weights/', model_type, '/', idp_type, '.', modality, '.perf.tsv.gz')

model_type = ridge

idp_type = residual

modality = dmri

Reading parquet file: path-to-brainxcan_data/idp_weights/ridge/residual.dmri.parquet

Reading performance file: path-to-brainxcan_data/idp_weights/ridge/residual.dmri.perf.tsv.gz

# load the parquet file
df_weights = pd$read_parquet(paste0(datadir, '/', parquet_file))
# load the performance file
df_perf = read.table(paste0(datadir, '/', perf_file), stringsAsFactors = F, header = T) 

# look at the first 6 columns and rows
head(df_weights[, 1: 6]) %>% pander

snpid	a0	a1	chr	IDP-25650	IDP-25651
rs3131972	A	G	1	-3.915e-05	6.14e-05
rs3131969	A	G	1	-7.533e-05	4.303e-05
rs3131967	T	C	1	-7.558e-05	4.256e-05
rs1048488	C	T	1	-3.991e-05	5.655e-05
rs12562034	A	G	1	-7.253e-07	-7.05e-05
rs12124819	G	A	1	-3.426e-06	1.991e-05

head(df_perf) %>% pander

R2	Pearson	Spearman	phenotype
0.01224	0.1107	0.1094	IDP-25650
0.01266	0.1125	0.1102	IDP-25651
0.02177	0.1476	0.1409	IDP-25652
0.01806	0.1346	0.1313	IDP-25653
0.01723	0.1325	0.1285	IDP-25654
0.02	0.143	0.1363	IDP-25655

Number of rows in weights table = 1071343

Number of columns in weights table = 312

Number of rows in performance table = 308

The parquet file support reading by column. So, if we only want to load the weights of a specific IDP, we can do the following:

idp_id = 'IDP-25652'
df_weights_sub = pd$read_parquet(
  paste0(datadir, '/', parquet_file), 
  columns = c('snpid', 'a0', 'a1', 'chr', idp_id)
) 
df_weights_sub %>% head %>% pander

snpid	a0	a1	chr	IDP-25652
rs3131972	A	G	1	2.21e-06
rs3131969	A	G	1	1.576e-05
rs3131967	T	C	1	1.627e-05
rs1048488	C	T	1	4.518e-06
rs12562034	A	G	1	-0.0001214
rs12124819	G	A	1	-4.941e-05

3 Genotype covariance

Genotype covariance was computed using the UKB IDP cohort (the same as prediction model training). The genotype covariance is saved by chromosome. It is a banded matrix with the band width = 200 and only the upper triangular values are saved since it is symmetric. These genotype covariance files are at brainxcan_data/geno_covar/ and for each chromosome, there are two files:

chr*.banded.npz: The banded genotype covariance in scipy sparse matrix format.
chr*.banded.snp_meta.parquet: The meta data annotating the rows/columns of the genotype covariance file in parquet format. The number of rows of the meta data matches the genotype covariance file and the SNPs are in the same order.

# load banded genotype covariance 
mat = sp$load_npz(paste0(datadir, '/geno_covar/chr22.banded.npz'))

# take the first 500 SNPs and visualize
mat_sub = as.matrix(mat[ 1 : 500, 1 : 500])
var_sub = diag(mat_sub)
cor_sub = sweep(mat_sub, 1, FUN = '/', sqrt(var_sub))
cor_sub = sweep(cor_sub, 2, FUN = '/', sqrt(var_sub))
par(mfrow = c(1, 2))
image(mat_sub, main = 'Covariance between SNPs')
image(cor_sub, main = 'Correlation between SNPs')

class(mat) = dgTMatrix

To annotate the genotype covariance with the actual SNP information, we should load the meta data.

df_snp = pd$read_parquet(paste0(datadir, '/geno_covar/chr22.banded.snp_meta.parquet'))
df_snp %>% head %>% pander

snpid	effect_allele	non_effect_allele	chr
rs7287144	G	A	22
rs5748662	A	G	22
rs5747620	C	T	22
rs9605903	C	T	22
rs5746647	G	T	22
rs16980739	T	C	22

Number of rows in meta data = 15747

Number of rows/columns in genotype covariance = 15747, 15747

4 IDP GWAS

We share the IDP GWAS results for all the IDPs being analyzed and the study cohort is the same as IDP model training (n = 24,490).

The GWAS results are by chromosome and by IDP and the naming rule is as follow:

path-to-brainxcan_data/idp_gwas/[idp_type].[modality].chr[chr_num]/[idp_id]].parquet
- idp_type: residual or original
- modality: dmri or t1
- chr_num: 1 to 22
- idp_id: IDP ID

The SNPs are in rows and columns are:

variant_id: rsID of the SNP
phenotype_id: IDP ID
pval: P-value of the association
b and b_se: Estimated effect size and standard error
maf: Minor allele frequency of the SNP in the GWAS cohort

Further annotation of the SNPs can be found at path-to-brainxcan_data/idp_gwas/snp_bim/chr[chr_num].bim. This is essentially PLINK BIM file and the columns are: 1. Chromosome, 2. rsID, 3. placeholder, 4. Genomic position, 5. Effect allele, 6. Non-effect allele. The GWAS direction is the same as the one in BIM file.

For instance, let’s load one GWAS file.

# set up the file path
idp_type = 'residual'
modality = 't1'
chr_num = 22
idp_id = 'IDP-25002' 
gwas_file = paste0('idp_gwas/', idp_type, '.', modality, '.chr', chr_num, '/', idp_id, '.parquet')
bim_file = paste0('idp_gwas/', 'snp_bim/', 'chr', chr_num, '.bim')

idp_type = residual

modality = t1

chr_num = 22

idp_id = IDP-25002

Reading parquet file: path-to-brainxcan_data/idp_gwas/residual.t1.chr22/IDP-25002.parquet

Reading BIM file: path-to-brainxcan_data/idp_gwas/snp_bim/chr22.bim

df_gwas = pd$read_parquet(paste0(datadir, '/', gwas_file))
df_bim = read.table(
  paste0(datadir, '/', bim_file), 
  col.names = c('chromosome', 'variant_id', 'ph', 'position', 'effect_allele', 'non_effect_allele'), 
  stringsAsFactors = F
) %>% select(-ph)  # remove placeholder column

# annotate GWAS results with additional SNP information
df_gwas = right_join(df_bim, df_gwas, by = 'variant_id')

df_gwas %>% mutate(position = as.character(position)) %>% head %>% pander

chromosome	variant_id	position	effect_allele	non_effect_allele	phenotype_id	pval	b	b_se	maf
22	rs4911642	16504399	C	T	IDP-25002	0.3064	-0.02878	0.02814	0.08711
22	rs7287144	16886873	G	A	IDP-25002	0.7055	0.003934	0.01041	0.2708
22	rs5748662	16892858	A	G	IDP-25002	0.6634	0.00454	0.01043	0.2699
22	rs2027653	16918335	T	C	IDP-25002	0.573	-0.01169	0.02074	0.425
22	rs9680776	17000277	T	C	IDP-25002	0.8555	0.003334	0.0183	0.1795
22	rs5747620	17032698	C	T	IDP-25002	0.2179	0.01388	0.01127	0.4

Looking into BrainXcan database

1 About

2 Prediction models

3 Genotype covariance

4 IDP GWAS