In this post, we run BrainXcan with an example.
First of all, please make sure you’ve cloned Brainxcan code repository and installed all the BrainXcan software dependencies.
To perform BrainXcan analysis, we need to prepare the following files:
The BrainXcan database and meta files can be downloaded from brainxcan database zenodo link. The directory structure is
brainxcan_data/
|-- bxcan_vis
|-- geno_covar
|-- idp_gwas
|-- idp_weights
`-- mr
There is no need to change the structure and it is designed to be used asis.
See some detailed notes on the BrainXcan database at “Looking into BrainXcan database.”
In this example, we provide a pre-formatted GWAS file at here: preview link and direct download link, which is adapted from a schizophrenia GWAS by Ripke et al. (2020) (medRxiv link).
First, let’s download and take a look at it.
# download
wget https://uchicago.box.com/shared/static/s2yg1gvxm5vv1q3yijf5h389o3hkadwv.gz \
-O formatted.SCZ_PGC_2020.tsv.gz
# look at the first few rows
zcat < formatted.SCZ_PGC_2020.tsv.gz | head
# returns
variant_id chromosome non_effect_allele effect_allele effect_size standard_error zscore sample_size frequency
rs3131972 chr1 G A -0.0020020026706730793 0.0111 -0.18439957 71533 0.1652
rs3131969 chr1 G A -0.006601743634562364 0.0118 -0.56070304 71533 0.1339
rs3131967 chr1 C T -0.005796768847177342 0.0119 -0.48651794 71533 0.1171
rs1048488 chr1 T C 0.002696361547742533 0.0117 0.2303755 71533 0.1667
rs12562034 chr1 G A -0.003902375817241093 0.0156 -0.24843223 66899 0.1027
rs12124819 chr1 A G -0.011800104115750618 0.0181 -0.64967835 58941 0.2243
rs4040617 chr1 A G -0.0010993954433010642 0.0127 -0.09036144 71843 0.1295
rs4970383 chr1 C A 0.02810116511090993 0.0111 2.5433347 64548 0.1964
rs4475691 chr1 C T 0.012102946067745635 0.0121 0.9998151 65154 0.1429
TLDR: The configuration file is in YAML format and the one for example run can be downloaded from here. And there are a few fields that you need to fill in.
If you are not in a hurry, we recommend going through the following sections which explain the configuration file in details.
The configuration file contains three main parts:
We will go over them one by one.
datadir
: Path to the BrainXcan database folder.outdir
: Output directory name (if it does not exist, it will be generated).prefix
: Prefix of the output files (to distinguish between results in the same output folder).brainxcan_path
: Path to the BrainXcan code repository.In the configuration YAML, we have:
datadir: 'path-to-braixcan-database/brainxcan_data'
outdir: 'brainxcan_test'
prefix: 'SCZ_PGC_2020'
brainxcan_path: 'path-to-brainxcan-repository/brainxcan'
Here we want to tell the software how to understand the GWAS file.
gwas
: Path to GWAS file (formats: csv(.gz)
, tsv(.gz)
, parquet
).snpid
: Column name of GWAS SNP rsIDeffect_allele
: Column name of GWAS effect allelenon_effect_allele
: Column name of GWAS non-effect allelechr
: Column name of chromosomeposition
: Column name of positionAlong with Group 1 or 2 (see details at here):
effect_size
: Column name of GWAS effect sizeeffect_size_se
: Column name of GWAS effect size standard errorWhen both groups are specified, group 1 has high priority. For the example run, we go with group 1.
In the configuration YAML, we have:
gwas: 'formatted.SCZ_PGC_2020.tsv.gz'
snpid: 'variant_id'
effect_allele: 'effect_allele'
non_effect_allele: 'non_effect_allele'
chr: 'chromosome'
effect_size: 'effect_size'
effect_size_se: 'standard_error'
position: 'position'
There are several parameters underlying a BrainXcan run. They are associated with three aspects:
model_type
: IDP prediction model type: ridge
or elastic_net
(Default = ridge
)idp_type
: IDP sets to use: original
or residual
(after PC adjustment) (Default = residual
)spearman_cutoff
: CV Spearman cutoff on models (only models passing this criteria will be shown) (Default = 0.1
)bxcan_ldblock_perm
: If want to run this adjustment, set the path to the BED file (TAB-delimited base0 with header chr, start, stop)bxcan_ldblock_perm_seed
: Random seedbxcan_ldblock_perm_nrepeat
: Number of permutation across all IDPscorrection_factor_perm
: Correcting the standard error of the simulated z-scores by multiplying with this factor (default = 1.1)bxcan_empirical_z
: Whether to run this adjustment or notbxcan_empirical_z_seed
: Random seedbxcan_empirical_z_nrepeat
: Number of simulated IDPscorrection_factor_emp
: Correcting the standard error of the simulated z-scores by multiplying with this factorsignif_pval
: P-value cutoff to call an IDP significant (Default = 1e-5
)signif_max_idps
: Maximun number of IDPs to run MR (discard by p-value if too many IDPs passing p-value filter) (Default = 10
)ld_clump_yaml
: Parameters of LD clumping for defining the instrument SNPs in MR (Default = {datadir}/mr/ld_clump.yaml
)\(\dagger\)rscript_exe
: Path to the Rscript executable (Default = Rscript
)python_exe
: Path to the Python executable (Default = python
)plink_exe
: Path to the plink v1.9 executable (Default = plink
)\(\dagger\) {datadir}
will be interpreted as path-to-braixcan-database/brainxcan_data
internally since we set datadir: path-to-braixcan-database/brainxcan_data
.
In the configuration YAML, we have:
model_type: 'ridge'
idp_type: 'residual'
spearman_cutoff: 0.1
signif_pval: 0.01
signif_max_idps: 20
gwas_pop: 'EUR'
ld_clump_yaml: '{datadir}/mr/ld_clump.yaml'
rscript_exe: 'Rscript'
python_exe: 'python'
plink_exe: 'plink'
The full list of parameters available in a BrainXcan configuration YAML is explained (with default values) at here
Now we are ready to run BrainXcan pipeline. The general command is
# conda activate brainxcan
REPO=path-to-brainxcan-repository
CONFIG=path-to-config-file
snakemake \
-s $REPO/Snakefile \
--configfile $CONFIG \
[rule-name]
[rule-name]
can take the following four options:
rule.name | Association | Visualization | Mendelian.randomization | Automated.report |
---|---|---|---|---|
SBrainXcanOnly | x | |||
SBrainXcanAndVis | x | x | ||
SBrainXcanAndMR | x | x | ||
SBrainXcan | x | x | x | x |
Let’s say we want to run through everything, i.e. replacing [rule-name]
with SBrainXcan
. So, we can do:
snakemake \
-s $REPO/Snakefile \
--configfile $CONFIG \
-n -p \
SBrainXcan
We use -n
options in snakemake
to perform a dry-run to see what will be run (-p
is to print more information about each step). The MR runs won’t be printed since we’ve not known which IDPs will be significant but these will be updated automatically during the computation.
If everything looks good, we could start running the command (-j1
tells snakemake to use one core).
snakemake \
-s $REPO/Snakefile \
--configfile $CONFIG \
-j1 -p \
SBrainXcan
It takes about 9 minutes with 4 Gb memory. The output files are put in folder brainxcan_test/
with prefix SCZ_PGC_2020
.
The output files of the above run can be downloaded from here. Here is an overview of the files.
brainxcan_test
├── SCZ_PGC_2020.report.html
├── SCZ_PGC_2020.sbrainxcan.csv
├── SCZ_PGC_2020.vis.T1.pdf
├── SCZ_PGC_2020.vis.dMRI.pdf
└── SCZ_PGC_2020_files
├── MR_results
│ ├── dmri_IDP-25687.MR.rds
│ ├── dmri_IDP-25687.MR_sumstats.tsv
│ ├── dmri_IDP-25687.MR_vis.pdf
│ ├── t1_IDP-25020.MR.rds
│ ├── t1_IDP-25020.MR_sumstats.tsv
│ ├── t1_IDP-25020.MR_vis.pdf
| └── [more MR results]
└── logs
├── dmri_IDP-25687.mr.log
├── dmri_IDP-25687.mr_vis.log
├── sbrainxcan_dmri.log
├── sbrainxcan_t1.log
├── sbxcan_merge.log
├── sbxcan_report.log
├── sbxcan_signif.log
├── sbxcan_vis.log
├── t1_IDP-25020.mr.log
├── t1_IDP-25020.mr_vis.log
└── [more MR log files]
[prefix].sbrainxcan.csv
: BrainXcan association results in CSV format.[prefix].vis.T1.pdf
: Visualization of associations of structural IDPs.[prefix].vis.dMRI.pdf
: Visualization of associations of diffusion IDPs (only TBSS IDPs).[prefix]_files/MR_results
: Mendelian randomization for each significant IDPs
*.MR_vis.pdf
: Plotting effect sizes of instrument SNPs in MR.*.MR_sumstats.tsv
: Results of MR tests summarized in a table (TSV format).*.MR.rds
: Raw data used for MR test.[prefix].report.html
: The automated HTML report summarizing the association and MR results (for diffusion IDPs, only TBSS ones are shown).[prefix]_files/logs/
: Log files of each analysis module.Let’s take a quick look at brainxcan_test/SCZ_PGC_2020.sbrainxcan.csv
(first two rows). The descriptions of each columns are at the bottom of the table.
IDP | bhat | pval | z_brainxcan | nsnp_used | nsnp_total | CV_R2 |
---|---|---|---|---|---|---|
IDP-25020 | 0.395 | 5.79e-17 | 8.37 | 1069786 | 1071343 | 0.0129 |
IDP-25687 | 0.353 | 2.1e-14 | 7.64 | 1069786 | 1071343 | 0.0128 |
IDP ID code (UKB field ID) | Estimated effect size of the association test | P-value of the association test (without adjustment) | Z-score of the association test (without adjustment) | Number of SNPs used in the calculation | Total number of SNPs in the prediction model | R2 of the prediction model (cross-validated) |
CV_Pearson | CV_Spearman | modality | z_adj_gc | pval_adj_gc | z_adj_perm_null |
---|---|---|---|---|---|
0.114 | 0.109 | T1 | 3.13 | 0.00176 | 4.45 |
0.113 | 0.11 | dMRI | 2.86 | 0.00427 | 4.07 |
Pearson correlation of the prediction model (cross-validated) | Spearman correlation of the prediction model (cross-validated) | T1 (structural) or dMRI (diffusion) | Z-score adjusted by GC | P-value adjust by GC | Z-score adjusted by permutation |
pval_adj_perm_null | subtype | left_or_right | region | notes |
---|---|---|---|---|
8.53e-06 | Subcortical | right | hippocampus | Volume of right hippocampus (from T1 brain image) |
4.79e-05 | w-OD | NA | forceps major | Weighted-mean OD (orientation dispersion index) in tract forceps major (from dMRI data) |
P-value adjusted by permutation | IDP subtype | Left/right side of the brain | Anatomical region | Notes of the IDP from UKB |
Let’s take a quick look at brainxcan_test/SCZ_PGC_2020_files/MR_results/dmri_IDP-25687.MR_sumstats.tsv
(first two rows). The descriptions of each columns are at the bottom of the table.
direction | method | nsnp | b | pval | Q | Q_df | Q_pval |
---|---|---|---|---|---|---|---|
IDP -> Phenotype | MR Egger | 33 | -0.591 | 0.0595 | 170 | 31 | 3.98e-21 |
IDP -> Phenotype | Weighted median | 33 | 0.0394 | 0.476 | NA | NA | NA |
MR direction | MR method | Number of SNPs used in MR | Slope in MR | P-value in MR | Heterogenity test Q | Heterogenity test degree of freedom | Heterogenity test p-value |
We provide interactive HTMLs showing regions in the visualization:
These HTMLs can be generated along the BrainXcan visualization by setting bxcan_region_vis=True
in the configuration (default: False). The output files will be put inside the folder called [prefix].vis.region_vis/
.
snakemake
The BrainXcan software is essentially a data pipeline containing multiple steps and intermediate files. It is implemented using snakemake
Mölder et al. (2021) and the features provided by snakemake
are directly applicable here.
For instance, the parameters in the configuration file can be replaced by --config
, e.g. if we want to change the prefix of the output, we can do it without changing the config file:
snakemake \
-s $REPO/Snakefile \
--configfile $CONFIG \
-j1 -p \
SBrainXcan \
--config prefix=another_test
More over, if the intermediate files are already present, snakemake will not re-generate them. In other word, if you run SBrainXcanOnly
first and then run SBrainXcanAndVis
, it won’t run BrainXcan association again. If you want to enforce re-runing, use the flag --forceall
.
BrainXcan software is written with snakemake v6. For more cool features of snakemake, please refer to snakemake documentation.