In order to perform MR we need to combine our exposure and outcome datasets. Here we will use the TwoSampleMR procedure to harmonize GWAS summary stats.
The exposure GWAS SumStats have previously been standardized to common format, however, they need to be converted to the format required by TwoSampleMR.
Import exposure GWAS SumStats
exposure_path ="resources/Willer2013tc.chrall.CPRA_b37.tsv.gz"exposure_ss <-read_tsv(exposure_path, comment ="##", col_types = coltypes, col_select =c(DBSNP_ID, CHROM, POS, REF, ALT, AF, BETA, SE, Z, P, N, TRAIT))# Format data to TwoSampleMR formatexposure <- exposure_ss %>%format_data(.,type ="exposure",snps =NULL,header =TRUE,phenotype_col ="TRAIT",snp_col ="DBSNP_ID",beta_col ="BETA",se_col ="SE",eaf_col ="AF",effect_allele_col ="ALT",other_allele_col ="REF",pval_col ="P",samplesize_col ="N",z_col ="Z",chr_col ="CHROM",pos_col ="POS",log_pval =FALSE) %>%as_tibble()
Two-Sample MR methods usually utilize only independent genome-wide significant SNPs. The process of extracting these SNPs from the full genome-wide association study (GWAS) summary statistics involves linkage disequilibrium (LD) clumping. TwoSampleMR provides a clumping function, TwoSampleMR::clump_data(), which performs a stringent clumping procedure with a default window size of 10Mb and an r2 threshold of 0.001. Setting clump_p1 = 1 and clump_p2 = 1 will perform clumping on all available SNPs. The GWAS summary statistics used in this analysis are from a population of European ancestry, and therefore the EUR reference panel is specified.
LD clump exposure
# Perform LD clumping on SNP data, filter SNPs to make it run fasterexposure_clump <- exposure %>%filter(pval.exposure <0.01) %>%clump_data(.,clump_kb =10000,clump_r2 =0.001,clump_p1 =1,clump_p2 =1,pop ="EUR")# Should filter here at GWS, I dont here to demonstrate proxy snps# exposure_dat <- filter(exposure_clump, pval.exposure < 5e-8)exposure_dat <- exposure_clump
Note
Typically, in a Two-Sample MR analysis, one would only consider genome-wide significant SNPs (SNPs with pval.exposure < 5e-8). However, in this demonstration, all independent SNPs with pval.exposure < 0.01 are retained to showcase how to identify proxy variants, as all of the genome-wide significant SNPs are also present in the outcome GWAS.
Outcome datasets
Similarly, the outcome dataset needs to be converted to the TwoSampleMR format.
Proxy variants are genetic markers that are in strong linkage disequilibrium (LD) with the variants of interest, and can be used as a proxy to estimate the effect of the variants of interest on the outcome of interest. In some cases, variants in the exposure dataset may not be present in the outcome dataset due to different GWAS using different genotyping arrays, impuation methods, and QC practices. To overcome this, proxy variants in the outcome dataset can be substituted for the missing variant.
Here we first extract our exposure SNPs from the outcome GWAS and determine which SNPs are not presenting in the outcome dataset. We then query the LDLink web server to identify proxy variants that are in strong LD (r2) with the missing variant. The resulting proxy SNPs are then extracted from the outcome dataset and appended to the initial SNPs extracted from the outcome GWAS using munge_proxies() function.
Identify Proxy Variants
# extract exposure SNPs present in outcomeoutcome_clump <-semi_join( outcome, exposure_dat, by ="SNP")# Exposure SNPs not present in outomceexp_snps_wo <-anti_join( exposure_dat, outcome, by ="SNP")# Use LDLinkR to identify proxy snpsLDproxy_batch(exp_snps_wo$SNP, pop ="CEU", # Match population ancestriesr2d ="r2", token ='a6deee62cc4a', append =TRUE, # We appended the results of each LDlink query to a single filegenome_build ="grch37") # Select genome build based on summary statssystem("mv combined_query_snp_list_grch37.txt data/exposure_outcome_proxy_snps.txt")# Munge proxy snp fileoutcome_dat <-munge_proxies("data/exposure_outcome_proxy_snps.txt", outcome, outcome_clump)
Harmonize Exposure - Outcome Datasets
Finally, the effect of a SNP on an outcome and exposure must be harmonised to be relative to the same allele. We also specify to infer the positive strand alleles, using allele frequencies for palindromes rather then removing palindromic variants.
The variable mr_keep is used to determine the SNPs to be included in the MR analysis. It is recommended to exclude variants with pval.outcome < 5e-8 as they violate the exclusion restriction assumption. For studies involving Alzheimer’s disease, variants in the APOE region (GRCh37 - CHR: 19, BP: 44912079-45912079) should also be removed due to its large effect size and pleiotropic nature. However, in this tutorial, this step is not performed to demonstrate how causal estimates can be influenced by horizontal pleiotropy.
And now we have a final harmonized dataset that can be used in our downstream MR analyses