跳至正文
首页 » Summary level BLUP (SBLUP)

Summary level BLUP (SBLUP)

SBLUP model is a method to estimate marker effects using summary data from a GWAS/meta-analysis and LD matrix derived from a reference panel with individual-level data (Robinson, et al. 2017). The summary data should be prepared in COJO format, as described here. There are two ways to run SBLUP model by HIBLUP, see details as follows:

1. Run SBLUP using genotype data

The first way to run SBLUP model is to use the genotype data provided directly:

./hiblup --sblup
         --sumstat demo.ma   #the summary data
         --bfile demo 
         --window-bp 1e6
         --h2 0.3234
         --threads 16
         --out demo

The command --h2 is the heritability of the trait in analysis, which can be estimated from REML if the individual-level data are available or from LD score regression using the summary data.

There are several options to set the window size:

  • --window-bp: to specify the size of non-overlapped window (default 1Mb, i.e., --window-bp 1000000), in which the number of SNPs is not fixed;
  • --window-num: to specify a fixed number of SNPs in a window (e.g., --window-num 500), the size of window is not constant in this case;
  • --window-geno: to define all SNPs across entire genome as one window, note that it will take a long time and huge memory cost if there are large number of SNPs;
  • --window-file: to specify a text file in which the windows are pre-defined by users, see the file format here.

If the number of SNPs in a defined window size is pretty large (e.g., over 10k), it is recommended to add flag --pcg for fast computing of SNP effects in analysis.

2. Run SBLUP using pre-computed LD correlation matrix

Instead of using genotype data, using the LD correlation matrix to fit SBLUP model is more straight-forward. Although this strategy is more computationally efficient and memory-saving than the first one, it should be noted that all the SNPs should satisfy the Hardy-Weinberg equilibrium. If not, the estimated SNP effects would be biased, resulting in a bad prediction performance.

./hiblup --sblup
         --sumstat demo.ma   #the summary data
         --ldm demo_ldm      #the pre-computed LD correlation matrix
         --h2 0.3234
         --threads 10
         --out demo

The command --ldm is used to specify the LD correlation matrix, which could be computed by HIBLUP using the individual-level genotype data, see more details here. Also, If the number of SNPs in a window is pretty large (e.g., over 10k), it is recommended to add flag --pcg for fast computing of SNP effects in analysis.

After running successfully, a file named “demo.snpeff” will be generated in the work directory as follows:

id a1 a2 freq_a1 demo
M1 A G 0.1285 -0.000963937
M2 T C 0.1285 -0.00108931
M3 A G 0.1062 0.00588629
M4 G A 0.1285 -0.00164344
M5 A C 0.2459 -0.00100206

As shown above, the estimated SNP effects are listed in the last column. To obtain the predicted GEBV or GPRS of individuals, we recommend using HIBLUP to implement it (see here), since we tested that it is several times faster than the ‘--score‘ function in PLINK.