Genome-wide association studies in R

maxresdefault
’10 High Cholesterol Foods You Must Avoid’ [source]

This time I elaborate on a much more specific subject that will mostly concern biologists and geneticists. I will try my best to outline the approach as to ensure non-experts will still have a basic understanding. This tutorial illustrates the power of genome-wide association (GWA) studies by mapping the genetic determinants of cholesterol levels using three Southeast Asian populations.

Historical background

Early since Charles Darwin formulated the theories of natural and sexual selection in the late 1800s, the underlying role of genes, each represented by different alleles (i.e. variants) in different individuals was yet to be elucidated. His younger contemporary fellow Gregor Mendel scratched the surface after identifying the mechanism of trait inheritance, genetic segregation. In the years that followed, the genetic basis of phenotypes (i.e. observable traits) was gradually unraveled by classic geneticists pioneered by Ronald Fisher, who introduced key concepts such as genetic variance. The then emerging concept of genotype (i.e. genetic make-up) required the development of polymorphic genetic markers. In the early days, genotyping was based on determining the allelic composition of loci (loose definition of chromosomal regions), and later of copy number variations (CNVs), short-tandem repeats (STRs) and single nucleotide polymorphisms (SNPs). Humans and many other organisms including plants are diploid (i.e. carry two copies of each chromosome), which implies bi-allelic markers with alleles e.g. A and a distinguish individuals into AA (homozygous dominant), Aa (heterozygous) and aa (homozygous recessive).

The early discovery of the genetic basis for disorders such as sickle cell anemia and haemophilia owes much to their relatively simple genetic architectures – few mutations with high penetrance, thus more easily identified. Understandably then, more complex polygenic diseases that have been around for a long time, such as the neurodegenerative Alzheimer and Parkinson, are still currently under investigation. To characterize such traits, products of the interplay of many genes with small effects, it takes large genetic resources as well as flexible statistical methods, which is no problem in the current era of omics data and high-performance computing.

Genome-wide association studies

Genome-wide association (GWA) studies scan an entire species genome for association between up to millions of SNPs and a given trait of interest. Notably, the trait of interest can be virtually any sort of phenotype ascribed to the population, be it qualitative (e.g. disease status) or quantitative (e.g. height). Essentially, given p SNPs and n samples or individuals, a GWA analysis will fit p independent univariate linear models, each based on n samples, using the genotype of each SNP as predictor of the trait of interest. The significance of association (P-value) in each of the p tests is determined from the coefficient estimate \beta of the corresponding SNP (technically speaking, the significance of association is P(\beta|H_0: \beta=0) ). Note that because these tests are independent and quite numerous, there is a great computational advantage in setting up a parallelized GWA analysis (as we will do shortly). Quite reasonably, it is necessary to adjust the resulting P-values using multiple hypothesis testing methods such as Bonferroni, Benjamini-Hochberg or false discovery rate (FDR). GWA studies are now commonplace in genetics of many different species.

Association mapping vs. linkage mapping

Too often, people cannot tell the difference between association and linkage mapping, or quantitative trait loci (QTL) mapping. Albeit conceptually similar, their are actually opposite in their workings. One of the key differences between the two is that association mapping relies on high-density SNP genotyping of unrelated individuals, whereas linkage mapping relies on the segregation of substantially fewer markers in controlled breeding experiments – unsurprisingly QTL mapping is seldom conducted in humans. Importantly, association mapping gives you point associations in the genome, whereas linkage mapping gives you QTL, chromosomal regions.

The present tutorial covers fundamental aspects to consider when conducting GWA analysis, from the pre-processing of genotype and phenotype data to the interpretation of results. We will use a mixed population of 316 Chinese, Indian and Malay that was recently characterized using high-throughput SNP-chip sequencing, transcriptomics and lipidomics (Saw et al., 2017). More specifically, we will search for associations between the >2.5 million SNP markers and cholesterol levels. Finally, we will explore the vicinity of candidate SNPs using the USCS Genome Browser in order to gain functional insights. The methodology shown here is largely based on the tutorial outlined in Reed et al., 2015. The R scripts and all data can be found in my repository, but you can also download the omics data from here. Please follow the instructions in the repo.

Let’s get started with R

Read data

Let’s first import the PLINK-converted .bed, .fam and .bim Illumina files from each of the three ethnic groups. We will use the function read.plink from the package snpStats and work on the resulting objects throughout the rest of the tutorial. This function reads .bed, .fam and .bim and creates a list of three elements – $genotypes, $fam and $map. The first contains all SNPs determined from all samples, the second contains information about pedigree and sex, and the third contains the genomic coordinates of the SNPs. At this point we have a total of 323 individuals (110 Chinese, 105 Indian and 108 Malay) and 2,527,458 SNPs. Next, we will change the Illumina SNP IDs stored in $genotype to the more conventional rs IDs, which will allow us to zoom in into the genomic regions that surround the candidate SNPs in the USCS Genome Browser. I prepared a table that establishes the correspondence between the two, so that we can easily switch the IDs.

Next we import and merge the three lipid data sets (stored as .txt) and determine which samples are present in both SNP and lipid data sets. In the following analyses we will use the subset of samples profiled in both platforms, a total of 319. Finally, we create a list genData that stores the merged SNP data ($SNP), one of the three $map since they are all identical ($MAP) and the merged lipid data ($LIP). Finally, let’s save RAM for the subsequent steps and remove all files after saving genData into a .RData file.

Pre-processing

Let’s reload genData and clean it up. In a nutshell, the pre-processing of the data consists in

  • discarding SNPs with call rate < 1  or  MAF < 0.1
  • discarding samples with call rate < 100%, IBD kinship coefficient > 0.1 or inbreeding coefficient |F| > 0.1

Call rate is the proportion of SNPs (or samples) that were genotyped. For example, a call rate of 0.95 for a particular SNP (sample) means 5% of the values are missing. Because we have so many SNPs, we can afford to have absolutely no missing values in the $SNP matrix by imposing a call rate threshold of 1 for both SNPs and samples. If you want to relax the threshold and tolerate missing values, bear in mind you need to run a whole procedure for imputing those. Reed et al., 2015 describe a PCA-based imputation method that utilizes the 1,000 Genome Project as a proxy, in case you are interested.

Minor-allele frequency (MAF) denotes the proportion of the least common allele for each SNP. Of course, it is harder to detect associations with rare variants and this is why we select against low MAF values. Most GWA studies I have read typically report MAF thresholds of 0.05. Here, I opt for a more stringent 0.1 because again, we have plenty of data and since all this is for illustrative purposes we want to conduct a fast GWA analysis.

Next, we need to consider samples that exhibit excessive heterozygosity – technically speaking, deviations from the Hardy-Weinberg equilibrium (HWE),

p^2 + 2pq + q^2 = 1

where p  and q  are the frequencies of two alleles A and a that sum up to one, and 2pq the frequency of heterozygous individuals, assuming bi-allelic SNPs. As such, we determine the inbreeding coefficient |F|,

|F| = 1 - \frac{H}{2pq}

where and 2pq are the observed and expected heterozygosity (i.e. proportion of heterozygous SNPs per sample), respectively. You will note that in practice, rather than using the proportions we will use counts which still gives us the same results. Overall, large and small |F| might indicate poor sample quality or inbreeding, respectively. We will exclude samples with |F|> 0.1 as described in Reed et al., 2015.

Finally, we will investigate relatedness among samples using the kinship coefficient based on identity by descent (IBD). Please note that these functions from the package SNPRelate require GDS files. For this reason we first need to aggregate the .bed, .fam and .bim files from the three populations into convertGDS. The function snpgdsBED2GDS2 creates the GDS necessary for this part of the analysis. To determine the kinship coefficient between pairs of samples we will use a subset of uncorrelated SNPs in order to have unbiased estimates. For this purpose, we will use linkage disequilibrium (LD) as a measure of correlation between markers. LD ranges from 0 to 1, the higher its value the more likely two SNPs co-segregate and therefore correlate. Here, we will utilize the subset of SNPs with LD < 0.2 (p ~ 12,000) to determine the IBD kinship coefficient. It took me about 2 hours to calculate the LD with the function snpgdsLDpruning, so be patient. Next, following Reed et al., 2015, we will exclude all samples with kinship coefficients > 0.1.

After pre-processing, we are left with 316 samples (110 Chinese, 100 Indian and 106 Malay) characterised by 795,668 SNP markers and 282 lipid species. Note that your sample size might differ slightly as the LD pruning procedure is stochastic.

Analysis

Principal Component Analysis

Now that we are done with the pre-processing, it might be a good idea to examine the largest sources of variation in the genotype data and look out for outliers or clustering patterns, using Principal Component Analysis (PCA). Because we are working with S4 objects, we will be using the PCA function from SNPRelatesnpgdsPCA. Let’s plot the first two principal components (PCs).

PCApopulation

As expected, the 795,668 SNP markers clearly delineate the three populations. The results also suggest that Chinese and Malay are closer to each other than to Indian (this observation would be much better addressed with e.g. hierarchical clustering). Also, no obvious outliers are identified with the first two PCs. All good to proceed to GWA.

Genome-Wide Association

Finally, the pièce de résistance. From the set of 282 lipids species I chose cholesterol, one of the most familiar, as the trait of interest. You are completely free to select a different lipid species and proceed. We are going to use the GWA function provided in Reed et al., 2015 with some minor modifications. I recommend you to open the script GWASfunction.R  and skim through. This is an excellent, well documented parallelized implementation. Note that the glm function is used to determine the significance of association between each SNP and the trait of interest. In case you are wondering, glm is much more versatile than lm since it conducts Gaussian, Poisson, binomial and multinomial regression / classification, depending on how your trait of interest is distributed (all lipids in the phenotype file are Gaussian). This GWA function will not create a variable, but rather write a .txt summary table listing the coefficient estimate \beta , t and the corresponding P-value for each SNP, alongside with the corresponding genomic coordinates. Running the GWA function took me approximately 1.5 hours with my MacBook Pro mid-2012 (8Gb, 2.9 GHz Intel Core i7).

Once finished, we can visualize the results using so-called Manhattan plots. All we need is to load the .txt summary table written in the previous step, add a column with -\log_{10} (P) and plot these significance estimates against the genomic coordinates of all SNPs.

In addition, a full (resp. dashed) line indicate the levels of Bonferroni-adjusted \alpha = 0.05 for 1,000,000 (resp. 10,000) tests.Cholesterol

We see that a total of four SNPs pass the ‘relaxed’ Bonferroni threshold (none passes the ‘hard’ threshold). These are SNPs rs7527051 (Chr. 1), rs12140539 (co-localized with the first, Chr. 1), rs9509213 (Chr. 13) and rs2250402 (Chr. 15).

Before proceeding with these four hits, it is helpful to constrast the distribution of the resulting P-values against that expected by chance, as to ensure there is no confounding systemic bias. This is easily addressed with a quantile-quantile (Q-Q) plot. As the name suggests, it compares two distributions based on their quantiles. In the present case, we want to contrast the distribution of our t statistics against that obtained by chance. If two distributions are identical in shape, the Q-Q plot will display a x = y line. Therefore, the Q-Q plot from a reliable GWA analysis will display a x = y line with only few deviating values that are suggestive of association. Otherwise, if the line is shifted up or down the GWA analysis might be impaired by confounding factors. We will draw a Q-Q plot using the function estlambda from the package GenABEL.

Cholesterol_QQplot

The resulting Q-Q plot clearly depicts a trend line (\lambda = 0.99 , red) overlapping with x = y (black) and a slight deviation in the right tail, so we can be confident about our results.

Functional insights into candidate markers

Finally, we will try to find the functional relevance of these four candidate SNPs by searching for genes in their vicinity, using the USCS Genome Browser (enter Genome Browser, insert the SNP ID in the text box, enter and zoom out). I found that rs9509213 (Chr. 13) lands right on CRYL1 (crystalline lambda 1, intron sequence), rendering it an interesting candidate for follow-up studies.

Screen Shot 2017-10-08 at 17.58.52

The SNP rs9509213 is shown as a black text box in the bottom of the figure, and its coordinates are highlighted by a vertical yellow line. The CRYL1 gene model is shown on top, below the chromosome model (topmost).

Interestingly, there is a recent publication that found ‘POE, HP, and CRYL1 have all been associated with Alzheimer’s Disease, the pathology of which involves lipid and cholesterol pathways.’.

Finally, I would like to remark on the need of experimental validation of gene candidates to validate the resulting SNP-trait associations. In this particular case, one could test the effect of overexpressing or knocking-out the ortholog of CRYL1 in cholesterol levels of mice.

Wrap-up

To sum up, we have covered some of the most fundamental aspects of GWA analysis:

  • Pre-processing of genotype data based on call rates, MAF, kinship and heterozygosity
  • Investigation of population structure with PCA of the genotype data
  • GWA and visualisation with Manhattan plots
  • Q-Q plots
  • Functional insights into the resulting candidate SNPs

I did not cover many other aspects of GWA, such as fine-mapping using LD plots. Importantly, we did not consider sample size calculation in relation to the expected detectable effect sizes. Also noteworthy, the state-of-the-art of GWA is now almost-fully based on mixed-effect linear models (MLM) that consider kinship and cryptic relatedness as random factors, hence much more powerful that glm. My next post might cover MLM.

I encourage you to choose a different lipid species from the phenotype data, run the analysis and interpret the results.

I would like to thank Saw et al., 2017 for providing unrestricted access to their omics data. If it wasn’t for such conscientious scientists and their transparent research endeavours this whole tutorial would simply not exist. I also reiterate that this tutorial is largely based on the guide outlined in Reed et al., 2015, an excellent reference along with Sikorska et al., 2013.

Have fun and keep your cholesterol in check. Any feedback or suggestions are welcome!

37 thoughts on “Genome-wide association studies in R

  1. Wow! Talk about timely! I have an honors student in my undergraduate biostatistics course who is just beginning to write an introductory “GAA for Dummies” tutorial for genome-wide association analysis. This article is a gold mine of information and references. Thanks.

    Like

    1. Hi Mateusz! Which OS are you using? If Mac/Linux, open a terminal, use the ‘cd’ command to find the corresponding directory and type ‘tar xf iOmics_data.tar.gz’. If you are using Windows please use a software other than WinZip and let me know if your issue was solved. Cheers!

      Like

  2. While I was running the scripts about “Genome-wide association studies in R” posted at https://www.r-bloggers.com/genome-wide-association-studies-in-r/ , I got the following error message.

    > snpgdsBED2GDS(bed.fn = “convertGDS.bed”, fam.fn = “convertGDS.fam”, bim.fn = “convertGDS.bim”, out.gdsfn = “myGDS”, cvt.chr = “char”)
    Start snpgdsBED2GDS …
    Error in file(filename, “rb”) : cannot open the connection
    In addition: Warning message:
    In file(filename, “rb”) :
    cannot open file ‘convertGDS.bed’: No such file or directory

    Can you please tell me what I am doing wrongly.
    The code was posted by Francisco Lima, but I could not find his email address.

    Thank you!

    Like

      1. Many thanks Francisco.
        Everything went smoothly.
        Great!

        Wanted to know whether you plan also to implement mixed models in GWAS.
        Thanks again.

        Like

  3. Hi again Ephrem! I am glad it worked. The next post will be out in one or two weeks, it will definitely deal with mixed models using a simple experiment. I can nevertheless add a paragraph or two about the applications in GWAS. Cheers

    Like

  4. is it possible to run the snpgdsLDpruning function with more than 1 cpu? there is an argument to set threads but no matter how many I set, the system only utilizes one.

    Like

    1. Hi Bo! Is this the error message – “The current version of ‘snpgdsLDpruning()’ does not support multi-threading”? It seems like the current version (at least in the dev-repo) does not support multi-threading (https://github.com/zhengxwen/SNPRelate/blob/master/R/LD.R). Otherwise, should your version be fully operational, examine the code chunk that sets up the parallelisation in that particular function – I have found there are implementations that either work on UNIX/OS or Windows. Hope this helps! Cheers

      Like

  5. Hi Francisco, I have the same problem cannot install “GenABEL” package even I tried the methods in the link.
    I even tried to use old R versions, it’s still not working.
    Regards.

    Like

  6. Hi Francisco! I have a question what kind of new R package would make your life easier when you were doing this analyzation? An embedded genome browser that is using the Genome Browser’s database API? or writing instead of a .txt file, directly reading and drawing a plot?

    Like

    1. Hi Doga! I think an interface bridging GWA results and a Genome Browser (more technically gene coordinates that co-localize with selected markers) would be very exciting. To my knowledge this currently still is an open venue, but it suffers from a old standing issue – you have to custom-tailor such interfaces to individual species. Search for tools of that sort based on the human genome (R based or not), if none exist you can take the lead. As for plotting from GWA results (.txt or otherwise), I think there are so many options out there I would not even bother… Hope this helps!

      Like

  7. Hi Francisco! Amazing post! I’m working with GWAS data, I have the location (chromosome and genome) of each SNP. I already sucess plotting the SNPs just using the chromosome locations, but dont how to plot then using the genome as well (1A).
    Hope I’m clear:) Thanks in advance!

    Liked by 1 person

  8. Thanks Ale! If I understood correctly, for each SNP you have the coordinates that map into the corresponding chromosome (that is, you have both position and chromosome). In that case, you can produce a Manhattan plot by ordering all SNPs AND chromosomes. In terms of code, you could try encoding new coordinates that extend to all the ordered chromosomes. For example, the first position in Chr. 4 would be the last position in Chr. 3 plus 1bp. Overall, these new coordinates would be the x-axis of your figure. Does this help?

    Like

  9. Hi Francisco,

    I really liked your post.
    I tried to perform the very same analysis, but I don’t know where to find the conversionTable.RData image. Nor do I know how to obtain the PLINK-converted .bed, .fam and .bim Illumina files from each of the three ethnic groups you mention.

    Sincerely

    Hervé

    Like

  10. Hi Francisco,
    Yeah you are right I really should have paid more attention.
    I now have dowloaded the files and I am trying to follow all the described steps.

    Hope I will be able to run the GenABEL part, as I run the snpStats part with R 5.1, which appears to be non compatible with my previous GenABEL library (compiled for R 3.3.3).
    I will let you know if I run into difficulties I cannot solve despite all the feedback.

    Sincerely,

    Hervé

    Like

  11. UPDATE / ERRATUM
    Fix for Linux/macOS installation of GenABEL
    install.packages(“GenABEL.data”, repos=”http://R-Forge.R-project.org”)
    packageurl <- "https://cran.r-project.org/src/contrib/Archive/GenABEL/GenABEL_1.8-0.tar.gz&quot;
    install.packages(packageurl, repos=NULL)

    Fix for dplyr `arrange` and `count` in `while` loop
    while (nrow(ibdcoef) > 0) {
    # count the number of occurrences of each and take the top one
    sample.counts <- sort(table(c(ibdcoef$ID1, ibdcoef$ID2)), decreasing = T)
    rm.sample <- names(sample.counts)[1]
    cat("Removing sample", rm.sample, "too closely related to", sample.counts[1], "other samples.\n")

    # remove from ibdcoef and add to list
    ibdcoef <- ibdcoef[ibdcoef$ID1 != rm.sample & ibdcoef$ID2 != rm.sample,]
    related.samples <- c(as.character(rm.sample), related.samples)
    }

    Like

    1. Hi Francisco,
      I thought the Package ‘GenABEL’ was removed from the CRAN repository due to some non-compliances.
      have a nice weekend.

      Like

  12. Hi Francisco,
    Your post is amazing. I have a question regarding using my own database as it comes from axiom Arachis microarray I haven’t got the PLINK file. I only have genotypes for my samples and the chromosome and position for each SNP. Is it possible to run this analysis following your tutorial? Thank you very much in advance

    Like

    1. Hi frandeblas, thanks for the kind words. There certainly are tons of tools you can use to convert files over a wide range of formats including PLINK. What format are your files?
      Best,
      Francisco

      Like

  13. Dear Francisco,
    thank you for your post.
    Unfortunatelly, the GWAA function doesn’t work to me…
    Please, see the error:

    Error in { : task 1 failed – “task 1 failed – “indice fuori limite””
    4.
    stop(simpleError(msg, call = expr))
    3.
    e$fun(obj, substitute(ex), parent.frame(), e$data)
    2.
    foreach(part = 1:nSplits) %do% {
    genoNum <- as(genodata[, snp.start[part]:snp.stop[part]],
    "numeric")
    if (isTRUE(flip)) … at C:\Users\Gabry\Desktop\SNP UniMi Gabriella_2019\QTL_L22A\GWAS_RStudio\GWAA.R#78
    1.
    GWAA(genodata = genData$SNP, phenodata = phenodata,
    filename = paste(target, ".txt", sep = ""))

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.