Batch effect analysis

Batch Effects in RNA-Seq data

This page will explain how to see if your data contains batch effects.

Before you begin, read this paper.

Batch effect correction is the procedure of removing variability from your data that is not due to your variable of interest (e.g. cancer type). Batch effects are due to technical differences between your samples, such as the type of sequencing machine or even the technician that ran the sample. Removing this variability means changing the data for individual samples. For expression data, this means that you should never take individual samples from a batch corrected set for a separate analysis such as differential expression.

In order to see if you have a batch effect, run BatchQC. This R module provides interactive diagnostics, visualizations, and statistical analyses to explore the extent to which batch variation impacts your data. In the next section we will discuss how to run BatchQC and how to interpret its output.

Running BatchQC

You need two files:

A gene by sample matrix with gene IDs in the first column and sample IDs as column headers. The cells contain quantile normalized expression values.

A metadata file with sample IDs in the first column and information about the samples in the remainder It should include the suspected batch variables, such as Sequencing Platform, Data, Biopsy Site, etc., as well as your classifier (e.g. tumor type).

Make sure both tables are squeaky clean: No empty cells, no NA values, no spaces, quotes or other weird symbols in your column headers. Especially check your metadata for misspellings or unwanted variations ("mixed" and "mixed cytology" are probably the same thing). You might want to replace any spaces in the metadata cells with underscores. R is a picky, picky program.

Assuming your data is in tab separated text format, here's how to run the program:

source("https://bioconductor.org/biocLite.R")

biocLite(c("BatchQC"))

library(BatchQC)

 

metadf <- read.tsv("./metadata.tsv“)

exprdata <- read.tsv("./expressiondata.tsv")

batch = metadf$Sequencer

condition = metadf$CancerType

 

batchQC(dat=exprdata, batch=batch, condition=condition,

report_file="batchqc_report.html", report_dir=".",

report_option_binary="111111111",

view_report=TRUE, interactive=TRUE, batchqc_output=TRUE)

After running calculations for a while, the program will open an interactive URL that shows you plots of your data. You can color most of these plots by condition or by batch, so you can see if there is a batch effect.

Notes:

  • Not all batch effects show up equally in all plots. This is why different kinds of statistics are presented. Check the Circular Dendrogram first, it's the simplest plot to understand.

  • Larger datasets will take longer to run and will be harder to interact with, since the program re-creates plots based on your selections. If you find that your computer gets sluggish, downsample your gene set so you're working with 20-50% of your data.

  • You can run batch correction from the interactive interface by going to the ComBat or SVA tabs and selecting Run. This allows you to go back through your plots and select a 'post correction' version. You will still have to run ComBat or SVA on your full dataset to get the results table.

For a detailed description of all BatchQC outputs, click on the links below: