Batch Correction Code Snippets | Systems Biology Group

The R code snippets below show common things you may need to do for batch correction.

We have two input files:

metadata.tsv has sample ids in the first column and contains our batch data (Sequencer) and our Condition of interest (CancerType).
exprdata.tsv is a gene by sample matrix with quantile normalized data. The sample ids in this table match those in the metadata.

metadf <- read.tsv("./metadata.tsv")
exprdata <- read.tsv("./expressiondata.tsv")
batch = metadf$Sequencer
condition = metadf$CancerType

Remove Genes with low standard deviation

If you set the cutoff to zero, you remove all genes with no variation. Effectively these are the genes with no expression in any of the samples. In the example below the cutoff is set to 0.8.

m <- as.matrix(exprdata, var = "gene_id")
gene.sd <- unlist(apply(m,1,sd))
rows.for.removal <- names(which(gene.sd <= 0.8))
exprdata <- m[!(rownames(m) %in% rows.for.removal),]

Make sure none of your batches contain genes with only zeroes

ComBat (and BatchQC) will fail if the individual batches contain genes with no expression. This means that every time you want to run batch effect correction, you must filter out these genes for the variable of interest (here Sequencer). In the example below we first get the sample IDs for each batch from the metadata table, then select those samples from the expression table and sum the expression values per gene. If the sum is zero, none of the samples in that batch expresses the gene; we remove the gene from the whole expression set.

print(dim(exprdata))
for (b in levels(metadf$Sequencer)){
        bset <- subset(metadf, Sequencer==b) # get the samples in this batch
        submatrix <- exprdata[,(colnames(exprdata) %in% rownames(bset))] # use those to filter
        exprdata <- exprdata[(rowSums(submatrix) !=0),] # select all that are not zero
}
print(dim(exprdata))

Do not add an arbitrary number to your whole dataframe to avoid doing this filtering: After batch correction it will be impossible to see which genes are not expressed.

Run BatchQC on a subset of your data

R's sample function allows you to select rows randomly. In the example below we select 9000 genes from our set.

source("https://bioconductor.org/biocLite.R")
biocLite(c("BatchQC"))
library(BatchQC)

sampledf <-  exprdata[sample(nrow(exprdata), 9000),]
batch = metadf$Sequencer
condition = metadf$CancerType

batchQC(dat=sampledf, batch=batch, condition=condition,
report_file="batchqc_report.html", report_dir=".",
report_option_binary="111111111",
view_report=TRUE, interactive=TRUE, batchqc_output=TRUE)

report_option_binary="111111111" means 'create all plots'. Also note that this command creates a html file in your current working directory - it is a non interactive version of the report. If you run ComBat, a second file is created named combat_batchqc_report.html.

Run ComBat using R...

source("https://bioconductor.org/biocLite.R")
biocLite(c("sva"))
library(sva)

ebat <- ComBat(dat=exprdata, batch=batch, mod=NULL, par.prior=TRUE, prior.plots=FALSE)

... Or Using Python

When you have convinced yourself that you need to do batch correction, you can run ComBat in Python. This is especially useful when you have a large dataset: the Python implementation is a lot faster.

Note that you should run without telling ComBat about your condition of interest (here called Covariate; elsewhere it shows up as Confounder). In the combat.py example code you should call the function like so:

ebat = combat(dat, metadf['Sequencer'], None)