The R code snippets below show common things you may need to do for batch correction.
We have two input files:
metadata.tsv has sample ids in the first column and contains our batch data (Sequencer) and our Condition of interest (CancerType).
metadf <- read.tsv("./metadata.tsv")
exprdata <- read.tsv("./expressiondata.tsv")
batch = metadf$Sequencer
condition = metadf$CancerType
If you set the cutoff to zero, you remove all genes with no variation. Effectively these are the genes with no expression in any of the samples. In the example below the cutoff is set to 0.8.
m <- as.matrix(exprdata, var = "gene_id")
gene.sd <- unlist(apply(m,1,sd))
rows.for.removal <- names(which(gene.sd <= 0.8))
exprdata <- m[!(rownames(m) %in% rows.for.removal),]
ComBat (and BatchQC) will fail if the individual batches contain genes with no expression. This means that every time you want to run batch effect correction, you must filter out these genes for the variable of interest (here Sequencer). In the example below we first get the sample IDs for each batch from the metadata table, then select those samples from the expression table and sum the expression values per gene. If the sum is zero, none of the samples in that batch expresses the gene; we remove the gene from the whole expression set.
print(dim(exprdata))
for (b in levels(metadf$Sequencer)){
bset <- subset(metadf, Sequencer==b) # get the samples in this batch
submatrix <- exprdata[,(colnames(exprdata) %in% rownames(bset))] # use those to filter
exprdata <- exprdata[(rowSums(submatrix) !=0),] # select all that are not zero
}
print(dim(exprdata))
Do not add an arbitrary number to your whole dataframe to avoid doing this filtering: After batch correction it will be impossible to see which genes are not expressed.
R's sample function allows you to select rows randomly. In the example below we select 9000 genes from our set.
source("https://bioconductor.org/biocLite.R")
biocLite(c("BatchQC"))
library(BatchQC)
sampledf <- exprdata[sample(nrow(exprdata), 9000),]
batch = metadf$Sequencer
condition = metadf$CancerType
batchQC(dat=sampledf, batch=batch, condition=condition,
report_file="batchqc_report.html", report_dir=".",
report_option_binary="111111111",
view_report=TRUE, interactive=TRUE, batchqc_output=TRUE)
report_option_binary="111111111" means 'create all plots'. Also note that this command creates a html file in your current working directory - it is a non interactive version of the report. If you run ComBat, a second file is created named combat_batchqc_report.html.
source("https://bioconductor.org/biocLite.R")
biocLite(c("sva"))
library(sva)
ebat <- ComBat(dat=exprdata, batch=batch, mod=NULL, par.prior=TRUE, prior.plots=FALSE)
When you have convinced yourself that you need to do batch correction, you can run ComBat in Python. This is especially useful when you have a large dataset: the Python implementation is a lot faster.
Note that you should run without telling ComBat about your condition of interest (here called Covariate; elsewhere it shows up as Confounder). In the combat.py example code you should call the function like so:
ebat = combat(dat, metadf['Sequencer'], None)