For whatever reason, following on from my despair with normalizing gene expression data from earlier in the week, my most recent challenge has been to take a Bioconductor ExpressionSet of gene expression data measured using an Affymetrix GeneChip® Human Transcriptome Array 2.0 but instead of labeling each row with its probe ID having it mapped to its corresponding gene symbol.
I have seen a lot of code samples that suggest using variations on a theme of using the biomaRt package or querying a SQL database of annotation data directly: in the former I gave up trying; in the latter, I ran away to hide, having only interacted with a SQL database through Java’s JPA abstraction layer recently.
It turns out to be very easy to do this using the affycoretools package by James W MacDonald which contains ‘various wrapper functions that have been written to streamline the more common analyses that a core Biostatistician might see.’
As you can see below, you can very easily extract a vector of gene symbols for each of your probe IDs and assign it as the rownames to your gene expression data.frame.
I hope this will save you the trouble of finding this gem of a package.
source('http://www.bioconductor.org/biocLite.R') biocLite('GEOquery') require('GEOquery') biocLite('affycoretools') require('affycoretools') biocLite('pd.hta.2.0') require('pd.hta.2.0') #retrieve ExpressionSet using GEOquery gse76250 <- getGEO('GSE76250')[[1]] # populate the fData slot in the ExpressionSet # with gene symbols annot.gse76250 <- annotateEset(gse76250, pd.hta.2.0) # extract the expression data.frame gse76250.expr <- exprs(annot.gse76250) # change rownames of expression data.frame # to gene symbols instead of probe ids gene.symbols <- fData(annot.gse76250)$SYMBOL rownames(gse76250.expr) <- gene.symbols