Posted by Meical on March 25, 2017
For whatever reason, following on from my despair with normalizing gene expression data from earlier in the week, my most recent challenge has been to take a Bioconductor ExpressionSet of gene expression data measured using an Affymetrix GeneChip® Human Transcriptome Array 2.0 but instead of labeling each row with its probe ID having it mapped to its corresponding gene symbol.
I have seen a lot of code samples that suggest using variations on a theme of using the biomaRt package or querying a SQL database of annotation data directly: in the former I gave up trying; in the latter, I ran away to hide, having only interacted with a SQL database through Java’s JPA abstraction layer recently.
It turns out to be very easy to do this using the affycoretools package by James W MacDonald which contains ‘various wrapper functions that have been written to streamline the more common analyses that a core Biostatistician might see.’
As you can see below, you can very easily extract a vector of gene symbols for each of your probe IDs and assign it as the rownames to your gene expression data.frame.
I hope this will save you the trouble of finding this gem of a package.
source('http://www.bioconductor.org/biocLite.R') biocLite('GEOquery') require('GEOquery') biocLite('affycoretools') require('affycoretools') biocLite('pd.hta.2.0') require('pd.hta.2.0') #retrieve ExpressionSet using GEOquery gse76250 <- getGEO('GSE76250')[] # populate the fData slot in the ExpressionSet # with gene symbols annot.gse76250 <- annotateEset(gse76250, pd.hta.2.0) # extract the expression data.frame gse76250.expr <- exprs(annot.gse76250) # change rownames of expression data.frame # to gene symbols instead of probe ids gene.symbols <- fData(annot.gse76250)$SYMBOL rownames(gse76250.expr) <- gene.symbols