Recently I have been familiarising myself with analysing microarray data in R. Statistics and Analysis for Microarrays Using R and Bioconductor by Sorin Draghici is proving to be indispensible in guiding me through retrieving microarray data from the Gene Expression Omnibus (GEO), performing typical quality control on samples and normalizing expression data over multiple samples.
As an example, I wanted to examine the gene expression profiles from GSE76250 which is a comprehensive set of 165 Triple-Negative Breast Cancer samples. In order to perform the quality control on this dataset as detailed by the book, I needed to download the Affymetrix .CEL files and then load them into R as an AffyBatch object:
library(GEOquery) library(affy) getGEOSuppFiles('GSE76250') cel.files <- list.celfiles('GSE76250/') raw.data <- ReadAffy(filenames=CEL.files)
The raw.data AffyBatch object representing these data when loaded into R takes over 4 gigabytes of memory. When you then perform normalization on this data using rma(raw.data) (Robust Multi-Array Average), this creates an ExpressionSet that effectively doubles that.
This is where I come a bit unstuck. My laptop is an Asus Zenbook 13-inch UX303A which comes with (what I thought to be) a whopping 11 gigabytes of RAM. This meant that after loading and transforming all the data onto my laptop, I had effectively maxed out my RAM. The obvious answer would be to upgrade my RAM. Unfortunately, due to the small form factor of my laptop, I only have one accessible RAM slot meaning my options are limited.
So, I have concluded that I have three other options to resolve this issue.
- Firstly, I could buy a machine that has significantly more memory capacity at the expense of portability. Ideally, I don’t want to do this because it is the most expensive approach to tackling the problem.
- Another option would be to rent a Virtual Private Server (VPS) with lots of RAM and to install RStudio Webserver on it. I’m quite tempted by the idea of this but I don’t like the idea of my programming environment being exposed to the internet. Having said this, the data I am analysing is not sensitive data and, any code that I write could be safely committed to a private Bitbucket or Github repository.
- Or, I could invest the time in approaching the above problem in a less naive way! This would mean reading the documentation for the various R and Bioconductor packages to uncover a more memory restricted method or, it could mean scoping my data tactically so that, for instance, the AffyBatch project will be garbage collected, thereby freeing up memory once I no longer need it.
In any case, I have learned to be reluctant to follow the final path unless it is absolutely necessary. I don’t particularly want to risk obfuscating my code by adding extra layers of indirection while, at the same time, leaving myself open to making more mistakes by making my code more convoluted.
The moral of the story is not to buy a laptop for its form factor if your plan is to do some real work on it. Go for the clunkier option that has more than one RAM slot.
Either that or I could Download More Ram.