Home » R » Reading .xlsx files into R, quickly

Reading .xlsx files into R, quickly

The gdata library is fantastic for reading small .xls or .xlsx files into R, but it can be very slow for large files. Today I had to read data from the second sheet of a 64Mb .xlsx file. It should have gone something like

my.data <- read.xls("myfile.xlsx", sheet=2, header=TRUE)

but R just hung there, unresponsive, for minutes.

Instead, I followed advice on coverting .xls to .csv on the linux command line, and used gnumeric. I installed via my package manager

sudo apt-get install gnumeric

then, back in R, did

convert.xls <- function(f,out.dir=dirname(f)) {
  f2 <- paste(out.dir, gsub(".xlsx|.xls",".csv",basename(f)), sep="/")
  system(paste("ssconvert -S --export-type=Gnumeric_stf:stf_csv",f,f2))
  out.files <- list.files(dirname(f2),pattern=basename(f2),full.names=TRUE)

which reported

infile: myfile.xlsx 
outfile(s): myfile.csv.0

so that I could read my file with

# read sheet 2, indexing is 0-based
mydata <- read.table("myfile.csv.1", header = TRUE, as.is=TRUE, sep=",") 

It was quick, it worked. I love ssconvert! The persistence of journals in distributing large supplementary data files in excel format remains annoying, but perhaps ever so slightly less so.

In the function, I assumed by default that the converted files should be placed in the same directory as the input file. If that’s not the case, just supply the intended directory using the argument out.dir.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s