blog

2015-06-23

Improved R Performance with OpenBLAS and vecLib Libraries

Dramatic performance improvements are available by using an alternative to the BLAS ("Basic Linear Algebra Subprograms") library that is shipped with R. Current alternatives to the R "Reference" (i.e., default) library include:

There are others.

Older posts elsewhere may mention GotoBLAS, which is no longer being developed. OpenBLAS was initially derived from GotoBLAS.

OS X

Switching from the Reference R BLAS library to Apple's vecLib library is quite easy, although the official R FAQ on the subject is slightly misleading. The symbolic link given on the R FAQ page refers to an older version of R and is no longer correct. The current procedure, as of OS X 10.10.3 and R 3.2.1 is:

# for vecLib use
cd /Library/Frameworks/R.framework/Resources/lib
ln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib libRblas.dylib
     
# for R reference BLAS use
cd /Library/Frameworks/R.framework/Resources/lib
ln -sf libRblas.0.dylib libRblas.dylib

The following results are from the frequently-used R benchmark run on a mid-2010 Mac Pro (Quad-core 2.8 GHz) with R 3.2.1 on OS X 10.10.3. Many of the tests are not affected by the choice of library. The tests with the biggest improvements are listed here.

Reference library:

2800x2800 cross-product matrix (b = a' * a)_________ (sec):  12.912 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  6.09233333333334 
Eigenvalues of a 640x640 random matrix______________ (sec):  1.111 
Determinant of a 2500x2500 random matrix____________ (sec):  5.59966666666666 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  4.68366666666667 
Inverse of a 1600x1600 random matrix________________ (sec):  4.74333333333334 

Accelerate vecLib library:

2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.651666666666666 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.379 
Eigenvalues of a 640x640 random matrix______________ (sec):  0.529333333333331 
Determinant of a 2500x2500 random matrix____________ (sec):  0.499333333333335 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.408000000000001 
Inverse of a 1600x1600 random matrix________________ (sec):  2.04233333333333 

The speedup factors for these tests range from a high of about 20x (2800x2800 cross-product matrix) to a low of 2x (Eigenvalues of a 640x640 random matrix).

Test Performance
factor
2800x2800 cross-product matrix 19.8
Linear regr. over a 3000x3000 matrix 16.1
Eigenvalues of a 640x640 random matrix 2.1
Determinant of a 2500x2500 random matrix 11.2
Cholesky decomposition of a 3000x3000 matrix 11.5
Inverse of a 1600x1600 random matrix 2.3

Linux/CentOS 6.6

Build tools

The development setup described previously provides gcc version 4.4.7. OpenBLAS requires gcc version 4.6 or above. Follow these steps (from advice here) to set up devtoolset, which allows switching to an environment with gcc 4.8:

cd /etc/yum.repos.d
sudo wget http://people.centos.org/tru/devtools-2/devtools-2.repo

sudo yum -y install devtoolset-2-gcc
sudo yum -y install devtoolset-2-binutils
sudo yum -y install devtoolset-2-gcc-gfortran
sudo yum -y install devtoolset-2-gcc-c++

To switch into the environment:

scl enable devtoolset-2 bash

To leave the environment:

exit

Default build

Download and build OpenBLAS with default options:

cd ~
mkdir OpenBLAS
cd OpenBLAS/
wget http://github.com/xianyi/OpenBLAS/zipball/v0.2.13
unzip v0.2.13 
cd xianyi-OpenBLAS-aceee4e/
scl enable devtoolset-2 bash
make
sudo mkdir /usr/lib64/openblas
sudo make PREFIX=/usr/lib64/openblas install
exit

Switching

To prepare to use different versions of BLAS, rename the reference library and set up a symbolic link:

sudo mv /usr/lib64/R/lib/libRblas.so /usr/lib64/R/lib/libRblas.ref.so
sudo ln -sf /usr/lib64/R/lib/libRblas.ref.so /usr/lib64/R/lib/libRblas.so

We can now switch back and forth between the reference BLAS library and OpenBLAS:

# for OpenBLAS use
cd /usr/lib64/R/lib
sudo ln -sf /usr/lib64/openblas/lib/libopenblas.so libRblas.so
     
# for R reference BLAS use
cd /usr/lib64/R/lib
sudo ln -sf libRblas.ref.so libRblas.so

Benchmarks

All Linux results are with R 3.2.0 using the hardware described here and here: 4.0 GHz Intel i7-4790K with 32 GB RAM.

Reference library:

2800x2800 cross-product matrix (b = a' * a)_________ (sec):  8.34433333333334 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  4.0833333333333 
Eigenvalues of a 640x640 random matrix______________ (sec):  0.623999999999986 
Determinant of a 2500x2500 random matrix____________ (sec):  2.86333333333331 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  3.44600000000003 
Inverse of a 1600x1600 random matrix________________ (sec):  2.35433333333333 

OpenBLAS library (default build options):

2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.151666666666667 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.102333333333333 
Eigenvalues of a 640x640 random matrix______________ (sec):  0.506 
Determinant of a 2500x2500 random matrix____________ (sec):  0.109333333333334 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.121 
Inverse of a 1600x1600 random matrix________________ (sec):  0.114666666666667 

The speedup factors for these tests range from a high of 55x (2800x2800 cross-product matrix) to a low of 1.2x (Eigenvalues of a 640x640 random matrix).

Test Performance
factor
2800x2800 cross-product matrix 55.0
Linear regr. over a 3000x3000 matrix 39.9
Eigenvalues of a 640x640 random matrix 1.2
Determinant of a 2500x2500 random matrix 26.2
Cholesky decomposition of a 3000x3000 matrix 28.5
Inverse of a 1600x1600 random matrix 20.5

Threading Issues

The computation code within R itself is single-threaded, although various popular and useful packages (e.g., doMC in combination with foreach) provide for multi-threaded computation.

The default OpenBLAS library is multi-threaded. The OpenBLAS FAQ warns that this can

conflict with multi-threading used by R packages:

How can I use OpenBLAS in multi-threaded applications?

If your application is already multi-threaded, it will conflict with OpenBLAS multi-threading. Thus, you must set OpenBLAS to use single thread as following.
If the application is parallelized by OpenMP, please build OpenBLAS with USE_OPENMP=1

Nevertheless, the OpenBLAS library (as currently compiled on Linux) appeared to coexist nicely with parallel R computation. I believe this is because the Linux makefile now uses NO_AFFINITY = 1 as a default.

As Zhang Xianyi (main OpenBLAS author) explained in a post, thread affinity is a feature that can interfere with the use of multi-threading elsewhere in a application that uses OpenBLAS:

By default, OpenBLAS is compiled with the thread affinity. If the application uses the single thread, it will improve the performance. However, if the application already uses the multithreading, it will bind these threads of the application to one CPU core.

Thus, the solution is compiling OpenBLAS with NO_AFFINITY=1 or USE_OPENMP=1

I was still unsure, since different people on the web have made different statements about multi-threading compatibility. It seemed appropriate to do some experimentation.

Single-threaded build

To make a single-threaded version, edit Makefile.rule after making a copy of the original:

cd ~/OpenBLAS/xianyi-OpenBLAS-aceee4e/
cp Makefile.rule Makefile.rule.orig
vi Makefile.rule

Uncomment this line:

USE_THREAD = 0

And then make:

scl enable devtoolset-2 bash
make clean
make
sudo mkdir /usr/lib64/openblas.st
sudo make PREFIX=/usr/lib64/openblas.st install
exit

To switch into the single-threaded-only OpenBLAS:

# for single-threaded OpenBLAS use
cd /usr/lib64/R/lib
sudo ln -sf /usr/lib64/openblas.st/lib/libopenblas.so libRblas.so

Thread-affinity build

To make a version with thread-affinity enabled, restore the copy of the original Makefile.rule, then edit:

cd ~/OpenBLAS/xianyi-OpenBLAS-aceee4e/
cp Makefile.rule.orig Makefile.rule
vi Makefile.rule

Comment-out this line:

# NO_AFFINITY = 1

And then make:

scl enable devtoolset-2 bash
make clean
make
sudo mkdir /usr/lib64/openblas.ta
sudo make PREFIX=/usr/lib64/openblas.ta install
exit

To switch into the thread-affinity version of OpenBLAS:

# for thread-affinity OpenBLAS use
cd /usr/lib64/R/lib
sudo ln -sf /usr/lib64/openblas.ta/lib/libopenblas.so libRblas.so

Processor statistics

To monitor CPU usage under different conditions, install the sysstat utilities:

sudo yum -y install sysstat
and then call the mpstat tool to print CPU usage continuously:
mpstat -P ALL 1

Test code

I used this R snippet to test OpenBLAS CPU usage:

# Crossprod snippet
# runs about 15 sec on i7-4790K

a <- rnorm(2800*2800);
dim(a) <- c(2800, 2800)
for (i in 1:100) {
  b <- crossprod(a)  # equivalent to: b <- t(a) %*% a
}

I used this R snippet to test multi-threaded R usage:

# Foreach snippet
# requires installation of the foreach and doMC packages
library(foreach)
library(doMC)
registerDoMC(8)   # 8 threads

# runs about 30 sec on i7-4790K

phi <- 1.6180339887498949
a <- NULL
b <- NULL
foreach(k=1:8) %dopar% {
  for (i in 1:30) {
    a <- floor(runif(3500000)*1000)
    b <- (phi^a - (-phi)^(-a))/sqrt(5)
  }
}

Changing OpenBLAS threading at runtime

The OpenBLAS library includes a C function to change the number of threads being used at any time. The post here shows how to use the inline R package to define a R-callable function to access this:

library(inline)
openblas.set.num.threads <- cfunction( signature(ipt="integer"),
      body = 'openblas_set_num_threads(*ipt);',
      otherdefs = c ('extern void openblas_set_num_threads(int);'),
      libargs = c ('-L/usr/lib64/openblas/lib -lopenblas'),
      language = "C",
      convention = ".C"
      )

Note that the path in line 5 must be changed to match the location of the OpenBLAS library.

After this R code is executed, the number of OpenBLAS threads can be changed by invoking the R function:

# set single-threaded OpenBLAS:
openblas.set.num.threads(1)

# set multi-threaded OpenBLAS:
openblas.set.num.threads(8)

Comparisons

These tests show typical thread usage for the combination of R code and OpenBLAS library indicated. The crossprod code exercises the BLAS library and the foreach code exercises multi-threaded processing within R packages.

Crossprod snippet with default OpenBLAS library:

mpstat -P ALL 1
06:08:10 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
06:08:11 PM  all   77.28    0.00   22.72    0.00    0.00    0.00    0.00    0.00    0.00
06:08:11 PM    0   78.00    0.00   22.00    0.00    0.00    0.00    0.00    0.00    0.00
06:08:11 PM    1   77.78    0.00   22.22    0.00    0.00    0.00    0.00    0.00    0.00
06:08:11 PM    2   86.87    0.00   13.13    0.00    0.00    0.00    0.00    0.00    0.00
06:08:11 PM    3   76.00    0.00   24.00    0.00    0.00    0.00    0.00    0.00    0.00
06:08:11 PM    4   74.00    0.00   26.00    0.00    0.00    0.00    0.00    0.00    0.00
06:08:11 PM    5   73.27    0.00   26.73    0.00    0.00    0.00    0.00    0.00    0.00
06:08:11 PM    6   77.00    0.00   23.00    0.00    0.00    0.00    0.00    0.00    0.00
06:08:11 PM    7   77.00    0.00   23.00    0.00    0.00    0.00    0.00    0.00    0.00

Crossprod snippet with thread-affinity-enabled OpenBLAS library:

mpstat -P ALL 1
02:38:01 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
02:38:02 PM  all   39.00    0.00   11.00    0.00    0.00    0.00    0.00    0.00   50.00
02:38:02 PM    0   79.00    0.00   21.00    0.00    0.00    0.00    0.00    0.00    0.00
02:38:02 PM    1   70.00    0.00   30.00    0.00    0.00    0.00    0.00    0.00    0.00
02:38:02 PM    2   77.78    0.00   22.22    0.00    0.00    0.00    0.00    0.00    0.00
02:38:02 PM    3   85.00    0.00   15.00    0.00    0.00    0.00    0.00    0.00    0.00
02:38:02 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:38:02 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:38:02 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:38:02 PM    7    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

Crossprod snippet with single-threaded-only OpenBLAS library:

mpstat -P ALL 1
06:09:48 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
06:09:49 PM  all   12.41    0.00    0.00    0.00    0.00    0.00    0.00    0.00   87.59
06:09:49 PM    0    0.00    0.00    0.00    0.99    0.00    0.00    0.00    0.00   99.01
06:09:49 PM    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
06:09:49 PM    2    0.00    0.00    0.99    0.00    0.00    0.00    0.00    0.00   99.01
06:09:49 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:09:49 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:09:49 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:09:49 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
06:09:49 PM    7    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

Crossprod snippet with default OpenBLAS library and openblas.set.num.threads(1):

mpstat -P ALL 1
02:44:11 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
02:44:12 PM  all   12.48    0.00    0.12    0.00    0.00    0.00    0.00    0.00   87.39
02:44:12 PM    0   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00
02:44:12 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:44:12 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:44:12 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:44:12 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:44:12 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:44:12 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
02:44:12 PM    7    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

Above results are as expected. Note that when thread-affinity is enabled, OpenBLAS uses a number threads equal to the number of physical cores (4) rather than the number of virtual cores (8). The last test above demonstrates that the openblas.set.num.threads() function works.

Foreach snippet with default OpenBLAS library:

mpstat -P ALL 1
05:58:50 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
05:58:51 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
05:58:51 PM    0  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
05:58:51 PM    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
05:58:51 PM    2  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
05:58:51 PM    3  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
05:58:51 PM    4  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
05:58:51 PM    5  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
05:58:51 PM    6  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
05:58:51 PM    7  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

Foreach snippet with thread-affinity-enabled OpenBLAS library:

mpstat -P ALL 1
05:59:51 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
05:59:52 PM  all   12.52    0.00    0.00    0.00    0.00    0.00    0.00    0.00   87.48
05:59:52 PM    0  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
05:59:52 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
05:59:52 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
05:59:52 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
05:59:52 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
05:59:52 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
05:59:52 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
05:59:52 PM    7    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

Foreach snippet with single-threaded-only OpenBLAS library:

mpstat -P ALL 1
06:02:40 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
06:02:41 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
06:02:41 PM    0  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
06:02:41 PM    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
06:02:41 PM    2  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
06:02:41 PM    3  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
06:02:41 PM    4  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
06:02:41 PM    5  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
06:02:41 PM    6  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
06:02:41 PM    7  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

The three results above show that multi-threading is not harmed by use of the default OpenBLAS library (which has thread-affinity turned off). When thread-affinity is turned on, we see the dysfunctional results that others have observed and complained about.

The final set of tests shows performance results from the benchmark for various BLAS libraries:

Reference BLAS library:

2800x2800 cross-product matrix (b = a' * a)_________ (sec):  8.34433333333334 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  4.0833333333333 
Eigenvalues of a 640x640 random matrix______________ (sec):  0.623999999999986 
Determinant of a 2500x2500 random matrix____________ (sec):  2.86333333333331 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  3.44600000000003 
Inverse of a 1600x1600 random matrix________________ (sec):  2.35433333333333 

Default OpenBLAS library:

2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.151666666666667 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.102333333333333 
Eigenvalues of a 640x640 random matrix______________ (sec):  0.506 
Determinant of a 2500x2500 random matrix____________ (sec):  0.109333333333334 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.121 
Inverse of a 1600x1600 random matrix________________ (sec):  0.114666666666667 

Thread-affinity OpenBLAS library:

2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.159666666666667 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.108333333333333 
Eigenvalues of a 640x640 random matrix______________ (sec):  0.336333333333335 
Determinant of a 2500x2500 random matrix____________ (sec):  0.108999999999999 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.124333333333333 
Inverse of a 1600x1600 random matrix________________ (sec):  0.124000000000001 

Single-threaded-only OpenBLAS library:

2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.483666666666667 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.279333333333332 
Eigenvalues of a 640x640 random matrix______________ (sec):  0.281333333333333 
Determinant of a 2500x2500 random matrix____________ (sec):  0.278999999999999 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.271333333333335 
Inverse of a 1600x1600 random matrix________________ (sec):  0.265666666666666 

Default OpenBLAS library with openblas.set.num.threads(1):

2800x2800 cross-product matrix (b = a' * a)_________ (sec):  0.482333333333315 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.27733333333335 
Eigenvalues of a 640x640 random matrix______________ (sec):  0.27066666666669 
Determinant of a 2500x2500 random matrix____________ (sec):  0.279666666666647 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.271666666666666 
Inverse of a 1600x1600 random matrix________________ (sec):  0.264666666666661 

Observations:

Conclusions: