Improved R Performance with OpenBLAS and vecLib Libraries
Dramatic performance improvements are available by using an alternative to the BLAS ("Basic Linear Algebra Subprograms") library that is shipped with R. Current alternatives to the R "Reference" (i.e., default) library include:
- OpenBLAS. Open-source library.
- ATLAS. Open-source library.
- Apple Accelerate vecLib. Optimized library for Macintosh. Proprietary, but included automatically with OS X.
- Intel MKL (Math Kernel Library). Proprietary and expensive, although included without charge with Revolution Open R.
There are others.
OS X
Switching from the Reference R BLAS library to Apple's vecLib library is quite easy, although the official R FAQ on the subject is slightly misleading. The symbolic link given on the R FAQ page refers to an older version of R and is no longer correct. The current procedure, as of OS X 10.10.3 and R 3.2.1 is:
# for vecLib use
cd /Library/Frameworks/R.framework/Resources/lib
ln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib libRblas.dylib
# for R reference BLAS use
cd /Library/Frameworks/R.framework/Resources/lib
ln -sf libRblas.0.dylib libRblas.dylib
The following results are from the frequently-used R benchmark run on a mid-2010 Mac Pro (Quad-core 2.8 GHz) with R 3.2.1 on OS X 10.10.3. Many of the tests are not affected by the choice of library. The tests with the biggest improvements are listed here.
Reference library:
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 12.912
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 6.09233333333334
Eigenvalues of a 640x640 random matrix______________ (sec): 1.111
Determinant of a 2500x2500 random matrix____________ (sec): 5.59966666666666
Cholesky decomposition of a 3000x3000 matrix________ (sec): 4.68366666666667
Inverse of a 1600x1600 random matrix________________ (sec): 4.74333333333334
Accelerate vecLib library:
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 0.651666666666666
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.379
Eigenvalues of a 640x640 random matrix______________ (sec): 0.529333333333331
Determinant of a 2500x2500 random matrix____________ (sec): 0.499333333333335
Cholesky decomposition of a 3000x3000 matrix________ (sec): 0.408000000000001
Inverse of a 1600x1600 random matrix________________ (sec): 2.04233333333333
The speedup factors for these tests range from a high of about 20x (2800x2800 cross-product matrix) to a low of 2x (Eigenvalues of a 640x640 random matrix).
Test | Performance factor |
---|---|
2800x2800 cross-product matrix | 19.8 |
Linear regr. over a 3000x3000 matrix | 16.1 |
Eigenvalues of a 640x640 random matrix | 2.1 |
Determinant of a 2500x2500 random matrix | 11.2 |
Cholesky decomposition of a 3000x3000 matrix | 11.5 |
Inverse of a 1600x1600 random matrix | 2.3 |
Linux/CentOS 6.6
Build tools
The development setup described previously provides gcc version 4.4.7. OpenBLAS requires gcc version 4.6 or above. Follow these steps (from advice here) to set up devtoolset, which allows switching to an environment with gcc 4.8:
cd /etc/yum.repos.d
sudo wget http://people.centos.org/tru/devtools-2/devtools-2.repo
sudo yum -y install devtoolset-2-gcc
sudo yum -y install devtoolset-2-binutils
sudo yum -y install devtoolset-2-gcc-gfortran
sudo yum -y install devtoolset-2-gcc-c++
To switch into the environment:
scl enable devtoolset-2 bash
To leave the environment:
exit
Default build
Download and build OpenBLAS with default options:
cd ~
mkdir OpenBLAS
cd OpenBLAS/
wget http://github.com/xianyi/OpenBLAS/zipball/v0.2.13
unzip v0.2.13
cd xianyi-OpenBLAS-aceee4e/
scl enable devtoolset-2 bash
make
sudo mkdir /usr/lib64/openblas
sudo make PREFIX=/usr/lib64/openblas install
exit
Switching
To prepare to use different versions of BLAS, rename the reference library and set up a symbolic link:
sudo mv /usr/lib64/R/lib/libRblas.so /usr/lib64/R/lib/libRblas.ref.so
sudo ln -sf /usr/lib64/R/lib/libRblas.ref.so /usr/lib64/R/lib/libRblas.so
We can now switch back and forth between the reference BLAS library and OpenBLAS:
# for OpenBLAS use
cd /usr/lib64/R/lib
sudo ln -sf /usr/lib64/openblas/lib/libopenblas.so libRblas.so
# for R reference BLAS use
cd /usr/lib64/R/lib
sudo ln -sf libRblas.ref.so libRblas.so
Benchmarks
All Linux results are with R 3.2.0 using the hardware described here and here: 4.0 GHz Intel i7-4790K with 32 GB RAM.
Reference library:
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 8.34433333333334
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 4.0833333333333
Eigenvalues of a 640x640 random matrix______________ (sec): 0.623999999999986
Determinant of a 2500x2500 random matrix____________ (sec): 2.86333333333331
Cholesky decomposition of a 3000x3000 matrix________ (sec): 3.44600000000003
Inverse of a 1600x1600 random matrix________________ (sec): 2.35433333333333
OpenBLAS library (default build options):
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 0.151666666666667
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.102333333333333
Eigenvalues of a 640x640 random matrix______________ (sec): 0.506
Determinant of a 2500x2500 random matrix____________ (sec): 0.109333333333334
Cholesky decomposition of a 3000x3000 matrix________ (sec): 0.121
Inverse of a 1600x1600 random matrix________________ (sec): 0.114666666666667
The speedup factors for these tests range from a high of 55x (2800x2800 cross-product matrix) to a low of 1.2x (Eigenvalues of a 640x640 random matrix).
Test | Performance factor |
---|---|
2800x2800 cross-product matrix | 55.0 |
Linear regr. over a 3000x3000 matrix | 39.9 |
Eigenvalues of a 640x640 random matrix | 1.2 |
Determinant of a 2500x2500 random matrix | 26.2 |
Cholesky decomposition of a 3000x3000 matrix | 28.5 |
Inverse of a 1600x1600 random matrix | 20.5 |
Threading Issues
The computation code within R itself is single-threaded, although various popular and useful packages (e.g., doMC in combination with foreach) provide for multi-threaded computation.
The default OpenBLAS library is multi-threaded. The OpenBLAS FAQ warns that this canconflict with multi-threading used by R packages:
How can I use OpenBLAS in multi-threaded applications?
If your application is already multi-threaded, it will conflict with OpenBLAS multi-threading. Thus, you must set OpenBLAS to use single thread as following.
- export OPENBLAS_NUM_THREADS=1 in the environment variables. Or
- Call openblas_set_num_threads(1) in the application on runtime. Or
- Build OpenBLAS single thread version, e.g. make USE_THREAD=0
If the application is parallelized by OpenMP, please build OpenBLAS with USE_OPENMP=1
Nevertheless, the OpenBLAS library (as currently compiled on Linux) appeared to coexist nicely with parallel R computation. I believe this is because the Linux makefile now uses NO_AFFINITY = 1
as a default.
As Zhang Xianyi (main OpenBLAS author) explained in a post, thread affinity is a feature that can interfere with the use of multi-threading elsewhere in a application that uses OpenBLAS:
By default, OpenBLAS is compiled with the thread affinity. If the application uses the single thread, it will improve the performance. However, if the application already uses the multithreading, it will bind these threads of the application to one CPU core.
Thus, the solution is compiling OpenBLAS withNO_AFFINITY=1
orUSE_OPENMP=1
I was still unsure, since different people on the web have made different statements about multi-threading compatibility. It seemed appropriate to do some experimentation.
Single-threaded build
To make a single-threaded version, edit Makefile.rule
after making a copy of the original:
cd ~/OpenBLAS/xianyi-OpenBLAS-aceee4e/
cp Makefile.rule Makefile.rule.orig
vi Makefile.rule
Uncomment this line:
USE_THREAD = 0
And then make:
scl enable devtoolset-2 bash
make clean
make
sudo mkdir /usr/lib64/openblas.st
sudo make PREFIX=/usr/lib64/openblas.st install
exit
To switch into the single-threaded-only OpenBLAS:
# for single-threaded OpenBLAS use
cd /usr/lib64/R/lib
sudo ln -sf /usr/lib64/openblas.st/lib/libopenblas.so libRblas.so
Thread-affinity build
To make a version with thread-affinity enabled, restore the copy of the original Makefile.rule
, then edit:
cd ~/OpenBLAS/xianyi-OpenBLAS-aceee4e/
cp Makefile.rule.orig Makefile.rule
vi Makefile.rule
Comment-out this line:
# NO_AFFINITY = 1
And then make:
scl enable devtoolset-2 bash
make clean
make
sudo mkdir /usr/lib64/openblas.ta
sudo make PREFIX=/usr/lib64/openblas.ta install
exit
To switch into the thread-affinity version of OpenBLAS:
# for thread-affinity OpenBLAS use
cd /usr/lib64/R/lib
sudo ln -sf /usr/lib64/openblas.ta/lib/libopenblas.so libRblas.so
Processor statistics
To monitor CPU usage under different conditions, install the sysstat utilities:
sudo yum -y install sysstat
mpstat -P ALL 1
Test code
I used this R snippet to test OpenBLAS CPU usage:
# Crossprod snippet
# runs about 15 sec on i7-4790K
a <- rnorm(2800*2800);
dim(a) <- c(2800, 2800)
for (i in 1:100) {
b <- crossprod(a) # equivalent to: b <- t(a) %*% a
}
I used this R snippet to test multi-threaded R usage:
# Foreach snippet
# requires installation of the foreach and doMC packages
library(foreach)
library(doMC)
registerDoMC(8) # 8 threads
# runs about 30 sec on i7-4790K
phi <- 1.6180339887498949
a <- NULL
b <- NULL
foreach(k=1:8) %dopar% {
for (i in 1:30) {
a <- floor(runif(3500000)*1000)
b <- (phi^a - (-phi)^(-a))/sqrt(5)
}
}
Changing OpenBLAS threading at runtime
The OpenBLAS library includes a C function to change the number of threads being used at any time. The post here shows how to use the inline R package to define a R-callable function to access this:
library(inline)
openblas.set.num.threads <- cfunction( signature(ipt="integer"),
body = 'openblas_set_num_threads(*ipt);',
otherdefs = c ('extern void openblas_set_num_threads(int);'),
libargs = c ('-L/usr/lib64/openblas/lib -lopenblas'),
language = "C",
convention = ".C"
)
Note that the path in line 5 must be changed to match the location of the OpenBLAS library.
After this R code is executed, the number of OpenBLAS threads can be changed by invoking the R function:
# set single-threaded OpenBLAS:
openblas.set.num.threads(1)
# set multi-threaded OpenBLAS:
openblas.set.num.threads(8)
Comparisons
These tests show typical thread usage for the combination of R code and OpenBLAS library indicated. The crossprod
code exercises the BLAS library and the foreach
code exercises multi-threaded processing within R packages.
Crossprod snippet with default OpenBLAS library:
mpstat -P ALL 1
06:08:10 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
06:08:11 PM all 77.28 0.00 22.72 0.00 0.00 0.00 0.00 0.00 0.00
06:08:11 PM 0 78.00 0.00 22.00 0.00 0.00 0.00 0.00 0.00 0.00
06:08:11 PM 1 77.78 0.00 22.22 0.00 0.00 0.00 0.00 0.00 0.00
06:08:11 PM 2 86.87 0.00 13.13 0.00 0.00 0.00 0.00 0.00 0.00
06:08:11 PM 3 76.00 0.00 24.00 0.00 0.00 0.00 0.00 0.00 0.00
06:08:11 PM 4 74.00 0.00 26.00 0.00 0.00 0.00 0.00 0.00 0.00
06:08:11 PM 5 73.27 0.00 26.73 0.00 0.00 0.00 0.00 0.00 0.00
06:08:11 PM 6 77.00 0.00 23.00 0.00 0.00 0.00 0.00 0.00 0.00
06:08:11 PM 7 77.00 0.00 23.00 0.00 0.00 0.00 0.00 0.00 0.00
Crossprod snippet with thread-affinity-enabled OpenBLAS library:
mpstat -P ALL 1
02:38:01 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
02:38:02 PM all 39.00 0.00 11.00 0.00 0.00 0.00 0.00 0.00 50.00
02:38:02 PM 0 79.00 0.00 21.00 0.00 0.00 0.00 0.00 0.00 0.00
02:38:02 PM 1 70.00 0.00 30.00 0.00 0.00 0.00 0.00 0.00 0.00
02:38:02 PM 2 77.78 0.00 22.22 0.00 0.00 0.00 0.00 0.00 0.00
02:38:02 PM 3 85.00 0.00 15.00 0.00 0.00 0.00 0.00 0.00 0.00
02:38:02 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:38:02 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:38:02 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:38:02 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Crossprod snippet with single-threaded-only OpenBLAS library:
mpstat -P ALL 1
06:09:48 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
06:09:49 PM all 12.41 0.00 0.00 0.00 0.00 0.00 0.00 0.00 87.59
06:09:49 PM 0 0.00 0.00 0.00 0.99 0.00 0.00 0.00 0.00 99.01
06:09:49 PM 1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:09:49 PM 2 0.00 0.00 0.99 0.00 0.00 0.00 0.00 0.00 99.01
06:09:49 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
06:09:49 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
06:09:49 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
06:09:49 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
06:09:49 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Crossprod snippet with default OpenBLAS library and openblas.set.num.threads(1):
mpstat -P ALL 1
02:44:11 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
02:44:12 PM all 12.48 0.00 0.12 0.00 0.00 0.00 0.00 0.00 87.39
02:44:12 PM 0 99.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00
02:44:12 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:44:12 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:44:12 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:44:12 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:44:12 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:44:12 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
02:44:12 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Above results are as expected. Note that when thread-affinity is enabled, OpenBLAS uses a number threads equal to the number of physical cores (4) rather than the number of virtual cores (8). The last test above demonstrates that the openblas.set.num.threads()
function works.
Foreach snippet with default OpenBLAS library:
mpstat -P ALL 1
05:58:50 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
05:58:51 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:58:51 PM 0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:58:51 PM 1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:58:51 PM 2 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:58:51 PM 3 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:58:51 PM 4 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:58:51 PM 5 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:58:51 PM 6 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:58:51 PM 7 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Foreach snippet with thread-affinity-enabled OpenBLAS library:
mpstat -P ALL 1
05:59:51 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
05:59:52 PM all 12.52 0.00 0.00 0.00 0.00 0.00 0.00 0.00 87.48
05:59:52 PM 0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
05:59:52 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:59:52 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:59:52 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:59:52 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:59:52 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:59:52 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:59:52 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Foreach snippet with single-threaded-only OpenBLAS library:
mpstat -P ALL 1
06:02:40 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
06:02:41 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:02:41 PM 0 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:02:41 PM 1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:02:41 PM 2 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:02:41 PM 3 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:02:41 PM 4 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:02:41 PM 5 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:02:41 PM 6 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
06:02:41 PM 7 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The three results above show that multi-threading is not harmed by use of the default OpenBLAS library (which has thread-affinity turned off). When thread-affinity is turned on, we see the dysfunctional results that others have observed and complained about.
The final set of tests shows performance results from the benchmark for various BLAS libraries:
Reference BLAS library:
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 8.34433333333334
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 4.0833333333333
Eigenvalues of a 640x640 random matrix______________ (sec): 0.623999999999986
Determinant of a 2500x2500 random matrix____________ (sec): 2.86333333333331
Cholesky decomposition of a 3000x3000 matrix________ (sec): 3.44600000000003
Inverse of a 1600x1600 random matrix________________ (sec): 2.35433333333333
Default OpenBLAS library:
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 0.151666666666667
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.102333333333333
Eigenvalues of a 640x640 random matrix______________ (sec): 0.506
Determinant of a 2500x2500 random matrix____________ (sec): 0.109333333333334
Cholesky decomposition of a 3000x3000 matrix________ (sec): 0.121
Inverse of a 1600x1600 random matrix________________ (sec): 0.114666666666667
Thread-affinity OpenBLAS library:
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 0.159666666666667
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.108333333333333
Eigenvalues of a 640x640 random matrix______________ (sec): 0.336333333333335
Determinant of a 2500x2500 random matrix____________ (sec): 0.108999999999999
Cholesky decomposition of a 3000x3000 matrix________ (sec): 0.124333333333333
Inverse of a 1600x1600 random matrix________________ (sec): 0.124000000000001
Single-threaded-only OpenBLAS library:
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 0.483666666666667
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.279333333333332
Eigenvalues of a 640x640 random matrix______________ (sec): 0.281333333333333
Determinant of a 2500x2500 random matrix____________ (sec): 0.278999999999999
Cholesky decomposition of a 3000x3000 matrix________ (sec): 0.271333333333335
Inverse of a 1600x1600 random matrix________________ (sec): 0.265666666666666
Default OpenBLAS library with openblas.set.num.threads(1):
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 0.482333333333315
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 0.27733333333335
Eigenvalues of a 640x640 random matrix______________ (sec): 0.27066666666669
Determinant of a 2500x2500 random matrix____________ (sec): 0.279666666666647
Cholesky decomposition of a 3000x3000 matrix________ (sec): 0.271666666666666
Inverse of a 1600x1600 random matrix________________ (sec): 0.264666666666661
Observations:
- Multi-threading improves every test, except for Eigenvalues of a 640x640 random matrix. The eigenvalue test is fastest when single-threaded.
- In these tests on a computer with 4 physical cores and 8 virtual cores, multi-threaded OpenBLAS is 2x to 3x faster than single-threaded OpenBLAS (except for the eigenvalues test).
- For the particular hardware and software environment used here, there is no advantage to enabling thread-affinity.
- Performance of the library when compiled for single-threading-only is essentially the same as that of the default (multi-threading) library when the number of threads is set to one at runtime.
Conclusions:
- Use the default OpenBLAS library.
- For performance-critical code, test with varying number of threads by using the
openblas.set.num.threads()
function.