Personal Linux R Server (Part III: Leveling Up)

2015-03-12

Personal Linux R Server (Part III: Leveling Up)

After completing Part II: Software Setup, I had a fully functional personal Linux R server up and running, with a 2.8 GHz Intel Celeron G1840 CPU. I was ready to upgrade the CPU, as planned from the beginning. But first, redecorating...

(See also Part IV: Final Upgrades for selection and installation of the graphics card.)

Bling

The Corsair 250D case has a transparent window in the top panel, about 9" x 5.5" (22 x 14 cm). It is clearly intended for peering through to admire the wonderfulness within, but all I had going was one tiny LED on the motherboard, to indicate power on. It looked so ... plain.

I was basing my server on gaming and enthusiast components. If nothing else, Respect for the Traditions of the Culture demanded that I make at least a token effort towards case modding.

I consulted the catalogs. Illuminated case fans looked good. The Corsair 250D case comes with one 120mm fan on the side, but it has a room for an additional side fan of that size. I could swap out the existing one (next to the window) and then have the option of installing the original 120mm fan in the other space later. Might as well get another Corsair fan. Hoping for a dignified color, I ordered one with purple LEDs.

Corsair AF120 120mm case fan with purple LED

As a rationalist manqué, I was hoping I could use the cover story, "The aftermarket fan is a Quiet Edition. As such, it is much quieter than the stock fan, thus allowing me to concentrate far better on Hard and Serious Problems." No such luck. Here are the specs for the aftermarket fan:

AF120 Quiet Edition LED
Operating Voltage: 7V – 12V

Performance at 12V:

Airflow	52.19 CFM
Static Pressure	.75 mm/H20
Sound Level	25.2 dBA
Speed	1,500 RPM
Power Usage	0.40 A

It was a little work to find information about the stock case fan, which listed simply as "AF120L" on the 250D parts list, but helpful "CorsairGeorge" reported on a forum:

Also, here's the specs for the AF120L and AF140L

AF120L
Current Draw: .11A Max
RPM: 1300
CFM: 57.13
Static Pressure: .916 mmH2O

AF140L
Current Draw: .12A Max
RPM: 1000
CFM: 67.43
Static Pressure: .764 mmH2O

The noise level testing is still being finalized but it's pretty competitive with aftermarket fans around the same RPM.

With very few exceptions, two fans at the same RPM will perform at similar noise levels. There are very few ways to make this significantly better as it's been perfected for a long time. At lower RPMs motor and bearing noise is more apparent so that's focused on heavily.

and also:

The AF120L and AF140L use the same shroud and impeller design that the AF120 and AF140 retail do. They do differ in the following ways:

- Sleeve bearing
- Lower RPM (stock case buyers are very concerned about noise levels)
- No rubber corners
- No color swappable rings
- Impeller is grey instead of black
- Longer cables

I would have liked the effect of a glowing light filling the entire case, but after I installed it I found that the fan LEDs just illuminate the fan itself. (I think you need to use CCFLs for the full-case glowing effect.) The Celeron doesn't need much cooling, so the fan runs slowly and the LEDs are dim when the computer is idling. The fan glows nicely, however, when the CPU is stressed and heats up and causes the fan to spin faster. Maybe I'll use, "Its a convenient way to visually monitor fan speed" as my cover story.

Processor and Cooler Choices

Here are the upgrade items:

Core i7-4790K Intel processor
Noctua NH-L12 low-profile CPU cooler
ArctiClean thermal paste and cleaning kit

The i7-4790K CPU was the fastest available for this configuration at the time, so that was an easy choice. It has a 4.0 GHz stock speed. Although the "K" suffix means it can be overclocked, that was not my primary intention. The i7-4790 locked (non-overclockable) version has a slower speed of 3.6 GHz, however, so that was reason enough for me to choose the 4790K.

There were multiple choices for an improved cooler. The biggest coolers would not have fit in the Corsair 250D case, but the case was designed to be able to accept the high-end dual-fan:

Corsair H100i liquid CPU cooler

I considered it, for maximum geek-cred, but some users complained about its excessively loud fans. Also, it was not clear how much the inability to run the Windows-only software that accompanies it would be a problem.

The Noctua fans and coolers are known for being especially quiet. I was fairly confident that the dimensions of the NH-L12 would permit its installation in the case. It was not clear, however, if it would still fit if a CD/DVD drive was installed in the designated tray. I decided to install the cooler first, and make the decision about installing an optical drive later.

The Noctua NH-L12 came with its own little tube of thermal paste, so I did not immediately need the ArctiClean paste that came in the 3-piece kit. The two bottles of cleaning solutions proved useful, however, both for clean off the paste from the previous Celeron CPU and stock cooler, and for when I had to remove and reinstall the Noctua cooler from the new CPU.

Installation

Once again, a NewEgg TV video proved useful:

CPU Heatsink/Fan Replacement Guide

I was initially wary of the suggestion that a coffee filter be used to clean the surface of the CPU, but the recommendation is pretty universal, including in the detailed Arctic Silver manual:

Arctic Silver Thermal Compound Application Method

Even Good Houskeeping says they are lint-free and good for at least 11 different things besides filtering coffee.

I found it straightforward to follow the Noctua installation instructions. The process required access to the bottom of the motherboard, so I unhooked everything and removed the board from the case. As the instructions suggested might be the case, it was not possible to use the standard "X" bracket on the back of the motherboard due to the location of components on the Mini-ITX board. They supply alternate mounting posts to be used in securing the mounting brackets, and it all worked as instructed.

At this point I made a minor tactical error. While the board was conveniently out of the case, I attached the entire cooler to the board, and then attempted to reinstall the whole thing. This had worked just fine when I was using the stock cooler, but the Noctua cooler was much bigger. I realized the upper fan was blocking access to the mounting screws. I detached the upper fan. One of the mounting screws was still mostly covered by the cooler, so I dug up a right-angle screwdriver to try to tighten it a bit. Then I found it was almost impossible to reattach all the cables with the cooler covering the board. Mistake. I removed the whole motherboard and took off the cooler, leaving only the mounting brackets in place. I got to practice with coffee filter and cleaning solutions again.

The bright side was that I got to inspect the thermal paste after it had spread out over the CPU by being squished by the cooler's heatsink. I saw that I had used the correct amount (about the size of a small pea, in the center of the CPU); the paste had spread out to the edges of the CPU but no farther.

I reinstalled the board, reattached all the cables, re-dabbed the CPU with thermal paste, and carefully lowered the Noctua cooler (without the upper fan) and tightened it into place. (I had previously removed the side fan, for easier access, but left its cable attached to the motherboard.) All was well this time.

The system passed the smoke test and booted successfully. I rebooted and hit F2 during startup to get into the BIOS control screen, selected the Monitor page, and re-ran Qfan Tuning. With a new CPU fan, this seemed like an important step. The calibration indeed changed:

CPU Fan Min. Duty Cycle (%): 52 -> 15
Chassis Fan 1 Min. Duty Cycle (%): 28 -> 31
Chassis Fan 2 Min. Duty Cycle (%): 31 -> 40

The CPU minimum duty cycle dropped substantially, which is what I would expect from a higher-quality fan. From this point on, no more CPU Fan Errors were observed.

Benchmarks and Testing

i7-4790K Temperatures

The Intel Temperature Guide on tomshardware.com gives the following parameters for the 4790K:

4th Generation 22 Nanometer: 4790K (TDP 88W / Idle 2W)

Standard Ambient = 22C
Tcase (CPU temp) = 74C
CPU / Core offset + 5C
Tjunction (Core temp) = 79C
Tj Max (Throttle temp) = 100C

The Core temperatures above show that mid 70's are safe.

Prime95 / mprime

The mprime tool did not run at first. The problem was some leftover configuration files it had written when a completely different (2 core Celeron) CPU was installed.

Deleting these files caused it to be unconfused:

cd ~/mprime
rm local.txt
rm prime.txt

Immediately after the upgrade, I had not yet learned that mprime versions after 26.6 were not recommended for stress-testing these CPUs, so my initial test was with version 28.5.

Starting the mprime 28.5 Torture Test #1 (Small FFT) caused the highest core temperature to rise to 82°C after 2 minutes. After 5 minutes the highest core temperature was 86°C and I terminated the test. Two minutes after the end of the test, all core temperatures were below 35°C.

One of the cores reported temperatures noticeably lower than the others. Here is a typical sensors output during the previous 5-minute test:

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0: +84.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:        +84.0°C  (high = +80.0°C, crit = +100.0°C)
Core 1:        +83.0°C  (high = +80.0°C, crit = +100.0°C)
Core 2:        +84.0°C  (high = +80.0°C, crit = +100.0°C)
Core 3:        +77.0°C  (high = +80.0°C, crit = +100.0°C)

Several days later, I repeated the #1 Small FFT test with mprime 26.6. At bootup, the core temperatures were all 30°C or below. After 5 minutes, the peak temperature on any core was a more moderate 75°C. This suggested to me that the new cooling system was doing its job acceptably with the new CPU.

Right after that, I tried a quick check of the unrecommended mprime 28.5 (Small FFT test) again for comparison. After 60 seconds, the peak core temperature was 85°C: essentially the same response as on the first day.

R Benchmark

Results for the upgraded system:

   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  0.63833333333334 
2400x2400 normal distributed random matrix ^1000____ (sec):  0.444666666666658 
Sorting of 7,000,000 random values__________________ (sec):  0.518666666666652 
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  7.75766666666667 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  3.79366666666667 
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  1.07894260658742 

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.265666666666675 
Eigenvalues of a 640x640 random matrix______________ (sec):  0.584333333333338 
Determinant of a 2500x2500 random matrix____________ (sec):  2.64200000000002 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  3.13666666666667 
Inverse of a 1600x1600 random matrix________________ (sec):  2.19800000000001 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  1.50270496352845 

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.437666666666682 
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.161333333333346 
Grand common divisors of 400,000 pairs (recursion)__ (sec):  0.66700000000003 
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.241666666666674 
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.310999999999979 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.320410596702259 


Total time for all 15 tests_________________________ (sec):  23.7983333333334 
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.803883247606418 
                      --- End of test ---

These total time and overall mean values are about 80-90% higher than the values for the 2010 Mac Pro. The relative speedup of individual tests vary from 27% to 226%. Maximum core temperature during the benchmark was about 60°C.

The motherboard BIOS setup has an option called "EZ System Tuning" which can do "factory custom" overclocking. I switched it from "Normal" to "ASUS Optimal" i.e. fastest. The R benchmark ran about 5% faster:

Total time for all 15 tests_________________________ (sec):  22.6583333333333 
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.768417622383206

Peak core temperature went from 60°C to 70°C with overclocking. It didn't seem like a good enough tradeoff, so I switched it back to "Normal."

Linpack

I tried linpack by running:

./runme_xeon64

Core temperatures oscillated between 60°C and an unhealthy 88°C over the 6 minutes I let it run. Results here:

Wed Mar 11 21:54:41 PDT 2015
Intel(R) Optimized LINPACK Benchmark data

Current date/time: Wed Mar 11 21:54:41 2015

CPU frequency:    4.396 GHz
Number of CPUs: 1
Number of cores: 4
Number of threads: 8

Parameters are set to:

Number of tests: 15
Number of equations to solve (problem size) : 1000  2000  5000  10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array                  : 1000  2000  5008  10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run                     : 4     2     2     2     2     2     2     2     2     2     1     1     1     1     1    
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     4     4     4     1     1     1     1    

Maximum memory requested that can be used=16200901024, at the size=45000

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
1000   1000   4      0.013      52.8952  6.294965e-13 2.146745e-02   pass
1000   1000   4      0.008      87.6100  6.294965e-13 2.146745e-02   pass
1000   1000   4      0.007      94.7303  6.294965e-13 2.146745e-02   pass
1000   1000   4      0.007      94.5843  6.294965e-13 2.146745e-02   pass
2000   2000   4      0.085      62.4816  2.861267e-12 2.488952e-02   pass
2000   2000   4      0.085      62.6455  2.861267e-12 2.488952e-02   pass
5000   5008   4      0.604      137.9428 1.648793e-11 2.299109e-02   pass
5000   5008   4      0.579      144.1230 1.648793e-11 2.299109e-02   pass
10000  10000  4      3.417      195.1857 6.318562e-11 2.227989e-02   pass
10000  10000  4      3.458      192.8683 6.318562e-11 2.227989e-02   pass
15000  15000  4      10.896     206.5316 1.436672e-10 2.262782e-02   pass
15000  15000  4      10.916     206.1640 1.436672e-10 2.262782e-02   pass
18000  18008  4      18.424     211.0646 2.209714e-10 2.419908e-02   pass
18000  18008  4      18.448     210.7844 2.209714e-10 2.419908e-02   pass
20000  20016  4      24.965     213.6650 2.873690e-10 2.543845e-02   pass
20000  20016  4      24.900     214.2264 2.873690e-10 2.543845e-02   pass

Wow, over 200 GFLOPS at the top run.

I ran linpack on the 2010 Mac Pro as well:

./runme64

Here are those results:

Intel(R) Optimized LINPACK Benchmark data

Current date/time: Thu Mar 12 08:29:10 2015

CPU frequency:    3.058 GHz
Number of CPUs: 1
Number of cores: 4
Number of threads: 8

Parameters are set to:

Number of tests: 9
Number of equations to solve (problem size) : 15000 14000 13000 12000 11000 10000 8000  6000  1000 
Leading dimension of array                  : 15000 14008 13000 12008 11000 10008 8008  6008  1000 
Number of trials to run                     : 1     2     2     2     2     2     2     3     4    
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     4    

Maximum memory requested that can be used=1800304096, at the size=15000

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
15000  15000  4      63.857     35.2419  2.298756e-10 3.620579e-02   pass
14000  14008  4      52.667     34.7413  1.953428e-10 3.527683e-02   pass
14000  14008  4      51.453     35.5609  1.953428e-10 3.527683e-02   pass
13000  13000  4      42.386     34.5633  1.701041e-10 3.559301e-02   pass
13000  13000  4      42.477     34.4890  1.701041e-10 3.559301e-02   pass
12000  12008  4      32.230     35.7517  1.230560e-10 3.020391e-02   pass
12000  12008  4      33.609     34.2851  1.230560e-10 3.020391e-02   pass
11000  11000  4      25.862     34.3202  1.031222e-10 3.008505e-02   pass
11000  11000  4      25.430     34.9020  1.031222e-10 3.008505e-02   pass
10000  10008  4      21.929     30.4108  9.915883e-11 3.496441e-02   pass
10000  10008  4      19.512     34.1780  9.915883e-11 3.496441e-02   pass

The fastest run on the older Mac was about 35 GFLOPS. I used Hardware Monitor to watch the CPU temperatures on the Mac. It hit a peak of 86°C before I stopped the test.

I was a bit lazy in running the above tests. I just used the default demo parameters for the OS X and Linux packages. More careful reading of the documentation indicates that best performance is obtained by setting the number of threads equal to the number of cores. There are also some suggestions for picking optimal values of LDA.

I did try forcing the number cores to 4, which is done by using this:
export OMP_NUM_THREADS=4

That didn't appear to make a significant difference for the maximum GFLOPS, although it did help some for tests at smaller sizes.

So that's a factor of six speedup for the new system over the Mac Pro. Does it make sense?

I think so. An article from 2012 (Intel’s Haswell is an unprecedented threat to Nvidia, AMD) helped me to understand. The theoretical-maximum GFLOPS for the architectures used by the 2010 Mac Pro and the new 4790K are calculated as follows:

Architecture	Calculation	Value
Nehalem	(3.058 GHz) x (4 cores) x (4 DP FLOPS / cycle)	49 GFLOPS
Haswell	(4.396 GHz) x (4 cores) x (16 DP FLOPS / cycle)	281 GFLOPS

In both cases, the best Linpack results are roughly three-quarters of the theoretical maxima. The big boost that Haswell gets is due to the AVX2 & FMA instruction sets.

Linpack itself is a highly-optimized bit of code whose performance is not expected to be typical of most real-world problems. It's encouraging, however, to think about the potential power that's available if the right problem should happen to come along.

What's Next

A few additional components, including a graphics card, are to be discussed in Part IV: Final Upgrades.

blog