PostgreSQL Server Benchmarks: Part Three — Memory

03 Mar 2011

Good day, interfriends! Welcome to part three of my series on benchmarking Estately’s new database server. In this post I’ll discuss the tools, methodology, and results of the memory benchmarks. If you’re just joining our program already in progress, you may want to go back and skim Part One, which describes the background of the project.

Here’s a constantly-updated table of contents for the full series:

As you may recall from Part One, the server has 96GB of RAM, comprised of 12x 8GB sticks running at 1066MHz. The goal of this exercise is to find the BIOS settings that yield optimal RAM performance in synthetic benchmarks. For memory testing, I used memtest86+ v4.20 running from a USB stick.

memtest is single-threaded, so it only tests one CPU’s ability to access RAM. In order to test a situation closer to what we would see in production, I turned to a tool called STREAM which was originally designed for testing high-performance computing applications on very large machines.

Greg Smith (who you may remember from such parts as Part One) wrote a tool called stream-scaling that wraps STREAM to automate testing your system. Here’s a quote from the stream-scaling documentation:

stream-scaling automates running the STREAM memory bandwidth test on Linux systems. It detects the number of CPUs and how large each of their caches are. The program then downloads STREAM, compiles it, and runs it with an array size large enough to not fit into cache. The number of threads is varied from 1 to the total number of cores in the server, so that you can see how memory speed scales as cores involved increase.

The plan was to boot into memtest, let it run for 10 minutes, and record the results. Then I’d boot into the OS and run stream-scaling.

A Nasty Surprise

How did that actually go? Like so:

Here’s a little snippet from memtest’s output:

      Memtest86+ v4.20
Core i7 (32nm) 2394 MHz
L1 Cache:   32K  79801 MB/s
L2 Cache:  256K  31500 MB/s
L3 Cache: 12288  21963 MB/s
Memory  :   64G   5853 MB/s

Can you spot the problem? 64GB of RAM? No. Pretty sure I’ve got 96GB in there. I counted it myself. I go into the BIOS and go to Memory Settings and see that it’s set to “Advanced ECC Mode”.

As I understand it (ie, not that well) Advanced ECC Mode is kind of like RAID for your RAM. It basically bonds channels on the memory controllers together, which yields a wider, 128-bit bus. I don’t know why you’d want this; presumably there is a good reason. The end result, though, is that you end up only using two of the three memory controllers on the CPU, which means that four slots on the motherboard don’t do anything.

Recovering From Said Surprise, A Plan Emerges

Long story short, an alternate setting called “Optimizer Mode” causes the memory controllers to operate independently, thus giving us access to all 12 RAM slots and generally yielding faster performance for normal use. I made this change, re-ran memtest, and was back to 96GB of RAM.

There was another setting in the BIOS that looked interesting. Dell calls it “Node Interleaving”. Allow me to quote from the Dell™ PowerEdge™ R610 Systems Hardware Owner’s Manual:

If this field is Enabled, memory interleaving is supported if a symmetric memory configuration is installed. If Disabled, the system supports Non-Uniform Memory architecture (NUMA) (asymmetric) memory configurations.

I’m just going to go ahead and admit that I don’t understand NUMA at all. If you do, email me an explanation (address in the footer) and I’ll add it here. Doing some basic googling indicates that one might see higher performance with interleaving enabled, and a brief foray into the PostgreSQL mailing lists turn up more people complaining about NUMA than praising it.

Since Optimizer Mode is the only way to access all of the RAM, that left testing with interleaving both enabled and disabled. I went back to my original plan to get results from memtest and then boot into the OS and run stream-scaling like so:

$ rm -f stream && ./stream-scaling | tee results

stream-scaling automatically builds the stream binary with correct settings for the system as currently configured, so the first step is to remove the binary if one exists. I followed this procedure with interleaving disabled, then enabled interleaving and followed it again.

Initial Results

Here’s what memtest said:

I was pretty surprised by these results. I wasn’t expecting to see a 1.5GB/s jump just by enabling interleaving. Of course, as I mentioned above, memtest only gives one piece of the puzzle…

stream-scaling runs a transfer rate benchmark starting with a single core and ramping up to the total number of cores in your system… in my case, 16. Let’s see what that looks like:

This is where things start to get interesting. For one, the speeds are much, much faster than memtest. Not necessarily surprising as it’s a different tool with different internals. What’s more surprising is that while memtest showed a considerable speed boost with interleaving enabled, stream-scaling doesn’t agree.

A Better Test

I was also a bit concerned about the fluctuations in the data, particularly around 5 and 8 cores. I figured that this was likely due to other processes on the system getting in the way of getting a “clean” result. The simplest way I could think to address that was to run the benchmarks multiple times and average out the results. I decided to run each benchmark twenty times, so I wrote a script called multi-stream-scaling to automate this process:

#!/usr/bin/env ruby
$stdout.sync = true

count, title = ARGV[0,2]
unless count and title
  abort "Usage: #{$0} [run count] [test title]"

puts "Preparing..."
system "rm -f stream"

puts "Running warmup..."
system "./stream-scaling > /dev/null"

count.to_i.times do |run|
  filename = "#{title}_run_#{run + 1}"

  print "Starting run #{run + 1} of #{count}..."
  system "./stream-scaling > #{filename}"
  puts "completed. Results written to #{filename}"

This script does a couple of things. First it deletes the stream binary, which will force stream-scaling to rebuild it. Next, it runs stream-scaling once to warm up the system. Then, it runs stream-scaling count times in succession, storing the output of each in a file named “<title>_run_<count>”. To run it, just give a number of runs and a title to use for the reports:

$ ./multi-stream-scaling 20 interleaving_disabled

In order to get consistent results, I rebooted before running multi-stream-scaling the first time. Once the runs were done, I used stream-scaling’s to parse the results. It’s intended to output data suitable for plotting with gnuplot, but it was useful for my purposes as well.

I wrote another script called cruncher.rb that ran each results file through and averaged the results, outputting CSV with the number of cores, average transfer rate, and standard deviation.

#!/usr/bin/env ruby

module Enumerable
  def sum
    return self.inject(0) {|a,e| a + e.to_f }

  def mean
    return self.sum / self.size

  def std_dev
    return Math.sqrt( {|n| (n - self.mean) ** 2 }.mean )

title = ARGV[0]
unless title
  abort "Usage: #{$0} [report_name]"

results = {}

Dir[ "#{title}*" ].each do |file|
  lines = IO.popen( "cat #{file} | ./" ).readlines
  lines.shift # remove header comment {|l| l.split }.each do |cores, result|
    results[ cores.to_i ] ||= []
    results[ cores.to_i ] << result.to_f

puts "cores,avg,stddev"
results.sort_by {|k,v| k }.each do |cores, values|
  puts [ cores, values.mean, values.std_dev ].join(",")

To run this, just provide the name of the report you used when you ran multi-stream-scaling:

$ ./cruncher.rb interleaving_disabled

Results Revisited

Let’s overlay the results from multi-stream-scaling on top of the results from before:

This tells me a couple of things:

Overall, it’s an encouraging result. Our single-run results are nearly identical to our twenty-run results. It also confirms that there was no funny business going on before; interleaving really is a bit slower.

Curious, I did some googling and quickly came across a white paper published by Dell called Optimal BIOS Settings for High Performance Computing with PowerEdge 11G Servers that describes some testing they performed and the conclusions they reached. It’s worth noting that they’re talking about HPC as formally defined so it’s not directly relevant to my use case, but I figured it couldn’t hurt to look it over.

Here’s a choice quote:

… node interleaving helped performance in three out of nine benchmarks by 2-4% and hurt performance on four of nine benchmarks by 4-13%. In summary, node interleave rarely helped, and helped very little when it did therefore, node interleaving should be disabled for typical HPC workloads.

The whitepaper finds that interleaving is not helpful for “typical HPC workloads” and advises that it be left off unless application-specific benchmarking shows that it’s helpful. Between that recommendation, the results shown above, and the fact that disabled is the default, I decided to proceed with my testing with interleaving disabled.

Summary, and Next Time on…

The tl;dr version is that you need to check two things if you just picked up a big fat Dell server:

If you’re interested, you can look at the raw output of the tests in three gists:

In Part Four, I’ll be working through a similar process to find the optimal CPU settings.

« go back