Benchmark Load Balancing

The Problem

I am always looking for ways to turn boring school assignments to fun projects . A recent one involves the execution of a handful of spec benchmarks for a random set of gcc flags for my Compiler Optimisation course. This presents a slight challenge as the set of flags times the repetition count for each benchmark adds up to a rather large total execution time. Hence, parallelizing benchmark execution seemed reasonable and a fun-enough challenge.

My initial approach was to load balance the benchmarks on the school’s 16-core compute machine (2 Intel X5550 Quad Cores with HyperThreading). This boiled down to generating the necessary commands —making sure each command is allocated to a specific core with taskset— and piping them to xargs -L 1 -P $(fgrep -c name /proc/cpuinfo). This approach produced very large deviations however. This was partly because the machine was already under heavy load by other users —so distributing the tasks to individual CPUs instead of cores did not give significantly smaller deviations either.

My next thought was to schedule the benchmarks on regular login machines, most of which sit idly during the night, or even most of the day. Running on diverse hardware is not an issue —as long as a single benchmark is executed on the same machine for all different flag combinations— since I don’t care about relative performance across benchmarks. One problem with this approach however, was that it involved finding properly functioning machines which are unlikely to be used, and noting down their host names.

The Solution

So I kept putting it off, until I became aware of a cool undergraduate project. The project lists all login machines and their availability and could not have come at a more convenient time.

I quickly wrote down a script to scrape the available page and distribute each separate benchmark to one machine.

The distribute script finds available machines and submits a job via ssh. For some reason I could not use screen or tmux, since they would get killed as soon as the ssh session terminated. So I had to resort to good old nohup. Furthermore, I opted for zsh on this one since I was fed-up with the annoying idiosyncrasies of bash when it comes to separator handling —I couldn’t get the HOST array to split properly on newlines.

distribute
#!/bin/zsh
# Distribute computation across different machines.

# Parse lines in the following format to retrieve hosts:
# <span class='label label-success'>bazzini.inf.ed.ac.uk</span>
HOSTS=($(wget -O - http://project.shearn89.com/available \
       | sed -rn '/label-success/ s/.*>([^<]+)<.*/\1/p'))
i=1
for src in $@; do
    host=${HOSTS[$(( i++ ))]}
    ssh -n $host "nohup ./runjob $src" &
done

The runjob script simply changes into my project directory and executes the benchmark, while logging some information like which benchmark is matched to which machine.

runjob
#!/bin/bash

src=$(basename $1)
cd msc/copt1
joblog="info.$src.$(hostname).log"
date +'%F %T' >> $joblog
cat /proc/cpuinfo  >> $joblog
taskset -c 1 ./benchmark $@
echo $? $@ >> "$joblog"

The benchmark script is responsible for reading available flags, compiling a benchmark and then executing it for a certain number of iterations. After execution is complete, average runtime and standard deviation are calculated with a simple awk script, stats.awk.

benchmark
#!/bin/bash

SRC="$1"
DST="${2:-$(pwd)/out}"
FLAGS="${3:-200-flags}"
TIMES="${4:-12}"

die() { echo $@; exit 1; }

[[ -z "$SRC" ]] && die "usage: $0 src [dst=$DST] [flags=$FLAGS] [times=$TIMES]"

name="$(basename $SRC)"
dst="$DST/$name"
mkdir -p "$dst" || die "Failed to create output directory $dst"

log="$dst/run.log"
buildlog="$dst/build.log"

f=0
while read flags; do
    (( f++ ))
    file="$(printf $dst/%03d.times $f)"
    run=$(printf "$(basename $SRC) %03d" $f)
    echo "$(date +'%F %T') $run $flags" | tee -a "$log" "$buildlog"

    make -s -C "$SRC/src" CFLAGS="$flags" 1>/dev/null 2>>$buildlog \
      || die "Failed to build $SRC"

    pushd "$SRC"
    for i in  $(seq $TIMES); do
        /usr/bin/time --output="$file" --append \
                      --format='r %e k %S u %U csi %c csv %w' \
                      ./run.sh 1>/dev/null 2>>$log
    done
    popd

    echo -e "$name\n$flags\n$(./stats.awk $file)" \
       | tee -a "$dst/results.txt" | tail -n 1
done < $FLAGS
stats.awk
#!/usr/bin/awk -f

/^r/ {
    # Sum kernel/user CPU time and convert to milliseconds.
    cpu = 1000 * ($4 + $6)
    sum += cpu
    ssq += cpu * cpu
    printf("%d ", cpu)
}

END {
    # Print a line with average runtime and standard deviation.
    avg = sum / NR
    var = ssq / NR - avg * avg
    printf("\n%.2f %.2f\n", avg, sqrt(var))
}

There is no need to move files since I am taking advantage of AFS, both for the benchmark source and output files. Adding appropriate commands to set up a proper environment on local storage should be trivial however.

Future Work

The distribution scripts are a bit rough and assigning jobs to machines randomly is not the best approach. For example, some machines are i3 Quad Cores at 3.0GHz, while others are dated Core 2 Duo at 1.8GHz. It should be relatively straightforward to retrieve the specs of each machine and assign benchmarks to machines with adequate performance and no load —ideally such information should be provided in the original listing though. For example, the following script generates such a list:

machines
#!/bin/bash

stathosts() {
    while read host; do
        [[ "$host" = "Available" ]] && continue
        echo $host
        ssh -nT $host 'fgrep name /proc/cpuinfo; uptime; exit'
        echo
    done
}

wget -O - http://project.shearn89.com/available |\
sed -rn '/label-success/ s/.*>([^<]+)<.*/\1/p'  |\
stathosts > ${1:-host.stats}

It is then just a matter of turning this information into a usable heuristic. The benchmarks could be also ranked slowest to fastest with a script like the following:

rank
#!/bin/bash

for src in $@; do
    make -C $src/src &>/dev/null
    pushd $src &>/dev/null
    /usr/bin/time --format='%e' --output=>(read t; echo $t $(basename $src)) \
                  taskset -c 0 ./run.sh &>/dev/null
    popd &>/dev/null
done | sort -rgk 1 | cut -d' ' -f 2

Finally, the distribution script assumes there will always be more machines than benchmarks, which might not always be the case.

Update (2012-02-24)

I came up with a heuristic, a bit rough but does the job. It takes into account the frequency of the CPU, a user-supplied weight based on its type, and the system load. In the end, I decided to stick to a single CPU type, so that my results were directly comparable across benchmarks. To do that I just set all non-i3 multipliers to 0 in the following rankhost script.

rankhosts
#!/bin/bash

# Set the multipliers depending on the processor model.
FACTOR_I3=1.8 # Core i3
FACTOR_CD=1.0 # Core 2 Duo
FACTOR_C2=0.8 # Core 2

cpu_core() {
    fgrep -c name /proc/cpuinfo
}

cpu_freq() {
    name=$(fgrep name /proc/cpuinfo | sed 1q)
    freq=$(echo $name | sed -r 's/.*@[ \t]+([0-9.]+)GHz$/\1/')
    case $name in
        *i3-*)
            factor=$FACTOR_I3 ;;
        *Duo*)
            factor=$FACTOR_CD ;;
        *)
            factor=$FACTOR_C2 ;;
    esac
    echo "( $freq * $factor )"
}

sys_load() {
    cores=$(cpu_core)
    loads=$(uptime | sed 's/.*load average://; s/,/ +/g')
    echo "( ($loads) / (3 * $cores) )"
}

echo "scale=4; $(cpu_freq) /  (10 * (0.1 + $(sys_load)))" | bc

I placed some the ranking code into a separate file —so as to easily run the functions from the shell— and modified the distribution script accordingly.

functions
#!/bin/bash

rank_spec() {
  for src in $@; do
    make -C $src/src &>/dev/null
    pushd $src &>/dev/null
    /usr/bin/time --format='%e' --output=>(read t; echo $t $(basename $src)) \
                  taskset -c 0 ./run.sh &>/dev/null
    popd &>/dev/null
  done | sort -rgk 1 | cut -d' ' -f 2
}

list_hosts() {
  wget -O - http://project.shearn89.com/available \
  | sed -rn '/Available/n; /label-success/ s/.*>([^<]+)<.*/\1/p'
}

rank_hosts() {
  list_hosts | while read host; do
    echo $(ssh -nT $host '~/rankhost; exit') $host
  done | sed '/^[^1-9]/d' | sort -rgk 1 | cut -d' ' -f 2
}

cached_hosts() {
  local h=hosts.cache
  [[ -e $h ]] && cat $h || { rank_hosts | tee $h; }
}
distribute-new
#!/bin/bash

source $(dirname $0)/functions

HOSTS=($(cached_hosts))
for src in $@; do
  host=${HOSTS[$(( i++ ))]}
  ssh -n $host "nohup ~/runjob $src 200-flags; exit" &
done

To make sure any benchmark results were not affected by AFS or disk I/O latency I further modified the runjob script to execute the benchmark out of a RAM filesystem, under /dev/shm/.

runjob-new
#!/bin/bash

dst=/dev/shm/mike
results=$dst/results
mkdir -p $results

src=$1
flags=$2

cd ~/msc/copt1
cp -r $src $flags $dst
src=$(basename $src)
flags=$dst/$(basename $flags)
log="results/info.$src.$(hostname)"
{ date +'%F %T';
  cat /proc/cpuinfo;
  free -m; } > $log
taskset -c 1 ./src/benchmark $dst/$src $results $flags 15
echo $? $@ >> $log
cp -r $results/* results && rm -rf $dst || echo "CLEAN UP FAILED"

social