alpaastero

Libav and Free Software development


Leave a comment

Microbenchmarking: a null result

My first patch for undefined behavior eliminates left shifts of negative numbers, replacing a << b (where a can be negative) with a * (1 << b). This change fixes bug686, at least for fate-idct8x8 and libavcodec/dct-test -i (compiled with ubsan and fno-sanitize-recover). Due to Libav policy, the next step is to benchmark the change. I was also asked to write a simple benchmarking HowTo for the Libav wiki.

First, I installed perf: sudo aptitude install linux-tools-generic
I made two build directories, and built the code with defined behavior in one, and the code with undefined behavior in the other (with ../configure && make -j8 && make fate). Then, in each directory, I ran:

perf stat --repeat 150 ./libavcodec/dct-test -i > /dev/null

The results were somewhat more stable than with –repeat 30, but it still looks much more like noise than a meaningful result. I ran the command with –repeat 30 for both before the recorded 150 run, so both would start on equal footing. With defined behavior, the results were “0.121670022 seconds time elapsed ( +-  0.11% )”; with undefined behavior, “0.123038640 seconds time elapsed ( +-  0.15% )”. The best of a further three runs had the opposite result, shown below:

% cat undef.150.best

perf stat –repeat 150 ./libavcodec/dct-test -i > /dev/null

Performance counter stats for ‘./libavcodec/dct-test -i’ (150 runs):

120.427535 task-clock (msec) # 0.997 CPUs utilized ( +- 0.11% )
21 context-switches # 0.178 K/sec ( +- 1.88% )
0 cpu-migrations # 0.000 K/sec ( +-100.00% )
226 page-faults # 0.002 M/sec ( +- 0.01% )
455’393’772 cycles # 3.781 GHz ( +- 0.05% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1’306’169’698 instructions # 2.87 insns per cycle ( +- 0.00% )
89’674’090 branches # 744.631 M/sec ( +- 0.00% )
1’144’351 branch-misses # 1.28% of all branches ( +- 0.18% )

0.120741498 seconds time elapse

% cat def.150.best

Performance counter stats for ‘./libavcodec/dct-test -i’ (150 runs):

120.838976 task-clock (msec) # 0.997 CPUs utilized ( +- 0.11% )
21 context-switches # 0.172 K/sec ( +- 1.98% )
0 cpu-migrations # 0.000 K/sec
226 page-faults # 0.002 M/sec ( +- 0.01% )
457’077’626 cycles # 3.783 GHz ( +- 0.08% )
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
1’306’321’521 instructions # 2.86 insns per cycle ( +- 0.00% )
89’673’780 branches # 742.093 M/sec ( +- 0.00% )
1’148’393 branch-misses # 1.28% of all branches ( +- 0.11% )

0.121162660 seconds time elapsed ( +- 0.11% )

I also compared the disassembled code from jrevdct.o, before and after the changes to have defined behavior (using gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2 on x86_64).

In the build directory for the code with defined behavior:
objdump -d libavcodec/jrevdct.o > def.dis
sed -e 's/^.*://' def.dis > noline.def.dis

In the build directory for the code with undefined behavior:
objdump -d libavcodec/jrevdct.o > undef.dis
sed -e 's/^.*://' undef.dis > noline.undef.dis

Leaving aside difference in jump locations (despite the fact that they can impact performance), there are two differences:

diff -u build_benchmark_undef/noline.undef.dis build_benchmark_def/noline.def.dis

–       0f bf 50 f0             movswl -0x10(%rax),%edx
+       0f b7 58 f0             movzwl -0x10(%rax),%ebxi

It’s switched to using a zero-extension rather than sign-extension in one place.

–       74 1c                   je     40 <ff_j_rev_dct+0x40>
–       c1 e2 02                shl    $0x2,%edx
–       0f bf d2                movswl %dx,%edx
–       89 d1                   mov    %edx,%ecx
–       0f b7 d2                movzwl %dx,%edx
–       c1 e1 10                shl    $0x10,%ecx
–       09 d1                   or     %edx,%ecx
–       89 48 f0                mov    %ecx,-0x10(%rax)
–       89 48 f4                mov    %ecx,-0xc(%rax)
–       89 48 f8                mov    %ecx,-0x8(%rax)
–       89 48 fc                mov    %ecx,-0x4(%rax)
+       74 19                   je     3d <ff_j_rev_dct+0x3d>
+       c1 e3 02                shl    $0x2,%ebx
+       89 da                   mov    %ebx,%edx
+       0f b7 db                movzwl %bx,%ebx
+       c1 e2 10                shl    $0x10,%edx
+       09 da                   or     %ebx,%edx
+       89 50 f0                mov    %edx,-0x10(%rax)
+       89 50 f4                mov    %edx,-0xc(%rax)
+       89 50 f8                mov    %edx,-0x8(%rax)
+       89 50 fc                mov    %edx,-0x4(%rax)

Leaving aside differences in register use:

–       0f bf d2                movswl %dx,%edx
There is one extra movswl instruction in the version with undefined behavior, at least with the particular version of the particular compiler for the particular architecture checked.

This is an example of a null result while benchmarking; neither version performs better, although any given benchmark has one or the other come out ahead, generally by less than the variance within the run. If this were a suggested performance change, it would not make sense to apply it. However, the point of this change was correctness; a performance increase is not expected, and the lack of a performance penalty is a bonus.