SLP Vectorizer: Part 2 - Performance of LLVM SLP Vectorizer

Page content

在上一篇文章 SLP Vectorizer: Part 1 - Implementation in LLVM 中我们学习 LLVM 的 SLP Vectorizer 实现,本文基于 SPEC CPU 2017 来评估 SLP Vectorizer 的优化效果。

Introduction

在上一篇文章 SLP Vectorizer: Part 1 - Implementation in LLVM 的最后,我提到在 2013 LLVM Developers’ Meeting: “Vectorization in LLVM” 的最后有人问道:

Q: Do you happen to know how the speed ups from the two vectorizers (Loop Vectorizer and SLP Vectorizer) split out? One of them is more important or is it kind of equal?

A: The loop vectorizer is a lot more important.

Loop Vectorizer 于 LLVM 3.2 引入,于 LLVM 3.3 默认开启;SLP Vectorizer 于 LLVM 3.3 引入,于 LLVM 3.4 默认开启。现在距离 2013 年引入 SLP Vectorizer 已经有 10 年了,SLP Vectorizer 是否还是像 Nadav Rotem 在 2013 LLVM DevMtg 上说的那样(与 Loop Vectorizer 相比)只能带来很有限性能提升呢 。本着“实践是检验真理的唯一标准”的原则,本文通过实验来分析 SLP Vectorizer 能带来多少性能提升。

本文的测试基于如下环境:

  • LLVM trunk: b399c84

  • Processor: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz (96 Cores)

  • Memory: 376 GB

Benchmark

LLVM test-suite 提供了 benchmark 程序可以测试运行时间、编译时间和代码大小等指标。例如,在 MicroBenchmarks/ 目录下就提供了基于 google-benchmark 编写的用于测试 SLP Vectorization, Loop Vectorization 的程序。LLVM test-suite 同样支持运行 SPEC 这样的 external suites

本文就基于 SPEC CPU 2017 测试 SLP Vectorizer。

  1. 安装 SPEC CPU 2017,执行 $SPEC_CPU2017_PATH/install.sh 会输出如下内容:

    SPEC CPU2017 Installation
    
    Top of the CPU2017 tree is '$SPEC_CPU2017_PATH'
    
    Installing FROM $SPEC_CPU2017_PATH
    Installing TO $SPEC_CPU2017_PATH
    
    Is this correct? (Please enter 'yes' or 'no')
    
  2. 配置 llvm-test-suite 使用 SPEC CPU 2017 测试:

    % git clone git@github.com:llvm/llvm-test-suite.git
    % mkdir llvm-test-suite-build
    % cd llvm-test-suite-build
    % cmake -GNinja -DCMAKE_C_COMPILER=$LLVM_BUILD_PATH/bin/clang \
            -C../llvm-test-suite/cmake/caches/O3.cmake \
            -DTEST_SUITE_BENCHMARKING_ONLY=ON \
            -DTEST_SUITE_SUBDIRS=External \
            -DTEST_SUITE_SPEC2017_ROOT=$SPEC_CPU2017_PATH/ \
            -DTEST_SUITE_RUN_TYPE=test \
            -DTEST_SUITE_COLLECT_STATS=ON \
            -DTEST_SUITE_USE_PERF=ON \
            -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
            ../llvm-test-suite
    
  3. 编译 SPEC CPU 2017 benchmark & 运行:

    % ninja
    
    % llvm-lit -v -j 1 -o results.json .
    
  4. 执行脚本 llvm-test-suite/utils/compare.py 来展示和比较运行结果:

    # Make sure pandas and scipy are installed. Prepend `sudo` if necessary.
    % pip install pandas scipy
    # Show a single result file:
    % llvm-test-suite/utils/compare.py results.json
    # Compare two result files:
    % llvm-test-suite/utils/compare.py results_a.json results_b.json
    

Clang 在 O3 优化下,Loop Vectorizer 和 SLP Vectorizer 都会默认开启,并且是先执行 Loop Vectorizer 然后再执行 SLP Vectorizer,见 PassBuilderPipelines.cpp。Clang 提供了 -fno-vectorize 选项来关闭 Loop Vectorizer,-fno-slp-vectorize 选项关闭 SLP Vectorizer。

因此为了测试 SLP Vectorizer 的效果,进行了如下测试:

  • 一同关闭 Loop Vectorizer 和 SLP Vectorizer 运行 SPEC 2017 即 O3 -fno-vectorize -fno-slp-vectorize

  • 开启 Loop Vectorizer 且关闭 SLP Vectorizer 运行 SPEC 2017 即 O3 -fno-vectorize

  • 一同开启 Loop Vectorizer 和 SLP Vectorizer 运行 SPEC 2017 即 O3

O3 -fno-vectorize -fno-slp-vectorize

关闭 Loop Vectorizer 和 SLP Vectorizer 运行 SPEC 2017 的结果如下:

Program                                                                        exec_time
                                                                               O3-no-LV-SLP.results
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  44.94
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  44.70
             test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test  18.72
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test  12.14
              test-suite :: External/SPEC/CINT2017speed/657.xz_s/657.xz_s.test  11.19
               test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test  11.18
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  10.54
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  10.50
            test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test   9.78
             test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test   9.75
 test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test   7.21
         test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test   5.27
        test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test   5.27
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test   5.14
            test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test   5.05
     test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   2.64
    test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   2.61
              test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test   2.27
              test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   1.57
             test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   1.56
        test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test   0.32
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test   0.17
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test   0.06
 test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test   0.06
     test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test   0.01
      test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test   0.01
  test-suite :: External/SPEC/CFP2017rate/997.specrand_fr/997.specrand_fr.test   0.01
 test-suite :: External/SPEC/CINT2017rate/999.specrand_ir/999.specrand_ir.test   0.01
 test-suite :: External/SPEC/CFP2017speed/996.specrand_fs/996.specrand_fs.test   0.01
test-suite :: External/SPEC/CINT2017speed/998.specrand_is/998.specrand_is.test   0.01
             test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   0.01
            test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   0.0

O3 -fno-slp-vectorize

开启 Loop Vectorizer 且关闭 SLP Vectorizer 运行 SPEC 2017 的结果如下:

Program                                                                        exec_time
                                                                               O3-no-SLP.results
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  38.35
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  38.31
             test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test  18.44
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test  12.01
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  11.33
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  11.29
               test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test  11.14
              test-suite :: External/SPEC/CINT2017speed/657.xz_s/657.xz_s.test  11.12
             test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test  10.66
            test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test  10.59
 test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test   7.21
         test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test   5.29
        test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test   5.29
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test   5.01
            test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test   4.99
    test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   2.66
     test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   2.65
              test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test   2.25
             test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   1.58
              test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   1.58
        test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test   0.32
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test   0.17
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test   0.06
 test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test   0.06
      test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test   0.01
     test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test   0.01
 test-suite :: External/SPEC/CFP2017speed/996.specrand_fs/996.specrand_fs.test   0.01
test-suite :: External/SPEC/CINT2017speed/998.specrand_is/998.specrand_is.test   0.01
  test-suite :: External/SPEC/CFP2017rate/997.specrand_fr/997.specrand_fr.test   0.01
 test-suite :: External/SPEC/CINT2017rate/999.specrand_ir/999.specrand_ir.test   0.01
            test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   0.01
             test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   0.01

O3

一同开启 Loop Vectorizer 和 SLP Vectorizer 运行 SPEC 2017 的结果如下:

Program                                                                        exec_time
                                                                               O3.results
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  24.54
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  24.52
             test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test  18.90
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test  12.11
               test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test  11.24
              test-suite :: External/SPEC/CINT2017speed/657.xz_s/657.xz_s.test  11.22
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  10.68
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  10.57
             test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test  10.52
            test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test  10.51
 test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test   7.21
         test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test   5.30
        test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test   5.29
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test   5.09
            test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test   4.25
     test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   2.63
    test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   2.61
              test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test   2.28
              test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   1.55
             test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   1.52
        test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test   0.33
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test   0.17
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test   0.06
 test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test   0.06
     test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test   0.01
      test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test   0.01
 test-suite :: External/SPEC/CFP2017speed/996.specrand_fs/996.specrand_fs.test   0.01
  test-suite :: External/SPEC/CFP2017rate/997.specrand_fr/997.specrand_fr.test   0.01
 test-suite :: External/SPEC/CINT2017rate/999.specrand_ir/999.specrand_ir.test   0.01
test-suite :: External/SPEC/CINT2017speed/998.specrand_is/998.specrand_is.test   0.01
             test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   0.01
            test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   0.01

Performance: Loop Vectorizer v.s. SLP Vectorizer

通过比较 O3 -fno-vectorize -fno-slp-vectorize, O3 -fno-vectorizeO3 这 3 次运行,对比 Loop Vectorizer 和 SLP Vectorizer 的优化效果。

  • Loop Vectorizer

    % python3 ~/llvm-test-suite/utils/compare.py \
      O3-no-LV-SLP.results.json O3-no-SLP.results.json \
      --no-abs-sort --absolute-diff --filter-short=1
    
    Tests: 32
    Short Running: 12 (filtered out)
    Remaining: 20
    Metric: exec_time
    
    Program         exec_time
                    O3-no-LV-SLP.results O3-no-SLP.results diff
         525.x264_r  44.94                38.31             -6.64
         625.x264_s  44.70                38.35             -6.35
          505.mcf_r   9.75                10.66              0.91
          605.mcf_s   9.78                10.59              0.82
    600.perlbench_s  10.50                11.29              0.79
    500.perlbench_r  10.54                11.33              0.79
          619.lbm_s  18.72                18.44             -0.28
       510.parest_r   5.14                 5.01             -0.13
    631.deepsjeng_s  12.14                12.01             -0.13
           657.xz_s  11.19                11.12             -0.07
         508.namd_r   5.05                 4.99             -0.06
      620.omnetpp_s   2.61                 2.66              0.05
           557.xz_r  11.18                11.14             -0.05
          519.lbm_r   2.27                 2.25             -0.02
          644.nab_s   1.56                 1.58              0.02
    
  • SLP Vectorizer

    % python3 ~/llvm-test-suite/utils/compare.py  \
      O3-no-SLP.results.json O3.results.json \
      --minimal-names --no-abs-sort --absolute-diff --filter-short=1
    
    Tests: 32
    Short Running: 12 (filtered out)
    Remaining: 20
    Metric: exec_time
    
    Program         exec_time
                    O3-no-SLP.results O3.results diff
         625.x264_s  38.35             24.54     -13.81
         525.x264_r  38.31             24.52     -13.79
         508.namd_r   4.99              4.25      -0.73
    600.perlbench_s  11.29             10.57      -0.72
    500.perlbench_r  11.33             10.68      -0.65
          619.lbm_s  18.44             18.90       0.47
          505.mcf_r  10.66             10.52      -0.14
           557.xz_r  11.14             11.24       0.11
           657.xz_s  11.12             11.22       0.10
    631.deepsjeng_s  12.01             12.11       0.10
       510.parest_r   5.01              5.09       0.09
          605.mcf_s  10.59             10.51      -0.08
          644.nab_s   1.58              1.52      -0.06
      620.omnetpp_s   2.66              2.61      -0.04
          544.nab_r   1.58              1.55      -0.03
    

根据上述结果,Loop Vectorizer 和 SLP Vectorizer 的优化效果在 x264 上的效果最明显:

  • 开启 Loop Vectorizer,将运行时间从 44.94s(44.70s) 优化到 38.31s(38.35s),优化了 6.64s(6.35s),有相对 14.8%(14.2%) 的收益

  • 在开启 Loop Vectorizer 的基础上再开启 SLP Vectorizer,将运行时间从 38.31s(38.35s) 优化到 24.52s(24.54s),优化了 13.79s(13.81s),有相对 36.0% 的收益

至少在 x264 上 SLP Vectorizer 带来的性能提升是远大于 Loop Vectorizer 的

Case study: x264

本节尝试分析 Loop Vectorizer 和 SLP Vectorizer 为 x264 带来的性能提升的来源。

查看 llvm-test-suite-build/test.log 找到运行 x264 执行的命令为:625.x264_s --dumpyuv 50 --frames 156 -o BuckBunny_New.264 BuckBunny.yuv 1280x720

通过 perf 对比不同编译选项编译出来的 x264 的运行时 profile 信息,分析 Loop Vectorizer 和 SLP Vectorizer 为 x264 带来的性能提升的来源。

% cd llvm-test-suite-build/External/SPEC/CINT2017speed/625.x264_s/run_test

% perf record --call-graph=dwarf ../625.x264_s --dumpyuv 50 --frames 156 -o BuckBunny_New.264 BuckBunny.yuv 1280x720

% perf report --no-children
  • perf report: x264 built with O3 -fno-vectorize -fno-slp-vectorize

    Samples: 183K of event 'cycles', Event count (approx.): 139996467543
      Overhead  Command     Shared Object     Symbol
    +   18.81%  625.x264_s  625.x264_s        [.] x264_pixel_sad_16x16
    +   15.28%  625.x264_s  625.x264_s        [.] get_ref
    +   14.75%  625.x264_s  625.x264_s        [.] x264_pixel_satd_8x4
    +    5.57%  625.x264_s  625.x264_s        [.] x264_pixel_sad_x4_8x8
    +    5.07%  625.x264_s  625.x264_s        [.] quant_4x4
    +    4.59%  625.x264_s  625.x264_s        [.] mc_chroma
    +    2.97%  625.x264_s  625.x264_s        [.] x264_pixel_sad_x3_8x8
    +    2.38%  625.x264_s  625.x264_s        [.] pixel_hadamard_ac
    +    1.89%  625.x264_s  625.x264_s        [.] quant_trellis_cabac
    +    1.79%  625.x264_s  625.x264_s        [.] x264_pixel_sad_8x8
    +    1.71%  625.x264_s  625.x264_s        [.] refine_subpel
    +    1.52%  625.x264_s  625.x264_s        [.] x264_me_search_ref
    +    1.45%  625.x264_s  625.x264_s        [.] hpel_filter
    +    1.45%  625.x264_s  625.x264_s        [.] x264_pixel_sad_16x8
    +    1.16%  625.x264_s  625.x264_s        [.] quant_8x8
    +    1.09%  625.x264_s  625.x264_s        [.] sub4x4_dct
    
  • perf report: x264 built with O3 -fno-slp-vectorize

    Samples: 157K of event 'cycles', Event count (approx.): 120606957278
      Overhead  Command     Shared Object     Symbol
    +   22.12%  625.x264_s  625.x264_s        [.] x264_pixel_sad_16x16
    +   16.96%  625.x264_s  625.x264_s        [.] x264_pixel_satd_8x4
    +    9.11%  625.x264_s  625.x264_s        [.] get_ref
    +    6.40%  625.x264_s  625.x264_s        [.] x264_pixel_sad_x4_8x8
    +    5.26%  625.x264_s  625.x264_s        [.] mc_chroma
    +    3.46%  625.x264_s  625.x264_s        [.] x264_pixel_sad_x3_8x8
    +    2.39%  625.x264_s  625.x264_s        [.] pixel_hadamard_ac
    +    2.15%  625.x264_s  625.x264_s        [.] quant_trellis_cabac
    +    2.13%  625.x264_s  625.x264_s        [.] x264_pixel_sad_8x8
    +    1.95%  625.x264_s  625.x264_s        [.] refine_subpel
    +    1.81%  625.x264_s  625.x264_s        [.] quant_4x4
    +    1.75%  625.x264_s  625.x264_s        [.] x264_me_search_ref
    +    1.68%  625.x264_s  625.x264_s        [.] x264_pixel_sad_16x8
    +    1.19%  625.x264_s  625.x264_s        [.] sub4x4_dct
    
  • perf report: x264 built with O3

    Samples: 100K of event 'cycles', Event count (approx.): 77140113982
      Overhead  Command     Shared Object     Symbol
    +   26.50%  625.x264_s  625.x264_s        [.] x264_pixel_satd_8x4
    +   13.90%  625.x264_s  625.x264_s        [.] get_ref
    +    8.13%  625.x264_s  625.x264_s        [.] mc_chroma
    +    3.81%  625.x264_s  625.x264_s        [.] pixel_hadamard_ac
    +    3.41%  625.x264_s  625.x264_s        [.] quant_trellis_cabac
    +    3.41%  625.x264_s  625.x264_s        [.] x264_pixel_sad_16x16
    +    2.98%  625.x264_s  625.x264_s        [.] refine_subpel
    +    2.84%  625.x264_s  625.x264_s        [.] quant_4x4
    +    2.32%  625.x264_s  625.x264_s        [.] x264_me_search_ref
    +    1.86%  625.x264_s  625.x264_s        [.] sub4x4_dct
    +    1.48%  625.x264_s  625.x264_s        [.] pixel_avg_weight_wxh
    +    1.41%  625.x264_s  625.x264_s        [.] hpel_filter
    +    1.06%  625.x264_s  625.x264_s        [.] x264_pixel_satd_4x4
    +    0.93%  625.x264_s  625.x264_s        [.] x264_pixel_sad_x4_8x8
    

Loop Vectorizer

通过 perf 信息,可以看到开启 Loop Vectorizer 后 x264 的性能提升主要来自于对 get_ref() 函数的优化。

函数 get_ref() 的定义位于 x264_src/common/mc.c 中,源代码如下:

static uint8_t *get_ref( uint8_t *dst,   int *i_dst_stride,
                         uint8_t *src[4], int i_src_stride,
                         int mvx, int mvy,
                         int i_width, int i_height, const x264_weight_t *weight )
{
    int qpel_idx = ((mvy&3)<<2) + (mvx&3);
    int offset = (mvy>>2)*i_src_stride + (mvx>>2);
    uint8_t *src1 = src[hpel_ref0[qpel_idx]] + offset + ((mvy&3) == 3) * i_src_stride;

    if( qpel_idx & 5 ) /* qpel interpolation needed */
    {
        uint8_t *src2 = src[hpel_ref1[qpel_idx]] + offset + ((mvx&3) == 3);
        pixel_avg( dst, *i_dst_stride, src1, i_src_stride,
                   src2, i_src_stride, i_width, i_height );
        if( weight->weightfn )
            mc_weight( dst, *i_dst_stride, dst, *i_dst_stride, weight, i_width, i_height );
        return dst;
    }
    else if( weight->weightfn )
    {
        mc_weight( dst, *i_dst_stride, src1, i_src_stride, weight, i_width, i_height );
        return dst;
    }
    else
    {
        *i_dst_stride = i_src_stride;
        return src1;
    }
}

表面上看函数 get_ref() 的定义,应该是没有 Loop Vectorizer 的向量化机会的,那为什么开启 Loop Vectorizer 后对 get_ref() 函数优化有性能提升呢?答案是编译优化先将 get_ref() 函数中对 pixel_avg() 函数和 mc_weight() 函数的调用 inline 了,然后 inline 后的get_ref() 函数被 Loop Vectorizer 向量化了。

pixel_avg() 函数和 mc_weight() 函数的定义如下:

static inline void pixel_avg( uint8_t *dst,  int i_dst_stride,
                              uint8_t *src1, int i_src1_stride,
                              uint8_t *src2, int i_src2_stride,
                              int i_width, int i_height )
{
    for( int y = 0; y < i_height; y++ )
    {
        for( int x = 0; x < i_width; x++ )
            dst[x] = ( src1[x] + src2[x] + 1 ) >> 1;
        dst  += i_dst_stride;
        src1 += i_src1_stride;
        src2 += i_src2_stride;
    }
}

#define opscale(x) dst[x] = x264_clip_uint8( ((src[x] * weight->i_scale + (1<<(weight->i_denom - 1))) >> weight->i_denom) + weight->i_offset )
#define opscale_noden(x) dst[x] = x264_clip_uint8( src[x] * weight->i_scale + weight->i_offset )
static inline void mc_weight( uint8_t *dst, int i_dst_stride, uint8_t *src, int i_src_stride, const x264_weight_t *weight, int i_width, int i_height )
{
    if( weight->i_denom >= 1 )
    {
        for( int y = 0; y < i_height; y++, dst += i_dst_stride, src += i_src_stride )
            for( int x = 0; x < i_width; x++ )
                opscale( x );
    }
    else
    {
        for( int y = 0; y < i_height; y++, dst += i_dst_stride, src += i_src_stride )
            for( int x = 0; x < i_width; x++ )
                opscale_noden( x );
    }
}

如何验证 get_ref() 函数中对 pixel_avg() 函数和 mc_weight() 函数的调用被 inline 了呢?

  • 一方面,可以通过 objdump 查看函数 get_ref() 的反汇编来验证。

    % objdump -d ../625.x264_s | awk -v RS= '/^[[:xdigit:]]+ <get_ref>/'
    
  • 另一方面,可以通过查看编译 x264_src/common/mc.c 时生成 OptimizationRemark 来确认。

    因为在编译 SPEC 2017 时为 cmake 添加了 -DCMAKE_EXPORT_COMPILE_COMMANDS=ON 选项,所以可以在 llvm-test-suite-build/compile_commands.json 中找到编译 x264_src/common/mc.c 使用的命令:

    % $LLVM_BUILD_PATH/bin/clang \
          -DNDEBUG -DSPEC -DSPEC_AUTO_BYTEORDER=0x12345678 -DSPEC_CPU -DSPEC_LINUX \
          -DSPEC_LINUX_X64 -DSPEC_LP64 -DSPEC_SUPPRESS_OPENMP \
          -I$SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src \
          -I$SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src/extras \
          -I$SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src/common \
          -O3 -w -Werror=date-time -save-stats=obj -save-stats=obj \
          -c $SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src/common/mc.c
    

    为上述编译命令添加 -Rpass="inline|vectorize" 选项,手动编译 x264_src/common/mc.c,通过查看 OptimizationRemark 可以看到get_ref() 函数中对 pixel_avg() 函数和 mc_weight() 函数的调用确实被 inline 了。

    $SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src/common/mc.c:244:9: remark: ‘pixel_avg’ inlined into ‘get_ref’ with (cost=25, threshold=325) at callsite get_ref:12:9; [-Rpass=inline]

    $SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src/common/mc.c:247:13: remark: ‘mc_weight’ inlined into ‘get_ref’ with (cost=190, threshold=325) at callsite get_ref:15:13; [-Rpass=inline]

    $SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src/common/mc.c:252:9: remark: ‘mc_weight’ inlined into ‘get_ref’ with (cost=190, threshold=325) at callsite get_ref:20:9; [-Rpass=inline]


继续分析 Loop Vectorizer 对函数get_ref() 的向量化优化。

在编译 x264_src/common/mc.c 时生成 OptimizationRemark 中同样可以看到,Loop Vectorizer 将 inline 到函数 get_ref() 中的 pixel_avg() 函数和 mc_weight() 函数向量化了:

$SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src/common/mc.c:44:9: remark: vectorized loop (vectorization width: 16, interleaved count: 2) [-Rpass=loop-vectorize]

44 | for( int x = 0; x < i_width; x++ )

$SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src/common/mc.c:127:13: remark: vectorized loop (vectorization width: 4, interleaved count: 1) [-Rpass=loop-vectorize]

127 | for( int x = 0; x < i_width; x++ )

$SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src/common/mc.c:133:13: remark: vectorized loop (vectorization width: 4, interleaved count: 1) [-Rpass=loop-vectorize]

133 | for( int x = 0; x < i_width; x++ )

详细对比函数 get_ref() 在 Loop Vectorizer 前后 LLVM IR 的具体变化,可以在 x264_src/common/mc.c 的编译命令中分别添加 -mllvm -print-before=loop-vectorize-mllvm -print-after=loop-vectorize 选项,手动编译两次 x264_src/common/mc.c。

我将函数 get_ref() 在 Loop Vectorizer 前后 LLVM IR 上传到了 github gist:

SLP Vectorizer

通过 perf 信息,可以看到开启 SLP Vectorizer 后 x264 的性能提升主要来自于对 x264_pixel_sad_16x16() 函数的优化。

函数 x264_pixel_sad_16x16() 的定义位于 x264_src/common/pixel.c 中,源代码如下:

/****************************************************************************
 * pixel_sad_WxH
 ****************************************************************************/
#define PIXEL_SAD_C( name, lx, ly ) \
static int name( uint8_t *pix1, int i_stride_pix1,  \
                 uint8_t *pix2, int i_stride_pix2 ) \
{                                                   \
    int i_sum = 0;                                  \
    for( int y = 0; y < ly; y++ )                   \
    {                                               \
        for( int x = 0; x < lx; x++ )               \
        {                                           \
            i_sum += abs( pix1[x] - pix2[x] );      \
        }                                           \
        pix1 += i_stride_pix1;                      \
        pix2 += i_stride_pix2;                      \
    }                                               \
    return i_sum;                                   \
}

PIXEL_SAD_C( x264_pixel_sad_16x16, 16, 16 )
PIXEL_SAD_C( x264_pixel_sad_16x8,  16,  8 )
PIXEL_SAD_C( x264_pixel_sad_8x16,   8, 16 )
PIXEL_SAD_C( x264_pixel_sad_8x8,    8,  8 )
PIXEL_SAD_C( x264_pixel_sad_8x4,    8,  4 )
PIXEL_SAD_C( x264_pixel_sad_4x8,    4,  8 )
PIXEL_SAD_C( x264_pixel_sad_4x4,    4,  4 )

在 llvm-test-suite-build/compile_commands.json 中找到编译 x264_src/common/pixel.c 使用的命令:

% $LLVM_BUILD_PATH/bin/clang \
      -DNDEBUG -DSPEC -DSPEC_AUTO_BYTEORDER=0x12345678 -DSPEC_CPU -DSPEC_LINUX \
      -DSPEC_LINUX_X64 -DSPEC_LP64 -DSPEC_SUPPRESS_OPENMP \
      -I$SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src \
      -I$SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src/extras \
      -I$SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src/common \
      -O3 -w -Werror=date-time \
      -c $SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src/common/pixel.c

首先为上述编译命令添加 -Rpass="vectorize" 选项,手动编译 x264_src/common/pixel.c,通过查看 OptimizationRemark 可以看到 x264_src/common/pixel.c:61:1 即函数 x264_pixel_sad_16x16() 确实被 SLP Vectorizer 向量化了。

$SPEC_CPU2017_PATH/benchspec/CPU/525.x264_r/src/x264_src/common/pixel.c:61:1: remark: Vectorized horizontal reduction with cost -47 and with tree size 6 [-Rpass=slp-vectorizer]

61 | PIXEL_SAD_C( x264_pixel_sad_16x16, 16, 16 )

然后在上述编译命令中分别添加 -mllvm -print-before=slp-vectorizer-mllvm -print-after=slp-vectorizer 选项,手动编译两次 x264_src/common/pixel.c,对比 SLP Vectorizer 前后 LLVM IR 的具体变化:


这里详细分析下 SLP Vectorizer 对函数 x264_pixel_sad_16x16() 的向量化优化:

在上一篇文章 SLP Vectorizer: Part 1 - Implementation in LLVM 中提到:SLP Vectorizer 会遍历每个函数的所有基本块,对于每个基本块,依次调用 vectorizeStoreChains()vectorizeChainsInBlock()vectorizeGEPIndices() 尝试向量化。对于函数 x264_pixel_sad_16x16(),是 vectorizeChainsInBlock() 找到了 horizontal reduction 的向量化机会,成功向量化。

  • horizontal reduction 如下图所示:

  • 以 horizontal reduction 为根结点构建的 SLP Tree 如下图所示:

  • 函数 x264_pixel_sad_16x16() 经 SLP Vectorizer 向量化后的 LLVM IR 非常简洁:

    ; *** IR Dump After SLPVectorizerPass on x264_pixel_sad_16x16 ***
    ; Function Attrs: nofree norecurse nosync nounwind memory(read, inaccessiblemem: none) uwtable
    define internal i32 @x264_pixel_sad_16x16(
      ptr nocapture noundef readonly %pix1, i32 noundef %i_stride_pix1,
      ptr nocapture noundef readonly %pix2, i32 noundef %i_stride_pix2) {
    entry:
      %idx.ext = sext i32 %i_stride_pix1 to i64
      %idx.ext8 = sext i32 %i_stride_pix2 to i64
      br label %for.cond1.preheader
    
    for.cond1.preheader:                              ; preds = %entry, %for.cond1.preheader
      %y.025 = phi i32 [ 0, %entry ], [ %inc11, %for.cond1.preheader ]
      %i_sum.024 = phi i32 [ 0, %entry ], [ %op.rdx, %for.cond1.preheader ]
      %pix1.addr.023 = phi ptr [ %pix1, %entry ], [ %add.ptr, %for.cond1.preheader ]
      %pix2.addr.022 = phi ptr [ %pix2, %entry ], [ %add.ptr9, %for.cond1.preheader ]
      %0 = load <16 x i8>, ptr %pix1.addr.023, align 1, !tbaa !14
      %1 = zext <16 x i8> %0 to <16 x i32>
      %2 = load <16 x i8>, ptr %pix2.addr.022, align 1, !tbaa !14
      %3 = zext <16 x i8> %2 to <16 x i32>
      %4 = sub nsw <16 x i32> %1, %3
      %5 = call <16 x i32> @llvm.abs.v16i32(<16 x i32> %4, i1 true)
      %6 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %5)
      %op.rdx = add i32 %6, %i_sum.024
      %add.ptr = getelementptr inbounds i8, ptr %pix1.addr.023, i64 %idx.ext
      %add.ptr9 = getelementptr inbounds i8, ptr %pix2.addr.022, i64 %idx.ext8
      %inc11 = add nuw nsw i32 %y.025, 1
      %exitcond.not = icmp eq i32 %inc11, 16
      br i1 %exitcond.not, label %for.cond.cleanup, label %for.cond1.preheader, !llvm.loop !30
    
    for.cond.cleanup:                                 ; preds = %for.cond1.preheader
      ret i32 %op.rdx
    }
    

Summary

在上一篇文章 SLP Vectorizer: Part 1 - Implementation in LLVM 中我们学习 LLVM 的 SLP Vectorizer 实现,本文基于 SPEC CPU 2017 来评估 SLP Vectorizer 的优化效果。

  1. 在 SPEC CPU 2017 上测试, SLP Vectorizer 在 x264 上带来的性能提升远大于 Loop Vectorizer

  2. 在 x264 上,主要是 SLP Vectorizer 的 vectorizeChainsInBlock() 找到了 horizontal reduction 的向量化机会带来的性能提升

P.S.

  1. 如果不开启 Loop Vectorizer 只开启 SLP Vectorizer,SLP Vectorizer 在 x264 上能带来多大的性能提升呢?不开启 Loop Vectorizer 只开启 SLP Vectorizer 运行 SPEC 2017 的结果如下:

    • 只开启 Loop Vectorizer,将 x264 运行时间从 44.94s(44.70s) 优化到 38.31s(38.35s),优化了 6.64s(6.35s),有相对 14.8%(14.2%) 的收益。

    • 只开启 SLP Vectorizer,将 x264 运行时间从 44.94s(44.70s) 优化到 31.28s(31.31s),优化了 13.67s(13.39s),有相对 30.4%(30.0%) 的收益。因此,在 x264 上只开启 SLP Vectorizer 带来的性能提升也是大于只开启 Loop Vectorizer 带来的性能提升的。

    Program                                                                        exec_time
                                                                                   O3-no-LV.results
              test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  31.31
               test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  31.28
                 test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test  18.55
    test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test  12.18
                   test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test  11.21
                  test-suite :: External/SPEC/CINT2017speed/657.xz_s/657.xz_s.test  11.17
     test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  10.53
    test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  10.51
                test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test   9.85
                 test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test   9.84
     test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test   7.22
            test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test   5.26
             test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test   5.26
            test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test   5.16
                test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test   4.36
         test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   2.63
        test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   2.63
                  test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test   2.29
                  test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   2.01
                 test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   1.98
            test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test   0.33
          test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test   0.17
     test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test   0.06
    test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test   0.06
         test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test   0.01
          test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test   0.01
      test-suite :: External/SPEC/CFP2017rate/997.specrand_fr/997.specrand_fr.test   0.01
     test-suite :: External/SPEC/CINT2017rate/999.specrand_ir/999.specrand_ir.test   0.01
    test-suite :: External/SPEC/CINT2017speed/998.specrand_is/998.specrand_is.test   0.01
     test-suite :: External/SPEC/CFP2017speed/996.specrand_fs/996.specrand_fs.test   0.01
                 test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   0.01
                test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   0.01
    
  2. 查看 x264 在 O3 优化等级下函数 x264_pixel_sad_16x16 的反汇编:

    % objdump -d 625.x264_s | awk -v RS= '/^[[:xdigit:]]+ <x264_pixel_sad_16x16>/'
    000000000000f2e0 <x264_pixel_sad_16x16>:
        f2e0:        48 63 f6                     movslq %esi,%rsi
        f2e3:        48 63 c9                     movslq %ecx,%rcx
        f2e6:        31 c0                        xor    %eax,%eax
        f2e8:        41 b8 10 00 00 00            mov    $0x10,%r8d
        f2ee:        66 90                        xchg   %ax,%ax
        f2f0:        f3 0f 6f 07                  movdqu (%rdi),%xmm0
        f2f4:        f3 0f 6f 0a                  movdqu (%rdx),%xmm1
        f2f8:        66 0f f6 c8                  psadbw %xmm0,%xmm1
        f2fc:        66 0f 70 c1 ee               pshufd $0xee,%xmm1,%xmm0
        f301:        66 0f d4 c1                  paddq  %xmm1,%xmm0
        f305:        66 41 0f 7e c1               movd   %xmm0,%r9d
        f30a:        44 01 c8                     add    %r9d,%eax
        f30d:        48 01 f7                     add    %rsi,%rdi
        f310:        48 01 ca                     add    %rcx,%rdx
        f313:        41 ff c8                     dec    %r8d
        f316:        75 d8                        jne    f2f0 <x264_pixel_sad_16x16+0x10>
        f318:        c3                           retq
        f319:        0f 1f 80 00 00 00 00         nopl   0x0(%rax)
    

    LLVM 为函数 x264_pixel_sad_16x16 由 LLVM IR 生成汇编时也做了一定的“优化”,生成了 psadbw 指令,PSADBW — Compute Sum of Absolute Differences

    TEMP0 := ABS(DEST[7:0] - SRC[7:0])
    (* Repeat operation for bytes 2 through 14 *)
    TEMP15 := ABS(DEST[127:120] - SRC[127:120])
    DEST[15:0] := SUM(TEMP0:TEMP7)
    DEST[63:16] := 000000000000H
    DEST[79:64] := SUM(TEMP8:TEMP15)
    DEST[127:80] := 00000000000
    DEST[MAXVL-1:128] (Unmodified)
    
  3. -O3 v.s. -O3 -march=native -mtune=native

    开启 -march=native -mtune=native 选项会多暴露一些优化机会,可以通过 clang -E - -march=native -mtune=native -### 查看具体 pass 了哪些 -target-cpu, -tune-cpu, -target-feature 给 cc1。

    开启 -march=native -mtune=native 选项并不影响本文的结论,具体的开启 -march=native -mtune=native 选项后运行 SPEC 2017 的结果如下:

    • -O3 -march=native -mtune=native -fno-vectorize -fno-slp-vectorize

      Program                                       exec_time
                                                    O3-no-LV-SLP.results
      INT2017rate/525.x264_r/525.x264_r              43.80
      INT2017speed/625.x264_s/625.x264_s             43.72
      FP2017speed/619.lbm_s/619.lbm_s                16.94
      INT2017spe...31.deepsjeng_s/631.deepsjeng_s    11.86
      INT2017speed/657.xz_s/657.xz_s                 11.03
      INT2017rate/557.xz_r/557.xz_r                  10.98
      INT2017spe...00.perlbench_s/600.perlbench_s    10.85
      INT2017rat...00.perlbench_r/500.perlbench_r    10.83
      INT2017rate/505.mcf_r/505.mcf_r                 9.79
      INT2017speed/605.mcf_s/605.mcf_s                9.73
      INT2017rat...31.deepsjeng_r/531.deepsjeng_r     6.92
      INT2017speed/641.leela_s/641.leela_s            5.31
      INT2017rate/541.leela_r/541.leela_r             5.31
      FP2017rate/510.parest_r/510.parest_r            5.08
      FP2017rate/508.namd_r/508.namd_r                3.99
      INT2017rate/520.omnetpp_r/520.omnetpp_r         2.67
      INT2017spe...ed/620.omnetpp_s/620.omnetpp_s     2.63
      FP2017rate/519.lbm_r/519.lbm_r                  2.02
      FP2017speed/644.nab_s/644.nab_s                 1.43
      FP2017rate/544.nab_r/544.nab_r                  1.40
      FP2017rate/511.povray_r/511.povray_r            0.30
      FP2017rate/526.blender_r/526.blender_r          0.16
      INT2017rat...23.xalancbmk_r/523.xalancbmk_r     0.06
      INT2017spe...23.xalancbmk_s/623.xalancbmk_s     0.06
      FP2017rate/538.imagick_r/538.imagick_r          0.01
      FP2017speed/638.imagick_s/638.imagick_s         0.01
      INT2017rate/502.gcc_r/502.gcc_r                 0.01
      INT2017spe...98.specrand_is/998.specrand_is     0.01
      FP2017spee...96.specrand_fs/996.specrand_fs     0.01
      INT2017rat...99.specrand_ir/999.specrand_ir     0.01
      FP2017rate...97.specrand_fr/997.specrand_fr     0.01
      INT2017speed/602.gcc_s/602.gcc_s                0.01
      
    • -O3 -march=native -mtune=native -fno-slp-vectorize

      Program                                       exec_time
                                                    O3-no-SLP.results
      INT2017rate/525.x264_r/525.x264_r              38.76
      INT2017speed/625.x264_s/625.x264_s             38.76
      FP2017speed/619.lbm_s/619.lbm_s                17.10
      INT2017spe...31.deepsjeng_s/631.deepsjeng_s    11.77
      INT2017speed/657.xz_s/657.xz_s                 11.16
      INT2017rate/557.xz_r/557.xz_r                  11.15
      INT2017speed/605.mcf_s/605.mcf_s               10.55
      INT2017spe...00.perlbench_s/600.perlbench_s    10.55
      INT2017rate/505.mcf_r/505.mcf_r                10.54
      INT2017rat...00.perlbench_r/500.perlbench_r    10.53
      INT2017rat...31.deepsjeng_r/531.deepsjeng_r     6.88
      FP2017rate/510.parest_r/510.parest_r            5.42
      INT2017rate/541.leela_r/541.leela_r             5.30
      INT2017speed/641.leela_s/641.leela_s            5.29
      FP2017rate/508.namd_r/508.namd_r                4.18
      INT2017spe...ed/620.omnetpp_s/620.omnetpp_s     2.64
      INT2017rate/520.omnetpp_r/520.omnetpp_r         2.60
      FP2017rate/519.lbm_r/519.lbm_r                  2.02
      FP2017speed/644.nab_s/644.nab_s                 1.50
      FP2017rate/544.nab_r/544.nab_r                  1.47
      FP2017rate/511.povray_r/511.povray_r            0.31
      FP2017rate/526.blender_r/526.blender_r          0.16
      INT2017spe...23.xalancbmk_s/623.xalancbmk_s     0.06
      INT2017rat...23.xalancbmk_r/523.xalancbmk_r     0.06
      FP2017speed/638.imagick_s/638.imagick_s         0.01
      FP2017rate/538.imagick_r/538.imagick_r          0.01
      INT2017rate/502.gcc_r/502.gcc_r                 0.01
      INT2017rat...99.specrand_ir/999.specrand_ir     0.01
      FP2017spee...96.specrand_fs/996.specrand_fs     0.01
      FP2017rate...97.specrand_fr/997.specrand_fr     0.01
      INT2017spe...98.specrand_is/998.specrand_is     0.01
      INT2017speed/602.gcc_s/602.gcc_s                0.01
      
    • -O3 -march=native -mtune=native

      Program                                       exec_time
                                                    O3.results
      INT2017rate/525.x264_r/525.x264_r              21.40
      INT2017speed/625.x264_s/625.x264_s             21.35
      FP2017speed/619.lbm_s/619.lbm_s                17.16
      INT2017spe...31.deepsjeng_s/631.deepsjeng_s    11.81
      INT2017speed/657.xz_s/657.xz_s                 11.20
      INT2017rate/557.xz_r/557.xz_r                  11.19
      INT2017spe...00.perlbench_s/600.perlbench_s    10.86
      INT2017rat...00.perlbench_r/500.perlbench_r    10.83
      INT2017speed/605.mcf_s/605.mcf_s               10.68
      INT2017rate/505.mcf_r/505.mcf_r                10.65
      INT2017rat...31.deepsjeng_r/531.deepsjeng_r     6.85
      FP2017rate/510.parest_r/510.parest_r            5.47
      INT2017rate/541.leela_r/541.leela_r             5.26
      INT2017speed/641.leela_s/641.leela_s            5.26
      FP2017rate/508.namd_r/508.namd_r                4.18
      INT2017spe...ed/620.omnetpp_s/620.omnetpp_s     2.62
      INT2017rate/520.omnetpp_r/520.omnetpp_r         2.61
      FP2017rate/519.lbm_r/519.lbm_r                  2.00
      FP2017rate/544.nab_r/544.nab_r                  1.39
      FP2017speed/644.nab_s/644.nab_s                 1.39
      FP2017rate/511.povray_r/511.povray_r            0.31
      FP2017rate/526.blender_r/526.blender_r          0.16
      INT2017spe...23.xalancbmk_s/623.xalancbmk_s     0.06
      INT2017rat...23.xalancbmk_r/523.xalancbmk_r     0.06
      FP2017speed/638.imagick_s/638.imagick_s         0.01
      FP2017rate/538.imagick_r/538.imagick_r          0.01
      INT2017rate/502.gcc_r/502.gcc_r                 0.01
      INT2017rat...99.specrand_ir/999.specrand_ir     0.01
      FP2017spee...96.specrand_fs/996.specrand_fs     0.01
      INT2017spe...98.specrand_is/998.specrand_is     0.01
      FP2017rate...97.specrand_fr/997.specrand_fr     0.01
      INT2017speed/602.gcc_s/602.gcc_s                0.01
      

References

  1. https://www.phoronix.com/review/llvm_33_slpvectorizer

  2. https://groups.google.com/g/llvm-dev/c/eHiVo9RgZR4

  3. https://wiki.mozilla.org/SIMD/Uses/SAD

  4. https://maskray.me/blog/2022-08-28-march-mcpu-mtune