Enabling GCC Graphite and LTO on Gentoo

Jul 08 2013 Published by under Linux

In this article, we will be enabling the GCC options marked with “To use this code transformation, GCC has to be configured with –with-ppl and –with-cloog to enable the Graphite loop transformation infrastructure. “, referred to as graphite. In addition, Link Time Optimization(LTO) refers to the options -flto and -fuse-linker-plugin. -fuse-linker-plugin tells GCC to use an external linker and -flto defers link time optimizations until the final link step so that the optimizer can work across the whole program rather than on individual object files. OpenMP is for parallelization and I save it for another article since I do not have benchmarks to show that it improves performance.
To enable graphite, LTO, and OpenMP it is recommended to first install the latest version of GCC. More recent versions of GCC contain bug fixes for features we are about to enable. The GCC upgrade process is detailed in the guide http://www.gentoo.org/doc/en/gcc-upgrading.xml.

After upgrading GCC, we start enabling the newer features. The steps follow the guidelines on the Gentoo wiki roughly. However,  it is unnecessary to rebuild everything after upgrading GCC as explained in the official GCC upgrade guide. libtool allows packages to be built against shared libraries without rebuilding every time the toolchain is upgraded.
Before adding the graphite USE flag, it is essential to build cloog

emerge dev-libs/cloog dev-libs/cloog-ppl

Then add graphite to CFLAGS and enable graphite USE flags

GRAPHITE="-floop-interchange -ftree-loop-distribution -floop-strip-mine -floop-block"
CFLAGS="-flto=8 ${GRAPHITE} -ftree-vectorize"
CXXFLAGS="${CFLAGS}"
LDFLAGS="${CFLAGS} -fuse-linker-plugin"
# GCC >= 4.8.2 requires the lto flag
USE="graphite lto"

These flags can be appended to the ones you’re already using, although certain flags may conflict with the graphite optimizations. Since graphite flags such as -ftree-loop-distribution is intended to enable further vectorization, I also enabled the -ftree-vectorize flag. The -fuse-linker-plugin is a linker flag while -flto=n turns on the standard link-time optimizer. I set n to 8 for 8 threads while linking. Another factor that could affect performance is -flto-partition, which sets the partitioning algorithm.
To use the linker plugin with LTO, set the linker to Gold:

binutils-config --linker ld.gold

In addition, before compiling the system it is necessary to build GCC with graphite:

emerge gcc
emerge -e world

Dealing with breakages

Several packages failed to compile with LTO and graphite enabled. These flags can be disabled on a per package basis. First create the files in /etc/portage/env/ called no-lto.conf and no-graphite.conf. In no-graphite.conf, disable your graphite flags:

CFLAGS="${CFLAGS} -fno-loop-interchange -fno-tree-loop-distribution -fno-loop-strip-mine -fno-loop-block"
CXXFLAGS="${CXXFLAGS} -fno-loop-interchange -fno-tree-loop-distribution -fno-loop-strip-mine -fno-loop-block"
LDFLAGS="${LDFLAGS} -fno-loop-interchange -fno-tree-loop-distribution -fno-loop-strip-mine -fno-loop-block"

Similary, disable your LTO flags in no-lto.conf:

CFLAGS="${CFLAGS} -fno-lto -fno-use-linker-plugin"
CXXFLAGS="${CXXFLAGS} -fno-lto -fno-use-linker-plugin"
LDFLAGS="${LDFLAGS} -fno-lto -fno-use-linker-plugin"

Every time a package breaks, add a line to /etc/portage/package.env that either disables graphite or LTO. For example, I made these adjustments:

net-misc/curl no-graphite.conf
media-libs/mesa no-graphite.conf
sys-apps/findutils no-lto.conf
sys-apps/gawk no-lto.conf

After adding a line you can resume the system rebuild with emerge --resume.

Notes

Before re-compiling your system, setting moving /var/tmp to RAM speeds things up. The Sabayon wiki has an article on performance, just make sure you reboot after changing /etc/fstab.

After building the current version of PPL, this message is displayed for upgrades:

* After an upgrade of PPL it is important that you rebuild
* dev-libs/cloog-ppl.
*
* If you use gcc-config to switch to an older compiler version than
* the one PPL was built with, PPL must be rebuilt with that version.
*
* In both cases failure to do this will get you this error when
* graphite flags are used:
*
* sorry, unimplemented: Graphite loop optimizations cannot be used

2 responses so far

Several Thousand Visitors Second Day Launch

Aug 30 2011 Published by under Python Fiddle,Singularitarian

A recent site I launched received a lot of attention. It may have been the only way to get the momentum going, since Google wouldn’t index a site with no content.

Fortunately, I spent a week before launch optimizing the page loading, serving static files from Amazon, and fixing usability bugs. So the result is a very smooth launch, even when serving many visitors per second. Some users who experienced slow loading issues may have been waiting for the browser to download a 1.3 or 2.0 MB file, which could have caused a traffic jam on a static file server. The technique used here was to serve files that are already compressed with lzma and gzip, respectively. Due to htaccess configuration not being available on Amazon, it was decided to serve these from another server.

The most surprising effect was that Google seemed to have picked up the link as soon as it appeared, along with other sites that mirror content.

No responses yet

Javascript LZMA Decompression

Aug 13 2011 Published by under Linux,Programming

In modern browsers, g-zip compression is a standard feature. The typical compression ratio for a plain text file is 30%, reducing the download time of web content by 70% and making it load 2-3 times faster. In spite of the speed up, g-zip is an old algorithm based on LZ77. Since then, newer algorithms have been invented, with LZMA being the standard. On Linux, LZMA typically produces files half the size compared to g-zip. This tutorial will show you how to use an LZMA compressed file produced by the standard lzma command on Unix machines directly in a client side web application. The rest of the post assumes you have the JavaScript libraries for LZMA and binary AJAX set up.

First, Make a Compressed File

echo "Hello, world." | lzma -3 > hello.lzma

Next,  Read Binary Data

<script src="../src/jquery-1.4.4-binary-ajax.js"></script>
<script src="../src/jdataview.js"></script>
<script>
function unzip(data) {
    // Make a view on the data
    var view = new jDataView(data);

    var int_arr = new Array;

    while (view.tell() < view.length) {

       int_arr.push(view.getUint8(view.tell()));

   }
   console.log(int_arr.length);
   console.log(int_arr);

}

// Download the file
$.get('hello.lzma', unzip, 'binary');
</script>

This is a pretty simple step, except the while loop counter may be unintuitive. getUint8 increments the file pointer, though it wasn’t documented in the API specification. I spent an hour or so comparing the output in hex. One of the problem was that

5d 00 00 08 00 0d 00 00 00 00 00 00 00 00

is the same as

5d 00 00 08 00 0d ff ff ff ff ff ff ff ff

in little Endian. You can try it in the decompressor, just replace the bytes in the hello world lzma on compression level 3. However, I figured out the problem as soon as I compared view.length and int_arr.length. They were multiples of 2! That always has significance in computing, in this case it meant I was reading every other byte. After correcting the while loop, I moved onto decoding the binary.

Third, Enjoy the Decoding

Yes, this is a rather boring thing to do while waiting, but do enjoy it.

    lzma.decompress(int_arr, function(result) {
        $('body').append($('<textarea></textarea>').val(result));
    })

Benefits

Using LZMA compression rather than g-zip, I was able to reduce a g-zipped file to 2/3 of its size, reducing the download time by 33%. The LZMA decompression algorithm could be improved to use an array to store results, joining them at the end, rather than appending to a string. It is not recommended to use this method unless you have large files. The libraries themselves take up about 50kb with g-zip. Furthermore, it is unsuitable for downloads where files are sent directly to the user, without being used by the application, since the user would have the decompression utilities.

One response so far