Update on libm performance

If you’ve been following glibc development (I don’t think many do; the kernel is sexier apparently) you’ll see that a new NEWS item is in for 2.18:

* Improved worst case performance of libm functions with double inputs and
  output.

If you’re wondering if this has anything to do with my previous few posts on libm, then you’re right, it does. Thanks to a number of very interesting changes, the speed of the pow function on x86_64 is now roughly 5 times faster in the worst case than in 2.17. I have considered the pow function throughout my work because it is probably the most ill-reputed function implementation. I plan to write up a detailed description of various improvements I made to the code (other than formatting it and fixing the value of TWO) in a separate post or series of posts. To summarize, I have saved time by:

  • Avoiding wasting time multiplying zeroes in the multiplication function
  • Written a fast squaring method that is a special case of generic multiplication
  • Faster polynomial evaluation for multiple precision exp function
  • Configurable mantissa type for multiple precision numbers, to allow integral mantissa for x86 and retain the floating point mantissa for powerpc
  • Tweak to the multiplication algorithm to reduce multiplcations
  • Dozens of minor tweaks to eek out performance

But the worst case is a few thousand times slower than the average case; 5x is nothing!

Yes and no. 5x is indeed not a good enough improvement if one is looking to compare with the average case, but for an implementation that tries to guarantee 0.5 ulp correctness, it is quite impressive. The comparison point for that is multiple precision implementations like mpfr and we’re not far off that target. Anyone who has done some work on math library implementations will tell you how it is currently not possible to predict worst case precision required to guarantee accuracy in bivariate functions like pow, as a result of which one has to descend into large precisions whenever necessary.

I don't care about exactness, give me something fast and reasonably accurate

I got this a lot from people while working on this and honestly, it’s more a question of project goals than anything else. Currently we’re keeping things as they are, i.e. we’re going to try and maintain our halfulp correctness and try to speed things up. Maybe in future we could think of having different variants of implementations that have different performance and accuracy characteristics, but there’s nothing like that on the table right now.

Is this it? Will there be more?

There’s still a couple of interesting changes pending, the most important of them being the limitation of worst case precision for exp and log functions, based on the results of the paper Worst Cases for Correct Rounding of the Elementary Functions in Double Precision. I still have to prove that those results apply to the glibc multiple precision bits.

After that there is still a fair bit of scope for improvement, but before that I plan to get the performance benchmarking bits working for at least the major functions in glibc. That will give a standard way to measure performance across architectures and also track it across releases or milestones.

And now for the Call for Contributions!

And now on to what we need help with. Glibc exports a lot of functions and it is nearly impossible for me to write benchmark tests for all these functions in the 2.18 timeframe. I guess we’ll be happy to go with whatever we get, but if you’re looking for some interesting work, adding a function to the benchmark could be it. benchtests/Makefile has instructions on how one could add new benchmarks for functionms to the testsuite and I’d be more than happy to help with any queries anyone may have while doing this - all you have to do is post your query on the libc-help mailing list (libc-help at sourceware dot org).

The benchmark framework itself could use some love. The current implementation is simply based on clock_gettime, which is at best a vDSO function and at worst a system call. It would be really cool to have architecture-specific overrides that do measurements with little or no overhead so that the measurements are as accurate as they possibly can be.


The glibc manual needs volunteers

The GNU C Library (glibc) needs more contributors to help improve the library. This is even more true for a very key portion of the library, which is the documentation. We have a glibc manual that gets updated on every release, which is a bit incomplete and has a fair number of bugs filed against it. We would welcome volunteers to take a crack at those bug reports and send in patches. Here’s some information to get you started.

The glibc manual code is maintained within the glibc source tree, in the manual directory. It is in texinfo format, so if you’re familiar with tex, you’re halfway through. The various chapters in the manual are in *.texi files, with a Makefile to build the manual in either info or html format.

To build the manual, create a directory within the source tree (or anywhere outside it is also fine), which I will refer to as the build directory from now on. This keeps the generated content separate from the code. Enter that directory and type the command:

$SRCDIR/configure --prefix=/usr

where $SRCDIR is the path to the sources. Don’t worry about the /usr prefix since we’re not going to install anything. If you’re especially worried about it, then use some other prefix like $HOME/sandbox or similar. Once configure succeeds, build the info format documentation using:

make info

or the html format using:

make html

The documentation gets built in the manual directory within the build directory. The html documentation is built in the libc directory within the manual directory. You can open index.html in a browser to browse through and verify your changes.

Contributing to glibc usually requires a copyright assignment to FSF. If you don’t mind doing this, the procedure is fairly easy, albeit time consuming. All you have to do is post your patch on the libc-alpha (libc-alpha AT sourceware dot org) mailing list and if copyright assignment is necessary for the change, the reviewer will let you know what to do. For help when writing a patch, the libc-help (libc-help AT sourceware dot org) mailing list is the place to go.


Multiprecision arithmetic in libm

Before my two week break, I was working on a bunch of ideas to try and get the multiprecision arithmetic performance in libm to not suck as badly as it currently does. There was a lot of stuff going on, so I’ll try to summarize them here. The primary reason for this post is to get myself back in groove (although I’ve tried to provide as muchh background as possible), so I apologize to readers if some of the content is not coherent. Feel free to point them out in the comments and I’ll try to clarify.

The multiprecision bits in libm are essentially all files starting with mp in $srcdir/sysdeps/iee754/dbl-64/. The structure that stores a multiprecision number is called mp_no and is declared in mpa.h as:

typedef struct
{
  int e;
  double d[40];
} mp_no;

where e is the exponent of the number and the mantissa digits are in d. The radix of the number is 224, so each digit in d is always a non-negative integral value less than 224.

The other all-important module is mpa.c, which defines basic operations on mp_no (construct from double, deconstruct to double, add, subtract, multiply, divide). It was relatively easy to see that these basic operations were the hotspots (specifically multiplication), but not so easy to see that Power code has its own copy of these functions. And there begins the real difficulty.

The Power architecture is unique in that it has all of 4 floating point units. The power specific code takes advantage of that fact and is tweaked in a manner that the execution units are used in parallel. In contrast, the intel x86 architecture is quite weak on floating point units and this is where the major conflict is. The mantissa of mp_no, which is currently a double (but does not need to be since it can only have integral values), is perfect for Power, but not good enough for intel, which has much faster fixed point computations. Conversion between doubles and ints is too slow on both platforms and is hence not a good enough alternative.

A possible approach is using a mantissa_t typedef that is then overridden by Power, but I need to do some consolidation in the rest of the code to ensure that the internal structure of mp_no is not exposed anywhere. So that’s a TODO.

Apart from exploiting architecture-specific traits, the other approach I considered was to tweak the algorithm for multiplication to make it as fast as possible in a platform-independent manner. A significant number of multiplication inputs are numbers that do not utilize the full precision of mp_no, i.e. a large number of mantissa digits are zeroes. The current algorithm blindly loops around the entire precision multiplying these zeroes, which is a waste. A better idea is to find the precision of the numbers and then run the multiply-add-carry loop only to the extent of that precision. The result of this was an improvement in performance to an extent of 36% in the pow function, i.e. a bit less than twice the speed of the earlier algorithm!

Next steps are to evaluate the impact of storing the actual precision of numbers so that it does not have to be computed within the multiplication function. That’s another TODO for me.

Finally there is a lot of cleanup in order in pretty much all of the mathlib code (i.e. the stuff in sysdeps/iee754/dbl-64), which I’ve been chugging away at and will continue to do so. I’m sure the glibc community will love to review any patch submissions that even simply reformat these files in the GNU format, so there’s a good way to get started with contributing code to glibc.


Hotplugging monitors the old-fashioned way

Kushal saw me run the update-monitors script that I had written to autodetect and start monitors using xrandr and wanted it, so here it is. Thanks to this request, I now also have a repo to put some of the more interesting scripts that I use regularly. I’ll eventually slap a Free license on them. There are only two scripts now, the other one being something I use to format my glibc patches and are only of use to glibc contributors.


New look for siddhesh.in

I noticed recently that the comments section of my journal had a very tiny font. I quickly hacked up the stylesheet to increase the font size, but was dissatisfied in general with the look and decided to do something about it this weekend. If you’re reading this, you’ve already seen the result. I used the Titan theme from the Theme Foundry as the base for my theme code. I wrote the stylesheet completely from scratch. Here are a few things I did that were new to me.

border-radius - Rounded corners everywhere!

The curved underline in the top navigation menu, on the sidebar and the blockquotes (one of my old posts has this) are not images. They’re the result of a new css property I found out about called border-radius. I liked it and I have hence used it all over the place! The interesting thing about this property is what happens when you use percentages to denote the radius instead of the usual px. The curves at the end take a slightly different shape, which I liked. However I didn’t keep it because I didn’t know if they were standard or unintended side-effects.

Spam-free Wordpress

I realized that the plugin didn’t really work or worked badly since I was still getting spam comments to moderate. So I replaced it with a text based captcha. I hate the image captchas since I cannot decipher most of them most of the times.

ARIA and the Wordpress doctype

Wordpress uses the aria accessibility attributes for input tags (specifically, aria-required) and sets the doctype as XHTML 1.1. The result of this is that W3.org validator complains of aria-required being an invalid attribute. I looked around online and saw suggestions to either ignore the error or remove the attribute. Both are wrong. The correct resolution is to set your doctype to this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+ARIA 1.0//EN"
    "http://www.w3.org/WAI/ARIA/schemata/xhtml-aria-1.dtd">

Observations on glibc libm

One of my recent (ongoing) projects has been to figure out a way to make performance characteristics of some transcendental functions (exp, log, sin, cos, etc.) in libm a bit more uniform. All of the functions have been implemented with two computation phases. The first phase uses a table of precomputed values and a polynomial approximation of the function. If the result is not accurate enough, the second phase is triggered, which descends into multiprecision computation. These multiprecision computations are dog-slow and that’s what I’m trying to avoid.

As expected, this has turned out to be more of a mathematics project than programming. I haven’t done any serious mathematics (beyond filing tax returns) in recent years, so it has been quite an amazing ride so far with the real fun still to come. I’ve started a page on the glibc wiki to document my observations as I go along. Currently it has notes on pow and exp functions since that’s where I started. Read up and enjoy!


Making mutt send attached patches inline by default

So I moved back to mutt after I got my new laptop because I wanted to rip out as much of the desktop cruft as I possibly could. There was one feature from claws that I missed sorely though, which is the ability to send plain text attachments inline by default. If you add an attachment using the ‘a’ key in the compose window, it is set to be sent as an attachment by default. It is possible to do a ^D to change the disposition to inline, but I would keep forgetting to do that now and then, leading to pain for patch reviewers.

So after forgetting to press ^D before sending for the nth time, I have now ‘configured’ mutt to send text attachments inline with the following one-liner patch:

diff -pruN mutt-1.5.21/sendlib.c mutt-1.5.21.patched/sendlib.c
--- mutt-1.5.21/sendlib.c       2012-12-07 14:55:30.066959607 +0530
+++ mutt-1.5.21.patched/sendlib.c       2012-12-07 15:11:53.443348151 +0530
@@ -1392,6 +1392,9 @@ BODY *mutt_make_file_attach (const char
        */
       att->type = TYPETEXT;
       att->subtype = safe_strdup ("plain");
+
+      /* HACK! make text attachments inline by default.  */
+      att->disposition = 0;
     }
     else
     {

As the comment clarifies, it is a hack and is not suitable for mutt upstream. The right way to do this would be to add a muttrc clause that toggles this behaviour. Doesn’t matter to me though - I have what I need.


Malloc per-thread arenas in glibc

The malloc subsystem in glibc has had the feature of per-thread arenas for quite some time now and based on my experience, it seems to be the source of a lot of confusion. This is especially for enterprise users who move from RHEL-5 to RHEL-6 to see their apps taking up a lot more ‘memory’ and you’ll see a fair amount of content in this regard in the Red Hat Customer Portal if you’re a RHEL customer. This post is an attempt to get more perspective on this from the internal design perspective. I have not written any of the code in this implementation (barring a tiny improvement), so all of the talk about ‘original intention’ of any design is speculation on my behalf.

Background

The glibc malloc function allocates blocks of address space to callers by requesting memory from the kernel. I have written about the two syscalls glibc uses to do this, so I won’t repeat that here beyond mentioning that they two ‘places’ from where the address space is obtained are ‘arenas’ and ‘anonymous memory maps’ (referred to as anon maps in the rest of the post). The concept of interest here is the arena, which is nothing but a contiguous block of memory obtained from the kernel. The difference from the anon maps is that one anon map fulfills only one malloc request while an arena is a scratchpad that glibc maintains to return smaller blocks to the requestor. As you may have guessed, the data area (or ‘heap’) created by extending the process break (using the brk syscall) is also an arena - it is referred to as the main arena. The ‘heap’ keyword has a different meaning in relation to the glibc malloc implementation as we’ll find out later.

The Arenas

In addition to the main arena, glibc malloc allocates additional arenas. The reason for creation of arenas always seems to have been to improve performance of multithreaded processes. A malloc call needs to lock an arena to get a block from it and contention for this lock among threads is quite a performance hit. So the multiple arenas implementation did the following to reduce this contention:

  1. Firstly, the malloc call always tries to always stick to the arena it accessed the last time
  2. If the earlier arena is not available immediately (as tested by a pthread_mutex_trylock), try the next arena
  3. If we don't have uncontended access on any of the current arenas, then create a new arena and use it.

We obviously start with just the main arena. The main arena is extended using the brk() syscall and the other arenas are extended by mapping new ‘heaps’ (using mmap) and linking them. So an arena can generally be seen as a linked list of heaps.

An Arena for a Thread

The arena implementation was faster than the earlier model of using just a single arena with an on-demand model for reduction of contention in the general case. The process of detecting contention however was sufficiently slow and hence a better idea was needed. This is where the idea of having a thread stick to one arena comes in.

With the arena per-thread model, if malloc is called for the first time within a thread, a new arena is created without looking at whether the earlier locks would be contended or not. As a result, for a sane number of threads*, one can expect zero contention among threads when locking arenas since they’re all working on their own arenas.

There is a limit to the number of arenas that are created in this manner and that limit is determined based on the number of cores the system has. 32-bit systems get twice the number of cores and 64-bit systems get 8 times the number of cores. This can also be controlled using the MALLOC_ARENA_MAX environment variable.

  • A sane number of threads is usually not more than twice the number of cores. Anything more and you’ve have a different set of performance problems to deal with.

The Problem (or not)

The whole issue around the concept of arenas is the amount of ‘memory’ it tends to waste. The common complaint is that the virtual memory footprint of programs increase due to use of arenas and definitely more so due to using a different arena for each thread. The complaint is mostly bogus because of a couple of very important features that have been there for years - loads and loads of address space and demand paging.

The linux virtual memory model can allow a process to access most of its virtual memory address space (provided it is requested and mapped in). A 64-bit address space is massive and is hence not a resource that is likely to run out for any reasonable process. Add to it the fact that the kernel has mechanisms to use physical memory only for pages that are needed by the process at that time and you can rest assured that even if you map in terabytes of memory in your process, only the pages that are used will ever be accounted against you.

Both arena models rely on these facts. A mapped in arena is explicitly requested without any permissions (PROT_NONE) so that the kernel knows that it does not need any physical memory to back it. Block requests are then fulfilled by giving read+write permissions in parts. Freed blocks near the end of heaps in an arena are given back to the system either by using madvise(MADV_DONTNEED) or by explicitly unmapping.

So in the light of all of this, as an administrator or programmer, one needs to shift focus from the virtual memory column in your top output to the RSS column, since that is where the real resource usage is. Address space usage is not the same thing as memory usage. It is of course a problem if RSS is also high and in such cases one should look at the possibility of of a memory leak within to process and if that is not the problem, then fragmentation within the arenas. There are tuning options available to reduce fragmentation. Setting resource limits on address space usage is passe and it is time that better alternatives such as cgroups are given serious consideration.


Higher order functions in C?

The Structure and Interpretation of Computer Programs course has a class on higher order functions, which is the ability of a function to accept a function and/or returns another function that uses the input function. The C programming language has very limit capability to do this and it is limited to being able to accept function pointers or return function pointers. I wanted to explore this limitation further to figure out exactly how far I can stretch this, so I wrote a small C program that tries to emulate this concept. This is just random tinkering, so if the reader is looking for a specific lesson, then let me clarify that there is none.

I wrote a program to get square roots using the Newton-Raphson method – the same one that was used in the course with an attempt to keep the structure as close as possible to the scheme example in the class video. The Lisp-based listing for the course can be found on the internet and here is what I wrote:

#include <stdio.h>
#include <stdlib.h>
typedef double (*generic_func_t) (double);
#define dx 0.000001
#define absolute(n) ((n) < 0 ? -(n) : (n))
generic_func_t deriv(generic_func_t f)
{
        double deriv_func(double x)
        {
                return (f(x + dx) - f(x)) / dx;
        }
        return deriv_func;
}
double fixed_point(generic_func_t f, double x)
{
        while (absolute(f(x) - x) > dx)
                x = f(x);       
        return f(x);                            
}
double newton(generic_func_t f, double guess)
{
        double newton_func(double y)
        {
                generic_func_t df = deriv(f);
                return (y - f(y) / df (y));
        }
        return fixed_point(newton_func, guess);
}
double sqrt(double x)
{
        double sqrt_func(double y)
        {
                return (x - y*y);
        }
        return newton(sqrt_func, 1);
}
int main(int argc, char **argv)
{
        if (argc < 2) {
                printf(“Usage: %s <number>\n”, argv[0]);
                exit (1);
        }
        printf (“%lf\n”, sqrt(strtod(argv[1], NULL)));
        return 0;
}
The first thing that is evident in the above is that the syntax is non-standard. I have declared some functions within another function and that is not allowed by the C standard. GCC however implements this concept of nested functions as an extension, so this compiled cleanly for me. Running this however is a completely different thing.

Executing the above program for any argument results in segfaults and analysis of the program under gdb shows that the generated code is quite ridiculous. This is expected however, since we have pushed the gcc nested function support too far. The nested functions within gcc are OK as long as they remain within the following limits:

  1. They are only used within the function that nests them, either directly or indirectly via some function calls. Essentially, control from the nested function should return before it’s nesting function returns.
  2. If they’re used outside the nesting function and called after the nesting function returns, then they don’t use any local variables or arguments of the nesting function
Of course, if (2) is true then the function need not have been nested in the first place. Looking at the listing above, the functions sqrt_func, newton_func and deriv_func are the nested functions. Of these however, only deriv_func is actually returned as a function to be used from within newton_func. This is wrong because it breaks (2). deriv_func uses the input argument f(x) which is essentially sqrt_func and is actually called after the deriv() function returns. Other than that however, all of our ‘functional-type’ code looks sane and within what is supported by gcc.

So I modified the deriv function and came up with the listing below:

#include <stdio.h>
#include <stdlib.h>
typedef double (*generic_func_t) (double);
#define dx 0.000001
#define absolute(n) ((n) < 0 ? -(n) : (n))
double deriv(double x, generic_func_t f)
{
        return (f(x + dx) - f(x)) / dx;
}
double fixed_point(generic_func_t f, double x)
{
        while (absolute(f(x) - x) > dx)
                x = f(x);
        return f(x);
}
double newton(generic_func_t f, double guess)
{
        double newton_func(double y)
        {
                double dy = deriv(y, f);
                return (y - f(y) / dy);
        }
        return fixed_point(newton_func, guess);
}
double sqrt(double x)
{
        double sqrt_func(double y)
        {
                return (x - y*y);
        }
        return newton(sqrt_func, 1);
}
int main(int argc, char **argv)
{
        if (argc < 2) {
                printf(“Usage: %s <number>\n”, argv[0]);
                exit (1);
        }
        printf (“%lf\n”, sqrt(strtod(argv[1], NULL)));
        return 0;
}
The only change above is that instead of accepting a function and returning another in deriv, I now accept a function and return the evaluation of the derived function for the specified value. This is highlighted in bold above. This program now builds with gcc and also runs correctly. It should similarly be possible to eliminate the other nested functions to get a much more compact and standards-compliant program.


Ayttm and libyahoo2: So what's going on with them?

It’s been a while since I even looked at the code for ayttm and libyahoo2, as is evident from the fact that there have been no releases for a couple of years now. My current interests are centred around more low level OS stuff and I haven’t found the motivation to get back to these projects. I wrote my first real C code in these projects and a lot of my current knowledge will find its roots in these projects, so I’m really grateful to have gotten a chance to contribute.

So this is pretty much a (very late) confession that I am no longer writing code for these projects and they are likely dead once again. I am sure new contributors/maintainers will be welcome if they show their interest in the form of patches. Thanks Philip for giving me the chance to write some real and amazing code!


Comments are closed.