
Before we start. ⎇
Last week has been quite eventful to me. I taught my first university lecture on information theory. I had to fight my E16 a bit to let me actually finish my slide deck and wrote a short blog post at 5:00 a.m. out of frustration. Then I went to sleep, and by the time I woke up I found out that I am in the news...
If you're here looking for my work, keep searching. The E16 bug is definitely not as serious as the news outlets like to claim. But actual ground-breaking scientific work rarely breaks into mainstream, I'm genuinely as surprised as you are. I have done nothing to end up being famous for a window manager. It would be great if news sites reported as vividly on, for example, that I work on genomics and produced the best (Pareto-optimal) available digital genome storage solution or that I did some cool stuff when i was 16 or wrote a SoTA data compressor that is seemingly used by CERN and FitGirl Repacks (oh irony!) during my Abitur exams. But none of these stories sell.
I am also eternally grateful to many men who have been kind enough to spend their time bitching about me online for something I had no control over, to the weirdos who complain that I don't spend my time getting pregnant or that I look ugly. Doing God's job, really. I don't need to be told that "one day I will get it" because the media reported on my 20 page bug fix. I probably deserve some punishment for the sins that I have committed in my life, but having to read the opinions of Polish people online is a disproportionate one. If you're so smart, look at my things that actually can't reasonably break into the mainstream. And if you're also (regrettably) living in Poland and wondering why Poles abroad want - on average - nothing to do with you, imagine how my e-mail inbox looks like.
Anyway...
Introduction ⎇
Andrzej proposes a curious operator, Exp-Minus-Log, defined as
Much ink has been spilled about it. You can probably read about this sensation online. But the core claim is that you can use EML to compute most elementary functions, so it could be a two-variable basis function for some type of a calculator.
This looks deceptively simple to compute and I can't wait for people to tell me to pull libm and try subtracting the exps and logs. Not so simple. The C standard library typically gives you both to 1 ULP each, but the trouble is the word difference. When , the subtraction cancels most of the significant bits, and the relative error in the result can explode to thousands of ULPs. This is catastrophic cancellation, and it is the central enemy of this post.
I'll walk you through some progressively cleverer implementations that I've looked into, benchmark them against __float128 reference values over 20mln random inputs, and explain the IEEE 754 and algebraic tricks that make the final version work.
All code in this article requires a C compiler with hardware FMA support (-march=native or -mfma), IEEE 754 binary64 arithmetic (essentially any modern CPU), and linking against -lm.
The Problem ⎇
exp and log are each faithfully rounded by glibc/musl on x86-64: the returned double differs from the mathematical value by at most 1 ULP. So each operand carries at most relative error (mantissa bits).
Subtraction typically does not introduce new rounding error (Sterbenz's lemma). However, it amplifies existing error. If and
agree in their top
bits, then
, and the 1 ULP error in each operand becomes
ULP in the result.
For , that is half a million ULP we get a result with only about 33 correct bits out of 53. Not very useful. Some would anyway write:
double eml(double x, double y) {
return exp(x) - log(y);
}This is the baseline. It works perfectly when there is little cancellation, and catastrophically when there is. Over random
pairs with
,
, measured against
__float128:
| Metric | Value |
|---|---|
| ns/call | 12.4 |
| worst ULP | 851,150 |
| 0 ULP | 69.913% |
| below 1 ULP | 97.659% |
| below 2 ULP | 98.638% |
| above 1024 ULP | 673 (0.0034%) |
The vast majority of inputs are fine. But 1.4% of cases exceed 2 ULP, and the worst case has fewer than 33 correct bits.
We can start tackling the problem by reusing the fairly popular trick of double-double arithmetic. Such a number is a pair of IEEE 754 doubles satisfying
, representing the unevaluated sum
. This gives roughly 106 bits of significand, so enough headroom to survive 20+ bits of cancellation with precision to spare.
We have some key building blocks, all of which compile to a handful of instructions with FMA: Knuth's 2Sum and Dekker's 2Prod. Given , we produce
such that
exactly, or
such that
exactly.
static inline void two_sum(double a, double b, double * s, double * t) {
*s = a + b;
double v = *s - a;
*t = (a - (*s - v)) + (b - v);
}
static inline void two_prod(double a, double b, double * p, double * e) {
*p = a * b;
*e = fma(a, b, -*p);
}So now the basic approach would be to compute and
each as a double-double, subtract in extended precision and round once.
exp/log in double-double domain ⎇
For , the arguments are reduced via
, where
. The reduction is carried out in double-double using FMA product residuals:
double p = kf * L_HI;
double p_err = fma(kf, L_HI, -p); /* residual of kf * L_HI */
double r_hi, r_lo;
two_sum(x, -p, &r_hi, &r_lo); /* exact subtraction */
r_lo -= p_err + kf * L_LO;Then is approximated by an 8-term Estrin polynomial (serial depth 3 FMAs instead of 8 for Horner), with the
tail kept as a double-double via
two_prod:
double A = fma(r, 1.0/6.0, 0.5);
double B = fma(r, 1.0/120.0, 1.0/24.0);
double C = fma(r, 1.0/5040.0, 1.0/720.0);
double D = fma(r, 1.0/362880.0, 1.0/40320.0);
double AB = fma(r2, B, A);
double CD = fma(r2, D, C);
double c = fma(r4, CD, AB);The result is scaled by a 64-entry precomputed table (stored as double-double,
bits each) and then by
via IEEE exponent injection:
double S = u2d((uint64_t)(n + 1023) << 52); /* 2^n, no multiply */
*hi = s * S;
*lo = e * S;For , the input is decomposed as
by extracting IEEE exponent bits, and
is reduced via a 64-entry table
. The kernel is an atanh series on
where
.
One critical subtlety is that the denominator must be kept as a double-double. Without this, its
rounding error propagates through
and dominates the result:
double d_hi, d_lo;
two_sum(m, F, &d_hi, &d_lo);
double s_hi = f / d_hi;
double err = fma(-s_hi, d_hi, f); /* exact: f - s_hi * d_hi */
err -= s_hi * d_lo; /* correct for d_lo */
double s_lo = err / d_hi;The final subtraction is a double-double subtract followed by a single rounding:
two_sum(eh, -lh, &sh, &sl);
sl += el - ll;
return sh + sl;| Metric | Value |
|---|---|
| ns/call | 27.1 |
| worst ULP | 7 |
| 0 ULP | 99.999% |
| below 1 ULP | 100.000% |
| above 1024 ULP | 0 |
This is accurate but expensive: on my machine (Ryzen 7 PRO 7840U), about 2.2 times the naive (glibc) cost. It also costs us two 64-entry tables (2 KB total). The 7 ULP worst case comes from inputs with ~19 bits of cancellation, right at the limit of our ~74-bit intermediates.
The elegant solution ⎇
The unspoken key insight to evaluating eml, from my experiments, seems to be the following. When , the identity
follows directly from , and therefore
.
This is actually a really good way to tackle the problem, because expm1 by design handles the cancellation that exp(x) - 1 would otherwise suffer near zero. The problematic subtraction is absorbed into a dedicated function that already knows how to deal with it.
The identity reduces the problem to computing the argument accurately. When cancellation is severe, then of course
, and both
and
are
, so even a small error in
gets amplified by
.
This is why the slow path still needs double-double; we call log_dd(y) to get , then
log_dd(t_h) to get , and reduce:
double dh, dl;
two_sum(x, -uh, &dh, &dl); /* exact by Sterbenz (x approxeq. uh) */
dl -= ul;Then a Taylor-corrected expm1 on the double-double :
The remainder is
, negligible. The final assembly uses FMA for an exact product residual:
double em = expm1(dh);
double corr = (1.0 + em) * dl; /* = exp(dh) * dl */
double ph = th * em;
double pl = fma(th, em, -ph); /* exact residual */
return ph + (pl + th * corr - tl);Contrariwise, for of inputs, there is no significant cancellation. In this regime, the naive
exp(x) - log(y) is already ULP. The guard condition is thus a single comparison after the subtract:
double e = exp(x);
double diff = e - t;
double mag = e > t ? e : t;
if (fabs(diff) * 2.0 >= mag) /* < 1 bit cancelled */
return diff;
return eml_slow(x, y); /* dd path, cold */Hence, the full code is as follows:
#include <math.h>
#include <stdint.h>
#include <string.h>
static inline uint64_t d2u(double d) {
uint64_t u; memcpy(&u, &d, 8); return u;
}
static inline double u2d(uint64_t u) {
double d; memcpy(&d, &u, 8); return d;
}
static inline void two_sum(double a, double b,
double * s, double * t) {
*s = a + b;
double v = *s - a;
*t = (a - (*s - v)) + (b - v);
}
/*
* 64-entry table: LT[j] = ln(1 + j/64) as (hi, lo).
*/
static const double LT_hi[64] = { 0x0.0p+0, 0x1.fc0a8b0fc03e4p-7, 0x1.f829b0e783300p-6, 0x1.77458f632dcfcp-5, 0x1.f0a30c01162a6p-5, 0x1.341d7961bd1d1p-4, 0x1.6f0d28ae56b4cp-4, 0x1.a926d3a4ad563p-4, 0x1.e27076e2af2e6p-4, 0x1.0d77e7cd08e59p-3, 0x1.29552f81ff523p-3, 0x1.44d2b6ccb7d1ep-3, 0x1.5ff3070a793d4p-3, 0x1.7ab890210d909p-3, 0x1.9525a9cf456b4p-3, 0x1.af3c94e80bff3p-3, 0x1.c8ff7c79a9a22p-3, 0x1.e27076e2af2e6p-3, 0x1.fb9186d5e3e2bp-3, 0x1.0a324e27390e3p-2, 0x1.1675cababa60ep-2, 0x1.22941fbcf7966p-2, 0x1.2e8e2bae11d31p-2, 0x1.3a64c556945eap-2, 0x1.4618bc21c5ec2p-2, 0x1.51aad872df82dp-2, 0x1.5d1bdbf5809cap-2, 0x1.686c81e9b14afp-2, 0x1.739d7f6bbd007p-2, 0x1.7eaf83b82afc3p-2, 0x1.89a3386c1425bp-2, 0x1.947941c2116fbp-2, 0x1.9f323ecbf984cp-2, 0x1.a9cec9a9a084ap-2, 0x1.b44f77bcc8f63p-2, 0x1.beb4d9da71b7cp-2, 0x1.c8ff7c79a9a22p-2, 0x1.d32fe7e00ebd5p-2, 0x1.dd46a04c1c4a1p-2, 0x1.e744261d68788p-2, 0x1.f128f5faf06edp-2, 0x1.faf588f78f31fp-2, 0x1.02552a5a5d0ffp-1, 0x1.0723e5c1cdf40p-1, 0x1.0be72e4252a83p-1, 0x1.109f39e2d4c97p-1, 0x1.154c3d2f4d5eap-1, 0x1.19ee6b467c96fp-1, 0x1.1e85f5e7040d0p-1, 0x1.23130d7bebf43p-1, 0x1.2795e1289b11bp-1, 0x1.2c0e9ed448e8cp-1, 0x1.307d7334f10bep-1, 0x1.34e289d9ce1d3p-1, 0x1.393e0d3562a1ap-1, 0x1.3d9026a7156fbp-1, 0x1.41d8fe84672aep-1, 0x1.4618bc21c5ec2p-1, 0x1.4a4f85db03ebbp-1, 0x1.4e7d811b75bb1p-1, 0x1.52a2d265bc5abp-1, 0x1.56bf9d5b3f399p-1, 0x1.5ad404c359f2dp-1, 0x1.5ee02a9241675p-1};
static const double LT_lo[64] = { 0x0.0p+0, -0x1.83092c59642a1p-62, 0x1.33e3f04f1ef23p-60, 0x1.18d3ca87b9296p-59, 0x1.85f325c5bbacdp-59, -0x1.b599f227becbbp-58, -0x1.906d99184b992p-58, 0x1.942f48aa70ea9p-58, -0x1.61578001e0162p-60, 0x1.9a5dc5e9030acp-57, 0x1.301771c407dbfp-57, 0x1.9f4f6543e1f88p-57, -0x1.bc60efafc6f6ep-58, 0x1.be36b2d6a0608p-59, 0x1.d904c1d4e2e26p-57, -0x1.398cff3641985p-58, -0x1.4f689f8434012p-57, -0x1.61578001e0162p-59, -0x1.caaae64f21acbp-57, 0x1.7dcfde8061c03p-56, 0x1.ce63eab883717p-61, -0x1.76f5eb09628afp-56, -0x1.8f4cdb95ebdf9p-56, -0x1.c68651945f97cp-57, 0x1.f42decdeccf1dp-56, 0x1.3927ac19f55e3p-59, 0x1.4236383dc7fe1p-56, -0x1.ddea0f7f58e3dp-57, -0x1.8c76ceb014b04p-56, 0x1.92ce979ed2950p-56, -0x1.29639dfbbf0fbp-56, -0x1.16cc8bae0bbe4p-56, -0x1.a92e513217f5cp-59, -0x1.cadec02b436afp-56, -0x1.cd04495459c78p-56, -0x1.0f3c590a887cap-59, -0x1.4f689f8434012p-56, 0x1.877b232fafa37p-56, -0x1.0467656d8b892p-56, -0x1.c825c90c344b9p-58, -0x1.328df13bb38c3p-56, -0x1.328260d8abca0p-57, -0x1.cb1cb51408c00p-56, 0x1.395e58e2445bbp-55, -0x1.259da11330801p-55, -0x1.0e09b27a4373ap-60, -0x1.59c33171a6876p-55, -0x1.9d1a11443f10cp-56, 0x1.ef62cd2f9f1e3p-56, -0x1.f48725e374d6ep-55, -0x1.487c0c246978ep-57, -0x1.1a158f3917586p-55, 0x1.fb590a1f566dap-57, 0x1.6eb92d885ce4fp-57, -0x1.58eef67f2483ap-55, -0x1.6fef670bd4b62p-55, 0x1.9192f30bd1806p-55, 0x1.f42decdeccf1dp-55, 0x1.13dfa3d3761b6p-60, -0x1.8d3d9ea6e9ea9p-55, -0x1.1883750ea4d0ap-57, 0x1.0471885cd8ff3p-55, -0x1.35955683f7196p-59, 0x1.c358257f49082p-55};
static void log_dd(double y, double * hi, double * lo) {
const double LN2_HI = 0x1.62e42fefa39efp-1;
const double LN2_LO = 0x1.abc9e3b39803fp-56;
uint64_t bits = d2u(y);
int biased = (int)((bits >> 52) & 0x7FF);
int subnorm = (biased == 0);
if (__builtin_expect(subnorm, 0)) {
y *= 0x1.0p52;
bits = d2u(y);
biased = (int)((bits >> 52) & 0x7FF);
}
int e = biased - 1023 - (subnorm ? 52 : 0);
uint64_t m_bits = (bits & 0x000FFFFFFFFFFFFFull)
| 0x3FF0000000000000ull;
double m = u2d(m_bits);
int j = (int)((m_bits >> 46) & 0x3F);
double F = 1.0 + (double)j * 0x1.0p-6;
double f = m - F;
/* dd denominator: avoids 2^-51 contamination */
double d_hi, d_lo;
two_sum(m, F, &d_hi, &d_lo);
double s_hi = f / d_hi;
double err = fma(-s_hi, d_hi, f) - s_hi * d_lo;
double s_lo = err / d_hi;
double s2 = s_hi * s_hi, s4 = s2 * s2;
double Aq = fma(s2, 1.0/5.0, 1.0/3.0);
double Bq = fma(s2, 1.0/9.0, 1.0/7.0);
double q = fma(s4, Bq, Aq);
double log_m_hi = 2.0 * s_hi;
double log_m_tail = 2.0 * fma(s_hi, s2 * q, s_lo);
double a, b;
two_sum(LT_hi[j], log_m_hi, &a, &b);
b += LT_lo[j] + log_m_tail;
double ef = (double)e;
double eln2_hi = ef * LN2_HI;
double eln2_lo = fma(ef, LN2_HI, -eln2_hi)
+ ef * LN2_LO;
double c, d;
two_sum(eln2_hi, a, &c, &d);
d += eln2_lo + b;
two_sum(c, d, hi, lo);
}
static __attribute__((noinline, cold))
double eml_slow(double x, double y) {
double th, tl;
log_dd(y, &th, &tl);
double uh, ul;
log_dd(th, &uh, &ul);
double dh, dl;
two_sum(x, -uh, &dh, &dl);
dl -= ul;
double em = expm1(dh);
double corr = (1.0 + em) * dl;
double ph = th * em;
double pl = fma(th, em, -ph);
return ph + (pl + th * corr - tl);
}
double eml(double x, double y) {
// IEEE754 nonsense.
if (__builtin_expect(x != x || y != y, 0))
return x + y;
if (__builtin_expect(y < 0.0, 0)) return NAN;
if (__builtin_expect(y == 0.0, 0)) return INFINITY;
if (__builtin_expect(y == INFINITY, 0))
return -INFINITY;
if (__builtin_expect(x >= 0x1.62e42fefa39efp+9, 0))
return INFINITY;
// Hot path.
double t = log(y);
if (t <= 0.0)
return exp(x) - t;
double e = exp(x);
double diff = e - t;
double mag = e > t ? e : t;
if (__builtin_expect(fabs(diff) * 2.0 >= mag, 1))
return diff;
return eml_slow(x, y);
}Finally, the results are again very good:
| Metric | Value |
|---|---|
| ns/call | 21.4 |
| worst ULP | 2 |
| 0 ULP | 73.016% |
| below 1 ULP | 99.958% |
| below 2 ULP | 100.000% |
| above 1024 ULP | 0 |
A comparison of the methods tested:
| Naive | DD brute-force | new identity | |
|---|---|---|---|
| ns/call | 12.4 | 27.1 | 21.4 |
| worst ULP | 851,150 | 7 | 2 |
| tables | 0 | 2 KB (exp/log) | 1 KB (log only) |
| code | 1 line | ~200 lines | ~130 lines |
All benchmarks ran on x86-64 with glibc, GCC -O2 -march=native, measured against libquadmath's 128-bit reference (authoritative source) over uniformly random
pairs with
and
.
