Mailing List Archive

NWCLARK TPF grant report #33
[Hours] [Activity]
2012/04/16 Monday
0.25 RT #112478
0.25 pp_pow
7.00 undefined behaviour caught by gcc -ftrapv

2012/04/17 Tuesday
0.25 HP-UX cc bug
1.25 RT #112494
1.25 pp_pow
2.50 reading/responding to list mail
0.75 sfio
0.25 undefined behaviour caught by gcc -ftrapv

2012/04/18 Wednesday
0.50 RT #40652
4.50 reading/responding to list mail
0.75 review davem/re_eval
2.50 sfio
1.25 the todo list

2012/04/19 Thursday
1.50 Solaris -xO3 failure
2.75 reading/responding to list mail
2.75 sfio

2012/04/20 Friday
2.00 RT #112504
2.25 RT #24250
1.25 reading/responding to list mail

Which I calculate is 36.75 hours

The most significant activity this week was investigating what gcc's -ftrapv
flag reveals about the prevenance of signed integer overflow in the core C
code. Unsigned integer overflow in C is well defined - it wraps. *Signed*
integer overflow, however, is undefined behaviour (not just implementation
defined or unspecified) so really isn't something we should be doing. Not
having to care about undefined behaviour gives C compilers the loopholes
they need to optimise conformant code. Whilst right now compilers usually
end up exposing the behaviour of the underlying CPU (which is nearly always
2s complement these days) the core's code is making this assumption, but
should not.

The -ftrapv flag for gcc changes all operations that might caused signed
overlow to trap undefined behaviour and abort(). This, of course, is *still*
a conformant compiler, because undefined behaviour is, well, undefined. (Be
glad that it didn't format your hard disk. That's fair game too, although
one hopes that the operating system stops that.)

It turns out that there's quite a lot of bad assumptions in the code about
signed integer overflow. These seem to fall into three groups

0) The code used to implement the arithmetic operators under use integer;

1) The assumption that the signed bit pattern for the most negative signed
value, IV_MIN, is identical to the bit pattern for the unsigned negation
of that value. ie on a 32 bit system, the IV -2147483648 and the
UV 2147483648 have the same bit representation

2) A lot of code in the regular expression engine uses the type I32

The regular expression engine code is complex. It only causes aborts under
-ftrapv on a 32 bit system, probably due to type promotion on 64 bit systems
for the problematic expressions, such that the values don't overflow.
However, it's clear that in places the engine is relying on sentinel values
such as -1, and it's not clear whether these are tightly coupled with
pointer arithmetic, so a brute force approach of trying to "fix" the problem
by converting I32 to SSize_t is likely to cause far more bugs than it fixes.

Sadly the use integer code sadly is doing exactly what it's documented to do
- "native integer arithmetic (as provided by your C compiler) is used" And
as your C compiler's integer arithmetic is undefined on signed overflow, so
will your Perl code be. So, we have to accept this is going to raise the ire
of -ftrapv. However, this causes problems when testing, as code to test it
will abort, and it's not possible to make exceptions from -ftrapv on a
function by function basis. So the best solution to this seems to be to
break out the integer ops into their own file, and set the C compiler flags
specially for that file (which we already have the infrastructure for)

The middle set of code turns out to be relatively easy to fix. In part it
can be done by changing conversions between IV_MIN and its negation from
obvious-but-overflowing terse expressions to slightly longer expressions
that sidestep the overflow. Most of the rest can be fixed by bounds checking
- negating values in range, and explicitly using IV_MIN when negating the
value 1-(IV-MIN+1). One surprise was the code used to avoid infinite looping
when the end point of a .. range was IV_MAX, which needs a special case, but
the special case in turn was written assuming signed integer overflow. That
code is fixed in smoke-me/iter-IV_MAX, the rest in smoke-me/ftrapv, and some
related cleanup of pp_pow in smoke-me/pp_pow. All will be merged to blead
after 5.16.0 is released.

To continue this, I investigated using CompCert to compile Perl.
CompCert is a "verified compiler, a high-assurance compiler for almost all
of the ISO C90 / ANSI C language, generating efficient code for the PowerPC,
ARM and x86 processors." --

However, its description of "almost all" is accurate - it doesn't include
variable argument lists, which is kind of a show stopper for us. However, as
an experiment I tried out running Configure with it to see what happened.
Result! We generate an invalid config.h file. So that's now RT #112494, with
a fix ready to go in post 5.16.0

I also dug further into the cause of the mysterious 8192 byte bug with sfio.
I wanted to be sure it wasn't the only known symptom of some mysterious core
buffering bug. Based on some historical anomalous CPAN smoker results, I'm
suspicious that we may well have one. It turns out, fortunately, that in
this case it's not our bug. The cause was far more interesting - something
I'd never even considered. In C, a pointer to the element immediately after
an array is valid for pointer arithmetic. (But, obviously, not the value it
points to. So don't dereference it.) The usual use of this is to store a
buffer end pointer - increment the active pointer, and if it's equal to the
buffer end, do something.

However, what's not obvious from that is that the memory address immediately
after the array might *also* be the start of some other array that the
program holds a pointer to. The bug was that sfio's "do something" (ie empty
the write buffer) was deferred until the next call that wrote to the
stream. However, ahead of that check was a check related to sfreserve(),
which permits user code to write directly into the buffer. This is
implemented by handing out a pointer to the internal "next write" pointer,
and the user code signals that it is done by calling sfwrite() with this
(internal) pointer. The special case was intended to express "is the passed
in write-from pointer identical to the internal next write pointer?"
However, that check breaks if it just happens that the internal buffer is
full ("next write" points one beyond the buffer), *and* the user's buffer
just happens to be in the very next byte of memory. Which turns out to be
possible on FreeBSD, where malloc() is able to allocate contiguous blocks of
memory. So, fortunately, this isn't our bug. I've reported it to Phong Vo,
who confirms that it's a bug in sfio.

Nicholas Clark