Mailing List Archive

1 2  View All
Re: SMP performance degradation with sysbench [ In reply to ]
Anton Blanchard wrote:
>
> Hi Nick,
>
>
>>Anyway, I'll keep experimenting. If anyone from MySQL wants to help look
>>at this, send me a mail (eg. especially with the sched_setscheduler issue,
>>you might be able to do something better).
>
>
> I took a look at this today and figured Id document it:
>
> http://ozlabs.org/~anton/linux/sysbench/
>
> Bottom line: it looks like issues in the glibc malloc library, replacing
> it with the google malloc library fixes the negative scaling:
>
> # apt-get install libgoogle-perftools0
> # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld

Hi Anton,

Very cool. Yeah I had come to the conclusion that it wasn't a kernel
issue, and basically was afraid to look into userspace ;)

That bogus setscheduler thing must surely have never worked, though.
I wonder if FreeBSD avoids the scalability issue because it is using
SCHED_RR there, or because it has a decent threaded malloc implementation.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Anton Blanchard a écrit :
>
> Hi Nick,
>
>> Anyway, I'll keep experimenting. If anyone from MySQL wants to help look
>> at this, send me a mail (eg. especially with the sched_setscheduler issue,
>> you might be able to do something better).
>
> I took a look at this today and figured Id document it:
>
> http://ozlabs.org/~anton/linux/sysbench/
>
> Bottom line: it looks like issues in the glibc malloc library, replacing
> it with the google malloc library fixes the negative scaling:
>
> # apt-get install libgoogle-perftools0
> # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld

Hi Anton, thanks for the report.
glibc has certainly many scalability problems.

One of the known problem is its (ab)use of mmap() to allocate one (yes : one
!) page every time you fopen() a file. And then a munmap() at fclose() time.


mmap()/munmap() should be avoided as hell in multithreaded programs.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, Mar 13, 2007 at 04:11:02PM +1100, Nick Piggin wrote:
> Hi Anton,
>
> Very cool. Yeah I had come to the conclusion that it wasn't a kernel
> issue, and basically was afraid to look into userspace ;)

btw, regardless of what glibc is doing, still the cpu shouldn't go
idle IMHO. Even if we're overscheduling and trashing over the mmap_sem
with threads (no idea if other OS schedules the task away when they
find the other cpu in the mmap critical section), or if we've
overscheduling with futex locking, the cpu usage should remain 100%
system time in the worst case. The only explanation for going idle
legitimately could be on HT cpus where HT may hurt more than help but
on real multicore it shouldn't happen.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Andrea Arcangeli wrote:
> On Tue, Mar 13, 2007 at 04:11:02PM +1100, Nick Piggin wrote:
>
>>Hi Anton,
>>
>>Very cool. Yeah I had come to the conclusion that it wasn't a kernel
>>issue, and basically was afraid to look into userspace ;)
>
>
> btw, regardless of what glibc is doing, still the cpu shouldn't go
> idle IMHO. Even if we're overscheduling and trashing over the mmap_sem
> with threads (no idea if other OS schedules the task away when they
> find the other cpu in the mmap critical section), or if we've
> overscheduling with futex locking, the cpu usage should remain 100%
> system time in the worst case. The only explanation for going idle
> legitimately could be on HT cpus where HT may hurt more than help but
> on real multicore it shouldn't happen.
>

Well ignoring the HT issue, I was seeing lots of idle time simply
because userspace could not keep up enough load to the scheduler.
There simply were fewer runnable tasks than CPU cores.

But it wasn't a case of all CPUs going idle, just most of them ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, Mar 13, 2007 at 09:06:14PM +1100, Nick Piggin wrote:
> Well ignoring the HT issue, I was seeing lots of idle time simply
> because userspace could not keep up enough load to the scheduler.
> There simply were fewer runnable tasks than CPU cores.

When you said idle I thought idle and not waiting for I/O. Waiting for
I/O would be hardly a kernel issue ;). If they're not waiting for I/O
and they're not scheduling in userland with nanosleep/pause, the cpu
shouldn't go idle. Even if they're calling sched_yield in a loop the
cpu should account for zero idle time as far as I can tell.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Andrea Arcangeli wrote:
> On Tue, Mar 13, 2007 at 09:06:14PM +1100, Nick Piggin wrote:
>
>>Well ignoring the HT issue, I was seeing lots of idle time simply
>>because userspace could not keep up enough load to the scheduler.
>>There simply were fewer runnable tasks than CPU cores.
>
>
> When you said idle I thought idle and not waiting for I/O. Waiting for
> I/O would be hardly a kernel issue ;). If they're not waiting for I/O
> and they're not scheduling in userland with nanosleep/pause, the cpu
> shouldn't go idle. Even if they're calling sched_yield in a loop the
> cpu should account for zero idle time as far as I can tell.

Well it wasn't iowait time. From Anton's analysis, I would probably
say it was time waiting for either the glibc malloc mutex or MySQL
heap mutex.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, Mar 13, 2007 at 09:37:54PM +1100, Nick Piggin wrote:
> Well it wasn't iowait time. From Anton's analysis, I would probably
> say it was time waiting for either the glibc malloc mutex or MySQL
> heap mutex.

So it again makes little sense to me that this is idle time, unless
some userland mutex has a usleep in the slow path which would be very
wrong, in the worst case they should yield() (yield can still waste
lots of cpu if two tasks in the slow paths calls it while the holder
is not scheduled, but at least it wouldn't be idle time).

Idle time is suspicious for a kernel issue in the scheduler or some
userland inefficiency (the latter sounds more likely).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Andrea Arcangeli wrote:
> On Tue, Mar 13, 2007 at 09:37:54PM +1100, Nick Piggin wrote:
>
>>Well it wasn't iowait time. From Anton's analysis, I would probably
>>say it was time waiting for either the glibc malloc mutex or MySQL
>>heap mutex.
>
>
> So it again makes little sense to me that this is idle time, unless
> some userland mutex has a usleep in the slow path which would be very
> wrong, in the worst case they should yield() (yield can still waste
> lots of cpu if two tasks in the slow paths calls it while the holder
> is not scheduled, but at least it wouldn't be idle time).

They'll be sleeping in futex_wait in the kernel, I think. One thread
will hold the critical mutex, some will be off doing their own thing,
but importantly there will be many sleeping for the mutex to become
available.

> Idle time is suspicious for a kernel issue in the scheduler or some
> userland inefficiency (the latter sounds more likely).

That is what I first suspected, because the dropoff appeared to happen
exactly after we saturated the CPU count: it seems like a scheduler
artifact.

However, I tested with a bigger system and actually the idle time
comes before we saturate all CPUs. Also, increasing the aggressiveness
of the load balancer did not drop idle time at all, so it is not a case
of some runqueues idle while others have many threads on them.


I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
glibc allocator. But I wonder if there are other improvements that glibc
can do here?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tuesday 13 March 2007 12:12, Nick Piggin wrote:
>
> I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
> glibc allocator. But I wonder if there are other improvements that glibc
> can do here?

I cooked a patch some time ago to speedup threaded apps and got no feedback.

http://lkml.org/lkml/2006/8/9/26

Maybe we have to wait for 32 core cpu before thinking of cache line
bouncings...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, Mar 13, 2007 at 10:12:19PM +1100, Nick Piggin wrote:
> They'll be sleeping in futex_wait in the kernel, I think. One thread
> will hold the critical mutex, some will be off doing their own thing,
> but importantly there will be many sleeping for the mutex to become
> available.

The initial assumption was that there was zero idle time with threads
= cpus and the idle time showed up only when the number of threads
increased to the double the number of cpus. If the idle time wouldn't
increase with the number of threads, nothing would be suspect.

> However, I tested with a bigger system and actually the idle time
> comes before we saturate all CPUs. Also, increasing the aggressiveness
> of the load balancer did not drop idle time at all, so it is not a case
> of some runqueues idle while others have many threads on them.

It'd be interesting to see the sysrq+t after the idle time
increased.

> I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
> glibc allocator. But I wonder if there are other improvements that glibc
> can do here?

My wild guess is that they're allocating memory after taking
futexes. If they do, something like this will happen:

taskA taskB taskC
user lock
mmap_sem lock
mmap sem -> schedule
user lock -> schedule

If taskB wouldn't be there triggering more random trashing over the
mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too.

I suspect the real fix is not to allocate memory or to run other
expensive syscalls that can block inside the futex critical sections...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Eric Dumazet wrote:
> On Tuesday 13 March 2007 12:12, Nick Piggin wrote:
>
>>I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
>>glibc allocator. But I wonder if there are other improvements that glibc
>>can do here?
>
>
> I cooked a patch some time ago to speedup threaded apps and got no feedback.

Well that doesn't help in this case. I tested and the mmap_sem contention
is not an issue.

> http://lkml.org/lkml/2006/8/9/26
>
> Maybe we have to wait for 32 core cpu before thinking of cache line
> bouncings...

The idea is a good one, and I was half way through implementing similar
myself at one point (some java apps hit this badly).

It is just horribly sad that futexes are supposed to implement a
_scalable_ thread synchronisation mechanism, whilst fundamentally
relying on an mm-wide lock to operate.

I don't like your interface, but then again, the futex interface isn't
exactly pretty anyway.

You should resubmit the patch, and get the glibc guys to use it.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tuesday 13 March 2007 12:42, Andrea Arcangeli wrote:

> My wild guess is that they're allocating memory after taking
> futexes. If they do, something like this will happen:
>
> taskA taskB taskC
> user lock
> mmap_sem lock
> mmap sem -> schedule
> user lock -> schedule
>
> If taskB wouldn't be there triggering more random trashing over the
> mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too.
>
> I suspect the real fix is not to allocate memory or to run other
> expensive syscalls that can block inside the futex critical sections...

glibc malloc uses arenas, and trylock() only. It should not block because if
an arena is already locked, thread automatically chose another arena, and
might create a new one if necessary.

But yes, mmap_sem contention is a big problem, because it's also taken by
futex code (unfortunately)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Andrea Arcangeli wrote:
> On Tue, Mar 13, 2007 at 10:12:19PM +1100, Nick Piggin wrote:
>
>>They'll be sleeping in futex_wait in the kernel, I think. One thread
>>will hold the critical mutex, some will be off doing their own thing,
>>but importantly there will be many sleeping for the mutex to become
>>available.
>
>
> The initial assumption was that there was zero idle time with threads
> = cpus and the idle time showed up only when the number of threads
> increased to the double the number of cpus. If the idle time wouldn't
> increase with the number of threads, nothing would be suspect.

Well I think more threads ~= more probability that this guy is going to
be preempted while holding the mutex?

This might be why FreeBSD works much better, because it looks like MySQL
actually will set RT scheduling for those processes that take critical
resources.

>>However, I tested with a bigger system and actually the idle time
>>comes before we saturate all CPUs. Also, increasing the aggressiveness
>>of the load balancer did not drop idle time at all, so it is not a case
>>of some runqueues idle while others have many threads on them.
>
>
> It'd be interesting to see the sysrq+t after the idle time
> increased.
>
>
>>I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
>>glibc allocator. But I wonder if there are other improvements that glibc
>>can do here?
>
>
> My wild guess is that they're allocating memory after taking
> futexes. If they do, something like this will happen:
>
> taskA taskB taskC
> user lock
> mmap_sem lock
> mmap sem -> schedule
> user lock -> schedule
>
> If taskB wouldn't be there triggering more random trashing over the
> mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too.
>
> I suspect the real fix is not to allocate memory or to run other
> expensive syscalls that can block inside the futex critical sections...


I would agree that it points to MySQL scalability issues, however the
fact that such large gains come from tcmalloc is still interesting.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, Mar 13, 2007 at 01:02:44PM +0100, Eric Dumazet wrote:
> On Tuesday 13 March 2007 12:42, Andrea Arcangeli wrote:
>
> > My wild guess is that they're allocating memory after taking
> > futexes. If they do, something like this will happen:
> >
> > taskA taskB taskC
> > user lock
> > mmap_sem lock
> > mmap sem -> schedule
> > user lock -> schedule
> >
> > If taskB wouldn't be there triggering more random trashing over the
> > mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too.
> >
> > I suspect the real fix is not to allocate memory or to run other
> > expensive syscalls that can block inside the futex critical sections...
>
> glibc malloc uses arenas, and trylock() only. It should not block because if
> an arena is already locked, thread automatically chose another arena, and
> might create a new one if necessary.

Well, only when allocating it uses trylock, free uses normal lock.
glibc malloc will by default use the same arena for all threads, only when
it sees contention during allocation it gives different threads different
arenas. So, e.g. if mysql did all allocations while holding some global
heap lock (thus glibc wouldn't see any contention on allocation), but
freeing would be done outside of application's critical section, you would
see contention on main arena's lock in the free path.
Calling malloc_stats (); from e.g. atexit handler could give interesting
details, especially if you recompile glibc malloc with -DTHREAD_STATS=1.

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On 3/12/07, Anton Blanchard <anton@samba.org> wrote:
>
> Hi Nick,
>
> > Anyway, I'll keep experimenting. If anyone from MySQL wants to help look
> > at this, send me a mail (eg. especially with the sched_setscheduler issue,
> > you might be able to do something better).
>
> I took a look at this today and figured Id document it:
>
> http://ozlabs.org/~anton/linux/sysbench/
>
> Bottom line: it looks like issues in the glibc malloc library, replacing
> it with the google malloc library fixes the negative scaling:
>
> # apt-get install libgoogle-perftools0
> # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld

Quick datapoint, still collecting data and trying to verify it's
always the case: on my 8-way Xeon, I'm actually seeing *much* worse
performance with libtcmalloc.so compared to mainline. Am generating
graphs and such still, but maybe someone else with x86_64 hardware
could try the google PRELOAD and see if it helps/hurts (to rule out
tester stupidity)?

Thanks,
Nish
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Nish Aravamudan a écrit :
> On 3/12/07, Anton Blanchard <anton@samba.org> wrote:
>>
>> Hi Nick,
>>
>> > Anyway, I'll keep experimenting. If anyone from MySQL wants to help
>> look
>> > at this, send me a mail (eg. especially with the sched_setscheduler
>> issue,
>> > you might be able to do something better).
>>
>> I took a look at this today and figured Id document it:
>>
>> http://ozlabs.org/~anton/linux/sysbench/
>>
>> Bottom line: it looks like issues in the glibc malloc library, replacing
>> it with the google malloc library fixes the negative scaling:
>>
>> # apt-get install libgoogle-perftools0
>> # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld
>
> Quick datapoint, still collecting data and trying to verify it's
> always the case: on my 8-way Xeon, I'm actually seeing *much* worse
> performance with libtcmalloc.so compared to mainline. Am generating
> graphs and such still, but maybe someone else with x86_64 hardware
> could try the google PRELOAD and see if it helps/hurts (to rule out
> tester stupidity)?

I wish I had a 8-way test platform :)

Anyway, could you post some oprofile results ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On 3/13/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Nish Aravamudan a écrit :
> > On 3/12/07, Anton Blanchard <anton@samba.org> wrote:
> >>
> >> Hi Nick,
> >>
> >> > Anyway, I'll keep experimenting. If anyone from MySQL wants to help
> >> look
> >> > at this, send me a mail (eg. especially with the sched_setscheduler
> >> issue,
> >> > you might be able to do something better).
> >>
> >> I took a look at this today and figured Id document it:
> >>
> >> http://ozlabs.org/~anton/linux/sysbench/
> >>
> >> Bottom line: it looks like issues in the glibc malloc library, replacing
> >> it with the google malloc library fixes the negative scaling:
> >>
> >> # apt-get install libgoogle-perftools0
> >> # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld
> >
> > Quick datapoint, still collecting data and trying to verify it's
> > always the case: on my 8-way Xeon, I'm actually seeing *much* worse
> > performance with libtcmalloc.so compared to mainline. Am generating
> > graphs and such still, but maybe someone else with x86_64 hardware
> > could try the google PRELOAD and see if it helps/hurts (to rule out
> > tester stupidity)?
>
> I wish I had a 8-way test platform :)
>
> Anyway, could you post some oprofile results ?

Hopefully soon -- want to still make sure I'm not doing something
dumb. Am also hoping to get some of the gdb backtraces like Anton had.

Thanks,
Nish
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote:
> I would agree that it points to MySQL scalability issues, however the
> fact that such large gains come from tcmalloc is still interesting.

What glibc version are you, Anton and others are using?

Does that version has this fix included?

Dynamically size mmap treshold if the program frees mmaped blocks.

http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158&r2=1.159&cvsroot=glibc

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Wed, 2007-03-14 at 16:33 -0700, Siddha, Suresh B wrote:
> On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote:
> > I would agree that it points to MySQL scalability issues, however the
> > fact that such large gains come from tcmalloc is still interesting.
>
> What glibc version are you, Anton and others are using?
>
> Does that version has this fix included?
>
> Dynamically size mmap treshold if the program frees mmaped blocks.
>
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158&r2=1.159&cvsroot=glibc
>
Last week, I reproduced it on RHEL4U3 with glibc 2.3.4-2.19. Today, I
installed RHEL5GA and reproduced it again. RHEL5GA uses glibc 2.5-12 which
already includes the dynamically size mmap threshold patch, so this patch doesn’t
resolve the issue.

The problem is really relevant to malloc/free of glibc multi-thread.

My paxville has 16 logical cpu (dual core+HT). I disabled HT by hot
removing the last 8 logical processors.

I captured the schedule status. When sysbench thread=8 (best performance),
there are about 3.4% context switches caused by __down_read/__down_write_nested.
When sysbench thread=10 (best performance), the percentage becomes 11.83%.

I captured the thread status by gdb. When sysbench thread=10, usually 2 threads
are calling mprotect/mmap. When sysbench thread=8, there are no threads calling
mprotect/mmap. Such capture has random behavior, but I tried for many times.

I think the increased percentage of context switch related to
__down_read/__down_write_nested is caused by mprotect/mmap. mprotect/mmap
accesses the semaphore of vm, so there are some contentions on the sema which
make performance down.

The strace shows mysqld often calls mprotect/mmap with the same data length
61440. That’s another evidence. Gdb showed such mprotect is called by
init_io_malloc=>my_malloc=>malloc=>init_malloc=>mprotect. Mmap is caused by
__init_free=>mmap. I checked the source codes of glibc and found the real call
chains are malloc=>init_malloc=>grow_heap=>mprotect and __init_free=>heap_trim=>mmap.

I guess the transaction processing of mysql/sysbench is: mysql accepts a connection
and initiates a block for the connection. After processing a couple of transactions,
sysbench closes the connection. Then, restart the procedure.

So why are there so many mprotect/mmap?

Glibc uses arena to speedup malloc/free at multi-thread environment.
mp.trim_threshold only controls main_arena. In function __init_free,
FASTBIN_CONSOLIDATION_THRE might be helpful, but it’s a fixed value.

The *ROOT CAUSE* is dynamic thresholds don’t apply to non-main arena.

To verify my idea, I created a small patch. When freeing a block, always
check mp_.trim_threshold even though it might not be in main arena. The
patch is just to verify my idea instead of the final solution.

--- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.000000000 +0800
+++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-20 07:41:03.000000000 +0800
@@ -4607,10 +4607,13 @@ _int_free(mstate av, Void_t* mem)
} else {
/* Always try heap_trim(), even if the top chunk is not
large, because the corresponding heap might go away. */
+ if ((unsigned long)(chunksize(av->top)) >=
+ (unsigned long)(mp_.trim_threshold)) {
heap_info *heap = heap_for_ptr(top(av));

assert(heap->ar_ptr == av);
heap_trim(heap, mp_.top_pad);
+ }
}
}


With the patch, I recompiled glibc and reran sysbench/mysql. The result is good.
When thread number is larger than 8, the tps and response time(avg) are smooth, and
don't drop severely.

Is there anyone being able to test it on AMD machine?

Yanmin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, 2007-03-20 at 10:29 +0800, Zhang, Yanmin wrote:
> On Wed, 2007-03-14 at 16:33 -0700, Siddha, Suresh B wrote:
> > On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote:
> > > I would agree that it points to MySQL scalability issues, however the
> > > fact that such large gains come from tcmalloc is still interesting.
> >
> > What glibc version are you, Anton and others are using?
> >
> > Does that version has this fix included?
> >
> > Dynamically size mmap treshold if the program frees mmaped blocks.
> >
> > http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158&r2=1.159&cvsroot=glibc

> The *ROOT CAUSE* is dynamic thresholds don’t apply to non-main arena.
>
> To verify my idea, I created a small patch. When freeing a block, always
> check mp_.trim_threshold even though it might not be in main arena. The
> patch is just to verify my idea instead of the final solution.
>
> --- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.000000000 +0800
> +++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-20 07:41:03.000000000 +0800
> @@ -4607,10 +4607,13 @@ _int_free(mstate av, Void_t* mem)
> } else {
> /* Always try heap_trim(), even if the top chunk is not
> large, because the corresponding heap might go away. */
> + if ((unsigned long)(chunksize(av->top)) >=
> + (unsigned long)(mp_.trim_threshold)) {
> heap_info *heap = heap_for_ptr(top(av));
>
> assert(heap->ar_ptr == av);
> heap_trim(heap, mp_.top_pad);
> + }
> }
> }
>
>
I sent a new patch to glibc maintainer, but didn't get response. So resend it here.

Glibc arena is to decrease the malloc/free contention among threads. But arena
chooses to shrink agressively, so also grow agressively. When heaps grow, mprotect
is called. When heaps shrink, mmap is called. In kernel, both mmap and mprotect
need hold the write lock of mm->mmap_sem which introduce new contention. The new
contention actually causes the arena effort to become 0.

Here is a new patch to address this issue.

Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>

---

--- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.000000000 +0800
+++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-30 09:01:18.000000000 +0800
@@ -4605,12 +4605,13 @@ _int_free(mstate av, Void_t* mem)
sYSTRIm(mp_.top_pad, av);
#endif
} else {
- /* Always try heap_trim(), even if the top chunk is not
- large, because the corresponding heap might go away. */
- heap_info *heap = heap_for_ptr(top(av));
-
- assert(heap->ar_ptr == av);
- heap_trim(heap, mp_.top_pad);
+ if ((unsigned long)(chunksize(av->top)) >=
+ (unsigned long)(mp_.trim_threshold)) {
+ heap_info *heap = heap_for_ptr(top(av));
+
+ assert(heap->ar_ptr == av);
+ heap_trim(heap, mp_.top_pad);
+ }
}
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

1 2  View All