Mailing List Archive

SMP performance degradation with sysbench
Hi lkml,

according to the test below (sysbench) Linux seems to have scalability
problems beyond 8 client threads:
http://jeffr-tech.livejournal.com/6268.html#cutid1
http://jeffr-tech.livejournal.com/5705.html
Hardware is an 8-core amd64 system and jeffr seems willing to try more
Linux versions on that machine.
Anyway, is there anyone who can reproduce this?


Chiacchiera con i tuoi amici in tempo reale!
http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Lorenzo Allegrucci wrote:
> Hi lkml,
>
> according to the test below (sysbench) Linux seems to have scalability
> problems beyond 8 client threads:
> http://jeffr-tech.livejournal.com/6268.html#cutid1
> http://jeffr-tech.livejournal.com/5705.html
> Hardware is an 8-core amd64 system and jeffr seems willing to try more
> Linux versions on that machine.
> Anyway, is there anyone who can reproduce this?

I have reproduced it on a quad core test system.

With 4 threads (on 4 cores) I get a high throughput, with
approximately 58% user time and 42% system time.

With 8 threads (on 4 cores) I get way lower throughput,
with 37% user time, 29% system time 35% idle time!

The maximum time taken per query also increases from
0.0096s to 0.5273s. Ouch!

I don't know if this is MySQL, glibc or Linux kernel,
but something strange is going on...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Rik van Riel wrote:
> Lorenzo Allegrucci wrote:
>
>> Hi lkml,
>>
>> according to the test below (sysbench) Linux seems to have scalability
>> problems beyond 8 client threads:
>> http://jeffr-tech.livejournal.com/6268.html#cutid1
>> http://jeffr-tech.livejournal.com/5705.html
>> Hardware is an 8-core amd64 system and jeffr seems willing to try more
>> Linux versions on that machine.
>> Anyway, is there anyone who can reproduce this?
>
>
> I have reproduced it on a quad core test system.
>
> With 4 threads (on 4 cores) I get a high throughput, with
> approximately 58% user time and 42% system time.
>
> With 8 threads (on 4 cores) I get way lower throughput,
> with 37% user time, 29% system time 35% idle time!
>
> The maximum time taken per query also increases from
> 0.0096s to 0.5273s. Ouch!
>
> I don't know if this is MySQL, glibc or Linux kernel,
> but something strange is going on...

Like you, I'm also seeing idle time start going up as threads increase.

I initially thought this was a problem with the multiprocessor scheduler,
because the pattern is exactly like some artificat in the load balancing.

However, after looking at the stats, and testing a couple of things, I
think it may not be after all.

I've reproduced this on a 8-socket/16-way dual core Opteron. So far what
I am seeing is that MySQL is having trouble putting enough load into the
scheduler.

Virtually all of the sleep time is coming from unix_stream_recvmsg, which
seems to be what the clients and server threads use to communicate with.
There doesn't seem to be any other tell-tale event that the database is
blocking on.

It seems like it might at least partially be a problem with MySQL
thread/connection management.

I found a couple of interesting issues so far. Firstly, the MySQL version
that I'm using (5.0.26-Max) is making lots of calls to sched_setscheduler
attempting to fiddle with SCHED_OTHER priority in what looks like an
attempt to boot CPU time while holding some resource. All these calls
actually fail, because you cannot change SCHED_OTHER priority like that.
Adding a hack to make it fall through to set_user_nice provides a boost
which eliminates the cliff (but a downward degredation is still there).

Secondly, I've raised the thread numbers from 16 to 32 for my system,
which also provides a bit more (although doesn't help the downward
slope).

Combined, it looks like around 30-40% improvement past 16 threads. It
isn't anything like making up for the dropoff seen in the blog link, but
different systems, different mysql version... I wonder how close we are
with this hack in place?

Attached is a graph of my numbers, from 1 to 32 clients. plain = 2.6.20.1,
sched is with the attached sched patch, and thread is with 32 rather than
16 clients.

Anyway, I'll keep experimenting. If anyone from MySQL wants to help look
at this, send me a mail (eg. especially with the sched_setscheduler issue,
you might be able to do something better).

Nick

--
SUSE Labs, Novell Inc.
Re: SMP performance degradation with sysbench [ In reply to ]
Nick Piggin wrote:
> Rik van Riel wrote:
>
>> Lorenzo Allegrucci wrote:
>>
>>> Hi lkml,
>>>
>>> according to the test below (sysbench) Linux seems to have scalability
>>> problems beyond 8 client threads:
>>> http://jeffr-tech.livejournal.com/6268.html#cutid1
>>> http://jeffr-tech.livejournal.com/5705.html
>>> Hardware is an 8-core amd64 system and jeffr seems willing to try more
>>> Linux versions on that machine.
>>> Anyway, is there anyone who can reproduce this?
>>
>>
>>
>> I have reproduced it on a quad core test system.
>>
>> With 4 threads (on 4 cores) I get a high throughput, with
>> approximately 58% user time and 42% system time.
>>
>> With 8 threads (on 4 cores) I get way lower throughput,
>> with 37% user time, 29% system time 35% idle time!
>>
>> The maximum time taken per query also increases from
>> 0.0096s to 0.5273s. Ouch!
>>
>> I don't know if this is MySQL, glibc or Linux kernel,
>> but something strange is going on...
>
>
> Like you, I'm also seeing idle time start going up as threads increase.
>
> I initially thought this was a problem with the multiprocessor scheduler,
> because the pattern is exactly like some artificat in the load balancing.

"artificat"

Wow. I must need some sleep :) Please excuse any other typos!

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote:
> I found a couple of interesting issues so far. Firstly, the MySQL
> version that I'm using (5.0.26-Max) is making lots of calls to

FYI, MySQL fixed some scalability problems in version 5.0.30, as
mentioned here:

http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/

It may be worth using more recent sources than 5.0.26 if tracking down
scaling problems in MySQL.

--Pete

----------------------------------
Pete Harlan
ArtSelect, Inc.
harlan@artselect.com
http://www.artselect.com
ArtSelect is a subsidiary of a21, Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Mon, Feb 26, 2007 at 04:04:01PM -0600, Pete Harlan wrote:
> On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote:
> > I found a couple of interesting issues so far. Firstly, the MySQL
> > version that I'm using (5.0.26-Max) is making lots of calls to
>
> FYI, MySQL fixed some scalability problems in version 5.0.30, as
> mentioned here:
>
> http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/
>
> It may be worth using more recent sources than 5.0.26 if tracking down
> scaling problems in MySQL.

The blog post that originated this discussion ran tests on 5.0.33
Not that the mysql version should really matter. The key point here
is that FreeBSD and Linux were running the *same* version, and
FreeBSD was able to handle the situation better somehow.

Dave

--
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Howdy,

MySQL 5.0.26 had some scalability issues and it solved since 5.0.32
http://ossipedia.ipa.go.jp/capacity/EV0612260303/
(written in Japanese but you may read the graph. We compared
5.0.24 vs 5.0.32)

The following is oprofile data
<==
CPU: Core Solo / Duo, speed 2666.76 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples % app name symbol name
47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock
19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock
18600010 6.6502 mysqld rec_get_offsets_func
18121328 6.4790 mysqld btr_search_guess_on_hash
11453095 4.0949 mysqld row_search_for_mysql

MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core
machine.

I think there are a lot of room to be inproved in MySQL implementation.

On 2/27/07, Dave Jones <davej@redhat.com> wrote:
> On Mon, Feb 26, 2007 at 04:04:01PM -0600, Pete Harlan wrote:
> > On Tue, Feb 27, 2007 at 12:36:04AM +1100, Nick Piggin wrote:
> > > I found a couple of interesting issues so far. Firstly, the MySQL
> > > version that I'm using (5.0.26-Max) is making lots of calls to
> >
> > FYI, MySQL fixed some scalability problems in version 5.0.30, as
> > mentioned here:
> >
> > http://www.mysqlperformanceblog.com/2007/01/03/innodb-benchmarks/
> >
> > It may be worth using more recent sources than 5.0.26 if tracking down
> > scaling problems in MySQL.
>
> The blog post that originated this discussion ran tests on 5.0.33
> Not that the mysql version should really matter. The key point here
> is that FreeBSD and Linux were running the *same* version, and
> FreeBSD was able to handle the situation better somehow.
>
> Dave
>
> --
> http://www.codemonkey.org.uk
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

Regards,
Hiro
--
Hiro Yoshioka
mailto:hyoshiok at miraclelinux.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Hiro Yoshioka wrote:
> Howdy,
>
> MySQL 5.0.26 had some scalability issues and it solved since 5.0.32
> http://ossipedia.ipa.go.jp/capacity/EV0612260303/
> (written in Japanese but you may read the graph. We compared
> 5.0.24 vs 5.0.32)
>
> The following is oprofile data
> ==>
> cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt
> <==
> CPU: Core Solo / Duo, speed 2666.76 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit
> mask of 0x00 (Unhalted core cycles) count 100000
> samples % app name symbol name
> 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock
> 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock
> 18600010 6.6502 mysqld rec_get_offsets_func
> 18121328 6.4790 mysqld btr_search_guess_on_hash
> 11453095 4.0949 mysqld row_search_for_mysql
>
> MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core
> machine.
>
> I think there are a lot of room to be inproved in MySQL implementation.

That's one aspect.

The other aspect of the problem is that when the number of
threads exceeds the number of CPU cores, Linux no longer
manages to keep the CPUs busy and we get a lot of idle time.

On the other hand, with the number of threads being equal to
the number of CPU cores, we are 100% CPU bound...

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Hi,

From: Rik van Riel <riel@redhat.com>
> Hiro Yoshioka wrote:
> > Howdy,
> >
> > MySQL 5.0.26 had some scalability issues and it solved since 5.0.32
> > http://ossipedia.ipa.go.jp/capacity/EV0612260303/
> > (written in Japanese but you may read the graph. We compared
> > 5.0.24 vs 5.0.32)
snip
> > MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core
> > machine.
> >
> > I think there are a lot of room to be inproved in MySQL implementation.
>
> That's one aspect.
>
> The other aspect of the problem is that when the number of
> threads exceeds the number of CPU cores, Linux no longer
> manages to keep the CPUs busy and we get a lot of idle time.
>
> On the other hand, with the number of threads being equal to
> the number of CPU cores, we are 100% CPU bound...

I have a question. If so, what is the difference of kernel's
view between SMP and CPU cores?

Another question. When the number of threads exceeds the number of
CPU cores, we may get a lot of idle time. Then a workaround of
MySQL is that do not creat threads which exceeds the number
of CPU cores. Is it right?

Regards,
Hiro
--
Hiro Yoshioka
CTO/Miracle Linux Corporation
http://blog.miraclelinux.com/yume/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Hiro Yoshioka wrote:
> Hi,
>
> From: Rik van Riel <riel@redhat.com>
>> Hiro Yoshioka wrote:
>>> Howdy,
>>>
>>> MySQL 5.0.26 had some scalability issues and it solved since 5.0.32
>>> http://ossipedia.ipa.go.jp/capacity/EV0612260303/
>>> (written in Japanese but you may read the graph. We compared
>>> 5.0.24 vs 5.0.32)
> snip
>>> MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core
>>> machine.
>>>
>>> I think there are a lot of room to be inproved in MySQL implementation.
>> That's one aspect.
>>
>> The other aspect of the problem is that when the number of
>> threads exceeds the number of CPU cores, Linux no longer
>> manages to keep the CPUs busy and we get a lot of idle time.
>>
>> On the other hand, with the number of threads being equal to
>> the number of CPU cores, we are 100% CPU bound...
>
> I have a question. If so, what is the difference of kernel's
> view between SMP and CPU cores?

None. Each schedulable entity (whether a fully fledged
CPU core or an SMT/HT thread) is treated the same.

> Another question. When the number of threads exceeds the number of
> CPU cores, we may get a lot of idle time. Then a workaround of
> MySQL is that do not creat threads which exceeds the number
> of CPU cores. Is it right?

Not really, that would make it impossible for MySQL to
handle more simultaneous database queries than the system
has CPUs.

Besides, it looks like this is not a problem in MySQL
per se (it works on FreeBSD) but some bug in Linux.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Mon, 26 Feb 2007 23:31:29 -0500, Rik van Riel <riel@redhat.com> wrote:

> Hiro Yoshioka wrote:
> > Hi,
> >
> > From: Rik van Riel <riel@redhat.com>
> >> Hiro Yoshioka wrote:
> >>> Howdy,
> >>>
> >>> MySQL 5.0.26 had some scalability issues and it solved since 5.0.32
> >>> http://ossipedia.ipa.go.jp/capacity/EV0612260303/
> >>> (written in Japanese but you may read the graph. We compared
> >>> 5.0.24 vs 5.0.32)
> > snip
> >>> MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core
> >>> machine.
> >>>
> >>> I think there are a lot of room to be inproved in MySQL implementation.
> >> That's one aspect.
> >>
> >> The other aspect of the problem is that when the number of
> >> threads exceeds the number of CPU cores, Linux no longer
> >> manages to keep the CPUs busy and we get a lot of idle time.
> >>
> >> On the other hand, with the number of threads being equal to
> >> the number of CPU cores, we are 100% CPU bound...
> >
> > I have a question. If so, what is the difference of kernel's
> > view between SMP and CPU cores?
>
> None. Each schedulable entity (whether a fully fledged
> CPU core or an SMT/HT thread) is treated the same.
>

And what do the SMT and Multi-Core scheduling options in the kernel
config are for ? Because of this thread I re-read the help text, and
it looks like on could de-select the SMT scheduler option, get a
working SMP system, and see what difference ? I suppose its related
to migration and cache flushing and so on, but where could I get
more details ?
And more strange, what is the difference between multi-core and
normal SMP configs ?

> > Another question. When the number of threads exceeds the number of
> > CPU cores, we may get a lot of idle time. Then a workaround of
> > MySQL is that do not creat threads which exceeds the number
> > of CPU cores. Is it right?
>
> Not really, that would make it impossible for MySQL to
> handle more simultaneous database queries than the system
> has CPUs.
>

I don't know myqsl internals, but you assume one thread per query.
If its more like Apache, one long living thread for several connections ?
Its the same to answer 4+4 queries than 8 at half the speed, isn't it ?

> Besides, it looks like this is not a problem in MySQL
> per se (it works on FreeBSD) but some bug in Linux.
>


--
J.A. Magallon <jamagallon()ono!com> \ Software is like sex:
\ It's better when it's free
Mandriva Linux release 2007.1 (Cooker) for i586
Linux 2.6.19-jam07 (gcc 4.1.2 20070115 (prerelease) (4.1.2-0.20070115.1mdv2007.1)) #2 SMP PREEMPT
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
J.A. Magallón wrote:
> On Mon, 26 Feb 2007 23:31:29 -0500, Rik van Riel <riel@redhat.com> wrote:
>
>> Hiro Yoshioka wrote:

>>> Another question. When the number of threads exceeds the number of
>>> CPU cores, we may get a lot of idle time. Then a workaround of
>>> MySQL is that do not creat threads which exceeds the number
>>> of CPU cores. Is it right?
>> Not really, that would make it impossible for MySQL to
>> handle more simultaneous database queries than the system
>> has CPUs.
>>
>
> I don't know myqsl internals, but you assume one thread per query.
> If its more like Apache, one long living thread for several connections ?

Yes, they are longer lived client connections. One thread
per connection, just like Apache.

> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ?

That still doesn't fix the potential Linux problem that this
benchmark identified.

To clarify: I don't care as much about MySQL performance as
I care about identifying and fixing this potential bug in
Linux.

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is. Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Rik van Riel wrote:
> J.A. Magallón wrote:
>>[...]
>> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ?
>
> That still doesn't fix the potential Linux problem that this
> benchmark identified.
>
> To clarify: I don't care as much about MySQL performance as
> I care about identifying and fixing this potential bug in
> Linux.

IIRC a long time ago there was a change in the scheduler to prevent a
low prio task running on a sibling of a hyperthreaded processor to slow
down a higher prio task on another sibling of the same processor.

Basically the scheduler would put the low prio task to sleep during an
adequate task slice to allow the other sibling to run at full speed for
a while.

I don't know the scheduler code well enough, but comments like this one
make me think that the change is still in place:

> /*
> * If an SMT sibling task has been put to sleep for priority
> * reasons reschedule the idle task to see if it can now run.
> */
> if (rq->nr_running) {
> resched_task(rq->idle);
> ret = 1;
> }

If that is the case, turning off CONFIG_SCHED_SMT would solve the problem.

--
Paulo Marques - www.grupopie.com

"The face of a child can say it all, especially the
mouth part of the face."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, 2007-02-27 at 09:02 -0500, Rik van Riel wrote:
> J.A. Magallón wrote:
> > On Mon, 26 Feb 2007 23:31:29 -0500, Rik van Riel <riel@redhat.com> wrote:
> >
> >> Hiro Yoshioka wrote:
>
> >>> Another question. When the number of threads exceeds the number of
> >>> CPU cores, we may get a lot of idle time. Then a workaround of
> >>> MySQL is that do not creat threads which exceeds the number
> >>> of CPU cores. Is it right?
> >> Not really, that would make it impossible for MySQL to
> >> handle more simultaneous database queries than the system
> >> has CPUs.
> >>
> >
> > I don't know myqsl internals, but you assume one thread per query.
> > If its more like Apache, one long living thread for several connections ?
>
> Yes, they are longer lived client connections. One thread
> per connection, just like Apache.
>
> > Its the same to answer 4+4 queries than 8 at half the speed, isn't it ?
>
> That still doesn't fix the potential Linux problem that this
> benchmark identified.
>
> To clarify: I don't care as much about MySQL performance as
> I care about identifying and fixing this potential bug in
> Linux.

Here http://people.freebsd.org/~kris/scaling/mysql.html Kris Kennaway
talks about a patch for FreeBSD 7 which addresses poor scalability
of file descriptor locking and that it's responsible for almost all
of the performance and scaling improvements.


Chiacchiera con i tuoi amici in tempo reale!
http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On 2/27/07, Paulo Marques <pmarques@grupopie.com> wrote:
> Rik van Riel wrote:
> > J.A. Magallón wrote:
> >>[...]
> >> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ?
> >
> > That still doesn't fix the potential Linux problem that this
> > benchmark identified.
> >
> > To clarify: I don't care as much about MySQL performance as
> > I care about identifying and fixing this potential bug in
> > Linux.
>
> IIRC a long time ago there was a change in the scheduler to prevent a
> low prio task running on a sibling of a hyperthreaded processor to slow
> down a higher prio task on another sibling of the same processor.
>
> Basically the scheduler would put the low prio task to sleep during an
> adequate task slice to allow the other sibling to run at full speed for
> a while.
>
> I don't know the scheduler code well enough, but comments like this one
> make me think that the change is still in place:

<snip>

> If that is the case, turning off CONFIG_SCHED_SMT would solve the problem.

To chime in here, I was attempting to reproduce this on an 8-way Xeon
box (4 dual-core). SCHED_SMT and SCHED_MC on led to scaling issues
when above 4 threads (4 threads was the peak). To the point, where I
couldn't break 1000 transactions per second. Turning both off (with
2.6.20.1) gives much better performance through 16 threads. I am now
running for the cases from 17 to 32 to see if I can reproduce the
problem at hand. I'll regenerate my data and post numbers soon.

I don't know if anyone else has those on in their kernel .config, but
I'd suggest turning them off, as Paulo said.

Thanks,
Nish
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Hiro Yoshioka wrote:
> Howdy,
>
> MySQL 5.0.26 had some scalability issues and it solved since 5.0.32
> http://ossipedia.ipa.go.jp/capacity/EV0612260303/
> (written in Japanese but you may read the graph. We compared
> 5.0.24 vs 5.0.32)
>
> The following is oprofile data
> ==>
> cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt
> <==
> CPU: Core Solo / Duo, speed 2666.76 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit
> mask of 0x00 (Unhalted core cycles) count 100000
> samples % app name symbol name
> 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock
> 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock
> 18600010 6.6502 mysqld rec_get_offsets_func
> 18121328 6.4790 mysqld btr_search_guess_on_hash
> 11453095 4.0949 mysqld row_search_for_mysql
>
> MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core
> machine.

Curious that it calls pthread_mutex_trylock (as opposed to
pthread_mutex_lock) so often. Maybe they're doing some kind of mutex
lock busy-looping?

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On 2/26/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Rik van Riel wrote:
> > Lorenzo Allegrucci wrote:
> >
> >> Hi lkml,
> >>
> >> according to the test below (sysbench) Linux seems to have scalability
> >> problems beyond 8 client threads:
> >> http://jeffr-tech.livejournal.com/6268.html#cutid1
> >> http://jeffr-tech.livejournal.com/5705.html
> >> Hardware is an 8-core amd64 system and jeffr seems willing to try more
> >> Linux versions on that machine.
> >> Anyway, is there anyone who can reproduce this?
> >
> >
> > I have reproduced it on a quad core test system.
> >
> > With 4 threads (on 4 cores) I get a high throughput, with
> > approximately 58% user time and 42% system time.
> >
> > With 8 threads (on 4 cores) I get way lower throughput,
> > with 37% user time, 29% system time 35% idle time!
> >
> > The maximum time taken per query also increases from
> > 0.0096s to 0.5273s. Ouch!
> >
> > I don't know if this is MySQL, glibc or Linux kernel,
> > but something strange is going on...
>
> Like you, I'm also seeing idle time start going up as threads increase.
>
> I initially thought this was a problem with the multiprocessor scheduler,
> because the pattern is exactly like some artificat in the load balancing.
>
> However, after looking at the stats, and testing a couple of things, I
> think it may not be after all.
>
> I've reproduced this on a 8-socket/16-way dual core Opteron. So far what
> I am seeing is that MySQL is having trouble putting enough load into the
> scheduler.

Here are some graphs from the 4-socket/8-way Xeon box (no SMT, no MC
in .config) I posted about earlier.

transactions.png resembles Nick's results pretty closely, in that a
drop-off occurs, at the same # of threads, too. That seems weird to
me, but I haven't thought about it too closely. Shouldn't Nick's be
dropping off closer to 16 threads (that would be 1 per core, then,
right?)

idle.png is the average % idle according to sar over each run from 1
to 32 threads. This appears to confirm what Rik was seeing.

Not sure if my data is hurting or helping, but this box remains
available for further tests.

Thanks,
Nish
Re: SMP performance degradation with sysbench [ In reply to ]
From: Robert Hancock <hancockr@shaw.ca>
Subject: Re: SMP performance degradation with sysbench
Date: Tue, 27 Feb 2007 18:20:25 -0600
Message-ID: <45E4CAC9.4070504@shaw.ca>

> Hiro Yoshioka wrote:
> > Howdy,
> >
> > MySQL 5.0.26 had some scalability issues and it solved since 5.0.32
> > http://ossipedia.ipa.go.jp/capacity/EV0612260303/
> > (written in Japanese but you may read the graph. We compared
> > 5.0.24 vs 5.0.32)
> >
> > The following is oprofile data
> > ==>
> > cpu=8-mysql=5.0.32-gcc=3.4/oprofile-eu=2200-op=default-none/opreport-l.txt
> > <==
> > CPU: Core Solo / Duo, speed 2666.76 MHz (estimated)
> > Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit
> > mask of 0x00 (Unhalted core cycles) count 100000
> > samples % app name symbol name
> > 47097502 16.8391 libpthread-2.3.4.so pthread_mutex_trylock
> > 19636300 7.0207 libpthread-2.3.4.so pthread_mutex_unlock
> > 18600010 6.6502 mysqld rec_get_offsets_func
> > 18121328 6.4790 mysqld btr_search_guess_on_hash
> > 11453095 4.0949 mysqld row_search_for_mysql
> >
> > MySQL tries to get a mutex but it spends about 16.8% of CPU on 8 core
> > machine.
>
> Curious that it calls pthread_mutex_trylock (as opposed to
> pthread_mutex_lock) so often. Maybe they're doing some kind of mutex
> lock busy-looping?

Yes, it is.

Regards,
Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Paulo Marques wrote:
> Rik van Riel wrote:
>> J.A. Magallón wrote:
>>> [...]
>>> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ?
>>
>> That still doesn't fix the potential Linux problem that this
>> benchmark identified.
>>
>> To clarify: I don't care as much about MySQL performance as
>> I care about identifying and fixing this potential bug in
>> Linux.
>
> IIRC a long time ago there was a change in the scheduler to prevent a
> low prio task running on a sibling of a hyperthreaded processor to slow
> down a higher prio task on another sibling of the same processor.
>
> Basically the scheduler would put the low prio task to sleep during an
> adequate task slice to allow the other sibling to run at full speed for
> a while.
>
> I don't know the scheduler code well enough, but comments like this one
> make me think that the change is still in place:
>
>> /*
>> * If an SMT sibling task has been put to sleep for priority
>> * reasons reschedule the idle task to see if it can now run.
>> */
>> if (rq->nr_running) {
>> resched_task(rq->idle);
>> ret = 1;
>> }
>
> If that is the case, turning off CONFIG_SCHED_SMT would solve the problem.
>
That may be the case, but in my opinion if this helps it doesn't "solve"
the problem, because the real problem is that a process which is not on
a HT is being treated as if it were.

Note that Intel does make multicore HT processors, and hopefully when
this code works as intended it will result in more total throughput. My
supposition is that it currently is NOT working as intended, since
disabling SMT scheduling is reported to help.

A test with MC on and SMT off would be informative for where to look next.

--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Nish Aravamudan wrote:
> On 2/26/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
>> Rik van Riel wrote:
>> > Lorenzo Allegrucci wrote:
>> >
>> >> Hi lkml,
>> >>
>> >> according to the test below (sysbench) Linux seems to have scalability
>> >> problems beyond 8 client threads:
>> >> http://jeffr-tech.livejournal.com/6268.html#cutid1
>> >> http://jeffr-tech.livejournal.com/5705.html
>> >> Hardware is an 8-core amd64 system and jeffr seems willing to try more
>> >> Linux versions on that machine.
>> >> Anyway, is there anyone who can reproduce this?
>> >
>> >
>> > I have reproduced it on a quad core test system.
>> >
>> > With 4 threads (on 4 cores) I get a high throughput, with
>> > approximately 58% user time and 42% system time.
>> >
>> > With 8 threads (on 4 cores) I get way lower throughput,
>> > with 37% user time, 29% system time 35% idle time!
>> >
>> > The maximum time taken per query also increases from
>> > 0.0096s to 0.5273s. Ouch!
>> >
>> > I don't know if this is MySQL, glibc or Linux kernel,
>> > but something strange is going on...
>>
>> Like you, I'm also seeing idle time start going up as threads increase.
>>
>> I initially thought this was a problem with the multiprocessor scheduler,
>> because the pattern is exactly like some artificat in the load balancing.
>>
>> However, after looking at the stats, and testing a couple of things, I
>> think it may not be after all.
>>
>> I've reproduced this on a 8-socket/16-way dual core Opteron. So far what
>> I am seeing is that MySQL is having trouble putting enough load into the
>> scheduler.
>
>
> Here are some graphs from the 4-socket/8-way Xeon box (no SMT, no MC
> in .config) I posted about earlier.
>
> transactions.png resembles Nick's results pretty closely, in that a
> drop-off occurs, at the same # of threads, too. That seems weird to
> me, but I haven't thought about it too closely. Shouldn't Nick's be
> dropping off closer to 16 threads (that would be 1 per core, then,
> right?)

I don't think it is exactly a matter of processes >= cores, but rather
just a general problem at higher concurrency.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On 2/27/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Nish Aravamudan wrote:
> > On 2/26/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >
> >> Rik van Riel wrote:
> >> > Lorenzo Allegrucci wrote:
> >> >
> >> >> Hi lkml,
> >> >>
> >> >> according to the test below (sysbench) Linux seems to have scalability
> >> >> problems beyond 8 client threads:
> >> >> http://jeffr-tech.livejournal.com/6268.html#cutid1
> >> >> http://jeffr-tech.livejournal.com/5705.html
> >> >> Hardware is an 8-core amd64 system and jeffr seems willing to try more
> >> >> Linux versions on that machine.
> >> >> Anyway, is there anyone who can reproduce this?
> >> >
> >> >
> >> > I have reproduced it on a quad core test system.
> >> >
> >> > With 4 threads (on 4 cores) I get a high throughput, with
> >> > approximately 58% user time and 42% system time.
> >> >
> >> > With 8 threads (on 4 cores) I get way lower throughput,
> >> > with 37% user time, 29% system time 35% idle time!
> >> >
> >> > The maximum time taken per query also increases from
> >> > 0.0096s to 0.5273s. Ouch!
> >> >
> >> > I don't know if this is MySQL, glibc or Linux kernel,
> >> > but something strange is going on...
> >>
> >> Like you, I'm also seeing idle time start going up as threads increase.
> >>
> >> I initially thought this was a problem with the multiprocessor scheduler,
> >> because the pattern is exactly like some artificat in the load balancing.
> >>
> >> However, after looking at the stats, and testing a couple of things, I
> >> think it may not be after all.
> >>
> >> I've reproduced this on a 8-socket/16-way dual core Opteron. So far what
> >> I am seeing is that MySQL is having trouble putting enough load into the
> >> scheduler.
> >
> >
> > Here are some graphs from the 4-socket/8-way Xeon box (no SMT, no MC
> > in .config) I posted about earlier.
> >
> > transactions.png resembles Nick's results pretty closely, in that a
> > drop-off occurs, at the same # of threads, too. That seems weird to
> > me, but I haven't thought about it too closely. Shouldn't Nick's be
> > dropping off closer to 16 threads (that would be 1 per core, then,
> > right?)
>
> I don't think it is exactly a matter of processes >= cores, but rather
> just a general problem at higher concurrency.

Ok, thanks for the clarification.

-Nish
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On 2/27/07, Bill Davidsen <davidsen@tmr.com> wrote:
> Paulo Marques wrote:
> > Rik van Riel wrote:
> >> J.A. Magallón wrote:
> >>> [...]
> >>> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ?
> >>
> >> That still doesn't fix the potential Linux problem that this
> >> benchmark identified.
> >>
> >> To clarify: I don't care as much about MySQL performance as
> >> I care about identifying and fixing this potential bug in
> >> Linux.
> >
> > IIRC a long time ago there was a change in the scheduler to prevent a
> > low prio task running on a sibling of a hyperthreaded processor to slow
> > down a higher prio task on another sibling of the same processor.
> >
> > Basically the scheduler would put the low prio task to sleep during an
> > adequate task slice to allow the other sibling to run at full speed for
> > a while.
> >
> > I don't know the scheduler code well enough, but comments like this one
> > make me think that the change is still in place:
> >
> >> /*
> >> * If an SMT sibling task has been put to sleep for priority
> >> * reasons reschedule the idle task to see if it can now run.
> >> */
> >> if (rq->nr_running) {
> >> resched_task(rq->idle);
> >> ret = 1;
> >> }
> >
> > If that is the case, turning off CONFIG_SCHED_SMT would solve the problem.
> >
> That may be the case, but in my opinion if this helps it doesn't "solve"
> the problem, because the real problem is that a process which is not on
> a HT is being treated as if it were.
>
> Note that Intel does make multicore HT processors, and hopefully when
> this code works as intended it will result in more total throughput. My
> supposition is that it currently is NOT working as intended, since
> disabling SMT scheduling is reported to help.

It does help, but we still drop off, clearly. Also, that's my
baseline, so I'm not able to reproduce the *sharp* dropoff from the
blog post yet.

> A test with MC on and SMT off would be informative for where to look next.

I'm rebooting my box with 2.6.20.1 and exactly this setup now.

Thanks,
Nish
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On 2/27/07, Nish Aravamudan <nish.aravamudan@gmail.com> wrote:
> On 2/27/07, Bill Davidsen <davidsen@tmr.com> wrote:
> > Paulo Marques wrote:
> > > Rik van Riel wrote:
> > >> J.A. Magallón wrote:
> > >>> [...]
> > >>> Its the same to answer 4+4 queries than 8 at half the speed, isn't it ?
> > >>
> > >> That still doesn't fix the potential Linux problem that this
> > >> benchmark identified.
> > >>
> > >> To clarify: I don't care as much about MySQL performance as
> > >> I care about identifying and fixing this potential bug in
> > >> Linux.
> > >
> > > IIRC a long time ago there was a change in the scheduler to prevent a
> > > low prio task running on a sibling of a hyperthreaded processor to slow
> > > down a higher prio task on another sibling of the same processor.
> > >
> > > Basically the scheduler would put the low prio task to sleep during an
> > > adequate task slice to allow the other sibling to run at full speed for
> > > a while.
<snip>
> > > If that is the case, turning off CONFIG_SCHED_SMT would solve the problem.
<snip>
> > Note that Intel does make multicore HT processors, and hopefully when
> > this code works as intended it will result in more total throughput. My
> > supposition is that it currently is NOT working as intended, since
> > disabling SMT scheduling is reported to help.
>
> It does help, but we still drop off, clearly. Also, that's my
> baseline, so I'm not able to reproduce the *sharp* dropoff from the
> blog post yet.
>
> > A test with MC on and SMT off would be informative for where to look next.
>
> I'm rebooting my box with 2.6.20.1 and exactly this setup now.

Here are the results:

idle.png: average % idle over 120s runs from 1 to 32 threads
transactions.png: TPS over 120s runs from 1 to 32 threads

Hope the data is useful. All I can conclude right now is that SMT
appears to help (contradicting what I said earlier), but that MC seems
to have no effect (or no substantial effect).

Thanks,
Nish
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, 2007-02-27 at 20:05 +0100, Lorenzo Allegrucci wrote:
> On Tue, 2007-02-27 at 09:02 -0500, Rik van Riel wrote:
> > That still doesn't fix the potential Linux problem that this
> > benchmark identified.
> >
> > To clarify: I don't care as much about MySQL performance as
> > I care about identifying and fixing this potential bug in
> > Linux.
>
> Here http://people.freebsd.org/~kris/scaling/mysql.html Kris Kennaway
> talks about a patch for FreeBSD 7 which addresses poor scalability
> of file descriptor locking and that it's responsible for almost all
> of the performance and scaling improvements.

How does Linux scale with many threads contending for file descriptor
lock?
Has anyone tried to run the test with oprofile?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Hi Nick,

> Anyway, I'll keep experimenting. If anyone from MySQL wants to help look
> at this, send me a mail (eg. especially with the sched_setscheduler issue,
> you might be able to do something better).

I took a look at this today and figured Id document it:

http://ozlabs.org/~anton/linux/sysbench/

Bottom line: it looks like issues in the glibc malloc library, replacing
it with the google malloc library fixes the negative scaling:

# apt-get install libgoogle-perftools0
# LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld

Anton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Anton Blanchard wrote:
>
> Hi Nick,
>
>
>>Anyway, I'll keep experimenting. If anyone from MySQL wants to help look
>>at this, send me a mail (eg. especially with the sched_setscheduler issue,
>>you might be able to do something better).
>
>
> I took a look at this today and figured Id document it:
>
> http://ozlabs.org/~anton/linux/sysbench/
>
> Bottom line: it looks like issues in the glibc malloc library, replacing
> it with the google malloc library fixes the negative scaling:
>
> # apt-get install libgoogle-perftools0
> # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld

Hi Anton,

Very cool. Yeah I had come to the conclusion that it wasn't a kernel
issue, and basically was afraid to look into userspace ;)

That bogus setscheduler thing must surely have never worked, though.
I wonder if FreeBSD avoids the scalability issue because it is using
SCHED_RR there, or because it has a decent threaded malloc implementation.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Anton Blanchard a écrit :
>
> Hi Nick,
>
>> Anyway, I'll keep experimenting. If anyone from MySQL wants to help look
>> at this, send me a mail (eg. especially with the sched_setscheduler issue,
>> you might be able to do something better).
>
> I took a look at this today and figured Id document it:
>
> http://ozlabs.org/~anton/linux/sysbench/
>
> Bottom line: it looks like issues in the glibc malloc library, replacing
> it with the google malloc library fixes the negative scaling:
>
> # apt-get install libgoogle-perftools0
> # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld

Hi Anton, thanks for the report.
glibc has certainly many scalability problems.

One of the known problem is its (ab)use of mmap() to allocate one (yes : one
!) page every time you fopen() a file. And then a munmap() at fclose() time.


mmap()/munmap() should be avoided as hell in multithreaded programs.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, Mar 13, 2007 at 04:11:02PM +1100, Nick Piggin wrote:
> Hi Anton,
>
> Very cool. Yeah I had come to the conclusion that it wasn't a kernel
> issue, and basically was afraid to look into userspace ;)

btw, regardless of what glibc is doing, still the cpu shouldn't go
idle IMHO. Even if we're overscheduling and trashing over the mmap_sem
with threads (no idea if other OS schedules the task away when they
find the other cpu in the mmap critical section), or if we've
overscheduling with futex locking, the cpu usage should remain 100%
system time in the worst case. The only explanation for going idle
legitimately could be on HT cpus where HT may hurt more than help but
on real multicore it shouldn't happen.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Andrea Arcangeli wrote:
> On Tue, Mar 13, 2007 at 04:11:02PM +1100, Nick Piggin wrote:
>
>>Hi Anton,
>>
>>Very cool. Yeah I had come to the conclusion that it wasn't a kernel
>>issue, and basically was afraid to look into userspace ;)
>
>
> btw, regardless of what glibc is doing, still the cpu shouldn't go
> idle IMHO. Even if we're overscheduling and trashing over the mmap_sem
> with threads (no idea if other OS schedules the task away when they
> find the other cpu in the mmap critical section), or if we've
> overscheduling with futex locking, the cpu usage should remain 100%
> system time in the worst case. The only explanation for going idle
> legitimately could be on HT cpus where HT may hurt more than help but
> on real multicore it shouldn't happen.
>

Well ignoring the HT issue, I was seeing lots of idle time simply
because userspace could not keep up enough load to the scheduler.
There simply were fewer runnable tasks than CPU cores.

But it wasn't a case of all CPUs going idle, just most of them ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, Mar 13, 2007 at 09:06:14PM +1100, Nick Piggin wrote:
> Well ignoring the HT issue, I was seeing lots of idle time simply
> because userspace could not keep up enough load to the scheduler.
> There simply were fewer runnable tasks than CPU cores.

When you said idle I thought idle and not waiting for I/O. Waiting for
I/O would be hardly a kernel issue ;). If they're not waiting for I/O
and they're not scheduling in userland with nanosleep/pause, the cpu
shouldn't go idle. Even if they're calling sched_yield in a loop the
cpu should account for zero idle time as far as I can tell.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Andrea Arcangeli wrote:
> On Tue, Mar 13, 2007 at 09:06:14PM +1100, Nick Piggin wrote:
>
>>Well ignoring the HT issue, I was seeing lots of idle time simply
>>because userspace could not keep up enough load to the scheduler.
>>There simply were fewer runnable tasks than CPU cores.
>
>
> When you said idle I thought idle and not waiting for I/O. Waiting for
> I/O would be hardly a kernel issue ;). If they're not waiting for I/O
> and they're not scheduling in userland with nanosleep/pause, the cpu
> shouldn't go idle. Even if they're calling sched_yield in a loop the
> cpu should account for zero idle time as far as I can tell.

Well it wasn't iowait time. From Anton's analysis, I would probably
say it was time waiting for either the glibc malloc mutex or MySQL
heap mutex.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, Mar 13, 2007 at 09:37:54PM +1100, Nick Piggin wrote:
> Well it wasn't iowait time. From Anton's analysis, I would probably
> say it was time waiting for either the glibc malloc mutex or MySQL
> heap mutex.

So it again makes little sense to me that this is idle time, unless
some userland mutex has a usleep in the slow path which would be very
wrong, in the worst case they should yield() (yield can still waste
lots of cpu if two tasks in the slow paths calls it while the holder
is not scheduled, but at least it wouldn't be idle time).

Idle time is suspicious for a kernel issue in the scheduler or some
userland inefficiency (the latter sounds more likely).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Andrea Arcangeli wrote:
> On Tue, Mar 13, 2007 at 09:37:54PM +1100, Nick Piggin wrote:
>
>>Well it wasn't iowait time. From Anton's analysis, I would probably
>>say it was time waiting for either the glibc malloc mutex or MySQL
>>heap mutex.
>
>
> So it again makes little sense to me that this is idle time, unless
> some userland mutex has a usleep in the slow path which would be very
> wrong, in the worst case they should yield() (yield can still waste
> lots of cpu if two tasks in the slow paths calls it while the holder
> is not scheduled, but at least it wouldn't be idle time).

They'll be sleeping in futex_wait in the kernel, I think. One thread
will hold the critical mutex, some will be off doing their own thing,
but importantly there will be many sleeping for the mutex to become
available.

> Idle time is suspicious for a kernel issue in the scheduler or some
> userland inefficiency (the latter sounds more likely).

That is what I first suspected, because the dropoff appeared to happen
exactly after we saturated the CPU count: it seems like a scheduler
artifact.

However, I tested with a bigger system and actually the idle time
comes before we saturate all CPUs. Also, increasing the aggressiveness
of the load balancer did not drop idle time at all, so it is not a case
of some runqueues idle while others have many threads on them.


I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
glibc allocator. But I wonder if there are other improvements that glibc
can do here?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tuesday 13 March 2007 12:12, Nick Piggin wrote:
>
> I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
> glibc allocator. But I wonder if there are other improvements that glibc
> can do here?

I cooked a patch some time ago to speedup threaded apps and got no feedback.

http://lkml.org/lkml/2006/8/9/26

Maybe we have to wait for 32 core cpu before thinking of cache line
bouncings...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, Mar 13, 2007 at 10:12:19PM +1100, Nick Piggin wrote:
> They'll be sleeping in futex_wait in the kernel, I think. One thread
> will hold the critical mutex, some will be off doing their own thing,
> but importantly there will be many sleeping for the mutex to become
> available.

The initial assumption was that there was zero idle time with threads
= cpus and the idle time showed up only when the number of threads
increased to the double the number of cpus. If the idle time wouldn't
increase with the number of threads, nothing would be suspect.

> However, I tested with a bigger system and actually the idle time
> comes before we saturate all CPUs. Also, increasing the aggressiveness
> of the load balancer did not drop idle time at all, so it is not a case
> of some runqueues idle while others have many threads on them.

It'd be interesting to see the sysrq+t after the idle time
increased.

> I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
> glibc allocator. But I wonder if there are other improvements that glibc
> can do here?

My wild guess is that they're allocating memory after taking
futexes. If they do, something like this will happen:

taskA taskB taskC
user lock
mmap_sem lock
mmap sem -> schedule
user lock -> schedule

If taskB wouldn't be there triggering more random trashing over the
mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too.

I suspect the real fix is not to allocate memory or to run other
expensive syscalls that can block inside the futex critical sections...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Eric Dumazet wrote:
> On Tuesday 13 March 2007 12:12, Nick Piggin wrote:
>
>>I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
>>glibc allocator. But I wonder if there are other improvements that glibc
>>can do here?
>
>
> I cooked a patch some time ago to speedup threaded apps and got no feedback.

Well that doesn't help in this case. I tested and the mmap_sem contention
is not an issue.

> http://lkml.org/lkml/2006/8/9/26
>
> Maybe we have to wait for 32 core cpu before thinking of cache line
> bouncings...

The idea is a good one, and I was half way through implementing similar
myself at one point (some java apps hit this badly).

It is just horribly sad that futexes are supposed to implement a
_scalable_ thread synchronisation mechanism, whilst fundamentally
relying on an mm-wide lock to operate.

I don't like your interface, but then again, the futex interface isn't
exactly pretty anyway.

You should resubmit the patch, and get the glibc guys to use it.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tuesday 13 March 2007 12:42, Andrea Arcangeli wrote:

> My wild guess is that they're allocating memory after taking
> futexes. If they do, something like this will happen:
>
> taskA taskB taskC
> user lock
> mmap_sem lock
> mmap sem -> schedule
> user lock -> schedule
>
> If taskB wouldn't be there triggering more random trashing over the
> mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too.
>
> I suspect the real fix is not to allocate memory or to run other
> expensive syscalls that can block inside the futex critical sections...

glibc malloc uses arenas, and trylock() only. It should not block because if
an arena is already locked, thread automatically chose another arena, and
might create a new one if necessary.

But yes, mmap_sem contention is a big problem, because it's also taken by
futex code (unfortunately)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Andrea Arcangeli wrote:
> On Tue, Mar 13, 2007 at 10:12:19PM +1100, Nick Piggin wrote:
>
>>They'll be sleeping in futex_wait in the kernel, I think. One thread
>>will hold the critical mutex, some will be off doing their own thing,
>>but importantly there will be many sleeping for the mutex to become
>>available.
>
>
> The initial assumption was that there was zero idle time with threads
> = cpus and the idle time showed up only when the number of threads
> increased to the double the number of cpus. If the idle time wouldn't
> increase with the number of threads, nothing would be suspect.

Well I think more threads ~= more probability that this guy is going to
be preempted while holding the mutex?

This might be why FreeBSD works much better, because it looks like MySQL
actually will set RT scheduling for those processes that take critical
resources.

>>However, I tested with a bigger system and actually the idle time
>>comes before we saturate all CPUs. Also, increasing the aggressiveness
>>of the load balancer did not drop idle time at all, so it is not a case
>>of some runqueues idle while others have many threads on them.
>
>
> It'd be interesting to see the sysrq+t after the idle time
> increased.
>
>
>>I guess googlemalloc (tcmalloc?) isn't suitable for a general purpose
>>glibc allocator. But I wonder if there are other improvements that glibc
>>can do here?
>
>
> My wild guess is that they're allocating memory after taking
> futexes. If they do, something like this will happen:
>
> taskA taskB taskC
> user lock
> mmap_sem lock
> mmap sem -> schedule
> user lock -> schedule
>
> If taskB wouldn't be there triggering more random trashing over the
> mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too.
>
> I suspect the real fix is not to allocate memory or to run other
> expensive syscalls that can block inside the futex critical sections...


I would agree that it points to MySQL scalability issues, however the
fact that such large gains come from tcmalloc is still interesting.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, Mar 13, 2007 at 01:02:44PM +0100, Eric Dumazet wrote:
> On Tuesday 13 March 2007 12:42, Andrea Arcangeli wrote:
>
> > My wild guess is that they're allocating memory after taking
> > futexes. If they do, something like this will happen:
> >
> > taskA taskB taskC
> > user lock
> > mmap_sem lock
> > mmap sem -> schedule
> > user lock -> schedule
> >
> > If taskB wouldn't be there triggering more random trashing over the
> > mmap_sem, the lock holder wouldn't wait and task C wouldn't wait too.
> >
> > I suspect the real fix is not to allocate memory or to run other
> > expensive syscalls that can block inside the futex critical sections...
>
> glibc malloc uses arenas, and trylock() only. It should not block because if
> an arena is already locked, thread automatically chose another arena, and
> might create a new one if necessary.

Well, only when allocating it uses trylock, free uses normal lock.
glibc malloc will by default use the same arena for all threads, only when
it sees contention during allocation it gives different threads different
arenas. So, e.g. if mysql did all allocations while holding some global
heap lock (thus glibc wouldn't see any contention on allocation), but
freeing would be done outside of application's critical section, you would
see contention on main arena's lock in the free path.
Calling malloc_stats (); from e.g. atexit handler could give interesting
details, especially if you recompile glibc malloc with -DTHREAD_STATS=1.

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On 3/12/07, Anton Blanchard <anton@samba.org> wrote:
>
> Hi Nick,
>
> > Anyway, I'll keep experimenting. If anyone from MySQL wants to help look
> > at this, send me a mail (eg. especially with the sched_setscheduler issue,
> > you might be able to do something better).
>
> I took a look at this today and figured Id document it:
>
> http://ozlabs.org/~anton/linux/sysbench/
>
> Bottom line: it looks like issues in the glibc malloc library, replacing
> it with the google malloc library fixes the negative scaling:
>
> # apt-get install libgoogle-perftools0
> # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld

Quick datapoint, still collecting data and trying to verify it's
always the case: on my 8-way Xeon, I'm actually seeing *much* worse
performance with libtcmalloc.so compared to mainline. Am generating
graphs and such still, but maybe someone else with x86_64 hardware
could try the google PRELOAD and see if it helps/hurts (to rule out
tester stupidity)?

Thanks,
Nish
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
Nish Aravamudan a écrit :
> On 3/12/07, Anton Blanchard <anton@samba.org> wrote:
>>
>> Hi Nick,
>>
>> > Anyway, I'll keep experimenting. If anyone from MySQL wants to help
>> look
>> > at this, send me a mail (eg. especially with the sched_setscheduler
>> issue,
>> > you might be able to do something better).
>>
>> I took a look at this today and figured Id document it:
>>
>> http://ozlabs.org/~anton/linux/sysbench/
>>
>> Bottom line: it looks like issues in the glibc malloc library, replacing
>> it with the google malloc library fixes the negative scaling:
>>
>> # apt-get install libgoogle-perftools0
>> # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld
>
> Quick datapoint, still collecting data and trying to verify it's
> always the case: on my 8-way Xeon, I'm actually seeing *much* worse
> performance with libtcmalloc.so compared to mainline. Am generating
> graphs and such still, but maybe someone else with x86_64 hardware
> could try the google PRELOAD and see if it helps/hurts (to rule out
> tester stupidity)?

I wish I had a 8-way test platform :)

Anyway, could you post some oprofile results ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On 3/13/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
> Nish Aravamudan a écrit :
> > On 3/12/07, Anton Blanchard <anton@samba.org> wrote:
> >>
> >> Hi Nick,
> >>
> >> > Anyway, I'll keep experimenting. If anyone from MySQL wants to help
> >> look
> >> > at this, send me a mail (eg. especially with the sched_setscheduler
> >> issue,
> >> > you might be able to do something better).
> >>
> >> I took a look at this today and figured Id document it:
> >>
> >> http://ozlabs.org/~anton/linux/sysbench/
> >>
> >> Bottom line: it looks like issues in the glibc malloc library, replacing
> >> it with the google malloc library fixes the negative scaling:
> >>
> >> # apt-get install libgoogle-perftools0
> >> # LD_PRELOAD=/usr/lib/libtcmalloc.so /usr/sbin/mysqld
> >
> > Quick datapoint, still collecting data and trying to verify it's
> > always the case: on my 8-way Xeon, I'm actually seeing *much* worse
> > performance with libtcmalloc.so compared to mainline. Am generating
> > graphs and such still, but maybe someone else with x86_64 hardware
> > could try the google PRELOAD and see if it helps/hurts (to rule out
> > tester stupidity)?
>
> I wish I had a 8-way test platform :)
>
> Anyway, could you post some oprofile results ?

Hopefully soon -- want to still make sure I'm not doing something
dumb. Am also hoping to get some of the gdb backtraces like Anton had.

Thanks,
Nish
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote:
> I would agree that it points to MySQL scalability issues, however the
> fact that such large gains come from tcmalloc is still interesting.

What glibc version are you, Anton and others are using?

Does that version has this fix included?

Dynamically size mmap treshold if the program frees mmaped blocks.

http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158&r2=1.159&cvsroot=glibc

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Wed, 2007-03-14 at 16:33 -0700, Siddha, Suresh B wrote:
> On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote:
> > I would agree that it points to MySQL scalability issues, however the
> > fact that such large gains come from tcmalloc is still interesting.
>
> What glibc version are you, Anton and others are using?
>
> Does that version has this fix included?
>
> Dynamically size mmap treshold if the program frees mmaped blocks.
>
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158&r2=1.159&cvsroot=glibc
>
Last week, I reproduced it on RHEL4U3 with glibc 2.3.4-2.19. Today, I
installed RHEL5GA and reproduced it again. RHEL5GA uses glibc 2.5-12 which
already includes the dynamically size mmap threshold patch, so this patch doesn’t
resolve the issue.

The problem is really relevant to malloc/free of glibc multi-thread.

My paxville has 16 logical cpu (dual core+HT). I disabled HT by hot
removing the last 8 logical processors.

I captured the schedule status. When sysbench thread=8 (best performance),
there are about 3.4% context switches caused by __down_read/__down_write_nested.
When sysbench thread=10 (best performance), the percentage becomes 11.83%.

I captured the thread status by gdb. When sysbench thread=10, usually 2 threads
are calling mprotect/mmap. When sysbench thread=8, there are no threads calling
mprotect/mmap. Such capture has random behavior, but I tried for many times.

I think the increased percentage of context switch related to
__down_read/__down_write_nested is caused by mprotect/mmap. mprotect/mmap
accesses the semaphore of vm, so there are some contentions on the sema which
make performance down.

The strace shows mysqld often calls mprotect/mmap with the same data length
61440. That’s another evidence. Gdb showed such mprotect is called by
init_io_malloc=>my_malloc=>malloc=>init_malloc=>mprotect. Mmap is caused by
__init_free=>mmap. I checked the source codes of glibc and found the real call
chains are malloc=>init_malloc=>grow_heap=>mprotect and __init_free=>heap_trim=>mmap.

I guess the transaction processing of mysql/sysbench is: mysql accepts a connection
and initiates a block for the connection. After processing a couple of transactions,
sysbench closes the connection. Then, restart the procedure.

So why are there so many mprotect/mmap?

Glibc uses arena to speedup malloc/free at multi-thread environment.
mp.trim_threshold only controls main_arena. In function __init_free,
FASTBIN_CONSOLIDATION_THRE might be helpful, but it’s a fixed value.

The *ROOT CAUSE* is dynamic thresholds don’t apply to non-main arena.

To verify my idea, I created a small patch. When freeing a block, always
check mp_.trim_threshold even though it might not be in main arena. The
patch is just to verify my idea instead of the final solution.

--- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.000000000 +0800
+++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-20 07:41:03.000000000 +0800
@@ -4607,10 +4607,13 @@ _int_free(mstate av, Void_t* mem)
} else {
/* Always try heap_trim(), even if the top chunk is not
large, because the corresponding heap might go away. */
+ if ((unsigned long)(chunksize(av->top)) >=
+ (unsigned long)(mp_.trim_threshold)) {
heap_info *heap = heap_for_ptr(top(av));

assert(heap->ar_ptr == av);
heap_trim(heap, mp_.top_pad);
+ }
}
}


With the patch, I recompiled glibc and reran sysbench/mysql. The result is good.
When thread number is larger than 8, the tps and response time(avg) are smooth, and
don't drop severely.

Is there anyone being able to test it on AMD machine?

Yanmin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: SMP performance degradation with sysbench [ In reply to ]
On Tue, 2007-03-20 at 10:29 +0800, Zhang, Yanmin wrote:
> On Wed, 2007-03-14 at 16:33 -0700, Siddha, Suresh B wrote:
> > On Tue, Mar 13, 2007 at 05:08:59AM -0700, Nick Piggin wrote:
> > > I would agree that it points to MySQL scalability issues, however the
> > > fact that such large gains come from tcmalloc is still interesting.
> >
> > What glibc version are you, Anton and others are using?
> >
> > Does that version has this fix included?
> >
> > Dynamically size mmap treshold if the program frees mmaped blocks.
> >
> > http://sources.redhat.com/cgi-bin/cvsweb.cgi/libc/malloc/malloc.c.diff?r1=1.158&r2=1.159&cvsroot=glibc

> The *ROOT CAUSE* is dynamic thresholds don’t apply to non-main arena.
>
> To verify my idea, I created a small patch. When freeing a block, always
> check mp_.trim_threshold even though it might not be in main arena. The
> patch is just to verify my idea instead of the final solution.
>
> --- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.000000000 +0800
> +++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-20 07:41:03.000000000 +0800
> @@ -4607,10 +4607,13 @@ _int_free(mstate av, Void_t* mem)
> } else {
> /* Always try heap_trim(), even if the top chunk is not
> large, because the corresponding heap might go away. */
> + if ((unsigned long)(chunksize(av->top)) >=
> + (unsigned long)(mp_.trim_threshold)) {
> heap_info *heap = heap_for_ptr(top(av));
>
> assert(heap->ar_ptr == av);
> heap_trim(heap, mp_.top_pad);
> + }
> }
> }
>
>
I sent a new patch to glibc maintainer, but didn't get response. So resend it here.

Glibc arena is to decrease the malloc/free contention among threads. But arena
chooses to shrink agressively, so also grow agressively. When heaps grow, mprotect
is called. When heaps shrink, mmap is called. In kernel, both mmap and mprotect
need hold the write lock of mm->mmap_sem which introduce new contention. The new
contention actually causes the arena effort to become 0.

Here is a new patch to address this issue.

Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>

---

--- glibc-2.5-20061008T1257_bak/malloc/malloc.c 2006-09-08 00:06:02.000000000 +0800
+++ glibc-2.5-20061008T1257/malloc/malloc.c 2007-03-30 09:01:18.000000000 +0800
@@ -4605,12 +4605,13 @@ _int_free(mstate av, Void_t* mem)
sYSTRIm(mp_.top_pad, av);
#endif
} else {
- /* Always try heap_trim(), even if the top chunk is not
- large, because the corresponding heap might go away. */
- heap_info *heap = heap_for_ptr(top(av));
-
- assert(heap->ar_ptr == av);
- heap_trim(heap, mp_.top_pad);
+ if ((unsigned long)(chunksize(av->top)) >=
+ (unsigned long)(mp_.trim_threshold)) {
+ heap_info *heap = heap_for_ptr(top(av));
+
+ assert(heap->ar_ptr == av);
+ heap_trim(heap, mp_.top_pad);
+ }
}
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/