Mailing List Archive

Virt overehead with HT [was: Re: Xen 4.5 development update]
[.Sorry to the ones that will receive this mail twice, but I managed to
drop xen-devel when replying the first time :-(]

On mar, 2014-07-01 at 12:43 -0400, konrad.wilk@oracle.com wrote:
> == x86 ==

> * HT enabled, virtualization overhead is high (Xen 4.4) (none)
> kernbench demonstrated it
> looking and tracing it
> - Dario Faggioli
>
I spent a few time running kernbench on different boxes and with
different configurations. After all this, here's what I found.

So, on a non-NUMA, HT and EPT capable box, both BAREMETAL and HVM case
were using 8G RAM and 8 CPUs/VCPUs. HT was enabled in BIOS:

Elapsed(stddev) BAREMETAL HVM
kernbench -j4 31.604 (0.0963328) 34.078 (0.168582)
kernbench -j8 26.586 (0.145705) 26.672 (0.0432435)
kernbench -j 27.358 (0.440307) 27.49 (0.364897)

With HT disabled in BIOS (which means only 4 CPUs for both):
Elapsed(stddev) BAREMETAL HVM
kernbench -j4 57.754 (0.0642651) 56.46 (0.0578792)
kernbench -j8 31.228 (0.0775887) 31.362 (0.210998)
kernbench -j 32.316 (0.0270185) 33.084 (0.600442)

So, first of all, no much difference, in terms of performance
degradation when going from baremetal to guest, between the HT and no-HT
cases.

In the HT enabled case, there is a slight degradation in perf on the
least loaded case, but certainly not of the nature Stefano saw, and all
goes pretty well when load increases.

I guess I can investigate a bit more about what happens with '-j4'. What
I suspect is that the scheduler may make a few non-optimal decisions wrt
HT, when there are more PCPUs than busy guest VCPUs. This may be due to
the fact that Dom0 (or another guest VCPU doing other stuff than
kernbench) may be already running on PCPUs that are on different cores
than the guest's one (i.e., the guest VCPUs that wants to run
kernbench), and that may force two guest's vCPUs to execute on two HTs
some of the time (which of course is something that does not happen on
baremetal!).

However, that is, I think, a separate issue, and it looks to me that
the original HT perf regression we were chasing, the one this item is
about, may actually have been disappear, or it was caused to something
different than HT. :-P

Thoughts? Do we think this is enough to kill the "disable hyperthreading
hint" from the performance tuning page on the Wiki?
http://wiki.xenproject.org/wiki/Tuning_Xen_for_Performance#Hyperthreading_in_Xen_4.3_and_4.4

Regards,
Dario


--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Re: Virt overehead with HT [was: Re: Xen 4.5 development update] [ In reply to ]
On 07/14/2014 05:12 PM, Dario Faggioli wrote:
> [.Sorry to the ones that will receive this mail twice, but I managed to
> drop xen-devel when replying the first time :-(]
>
> On mar, 2014-07-01 at 12:43 -0400, konrad.wilk@oracle.com wrote:
>> == x86 ==
>
>> * HT enabled, virtualization overhead is high (Xen 4.4) (none)
>> kernbench demonstrated it
>> looking and tracing it
>> - Dario Faggioli
>>
> I spent a few time running kernbench on different boxes and with
> different configurations. After all this, here's what I found.
>
> So, on a non-NUMA, HT and EPT capable box, both BAREMETAL and HVM case
> were using 8G RAM and 8 CPUs/VCPUs. HT was enabled in BIOS:
>
> Elapsed(stddev) BAREMETAL HVM
> kernbench -j4 31.604 (0.0963328) 34.078 (0.168582)
> kernbench -j8 26.586 (0.145705) 26.672 (0.0432435)
> kernbench -j 27.358 (0.440307) 27.49 (0.364897)
>
> With HT disabled in BIOS (which means only 4 CPUs for both):
> Elapsed(stddev) BAREMETAL HVM
> kernbench -j4 57.754 (0.0642651) 56.46 (0.0578792)
> kernbench -j8 31.228 (0.0775887) 31.362 (0.210998)
> kernbench -j 32.316 (0.0270185) 33.084 (0.600442)

Just to make sure I'm reading this right - _disabling_ HT causes a near
50% performance drop?



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: Virt overehead with HT [was: Re: Xen 4.5 development update] [ In reply to ]
On Mon, 2014-07-14 at 17:32 +0100, Gordan Bobic wrote:
> On 07/14/2014 05:12 PM, Dario Faggioli wrote:
> > Elapsed(stddev) BAREMETAL HVM
> > kernbench -j4 31.604 (0.0963328) 34.078 (0.168582)
> > kernbench -j8 26.586 (0.145705) 26.672 (0.0432435)
> > kernbench -j 27.358 (0.440307) 27.49 (0.364897)
> >
> > With HT disabled in BIOS (which means only 4 CPUs for both):
> > Elapsed(stddev) BAREMETAL HVM
> > kernbench -j4 57.754 (0.0642651) 56.46 (0.0578792)
> > kernbench -j8 31.228 (0.0775887) 31.362 (0.210998)
> > kernbench -j 32.316 (0.0270185) 33.084 (0.600442)
>
BTW, there's a mistake here. The three runs, in the no-HT case are as
follows:
kernbench -j2
kernbench -j4
kernbench -j

I.e., half the number of VCPUs, as much as there are VCPUs and
unlimited, exactly as for the HT case.

The numbers are the right one.

> Just to make sure I'm reading this right - _disabling_ HT causes a near
> 50% performance drop?
>
For kernbench, and if you consider the "-j <half_of_nr_cpus>" run, yes,
nearly. And that is both for baremetal and HVM guest. And with
baremetal, I mean just bare Linux, no Xen at all involved.

Doesn't this make sense? Well, perhaps the wrong indication I gave about
the actual number of jobs used was misleading... better now?

BTW, the idea here was to compare perf between baremetal and HVM, and
they appear to be consistent.

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Re: Virt overehead with HT [was: Re: Xen 4.5 development update] [ In reply to ]
On 07/14/2014 05:44 PM, Dario Faggioli wrote:
> On Mon, 2014-07-14 at 17:32 +0100, Gordan Bobic wrote:
>> On 07/14/2014 05:12 PM, Dario Faggioli wrote:
>>> Elapsed(stddev) BAREMETAL HVM
>>> kernbench -j4 31.604 (0.0963328) 34.078 (0.168582)
>>> kernbench -j8 26.586 (0.145705) 26.672 (0.0432435)
>>> kernbench -j 27.358 (0.440307) 27.49 (0.364897)
>>>
>>> With HT disabled in BIOS (which means only 4 CPUs for both):
>>> Elapsed(stddev) BAREMETAL HVM
>>> kernbench -j4 57.754 (0.0642651) 56.46 (0.0578792)
>>> kernbench -j8 31.228 (0.0775887) 31.362 (0.210998)
>>> kernbench -j 32.316 (0.0270185) 33.084 (0.600442)
> BTW, there's a mistake here. The three runs, in the no-HT case are as
> follows:
> kernbench -j2
> kernbench -j4
> kernbench -j
>
> I.e., half the number of VCPUs, as much as there are VCPUs and
> unlimited, exactly as for the HT case.

Ah -- that's a pretty critical piece of information.

So actually, on native, HT enabled and disabled effectively produce the
same exact thing if HT is not actually being used: 31 seconds in both
cases. But on Xen, enabling HT when it's not being used (i.e., when in
theory each core should have exactly one process running), performance
goes from 31 seconds to 34 seconds -- roughly a 10% degradation.

-George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: Virt overehead with HT [was: Re: Xen 4.5 development update] [ In reply to ]
On Mon, 2014-07-14 at 17:55 +0100, George Dunlap wrote:
> On 07/14/2014 05:44 PM, Dario Faggioli wrote:
> > On Mon, 2014-07-14 at 17:32 +0100, Gordan Bobic wrote:
> >> On 07/14/2014 05:12 PM, Dario Faggioli wrote:
> >>> Elapsed(stddev) BAREMETAL HVM
> >>> kernbench -j4 31.604 (0.0963328) 34.078 (0.168582)
> >>> kernbench -j8 26.586 (0.145705) 26.672 (0.0432435)
> >>> kernbench -j 27.358 (0.440307) 27.49 (0.364897)
> >>>
> >>> With HT disabled in BIOS (which means only 4 CPUs for both):
> >>> Elapsed(stddev) BAREMETAL HVM
> >>> kernbench -j4 57.754 (0.0642651) 56.46 (0.0578792)
> >>> kernbench -j8 31.228 (0.0775887) 31.362 (0.210998)
> >>> kernbench -j 32.316 (0.0270185) 33.084 (0.600442)
> > BTW, there's a mistake here. The three runs, in the no-HT case are as
> > follows:
> > kernbench -j2
> > kernbench -j4
> > kernbench -j
> >
> > I.e., half the number of VCPUs, as much as there are VCPUs and
> > unlimited, exactly as for the HT case.
>
> Ah -- that's a pretty critical piece of information.
>
> So actually, on native, HT enabled and disabled effectively produce the
> same exact thing if HT is not actually being used: 31 seconds in both
> cases. But on Xen, enabling HT when it's not being used (i.e., when in
> theory each core should have exactly one process running), performance
> goes from 31 seconds to 34 seconds -- roughly a 10% degradation.
>
Yes. 7.96% degradation, to be precise.

I attempted an analysis in my first e-mail. Cutting and pasting it
here... What do you think?

"I guess I can investigate a bit more about what happens with '-j4'.
What I suspect is that the scheduler may make a few non-optimal
decisions wrt HT, when there are more PCPUs than busy guest VCPUs. This
may be due to the fact that Dom0 (or another guest VCPU doing other
stuff than kernbench) may be already running on PCPUs that are on
different cores than the guest's one (i.e., the guest VCPUs that wants
to run kernbench), and that may force two guest's vCPUs to execute on
two HTs some of the time (which of course is something that does not
happen on baremetal!)."

I just re-run the benchmark with credit2, which has no SMT knowledge,
and the first run (the one that does not use HT) ended up to be 37.54,
while the other two were pretty much the same of above (26.81 and
27.92).

This confirms, for me, that it's an SMT balancing issue that we're seen.

I'll try more runs, e.g. with number of VCPUs equal less than
nr_corse/2 and see what happens.

Again, thoughts?

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Re: Virt overehead with HT [was: Re: Xen 4.5 development update] [ In reply to ]
On 07/14/2014 06:22 PM, Dario Faggioli wrote:
> On Mon, 2014-07-14 at 17:55 +0100, George Dunlap wrote:
>> On 07/14/2014 05:44 PM, Dario Faggioli wrote:
>>> On Mon, 2014-07-14 at 17:32 +0100, Gordan Bobic wrote:
>>>> On 07/14/2014 05:12 PM, Dario Faggioli wrote:
>>>>> Elapsed(stddev) BAREMETAL HVM
>>>>> kernbench -j4 31.604 (0.0963328) 34.078 (0.168582)
>>>>> kernbench -j8 26.586 (0.145705) 26.672 (0.0432435)
>>>>> kernbench -j 27.358 (0.440307) 27.49 (0.364897)
>>>>>
>>>>> With HT disabled in BIOS (which means only 4 CPUs for both):
>>>>> Elapsed(stddev) BAREMETAL HVM
>>>>> kernbench -j4 57.754 (0.0642651) 56.46 (0.0578792)
>>>>> kernbench -j8 31.228 (0.0775887) 31.362 (0.210998)
>>>>> kernbench -j 32.316 (0.0270185) 33.084 (0.600442)
>>> BTW, there's a mistake here. The three runs, in the no-HT case are as
>>> follows:
>>> kernbench -j2
>>> kernbench -j4
>>> kernbench -j
>>>
>>> I.e., half the number of VCPUs, as much as there are VCPUs and
>>> unlimited, exactly as for the HT case.
>>
>> Ah -- that's a pretty critical piece of information.
>>
>> So actually, on native, HT enabled and disabled effectively produce the
>> same exact thing if HT is not actually being used: 31 seconds in both
>> cases. But on Xen, enabling HT when it's not being used (i.e., when in
>> theory each core should have exactly one process running), performance
>> goes from 31 seconds to 34 seconds -- roughly a 10% degradation.
>>
> Yes. 7.96% degradation, to be precise.
>
> I attempted an analysis in my first e-mail. Cutting and pasting it
> here... What do you think?
>
> "I guess I can investigate a bit more about what happens with '-j4'.
> What I suspect is that the scheduler may make a few non-optimal
> decisions wrt HT, when there are more PCPUs than busy guest VCPUs. This
> may be due to the fact that Dom0 (or another guest VCPU doing other
> stuff than kernbench) may be already running on PCPUs that are on
> different cores than the guest's one (i.e., the guest VCPUs that wants
> to run kernbench), and that may force two guest's vCPUs to execute on
> two HTs some of the time (which of course is something that does not
> happen on baremetal!)."
>
> I just re-run the benchmark with credit2, which has no SMT knowledge,
> and the first run (the one that does not use HT) ended up to be 37.54,
> while the other two were pretty much the same of above (26.81 and
> 27.92).
>
> This confirms, for me, that it's an SMT balancing issue that we're seen.
>
> I'll try more runs, e.g. with number of VCPUs equal less than
> nr_corse/2 and see what happens.
>
> Again, thoughts?

Have you tried it with VCPUs pinned to appropriate PCPUs?



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: Virt overehead with HT [was: Re: Xen 4.5 development update] [ In reply to ]
On lun, 2014-07-14 at 19:31 +0100, Gordan Bobic wrote:
> On 07/14/2014 06:22 PM, Dario Faggioli wrote:

> > I'll try more runs, e.g. with number of VCPUs equal less than
> > nr_corse/2 and see what happens.
> >
> > Again, thoughts?
>
> Have you tried it with VCPUs pinned to appropriate PCPUs?
>
Define "appropriate".

I have a run for which I pinned VCPU#1-->PCPU#1, VCPU#2-->PCPU#2, and so
on, and the result is even worse:

Average Half load -j 4 Run (std deviation):
Elapsed Time 37.808 (0.538999)
Average Optimal load -j 8 Run (std deviation):
Elapsed Time 26.594 (0.235223)
Average Maximal load -j Run (std deviation):
Elapsed Time 27.9 (0.131149)

This is actually something I expected, since you do not allow the VCPUs
to move away from an HT with a busy sibling, even when it could have.

In fact, you may expect better result from pinning only if you were to
pin not only the VCPUs to the PCPUs, but also the kernbench's build jobs
on the appropriate (V)CPUs in the guest.. but that's something not only
really unpractical, but also very few representative as a benchmark, I
think.

If you pin VCPU#1 to PCPU#1 and VCPU#2 to PCPU#2, with PCPU#1 and PCPU#2
being HT siblings, what prevents Linux (in the guest) to run two of the
four build jobs on VCPU#1 and VCPU#2 (i.e., on siblings PCPUs!!) for all
the length of the benchmark? Nothing, I think.

And in fact, pinning would also result in good (near to native,
perhaps?) performance, if we were exposing the SMT topology details to
guests as, in that case, Linux would do the balancing properly. However,
that's not the case either. :-(

But, perhaps, you were referring to a different pinning strategy?

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Re: Virt overehead with HT [was: Re: Xen 4.5 development update] [ In reply to ]
On 07/14/2014 11:44 PM, Dario Faggioli wrote:
> On lun, 2014-07-14 at 19:31 +0100, Gordan Bobic wrote:
>> On 07/14/2014 06:22 PM, Dario Faggioli wrote:
>
>>> I'll try more runs, e.g. with number of VCPUs equal less than
>>> nr_corse/2 and see what happens.
>>>
>>> Again, thoughts?
>>
>> Have you tried it with VCPUs pinned to appropriate PCPUs?
>>
> Define "appropriate".
>
> I have a run for which I pinned VCPU#1-->PCPU#1, VCPU#2-->PCPU#2, and so
> on, and the result is even worse:
>
> Average Half load -j 4 Run (std deviation):
> Elapsed Time 37.808 (0.538999)
> Average Optimal load -j 8 Run (std deviation):
> Elapsed Time 26.594 (0.235223)
> Average Maximal load -j Run (std deviation):
> Elapsed Time 27.9 (0.131149)
>
> This is actually something I expected, since you do not allow the VCPUs
> to move away from an HT with a busy sibling, even when it could have.
>
> In fact, you may expect better result from pinning only if you were to
> pin not only the VCPUs to the PCPUs, but also the kernbench's build jobs
> on the appropriate (V)CPUs in the guest.. but that's something not only
> really unpractical, but also very few representative as a benchmark, I
> think.
>
> If you pin VCPU#1 to PCPU#1 and VCPU#2 to PCPU#2, with PCPU#1 and PCPU#2
> being HT siblings, what prevents Linux (in the guest) to run two of the
> four build jobs on VCPU#1 and VCPU#2 (i.e., on siblings PCPUs!!) for all
> the length of the benchmark? Nothing, I think.

That would imply that Xen can somehow make a better decision that the
domU's kernel scheduler, something that doesn't seem that likely. I
would expect not pinning CPUs to increase process migration because Xen
might migrate the CPU even though the kernel in domU decided which
presented CPU was most lightly loaded.

> And in fact, pinning would also result in good (near to native,
> perhaps?) performance, if we were exposing the SMT topology details to
> guests as, in that case, Linux would do the balancing properly. However,
> that's not the case either. :-(

I see, so you are referring specifically to the HT case. I can see how
that could cause a problem. Does pinning improve the performance with HT
disabled?

Gordan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: Virt overehead with HT [was: Re: Xen 4.5 development update] [ In reply to ]
On mar, 2014-07-15 at 01:10 +0100, Gordan Bobic wrote:
> On 07/14/2014 11:44 PM, Dario Faggioli wrote:

> > If you pin VCPU#1 to PCPU#1 and VCPU#2 to PCPU#2, with PCPU#1 and PCPU#2
> > being HT siblings, what prevents Linux (in the guest) to run two of the
> > four build jobs on VCPU#1 and VCPU#2 (i.e., on siblings PCPUs!!) for all
> > the length of the benchmark? Nothing, I think.
>
> That would imply that Xen can somehow make a better decision that the
> domU's kernel scheduler, something that doesn't seem that likely.
>
Well, as far as SMT load balancing is concerned, that is _exactly_ the
case. The reason is simple: Xen knows the hw topology, and hence knows
whether the sibling of an idle core is idle or busy. The guest kernel
sees nothing about this, it just treat all its (V)CPUs as full cores, so
it most likely will do a bad job in this case.

> > And in fact, pinning would also result in good (near to native,
> > perhaps?) performance, if we were exposing the SMT topology details to
> > guests as, in that case, Linux would do the balancing properly. However,
> > that's not the case either. :-(
>
> I see, so you are referring specifically to the HT case.
>
Yeah, well, that's what this benchmarks where all about :-)

> I can see how
> that could cause a problem. Does pinning improve the performance with HT
> disabled?
>
HT disabled had pretty goo perf. already. Anyhow, I tried:

Average Half load -j 2 Run (std deviation):
Elapsed Time 56.462 (0.109179)
Average Optimal load -j 4 Run (std deviation):
Elapsed Time 31.526 (0.224789)
Average Maximal load -j Run (std deviation):
Elapsed Time 33.04 (0.439147)

So a lot similar to the no-HT unpinned case, which on it's turn was a
lot similar to baremetal without HT.

Daio

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Re: Virt overehead with HT [was: Re: Xen 4.5 development update] [ In reply to ]
On 07/15/2014 03:30 AM, Dario Faggioli wrote:
> On mar, 2014-07-15 at 01:10 +0100, Gordan Bobic wrote:
>> On 07/14/2014 11:44 PM, Dario Faggioli wrote:
>
>>> If you pin VCPU#1 to PCPU#1 and VCPU#2 to PCPU#2, with PCPU#1 and PCPU#2
>>> being HT siblings, what prevents Linux (in the guest) to run two of the
>>> four build jobs on VCPU#1 and VCPU#2 (i.e., on siblings PCPUs!!) for all
>>> the length of the benchmark? Nothing, I think.
>>
>> That would imply that Xen can somehow make a better decision that the
>> domU's kernel scheduler, something that doesn't seem that likely.
>>
> Well, as far as SMT load balancing is concerned, that is _exactly_ the
> case. The reason is simple: Xen knows the hw topology, and hence knows
> whether the sibling of an idle core is idle or busy. The guest kernel
> sees nothing about this, it just treat all its (V)CPUs as full cores, so
> it most likely will do a bad job in this case.
>
>>> And in fact, pinning would also result in good (near to native,
>>> perhaps?) performance, if we were exposing the SMT topology details to
>>> guests as, in that case, Linux would do the balancing properly. However,
>>> that's not the case either. :-(
>>
>> I see, so you are referring specifically to the HT case.
>>
> Yeah, well, that's what this benchmarks where all about :-)
>
>> I can see how
>> that could cause a problem. Does pinning improve the performance with HT
>> disabled?
>>
> HT disabled had pretty goo perf. already. Anyhow, I tried:
>
> Average Half load -j 2 Run (std deviation):
> Elapsed Time 56.462 (0.109179)
> Average Optimal load -j 4 Run (std deviation):
> Elapsed Time 31.526 (0.224789)
> Average Maximal load -j Run (std deviation):
> Elapsed Time 33.04 (0.439147)
>
> So a lot similar to the no-HT unpinned case, which on it's turn was a
> lot similar to baremetal without HT.

Just out of interest - in cases where there is a non-negligible
performance discrepancy with HT enabled (bare metal or Xen), does
disabling the C6 CPU state support in the BIOS help? C6 state
selectively disables CPU threads when the CPU is idle for power saving
purposes, but the disabling threshold can be too sensitive. Does Xen
handle this?

Gordan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel