Mailing List Archive

Optimizing pymalloc (was obmalloc
On Tue, Jul 9, 2019 at 5:29 PM Inada Naoki <songofacandy@gmail.com> wrote:
>
> On Tue, Jul 9, 2019 at 9:46 AM Tim Peters <tim.peters@gmail.com> wrote:
> >
> >> I was more intrigued by your first (speed) comparison:
> >
> > > - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)
> >
> > Now _that's_ interesting ;-) Looks like spectral_norm recycles many
> > short-lived Python floats at a swift pace. So memory management
> > should account for a large part of its runtime (the arithmetic it does
> > is cheap in comparison), and obmalloc and mimalloc should both excel
> > at recycling mountains of small objects. Why is mimalloc
> > significantly faster?
>
> Totally agree. I'll investigate this next.
>

I compared "perf" output of mimalloc and pymalloc, and I succeeded to
optimize pymalloc!

$ ./python bm_spectral_norm.py --compare-to ./python-master
python-master: ..................... 199 ms +- 1 ms
python: ..................... 182 ms +- 4 ms

Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 182 ms +-
4 ms: 1.10x faster (-9%)

mimalloc uses many small static (inline) functions.
On the other hand, pymalloc_alloc and pymalloc_free is large function
containing slow/rare path.

PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines pymalloc_free.
But compiler doesn't know which is the hot part in pymalloc_alloc and
pymalloc_free.
So gcc failed to chose code to inline. Remaining part of
pymalloc_alloc and pymalloc_free
are called and many push/pop are executed because they contains complex logic.

So I tried to use LIKELY/UNLIKELY macro to teach compiler hot part.
But I need to use
"static inline" for pymalloc_alloc and pymalloc_free yet [1].
Generated assembly is optimized
well, the hot code is in top of the PyObject_Malloc [2] and PyObject_Free [3].
But there are many code duplication in PyObject_Malloc and
PyObject_Calloc, etc...

[1] https://github.com/python/cpython/pull/14674/files
[2] https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmalloc-s-L232-L274
[3] https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmalloc-s-L2-L32

I will try to split pymalloc_alloc and pymalloc_free to smaller functions.

Except above, there is one more important difference.

pymalloc return free pool to freepool list soon when pool become empty.
On the other hand, mimalloc return "page" (it's similar to "pool" in pymalloc)
not everytime when it's empty [4]. So they can avoid rebuilding linked list of
free blocks.
I think pymalloc should do same optimization.

[4] https://github.com/microsoft/mimalloc/blob/1125271c2756ee1db1303918816fea35e08b3405/src/page.c#L365-L375

BTW, which is proper name? pymalloc, or obmalloc.

Regards,
--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YWNWHGTJUMZ4D34DPRFXECF7O7GRJK2M/
Re: Optimizing pymalloc (was obmalloc [ In reply to ]
[.Inada Naoki <songofacandy@gmail.com>,
looking into why mimalloc did so much better on spectral_norm]
> I compared "perf" output of mimalloc and pymalloc, and I succeeded to
> optimize pymalloc!
>
> $ ./python bm_spectral_norm.py --compare-to ./python-master
> python-master: ..................... 199 ms +- 1 ms
> python: ..................... 182 ms +- 4 ms
>
> Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 182 ms +-
> 4 ms: 1.10x faster (-9%)

Yay!! Nice :-)


> mimalloc uses many small static (inline) functions.

Too many , if you ask me ;-) Really, the enormous number of tiny
functions makes the code very hard to follow for me. OTOH, the tiny
number of enormous functions in pymalloc makes that hard to follow too
:-(

> On the other hand, pymalloc_alloc and pymalloc_free is large function
> containing slow/rare path.

obmalloc.c was written when compiler inlining was barely a thing.
Some restructuring is a fine idea - overdue, in fact.


> PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines pymalloc_free.
> But compiler doesn't know which is the hot part in pymalloc_alloc and
> pymalloc_free.
> So gcc failed to chose code to inline. Remaining part of
> pymalloc_alloc and pymalloc_free
> are called and many push/pop are executed because they contains complex logic.
>
> So I tried to use LIKELY/UNLIKELY macro to teach compiler hot part.
> But I need to use
> "static inline" for pymalloc_alloc and pymalloc_free yet [1].
> Generated assembly is optimized
> well, the hot code is in top of the PyObject_Malloc [2] and PyObject_Free [3].
> But there are many code duplication in PyObject_Malloc and
> PyObject_Calloc, etc...
>
> [1] https://github.com/python/cpython/pull/14674/files
> [2] https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmalloc-s-L232-L274
> [3] https://gist.github.com/methane/ab8e71c00423a776cb5819fa1e4f871f#file-obmalloc-s-L2-L32
>
> I will try to split pymalloc_alloc and pymalloc_free to smaller functions.

Yes, splitting out the slower paths should be a win.

Note this is one reason I remain keen to increase pool size: the
bigger the pools, the more objects can be handed out & reclaimed
_from_ the fastest paths. You've now discovered that "the fastest
paths" are, for pragmatic compiler-is-overwhelmed reasons,
significantly slower than they "should be".


> Except above, there is one more important difference.
>
> pymalloc return free pool to freepool list soon when pool become empty.

Fine by me, & I think it should continue to do so. Details below.

> On the other hand, mimalloc return "page" (it's similar to "pool" in pymalloc)
> not everytime when it's empty [4]. So they can avoid rebuilding linked list of
> free blocks.

pymalloc never returns anything to the system at smaller than "arena"
granularity, so it's not a big deal at all _merely_ to link and unlink
a pool to/from an arena's freepool list. There's no possibility than
any byte in the pool can change while the pool is sitting on a
freepools list.

If we freed the last pool of a given size class, and next need to
allocate another object of that size class (most common), it's cheap:
when the pool is unlinked from the pool free list, it sees that the
pool's last size class is the same as the size class being requested,
and skips almost all pool initialization (the pool is already fully
prepared for objects of this size class).

init_pool:
/* Frontlink to used pools. */
next = usedpools[size + size]; /* == prev */
pool->nextpool = next;
pool->prevpool = next;
next->nextpool = pool;
next->prevpool = pool;
pool->nalloc = 1;
if (pool->szidx == size) { // xxxxxxxx HERE xxxxxxxc
/* Luckily, this pool last contained blocks
* of the same size class, so its header
* and free list are already initialized.
*/
bp = pool->freeblock;
assert(bp != NULL);
pool->freeblock = *(block **)bp;
goto success; // xxxxxxxxxxx FASTEST PATH xxxxxxxxxxxxxx
}
// and here there's a mound of code in case
// the size classes don't match, including code
// to restart the process of linking the pool's
// blocks into a free list.

So In the common case, the pool's free list is reused exactly as-is at
the time the pool was linked to the freepool list.

> I think pymalloc should do same optimization.

As above, I believe pymalloc is already optimal in this case,

On the other end, something to consider: when a block is freed from
an entirely used pool, the pool now contains one free block, and is
linked to the front of usedpools[sizeclass]. So the next request for
a block of that size will come from that pool, which will again cause
the pool to become entirely used.

That's a kind of possible thrashing, and acts against mimalloc's
(sane, to my eyes) desire to _entirely_ use up a given pool ("page")
once it starts passing things out from a pool. It could well be
better to, e.g., link the pool-with-one-newly-free-block to the end of
the usedpool list instead.

But it's unclear. The reason it front-links is that the block being
freed is presumably in L1 cache (because, e.g., its refcount just fell
to 0), so best to hand it out again ASAP. But that old reasoning
applied when pymalloc was _only_ used for PyObject-like structs.


> [4] https://github.com/microsoft/mimalloc/blob/1125271c2756ee1db1303918816fea35e08b3405/src/page.c#L365-L375
>
> BTW, which is proper name? pymalloc, or obmalloc.

I think it's pymalloc, so that's what I used above, I got lazy before ;-)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/I4CX5YNCGPS52JZSFKXMTDTGDOP2Y2WG/
Re: Optimizing pymalloc (was obmalloc [ In reply to ]
On 2019-07-09, Inada Naoki wrote:
> So I tried to use LIKELY/UNLIKELY macro to teach compiler hot part.
> But I need to use
> "static inline" for pymalloc_alloc and pymalloc_free yet [1].

I think LIKELY/UNLIKELY is not helpful if you compile with LTO/PGO
enabled. So, I would try that first. Also, if you use relatively
small static functions that are defined before use (no forward
declarations), I have found that GCC is usually smart about inlining
them. So, I don't think you should have to use "static inline"
rather than just "static".

Good work looking into this. Should be some relatively easy
performance win.

Cheers,

Neil
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TISXQ7E3AA5BPMVPTDWKDWBKV5VPVOTI/
Re: Optimizing pymalloc (was obmalloc [ In reply to ]
On 2019-07-09, Inada Naoki wrote:
> PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines pymalloc_free.
> But compiler doesn't know which is the hot part in pymalloc_alloc and
> pymalloc_free.

Hello Inada,

I don't see this on my PC. I'm using GCC 8.3.0. I have configured
the build with --enable-optimizations. To speed up the profile
generation, I have changed PROFILE_TASK to only run these tests:

test_shelve test_set test_pprint test_pickletools
test_ordered_dict test_tabnanny test_difflib test_pickle
test_json test_collections

I haven't spent much time trying to figure out what set of tests is
best but the above set runs pretty quickly and seems to work okay.

I have run pyperformance to compare CPython 'master' with your PR
14674. There doesn't seem to be a difference (table below). If I
look at the disassembly, it seems that the hot paths of
pymalloc_alloc and pymalloc_free are being inlined as you would
hope, without needing the LIKELY/UNLIKELY annotations.

OTOH, your addition of LIKELY() and UNLIKELY() in the PR is a pretty
small change and probably doesn't hurt anything. So, I think it
would be fine to merge it.

Regards,

Neil


+-------------------------+---------+-----------------------------+
| Benchmark | master | PR-14674 |
+=========================+=========+=============================+
| 2to3 | 305 ms | 304 ms: 1.00x faster (-0%) |
+-------------------------+---------+-----------------------------+
| chaos | 109 ms | 110 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| crypto_pyaes | 118 ms | 117 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| django_template | 112 ms | 114 ms: 1.02x slower (+2%) |
+-------------------------+---------+-----------------------------+
| fannkuch | 446 ms | 440 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| float | 119 ms | 120 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| go | 247 ms | 250 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| json_loads | 25.1 us | 24.4 us: 1.03x faster (-3%) |
+-------------------------+---------+-----------------------------+
| logging_simple | 8.86 us | 8.66 us: 1.02x faster (-2%) |
+-------------------------+---------+-----------------------------+
| meteor_contest | 97.5 ms | 97.7 ms: 1.00x slower (+0%) |
+-------------------------+---------+-----------------------------+
| nbody | 140 ms | 142 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| pathlib | 19.2 ms | 18.9 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| pickle | 8.95 us | 9.08 us: 1.02x slower (+2%) |
+-------------------------+---------+-----------------------------+
| pickle_dict | 18.1 us | 18.0 us: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| pickle_list | 2.75 us | 2.68 us: 1.03x faster (-3%) |
+-------------------------+---------+-----------------------------+
| pidigits | 182 ms | 184 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| python_startup | 7.83 ms | 7.81 ms: 1.00x faster (-0%) |
+-------------------------+---------+-----------------------------+
| python_startup_no_site | 5.36 ms | 5.36 ms: 1.00x faster (-0%) |
+-------------------------+---------+-----------------------------+
| raytrace | 495 ms | 499 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| regex_dna | 173 ms | 170 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| regex_effbot | 2.79 ms | 2.67 ms: 1.05x faster (-4%) |
+-------------------------+---------+-----------------------------+
| regex_v8 | 21.1 ms | 21.2 ms: 1.00x slower (+0%) |
+-------------------------+---------+-----------------------------+
| richards | 68.2 ms | 68.7 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| scimark_monte_carlo | 103 ms | 102 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| scimark_sparse_mat_mult | 4.37 ms | 4.35 ms: 1.00x faster (-0%) |
+-------------------------+---------+-----------------------------+
| spectral_norm | 132 ms | 133 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| sqlalchemy_imperative | 30.3 ms | 30.7 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| sympy_sum | 88.2 ms | 89.2 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| telco | 6.63 ms | 6.58 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| tornado_http | 178 ms | 179 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| unpickle | 12.0 us | 12.4 us: 1.03x slower (+3%) |
+-------------------------+---------+-----------------------------+
| unpickle_list | 3.93 us | 3.75 us: 1.05x faster (-4%) |
+-------------------------+---------+-----------------------------+

Not significant (25): deltablue; dulwich_log; hexiom; html5lib; json_dumps; logging_format; logging_silent; mako; nqueens; pickle_pure_python; regex_compile; scimark_fft; scimark_lu; scimark_sor; sqlalchemy_declarative; sqlite_synth; sympy_expand; sympy_integrate; sympy_str; unpack_sequence; unpickle_pure_python; xml_etree_parse; xml_etree_iterparse; xml_etree_generate; xml_etree_process
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6E44YQ4EOFCO6CNYFXT7PQJUCCFR5YXS/
Re: Optimizing pymalloc (was obmalloc [ In reply to ]
On Wed, Jul 10, 2019 at 5:18 PM Neil Schemenauer <nas-python@arctrix.com> wrote:
>
> On 2019-07-09, Inada Naoki wrote:
> > PyObject_Malloc inlines pymalloc_alloc, and PyObject_Free inlines pymalloc_free.
> > But compiler doesn't know which is the hot part in pymalloc_alloc and
> > pymalloc_free.
>
> Hello Inada,
>
> I don't see this on my PC. I'm using GCC 8.3.0. I have configured
> the build with --enable-optimizations.

I didn't use PGO and that's why GCC didn't know which part is hot.
Maybe, pymalloc performance is similar to mimalloc when PGO is used,
but I had not confirmed it.

While Linux distributions are using PGO, some people use non-PGO Python
(Homebrew, pyenv, etc...). So better performance without PGO is worth.

Regards,
--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LKU5FDWGWHHEBUMTNZ5ME23RC73B5JIF/
Re: Optimizing pymalloc (was obmalloc [ In reply to ]
> Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 182 ms +-
> 4 ms: 1.10x faster (-9%)
...
> I will try to split pymalloc_alloc and pymalloc_free to smaller functions.

I did it and pymalloc is now as fast as mimalloc.

$ ./python bm_spectral_norm.py --compare-to=./python-master
python-master: ..................... 199 ms +- 1 ms
python: ..................... 176 ms +- 1 ms

Mean +- std dev: [python-master] 199 ms +- 1 ms -> [python] 176 ms +-
1 ms: 1.13x faster (-11%)

I filed an new issue for this: https://bugs.python.org/issue37543
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HKV6TQAHHLLLK4JS5F5JQ26MGWPLOD2M/
Re: Optimizing pymalloc (was obmalloc [ In reply to ]
[Inada Naoki]
>> So I tried to use LIKELY/UNLIKELY macro to teach compiler hot part.
>> But I need to use
>> "static inline" for pymalloc_alloc and pymalloc_free yet [1].

[Neil Schemenauer]
> I think LIKELY/UNLIKELY is not helpful if you compile with LTO/PGO
> enabled.

I like adding those regardless of whether compilers find them helpful:
they help _people_ reading the code focus on what's important to
speed. While not generally crucial, speed is important in these very
low-level, very heavily used functions.

Speaking of which, another possible teensy win: pymalloc's allocation
has always started with:

if (nbytes == 0) {
return 0;
}
if (nbytes > SMALL_REQUEST_THRESHOLD) {
return 0;
}
size = (uint)(nbytes - 1) >> ALIGNMENT_SHIFT;

But it could be a bit leaner:

size_t fatsize = (nbytes - 1) >> ALIGNMENT_SHIFT;
if (UNLIKELY(fatsize >= NB_SMALL_SIZE_CLASSES)) {
return 0;'
}
size = (uint)fatsize;

The `nbytes == 0` case ends up mapping to a very large size class
then, although C may not guarantee that. But it doesn't matter: if
it maps to "a real" size class, that's fine. We'll return a unique
pointer into a pymalloc pool then, and "unique pointer" is all that's
required.

An allocation requesting 0 bytes does happen at times, but it's very
rare. It just doesn't merit its own dedicated test-&-branch.

> Good work looking into this. Should be some relatively easy
> performance win.

Ditto!
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RE44X7IP464I4KDPJPG3LF5NV5P27DHU/