Mailing List Archive

obmalloc (was Have a big machine and spare time? Here's a possible Python bug.)
[Antoine Pitrou, replying to Thomas Wouters]
> Interesting that a 20-year simple allocator (obmalloc) is able to do
> better than the sophisticated TCMalloc.

It's very hard to beat obmalloc (O) at what it does. TCMalloc (T) is
actually very similar where they overlap, but has to be more complex
because it's trying to do more than O.

In either case, for small objects "the fast path" consists merely of
taking the first block of memory off a singly-linked size-segregated
free list. For freeing, the fast path is just linking the block back
to the front of the appropriate free list. What _could_ be faster? A
"bump allocator" allocates faster (just increment a highwater mark),
but creates worlds of problems when freeing.

But because O is only trying to deal with small (<= 512 bytes)
requests, it can use a very fast method based on trivial address
arithmetic to find the size of an allocated block by just reading it
up from the start of the (4K) "pool" the address belongs to. T can't
do that - it appears to need to look up the address in a more
elaborate radix tree, to find info recording the size of the block
(which may be just about anything - no upper limit).

> (well, of course, obmalloc doesn't have to worry about concurrent
> scenarios, which explains some of the simplicity)

Right, T has a different collection of free lists for each thread. so
on each entry has to figure out which collection to use (and so
doesn't need to lock). That's not free. O only has one collection,
and relies on the GIL.

Against that, O burns cycles worrying about something else: because
it was controversial when it was new, O thought it was necessary to
handle free/realloc calls even when passed addresses that had actually
been obtained from the system malloc/realloc. The T docs I saw said
"don't do that - things will blow up in mysterious ways".

That's where O's excruciating "address_in_range()" logic comes from.
While that's zippy and scales extremely well (it doesn't depend on how
many objects/arenas/pools exist), it's not free, and is a significant
part of the "fast path" expense for both allocation and deallocation.

It also limits us to a maximum pool size of 4K (to avoid possible
segfaults when reading up memory that was actually obtained from the
system malloc/realloc), and that's become increasingly painful: on
64-bit boxes the bytes lost to pool headers increased, and O changed
to handle requests up to 512 bytes instead of its original limit of
256. O was intended to supply "a bunch" of usable blocks per pool,
not just a handful. We "should" really at least double the pool and
arena sizes now.

I don't think we need to cater anymore to careless code that mixes
system memory calls with O calls (e.g., if an extension gets memory
via `malloc()`, it's its responsibility to call `free()`), and if not
then `address_in_range()` isn't really necessary anymore either, and
then we could increase the pool size. O would, however, need a new
way to recognize when its version of malloc punted to the system
malloc.

BTW, one more: last I saw T never returns memory to "the system", but
O does - indeed, the parent thread here was all about _enormous_ time
waste due to that in O ;-) That's not free either, but doesn't affect
O's fast paths.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On Sun, 2 Jun 2019 00:56:52 -0500
Tim Peters <tim.peters@gmail.com> wrote:
>
> But because O is only trying to deal with small (<= 512 bytes)
> requests, it can use a very fast method based on trivial address
> arithmetic to find the size of an allocated block by just reading it
> up from the start of the (4K) "pool" the address belongs to. T can't
> do that - it appears to need to look up the address in a more
> elaborate radix tree, to find info recording the size of the block
> (which may be just about anything - no upper limit).

The interesting thing here is that in many situations, the size is
known up front when deallocating - it is simply not communicated to the
deallocator because the traditional free() API takes a sole pointer,
not a size. But CPython could communicate that size easily if we
would like to change the deallocation API. Then there's no bother
looking up the allocated size in sophisticated lookup structures.

I'll note that jemalloc provides such APIs:
http://jemalloc.net/jemalloc.3.html

"""The dallocx() function causes the memory referenced by ptr to be
made available for future allocations.

The sdallocx() function is an extension of dallocx() with a size
parameter to allow the caller to pass in the allocation size as an
optimization."""

Regards

Antoine.


>
> > (well, of course, obmalloc doesn't have to worry about concurrent
> > scenarios, which explains some of the simplicity)
>
> Right, T has a different collection of free lists for each thread. so
> on each entry has to figure out which collection to use (and so
> doesn't need to lock). That's not free. O only has one collection,
> and relies on the GIL.
>
> Against that, O burns cycles worrying about something else: because
> it was controversial when it was new, O thought it was necessary to
> handle free/realloc calls even when passed addresses that had actually
> been obtained from the system malloc/realloc. The T docs I saw said
> "don't do that - things will blow up in mysterious ways".
>
> That's where O's excruciating "address_in_range()" logic comes from.
> While that's zippy and scales extremely well (it doesn't depend on how
> many objects/arenas/pools exist), it's not free, and is a significant
> part of the "fast path" expense for both allocation and deallocation.
>
> It also limits us to a maximum pool size of 4K (to avoid possible
> segfaults when reading up memory that was actually obtained from the
> system malloc/realloc), and that's become increasingly painful: on
> 64-bit boxes the bytes lost to pool headers increased, and O changed
> to handle requests up to 512 bytes instead of its original limit of
> 256. O was intended to supply "a bunch" of usable blocks per pool,
> not just a handful. We "should" really at least double the pool and
> arena sizes now.
>
> I don't think we need to cater anymore to careless code that mixes
> system memory calls with O calls (e.g., if an extension gets memory
> via `malloc()`, it's its responsibility to call `free()`), and if not
> then `address_in_range()` isn't really necessary anymore either, and
> then we could increase the pool size. O would, however, need a new
> way to recognize when its version of malloc punted to the system
> malloc.
>
> BTW, one more: last I saw T never returns memory to "the system", but
> O does - indeed, the parent thread here was all about _enormous_ time
> waste due to that in O ;-) That's not free either, but doesn't affect
> O's fast paths.



_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
[Antoine Pitrou <solipsis@pitrou.net>]
> The interesting thing here is that in many situations, the size is
> known up front when deallocating - it is simply not communicated to the
> deallocator because the traditional free() API takes a sole pointer,
> not a size. But CPython could communicate that size easily if we
> would like to change the deallocation API. Then there's no bother
> looking up the allocated size in sophisticated lookup structures.

That could work (to make it possible to increase obmalloc's pool
size). Except ...

> I'll note that jemalloc provides such APIs:
> http://jemalloc.net/jemalloc.3.html
>
> """The dallocx() function causes the memory referenced by ptr to be
> made available for future allocations.
>
> The sdallocx() function is an extension of dallocx() with a size
> parameter to allow the caller to pass in the allocation size as an
> optimization."""

obmalloc doesn't intend to be a general-purpose allocator - it only
aims at optimizing "small" allocations, punting to the system for
everything beyond that. Unless the size is _always_ passed in (on
every free() and realloc() spelling it supports), an "optimization"
doesn't help it much. It needs a bulletproof way to determine whether
it, or system malloc/realloc, originally obtained an address passed
in. If the size is always passed in, no problem (indeed, a single bit
could suffice). But if it's ever possible that the size won't be
passed in, all the runtime machinery to figure that out on its own
needs to be able to handle all addresses.

Like now: if the size were passed in, obmalloc could test the size
instead of doing the `address_in_range()` dance(*). But if it's ever
possible that the size won't be passed in, all the machinery
supporting `address_in_range()` still needs to be there, and every
obmalloc spelling of malloc/realloc needs to ensure that machinery
will work if the returned address is passed back to an obmalloc
free/realloc spelling without the size.

The "only"problem with address_in_range is that it limits us to a
maximum pool size of 4K. Just for fun, I boosted that to 8K to see
how likely segfaults really are, and a Python built that way couldn't
even get to its first prompt before dying with an access violation
(Windows-speak for segfault).

Alas, that makes it hard to guess how much value there would be for
Python programs if the pool size could be increased - can't even get
Python started.

We could eliminate the pool size restriction in many ways. For
example, we could store the addresses obtained from the system
malloc/realloc - but not yet freed - in a set, perhaps implemented as
a radix tree to cut the memory burden. But digging through 3 or 4
levels of a radix tree to determine membership is probably
significantly slower than address_in_range.

I can think of a way to do it slightly faster than (but related to)
address_in_range, but it would (on a 64-bit box) require adding 24
more bytes for each system-malloc/realloc allocation. 8 of those
bytes would be pure waste, due to that the Alignment Gods appear to
require 16-byte alignment for every allocation on a 64-bit box now.

In stark contrast, the extra memory burden of the current
address_in_range is an insignificant 8 bytes per _arena_ (256 KB, and
"should be" larger now).

Another approach: keep address_as_range as-is, but add new internal
structure to larger pools, to repeat the arena index every 4KB. But
that fights somewhat against the goal of larger pools.

Etc. ;-)

(*) Actually not quite true. If a "large" object is obtained from
obmalloc now (meaning it actually came from the system malloc), then
cut back to a "small" size by a realloc, it _remains_ under the
control of the system malloc now. Passing in the newer "small" size
to a free() later would cause obmalloc to get fatally confused about
that.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On Thu, 6 Jun 2019 13:57:37 -0500
Tim Peters <tim.peters@gmail.com> wrote:
> [Antoine Pitrou <solipsis@pitrou.net>]
> > The interesting thing here is that in many situations, the size is
> > known up front when deallocating - it is simply not communicated to the
> > deallocator because the traditional free() API takes a sole pointer,
> > not a size. But CPython could communicate that size easily if we
> > would like to change the deallocation API. Then there's no bother
> > looking up the allocated size in sophisticated lookup structures.
>
> That could work (to make it possible to increase obmalloc's pool
> size). Except ...
>
> > I'll note that jemalloc provides such APIs:
> > http://jemalloc.net/jemalloc.3.html
> >
> > """The dallocx() function causes the memory referenced by ptr to be
> > made available for future allocations.
> >
> > The sdallocx() function is an extension of dallocx() with a size
> > parameter to allow the caller to pass in the allocation size as an
> > optimization."""
>
> obmalloc doesn't intend to be a general-purpose allocator - it only
> aims at optimizing "small" allocations, punting to the system for
> everything beyond that.

But my response was under the assumption that we would want obmalloc to
deal with all allocations. Which is more or less required anyway to
have an efficient GC that doesn't have to walk linked lists and access
memory in random order to iterate over known objects.

Regards

Antoine.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
[Antoine Pitrou <solipsis@pitrou.net>]
> But my response was under the assumption that we would want obmalloc to
> deal with all allocations.

I didn't know that. I personally have no interest in that: if we
want an all-purpose allocator, there are several already to choose
from. There's no reason to imagine we could write a better one.

> Which is more or less required anyway to have an efficient GC that doesn't
> have to walk linked lists and access memory in random order to iterate over
> known objects.

As the parent thread showed, obmalloc does at least as well as any
general-purpose allocator known _for Python's purposes_ (a great many
(de)allocations of "small" objects). Already explained too that
size-segregated linked free lists are the core of _why_ it does so
well. Besides making carving off, and freeing, blocks dirt cheap,
linked lists also naturally support recycling memory blocks in MRU
("freshness in cache") order.

But I don't know what you mean by "access memory in random order to
iterate over known objects". obmalloc never needs to iterate over
known objects - indeed, it contains no code capable of doing that..
Our cyclic gc does, but that's independent of obmalloc. Over on
Discourse, Neil is speculating about using radix trees for cyclic gc
instead of _its_ linked lists In obmalloc, allocated memory regions
aren't linked at all. It's free regions that are linked, and
helpfully so in MRU order.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On Thu, 6 Jun 2019 16:03:03 -0500
Tim Peters <tim.peters@gmail.com> wrote:
> But I don't know what you mean by "access memory in random order to
> iterate over known objects". obmalloc never needs to iterate over
> known objects - indeed, it contains no code capable of doing that..
> Our cyclic gc does, but that's independent of obmalloc.

It's not. Cyclic GC needs its own linked lists *because* the allocator
doesn't allow it to iterate over allocated objects.

Regards

Antoine.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YZIG3RZKM4XSTM3PYPDAPG3UHOB5QKM4/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
[Tim]
>> But I don't know what you mean by "access memory in random order to
>> iterate over known objects". obmalloc never needs to iterate over
>> known objects - indeed, it contains no code capable of doing that..
>> Our cyclic gc does, but that's independent of obmalloc.

[Antoine]
> It's not. Cyclic GC needs its own linked lists *because* the allocator
> doesn't allow it to iterate over allocated objects.

The doubly linked lists in gc primarily support efficient
_partitioning_ of objects for gc's purposes (a union of disjoint sets,
with constant-time moving of an object from one set to another, and
constant-time union of disjoint sets). "All objects" is almost never
interesting to it (it is only when the oldest non-frozen generation is
being collected).

Between collections, the partitioning is by generation.

During a collection, the youngest generations are combined into one,
and then that's sub-partitioned in various ways as collection goes
along, ending with a partition into reachable and unreachable objects.
In between, ephemeral partitions are constructed (e.g., the set of
"tentatively unreachable" objects).

None of that was driven by obmalloc's (or malloc's) inability to
iterate over objects. Doubly linked lists were the obvious way to
implement the required operations on partitions efficiently and
simply.

In any case, it appears to be "a feature" now that people can use any
flavor of the malloc family they like in CPython, so I expect that any
attempt to tie cyclic gc to a specific flavor of malloc would be a
very hard sell. Which, BTW, was the intended meaning of
"independent": cyclic gc right now couldn't care less which version
of malloc a user plugs in - nor could obmalloc care less which cyclic
gc algorithm is used.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OR4NFZRQIOCK2N3XBCPI6GESI5BYRD3D/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On Thu, 6 Jun 2019 17:26:17 -0500
Tim Peters <tim.peters@gmail.com> wrote:
>
> The doubly linked lists in gc primarily support efficient
> _partitioning_ of objects for gc's purposes (a union of disjoint sets,
> with constant-time moving of an object from one set to another, and
> constant-time union of disjoint sets). "All objects" is almost never
> interesting to it (it is only when the oldest non-frozen generation is
> being collected).

Right. But the oldest generation is precisely the pain point, since
full collections can induce very long pauses. IOW, perhaps it would be
fine to keep dedicated lists for the young generations, while doing a
heap walk for the full collection case.

> In any case, it appears to be "a feature" now that people can use any
> flavor of the malloc family they like in CPython, so I expect that any
> attempt to tie cyclic gc to a specific flavor of malloc would be a
> very hard sell.

Depends on the benefits of course ;-) Most people use some pre-built
binaries that "naturally" come with obmalloc enabled.

Of course, it's only a very small minority of applications that hit
real performance issues with the GC - and in probably many of those
cases tuning the full collection threshold can alleviate the problem
still.

Regards

Antoine.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZQEHZTW77MFCKCJMR7XB4AVBF6KCB3ZW/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On 2019-06-06, Tim Peters wrote:
> Like now: if the size were passed in, obmalloc could test the size
> instead of doing the `address_in_range()` dance(*). But if it's ever
> possible that the size won't be passed in, all the machinery
> supporting `address_in_range()` still needs to be there, and every
> obmalloc spelling of malloc/realloc needs to ensure that machinery
> will work if the returned address is passed back to an obmalloc
> free/realloc spelling without the size.


We can almost make it work for GC objects, the use of obmalloc is
quite well encapsulated. I think I intentionally designed the
PyObject_GG_New/PyObject_GC_Del/etc APIs that way.

Quick and dirty experiment is here:

https://github.com/nascheme/cpython/tree/gc_malloc_free_size

The major hitch seems my new gc_obj_size() function. We can't be
sure the 'nbytes' passed to _PyObject_GC_Malloc() is the same as
what is computed by gc_obj_size(). It usually works but there are
exceptions (freelists for frame objects and tuple objects, for one)

A nasty problem is the weirdness with PyType_GenericAlloc() and the
sentinel item. _PyObject_GC_NewVar() doesn't include space for the
sentinel but PyType_GenericAlloc() does. When you get to
gc_obj_size(), you don't if you should use "nitems" or "nitems+1".

I'm not sure how the fix the sentinel issue. Maybe a new type slot
or a type flag? In any case, making a change like my git branch
above would almost certainly break extensions that don't play
nicely. It won't be hard to make it a build option, like the
original gcmodule was. Then, assuming there is a performance boost,
people can enable it if their extensions are friendly.


> The "only"problem with address_in_range is that it limits us to a
> maximum pool size of 4K. Just for fun, I boosted that to 8K to see
> how likely segfaults really are, and a Python built that way couldn't
> even get to its first prompt before dying with an access violation
> (Windows-speak for segfault).

If we can make the above idea work, you could set the pool size to
8K without issue. A possible problem is that the obmalloc and
gcmalloc arenas are separate. I suppose that affects
performance testing.

> We could eliminate the pool size restriction in many ways. For
> example, we could store the addresses obtained from the system
> malloc/realloc - but not yet freed - in a set, perhaps implemented as
> a radix tree to cut the memory burden. But digging through 3 or 4
> levels of a radix tree to determine membership is probably
> significantly slower than address_in_range.

You are likely correct. I'm hoping to benchmark the radix tree idea.
I'm not too far from having it working such that it can replace
address_in_range(). Maybe allocating gc_refs as a block would
offset the radix tree cost vs address_in_range(). If the above idea
works, we know the object size at free() and realloc(), we don't
need address_in_range() for those code paths.

Regards,

Neil
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ILFK2MTCVA7GB7JGBVSUWASKJ7T4LLJE/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On 2019-06-06, Tim Peters wrote:
> The doubly linked lists in gc primarily support efficient
> _partitioning_ of objects for gc's purposes (a union of disjoint sets,
> with constant-time moving of an object from one set to another, and
> constant-time union of disjoint sets). "All objects" is almost never
> interesting to it (it is only when the oldest non-frozen generation is
> being collected).

My current idea is to put partitioning flags on the interior radix
tree nodes. If you mark an object as "finalizer reachable", for
example, it would mark all the nodes on the path from the root with
that flag. Then, when you want to iterate over all the GC objects
with a flag, you can avoid uninteresting branches of the tree.

For generations, maybe tracking them at the pool level is good
enough. Interior nodes can track generations too (i.e. the youngest
generation contained under them).

My gut feeling is that the prev/next pointer updates done by
move_unreachable() and similar functions must be quite expensive.
Doing the traversal with an explicit stack is a lot less elegant but
I think it should be faster. At least, when you are dealing with a
big set of GC objects that don't fit in the CPU cache.

Regards,

Neil
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/J422RWENKJAYHMXSZVRV5KGWSHNMAMJF/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
To be clearer, while knowing the size of allocated objects may be of
some use to some other allocators, "not really" for obmalloc. That
one does small objects by itself in a uniform way, and punts
everything else to the system malloc family. The _only_ thing it
wants to know on a free/realloc is which of those two ways delivered
the address to begin with.

Knowing the original size would be an indirect way to accomplish that,
but it really only needs 1 bit of info. If you're using a radix tree
to implement - in effect - a dict mapping addresses to "stuff", 1 flag
bit in the "stuff" would be ideal for that. If you haven't already, I
assume you'll soon come around to believe you really should be
tracking the addresses of all gc'ed objects (not just the "small"
ones).

> If the above idea works, we know the object size at free() and realloc(), we don't
> need address_in_range() for those code paths.

Two things: first, a single bit would not only be sufficient, it
would be _better_ than knowing the object size. If the bit says the
object came from the system, the size is of no use at all. It the bit
says it came from obmalloc, what's _really_ wanted is the index of the
object's "size class" in the vector of free lists, and that's already
directly available in the object's pool header (various parts of which
have to be read up &/or updated regardless). Second, if that bit were
available, `address_in_range()` could be thrown away - obmalloc's free
and realloc entries are the only places it's used (or is of any
conceivable use to obmalloc).

For the current obmalloc, I have in mind a different way (briefly. let
s pool span a number of 4K pages, but teach pools about the page
structure so that address_in_range() continues to work after trivial
changes (read the arena index from the containing page's start rather
than from the containing pool's)). Not ideal, but looks remarkably
easy to do, and captures the important part (more objects in a pool ->
more times obmalloc can remain in its fastest "all within the pool"
paths).

> My gut feeling is that the prev/next pointer updates done by
> move_unreachable() and similar functions must be quite expensive.
> Doing the traversal with an explicit stack is a lot less elegant but
> I think it should be faster. At least, when you are dealing with a
> big set of GC objects that don't fit in the CPU cache.

At the start, it was the _reachable_ objects that were moved out of
the collected generation list rather than the unreachable objects.
Since it's (in all my experience) almost all objects that survive
almost all collections in almost all programs, that was an enormous
amount of moving overall. So long I ago I rewrote that code to move
the _un_reachable objects instead.

Debug output showed countless billions of pointer updates saved across
the programs I tried at the time, but the timing difference was in the
noise.

Probably a big part of "the problem": when collecting the oldest
generation in a program with "a lot" of objects, no approach at all
will "fit in cache". We have to crawl the entire object graph then.
Note that the test case in the parent thread had over a billion class
instances, and it was way whittled down(!) from the OP's real program.

But I don't _think_ that's typical quite yet ;-) , and:

> You are likely correct. I'm hoping to benchmark the radix tree idea.
> I'm not too far from having it working such that it can replace
> address_in_range(). Maybe allocating gc_refs as a block would
> offset the radix tree cost vs address_in_range().

I certainly agree gc_refs is the big-bang-for-the-buck thing to
target. There's no way to avoid mutating that bunches and bunches of
times, even in small programs, and reducing the number of dirtied
cache lines due to that "should" pay off.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/3DPAR4PPPM3Z675ELBGYIJ74INI6PL3X/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
[Tim\
> For the current obmalloc, I have in mind a different way ...
> Not ideal, but ... captures the important part (more objects
> in a pool -> more times obmalloc can remain in its
> fastest "all within the pool" paths).

And now there's a PR that removes obmalloc's limit on pool sizes, and,
for a start, quadruples pool (and arena!) sizes on 64-bit boxes:

https://github.com/python/cpython/pull/13934

https://bugs.python.org/issue37211

As the PR says,

"""
It would be great to get feedback from 64-bit apps that do massive
amounts of small-object allocations and deallocations.
"""

:-)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Z4YIHDGNZLP4WX5HVEBXDSFIDIWPTTYK/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On 2019-06-09, Tim Peters wrote:
> And now there's a PR that removes obmalloc's limit on pool sizes, and,
> for a start, quadruples pool (and arena!) sizes on 64-bit boxes:

Neat.

> As the PR says,
>
> """
> It would be great to get feedback from 64-bit apps that do massive
> amounts of small-object allocations and deallocations.
> """

I've done a little testing the pool overhead. I have an application
that uses many small dicts as holders of data. The function:

sys._debugmallocstats()

is useful to get stats for the obmalloc pools. Total data allocated
by obmalloc is 262 MB. At the 4*PAGE_SIZE pool size, the wasted
space due to partly filled pools is only 0.18%. For 16*PAGE_SIZE
pools, 0.71%.

I have a set of stats for another program. In that case, total
memory allocated by obmalloc is 14 MB. For 4*PAGE_SIZE pools,
wasted space is 0.78% of total. At 16*PAGE_SIZE, it is 2.4%.

Based on that small set of data, using 4*PAGE_SIZE seems
conservative. As I'm sure you realize, making pools bigger will
waste actual memory, not just virtual address space because you
write the arena pointer to each OS page.

I want to do performance profiling using Linux perf. That should
show where the hotspot instructions in the obmalloc code. Maybe
that will be useful to you.

Another thought about address_in_range(): some operating systems
allow you to allocate memory a specific alignments. Or, you can
even allocate a chunk of memory at a fixed memory location if you do
the correct magic incantation. I noticed that Go does that. I
imagine doing that has a bunch of associated challenges with it.
However, if we could control the alignment and memory location of
obmalloc arenas, we would not have the segv problem of
address_in_range(). It's probably not worth going down that path
due to the problems involved.

Regards,

Neil
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LJWXNQNZRVCCF36ALXKZVWYHINXTVMVB/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
[Neil Schemenauer <nas-python@arctrix.com>]
> I've done a little testing the pool overhead. I have an application
> that uses many small dicts as holders of data. The function:
>
> sys._debugmallocstats()
>
> is useful to get stats for the obmalloc pools. Total data allocated
> by obmalloc is 262 MB. At the 4*PAGE_SIZE pool size, the wasted
> space due to partly filled pools is only 0.18%. For 16*PAGE_SIZE
> pools, 0.71%.
>
> I have a set of stats for another program. In that case, total
> memory allocated by obmalloc is 14 MB. For 4*PAGE_SIZE pools,
> wasted space is 0.78% of total. At 16*PAGE_SIZE, it is 2.4%.

Definitely a tradeoff here: increasing the number of pages per pool
is a pure win for objects of very popular size classes, but a pure
loss for objects of unpopular size classes. New comments about
POOL_SIZE in the patch briefly hint at that.

> Based on that small set of data, using 4*PAGE_SIZE seems
> conservative.

Which is a nice way of saying "pulled out of Tim's ass" ;-)

> As I'm sure you realize, making pools bigger will
> waste actual memory, not just virtual address space because you
> write the arena pointer to each OS page.

Yes, but mostly no! It remains the case that obmalloc neither writes
nor reads anything in an arena until a malloc/realloc call actually
needs to return a pointer to a never-before accessed page. Writing
the arena index is NOT done to pages when the arena is allocated, or
even when a new pool is carved off an arena, but lazily, one page at a
time, lowest address to highest, as fresh pages actually need to be
returned to the caller.

So arena pointers aren't actually written more frequently than when
using pools of just one page, as is done now (which _still_ writes the
arena index into each page).

Subtlety: currently, if you pass a nonsense address to
address_in_range that happens to be in one of the arena's pools,
address_in_range() says "yes". However, after the patch, it may well
return "no" if the address is in one of the pool's pages that hasn't
yet been returned to a malloc/realloc caller (in which case the arena
index hasn't yet been stored at the start of the page).

I don't care about that, because it's 100% a logic error to pass an
address to free/realloc that wasn't obtained from a previous
malloc/realloc call. So it's a change in undefined behavior.

> I want to do performance profiling using Linux perf. That should
> show where the hotspot instructions in the obmalloc code. Maybe
> that will be useful to you.

Would be good to know, yes! But may also depend on the app.

> Another thought about address_in_range(): some operating systems
> allow you to allocate memory a specific alignments. Or, you can
> even allocate a chunk of memory at a fixed memory location if you do
> the correct magic incantation. I noticed that Go does that. I
> imagine doing that has a bunch of associated challenges with it.
> However, if we could control the alignment and memory location of
> obmalloc arenas, we would not have the segv problem of
> address_in_range(). It's probably not worth going down that path
> due to the problems involved.

I'm unclear on how that could be exploited. It's not addresses that
come from arenas that create segv headaches, it's addresses that come
from the system's malloc/realloc.

If we were, e.g., to pick a maximum amount of address space obmalloc
can use in advance, we could live with a single arena of that size,
and just check whether an address is within it. All the problems come
from that address space may alternate, any number of times, between
addresses in obmalloc arenas and addresses from the system malloc's
internals.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YUWFWPKN67EOGF5F7QFGB7IFRH5K2PFW/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On Sun, Jun 2, 2019 at 7:57 AM Tim Peters <tim.peters@gmail.com> wrote:

> I don't think we need to cater anymore to careless code that mixes
> system memory calls with O calls (e.g., if an extension gets memory
> via `malloc()`, it's its responsibility to call `free()`), and if not
> then `address_in_range()` isn't really necessary anymore either, and
> then we could increase the pool size. O would, however, need a new
> way to recognize when its version of malloc punted to the system
> malloc.
>

Is this really feasible in a world where the allocators can be selected
(and the default changed) at runtime? And what would be an efficient way of
detecting allocations punted to malloc, if not address_in_range?

Getting rid of address_in_range sounds like a nice idea, and I would love
to test how feasible it is -- I can run such a change against a wide
selection of code at work, including a lot of third-party extension
modules, but I don't see an easy way to do it right now.

--
Thomas Wouters <thomas@python.org>

Hi! I'm an email virus! Think twice before sending your email to help me
spread!
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
[Tim]
>> I don't think we need to cater anymore to careless code that mixes
>> system memory calls with O calls (e.g., if an extension gets memory
>> via `malloc()`, it's its responsibility to call `free()`), and if not
>> then `address_in_range()` isn't really necessary anymore either, and
>> then we could increase the pool size. O would, however, need a new
>> way to recognize when its version of malloc punted to the system
>> malloc.

[Thomas Wouters <thomas@python.org>]
> Is this really feasible in a world where the allocators can be selected (and
> the default changed) at runtime?

I think so. See the "Memory Management" section of the Python/C API
Reference Manual. It's always been "forbidden" to, e.g., allocate a
thing with PyMem_New() but release it with free(). Ditto mixing a
PyMem_Raw... allocator with a PyMem... deallocator, or PyObject...
one. Etc.

A type's tp_dealloc implementation should damn well which memory
family the type's allocator used,

However, no actual proposal on the table changes any "fact on the
ground" here. They're all as forgiving of slop as the status quo.

> And what would be an efficient way of detecting allocations punted to
> malloc, if not address_in_range?

_The_ most efficient way is the one almost all allocators used long
ago: use some "hidden" bits right before the address returned to the
user to store info about the block being returned. Like 1 bit to
distinguish between "obmalloc took this out of one of its pools" and
"obmalloc got this from PyMem_Raw... (whatever that maps to - obmalloc
doesn't care)". That would be much faster than what we do now.

But on current 64-bit boxes, "1 bit" turns into "16 bytes" to maintain
alignment, so space overhead becomes 100% for the smallest objects
obmalloc can return :-(

Neil Schemenauer takes a different approach in the recent "radix tree
arena map for obmalloc" thread here. We exchanged ideas on that until
it got to the point that the tree levels only need to trace out
prefixes of obmalloc arena addresses. That is, the new space burden
of the radix tree appears quite reasonably small.

It doesn't appear to be possible to make it faster than the current
address_in_range(), but in small-scale testing so far speed appears
comparable.


> Getting rid of address_in_range sounds like a nice idea, and I would love to test
> how feasible it is -- I can run such a change against a wide selection of code
> at work, including a lot of third-party extension modules, but I don't see an easy
> way to do it right now.

Neil's branch is here:

https://github.com/nascheme/cpython/tree/obmalloc_radix_tree

It's effectively a different _implementation_ of the current
address_in_range(), one that doesn't ever need to read possibly
uninitialized memory, and couldn't care less about the OS page size.

For the latter reason, it's by far the clearest way to enable
expanding pool size above 4 KiB. My PR also eliminates the pool size
limitation:

https://github.com/python/cpython/pull/13934

but at the cost of breaking bigger pools up internally into 4K regions
so the excruciating current address_in_range black magic still works.

Neil and I are both keen _mostly_ to increase pool and arena sizes.
The bigger they are, the more time obmalloc can spend in its fastest
code paths.

A question we can't answer yet (or possibly ever) is how badly that
would hurt Python returning arenas to the system, in long-running apps
the go through phases of low and high memory need.

I don't run anything like that - not a world I've ever lived in. All
my experiments so far say, for programs that are neither horrible nor
wonderful in this respect:

1. An arena size of 4 KiB is most effective for that.
2. There's significant degradation in moving even to 8 KiB arenas.
3. Which continues getting worse the larger the arenas.
4. Until reaching 128 KiB, at which point the rate of degradation falls a lot.

So the current 256 KiB arenas already suck for such programs.

For "horrible" programs, not even tiny 4K arenas help much.

For "wonderful" programs, not even 16 MiB arenas hurt arena recycling
effectiveness.

So if you have real programs keen to "return memory to the system"
periodically, it would be terrific to get info about how changing
arena size affects their behavior in that respect.

My PR uses 16K pools and 1M arenas, quadrupling the status quo.
Because "why not?" ;-)

Neil's branch has _generally_, but not always, used 16 MiB arenas.
The larger the arenas in his branch, the smaller the radix tree needs
to grow.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/7ZIFV2BEL64FQGC35F7QUPK3SHVR3VGT/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
Le ven. 21 juin 2019 à 23:19, Thomas Wouters <thomas@python.org> a écrit :
> Is this really feasible in a world where the allocators can be selected (and the default changed) at runtime?

The memory allocation must not be changed after the Python
pre-initialization. What's done after pre-initialization is more to
put "hook" which executes code before/after an allocation, but don't
replace the allocator.

It simply doesn't work to switch from pymalloc to malloc "at runtime".
Calling PyMem_Free(ptr) would call free(ptr). If the memory block was
allocated by pymalloc, free(ptr) does simply crash.

Victor
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/C5AI3SL77AV6QLRNTJ4PZH7MCYR2ZQAC/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On 2019-06-21, Tim Peters wrote:
> [Thomas Wouters <thomas@python.org>]
> > Getting rid of address_in_range sounds like a nice idea, and I
> > would love to test how feasible it is -- I can run such a change
> > against a wide selection of code at work, including a lot of
> > third-party extension modules, but I don't see an easy way to do
> > it right now.
>
> Neil's branch is here:
>
> https://github.com/nascheme/cpython/tree/obmalloc_radix_tree

If you can test vs some real-world programs, that would be great.
I was trying to run some tests this afternoon. Testing with Python
3.8+ is a pain because of the PyCode_New and tp_print changes. I've
just added two fixes to the head of the obmalloc_radix_tree branch
so that you can compile code generated by old versions of Cython.
Without those fixes, building 3rd party extensions can be a real
pain.

> My PR uses 16K pools and 1M arenas, quadrupling the status quo.
> Because "why not?" ;-)
>
> Neil's branch has _generally_, but not always, used 16 MiB arenas.
> The larger the arenas in his branch, the smaller the radix tree needs
> to grow.

Currently I have it like your big pool branch (16 KB, 1MB).
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TF2JI7G5ZMCGUMM3AWNSCQDYVFNRPMQ4/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
For those who would like to test with something compatible with
Python 3.7.3, I made re-based branches here:

https://github.com/nascheme/cpython/tree/obmalloc_radix_v37
https://github.com/nascheme/cpython/tree/obmalloc_big_pools_v37

They should be ABI compatible with Python 3.7.3. So, if you just
re-build the "python" executable, you don't have to rebuild anything
else. Both those use the same arena/pool sizes and they both have
Tim's arena thrashing fix.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/K5DCROCGGVNWWLC6XM6XMCTJACESNEYS/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On Fri, 21 Jun 2019 17:18:18 -0500
Tim Peters <tim.peters@gmail.com> wrote:
>
> > And what would be an efficient way of detecting allocations punted to
> > malloc, if not address_in_range?
>
> _The_ most efficient way is the one almost all allocators used long
> ago: use some "hidden" bits right before the address returned to the
> user to store info about the block being returned.

There's a fundamental problem here: you can't be sure that all
allocators reserve such space. If some allocator doesn't, it can
return a pointer just at the very start of the page, and if obmalloc
tries to look at "a few bits before" that address, it could very well
page-fault.

> Neil Schemenauer takes a different approach in the recent "radix tree
> arena map for obmalloc" thread here. We exchanged ideas on that until
> it got to the point that the tree levels only need to trace out
> prefixes of obmalloc arena addresses. That is, the new space burden
> of the radix tree appears quite reasonably small.

One open question is the cache efficiency of the two approaches.
Intuitively, address_in_range() will often look at exactly the same
cache line (since a significant number of allocation will share the
same "page prefix"). Apparently, this benefit may be offset by cache
aliasing issues. Cache aliasing can also be mitigated by the fact that
CPU caches are most of the time N-way with N > 1 (but N generally
remains small, from 2 to 8, for L1 and L2 caches).

So I guess the a priori answer is "it's complicated" :-)

I must also thank both you and Neil for running these experiments, even
though sometimes I may disagree about the conclusions :-)

Regards

Antoine.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RUQVG2KOYVMUIIX5HIZKNVN4AUXKKURM/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
[Thomas]
>>> And what would be an efficient way of detecting allocations punted to
>>> malloc, if not address_in_range?

[Tim]
>> _The_ most efficient way is the one almost all allocators used long
>> ago: use some "hidden" bits right before the address returned to the
>> user to store info about the block being returned.

[Antoine]
> There's a fundamental problem here: you can't be sure that all
> allocators reserve such space. If some allocator doesn't, it can
> return a pointer just at the very start of the page, and if obmalloc
> tries to look at "a few bits before" that address, it could very well
> page-fault.

I snipped some technical but crucial context in my reply to Thomas:
this was assuming users are following the documented requirement to
never mix memory calls from different families.

What you describe certainly could happen in "illegal" code that. e.g.,
obtained a block from the system malloc, but passed it to obmalloc to
free. Which, in reality, works fine today, although the docs forbid
it. (I'm not sure, but there _may_ be some mode of running today that
actually enforces the doc requirements.)

In the other world, where code is playing by the rules, if an obmalloc
malloc/;realloc spelling is called, and it needs to punt to a
different allocator, no problem: first it boosts the size request so
it has room to store "the bit" it needs before the address it actually
returns to the client. Then it's "legal" only to free that memory
with an obmalloc spelling of free() later - obmalloc reads "the bit",
sees "oh - that's not my memory!", and adjusts the pointer accordingly
on _its_ call to the spelling of free() corresponding to the memory
family obmalloc() used to get the memory to begin with.

>> Neil Schemenauer takes a different approach in the recent "radix tree
>> arena map for obmalloc" thread here. ...

> One open question is the cache efficiency of the two approaches.
> Intuitively, address_in_range() will often look at exactly the same
> cache line (since a significant number of allocation will share the
> same "page prefix").

I believe we haven't seen a program yet that used more than one node
at the tree's top level :-) But who knows? mmap() and VirtualAlloc()
don't appear to make any _guarantees_ that the high-order bits of
returned addresses aren't entirely random. In real life so far, they
always appear to be zeroes.

While x86 has a 64-bit virtual address space, the hardware "only"
supports 48 bits of physical address space, and I haven't seen a
virtual address yet where any of the top 16 bits are set.

AMD requires that the top 16 bits of virtual addresses be copies of bit 2**47.

Blah blah blah - for the foreseeable future, the top level of the tree
has a very easy job.

And Neil keenly observed that the top level of the tree can be
_declared_ as being very broad (and so suck up a lot of the leading
bits), because it's a file static and is effectively just an address
space reservation (at least on Linux) until nodes in it actually get
used.

> Apparently, this benefit may be offset by cache
> aliasing issues. Cache aliasing can also be mitigated by the fact that
> CPU caches are most of the time N-way with N > 1 (but N generally
> remains small, from 2 to 8, for L1 and L2 caches).
>
> So I guess the a priori answer is "it's complicated" :-)

Indeed it is.

> I must also thank both you and Neil for running these experiments, even
> though sometimes I may disagree about the conclusions :-)

Well, there aren't any conclusions yet - just seeing the same things repeatedly.

Over the weekend, Neil ran many variations of a longish-running "real
job" related to his work, which goes through phases of processing bulk
database operations and "trying to" release the space (details are
complicated).

Arena recycling was essentially non-existent in either branch (my PR
or the radix tree).

In 3.7.3 it _appeared_ to recycle hundreds of thousands of arenas, but
on closer examination they were virtually all of the "useless arena
thrashing" flavor. The arenas in use were almost always within one of
the highwater mark.

But it got much better in the branches if arenas were shrunk to a tiny 8 KiB.

Which is just another instance of the "256 KiB arenas are already way
too fat for effective arena recycling unless the program is
exceptionally friendly in its dynamic memory use patterns"
observation.

We haven't seen a case yet where 1 MiB arenas do materially worse than
256 KiB ones in this respect.

Speed is generally a wash between the branches, although they
consistently appear to be faster (by a little, not a lot) than 3.7.3.

The radix tree generally appears to be a little more memory-frugal
than my PR (presumably because my need to break "big pools" into 4K
chunks, while the tree branch doesn't, buys the tree more space to
actually store objects than it costs for the new tree).

We need more "real jobs", and especially ones where arena recycling is
actually doing good in 3.7.3 (which can be hard to tell, because the
"number of arenas recycled" output in 3.7.3 is vastly inflated by
arena-thrashing noise).
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/2E5SGN74XUENR4SZRF3UIJ6KSPPYQ2MN/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
[Tim]
> The radix tree generally appears to be a little more memory-frugal
> than my PR (presumably because my need to break "big pools" into 4K
> chunks, while the tree branch doesn't, buys the tree more space to
> actually store objects than it costs for the new tree).

It depends a whole lot on the size classes of the most popular
objects. A program below to compute it all. For a 64-bit box using
3.8 alignment, and 16 KiB pools:

pages per pool 4
pool size 16,384
alignment 16

The first output block:

size 16
SQ 1012 1.2%
PR 1018 0.6%
RT 1021 0.3%

SQ is the status quo: we have to use four separate 4 KiB pools, and
each burns 48 bytes for a pool header.

PR: my PR. There's one pool spanning 4 pages, with 48 bytes for a
pool header in the first page, and 16 bytes to store the arena index
in each of the other 3 pages.

RT: the radix tree. One 16 KiB block that only "wastes" 48 bytes for
the pool header.

The first number on each line is the number of objects we can get from
a "big pool" under that scheme. The second number is the % of total
pool space that cannot be use for client objects.

So, in the above, PR is a substantial improvement over SQ, and RT a
less substantial improvement over PR. Regardless of size class, PR
never does worse than SQ, and RT never worse than PR.

The most dramatic difference is in the largest size class:

size 512
SQ 28 12.5%
PR 28 12.5%
RT 31 3.1%

RT is a huge win there. And while it's generally true that RT's
advantages are more striking in the larger size classes, it doesn't
always crush. For example, in the 2nd-largest size class, it doesn't
matter at all which scheme is used:

size 496
SQ 32 3.1%
PR 32 3.1%
RT 32 3.1%

However, in the 3rd-largest size class, RT crushes it again:

size 480
SQ 32 6.2%
PR 32 6.2%
RT 34 0.4%

I trust the general principle at work here is too obvious to need
explanation ;-)

Anyway, these differences can really add up when there are millions of
objects passed out from a size class where RT has an advantage.
That's very attractive to me.

On the other hand, this _effectively_ make arenas even larger (they
can contain more objects), which apparently makes it even less likely
that arenas can eventually be returned to the system.

But, on the third hand, I've seen no evidence yet that increasing
arena size matters much at all to that after hitting 128 KiB arenas
(smaller than what we already use). "Uncooperative" programs
essentially recycle nothing regardless, and "happy" programs
essentially recycle almost as many arena bytes with 1 MiB arenas as
with 8 KiB arenas.

Here's the code:

PAGES_PER_POOL = 4
ALIGNMENT = 16 # change to 8 for < Python 3.8

PAGE_SIZE = 2**12
POOL_SIZE = PAGE_SIZE * PAGES_PER_POOL
POOL_HEADER_SIZE = 48

def from_block(size, blocksize, overhead):
return (blocksize - overhead) // size

def from_first_page(size, *, pagesize=PAGE_SIZE):
return from_block(size, pagesize, POOL_HEADER_SIZE)

# using multiple 4K one-page pools - status quo
def nobj_4K(size):
return from_first_page(size) * PAGES_PER_POOL

# using the PR
def nobj_PR(size):
return (from_first_page(size) +
from_block(size, PAGE_SIZE, ALIGNMENT)
* (PAGES_PER_POOL - 1))

# using the radix tree branch
def nobj_RT(size):
return from_first_page(size, pagesize=POOL_SIZE)

print("pages per pool", PAGES_PER_POOL)
print(f"pool size {POOL_SIZE:,}")
print("alignment", ALIGNMENT)

for size in range(ALIGNMENT, 512 + 1, ALIGNMENT):
print("size", size)
for tag, f in (("SQ", nobj_4K),
("PR", nobj_PR),
("RT", nobj_RT),
):
nobj = f(size)
waste = POOL_SIZE - nobj * size
print(f" {tag} {nobj:4} {waste/POOL_SIZE:5.1%}")
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/S5KMU6M6GZACRNFCF3TNPE7NKDCMQD5E/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
For the record, there's another contender in the allocator
competition now:
https://github.com/microsoft/mimalloc/

Regards

Antoine.


On Mon, 24 Jun 2019 00:20:03 -0500
Tim Peters <tim.peters@gmail.com> wrote:
> [Tim]
> > The radix tree generally appears to be a little more memory-frugal
> > than my PR (presumably because my need to break "big pools" into 4K
> > chunks, while the tree branch doesn't, buys the tree more space to
> > actually store objects than it costs for the new tree).
>
> It depends a whole lot on the size classes of the most popular
> objects. A program below to compute it all. For a 64-bit box using
> 3.8 alignment, and 16 KiB pools:
[snip]

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LWVFZYVFLTCNL7AKJVH2HLD2CHFRATUB/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
[Antoine Pitrou <solipsis@pitrou.net>]
> For the record, there's another contender in the allocator
> competition now:
> https://github.com/microsoft/mimalloc/

Thanks! From a quick skim, most of it is addressing things obmalloc doesn't:

1) Efficient thread safety (we rely on the GIL).

2) Directly handling requests of all sizes (we only deal with <= 512 bytes).

3) Funky stuff to help refcounting languages that want to guarantee
that freeing an object takes bounded time (the problem is that large
objects can contain an unbounded number of other objects that
will then also die - so mimalloc has complications to interact with
client-supplied functions that parcel out the tree of objects killed
by a decref, so the freeing can be done incrementally over time).

I don't think it's been explicitly pointed out before, but #2 is an
alternative to my PR and Neil's radix tree. If we insist that
programs play by the rules, and obmalloc controls every piece of
memory it hands out, then the new definition of address_in_range()
becomes

#define address_in_range(p, pool) true

;-)

This deserves more thought than I've ever given to it.

Leaving aside all the above, they emphasize this:

"""
Many allocators use a single free list per size class which can lead
to bad spatial locality where objects belonging to a single structure
can be spread out over the entire heap.
...
To improve the spatial locality of allocation, mimalloc use free list
sharding where the heap is divided into pages (per size-class) with a
free list per page (where pages are usually 64KiB for small objects).
"""

Which is exactly what obmalloc already does, except we call their
"pages" "pools", and ours are 16x smaller (<= Python 3.8) or 4x
smaller (my PR).

The call our "arenas" "segments", and theirs are a minimum of 4 MiB.

Their motivations for making these "large" are the same as mine:
maximize the time they can stay in the very zippy "fast paths".

Something I'm torn on: within a pool, obmalloc has two ways to find
the next free block. The pool has its own block free list (as in
mimalloc), but _also_ has a "bump allocator": The pool is carved into
blocks one at a time, low address to high, whenever the pool's free
list is NULL. This was part of Vladimir's desire to never touch a
piece of memory before it's actually needed to satisfy a client
request.

But it does complicate and slow the code on the second-fastest
allocation path (when the pool's free list _is_ NULL). The
conditional "OK, the free list is NULL - so is there room for another
bump allocation?" is a waste of time after all the pool's blocks have
been passed out at least once. The bump allocator can never succeed
again then, but we keep testing for it forever.

The alternative is to break the pool into blocks, and link them into a
free list, in one gulp when a size class is assigned to a free pool.
That mutates all the blocks in the pool (to store the list pointers),
even if all but one will never be used. In return, the code for
testing whether the bump allocator can find room, and the pool header
members supporting the bump allocator, could be thrown out. That's
what mimalloc appears to do, and they said it sped things up a bit
overall.

But I think I'll leave that one for someone else ;-)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RSBCIDD6YBDSEPQIBGTOZHVS63PS7LTU/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On Tue, Jun 25, 2019 at 5:49 AM Antoine Pitrou <solipsis@pitrou.net> wrote:
>
>
> For the record, there's another contender in the allocator
> competition now:
> https://github.com/microsoft/mimalloc/
>
> Regards
>
> Antoine.

It's a very strong competitor!

$ ./python -m pyperf compare_to pymalloc.json mimalloc.json -G --min-speed=3
Faster (14):
- spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)
- unpickle: 19.7 us +- 1.9 us -> 17.6 us +- 1.3 us: 1.12x faster (-11%)
- json_dumps: 17.1 ms +- 0.2 ms -> 15.7 ms +- 0.2 ms: 1.09x faster (-8%)
- json_loads: 39.0 us +- 2.6 us -> 36.2 us +- 1.1 us: 1.08x faster (-7%)
- crypto_pyaes: 162 ms +- 1 ms -> 150 ms +- 1 ms: 1.08x faster (-7%)
- regex_effbot: 3.62 ms +- 0.04 ms -> 3.38 ms +- 0.01 ms: 1.07x faster (-7%)
- pickle_pure_python: 689 us +- 53 us -> 650 us +- 5 us: 1.06x faster (-6%)
- scimark_fft: 502 ms +- 2 ms -> 478 ms +- 2 ms: 1.05x faster (-5%)
- float: 156 ms +- 2 ms -> 149 ms +- 1 ms: 1.05x faster (-5%)
- pathlib: 29.0 ms +- 0.5 ms -> 27.7 ms +- 0.4 ms: 1.05x faster (-4%)
- mako: 22.4 ms +- 0.1 ms -> 21.6 ms +- 0.2 ms: 1.04x faster (-4%)
- scimark_sparse_mat_mult: 6.21 ms +- 0.04 ms -> 5.99 ms +- 0.08 ms:
1.04x faster (-3%)
- xml_etree_parse: 179 ms +- 5 ms -> 173 ms +- 3 ms: 1.04x faster (-3%)
- sqlalchemy_imperative: 42.0 ms +- 0.9 ms -> 40.7 ms +- 0.9 ms: 1.03x
faster (-3%)

Benchmark hidden because not significant (46): ...
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CTIOESA4NQSWSXH5SZ5D6D7YITDGK33S/

1 2  View All