Mailing List Archive

Have a big machine and spare time? Here's a possible Python bug.
There's a Stackoverflow report[1] I suspect is worth looking into, but
it requires far more RAM (over 80GB) than I have). The OP whittled it
down to a reasonably brief & straightforward pure Python 3 program.
It builds a ternary search tree, with perhaps a billion nodes. The
problem is that it "takes forever" for Python to reclaim the tree
storage when it goes out of scope (the author waited at least hours).

Alas, the OP said it takes about 45 minutes to build the tree, and the
problem goes away if the tree is materially smaller. So it takes a
long time just to try once. With a tree about 10x smaller, for me it
takes about 45 seconds for Python to reclaim the tree storage.

The tree is deep enough that the "trashcan" may be implicated, and the
node objects are small enough that obmalloc carves them out of a
(relatively) great many arenas. Those are two ways in which Python
_may_ be to blame. The sheer number of objects involved may be
provoking edge cases we normally never see.

But, for a start, it would be good to know if anyone else can actually
reproduce the problem.

[1] https://stackoverflow.com/questions/56228799/python-hangs-indefinitely-trying-to-delete-deeply-recursive-object
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
I have only 32GB mem, but AWS has larger memory machine!

Linux perf shows here is bottleneck:
https://github.com/python/cpython/blob/master/Objects/obmalloc.c#L1784-L1819

obmalloc sorts arenas by number of free pools.
If there are many arenas, and memory block is freed by random order,
this sorting become O(N^2). That's too bad.

I'm trying address order instead.

Regards,
--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
On Thu, May 23, 2019 at 11:52 AM Inada Naoki <songofacandy@gmail.com> wrote:

> I have only 32GB mem, but AWS has larger memory machine!
>

Be aware that on larger cloud vendor instances, the backing vCPUs and
memory you get allocated might or might not be spread opaquely across
multiple sockets and/or NUMA nodes of the backing hardware. Some of them
have options where you can allocate raw hardware as well so you can opt to
lock your process to run within just one NUMA node and ensure hardware
locality for your performance debugging.

I'm pointing this out in case you experience any mystery jitters in your
benchmark results.

--
Joni Orponen
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
1. perf shows 95% of CPU time is eaten by _PyObject_Free, not kernel space.
2. This loop is cleary hot:
https://github.com/python/cpython/blob/51aa35e9e17eef60d04add9619fe2a7eb938358c/Objects/obmalloc.c#L1816-L1819

I can attach the process by gdb and I confirmed many arenas have
same nfreepools.

I believe this is not jitter caused from NUMA or something else in cloud.
--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
On 23May2019 0542, Inada Naoki wrote:
> 1. perf shows 95% of CPU time is eaten by _PyObject_Free, not kernel space.
> 2. This loop is cleary hot:
> https://github.com/python/cpython/blob/51aa35e9e17eef60d04add9619fe2a7eb938358c/Objects/obmalloc.c#L1816-L1819
>
> I can attach the process by gdb and I confirmed many arenas have
> same nfreepools.

It's relatively easy to test replacing our custom allocators with the
system ones, yes? Can we try those to see whether they have the same
characteristic?

Given the relative amount of investment over the last 19 years [1], I
wouldn't be surprised if most system ones are at least as good for our
needs now. Certainly Windows HeapAlloc has had serious improvements in
that time to help with fragmentation and small allocations.

Cheers,
Steve

[1]:
https://github.com/python/cpython/blob/51aa35e9e17eef60d04add9619fe2a7eb938358c/Objects/obmalloc.c#L769
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
>
> It's relatively easy to test replacing our custom allocators with the
> system ones, yes? Can we try those to see whether they have the same
> characteristic?
>

Yes.

PYTHONMALLOC=malloc LD_PRELOAD=/path/to/jemalloc.so python script.py

I will try it tomorrow.


>
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
On Fri, 24 May 2019 00:59:08 +0900
Inada Naoki <songofacandy@gmail.com> wrote:

> >
> > It's relatively easy to test replacing our custom allocators with the
> > system ones, yes? Can we try those to see whether they have the same
> > characteristic?
> >
>
> Yes.
>
> PYTHONMALLOC=malloc LD_PRELOAD=/path/to/jemalloc.so python script.py
>
> I will try it tomorrow.

Can you simply test with the system allocator rather than jemalloc?

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
[Inada Naoki <songofacandy@gmail.com>]
> ...
> 2. This loop is cleary hot:
> https://github.com/python/cpython/blob/51aa35e9e17eef60d04add9619fe2a7eb938358c/Objects/obmalloc.c#L1816-L1819

Which is 3 lines of code plus a closing brace. The OP didn't build
their own Python, and the source from which it was compiled wasn't
available to them. But they did say that when they got into gdb and
single-stepped, it was looping through the same three lines of code,
in obmalloc.c, over & over. Which is 100% consistent with the loop
you identified as being "the hot spot".
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
I suggest filing a bug to track this...

On Thu, May 23, 2019 at 10:13 AM Tim Peters <tim.peters@gmail.com> wrote:

> [Inada Naoki <songofacandy@gmail.com>]
> > ...
> > 2. This loop is cleary hot:
> >
> https://github.com/python/cpython/blob/51aa35e9e17eef60d04add9619fe2a7eb938358c/Objects/obmalloc.c#L1816-L1819
>
> Which is 3 lines of code plus a closing brace. The OP didn't build
> their own Python, and the source from which it was compiled wasn't
> available to them. But they did say that when they got into gdb and
> single-stepped, it was looping through the same three lines of code,
> in obmalloc.c, over & over. Which is 100% consistent with the loop
> you identified as being "the hot spot".
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/greg%40krypto.org
>
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
It seems pretty clear now that the primary cause is keeping arenas
sorted by number of free pools. As deallocation goes on, the number of
distinct "# of free pools" values decreases, leaving large numbers of
arenas sharing the same value. Then, e.g., if there are 10,000 arenas
with 30 free pools remaining, and another pool is freed in one of
them, its free pool count jumps to 31, and on average we have to crawl
over 5,000 of its former peers to locate its new position in the list.

Which is also consistent with what the OP saw, and I did too but in a
much smaller case: the longer deallocation goes on, the slower it
gets (fewer and fewer Nodes freed per unit time).

Which suggests a different - albeit related - approach: it's not
really the order of arenas that matters, but the partitioning of
arenas by number of free pools. So instead of a singly doubly linked
list of arenas, have a vector of doubly linked lists, one list per
possible number of free pools. That's a fixed and relatively small
number. For example, even if all arena bytes could be used for pools
(they can't be - there's bookkeeping info at the start of an arena),
the maximum number of free pools in an arena would be 2**18 / 2**12 ==
2**6 == 64.

When a pool is freed, no big deal: unlink the arena from its current
list, and stick it in the list for the next higher number of free
pools. This, of course, is constant-time.

Of course there are edge cases (e.g., if the area is entirely free, it
should be given back to C instead), but they seem pretty obvious, and
I haven't thought of a real problem.

Tedious, though ;-)
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
On Fri, May 24, 2019 at 3:49 AM Gregory P. Smith <greg@krypto.org> wrote:
>
> I suggest filing a bug to track this...
>

I created the issue: https://bugs.python.org/issue37029

--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
> > >
> > > It's relatively easy to test replacing our custom allocators with the
> > > system ones, yes? Can we try those to see whether they have the same
> > > characteristic?
> > >
> >
> > Yes.
> >
> > PYTHONMALLOC=malloc LD_PRELOAD=/path/to/jemalloc.so python script.py
> >
> > I will try it tomorrow.
>
> Can you simply test with the system allocator rather than jemalloc?
>

For the record, result for 10M nodes, Ubuntu 18.04 on AWS r5a.4xlarge:

$ local/bin/python3 t1.py # default
1138.1123778309993 -- end train, start del
688.7927911250008 -- end

$ arena-1m/bin/python3 t1.py # Changed ARENA_SIZE to 1MiB
1085.3363994170013 -- end train, start del
84.57135540099989 -- end

$ PYTHONMALLOC=malloc local/bin/python3 t1.py
1157.4882792789995 -- end train, start del
27.919834706000074 -- end

$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1
PYTHONMALLOC=malloc local/bin/python3 t1.py
1098.4383037820007 -- end train, start del
117.93938426599925 -- end

In this case, glibc malloc is the fastest.
glibc is know to weak about fragmentation.
But algorithm to avoid fragmentation is just an overhead in this script.

Regards,
--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
Le ven. 24 mai 2019 à 09:41, Inada Naoki <songofacandy@gmail.com> a écrit :
> For the record, result for 10M nodes, Ubuntu 18.04 on AWS r5a.4xlarge:
>
> $ local/bin/python3 t1.py # default
> 1138.1123778309993 -- end train, start del
> 688.7927911250008 -- end
>
> $ arena-1m/bin/python3 t1.py # Changed ARENA_SIZE to 1MiB
> 1085.3363994170013 -- end train, start del
> 84.57135540099989 -- end

688 => 84 looks like an interesting speedup. Would it be technically
possible to make ARENA_SIZE configurable at runtime?

Using the new PEP 587 preinitialization, it shouldn't be too hard to
hard a new command line option and/or an environment variable to tune
the memory allocator:
https://www.python.org/dev/peps/pep-0587/#preinitialization-with-pypreconfig

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
On Thu, May 23, 2019 at 5:15 PM Steve Dower <steve.dower@python.org> wrote:

> On 23May2019 0542, Inada Naoki wrote:
> > 1. perf shows 95% of CPU time is eaten by _PyObject_Free, not kernel
> space.
> > 2. This loop is cleary hot:
> >
> https://github.com/python/cpython/blob/51aa35e9e17eef60d04add9619fe2a7eb938358c/Objects/obmalloc.c#L1816-L1819
> >
> > I can attach the process by gdb and I confirmed many arenas have
> > same nfreepools.
>
> It's relatively easy to test replacing our custom allocators with the
> system ones, yes? Can we try those to see whether they have the same
> characteristic?
>
> Given the relative amount of investment over the last 19 years [1], I
> wouldn't be surprised if most system ones are at least as good for our
> needs now. Certainly Windows HeapAlloc has had serious improvements in
> that time to help with fragmentation and small allocations.
>

FYI, and I've mentioned this at PyCon to a few people (might've been you,
Steve, I don't remember) -- but at Google we've experimented with disabling
obmalloc when using TCMalloc (a faster and thread-aware malloc, which makes
a huge difference within Google when dealing with multi-threaded C++
libraries), both using the usual Python benchmarks and real-world code with
real-world workloads (a core part of YouTube, for example), all on Linux.
There's still a significant benefit to using obmalloc when using glibc's
malloc, and also a noticeable benefit when using TCMalloc. There are
certainly cases where it doesn't matter much, and there may even be cases
where the overhead of obmalloc isn't worth it, but I do believe it's still
a firm net benefit.


>
> Cheers,
> Steve
>
> [1]:
>
> https://github.com/python/cpython/blob/51aa35e9e17eef60d04add9619fe2a7eb938358c/Objects/obmalloc.c#L769
> _______________________________________________
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/thomas%40python.org
>


--
Thomas Wouters <thomas@python.org>

Hi! I'm an email virus! Think twice before sending your email to help me
spread!
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
[Inada Naoki <songofacandy@gmail.com>]
> For the record, result for 10M nodes, Ubuntu 18.04 on AWS r5a.4xlarge:

I'm unclear on what "nodes" means. If you mean you changed 27M to 10M
in this line:

for token in random_strings(27_000_000):

that's fine, but there are about 40 times more than that `Node`
objects created by the program. So if you did change that to
10_000_000, you'd have about 400M `Node` objects, consuming about 80x
that many bytes of RAM (or about 32GB).

> $ local/bin/python3 t1.py # default
> 1138.1123778309993 -- end train, start del
> 688.7927911250008 -- end
>
> $ arena-1m/bin/python3 t1.py # Changed ARENA_SIZE to 1MiB
> 1085.3363994170013 -- end train, start del
> 84.57135540099989 -- end

Nice! I assume these are seconds. On Stackoverflow, the OP reported
that boosting ARENA_SIZE the same way cut deallocation time in his
original program to about 13 minutes.


> $ PYTHONMALLOC=malloc local/bin/python3 t1.py
> 1157.4882792789995 -- end train, start del
> 27.919834706000074 -- end

While the OP reported, for the original program:

"""
With PYTHONMALLOC=malloc, insertion takes nearly twice as long, but
deallocation only takes 2 minutes!
"""

The "nearly twice as long" for building the tree is in sharp contrast
to what you saw, but there's about 3x as much of everything in the
original program, and, e.;g., it's certainly possible malloc is
tripping over fragmentation then that obmalloc avoids.



> $ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1
> PYTHONMALLOC=malloc local/bin/python3 t1.py
> 1098.4383037820007 -- end train, start del
> 117.93938426599925 -- end
>
> In this case, glibc malloc is the fastest.
> glibc is know to weak about fragmentation.
> But algorithm to avoid fragmentation is just an overhead in this script.

I can't say.

I'll note that the approach I very briefly sketched before
(restructure the list of arenas to partition it into multiple lists
partitioned by number of free pools) "should make" obmalloc
competitive with malloc here (it would eliminate all searches, except
perhaps (depending on how gonzo the code is) a rare need to search for
"the first" non-empty list).
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
[Tim]
> I'll note that the approach I very briefly sketched before
> (restructure the list of arenas to partition it into multiple lists
> partitioned by number of free pools) "should make" obmalloc
> competitive with malloc here ...

But it's also intrusive, breaking up a simple linked list into a
mutli-headed beast. That affects all code using it, not just the
parts mutating it.

So instead I suggest leaving `usable_arenas` entirely alone, but add a
vector of "search fingers", say,

static struct arenaobject* nfp2lasta[MAX_NUMBER_OF_FREE_POOLS_POSSIBLE + 1];

mapping a count of free pools to (a pointer to) the last arena object
in usable_arenas with that number of free pools.

Then the search loop in py_malloc_free() can be replaced with a single
array lookup.: it leaps directly to where the search loop ends now.
For example, if we free a pool in an arena that had 23 free pools,
then the arena object's count of free pools has to be increased to 24,
and the arena object unlinked from its current position in
usable_arenas and inserted immediately after nfp2lasta[23]. Note that
usable_arenas is a doubly linked list, so there's no _need_ at all to
iterate over it. Each node in the list knows what's immediately
before and after it. And since a free pool count can only increase by
1, it's always correct to move the arena immediately after the last
arena with the same original free pool count.

Key invariants:

1. nfp2lasta[n] is NULL if and only if no arena in usable_arenas has
nfreepools == n.

2. nfp2lasta[pa->nfreepools] == pa if and only if pa is the only arena
in usable_arenas with that many free pools.

So, in the example above, nfp2lasta[23] has to be set to NULL if and
only if the arena in question was the only one with 23 free pools
(which can be determined cheaply via invariant #2).

And nfp2lasta[24] has to be set to point to the arena in question if
and only if nfp2lasta[24] is NULL.

Tricky bit: if the arena in question is the only one with a given
original free pool count, no pointers in arena objects need to be
changed at all. Following the sketch above literally would leave you
trying to insert it after itself, which wouldn't end well ;-)

Anyway, this "feels like a right solution" to me, eliminating all
searches with simple (albeit delicate) constant-time code, and
requiring code changes only where an arena's number of free pools can
change.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
[Tim]
> Key invariants:
> ...
> 2. nfp2lasta[pa->nfreepools] == pa if and only if pa is the only arena
> in usable_arenas with that many free pools.

Ack! Scratch that. I need a nap :-(

In fact if that equality holds, it means that nfp2lasta entry has to
change if pa is moved and pa->prevarena has the same count of free
pools.

So forget about the explanation and just think about the goal - the
details will work themselves out ;-)
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
On Fri, 24 May 2019 14:23:21 +0200
Thomas Wouters <thomas@python.org> wrote:
> On Thu, May 23, 2019 at 5:15 PM Steve Dower <steve.dower@python.org> wrote:
>
> > On 23May2019 0542, Inada Naoki wrote:
> > > 1. perf shows 95% of CPU time is eaten by _PyObject_Free, not kernel
> > space.
> > > 2. This loop is cleary hot:
> > >
> > https://github.com/python/cpython/blob/51aa35e9e17eef60d04add9619fe2a7eb938358c/Objects/obmalloc.c#L1816-L1819
> > >
> > > I can attach the process by gdb and I confirmed many arenas have
> > > same nfreepools.
> >
> > It's relatively easy to test replacing our custom allocators with the
> > system ones, yes? Can we try those to see whether they have the same
> > characteristic?
> >
> > Given the relative amount of investment over the last 19 years [1], I
> > wouldn't be surprised if most system ones are at least as good for our
> > needs now. Certainly Windows HeapAlloc has had serious improvements in
> > that time to help with fragmentation and small allocations.
> >
>
> FYI, and I've mentioned this at PyCon to a few people (might've been you,
> Steve, I don't remember) -- but at Google we've experimented with disabling
> obmalloc when using TCMalloc (a faster and thread-aware malloc, which makes
> a huge difference within Google when dealing with multi-threaded C++
> libraries), both using the usual Python benchmarks and real-world code with
> real-world workloads (a core part of YouTube, for example), all on Linux.
> There's still a significant benefit to using obmalloc when using glibc's
> malloc, and also a noticeable benefit when using TCMalloc. There are
> certainly cases where it doesn't matter much, and there may even be cases
> where the overhead of obmalloc isn't worth it, but I do believe it's still
> a firm net benefit.

Interesting that a 20-year simple allocator (obmalloc) is able to do
better than the sophisticated TCMalloc.

(well, of course, obmalloc doesn't have to worry about concurrent
scenarios, which explains some of the simplicity)

Regards

Antoine.


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
On 5/22/19 12:15 PM, Tim Peters wrote:
> There's a Stackoverflow report[1] I suspect is worth looking into, but
> it requires far more RAM (over 80GB) than I have). [...]
> But, for a start, it would be good to know if anyone else can actually
> reproduce the problem.
>
> [1] https://stackoverflow.com/questions/56228799/python-hangs-indefinitely-trying-to-delete-deeply-recursive-object


I have a computer with two Xeon CPUs and 256GB of RAM.  So, even though
it's NUMA, I still have 128GB of memory per CPU.  It's running a "spin"
of Ubuntu 18.10.

I compiled a fresh Python 3.7.3 --with-optimizations.  I copied the
sample program straight off the StackOverflow page.  The program ran for
about five and a half hours then exited normally.

During the run it printed:

This gets printed!
This doesn't get printed

Statistics reported by "time":

19811.05s user 123.56s system 99% cpu 5:32:15.04 total

Checking in on it now and then, peak observed memory usage (as reported
by "top") was just under 80GB.

I take it that the interesting part was confirming that "This doesn't
get printed" gets printed when you have enough RAM for the program to
run to completion.  So I guess there's no bug here? Just an observation
about CPython's garbage collector being kinda slow?  Or maybe CPython gc
+ swap = super bombad slow?


//arry/
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
[Larry Hastings <larry@hastings.org>]
> I have a computer with two Xeon CPUs and 256GB of RAM. So, even
> though it's NUMA, I still have 128GB of memory per CPU. It's running a
> "spin" of Ubuntu 18.10.
>
> I compiled a fresh Python 3.7.3 --with-optimizations. I copied the sample
> program straight off the StackOverflow page. The program ran for about
> five and a half hours then exited normally.

Thanks, Larry!

> During the run it printed:
>
> This gets printed!
> This doesn't get printed
>
> Statistics reported by "time":
>
> 19811.05s user 123.56s system 99% cpu 5:32:15.04 total
>
> Checking in on it now and then, peak observed memory usage (as reported
> by "top") was just under 80GB.

All of that is consistent with what others reported, although you've
given more quantified details, and that's helpful.

> I take it that the interesting part was confirming that "This doesn't get printed"
> gets printed when you have enough RAM for the program to run to completion.

That was the primary point at first, yes,

> So I guess there's no bug here? Just an observation about CPython's
> garbage collector being kinda slow? Or maybe CPython gc + swap =
> super bombad slow?

No need to confirm this: if you ran it again with a bit more output,
you'd find that the vast bulk of the time was spent between the two
output lines. That is, it takes hours from the time the train()
function starts to return and completes returning. That's all
consumed as a side effect of `tree` losing its last reference.

We know why now, and it's hard to say whether it's "a bug". It's
functioning as designed, but the design sucks ;-) So I'd call it a
design flaw.

The heuristics that were introduced to try to make it likely we could
return more empty arenas to C were designed when machines had far less
RAM, and can burn time quadratic in the number of arenas. That really
didn't matter in a world with a few hundred arenas, but can be a
timing disaster in a world with hundreds of thousands (or more) of
arenas.

The Stackoverflow test case was whittled way down from the OP's real
program, which ran "for days" without completing.

I've sketched two (albeit closely related, both to each other and to
what we currently do) ways the worst-case overall time could be cut
from quadratic to linear in the number of arenas, which is
asymptotically optimal.

More ambitious would be to dream up an entirely different heuristic.
Keeping the list of usable arenas fully sorted in order of number of
free pools seems a very strong requirement for a poke-and-hope
heuristic.

But, best I can tell, the person who came up with this heuristic to
begin with didn't check in any motivating test programs. So we'd be
starting over from scratch. For this specific Stackoverflow case,
I've confirmed that it doesn't make a real difference if we didn't
bother to sort the list of arenas _at all_ (then it still returns
about the same number of arenas to C, both at the point the tree
finishes building, and at the time it completes tearing down the
tree).

So I have a program showing the current scheme can be a disaster, but
none where it shows it actually helps anything ;-)
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
I made a pull request for this that appears to work fine for my 10x
smaller test case (reduces tear-down time from over 45 seconds to over
7). It implements my second earlier sketch (add a vector of search
fingers, to eliminate searches):

https://github.com/python/cpython/pull/13612

It would be good to know how it works on the OP's 10x larger test
case, which requires a machine with over 80 GB of RAM. Hoping it will
reduce tear-down time from hours to minutes (they have about a billion
`Node` objects, so tear-down time will never be trivial).

It would also be good to get a code review, if you have the time and
inclination. Thanks!
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
The PR for this looks good to go:

https://github.com/python/cpython/pull/13612

But, I still have no idea how it works for the OP's original test
case. So, if you have at least 80 GB of RAM to try it, I added
`arena.py` to the BPO report:

https://bugs.python.org/issue37029

That adds code to the OP's test case to display the times needed to
build the tree and to tear it down (& to display some obmalloc stats).
So there's no need for you to think about anything ;-)

I'm keen to get feedback on this before merging the PR, because this
case is so very much larger than anything I've ever tried that I'm
wary that there may be more than one "surprise" lurking here. The PR
certainly addresses "an obvious" (with hindsight) problem - but is
that the _only_ gross design flaw here?
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
> I'm keen to get feedback on this before merging the PR, because this
> case is so very much larger than anything I've ever tried that I'm
> wary that there may be more than one "surprise" lurking here. The PR
> certainly addresses "an obvious" (with hindsight) problem - but is
> that the _only_ gross design flaw here?

I started r5a.4xlarge EC2 instance and started arena.py.
I will post the result in next 12 hours.

Regards,

--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
[Tim]
>> I'm keen to get feedback on this before merging the PR, because this
>> case is so very much larger than anything I've ever tried that I'm
>> wary that there may be more than one "surprise" lurking here. ...

[Inada Naoki <songofacandy@gmail.com>]
> I started r5a.4xlarge EC2 instance and started arena.py.
> I will post the result in next 12 hours.

Thank you! Wrapping this up, Inada attached the results to the bug
report. For the OP's original big case, the time to reclaim the tree
storage dropped from nearly 5 hours to about 70 seconds. The time to
build the tree didn't materially change (it was a bit faster after the
patch, but offhand didn't look significantly faster to me).

Since I called this a "design flaw" rather than "a bug", I'm not
inclined to backport it. If someone else thinks it should be, that's
up to them. I'm _assuming_ Github won't decide to backport it on its
own - but maybe I'm wrong (Github workflow is still pretty magical to
me).
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com
Re: Have a big machine and spare time? Here's a possible Python bug. [ In reply to ]
On Fri, May 31, 2019 at 11:12 AM Tim Peters <tim.peters@gmail.com> wrote:

> [Tim]
> >> I'm keen to get feedback on this before merging the PR, because this
> >> case is so very much larger than anything I've ever tried that I'm
> >> wary that there may be more than one "surprise" lurking here. ...
>
> [Inada Naoki <songofacandy@gmail.com>]
> > I started r5a.4xlarge EC2 instance and started arena.py.
> > I will post the result in next 12 hours.
>
> Thank you! Wrapping this up, Inada attached the results to the bug
> report. For the OP's original big case, the time to reclaim the tree
> storage dropped from nearly 5 hours to about 70 seconds. The time to
> build the tree didn't materially change (it was a bit faster after the
> patch, but offhand didn't look significantly faster to me).
>
> Since I called this a "design flaw" rather than "a bug", I'm not
> inclined to backport it. If someone else thinks it should be, that's
> up to them. I'm _assuming_ Github won't decide to backport it on its
> own - but maybe I'm wrong (Github workflow is still pretty magical to
> me).
>

Backports only happen if you add the appropriate labels to the PR itself
(and if that wasn't clear from the devguide then please open an issue so we
can fix that oversight).

1 2  View All