Mailing List Archive

1 2  View All
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On Thu, 4 Jul 2019 16:19:52 +0900
Inada Naoki <songofacandy@gmail.com> wrote:

> On Tue, Jun 25, 2019 at 5:49 AM Antoine Pitrou <solipsis@pitrou.net> wrote:
> >
> >
> > For the record, there's another contender in the allocator
> > competition now:
> > https://github.com/microsoft/mimalloc/
> >
> > Regards
> >
> > Antoine.
>
> It's a very strong competitor!
>
> $ ./python -m pyperf compare_to pymalloc.json mimalloc.json -G --min-speed=3
> Faster (14):
> - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)
> - unpickle: 19.7 us +- 1.9 us -> 17.6 us +- 1.3 us: 1.12x faster (-11%)
> - json_dumps: 17.1 ms +- 0.2 ms -> 15.7 ms +- 0.2 ms: 1.09x faster (-8%)
> - json_loads: 39.0 us +- 2.6 us -> 36.2 us +- 1.1 us: 1.08x faster (-7%)
> - crypto_pyaes: 162 ms +- 1 ms -> 150 ms +- 1 ms: 1.08x faster (-7%)
> - regex_effbot: 3.62 ms +- 0.04 ms -> 3.38 ms +- 0.01 ms: 1.07x faster (-7%)
> - pickle_pure_python: 689 us +- 53 us -> 650 us +- 5 us: 1.06x faster (-6%)
> - scimark_fft: 502 ms +- 2 ms -> 478 ms +- 2 ms: 1.05x faster (-5%)
> - float: 156 ms +- 2 ms -> 149 ms +- 1 ms: 1.05x faster (-5%)
> - pathlib: 29.0 ms +- 0.5 ms -> 27.7 ms +- 0.4 ms: 1.05x faster (-4%)
> - mako: 22.4 ms +- 0.1 ms -> 21.6 ms +- 0.2 ms: 1.04x faster (-4%)
> - scimark_sparse_mat_mult: 6.21 ms +- 0.04 ms -> 5.99 ms +- 0.08 ms:
> 1.04x faster (-3%)
> - xml_etree_parse: 179 ms +- 5 ms -> 173 ms +- 3 ms: 1.04x faster (-3%)
> - sqlalchemy_imperative: 42.0 ms +- 0.9 ms -> 40.7 ms +- 0.9 ms: 1.03x
> faster (-3%)
>
> Benchmark hidden because not significant (46): ...

Ah, interesting. Were you able to measure the memory footprint as well?

Regards

Antoine.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/47NHQNPQVR6GPZ3PPRCAVZLPRXGV4GNW/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On Thu, Jul 4, 2019 at 8:09 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
>
> Ah, interesting. Were you able to measure the memory footprint as well?
>

Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some
benchmarks. I will look it later.

```
$ ./python -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G
Slower (60):
- logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x
slower (+158%)
- logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x
slower (+127%)
- regex_dna: 13.3 MB +- 19.1 kB -> 27.0 MB +- 12.1 kB: 2.02x slower (+102%)
- json_dumps: 8351.8 kB +- 19.8 kB -> 15.2 MB +- 18.0 kB: 1.87x slower (+87%)
- regex_v8: 8434.6 kB +- 20.9 kB -> 14.4 MB +- 11.0 kB: 1.75x slower (+75%)
- unpack_sequence: 7521.0 kB +- 17.0 kB -> 9980.8 kB +- 24.7 kB: 1.33x
slower (+33%)
- hexiom: 7412.2 kB +- 19.0 kB -> 9307.4 kB +- 8004 bytes: 1.26x slower (+26%)
- xml_etree_process: 12.2 MB +- 26.3 kB -> 15.0 MB +- 28.9 kB: 1.23x
slower (+23%)
- genshi_text: 10.2 MB +- 24.0 kB -> 12.5 MB +- 24.8 kB: 1.22x slower (+22%)
- crypto_pyaes: 7602.2 kB +- 35.7 kB -> 9242.8 kB +- 7873 bytes: 1.22x
slower (+22%)
- tornado_http: 24.9 MB +- 72.1 kB -> 30.1 MB +- 33.0 kB: 1.21x slower (+21%)
- chameleon: 15.8 MB +- 24.5 kB -> 19.1 MB +- 23.4 kB: 1.21x slower (+21%)
- genshi_xml: 10.9 MB +- 24.0 kB -> 12.9 MB +- 19.6 kB: 1.18x slower (+18%)
- go: 8662.6 kB +- 16.4 kB -> 10082.8 kB +- 26.2 kB: 1.16x slower (+16%)
- pathlib: 8863.6 kB +- 30.2 kB -> 10229.8 kB +- 19.4 kB: 1.15x slower (+15%)
- scimark_fft: 7473.4 kB +- 14.4 kB -> 8606.0 kB +- 28.6 kB: 1.15x slower (+15%)
- scimark_lu: 7463.2 kB +- 15.1 kB -> 8569.8 kB +- 28.6 kB: 1.15x slower (+15%)
- pidigits: 7380.2 kB +- 20.0 kB -> 8436.0 kB +- 24.2 kB: 1.14x slower (+14%)
- scimark_monte_carlo: 7354.4 kB +- 18.2 kB -> 8398.8 kB +- 27.0 kB:
1.14x slower (+14%)
- scimark_sparse_mat_mult: 7889.8 kB +- 20.1 kB -> 9006.2 kB +- 29.4
kB: 1.14x slower (+14%)
- scimark_sor: 7377.2 kB +- 18.9 kB -> 8402.0 kB +- 29.0 kB: 1.14x slower (+14%)
- chaos: 7693.0 kB +- 33.0 kB -> 8747.6 kB +- 10.5 kB: 1.14x slower (+14%)
- richards: 7364.2 kB +- 29.8 kB -> 8331.4 kB +- 20.2 kB: 1.13x slower (+13%)
- raytrace: 7696.0 kB +- 30.3 kB -> 8695.4 kB +- 30.0 kB: 1.13x slower (+13%)
- sqlite_synth: 8799.2 kB +- 25.5 kB -> 9937.4 kB +- 27.1 kB: 1.13x
slower (+13%)
- logging_silent: 7533.8 kB +- 32.0 kB -> 8488.2 kB +- 25.1 kB: 1.13x
slower (+13%)
- json_loads: 7317.8 kB +- 22.7 kB -> 8215.2 kB +- 21.5 kB: 1.12x slower (+12%)
- unpickle_list: 7513.4 kB +- 9790 bytes -> 8420.6 kB +- 25.6 kB:
1.12x slower (+12%)
- unpickle: 7519.8 kB +- 11.4 kB -> 8425.4 kB +- 27.1 kB: 1.12x slower (+12%)
- fannkuch: 7170.0 kB +- 14.9 kB -> 8033.0 kB +- 22.5 kB: 1.12x slower (+12%)
- pickle_list: 7514.6 kB +- 18.2 kB -> 8414.6 kB +- 24.0 kB: 1.12x slower (+12%)
- telco: 7685.2 kB +- 15.0 kB -> 8598.2 kB +- 17.6 kB: 1.12x slower (+12%)
- nbody: 7214.8 kB +- 10.7 kB -> 8070.2 kB +- 19.5 kB: 1.12x slower (+12%)
- pickle: 7523.2 kB +- 12.4 kB -> 8415.0 kB +- 21.0 kB: 1.12x slower (+12%)
- 2to3: 7171.2 kB +- 35.8 kB -> 8016.4 kB +- 21.7 kB: 1.12x slower (+12%)
- nqueens: 7425.2 kB +- 21.8 kB -> 8296.8 kB +- 25.5 kB: 1.12x slower (+12%)
- spectral_norm: 7212.6 kB +- 19.6 kB -> 8052.8 kB +- 18.4 kB: 1.12x
slower (+12%)
- regex_compile: 8538.0 kB +- 21.0 kB -> 9528.6 kB +- 22.1 kB: 1.12x
slower (+12%)
- pickle_pure_python: 7559.8 kB +- 19.4 kB -> 8430.0 kB +- 25.6 kB:
1.12x slower (+12%)
- unpickle_pure_python: 7545.4 kB +- 9233 bytes -> 8413.0 kB +- 15.6
kB: 1.11x slower (+11%)
- float: 23.9 MB +- 22.5 kB -> 26.6 MB +- 19.7 kB: 1.11x slower (+11%)
- sqlalchemy_imperative: 18.2 MB +- 46.2 kB -> 20.2 MB +- 36.5 kB:
1.11x slower (+11%)
- regex_effbot: 7910.8 kB +- 15.1 kB -> 8804.8 kB +- 20.9 kB: 1.11x
slower (+11%)
- pickle_dict: 7563.4 kB +- 15.3 kB -> 8415.2 kB +- 19.3 kB: 1.11x slower (+11%)
- sqlalchemy_declarative: 18.9 MB +- 40.2 kB -> 21.0 MB +- 26.4 kB:
1.11x slower (+11%)
- xml_etree_parse: 11.8 MB +- 12.6 kB -> 13.0 MB +- 16.3 kB: 1.11x slower (+11%)
- html5lib: 20.1 MB +- 44.9 kB -> 22.2 MB +- 46.6 kB: 1.10x slower (+10%)
- xml_etree_iterparse: 12.0 MB +- 26.5 kB -> 13.2 MB +- 31.3 kB: 1.10x
slower (+10%)
- sympy_integrate: 36.4 MB +- 26.7 kB -> 40.0 MB +- 33.2 kB: 1.10x slower (+10%)
- sympy_str: 37.2 MB +- 28.4 kB -> 40.7 MB +- 26.6 kB: 1.10x slower (+10%)
- sympy_expand: 36.2 MB +- 19.9 kB -> 39.7 MB +- 25.1 kB: 1.09x slower (+9%)
- mako: 15.3 MB +- 19.1 kB -> 16.7 MB +- 25.4 kB: 1.09x slower (+9%)
- django_template: 19.3 MB +- 14.9 kB -> 21.0 MB +- 14.6 kB: 1.09x slower (+9%)
- xml_etree_generate: 12.3 MB +- 39.5 kB -> 13.3 MB +- 26.9 kB: 1.08x
slower (+8%)
- deltablue: 8918.0 kB +- 19.8 kB -> 9615.8 kB +- 12.5 kB: 1.08x slower (+8%)
- dulwich_log: 12.2 MB +- 102.9 kB -> 13.1 MB +- 26.6 kB: 1.08x slower (+8%)
- meteor_contest: 9296.0 kB +- 11.9 kB -> 9996.8 kB +- 20.7 kB: 1.08x
slower (+8%)
- sympy_sum: 62.2 MB +- 20.8 kB -> 66.5 MB +- 21.1 kB: 1.07x slower (+7%)
- python_startup: 7946.6 kB +- 20.4 kB -> 8210.2 kB +- 16.6 kB: 1.03x
slower (+3%)
- python_startup_no_site: 7409.0 kB +- 18.3 kB -> 7574.6 kB +- 21.8
kB: 1.02x slower (+2%)
```

--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YPMZIKREWIV7SNFIUI7U6AFXVA2T6CL2/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On Thu, 4 Jul 2019 23:32:55 +0900
Inada Naoki <songofacandy@gmail.com> wrote:
> On Thu, Jul 4, 2019 at 8:09 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
> >
> > Ah, interesting. Were you able to measure the memory footprint as well?
> >
>
> Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some
> benchmarks. I will look it later.

Ah, indeed, MADV_FREE will make it complicated to measure actual memory
usage :-/

Regards

Antoine.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/46QXQLQWHB4ASSBL5LB5QXJIYJLTDY77/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
[Antoine Pitrou <solipsis@pitrou.net>]
>> Ah, interesting. Were you able to measure the memory footprint as well?

[Inada Naoki <songofacandy@gmail.com>]
> Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some
> benchmarks. I will look it later.
>
> ```
> $ ./python -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G
> Slower (60):
> - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x
> slower (+158%)
> ...

Could you say more about what's being measured here? Like, if this is
on Linux, is it reporting RSS? VSS? Something else?

mimalloc is "most like" obmalloc among those I've looked at in recent
weeks. I noted before that its pools (they call them "pages") and
arenas (called "segments") are at least 16x larger than obmalloc uses
(64 KiB minimum pool/page size, and 4 MiB minimum arena/segment size,
in mimalloc).

Like all "modern" 64-bit allocators, it cares little about reserving
largish blobs of address space, so I'd _expect_ Linuxy VSS to zoom.
But it also releases (in some sense, ;like MADV_FREE) physical RAM
back to the system at a granularity far smaller than arena'segment.

At an extreme, the SuperMalloc I linked to earlier reserves a 512 MiB
vector at the start, so no program linking to that can consume less
than half a gig of address space. But it only expects to _use_ a few
4 KiB OS pages out of that. mimalloc doesn't do anything anywhere
near _that_ gonzo (& since mimalloc comes out of Microsoft, it never
will - "total virtual memory" on Windows is a system-wide resource,
limited to the sum of physical RAM + pagefile size - no "overcommit"
is allowed).
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ES3BFCBXS7N56XGUHHSOPHRT3UAEGKVA/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
I guess that INADA-san used pyperformance --track-memory.

pyperf --track-memory doc:
"--track-memory: get the memory peak usage. it is less accurate than
tracemalloc, but has a lower overhead. On Linux, compute the sum of
Private_Clean and Private_Dirty memory mappings of /proc/self/smaps.
On Windows, get PeakPagefileUsage of GetProcessMemoryInfo() (of the
current process): the peak value of the Commit Charge during the
lifetime of this process."
https://pyperf.readthedocs.io/en/latest/runner.html#misc

On Linux, pyperf uses read_smap_file() of pyperf._memory:
https://github.com/vstinner/pyperf/blob/master/pyperf/_memory.py

# Code to parse Linux /proc/%d/smaps files.
#
# See http://bmaurer.blogspot.com/2006/03/memory-usage-with-smaps.html for
# a quick introduction to smaps.
#
# Need Linux 2.6.16 or newer.
def read_smap_file():
total = 0
fp = open(proc_path("self/smaps"), "rb")
with fp:
for line in fp:
# Include both Private_Clean and Private_Dirty sections.
line = line.rstrip()
if line.startswith(b"Private_") and line.endswith(b'kB'):
parts = line.split()
total += int(parts[1]) * 1024
return total

It spawns a thread which reads /proc/self/smaps every milliseconds and
then report the *maximum*.

Victor

Le jeu. 4 juil. 2019 à 17:12, Tim Peters <tim.peters@gmail.com> a écrit :
>
> [Antoine Pitrou <solipsis@pitrou.net>]
> >> Ah, interesting. Were you able to measure the memory footprint as well?
>
> [Inada Naoki <songofacandy@gmail.com>]
> > Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some
> > benchmarks. I will look it later.
> >
> > ```
> > $ ./python -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G
> > Slower (60):
> > - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x
> > slower (+158%)
> > ...
>
> Could you say more about what's being measured here? Like, if this is
> on Linux, is it reporting RSS? VSS? Something else?
>
> mimalloc is "most like" obmalloc among those I've looked at in recent
> weeks. I noted before that its pools (they call them "pages") and
> arenas (called "segments") are at least 16x larger than obmalloc uses
> (64 KiB minimum pool/page size, and 4 MiB minimum arena/segment size,
> in mimalloc).
>
> Like all "modern" 64-bit allocators, it cares little about reserving
> largish blobs of address space, so I'd _expect_ Linuxy VSS to zoom.
> But it also releases (in some sense, ;like MADV_FREE) physical RAM
> back to the system at a granularity far smaller than arena'segment.
>
> At an extreme, the SuperMalloc I linked to earlier reserves a 512 MiB
> vector at the start, so no program linking to that can consume less
> than half a gig of address space. But it only expects to _use_ a few
> 4 KiB OS pages out of that. mimalloc doesn't do anything anywhere
> near _that_ gonzo (& since mimalloc comes out of Microsoft, it never
> will - "total virtual memory" on Windows is a system-wide resource,
> limited to the sum of physical RAM + pagefile size - no "overcommit"
> is allowed).
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ES3BFCBXS7N56XGUHHSOPHRT3UAEGKVA/



--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HF4UIZP5J3KKWQMLCHKJD3G6YZLHWWBE/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
I found calibrated loop count is not stable so memory usage is very different
in some benchmarks.
Especially, RAM usage of logging benchmark is very relating to loop count:

$ PYTHONMALLOC=malloc LD_PRELOAD=$HOME/local/lib/libmimalloc.so
./python bm_logging.py simple --track-memory --fast --inherit-environ
PYTHONMALLOC,LD_PRELOAD -v
Run 1: calibrate the number of loops: 512
- calibrate 1: 12.7 MB (loops: 512)
Calibration: 1 warmup, 512 loops
Run 2: 0 warmups, 1 value, 512 loops
- value 1: 12.9 MB
Run 3: 0 warmups, 1 value, 512 loops
- value 1: 12.9 MB
...

$ PYTHONMALLOC=malloc LD_PRELOAD=$HOME/local/lib/libmimalloc.so
./python bm_logging.py simple --track-memory --fast --inherit-environ
PYTHONMALLOC,LD_PRELOAD -v -l1024
Run 1: 0 warmups, 1 value, 1024 loops
- value 1: 21.4 MB
Run 2: 0 warmups, 1 value, 1024 loops
- value 1: 21.4 MB
Run 3: 0 warmups, 1 value, 1024 loops
- value 1: 21.4 MB
...

--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QBXLRFXDD5TLLDATV2PWE2QNLLDWRVXY/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
[Victor Stinner <vstinner@redhat.com>]
> I guess that INADA-san used pyperformance --track-memory.
>
> pyperf --track-memory doc:
> "--track-memory: get the memory peak usage. it is less accurate than
> tracemalloc, but has a lower overhead. On Linux, compute the sum of
> Private_Clean and Private_Dirty memory mappings of /proc/self/smaps.
> ...

So I'll take that as meaning essentially that it's reporting what RSS
would report if it ignored shared pages (so peak # of private pages
actually backed by physical RAM).

Clear as mud how MADV_FREE affects that.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6H7HP6TXKGZBIPVNBTLULGYEDFJKVFCQ/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On Thu, Jul 4, 2019 at 11:32 PM Inada Naoki <songofacandy@gmail.com> wrote:
>
> On Thu, Jul 4, 2019 at 8:09 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
> >
> > Ah, interesting. Were you able to measure the memory footprint as well?
> >
>
> Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some
> benchmarks. I will look it later.
>
> ```
> $ ./python -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G
> Slower (60):
> - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x
> slower (+158%)
> - logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x
> slower (+127%)

I think I understand why the mimalloc uses more than twice memory than the
pymalloc + glibc malloc in logging_format and logging_simple benchmarks.

These two benchmarks does like this:

buf = [] # in StringIO
for _ in range(10*1024):
buf.append("important: some important information to be logged")
s = "".join(buf) # StringIO.getvalue()
s.splitlines()

mimalloc uses size segregated allocator for ~512KiB. And size class
is determined
top three bits.
On the other hand, list increases capacity by 9/8. It means next size
class is used
on each realloc. At last, all size classes has1~3 used/cached memory blocks.

This is almost worst case for mimalloc. In more complex application,
there may be
more chance to reuse memory blocks.

In complex or huge application, this overhead will become relatively small.
It's speed is attractive.

But for memory efficiency, pymalloc + jemalloc / tcmalloc may be better for
common cases.

Regards,
--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ORVLH5FAEO7LVE7SK44TQR6XK4YRRZ7L/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
[Inada Naoki <songofacandy@gmail.com>, trying mimalloc]
>>> Hmm, it is not good. mimalloc uses MADV_FREE so it may affect to some
>>> benchmarks. I will look it later.

>> ...
>> $ ./python -m pyperf compare_to pymalloc-mem.json mimalloc-mem.json -G
>> Slower (60):
>> - logging_format: 10.6 MB +- 384.2 kB -> 27.2 MB +- 21.3 kB: 2.58x
>> slower (+158%)
>> - logging_simple: 10028.4 kB +- 371.2 kB -> 22.2 MB +- 24.9 kB: 2.27x
>> slower (+127%)

> I think I understand why the mimalloc uses more than twice memory than the
> pymalloc + glibc malloc in logging_format and logging_simple benchmarks.
>
> These two benchmarks does like this:
>
> buf = [] # in StringIO
> for _ in range(10*1024):
> buf.append("important: some important information to be logged")
> s = "".join(buf) # StringIO.getvalue()
> s.splitlines()
>
> mimalloc uses size segregated allocator for ~512KiB. And size class
> is determined top three bits.
> On the other hand, list increases capacity by 9/8. It means next size
> class is used on each realloc.

Often, but not always (multiplication by 9/8 may not change the top 3
bits - e.g., 128 * 9/8 = 144).

> At last, all size classes has1~3 used/cached memory blocks.

No doubt part of it, but hard to believe it's most of it. If the loop
count above really is 10240, then there's only about 80K worth of
pointers in the final `buf`. To account for a difference of over 10M,
it would need to have left behind well over 100 _full_ size copies
from earlier reallocs.

in fact the number of list elements across resizes goes like so:

0, 4, 8, 16, 25, 35, 46, ..., 7671, 8637, 9723, 10945

Adding all of those sums to 96,113, so accounts for less than 1M of
8-byte pointers if none were ever released. mimalloc will, of course,
add its own slop on top of that - but not a factor of ten's worth.
Unless maybe it's using a dozen such buffers at once?

But does it really matter? ;-) mimalloc "should have" done MADV_FREE
on the pages holding the older `buf` instances, so it's not like the
app is demanding to hold on to the RAM (albeit that it may well show
up in the app's RSS unless/until the OS takes the RAM away).


> This is almost worst case for mimalloc. In more complex application,
> there may be more chance to reuse memory blocks.
>
> In complex or huge application, this overhead will become relatively small.
> It's speed is attractive.
>
> But for memory efficiency, pymalloc + jemalloc / tcmalloc may be better for
> common cases.

The mimalloc page says that, in their benchmarks:

"""
In our benchmarks (see below), mimalloc always outperforms all other
leading allocators (jemalloc, tcmalloc, Hoard, etc), and usually uses
less memory (up to 25% more in the worst case).
"""

obmalloc is certainly more "memory efficient" (for some meanings of
that phrase) for smaller objects: in 3.7 it splits objects of <= 512
bytes into 64 size classes. mimalloc also has (close to) 64 "not
gigantic" size classes, but those cover a range of sizes over a
thousand times wider (up to about half a meg). Everything obmalloc
handles fits in mimalloc's first 20 size classes. So mimalloc
routinely needs more memory to satisfy a "small object" request than
obmalloc does.

I was more intrigued by your first (speed) comparison:

> - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)

Now _that's_ interesting ;-) Looks like spectral_norm recycles many
short-lived Python floats at a swift pace. So memory management
should account for a large part of its runtime (the arithmetic it does
is cheap in comparison), and obmalloc and mimalloc should both excel
at recycling mountains of small objects. Why is mimalloc
significantly faster? This benchmark should stay in the "fastest
paths" of both allocators most often, and they both have very lean
fastest paths (they both use pool-local singly-linked sized-segregated
free lists, so malloc and free for both should usually amount to just
popping or pushing one block off/on the head of the appropriate list).

obmalloc's `address_in_range()` is definitely a major overhead in its
fastest `free()` path, but then mimalloc has to figure out which
thread is doing the freeing (looks cheaper than address_in_range, but
not free). Perhaps the layers of indirection that have been wrapped
around obmalloc over the years are to blame? Perhaps mimalloc's
larger (16x) pools and arenas let it stay in its fastest paths more
often? I don't know why, but it would be interesting to find out :-)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/554D4PU6LBBIKWJCQI4VKU2BVZD4Z3PM/
Re: obmalloc (was Have a big machine and spare time? Here's a possible Python bug.) [ In reply to ]
On Tue, Jul 9, 2019 at 9:46 AM Tim Peters <tim.peters@gmail.com> wrote:
>
> > At last, all size classes has1~3 used/cached memory blocks.
>
> No doubt part of it, but hard to believe it's most of it. If the loop
> count above really is 10240, then there's only about 80K worth of
> pointers in the final `buf`.

You are right. List.append is not the major part of memory consumer
of "large" class (8KiB+1 ~ 512KiB). They are several causes of large
size alloc:

* bm_logging uses StringIO.seek(0); StringIO.truncate() to reset buffer.
So internal buffer of StringIO become Py_UCS4 array instead of a list
of strings from the 2nd loop. This buffer is using same policy to list
for increase capacity. `size + size >> 8 + (size < 9 ? 3 : 6)`.
Actually, when I use `-n 1` option, memory usage is only 9MiB.
* The intern dict.
* Many modules are loaded, and FileIO.readall() is used to read pyc files.
This creates and deletes various size of bytes objects.
* logging module uses several regular expressions. `b'\0' * 0xff00` is
used in sre_compile.
https://github.com/python/cpython/blob/master/Lib/sre_compile.py#L320


>
> But does it really matter? ;-) mimalloc "should have" done MADV_FREE
> on the pages holding the older `buf` instances, so it's not like the
> app is demanding to hold on to the RAM (albeit that it may well show
> up in the app's RSS unless/until the OS takes the RAM away).
>

mimalloc doesn't call madvice for each free(). Each size classes
keeps a 64KiB "page".
And several pages (4KiB) in the "page" are committed but not used.

I dumped all "mimalloc page" stat.
https://paper.dropbox.com/doc/mimalloc-on-CPython--Agg3g6XhoX77KLLmN43V48cfAg-fFyIm8P9aJpymKQN0scpp#:uid=671467140288877659659079&h2=memory-usage-of-logging_format

For example:

bin block_size used capacity reserved
29 2560 1 22 25 (14 pages are committed, 2560
bytes are in use)
29 2560 14 25 25 (16 pages are committed,
2560*14 bytes are in use)
29 2560 11 25 25
31 3584 1 5 18 (5 pages are committed, 3584
bytes are in use)
33 5120 1 4 12
33 5120 2 12 12
33 5120 2 12 12
37 10240 3 11 409
41 20480 1 6 204
57 327680 1 2 12

* committed pages can be calculated by `ceil(block_size * capacity /
4096)` roughly.

There are dozen of unused memory block and committed pages in each size classes.
This caused 10MiB+ memory usage overhead on logging_format and logging_simple
benchmarks.


>> I was more intrigued by your first (speed) comparison:
>
> > - spectral_norm: 202 ms +- 5 ms -> 176 ms +- 3 ms: 1.15x faster (-13%)
>
> Now _that's_ interesting ;-) Looks like spectral_norm recycles many
> short-lived Python floats at a swift pace. So memory management
> should account for a large part of its runtime (the arithmetic it does
> is cheap in comparison), and obmalloc and mimalloc should both excel
> at recycling mountains of small objects. Why is mimalloc
> significantly faster?
[snip]
> obmalloc's `address_in_range()` is definitely a major overhead in its
> fastest `free()` path, but then mimalloc has to figure out which
> thread is doing the freeing (looks cheaper than address_in_range, but
> not free). Perhaps the layers of indirection that have been wrapped
> around obmalloc over the years are to blame? Perhaps mimalloc's
> larger (16x) pools and arenas let it stay in its fastest paths more
> often? I don't know why, but it would be interesting to find out :-)

Totally agree. I'll investigate this next.

Regards,
--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MXEE2NOEDAP72RFVTC7H4GJSE2CHP3SX/

1 2  View All