Mailing List Archive

[PATCH] Module::CoreList delta support
Hi,

I've added delta support to Module::CoreList, which makes it a bit smaller,
I'm not sending it to the list or RT because it's huge (mostly removed
lines), it's at:
https://github.com/dgl/perl/commit/e08139414a12cd6d613816f6eeee39b95a63538

Yes it uses tie, which I'm not entirely pleased with however see for
example: http://grep.cpan.me/?q=%5C%24Module%3A%3ACoreList%3A%3Aversion for
the sheer amount of stuff reading from those hashes.

(There's also another cleanup patch on top of it to reduce the manual work
needed to alias releases:
https://github.com/dgl/perl/commit/8d2b6197299cd9c5cf45bbc4016850cab1bd348f)

David
Re: [PATCH] Module::CoreList delta support [ In reply to ]
On Sun, Jul 29, 2012 at 3:22 PM, David Leadbeater <dgl@dgl.cx> wrote:
> Yes it uses tie, which I'm not entirely pleased with however see for
> example: http://grep.cpan.me/?q=%5C%24Module%3A%3ACoreList%3A%3Aversion for
> the sheer amount of stuff reading from those hashes.

Maybe it's too late to get rid of those, but let's please add a proper
API (that does the normalization dance that's required but often
forgotten) so that we don't really need to use those hashes directly
in the future. We can do better than this.

Leon
Re: [PATCH] Module::CoreList delta support [ In reply to ]
On Sun, Jul 29, 2012 at 01:22:50PM +0100, David Leadbeater wrote:
> Hi,
>
> I've added delta support to Module::CoreList, which makes it a bit smaller,
> I'm not sending it to the list or RT because it's huge (mostly removed
> lines), it's at:
> https://github.com/dgl/perl/commit/e08139414a12cd6d613816f6eeee39b95a63538
>
> Yes it uses tie, which I'm not entirely pleased with however see for
> example: http://grep.cpan.me/?q=%5C%24Module%3A%3ACoreList%3A%3Aversion for
> the sheer amount of stuff reading from those hashes.
>
> (There's also another cleanup patch on top of it to reduce the manual work
> needed to alias releases:
> https://github.com/dgl/perl/commit/8d2b6197299cd9c5cf45bbc4016850cab1bd348f)
>

Myself and Dave Golden had been working on reducing the size of Module-CoreList
(Mainly Dave to be fair).

The work so far is at https://github.com/bingos/module-snorelist

All the data is compressed and uuencoded.

This does have an affect on performance.

# current corelist
$ time corelist -a CPANPLUS

<snip>

real 0m1.105s
user 0m0.947s
sys 0m0.049s

# snorelist corelist
$ time perl -Mblib blib/script/corelist -a CPANPLUS

<snip>

real 0m3.810s
user 0m1.462s
sys 0m1.839s

Even with delta'd information that CoreList.pm is a 8000+ file :(

--
Chris Williams
aka BinGOs
PGP ID 0x4658671F
http://www.gumbynet.org.uk
==========================
Re: [PATCH] Module::CoreList delta support [ In reply to ]
On Sun, Jul 29, 2012 at 11:08 AM, Chris 'BinGOs' Williams
<chris@bingosnet.co.uk> wrote:
> All the data is compressed and uuencoded.
>
> This does have an affect on performance.

The real win might be to get to a format that can be searched on
demand so we don't load the whole thing into memory just to check one
module.

Ideally, we'd pivot everything to make module name the primary key and
have a data structure per module with the per-perl-release data.

I think it might be possible to adapt some of my Search::Dict wranging
from the Paris QA hackathon to do it. E.g. have a data file with
"$module::name $json_data\n" per line. Then Search::Dict the data
file and convert the JSON data part and that would give answers to 99%
of questions people ask with corelist.

For the handful of users that need full data per perl, the time cost
of loading it all up should be bearable.

N.B. I'm not planning on doing that work, but if someone is motivated,
it's another way to do it. Eliminating the repeated module names from
the file probably accomplishes a substantial size reduction. Delta
representation could be added at a per-module basis as well, of
course.

-- David
Re: [PATCH] Module::CoreList delta support [ In reply to ]
Ah, I didn't come across that repo, interesting.

On 29 July 2012 18:07, David Golden <xdaveg@gmail.com> wrote:

> On Sun, Jul 29, 2012 at 11:08 AM, Chris 'BinGOs' Williams
> <chris@bingosnet.co.uk> wrote:
> > All the data is compressed and uuencoded.
> >
> > This does have an affect on performance.
>
> The real win might be to get to a format that can be searched on
> demand so we don't load the whole thing into memory just to check one
> module.
>

Yes, that would be nice. However I do wonder how useful it is, that seems
like a lot of work and also needs something to generate that format. I was
trying to keep my changes fairly minimal. (I agree with Leon that it would
be nice to deprecate the access as a hash style, as then it would be easier
for a clean API to just load the data it needs.)

How important is performance though?

I notice that my patched version is actually a tad faster for simple
lookups (because presumably perl has less lines to parse despite doing more
work after that), although I was mostly aiming for some size reduction. I
also notice that gzip compressing the delta file results in a slightly
smaller file than the compressed one from Module-Snorelist (even after
compressing the uuencoded data for another level of compression).

As an experiment I just tried generating something approximating the JSON
format you mentioned (without deltas) and that alone is nearly 700K of
data. The other interesting thing is as the data decreases in size it
reaches a point where it won't compress as well, and many things where size
matter are already using some form of compression.

David
Re: [PATCH] Module::CoreList delta support [ In reply to ]
David Leadbeater wrote:
> I notice that my patched version is actually a tad faster for simple
> lookups (because presumably perl has less lines to parse despite doing more
> work after that), although I was mostly aiming for some size reduction. I
> also notice that gzip compressing the delta file results in a slightly
> smaller file than the compressed one from Module-Snorelist (even after
> compressing the uuencoded data for another level of compression).

I am in favour of incorporating this patch. If snorelist proves better
we can always revert it.

This patch's advantage is its simplicity and the fact that it is
already written.

If nobody objects too vociferously, I might apply it soon.
Re: [PATCH] Module::CoreList delta support [ In reply to ]
I wrote:
> I am in favour of incorporating this patch. If snorelist proves better
> we can always revert it.
>
> This patch's advantage is its simplicity and the fact that it is
> already written.
>
> If nobody objects too vociferously, I might apply it soon.

Actually, I think there is something wrong with the new data. What
happened to ExtUtils::Miniperl? Why does it show up in the removed
hash in two places?
Re: [PATCH] Module::CoreList delta support [ In reply to ]
On 30 July 2012 00:13, Father Chrysostomos <sprout@cpan.org> wrote:
>
> Actually, I think there is something wrong with the new data. What
> happened to ExtUtils::Miniperl? Why does it show up in the removed
> hash in two places?
>

From the output of the current corelist -a ExtUtils::Miniperl:

[snip]
v5.12.3 undef
v5.13.0 undef
v5.13.1 undef
v5.13.2 undef
v5.13.4 undef
[snip]

For some reason ExtUtils::Miniperl isn't listed as part of 5.12.4 or 5.13.3.

Looking at how Porting/corelist.pl works it looks like this would happen if
it was run in a checkout of perl that wasn't built (i.e. it won't pick up
minimod.pl, only the built version as lib/ExtUtils/Miniperl.pm).

David
Re: [PATCH] Module::CoreList delta support [ In reply to ]
[PATCH] Module::CoreList delta support

David Leadbeater:
> I've added delta support to Module::CoreList, which makes it a bit smaller,

Thank you. Applied as a272bf38fe56b and 4ffbb0c46e27652ffd.
Re: [PATCH] Module::CoreList delta support [ In reply to ]
* David Golden <xdaveg@gmail.com> [2012-07-29 19:10]:
> I think it might be possible to adapt some of my Search::Dict wranging
> from the Paris QA hackathon to do it. E.g. have a data file with
> "$module::name $json_data\n" per line. Then Search::Dict the data
> file and convert the JSON data part and that would give answers to 99%
> of questions people ask with corelist.
>
> For the handful of users that need full data per perl, the time cost
> of loading it all up should be bearable.
>
> N.B. I'm not planning on doing that work, but if someone is motivated,
> it's another way to do it. Eliminating the repeated module names from
> the file probably accomplishes a substantial size reduction. Delta
> representation could be added at a per-module basis as well, of
> course.

How about we do something else – e.g. how about an ASCII table? Perl is
good at munging those, right?

Attached is a semi-crude script I wrote that builds a table out of the
current data structures, just to see what the result would look like.
(There’s a bunch of details to fix, e.g. it produces dupes of a few
columns due to perl version aliases.)

The table comes out to around 550Kb, which is nowadays entirely
reasonable to load into memory in one fell swoop as a monolithic string.
No need to busy perl with parsing the data.

That is of course a small step back from the 370Kb on disk that the
current scheme yields.

But the vast expanses of whitespace gzip-compress to peanuts (<25Kb).
Do we have gunzip in core?

With each release it will grow by a couple Kb in memory, and if we can
ship it gzipped, a handful of bytes on disk.

For access by perl release, parsing the first line allows building
a column offset list, which can be used to generate unpack formats for
extracting any particular column as an array. That is efficient in both
speed and memory.

For access by module, a map of module to row offset is quickly built.
A line extracted from the string is trivially parsed with `split " "`.

In both cases there is some light follow-up munging for “undef” etc.

That makes a total of 4 variables necessary: the table as a string, an
array of column offsets within a line, a hash of row offsets within the
string, and one more scalar for the width of a line. No nested data
structures, and the hash totals a couple dozen Kb.

So we’re looking at <1MB in memory (incl. all overheads), a pittance on
disk, near zero load time, most parsing work done in a few heavy-weight
builtins with almost no looping in Perl code, and equally fast access to
the data by either axis, with no spin-up key index generation for either
of them.

· —— * —— ·

Will a patch be accepted if I try this and find the results live up to
the promise? Did I miss any reason why this is a bad idea?

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
Re: [PATCH] Module::CoreList delta support [ In reply to ]
On 4 August 2012 21:29, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
[snip details of space separated table format]
> But the vast expanses of whitespace gzip-compress to peanuts (<25Kb).
> Do we have gunzip in core?

Yes, see the approach BinGOs mentioned, code here:
https://github.com/bingos/module-snorelist/blob/master/Data.PL

(Note this is uuencoding the gzipped data, so a tiny bit larger, but
I'm not sure that's really needed.)

[snip details of searching data file]
> So we’re looking at <1MB in memory (incl. all overheads), a pittance on
> disk, near zero load time, most parsing work done in a few heavy-weight
> builtins with almost no looping in Perl code, and equally fast access to
> the data by either axis, with no spin-up key index generation for either
> of them.

Putting a tie interface on top of that might not be that nice. I do
wonder if the next step is to make a cleaner API and deprecate the
access via hash "API" as mentioned earlier in the thread.

> Will a patch be accepted if I try this and find the results live up to
> the promise? Did I miss any reason why this is a bad idea?

Sounds sane in general (not my call to accept patches though). With
this approach I assume generation will be involved somewhere and it's
worth thinking about that aspect:

* Where will the generation will take place?
(First thought is a .PL file run as part of the core perl build
process and Module-CoreList build process -- in that case we wouldn't
reduce the size of those distributions, but may reduce the size of
built packages)
* What the diff for a release manager will look like
(Presumably the data can be in a nicer format than a table with very
long lines if it's generated from something else).
* How this fits in with scripts in Porting, etc.
(e.g. test_porting at the moment will act as a sanity test on
CoreList.pm due to the strict version check there, may be worth having
an explicit test for the data consistency).

David
Re: [PATCH] Module::CoreList delta support [ In reply to ]
David Leadbeater:
> * What the diff for a release manager will look like
...
> * How this fits in with scripts in Porting, etc.

Change the instructions to say something like this for maint releases:

- Regenerate Module::CoreList
- Commit to maint
- Then:

$ ./perl -Ilib utils/corelist -v 5.14.2 > /tmp/corelist
$ checkout blead
$ perl Porting/corelist.pl --add-version=5.14.2 < /tmp/corelist
$ commit -a -m 'Update CoreList for 5.14.2'

(And adjust corelist.pl in bleadperl to work with that.)

Then the format does not matter.
Re: [PATCH] Module::CoreList delta support [ In reply to ]
* David Leadbeater <dgl@dgl.cx> [2012-08-04 23:30]:
> On 4 August 2012 21:29, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
> > So we’re looking at <1MB in memory (incl. all overheads), a pittance
> > on disk, near zero load time, most parsing work done in a few
> > heavy-weight builtins with almost no looping in Perl code, and
> > equally fast access to the data by either axis, with no spin-up key
> > index generation for either of them.
>
> Putting a tie interface on top of that might not be that nice. I do
> wonder if the next step is to make a cleaner API and deprecate the
> access via hash "API" as mentioned earlier in the thread.

Actually I don’t think it will be terrible. But moving from the hash to
a cleaner interface is a good idea in any case.

> With this approach I assume generation will be involved somewhere and
> it's worth thinking about that aspect:
>
> * Where will the generation will take place?
> (First thought is a .PL file run as part of the core perl build
> process and Module-CoreList build process -- in that case we wouldn't
> reduce the size of those distributions, but may reduce the size of
> built packages)

Well, how is the data for a new release generated now? I assumed that
generating the data for the new format would slot right in there.
I really have absolutely no idea about the module, all I was interested
in is whether I could come up with a credible solution to storing the
data table.

> * What the diff for a release manager will look like
> (Presumably the data can be in a nicer format than a table with very
> long lines if it's generated from something else).

That is a weakness in the scheme I outlined. Lines need to be fixed
length for the unpack sleight of hand to work, which means adding a perl
release will change (read: lengthen) every single line in the file. With
git at your disposal you can use word-based diff though, which will
nicely show only the tail of each line getting extended. If presented as
a patch then it will be 750+ lines long every time, though.

> * How this fits in with scripts in Porting, etc.
> (e.g. test_porting at the moment will act as a sanity test on
> CoreList.pm due to the strict version check there, may be worth
> having an explicit test for the data consistency).

I’m afraid I’m not at all aware of how that process works. What is its
purpose and where are the cogs of its machinery which do the main work?
I can just go through the release manager docs (and one of these days
I really should) but if there are any shortcuts to the bits that are
relevant to MCL I shall be glad to hear about them.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
Re: [PATCH] Module::CoreList delta support [ In reply to ]
On Sun, Aug 5, 2012 at 3:20 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
> That is a weakness in the scheme I outlined. Lines need to be fixed
> length for the unpack sleight of hand to work, which means adding a perl

(Without actually reading your code) my gut reaction is "don't do
unpack slight of hand" then. Store lines in sorted module name order
like this:

"$name $json\n" # where $json has no newlines in it

Then use Search::Dict to bisect to the right line, split the line into
two fields, decode the JSON part into a hash and dig into it as
needed. The bisection means it will be faster than reading every line
and loading the whole thing into memory anyway. Only in the rarest of
cases do you need to load the whole file -- most uses are still
"corelist Foo", which benefits hugely from bisection.

And if the JSON data for each module is in some delta format and only
changes when a module is updated, a line only gets touched when it
actually changes in a release, so the diff is annoying (a very long
line) but not horrible (every line changing).

Why JSON vs some other ascii delimited format? Because it allows
structured data instead of just fields. (Plus anyone with JSON:XS
installed gets a speed boost for free.)

-- David
Re: [PATCH] Module::CoreList delta support [ In reply to ]
* David Golden <xdaveg@gmail.com> [2012-08-05 19:40]:
> On Sun, Aug 5, 2012 at 3:20 AM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
> > That is a weakness in the scheme I outlined. Lines need to be fixed
> > length for the unpack sleight of hand to work, which means adding
> > a perl
>
> (Without actually reading your code) my gut reaction is "don't do
> unpack slight of hand" then. Store lines in sorted module name order
> like this:
>
> "$name $json\n" # where $json has no newlines in it
>
> Then use Search::Dict to bisect to the right line, split the line into
> two fields, decode the JSON part into a hash and dig into it as
> needed.

Err, the whole point was not to have to do that.

You have not considered what happens for queries on the by-perl-release
axis, in which case under your scheme the only solution is to load and
decode every single line and then do a hash access on it.

With fixed width records, queries on both axes are equally fast. The
cost is storing a bunch of whitespace – in memory, but not on disk,
because gzip crunches that away with abandon.

> The bisection means it will be faster than reading every line and
> loading the whole thing into memory anyway.
>
> Only in the rarest of cases do you need to load the whole file -- most
> uses are still "corelist Foo", which benefits hugely from bisection.

Err. You are proposing that doing repeated disk seeks will be faster
than reading 25KB in a single I/O operation and then inflating that to
500KB in memory using gzip (which decompresses at near memcpy speed),
and then scanning that string.

I, uhm… Are you sure you want to have that match?

> And if the JSON data for each module is in some delta format and only
> changes when a module is updated, a line only gets touched when it
> actually changes in a release, so the diff is annoying (a very long
> line) but not horrible (every line changing).

That is the benefit of the JSON-based scheme.

I posit that it loses on every other count.

> Why JSON vs some other ascii delimited format? Because it allows
> structured data instead of just fields. (Plus anyone with JSON:XS
> installed gets a speed boost for free.)

Microöptimising a fundamentally slow algorithm…

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
Re: [PATCH] Module::CoreList delta support [ In reply to ]
On Sun, Aug 5, 2012 at 4:39 PM, Aristotle Pagaltzis <pagaltzis@gmx.de> wrote:
> You have not considered what happens for queries on the by-perl-release
> axis, in which case under your scheme the only solution is to load and
> decode every single line and then do a hash access on it.

I have considered it and I consider it rare enough to dismiss out of
hand. I suspect it primarily benefits porters. (Which is not to say
that we're not important and worth optimizing for, but I still think
it's the minority case.

>> Only in the rarest of cases do you need to load the whole file -- most
>> uses are still "corelist Foo", which benefits hugely from bisection.
>
> Err. You are proposing that doing repeated disk seeks will be faster
> than reading 25KB in a single I/O operation and then inflating that to
> 500KB in memory using gzip (which decompresses at near memcpy speed),
> and then scanning that string.
>
> I, uhm… Are you sure you want to have that match?

/me shrugs. Don't know. Don't terribly care, really. For the most
part, I find a design that requires loading everything to throw away
99%+ of it most of the time is ugly, even if Moore's law has saved us
from caring.

I'm certainly not suggesting my idea is the *best* way (for any
definition of best), I was reacting to the "it hurts when I do this"
comment you made and thus my reaction was "so don't do that".

I don't know what anyone really cares about optimizing here. It
started (mostly) with the desire to minimize size on disk. Minimizing
memory size and speed are nice, too. So is minimizing p5p maintenance
hassle. I'd be happy with an approach that does a decent job across
criteria rather than focusing excessively on any single one.

What I *don't* like about compression (even though I played around
with such approaches) is that it's opaque on disk, which makes
reviewing deltas extremely difficult. What changed from version X to
Y? Did module Z actually get updated or not? Ugh. Let's unpack the
data (or run the code) and find out... instead of "git diff". But
it's wizard for minimizing disk size, which is where this whole idea
started.

-- David
Re: [PATCH] Module::CoreList delta support [ In reply to ]
On Mon, Aug 6, 2012 at 3:37 AM, David Golden <xdaveg@gmail.com> wrote:
> What I *don't* like about compression (even though I played around
> with such approaches) is that it's opaque on disk, which makes
> reviewing deltas extremely difficult. What changed from version X to
> Y? Did module Z actually get updated or not? Ugh. Let's unpack the
> data (or run the code) and find out... instead of "git diff". But
> it's wizard for minimizing disk size, which is where this whole idea
> started.

That could be solved if the compressed form is created at install
time, instead of at authoring time.

Leon
Re: [PATCH] Module::CoreList delta support [ In reply to ]
On Mon, Aug 6, 2012 at 5:54 AM, Leon Timmermans <fawaka@gmail.com> wrote:
> That could be solved if the compressed form is created at install
> time, instead of at authoring time.

Depends on whether you care about run-time transparency.

It could be compressed at tarball creation time during the release process, too.

-- David