* David Golden <xdaveg@gmail.com> [2012-07-29 19:10]:
> I think it might be possible to adapt some of my Search::Dict wranging
> from the Paris QA hackathon to do it. E.g. have a data file with
> "$module::name $json_data\n" per line. Then Search::Dict the data
> file and convert the JSON data part and that would give answers to 99%
> of questions people ask with corelist.
>
> For the handful of users that need full data per perl, the time cost
> of loading it all up should be bearable.
>
> N.B. I'm not planning on doing that work, but if someone is motivated,
> it's another way to do it. Eliminating the repeated module names from
> the file probably accomplishes a substantial size reduction. Delta
> representation could be added at a per-module basis as well, of
> course.
How about we do something else – e.g. how about an ASCII table? Perl is
good at munging those, right?
Attached is a semi-crude script I wrote that builds a table out of the
current data structures, just to see what the result would look like.
(There’s a bunch of details to fix, e.g. it produces dupes of a few
columns due to perl version aliases.)
The table comes out to around 550Kb, which is nowadays entirely
reasonable to load into memory in one fell swoop as a monolithic string.
No need to busy perl with parsing the data.
That is of course a small step back from the 370Kb on disk that the
current scheme yields.
But the vast expanses of whitespace gzip-compress to peanuts (<25Kb).
Do we have gunzip in core?
With each release it will grow by a couple Kb in memory, and if we can
ship it gzipped, a handful of bytes on disk.
For access by perl release, parsing the first line allows building
a column offset list, which can be used to generate unpack formats for
extracting any particular column as an array. That is efficient in both
speed and memory.
For access by module, a map of module to row offset is quickly built.
A line extracted from the string is trivially parsed with `split " "`.
In both cases there is some light follow-up munging for “undef†etc.
That makes a total of 4 variables necessary: the table as a string, an
array of column offsets within a line, a hash of row offsets within the
string, and one more scalar for the width of a line. No nested data
structures, and the hash totals a couple dozen Kb.
So we’re looking at <1MB in memory (incl. all overheads), a pittance on
disk, near zero load time, most parsing work done in a few heavy-weight
builtins with almost no looping in Perl code, and equally fast access to
the data by either axis, with no spin-up key index generation for either
of them.
· —— * —— ·
Will a patch be accepted if I try this and find the results live up to
the promise? Did I miss any reason why this is a bad idea?
Regards,
--
Aristotle Pagaltzis // <
http://plasmasturm.org/>