Mailing List Archive: Reducing your API overhead

Reducing your API overhead

May 19, 2010, 6:29 PM

Post #1 of 13 (5812 views)

In getting ready to optimize some templates for speed, I started looking over the mailing list for anything that talked about the overhead in fetching certain Bricolage objects using the API. One of the most relevant comments seemed to be this one from David just a few months ago when we were discussing large Bricolage installs:

http://www.gossamer-threads.com/lists/bricolage/users/38218#38218

"Well, yes, except that the slowness of publishing is not due to the speed of the database, but to the amount of data loaded from the database. So if you republish your home page, and the template loads the 25 most recent stories from the database, and then loads any of the elements from those 25 stories (such as a teaser field), then that's a lot of data that gets queried."

So this is my question to the list, keeping in mind that the mason and Perl code you're using will have a large impact on this: Are there some API calls that are better to use than others? Do you gain anything by focusing on pulling relatively small asset objects like Title, slug, description as opposed to elements in stories? If you're trying to make a template run faster, what sorts of API calls do you look to cut out? Is the question purely one of memory or are there other factors at play?

Sorry if the question is somewhat noobish, I haven't seen anything in the docs on this issue.

-Matt

Re: Reducing your API overhead [ In reply to ]

phillip at communitybandwidth

May 19, 2010, 6:52 PM

Post #2 of 13 (5742 views)

On 2010-05-19, at 9:29 PM, Matthew Rolf wrote:

> In getting ready to optimize some templates for speed, I started looking over the mailing list for anything that talked about the overhead in fetching certain Bricolage objects using the API. One of the most relevant comments seemed to be this one from David just a few months ago when we were discussing large Bricolage installs:
>
> http://www.gossamer-threads.com/lists/bricolage/users/38218#38218
>
> "Well, yes, except that the slowness of publishing is not due to the speed of the database, but to the amount of data loaded from the database. So if you republish your home page, and the template loads the 25 most recent stories from the database, and then loads any of the elements from those 25 stories (such as a teaser field), then that's a lot of data that gets queried."
>
> So this is my question to the list, keeping in mind that the mason and Perl code you're using will have a large impact on this: Are there some API calls that are better to use than others? Do you gain anything by focusing on pulling relatively small asset objects like Title, slug, description as opposed to elements in stories? If you're trying to make a template run faster, what sorts of API calls do you look to cut out? Is the question purely one of memory or are there other factors at play?
>
> Sorry if the question is somewhat noobish, I haven't seen anything in the docs on this issue.

Not "noobish" at all. It's a great question that would be valuable to answer and document in the wiki.

--
Phillip Smith // Simplifier of Technology // COMMUNITY BANDWIDTH
www.communitybandwidth.ca // www.phillipadsmith.com

Re: Reducing your API overhead [ In reply to ]

david at kineticode

May 19, 2010, 8:57 PM

Post #3 of 13 (5741 views)

On May 19, 2010, at 9:29 PM, Matthew Rolf wrote:

> http://www.gossamer-threads.com/lists/bricolage/users/38218#38218
>
> "Well, yes, except that the slowness of publishing is not due to the speed of the database, but to the amount of data loaded from the database. So if you republish your home page, and the template loads the 25 most recent stories from the database, and then loads any of the elements from those 25 stories (such as a teaser field), then that's a lot of data that gets queried."
>
> So this is my question to the list, keeping in mind that the mason and Perl code you're using will have a large impact on this: Are there some API calls that are better to use than others? Do you gain anything by focusing on pulling relatively small asset objects like Title, slug, description as opposed to elements in stories? If you're trying to make a template run faster, what sorts of API calls do you look to cut out? Is the question purely one of memory or are there other factors at play?

The problem is that elements and fields are fetched separately for each container element. Say you have 25 stories you retrieve in a template via Story->list. If all you grab is the URI and title of the stories it's no big deal: That data is loaded when each story is fetched. But as soon as you get into elements it gets expensive.

Say you want the "summary" field from the top-level element. That's 50 more queries: one for each element of each of the 25 stories, and then another one to fetch all the fields for each of those 25 elements so that you can just get the "summary" field value. If you end up fetching subelements, that's more queries.

I have an install I'm working on now where looked up stories need to be sorted on an event field in a subelement. If I've looked up 25 stories via Story->list, in addition to that one query, I execute 75 more: 1 for the top-level element of each of those stories, another to get the subelements of each of those 25 top-level elements, and 1 to get the fields for the "Event" subelement. If there is more than one Event subelement in a given story, add yet another query.

This is insane, frankly.

Ideally, there'd be some way to tell Story->list to load all container and field subelements in the same query. So if I needed to sort on that event date, while Story->list() might lead to 75 queries, Story->list({ WithElements => 1 }) would lead to just one query.

Of course, this assertion should be subject to benchmarking. But I'm pretty sure that Bricolage really just executes way too fucking many queries.

So, if you just use the main attributes of a story (title, URI, description), it'll be pretty fast. If you need to get at any contained objects (source, categories, elements, keywords, etc.), you're going to incur the cost of a lot more database queries.

Best,

David

Re: Reducing your API overhead [ In reply to ]

zdravko.balorda at siix

May 20, 2010, 12:19 AM

Post #4 of 13 (5740 views)

> Ideally, there'd be some way to tell Story->list to load all container and field subelements
> in the same query. So if I needed to sort on that event date, while Story->list() might lead
> to 75 queries, Story->list({ WithElements => 1 }) would lead to just one query.
>
> Of course, this assertion should be subject to benchmarking. But I'm pretty sure that Bricolage
> really just executes way too fucking many queries.
>
> So, if you just use the main attributes of a story (title, URI, description),
> it'll be pretty fast. If you need to get at any contained objects (source,

What does Bricolage do differently from a dynamic CMS? All of this is
done in any dynamic system, too, which would make it unusable.
How different is a data model?
How expensive is loading and executing subelement templates?

I have worked with a very simple dynamic CMS so I cann't answer this, but I'd say
Bricolage publishing shoudn't be much slower than any other dynamic CMS which does
all this on the fly.

API could have functions that spread some call to an array of stories (Biz::Stories),
so that there would be as many queries as there are different elements that are accessed.
I your case above, regarding some event date, there would be something like:
$allstories = Biz::Stories->new(Story::list());
$allstories->fill_value('some_field');

(@stories) = $allstories->get_stories();
and then the usual Story interface Story::get_value() would work as is.

Just a thaught.
Best, Zdravko

Re: Reducing your API overhead [ In reply to ]

May 20, 2010, 6:08 AM

Post #5 of 13 (5725 views)

On May 19, 2010, at 11:57 PM, David E. Wheeler wrote:

> The problem is that elements and fields are fetched separately for each container element. Say you have 25 stories you retrieve in a template via Story->list. If all you grab is the URI and title of the stories it's no big deal: That data is loaded when each story is fetched. But as soon as you get into elements it gets expensive.

Thank you, this is kind what I was expecting.

> Ideally, there'd be some way to tell Story->list to load all container and field subelements in the same query. So if I needed to sort on that event date, while Story->list() might lead to 75 queries, Story->list({ WithElements => 1 }) would lead to just one query.

How hard might something like that be to implement and test?

> So, if you just use the main attributes of a story (title, URI, description), it'll be pretty fast. If you need to get at any contained objects (source, categories, elements, keywords, etc.), you're going to incur the cost of a lot more database queries.

And presumably the more you've nested your elements, the more overhead you're going to incur as the number of queries rises.

Keywords is any interesting one to me. On the one hand, you've got Phillip who was able to do a moderate publish_another work without much issue using keywords:

http://www.gossamer-threads.com/lists/bricolage/users/34190?search_string=keyword;#34190

And John who saw Bricolage publishing thrash horribly under load doing the same sort of thing.

http://www.gossamer-threads.com/lists/bricolage/users/11070?search_string=caching;#11070

I wonder if it would make sense to optimize API calls to keywords, categories and sources separately or in a different way from the Element calls since those are "standard" asset objects and somewhat predictable in their construction.

In regards to this:

On May 20, 2010, at 3:19 AM, Zdravko Balorda wrote:

> I have worked with a very simple dynamic CMS so I cann't answer this, but I'd say
> Bricolage publishing shoudn't be much slower than any other dynamic CMS which does all this on the fly.

I think the key difference is that when you're publishing stuff in Bric, you can easily hit way more objects than a "conventional" dynamic site might. If you're asking for an element two levels deep in a thousand stories in a particular category three times in a single publish, and then doing something with the keywords, and then doing a complex calculation before you hand the data off to the autohandler, that's a lot different than a PHP site handing back a request for a single page.

Of course, Wordpress sites fall over and die every single day when they get hit by moderate internet traffic and someone isn't effectively caching. It's not like Bric has a monopoly on inefficiency, nor is it even inherently inefficient. The nature of Bric enables you to flat out *do more* than your garden variety CMS. Like anything else, this freedom can get you into trouble.

With any dynamically generated content, you're going to have to do the work somewhere at sometime, and no matter where you do it there's no single easy answer to streamline performance. One of Bric's great advantages is being able to output everything to static files, thus moving dynamic content generation to the backend where it's going to be hit by less users, and improving the scalability of the front end web site. But when things have gotten too slow on the Bric end the answer has always been to publish some sort of dynamic includes, be it flat file SSI or something more involved with PHP or Mason. For some instances, I think that is the right answer, but in other instances maybe it shouldn't be.

-Matt

Re: Reducing your API overhead [ In reply to ]

zdravko.balorda at siix

May 20, 2010, 7:00 AM

Post #6 of 13 (5742 views)

Matt, thank you for an excellent explanation of dynamic CMS vs. Bric system.
The Bric's concept of arbitrary structured subelements is brilliant.

In regards to this:

Matthew Rolf wrote:

>> Ideally, there'd be some way to tell Story->list to load all container and field subelements in the same query. So if I needed to sort on that event date, while Story->list() might lead to 75 queries, Story->list({ WithElements => 1 }) would lead to just one query.
>
> How hard might something like that be to implement and test?

I am not a Bric programmer but I'd say in advance it may not be easy if at all feasible.
Mostly because stories don't have the same elements, which in addition have different structure.
If by same great luck this can be done without looping queries then it would improve things.
Otherwise it may result only in rearranging more or less the same number of queries.

Zdravko

Re: Reducing your API overhead [ In reply to ]

May 20, 2010, 7:56 AM

Post #7 of 13 (5770 views)

On May 19, 2010, at 11:57 PM, David E. Wheeler wrote:

> The problem is that elements and fields are fetched separately for each container element. Say you have 25 stories you retrieve in a template via Story->list. If all you grab is the URI and title of the stories it's no big deal: That data is loaded when each story is fetched. But as soon as you get into elements it gets expensive.

David, after doing a little reading on my own, I'm wondering if you could clarify your statement a little bit. What exactly is loaded when a story object is fetched through Story->list?

It sounds like if you're listing story objects based on any of the supported lookup keys (including keyword) then that's going to be pretty fast. And if you were just using the title, uri, and description to generate an archive page that gets sorted on publish date, that stuff is all readily at hand. But once you start pulling keywords or categories or elements out of those stories to do stuff, that's where you run the risk of getting bogged down. Is that correct?

And at what point would it be more efficient to look for only story ids as opposed to story objects and then pull various asset pieces based on the id?

-Matt

Re: Reducing your API overhead [ In reply to ]

david at kineticode

May 20, 2010, 10:34 AM

Post #8 of 13 (5738 views)

On May 20, 2010, at 10:56 AM, Matthew Rolf wrote:

>
> David, after doing a little reading on my own, I'm wondering if you could clarify your statement a little bit. What exactly is loaded when a story object is fetched through Story->list?
>
> It sounds like if you're listing story objects based on any of the supported lookup keys (including keyword) then that's going to be pretty fast. And if you were just using the title, uri, and description to generate an archive page that gets sorted on publish date, that stuff is all readily at hand. But once you start pulling keywords or categories or elements out of those stories to do stuff, that's where you run the risk of getting bogged down. Is that correct?

Yes.

> And at what point would it be more efficient to look for only story ids as opposed to story objects and then pull various asset pieces based on the id?

Um, never, AFAIK. You pretty much always need some of the core attributes (uri, title).

I think it would be pretty hard to change this. The queries for stories are already quite hairy. Ideally, though, to include all subelements (container and field) in a story in a single query, we would need to have an SQL array of RECORDs. So it'd look something like this:

id | name | primary_uri | elements
----+------+-------------+-----------------------------------------------------------------
1 | Foo | /foo/ | {"(1,container,page,0,NULL),(2,field,paragraph,1,\"hi there\")"}

So one row, one story. This story has two elements: the top-level element (parent_id is 0), and a "paragraph" field subelement (parent is the page element). This is PostgreSQL-specific, but possible by using the `ROW()` constructor to turn a row from a the element table into a RECORD object, and then using array_agg() to aggregate all of the element RECORDs into an array. The Perl code would then have to parse these record objects into hashes and bless them as container and field objects.

I think it would be quite a lot of work to do this. Dealing with permissions might be tricky. And I have no idea if this could be made to work in MySQL without a lot of ugly hackery. But if someone wants to take it on -- or fund me to do it -- be my guest!

Best,

David

Re: Reducing your API overhead [ In reply to ]

alex at gossamer-threads

May 20, 2010, 12:28 PM

Post #9 of 13 (5733 views)

Hi,

On Thu May 20 07:56:31, Matthew Rolf wrote:
> And at what point would it be more efficient to look for only story ids
> as opposed to story objects and then pull various asset pieces based on
> the id?

I think best starting point here would be to do some profiling and get
some numbers. Christie Wilson did a good talk as part of vancouver perl
mongers last year on NYTProf along with some other profilers. Slides and
details are here:

http://www.socialtext.net/vanpm/index.cgi?meeting_august_12th_2009

NYTProf is really nice!

But just getting a real handle on where the bottlenecks are would help.
DBIx::Profile is also a real nice module for just getting a log of all
the queries run and timings/counts for them.

The trickiest thing I think is trying to run it outside of
apache/mod_perl, as personally, I find debugging and profiling inside
mod_perl a big pain. Older versions of mason had debug files:

http://www.masonhq.com/docs/manual/1.05/Devel.html#using_the_perl_debugger

but sadly, that's no longer supported. Was really nice for
debugging/profiling/running through strace/etc.

Not sure off hand if there's an easy way to just run a publish from
shell (not going through soap), bric_queued does it, so probably not too
much work. Once you can do this, easy to apply regular profiling
techniques.

Cheers,

Alex

--
Alex Krohn <alex@gossamer-threads.com>

Re: Reducing your API overhead [ In reply to ]

david at kineticode

May 20, 2010, 12:50 PM

Post #10 of 13 (5743 views)

On May 20, 2010, at 3:28 PM, Alex Krohn wrote:

> I think best starting point here would be to do some profiling and get
> some numbers. Christie Wilson did a good talk as part of vancouver perl
> mongers last year on NYTProf along with some other profilers. Slides and
> details are here:
>
> http://www.socialtext.net/vanpm/index.cgi?meeting_august_12th_2009
>
> NYTProf is really nice!

But useless for benchmarking the database, alas.

> But just getting a real handle on where the bottlenecks are would help.
> DBIx::Profile is also a real nice module for just getting a log of all
> the queries run and timings/counts for them.

Yes. We should update Bricolage to use NYTProf and DBIx::Profile.

> The trickiest thing I think is trying to run it outside of
> apache/mod_perl, as personally, I find debugging and profiling inside
> mod_perl a big pain. Older versions of mason had debug files:
>
> http://www.masonhq.com/docs/manual/1.05/Devel.html#using_the_perl_debugger
>
> but sadly, that's no longer supported. Was really nice for
> debugging/profiling/running through strace/etc.

NYTProf works under mod_perl.

> Not sure off hand if there's an easy way to just run a publish from
> shell (not going through soap), bric_queued does it, so probably not too
> much work. Once you can do this, easy to apply regular profiling
> techniques.

Yep.

Best,

David

Re: Reducing your API overhead [ In reply to ]

May 20, 2010, 4:14 PM

Post #11 of 13 (5721 views)

On May 20, 2010, at 10:00 AM, Zdravko Balorda wrote:

> Matt, thank you for an excellent explanation of dynamic CMS vs. Bric system.
> The Bric's concept of arbitrary structured subelements is brilliant.

You're welcome.

On May 20, 2010, at 1:34 PM, David E. Wheeler wrote:

> Yes.

Honestly, if this is the answer to writing the fastest templates possible at this point, then this is the answer. Minimize calls to specific elements wherever you can, focus your listing and filtering efforts to supported keys, and leverage the top level stuff as much as possible.

It's obvious and straight forward, but there it is.

There's a pretty clear bottleneck in bric_queued, which I'm going to submit as a bug to be fixed/enhancement. And if the speed is still an issue, publish more stuff to dynamic includes on the production server side. And if you're really adventurous, move your utility templates into Perl modules.

I wish I had known to use the description fields before. I can think of several uses for just that single field that would speed things up greatly.

As for potential query improvements, I agree with Alex that we should do some systematic profiling before making any kind of call to embark or fund an improvement. It looks like Scott put something in the contrib directory a long time ago to profile templates which might help. I had tried to do something with NYTProf a while ago and didn't have time to do it justice. Maybe someone else?

-Matt

Re: Reducing your API overhead [ In reply to ]

alex at gossamer-threads

May 20, 2010, 4:17 PM

Post #12 of 13 (5729 views)

Hi,

> As for potential query improvements, I agree with Alex that we should do
> some systematic profiling before making any kind of call to embark or
> fund an improvement. It looks like Scott put something in the contrib
> directory a long time ago to profile templates which might help. I had
> tried to do something with NYTProf a while ago and didn't have time to
> do it justice. Maybe someone else?

If you can get something standalone that can be run from shell that does
the publishing, happy to do the profiling and analysis for you. i.e. a
psql dump and a bric_pub that publishes a story (not via soap). Can give
you oodles of stats on what it does and where it bottlenecks.

Cheers,

Alex

--
Alex Krohn <alex@gossamer-threads.com>

Re: Reducing your API overhead [ In reply to ]

May 20, 2010, 5:36 PM

Post #13 of 13 (5772 views)

On May 20, 2010, at 7:17 PM, Alex Krohn wrote:

> If you can get something standalone that can be run from shell that does
> the publishing, happy to do the profiling and analysis for you. i.e. a
> psql dump and a bric_pub that publishes a story (not via soap). Can give
> you oodles of stats on what it does and where it bottlenecks.

Ok, I'll see what I can do. Many thanks for the motivation.

-Matt