Mailing List Archive

Request for Comments: Cross site data access for Wikidata
Hi all!

The wikidata team has been discussing how to best make data from wikidata
available on local wikis. Fetching the data via HTTP whenever a page is
re-rendered doesn't seem prudent, so we (mainly Jeroen) came up with a
push-based architecture.

The proposal is at
<http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage>,
I have copied it below too.

Please have a lot and let us know if you think this is viable, and which of the
two variants you deem better!

Thanks,
-- daniel

PS: Please keep the discussion on wikitech-l, so we have it all in one place.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

== Proposal: HTTP push to local db storage ==

* Every time an item on Wikidata is changed, an HTTP push is issued to all
subscribing clients (wikis)
** initially, "subscriptions" are just entries in an array in the configuration.
** Pushes can be done via the job queue.
** pushing is done via the mediawiki API, but other protocols such as PubSub
Hubbub / AtomPub can easily be added to support 3rd parties.
** pushes need to be authenticated, so we don't get malicious crap. Pushes
should be done using a special user with a special user right.
** the push may contain either the full set of information for the item, or just
a delta (diff) + hash for integrity check (in case an update was missed).

* When the client receives a push, it does two things:
*# write the fresh data into a local database table (the local wikidata cache)
*# invalidate the (parser) cache for all pages that use the respective item (for
now we can assume that we know this from the language links)
*#* if we only update language links, the page doesn't even need to be
re-parsed: we just update the languagelinks in the cached ParserOutput object.

* when a page is rendered, interlanguage links and other info is taken from the
local wikidata cache. No queries are made to wikidata during parsing/rendering.

* In case an update is missed, we need a mechanism to allow requesting a full
purge and re-fetch of all data from on the client side and not just wait until
the next push which might very well take a very long time to happen.
** There needs to be a manual option for when someone detects this. maybe
action=purge can be made to do this. Simple cache-invalidation however shouldn't
pull info from wikidata.
**A time-to-live could be added to the local copy of the data so that it's
updated by doing a pull periodically so the data does not stay stale
indefinitely after a failed push.

=== Variation: shared database tables ===

Instead of having a local wikidata cache on each wiki (which may grow big - a
first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all
client wikis could access the same central database table(s) managed by the
wikidata wiki.

* this is similar to the way the globalusage extension tracks the usage of
commons images
* whenever a page is re-rendered, the local wiki would query the table in the
wikidata db. This means a cross-cluster db query whenever a page is rendered,
instead a local query.
* the HTTP push mechanism described above would still be needed to purge the
parser cache when needed. But the push requests would not need to contain the
updated data, they may just be requests to purge the cache.
* the ability for full HTTP pushes (using the mediawiki API or some other
interface) would still be desirable for 3rd party integration.

* This approach greatly lowers the amount of space used in the database
* it doesn't change the number of http requests made
** it does however reduce the amount of data transferred via http (but not by
much, at least not compared to pushing diffs)
* it doesn't change the number of database requests, but it introduces
cross-cluster requests



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Request for Comments: Cross site data access for Wikidata [ In reply to ]
I think it would be much better if the local wikis where it is
supposed to access this would have some sort of client extension which
would allow them to render the content using the db of wikidata. That
would be much simpler and faster

On Mon, Apr 23, 2012 at 2:45 PM, Daniel Kinzler <daniel@brightbyte.de> wrote:
> Hi all!
>
> The wikidata team has been discussing how to best make data from wikidata
> available on local wikis. Fetching the data via HTTP whenever a page is
> re-rendered doesn't  seem prudent, so we (mainly Jeroen) came up with a
> push-based architecture.
>
> The proposal is at
> <http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage>,
> I have copied it below too.
>
> Please have a lot and let us know if you think this is viable, and which of the
> two variants you deem better!
>
> Thanks,
> -- daniel
>
> PS: Please keep the discussion on  wikitech-l, so we have it all in one place.
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> == Proposal: HTTP push to local db storage ==
>
> * Every time an item on Wikidata is changed, an HTTP push is issued to all
> subscribing clients (wikis)
> ** initially, "subscriptions" are just entries in an array in the configuration.
> ** Pushes can be done via the job queue.
> ** pushing is done via the mediawiki API, but other protocols such as PubSub
> Hubbub / AtomPub can easily be added to support 3rd parties.
> ** pushes need to be authenticated, so we don't get malicious crap. Pushes
> should be done using a special user with a special user right.
> ** the push may contain either the full set of information for the item, or just
> a delta (diff) + hash for integrity check (in case an update was missed).
>
> * When the client receives a push, it does two things:
> *# write the fresh data into a local database table (the local wikidata cache)
> *# invalidate the (parser) cache for all pages that use the respective item (for
> now we can assume that we know this from the language links)
> *#* if we only update language links, the page doesn't even need to be
> re-parsed: we just update the languagelinks in the cached ParserOutput object.
>
> * when a page is rendered, interlanguage links and other info is taken from the
> local wikidata cache. No queries are made to wikidata during parsing/rendering.
>
> * In case an update is missed, we need a mechanism to allow requesting a full
> purge and re-fetch of all data from on the client side and not just wait until
> the next push which might very well take a very long time to happen.
> ** There needs to be a manual option for when someone detects this. maybe
> action=purge can be made to do this. Simple cache-invalidation however shouldn't
> pull info from wikidata.
> **A time-to-live could be added to the local copy of the data so that it's
> updated by doing a pull periodically so the data does not stay stale
> indefinitely after a failed push.
>
> === Variation: shared database tables ===
>
> Instead of having a local wikidata cache on each wiki (which may grow big - a
> first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all
> client wikis could  access the same central database table(s) managed by the
> wikidata wiki.
>
> * this is similar to the way the globalusage extension tracks the usage of
> commons images
> * whenever a page is re-rendered, the local wiki would query the table in the
> wikidata db. This means a cross-cluster db query whenever a page is rendered,
> instead a local query.
> * the HTTP push mechanism described above would still be needed to purge the
> parser cache when needed. But the push requests would not need to contain the
> updated data, they may just be requests to purge the cache.
> * the ability for full HTTP pushes (using the mediawiki API or some other
> interface) would still be desirable for 3rd party integration.
>
> * This approach greatly lowers the amount of space used in the database
> * it doesn't change the number of http requests made
> ** it does however reduce the amount of data transferred via http (but not by
> much, at least not compared to pushing diffs)
> * it doesn't change the number of database requests, but it introduces
> cross-cluster requests
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Request for Comments: Cross site data access for Wikidata [ In reply to ]
I mean, in simple words:

Your idea: when the data on wikidata is changed the new content is
pushed to all local wikis / somewhere

My idea: local wikis retrieve data from wikidata db directly, no need
to push anything on change

On Mon, Apr 23, 2012 at 4:07 PM, Petr Bena <benapetr@gmail.com> wrote:
> I think it would be much better if the local wikis where it is
> supposed to access this would have some sort of client extension which
> would allow them to render the content using the db of wikidata. That
> would be much simpler and faster
>
> On Mon, Apr 23, 2012 at 2:45 PM, Daniel Kinzler <daniel@brightbyte.de> wrote:
>> Hi all!
>>
>> The wikidata team has been discussing how to best make data from wikidata
>> available on local wikis. Fetching the data via HTTP whenever a page is
>> re-rendered doesn't  seem prudent, so we (mainly Jeroen) came up with a
>> push-based architecture.
>>
>> The proposal is at
>> <http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage>,
>> I have copied it below too.
>>
>> Please have a lot and let us know if you think this is viable, and which of the
>> two variants you deem better!
>>
>> Thanks,
>> -- daniel
>>
>> PS: Please keep the discussion on  wikitech-l, so we have it all in one place.
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>> == Proposal: HTTP push to local db storage ==
>>
>> * Every time an item on Wikidata is changed, an HTTP push is issued to all
>> subscribing clients (wikis)
>> ** initially, "subscriptions" are just entries in an array in the configuration.
>> ** Pushes can be done via the job queue.
>> ** pushing is done via the mediawiki API, but other protocols such as PubSub
>> Hubbub / AtomPub can easily be added to support 3rd parties.
>> ** pushes need to be authenticated, so we don't get malicious crap. Pushes
>> should be done using a special user with a special user right.
>> ** the push may contain either the full set of information for the item, or just
>> a delta (diff) + hash for integrity check (in case an update was missed).
>>
>> * When the client receives a push, it does two things:
>> *# write the fresh data into a local database table (the local wikidata cache)
>> *# invalidate the (parser) cache for all pages that use the respective item (for
>> now we can assume that we know this from the language links)
>> *#* if we only update language links, the page doesn't even need to be
>> re-parsed: we just update the languagelinks in the cached ParserOutput object.
>>
>> * when a page is rendered, interlanguage links and other info is taken from the
>> local wikidata cache. No queries are made to wikidata during parsing/rendering.
>>
>> * In case an update is missed, we need a mechanism to allow requesting a full
>> purge and re-fetch of all data from on the client side and not just wait until
>> the next push which might very well take a very long time to happen.
>> ** There needs to be a manual option for when someone detects this. maybe
>> action=purge can be made to do this. Simple cache-invalidation however shouldn't
>> pull info from wikidata.
>> **A time-to-live could be added to the local copy of the data so that it's
>> updated by doing a pull periodically so the data does not stay stale
>> indefinitely after a failed push.
>>
>> === Variation: shared database tables ===
>>
>> Instead of having a local wikidata cache on each wiki (which may grow big - a
>> first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all
>> client wikis could  access the same central database table(s) managed by the
>> wikidata wiki.
>>
>> * this is similar to the way the globalusage extension tracks the usage of
>> commons images
>> * whenever a page is re-rendered, the local wiki would query the table in the
>> wikidata db. This means a cross-cluster db query whenever a page is rendered,
>> instead a local query.
>> * the HTTP push mechanism described above would still be needed to purge the
>> parser cache when needed. But the push requests would not need to contain the
>> updated data, they may just be requests to purge the cache.
>> * the ability for full HTTP pushes (using the mediawiki API or some other
>> interface) would still be desirable for 3rd party integration.
>>
>> * This approach greatly lowers the amount of space used in the database
>> * it doesn't change the number of http requests made
>> ** it does however reduce the amount of data transferred via http (but not by
>> much, at least not compared to pushing diffs)
>> * it doesn't change the number of database requests, but it introduces
>> cross-cluster requests
>>
>>
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Request for Comments: Cross site data access for Wikidata [ In reply to ]
On Mon, Apr 23, 2012 at 10:07 AM, Petr Bena <benapetr@gmail.com> wrote:
> I think it would be much better if the local wikis where it is
> supposed to access this would have some sort of client extension which
> would allow them to render the content using the db of wikidata. That
> would be much simpler and faster
>

I agree with Petr here. I think doing it like we do FileRepo stuff
would make the most sense--have an abstract base that can
either connect via DB and skip those HTTP requests (for in-
cluster usage) or via the API (3rd-party sites).

-Chad

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Request for Comments: Cross site data access for Wikidata [ In reply to ]
On 23.04.2012 16:09, Petr Bena wrote:
> I mean, in simple words:
>
> Your idea: when the data on wikidata is changed the new content is
> pushed to all local wikis / somewhere
>
> My idea: local wikis retrieve data from wikidata db directly, no need
> to push anything on change

Well, the local wiki still needs to notice that something changed on wikidata.
That would require some sort of push, even if that push is just a purge. So this
would mean pushing *and* pulling, which makes things more complex instead of
simpler. Or am I missing something?

Alternatively, once could poll for changes regularly. That's a ton of overhead,
though: the majority of pages will need to be kept in sync with wikidata,
because they have at least their languagelinks there.

-- daniel

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Request for Comments: Cross site data access for Wikidata [ In reply to ]
On 23/04/12 14:45, Daniel Kinzler wrote:
> *#* if we only update language links, the page doesn't even need to be
> re-parsed: we just update the languagelinks in the cached ParserOutput object.

It's not that simple, for instance, they may be several ParserOutputs
for the same page. On the bright side, you probably don't need it. I'd
expect that if interwikis are handled through wikidata, they are
completely replaced through a hook, so no need to touch the ParserOutput
objects.


> *# invalidate the (parser) cache for all pages that use the respective item (for
> now we can assume that we know this from the language links)
And in such case, you don't need to invalidate the parser cache. Only if
it was factual data embedded into the page.


I think a save/purge shall always fetch the data. We can't store the
copy in the parsed object.
What we can do is to fetch is from a local cache or directly from the
origin one.

You mention the cache for the push model, but I think it deserves a
clearer separation.



> === Variation: shared database tables ===
> (...)
> * This approach greatly lowers the amount of space used in the database
> * it doesn't change the number of http requests made
> ** it does however reduce the amount of data transferred via http (but not by
> much, at least not compared to pushing diffs)
> * it doesn't change the number of database requests, but it introduces
> cross-cluster requests

You'd probably also want multiple dbs (let's call them WikiData
repositories), partitioned by content (and its update frequency). You
could then use different frontends (as Chad says, "similar to FileRepo").
So, a WikiData repository with the atom properties of each element would
happily live in a dba file. Interwikis would have to be on a MySQL db, etc.


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Request for Comments: Cross site data access for Wikidata [ In reply to ]
On 23.04.2012 17:28, Platonides wrote:
> On 23/04/12 14:45, Daniel Kinzler wrote:
>> *#* if we only update language links, the page doesn't even need to be
>> re-parsed: we just update the languagelinks in the cached ParserOutput object.
>
> It's not that simple, for instance, they may be several ParserOutputs
> for the same page. On the bright side, you probably don't need it. I'd
> expect that if interwikis are handled through wikidata, they are
> completely replaced through a hook, so no need to touch the ParserOutput
> objects.

I would go that way if we were just talking about languagelinks. But we have to
provide for phase II (infoboxes) and III (automated lists) too. Since we'll have
to re-parse in most cases anyway (and parsing pages without infoboxes tends to
be cheaper anyway), I see no benefit in spending time on inventing a way to
bypass parsing. It's tempting, granted, but it seems a distraction atm.

>> *# invalidate the (parser) cache for all pages that use the respective item (for
>> now we can assume that we know this from the language links)
> And in such case, you don't need to invalidate the parser cache. Only if
> it was factual data embedded into the page.

Which will be a very frequent case in the next phase: most infoboxes will (at
some point) work like that.

> I think a save/purge shall always fetch the data. We can't store the
> copy in the parsed object.

well, for languagelinks, we already do, and will probably keep doing it. Other
data, which will be used in the page content, shouldn't be stored in the parser
output. The parser should take them from some cache.

> What we can do is to fetch is from a local cache or directly from the
> origin one.

Indeed. Local or remote, DB directly or HTTP... we can have FileRepo-like
plugins for that, sure. But:

The real question is how purging and updating will work. Pushing? Polling?
Purge-and-pull?

> You mention the cache for the push model, but I think it deserves a
> clearer separation.

Can you explain what you have in mind?

> You'd probably also want multiple dbs (let's call them WikiData
> repositories), partitioned by content (and its update frequency). You
> could then use different frontends (as Chad says, "similar to FileRepo").
> So, a WikiData repository with the atom properties of each element would
> happily live in a dba file. Interwikis would have to be on a MySQL db, etc.

This is what I was aiming at with the DataTransclusion extension a while back.

But currently, we are not building a tool for including arbitrary data sources
in wikipedia. We are building a central database for maintaining factual
information. Our main objective is to get that done.

A design that is flexible enough to easily allow for future inclusion of other
data sources would be nice. As long as the abstraction doesn't get in the way.

Anyway, it seems that it boils down to this:

1) The client needs some (abstracted?) way to access the reporitory/repositories
2) The repo needs to be able to notify the client sites about changes, be it via
push, pr purge, or polling.
3) We'll need a local cache or cross-site database access.

So, which combination of these techniques would you prefer?

-- daniel


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Request for Comments: Cross site data access for Wikidata [ In reply to ]
On 23/04/12 18:42, Daniel Kinzler wrote:
> On 23.04.2012 17:28, Platonides wrote:
>> On 23/04/12 14:45, Daniel Kinzler wrote:
>>> *#* if we only update language links, the page doesn't even need to be
>>> re-parsed: we just update the languagelinks in the cached ParserOutput object.
>>
>> It's not that simple, for instance, they may be several ParserOutputs
>> for the same page. On the bright side, you probably don't need it. I'd
>> expect that if interwikis are handled through wikidata, they are
>> completely replaced through a hook, so no need to touch the ParserOutput
>> objects.
>
> I would go that way if we were just talking about languagelinks. But we have to
> provide for phase II (infoboxes) and III (automated lists) too. Since we'll have
> to re-parse in most cases anyway (and parsing pages without infoboxes tends to
> be cheaper anyway), I see no benefit in spending time on inventing a way to
> bypass parsing. It's tempting, granted, but it seems a distraction atm.

Sure, but in those cases you need to reparse the full page. No need to
make tricks modifying the ParserOutput. :)
So, if you want to skip the reparsing for iw fine, but just use a hook.



>> I think a save/purge shall always fetch the data. We can't store the
>> copy in the parsed object.
>
> well, for languagelinks, we already do, and will probably keep doing it. Other
> data, which will be used in the page content, shouldn't be stored in the parser
> output. The parser should take them from some cache.

The ParserOutput is a parsed representation of the wikitext. The cached
wikidata interwikis shouldn't be stored there (or at least, not only
there, in case it saved the interwikis as they were on last full-render).



>> What we can do is to fetch is from a local cache or directly from the
>> origin one.
>
> Indeed. Local or remote, DB directly or HTTP... we can have FileRepo-like
> plugins for that, sure. But:
>
> The real question is how purging and updating will work. Pushing? Polling?
> Purge-and-pull?
>
>> You mention the cache for the push model, but I think it deserves a
>> clearer separation.
> Can you explain what you have in mind?

I mean, they are based in the same concept. What really matters is how
things reach the db.
I'd have WikiData db replicated to {{places}}.
For WMF, all wikis could connect directly to the main instance, have a
slave "assigned" to each cluster...
Then on each page render, the variables used could be checked with the
latest version (unless checked in last x minutes) and trigger a rerender
if different.

So, suppose a page uses the fact
Germany{capital:"Berlin";language:"German"},
it would store that along the version of WikiData used (eg. Wikidata 2.0,
Germany 488584364).

When going to show it, it would check:
1) Is the latest WikiData version newer than 2.0? (No-> go to 5)
2) Is the Germany module newer than 488584364? (No-> Store that it's up
to date to WikiData 3, go to 5)
3) Fetch Germany data. If the used data hasn't changed, update the
metadata. Go to 5.
4) Re-render the page.
5) Show contents.

As for actively purging the pages content, that's interesting only for
the anons.
You'd need a script able to replicate a purge for a WikiData changes
range. That'd basically perform the above steps, but making the render
through the job queue.
A normal wiki would call those functions while replicating, but wikis
with a shared db (or dropping full files with newer data) would run it
standalone (plus utility on screw ups).


>> You'd probably also want multiple dbs (let's call them WikiData
>> repositories), partitioned by content (and its update frequency). You
>> could then use different frontends (as Chad says, "similar to FileRepo").
>> So, a WikiData repository with the atom properties of each element would
>> happily live in a dba file. Interwikis would have to be on a MySQL db, etc.
>
> This is what I was aiming at with the DataTransclusion extension a while back.
>
> But currently, we are not building a tool for including arbitrary data sources
> in wikipedia. We are building a central database for maintaining factual
> information. Our main objective is to get that done.

Not arbitrary, but having different sources (repositories), even if they
are under control of the same entity. Mostly interesting for slow-fast
altough I'm sure reusers would find more use cases, such as only
downloading the db about this section.


> A design that is flexible enough to easily allow for future inclusion of other
> data sources would be nice. As long as the abstraction doesn't get in the way.
>
> Anyway, it seems that it boils down to this:
>
> 1) The client needs some (abstracted?) way to access the reporitory/repositories
> 2) The repo needs to be able to notify the client sites about changes, be it via
> push, pr purge, or polling.
> 3) We'll need a local cache or cross-site database access.
>
> So, which combination of these techniques would you prefer?
>
> -- daniel

I'd use a pull-based model. That seems to be what fits better with
current MediaWiki model. But it isn't too relevant at this time (or you
have advanced a lot by now!).


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Request for Comments: Cross site data access for Wikidata [ In reply to ]
Thanks for the input Platonides.

I'll have to re-read your comments to fully understand them. For now, just a
quick question:

> When going to show it, it would check:
> 1) Is the latest WikiData version newer than 2.0? (No-> go to 5)
> 2) Is the Germany module newer than 488584364? (No-> Store that it's up
> to date to WikiData 3, go to 5)
> 3) Fetch Germany data. If the used data hasn't changed, update the
> metadata. Go to 5.
> 4) Re-render the page.
> 5) Show contents.

You think making a db query to check if the data is up to date, every time the
page is *viewed*, is feasible? I would have though this prohibitively
expensive... it would be nice and simple, of course.

The approach of marking the rendered page data as stale (using page_touched)
whenever the data changes seems much more efficient. Though it does introduce
some additional complexity.

Also, checking on every page view is out of the questions for external sites,
right? So we'd still need a push interface for these...

-- daniel

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Request for Comments: Cross site data access for Wikidata [ In reply to ]
On 04/23/2012 02:45 PM, Daniel Kinzler wrote:
> * In case an update is missed, we need a mechanism to allow requesting a full
> purge and re-fetch of all data from on the client side and not just wait until
> the next push which might very well take a very long time to happen.

Once the data set becomes large and the change rate drops, this would be
a very expensive way to catch up. You could use sequence numbers for
changes to allow clients to detect missed changes and selectively
retrieve all changes since the last contact.

In general, best-effort push / change notifications with bounded waiting
for slow clients combined with an efficient way to catch up should be
more reliable than push only. You don't really want to do a lot of
buffering for slow clients while pushing.

If you are planning to render large, standardized page fragments with
little to no input from the wiki page, then it might also become
interesting to directly load fragments using JS in the browser or
through a proxy with ESI-like capabilities for clients without JS.

Gabriel


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Request for Comments: Cross site data access for Wikidata [ In reply to ]
Why not just use a queue? We use the job queue for this right now for
nearly the same purpose. The job queue isn't amazing, but it works.
Maybe someone should replace this with a better system while they are
at it?

On Mon, Apr 23, 2012 at 5:45 AM, Daniel Kinzler <daniel@brightbyte.de> wrote:
> Hi all!
>
> The wikidata team has been discussing how to best make data from wikidata
> available on local wikis. Fetching the data via HTTP whenever a page is
> re-rendered doesn't  seem prudent, so we (mainly Jeroen) came up with a
> push-based architecture.
>
> The proposal is at
> <http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage>,
> I have copied it below too.
>
> Please have a lot and let us know if you think this is viable, and which of the
> two variants you deem better!
>
> Thanks,
> -- daniel
>
> PS: Please keep the discussion on  wikitech-l, so we have it all in one place.
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> == Proposal: HTTP push to local db storage ==
>
> * Every time an item on Wikidata is changed, an HTTP push is issued to all
> subscribing clients (wikis)
> ** initially, "subscriptions" are just entries in an array in the configuration.
> ** Pushes can be done via the job queue.
> ** pushing is done via the mediawiki API, but other protocols such as PubSub
> Hubbub / AtomPub can easily be added to support 3rd parties.
> ** pushes need to be authenticated, so we don't get malicious crap. Pushes
> should be done using a special user with a special user right.
> ** the push may contain either the full set of information for the item, or just
> a delta (diff) + hash for integrity check (in case an update was missed).
>
> * When the client receives a push, it does two things:
> *# write the fresh data into a local database table (the local wikidata cache)
> *# invalidate the (parser) cache for all pages that use the respective item (for
> now we can assume that we know this from the language links)
> *#* if we only update language links, the page doesn't even need to be
> re-parsed: we just update the languagelinks in the cached ParserOutput object.
>
> * when a page is rendered, interlanguage links and other info is taken from the
> local wikidata cache. No queries are made to wikidata during parsing/rendering.
>
> * In case an update is missed, we need a mechanism to allow requesting a full
> purge and re-fetch of all data from on the client side and not just wait until
> the next push which might very well take a very long time to happen.
> ** There needs to be a manual option for when someone detects this. maybe
> action=purge can be made to do this. Simple cache-invalidation however shouldn't
> pull info from wikidata.
> **A time-to-live could be added to the local copy of the data so that it's
> updated by doing a pull periodically so the data does not stay stale
> indefinitely after a failed push.
>
> === Variation: shared database tables ===
>
> Instead of having a local wikidata cache on each wiki (which may grow big - a
> first guesstimate of Jeroen and Reedy is up to 1TB total, for all wikis), all
> client wikis could  access the same central database table(s) managed by the
> wikidata wiki.
>
> * this is similar to the way the globalusage extension tracks the usage of
> commons images
> * whenever a page is re-rendered, the local wiki would query the table in the
> wikidata db. This means a cross-cluster db query whenever a page is rendered,
> instead a local query.
> * the HTTP push mechanism described above would still be needed to purge the
> parser cache when needed. But the push requests would not need to contain the
> updated data, they may just be requests to purge the cache.
> * the ability for full HTTP pushes (using the mediawiki API or some other
> interface) would still be desirable for 3rd party integration.
>
> * This approach greatly lowers the amount of space used in the database
> * it doesn't change the number of http requests made
> ** it does however reduce the amount of data transferred via http (but not by
> much, at least not compared to pushing diffs)
> * it doesn't change the number of database requests, but it introduces
> cross-cluster requests
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Request for Comments: Cross site data access for Wikidata [ In reply to ]
Am 23/04/12 19:34, Daniel Kinzler schrieb:
> You think making a db query to check if the data is up to date, every time the
> page is *viewed*, is feasible? I would have though this prohibitively
> expensive... it would be nice and simple, of course.
>
> The approach of marking the rendered page data as stale (using page_touched)
> whenever the data changes seems much more efficient. Though it does introduce
> some additional complexity.

Viewed by a logged in user, ie. the same case when we check page_touched.
Also note we are checking against a local cache. The way it's updated is
unspecified :)


> Also, checking on every page view is out of the questions for external sites,
> right? So we'd still need a push interface for these...

I think they'd use a cache with a configured ttl. So they wouldn't
actually be fetching it on each view, only every X hours.


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Request for Comments: Cross site data access for Wikidata [ In reply to ]
Hoi,
One of the KEY reasons to have Wikidata is that is DOES update when there
is a change in the data. For instance, how many Wikipedias have an article
on my home town of Almere and have it said that Mrs Jorritsma is the mayor
... She will not be mayor forever ... There are many villages, towns and
cities like Almere.

I do positively not like the idea of all the wasted effort when a pushy
Wikidata can be and should be the solution.
Thanks,
Gerard

On 23 April 2012 16:09, Petr Bena <benapetr@gmail.com> wrote:

> I mean, in simple words:
>
> Your idea: when the data on wikidata is changed the new content is
> pushed to all local wikis / somewhere
>
> My idea: local wikis retrieve data from wikidata db directly, no need
> to push anything on change
>
> On Mon, Apr 23, 2012 at 4:07 PM, Petr Bena <benapetr@gmail.com> wrote:
> > I think it would be much better if the local wikis where it is
> > supposed to access this would have some sort of client extension which
> > would allow them to render the content using the db of wikidata. That
> > would be much simpler and faster
> >
> > On Mon, Apr 23, 2012 at 2:45 PM, Daniel Kinzler <daniel@brightbyte.de>
> wrote:
> >> Hi all!
> >>
> >> The wikidata team has been discussing how to best make data from
> wikidata
> >> available on local wikis. Fetching the data via HTTP whenever a page is
> >> re-rendered doesn't seem prudent, so we (mainly Jeroen) came up with a
> >> push-based architecture.
> >>
> >> The proposal is at
> >> <
> http://meta.wikimedia.org/wiki/Wikidata/Notes/Caching_investigation#Proposal:_HTTP_push_to_local_db_storage
> >,
> >> I have copied it below too.
> >>
> >> Please have a lot and let us know if you think this is viable, and
> which of the
> >> two variants you deem better!
> >>
> >> Thanks,
> >> -- daniel
> >>
> >> PS: Please keep the discussion on wikitech-l, so we have it all in one
> place.
> >>
> >>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>
> >> == Proposal: HTTP push to local db storage ==
> >>
> >> * Every time an item on Wikidata is changed, an HTTP push is issued to
> all
> >> subscribing clients (wikis)
> >> ** initially, "subscriptions" are just entries in an array in the
> configuration.
> >> ** Pushes can be done via the job queue.
> >> ** pushing is done via the mediawiki API, but other protocols such as
> PubSub
> >> Hubbub / AtomPub can easily be added to support 3rd parties.
> >> ** pushes need to be authenticated, so we don't get malicious crap.
> Pushes
> >> should be done using a special user with a special user right.
> >> ** the push may contain either the full set of information for the
> item, or just
> >> a delta (diff) + hash for integrity check (in case an update was
> missed).
> >>
> >> * When the client receives a push, it does two things:
> >> *# write the fresh data into a local database table (the local wikidata
> cache)
> >> *# invalidate the (parser) cache for all pages that use the respective
> item (for
> >> now we can assume that we know this from the language links)
> >> *#* if we only update language links, the page doesn't even need to be
> >> re-parsed: we just update the languagelinks in the cached ParserOutput
> object.
> >>
> >> * when a page is rendered, interlanguage links and other info is taken
> from the
> >> local wikidata cache. No queries are made to wikidata during
> parsing/rendering.
> >>
> >> * In case an update is missed, we need a mechanism to allow requesting
> a full
> >> purge and re-fetch of all data from on the client side and not just
> wait until
> >> the next push which might very well take a very long time to happen.
> >> ** There needs to be a manual option for when someone detects this.
> maybe
> >> action=purge can be made to do this. Simple cache-invalidation however
> shouldn't
> >> pull info from wikidata.
> >> **A time-to-live could be added to the local copy of the data so that
> it's
> >> updated by doing a pull periodically so the data does not stay stale
> >> indefinitely after a failed push.
> >>
> >> === Variation: shared database tables ===
> >>
> >> Instead of having a local wikidata cache on each wiki (which may grow
> big - a
> >> first guesstimate of Jeroen and Reedy is up to 1TB total, for all
> wikis), all
> >> client wikis could access the same central database table(s) managed
> by the
> >> wikidata wiki.
> >>
> >> * this is similar to the way the globalusage extension tracks the usage
> of
> >> commons images
> >> * whenever a page is re-rendered, the local wiki would query the table
> in the
> >> wikidata db. This means a cross-cluster db query whenever a page is
> rendered,
> >> instead a local query.
> >> * the HTTP push mechanism described above would still be needed to
> purge the
> >> parser cache when needed. But the push requests would not need to
> contain the
> >> updated data, they may just be requests to purge the cache.
> >> * the ability for full HTTP pushes (using the mediawiki API or some
> other
> >> interface) would still be desirable for 3rd party integration.
> >>
> >> * This approach greatly lowers the amount of space used in the database
> >> * it doesn't change the number of http requests made
> >> ** it does however reduce the amount of data transferred via http (but
> not by
> >> much, at least not compared to pushing diffs)
> >> * it doesn't change the number of database requests, but it introduces
> >> cross-cluster requests
> >>
> >>
> >>
> >> _______________________________________________
> >> Wikitech-l mailing list
> >> Wikitech-l@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Request for Comments: Cross site data access for Wikidata [ In reply to ]
On 23.04.2012 20:06, Ryan Lane wrote:
> Why not just use a queue?

Well yes, a queue... for what exactly would be queued, and what precisely would
the consimer do? where?s the queue? on the repo or on the client?

-- daniel

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l