Mailing List Archive

RDFa and Microdata in MediaWiki
Duesentrieb checked in RDFa support for MediaWiki in r58712:

http://www.mediawiki.org/wiki/Special:Code/MediaWiki/58712

I discussed this with him at some length, and Tim commented on how it
ties into the parser. I'd like to discuss this a bit more broadly
because we're talking about extending wikitext -- whatever markup we
allow on Wikipedia (and in this case, particularly on Commons) at the
next scap is probably going to have to be allowed forever by default
in MediaWiki, because everyone will start using it and pages will
break if we disable it.

RDFa is a way to embed data in HTML more robustly than with attributes
like class and title, which are reserved for author use or have
existing functionality. It allows you to specify an external
vocabulary that adds some semantics to your page that HTML is not
capable of expressing by itself. RDFa is based on the RDF standard,
and is relatively old. Microdata is a new competing standard that was
created last year as part of HTML5, which aims to be much simpler to
use.

The major use case we have is marking up Commons image licenses.
Either RDFa or Microdata could allow machines to more easily tell what
licenses the images we use are under. But in the long term, it seems
likely that only one of these technologies will win, and the other
will die. We don't want to have to support the loser forever. So IMO
we should choose the better one and go with that alone.

Now, which to choose? RDFa is better-established, and the W3C is
still attached to it, but Microdata has much greater support among the
parties that matter, including Google, Mozilla, Apple, and Opera (as
judged from discussions in the WHATWG and W3C). It's a lot more
concise and simpler to use, is better integrated into HTML, and can
represent any semantics we'd want. At the bottom of this post is an
example exhibiting how much simpler microdata is. Both RDFa+HTML and
Microdata are Working Drafts at the W3C right now, although RDFa in
XHTML1 (which we won't be using for much longer) is a Recommendation.

I should note that currently Google and a couple of others support
RDFa but not Microdata. But come on -- we're Wikipedia. Google
already screen-scrapes our templates to figure out what licenses we
use anyway, parsing microdata has got to be easier. We shouldn't let
existing market shares deter us from picking the better technology.
My personal opinion on this is that we should enable Microdata by
default (which is much less intrusive than enabling RDFa -- just
whitelist a few extra attributes) and encourage Commons to use that
instead of RDFa. We can leave RDFa support in, but disabled by
default. What does everyone else think?


== Example of RDFa vs. Microdata ==
Suppose we have the following markup right now:

[[
<div id="bodyContent">
...
<img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
width="640" height="480">
...
<p>EmeryMolyneux-terrestrialglobe-1592-20061127.jpg by Bob Smith is
licensed under a <a
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
]]

Sample RDFa code to say an image is under a CC-BY-SA 3.0 license seems
to be something like this, based off the license generator on the CC
website:

[[
<div id="bodyContent">
...
<img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
width="640" height="480" id="mw-image">
...
<p><span xmlns:dc="http://purl.org/dc/elements/1.1/"
href="http://purl.org/dc/dcmitype/StillImage" property="dc:title"
rel="dc:type">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span>
by <span xmlns:cc="http://creativecommons.org/ns#" href="#mw-image"
property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span>
is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
]]

This adds an id to the image, rel="license" to the license link, and
two extra tags with lots of lengthy attributes. To be valid RDFa, we
would need to add further markup somewhere, at least a version tag in
the <html> tag on every page AFAIK. Equivalent microdata is this:

<div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work">
...
<img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
width="640" height="480" itemprop="work">
...
<p><span itemprop="title">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span>
by <span itemprop="author">Bob Smith</span> is licensed under a <a
itemprop="license"
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>

This adds two attributes to an ancestor to indicate that the contents
form a work -- these could be moved to lower elements if desired,
AFAICT, but then they'd have to be duplicated. Instead of adding an
id to the <img>, it uses itemprop="work" to directly say it's the work
being referred to. Instead of <span
xmlns:dc="http://purl.org/dc/elements/1.1/"
href="http://purl.org/dc/dcmitype/StillImage" property="dc:title"
rel="dc:type">, we have <span itemprop="title">. Instead of <span
xmlns:cc="http://creativecommons.org/ns#" href="#mw-image"
property="cc:attributionName" rel="cc:attributionURL">, we have <span
itemprop="author">.

Overall, I think it's clear from this example that microdata is much
more concise and also more coherent. It's easy to see from this
example exactly how the microdata model works: you have a bunch of
stuff grouped as an item using itemscope, itemtype tells you what type
of item it is, and then itemprop tells you what each role each piece
has. It's barely longer than the un-annotated markup. RDFa, by
contrast, is a mess of boilerplate that's impossible to understand
unless you actually read the specs. Microdata's syntax has actually
been refined by a usability study run on it by Google.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
On Fri, Jan 15, 2010 at 10:47 AM, Aryeh Gregor
<Simetrical+wikilist@gmail.com> wrote:
> Sample RDFa code to say an image is under a CC-BY-SA 3.0 license seems
> to be something like this, based off the license generator on the CC
> website:
>
> [[
> <div id="bodyContent">
> ...
> <img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
> width="640" height="480" id="mw-image">
> ...
> <p><span xmlns:dc="http://purl.org/dc/elements/1.1/"
> href="http://purl.org/dc/dcmitype/StillImage" property="dc:title"
> rel="dc:type">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span>
> by <span xmlns:cc="http://creativecommons.org/ns#" href="#mw-image"
> property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span>
> is licensed under a <a rel="license"
> href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
> Commons Attribution-Share Alike 3.0 United States License</a>.</p>
> ]]

It was pointed out in #whatwg on freenode that to be fair, I should
leave off the fact that the work being pointed to is a still image
(since Microdata does). On the other hand, the span needs to point to
the actual URL of the image, not just an ID, so I *think* this is the
markup I actually wanted:

[[
<div id="bodyContent">
...
<img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
width="640" height="480">
...
<p><span xmlns:dc="http://purl.org/dc/elements/1.1/"
property="dc:title">EmeryMolyneux-terrestrialglobe-1592-20061127.jpg</span>
by <span xmlns:cc="http://creativecommons.org/ns#"
href="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
property="cc:attributionName" rel="cc:attributionURL">Bob Smith</span>
is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
]]

This is about as long as before, but it might still be wrong. The
general points I made are still accurate, anyway.

Second, it was pointed out that the RDFa example here mixes two
existing vocabularies, while the Microdata example uses a vocabulary
specifically designed for our use-case. However, I think this is fair
-- we'd likely use the standard applicable vocabularies in each case,
and the Microdata vocabulary is simpler for our primary use-case.

Third of all, it was also pointed out that RDFa 1.1 is supposed to
simpler. But RDFa 1.1 probably has about the same deployment right
now as Microdata, i.e., roughly none, so that gets rid of RDF's
biggest advantage.

But in the end, personal opinion aside, Microdata looks like the
technology with a future right now, for good reason. The consensus of
almost everyone I've talked to who's not precommitted to RDF is that
Microdata is the better technology. Since existing deployment isn't a
huge issue for us given our size -- we'll become one of the biggest
web users of whichever technology we choose -- I think we should go
with Microdata as the apparent better solution, unless anyone has
reasons not to.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
Aryeh Gregor <Simetrical+wikilist <at> gmail.com> writes:
> This is about as long as before, but it might still be wrong. The
> general points I made are still accurate, anyway.

The general points that you made were riddled with technical
inaccuracies, bad advice, and if implemented by the MediaWiki community,
would have resulted in semantic data that would have been ambiguous at
best and erroneous at worst. I don't know if you intended the tone of
your e-mail in the way that I read it, but it came off as purposefully
misleading based on the discussions that both you and I have had as
members of the HTMLWG and WHATWG. I'll address the technical and factual
errors that I believe have been made in your posts as well as provide
alternative guidance.

Just to briefly introduce myself to this community, I do standards work
in a variety of online communities including the Microformats community
(lead editor for hAudio, hMedia and hVideo), contract my expertise to
the music industry and I am also an Invited Expert to the W3C's Semantic
Web Deployment Working Group and co-chair of the upcoming RDFa Working
Group and editor of the HTML5+RDFa spec. The company I founded is
interested in expressing digital content online via semantic languages
and builds open source software for the creation and standardization of
copyright-aware, DRM-free, peer-to-peer networks.

For guidance on how to implement semantic markup in a CMS, we might want
to look at the Drupal Community, who have done a superb job of
integrating RDFa into their platform. They expect several hundred
thousand websites to start using RDFa within the next year or two.

One lesson that we learned during implementation of RDFa in Drupal is
that it is helpful for CMS designers to pre-define vocabularies that are
usable with their CMS systems if manual markup is necessary. Most markup
of both Microdata and RDFa should also be left to the CMS code unless
there is a very good reason to not do so.

If you want to allow manual markup of RDFa, MediaWiki should probably
pre-define at least Dublin Core (used to describe creative works), FOAF
(used to describe people and organizations), and Creative Commons (used
to describe licenses). There are many RDF vocabularies to choose from
and Wikipedia might consider creating a few of their own. Pre-defining
vocabularies would greatly simplify the markup in case someone would
want to
markup something by hand.

Let's revisit Aryeh's example:

Emery Molyneux Terrestrial Globe by Bob Smith is licensed under a
Creative Commons Attribution-Share Alike 3.0 United States License.

The above could be marked up in RDFa, with pre-defined vocabs, like so:

<p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
typeof="dctype:StillImage">
<span property="dc:title">Emery Molyneux Terrestrial Globe</span>
by <a rel="cc:attributionUrl" href="http://example.org/bob/"
property="cc:attributionName">Bob Smith</span>
is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>

this would produce the following triples (I haven't expanded the CURIEs
out in order to make it easier to read):

<EmeryMolyneux-terrestrialglobe-1592-20061127.jpg>
rdf:type
dctype:StillImage .
<EmeryMolyneux-terrestrialglobe-1592-20061127.jpg>
dc:title
"Emery Molyneux Terrestrial Globe" .
<EmeryMolyneux-terrestrialglobe-1592-20061127.jpg>
cc:attributionName
"Bob Smith" .
<EmeryMolyneux-terrestrialglobe-1592-20061127.jpg>
xhv:license
<http://creativecommons.org/licenses/by-sa/3.0/us/> .

So, four pieces of data, which is pretty good considering the
compactness of the HTML code. The Microdata looks like this:

<div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work">
...
<img
src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
width="640" height="480" itemprop="work">
...
<p><span itemprop="title">Emery Molyneux Terrestrial Globe</span>
by <span itemprop="author">Bob Smith</span> is licensed under a <a
itemprop="license"
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
</div>

The compactness of the markup between Microdata and RDFa is more or less
the same in this particular example. There are some things that are
easier to express in Microdata and there are some things that are easier
to express in RDFa. We get the following Microdata out:

type http://n.whatwg.org/work
work http://upload.wikimedia.org/...terrestrialglobe-1592-20061127.jpg
title "Emery Molyneux Terrestrial Globe"
author "Bob Smith"
license http://creativecommons.org/licenses/by-sa/3.0/us/

So, we get more-or-less the same number of data items out, but there is
a problem. What does "title" mean in the semantic sense? Does it mean
"job title" or does it mean "work title"? The term "title" in this case
is ambiguous.

Concern #1:

Ambiguity is a big problem when it comes to semantics - make sure that
if this community does use Microdata markup, that you fully qualify
terms. It is far easier to be ambiguous in Microdata than it is in RDFa.
So, instead of using
itemprop="title" you should be using
itemprop="http://purl.org/dc/terms/title" - which will inflate the
markup required for Microdata, but is necessary when it comes to
classifying this information accurately for semantic data processors
(such as via SPARQL or higher-level reasoning agents).

Concern #2:

Getting Microdata and RDFa markup correct is easier if there are
templates or if the semantic markup is performed automatically by the
CMS based on a pre-defined form. For example,
http://en.wikipedia.org/wiki/Augustus, note the Infobox on the
right. It would be much better for the RDFa markup to happen
automatically via MediaWiki's template process, than for it to be marked
up by
hand.

Concern #3:

Intentional or not, Aryeh has painted RDFa in a negative light by not
outlining a number of points related to adoption and both RDFa and
Microdata's current status in the HTML Working Group. Adopting either
RDFa or Microdata in an HTML5 document type would be premature
at this time because both have not progressed past the Editors Draft
stage yet. Either is subject to change as far as HTML5 is concerned
and we really don't want you to ship HTML5 features before they've had
a chance to solidify a bit more.

However - XHTML1+RDFa is a published W3C Recommendation and it is safe
to use it for deployment. Google[1] is actively indexing RDFa today as
is Yahoo[2]. Sites such as Digg, Whitehouse.gov, the UK Government, The
Public Library of Science, O'Reilly and the UK Government are
high-profile sites that publish their pages using RDFa. Data formats
such as XHTML1, SVG 1.2 and ODF have integrated RDFa as a core part of
their language. Best Buy saw a 30% traffic increase after publishing
their pages in RDFa using the GoodRelations vocabulary. I'm sure
everyone here is aware of dbpedia.org[3] and Freebase[4] - which use RDF
as a semantic representation format. dbpedia, which gets its data from
Wikipedia, shows 479 million triples available - so that
should give you folks some idea of the treasure trove of immediately
extractable semantic data we're talking about.

Make no mistake - RDFa has very strong deployment at this point and it
will continue to grow past 100,000+ sites with the upcoming release of
Drupal 7.

Concern #4:

While I can't fault Aryeh's enthusiasm, I am now concerned that there
may be questions in this community that are going unanswered related to
RDFa and Microdata. I hope this will be a deliberate process as it is
easy to get semantic data markup wrong (regardless of the implementation
language - Microformats, Microdata or RDFa).

I hope that those that have an interest in semantic data will discuss
concerns and ask us about the lessons we've learned when implementing
metadata markup. The best place to send RDFa development questions at
the moment is:

http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/

We have a very friendly community that would love to answer any
questions that this community may have related to semantic data markup.
Please do respond to me directly or in this thread if you have lingering
concerns or questions - either the RDFa community or I will do our best
to answer any questions.

-- manu

[1]http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets.html
[2]http://developer.yahoo.net/blog/archives/2008/09/searchmonkey_support_for_rdfa_enabled.html
[3]http://en.wikipedia.org/wiki/DBpedia#Example
[4]http://en.wikipedia.org/wiki/Freebase_(database)

--
Manu Sporny (skype: msporny, twitter: manusporny)
President/CEO - Digital Bazaar, Inc.
blog: Monarch - Next Generation REST Web Services
http://blog.digitalbazaar.com/2009/12/14/monarch/

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
I don't suppose that the members of this list appreciate the epic
Microdata vs. RDFa battle leaking into this mailing list, but I want
to address a few inaccuracies below.

Introduction: I work for Opera Software and have been active in the
WHATWG and W3C HTML WG devloping HTML5 for the last year and a half. I
believe I have a good understanding of what browser vendors are likely
and not likely to support, although I don't speak for or make any
promises on behalf of Opera Software in this mail.

I have also worked on implementing the microdata DOM API in
JavaScript, an ongoing experiment at http://gitorious.org/microdatajs
and will be able to answer any technical questions about the
processing of microdata. In short, I can only say that it is really
quite intuitive and simple, with few surprises. It maps well to the
RDF model if you want it, but doesn't force authors to think in terms
of subject, predicate, object triples.

On Sat, Jan 16, 2010 at 06:32, Manu Sporny <msporny@digitalbazaar.com> wrote:
> Aryeh Gregor <Simetrical+wikilist <at> gmail.com> writes:

[snip]

> The compactness of the markup between Microdata and RDFa is more or less
> the same in this particular example. There are some things that are
> easier to express in Microdata and there are some things that are easier
> to express in RDFa. We get the following Microdata out:
>
> type  http://n.whatwg.org/work
> work  http://upload.wikimedia.org/...terrestrialglobe-1592-20061127.jpg
> title   "Emery Molyneux Terrestrial Globe"
> author  "Bob Smith"
> license http://creativecommons.org/licenses/by-sa/3.0/us/
>
> So, we get more-or-less the same number of data items out, but there is
> a problem. What does "title" mean in the semantic sense? Does it mean
> "job title" or does it mean "work title"? The term "title" in this case
> is ambiguous.

No, as long as an item type is used (http://n.whatwg.org/work) there
is no ambiguity. This particular item type is defined at
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#licensing-works

Title here "Gives the name of the work." without ambiguity.

Furthermore, for this particular vocabulary the mapping to RDF is
defined, as such:

title: http://purl.org/dc/elements/1.1/title
author: http://creativecommons.org/ns#attributionName
license: http://www.w3.org/1999/xhtml/vocab#license

In other words you express the exact same information as with RDFa but
without the mental overhead of triples or mixing multiple
vocabularies.

> Concern #2:
>
> Getting Microdata and RDFa markup correct is easier if there are
> templates or if the semantic markup is performed automatically by the
> CMS based on a pre-defined form. For example,
> http://en.wikipedia.org/wiki/Augustus, note the Infobox on the
> right. It would be much better for the RDFa markup to happen
> automatically via MediaWiki's template process, than for it to be marked
> up by
> hand.

Certainly, but if wiki editors are *able* to do it by hand, then IMHO
microdata is much less error-prone.

> However - XHTML1+RDFa is a published W3C Recommendation and it is safe

Is Wikipedia using XHTML served as application/xml+xhtml? It seems
that RDFa in "XHTML" as deployed only works because consumers pretend
that the data is XHTML even though it is served as text/html and
treated as such by browsers. I would assume that most pages using RDFa
today are neither valid XHTML, nor served with the XHTML MIME type.
Any attempts to use browser DOM APIs to access the data will have
surprising/confusing results, as HTML doesn't have namespaces but RDFa
uses the syntax.

> Concern #4:
>
> While I can't fault Aryeh's enthusiasm, I am now concerned that there
> may be questions in this community that are going unanswered related to
> RDFa and Microdata. I hope this will be a deliberate process as it is
> easy to get semantic data markup wrong (regardless of the implementation
> language - Microformats, Microdata or RDFa).

Agreed.

The microdata spec for the curious:
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html

Finally I will note that it is very likely that the microdata DOM APIs
will get implemented in browsers, making the semantic data available
to both scrapers, to native browser interfaces and to browser
extensions such as user JavaScript. As an example, you might see an
icon in the address bar for saving events to a calendar, or the
license information of an image displayed in the native properties
dialog. I stress again that I don't make any promises on behalf of
Opera or any other browser vendor, these are just my predictions.

In other goodies, microdata already has a defined mapping to JSON, so
dumping all embedded data as JSON via a web interface would be quite
trivial and be using the same format that you will get from browsers
when they have implemented some of the DOM APIs.

--
Philip Jägenstedt

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
Philip Jägenstedt wrote:
> I don't suppose that the members of this list appreciate the epic
> Microdata vs. RDFa battle leaking into this mailing list

I wouldn't use such terms to frame the debate. The Microformats,
Microdata and RDFa communities are not "battling" or working against
each other - they're having a very necessary, spirited debate. Clearly,
both communities are influencing the design of the other and clearly we
need to have these discussions in order to make sure that we're creating
the best possible technology for the future of the Web.

More importantly, the reason that all of us are working on this
technology is because we care about how it is used to better humanity.
At least, I hope that's why people are working on this stuff :).
Certainly, we all hold Wikipedia in high regard and want what's best for
this community as well.

It's not /unfortunate/ that we're having the discussion here - it was
inevitable.

I'm delighted by the fact that we're even having this debate. It took
ages to convince the WHAT WG that this was a problem that needed to be
addressed[1] just 18 months ago.

So, we can either grit our teeth and begrudgingly go through the
motions, or we can welcome the debate to come.

I choose to do the latter because I know that all of us will learn
something from it and better understand the requirements for Wikimedia
implementations. What we learn here will further influence guidance
given to future communities, just as integrating RDFa with Drupal has
influenced the advice that we may give to this community.

> [ed: Microdata] is really quite intuitive and simple, with few
> surprises.

I agree on the first point - Microdata is pretty intuitive and simple,
with few surprises. Although, I'd say the same for RDFa as well. I think
we tend to forget, though, that Web semantics require a bit of effort to
learn and the audience that is using the technology should be taken into
account when deciding how to expose an authoring environment for the
community.

I don't think that the best approach for Wikipedia is to allow direct
Microdata or RDFa markup. There are already many templates in use at
Wikipedia via Infobox - those templates could be leveraged to
automatically generate RDFa in the same way that dbpedia.org uses those
templates to generate RDF. The risk this community runs by allowing
arbitrary semantic data markup is that contributors make mistakes
causing half of the semantic data to be corrupted - making the rest of
the data useless.

Neither Microdata nor RDFa come with few surprises for the beginner.
Like all new web technologies, there is a learning curve for both of
them and it's pretty similar since Microdata's design was influenced by
RDFa and Microformats. More about the surprises with each, below.

> [ed: Microdata] maps well to the
> RDF model if you want it, but doesn't force authors to think in terms
> of subject, predicate, object triples.

Well, Microdata /almost/ maps to the RDF model. Microdata doesn't
support RDF literal typing, which is basically a fancy way of saying
that you can't verify that weights, volumes, speeds, the full range of
dates in different calendars, encodings such as chemical compositions,
and varying other typed information is expressed cleanly by the
Wikipedia contributors.

So, if you wanted to say something like this:

The speed of light is 299792458 m/s.

You would do this in RDFa:

<div about="#light">
The speed of light is <span property="measure:speed"
datatype="measure:meters-per-second">299792458</span> m/s.
</div>

which would generate the following triple:

<#light>
measure:speed
"299792458"^^measure:meters-per-second .

AFAIK, there is no way to do the equivalent in Microdata, is there Philip?

Some of you may be asking yourselves "Why is that so important?". The
primary concern has to do with data validation. Good RDF vocabularies
are built to be able to validate their data and this is important for
large sites like Wikipedia to ensure that the data that they're exposing
is valid. Since measure:speed's range is measure:meters-per-second, and
meters-per-second is presumably a sub-class of xsd:decimal, then a data
validator would know that it's expecting some sort of number. So, if a
Wikipedia author enters some markup that generates this data:

<#baseball>
measure:speed
"fast enough to hurt" .

An RDF reasoner would know that not only is the data not typed, but even
if it were typed, the value "fast enough to hurt" is not valid. I would
expect that this most basic level of data validation would be important
to Wikipedia as you want to make sure that contributors are being
careful with their markup.

The above is how you would do it in RDFa. Philip, I haven't seen any
work related to this in Microdata - have there been any recent
developments with regard to data validation in Microdata?

>> So, we get more-or-less the same number of data items out, but there is
>> a problem. What does "title" mean in the semantic sense? Does it mean
>> "job title" or does it mean "work title"? The term "title" in this case
>> is ambiguous.
>
> No, as long as an item type is used (http://n.whatwg.org/work) there
> is no ambiguity. This particular item type is defined at
> http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#licensing-works
>
> Title here "Gives the name of the work." without ambiguity.

This is new! I'm glad this issue was addressed in Microdata as it was
one of my criticisms of it when I last read the Microdata spec about six
months ago. Looks like that section of the spec was last changed on
October 23rd 2009? Do you know when this was put in there, Philip?

What happens when an author forgets to include itemtype? So, if somebody
does this:

<div itemscope>
<span itemprop="title">Emery Molyneux Terrestrial Globe</span>
</div>

There's nothing to ground the "title" property. The way I'm reading the
spec, it becomes ambiguous at that point, right?

RDFa is very careful to never let something like this happen... as this
data ambiguity results in questionable data that you wouldn't want to
pass to a reasoning agent.

> Furthermore, for this particular vocabulary the mapping to RDF is
> defined, as such:
>
> title: http://purl.org/dc/elements/1.1/title
> author: http://creativecommons.org/ns#attributionName
> license: http://www.w3.org/1999/xhtml/vocab#license
>
> In other words you express the exact same information as with RDFa but
> without the mental overhead of triples or mixing multiple
> vocabularies.

... and with the added danger of expressing ambiguous data. This is not
the real danger, though. While data ambiguity is really bad when it
comes to data stores, centralized vocabulary management is even worse.

RDFa is built on a concept called "follow your nose", which means that
all vocabulary term URLs in RDFa, such as
http://purl.org/media/audio#Recording, should be dereference-able and at
the end of that URL should be a machine-readable description of the
vocabulary term. Preferably, a human-readable description should also
exist at that URL.

Dereference http://n.whatwg.org/work and you get a 404 Error. Even
worse, the Microdata work vocabulary is hard-coded in the HTML5
specification. If one wanted to extend the vocabulary, you would have to
convince the only editor of that specification, who has a track record
of being both very easy and very difficult to work with (based on
whether or not he agrees with you), that your vocabulary term warrants
addition.

There are currently 3 Microdata vocabularies in the spec[2].

To contrast, there are over 250 active RDF vocabularies[3].

That is the true power of decentralized vocabulary development, which is
a corner-stone of RDFa. The RDFa community understands that Wikipedia
should be in charge of choosing and extending vocabularies since this
community has the appropriate domain experts. You are the experts, we
are not - and it's important to recognize that in the design of any
semantic data expression language.

If Wikipedia agrees that embedding semantics in their pages is of worth
to humanity (and I certainly think it is of great worth), then there
will come a time that this community will want to develop their own
vocabulary. RDFa allows that vocabulary to be developed independently of
any standards body and allows this community to have full control of it.

Sure, you could make the argument that Microdata allows RDF to be
expressed (as long as you use the complete vocabulary URL), but at that
point the Microdata markup is far more cumbersome than the RDFa markup.
Similarly, if the goal is to express RDF, that is what RDFa was designed
to accomplish.

Philip, could you give us an update on what the WHATWG sees as the
publishing process for Microdata vocabularies? For example, if Wikipedia
wanted to start expressing royal bloodlines using a vocabulary specific
to Wikipedia, how would they go about getting that vocabulary into the
HTML5 Microdata specification?

> Certainly, but if wiki editors are *able* to do it by hand, then IMHO
> microdata is much less error-prone.

IMHO, there are ways to shoot yourself in the foot with both Microdata
and RDFa - as I've outlined above. I suppose that you could use both and
pick which foot you're going to shoot with which technology :), but my
suggestion is that nobody should be making such generalized statements -
that one is more error-prone than the other.

It's like saying that programming in Python is more error prone than
programming in PHP - it depends entirely on the skill of the developer,
what you're doing, and many other factors that are out of the hands of
language designers.

>> However - XHTML1+RDFa is a published W3C Recommendation and it is safe
>
> Is Wikipedia using XHTML served as application/xml+xhtml? It seems
> that RDFa in "XHTML" as deployed only works because consumers pretend
> that the data is XHTML even though it is served as text/html and
> treated as such by browsers. I would assume that most pages using RDFa
> today are neither valid XHTML, nor served with the XHTML MIME type.
> Any attempts to use browser DOM APIs to access the data will have
> surprising/confusing results, as HTML doesn't have namespaces but RDFa
> uses the syntax.

Frankly, this is something that nobody that uses this technology cares
about because all they are ever going to see are key-value pairs
(Microdata) or triples (RDFa).

This is something that only concerns browser manufacturers and RDFa
parser writers. That's why there is a Microdata API, and is going to be
an RDFa API. There also exist many RDFa parser implementations to
abstract this low-level stuff away.

Both Microdata and RDFa are being designed to operate in "dirty"
environments with invalid markup and will work regardless of the MIME
type, file extension, markup botching and namespace support across
websites and web browsers.

There are a number of RDFa Javascript implementations that work just[4]
fine[5] on badly authored/served XHTML documents.

Besides, the Wikipedia community has done a fantastic job of generating
valid XHTML:

http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Augustus&charset=(detect+automatically)&doctype=Inline&group=0
http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Walyunga_National_Park&charset=(detect+automatically)&doctype=Inline&group=0&user-agent=W3C_Validator/1.654
http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Nishida_Shunei&charset=(detect+automatically)&doctype=Inline&group=0&user-agent=W3C_Validator/1.654

The migration to XHTML+RDFa would only require the DOCTYPE to change...
which shouldn't be any more difficult than transitioning to HTML5 (or
HTML5+RDFa) in the future.

> Finally I will note that it is very likely that the microdata DOM APIs
> will get implemented in browsers, making the semantic data available
> to both scrapers, to native browser interfaces and to browser
> extensions such as user JavaScript. As an example, you might see an
> icon in the address bar for saving events to a calendar, or the
> license information of an image displayed in the native properties
> dialog. I stress again that I don't make any promises on behalf of
> Opera or any other browser vendor, these are just my predictions.

Again, this is exciting news and while I don't think Microdata is the
proper solution for the Web, for the same reasons that are outlined
above and many more, I'm delighted to hear that Opera is taking
in-browser semantic data expression very seriously. How far we have come
in just 18 months! :)

-- manu

[1]http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-August/015971.html
[2]http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#mdvocabs
[3]http://prefix.cc/popular/all
[4]http://code.google.com/p/rdfquery/
[5]http://code.google.com/p/ubiquity-rdfa/

--
Manu Sporny (skype: msporny, twitter: manusporny)
President/CEO - Digital Bazaar, Inc.
blog: Monarch - Next Generation REST Web Services
http://blog.digitalbazaar.com/2009/12/14/monarch/

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
2010/1/16 Manu Sporny <msporny@digitalbazaar.com>:

> I don't know if you intended the tone of
> your e-mail in the way that I read it, but it came off as purposefully
> misleading based on the discussions that both you and I have had as
> members of the HTMLWG and WHATWG.
[...]
> We have a very friendly community


- d.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
Philip wrote:
> Certainly, but if wiki editors are *able* to do it by hand, then IMHO
> microdata is much less error-prone.


Manu Sporny wrote:
> I don't think that the best approach for Wikipedia is to allow direct
> Microdata or RDFa markup. There are already many templates in use at
> Wikipedia via Infobox - those templates could be leveraged to
> automatically generate RDFa in the same way that dbpedia.org uses those
> templates to generate RDF. The risk this community runs by allowing
> arbitrary semantic data markup is that contributors make mistakes
> causing half of the semantic data to be corrupted - making the rest of
> the data useless.


Both of you seem to think that wikipedia editors would start placing
RDF/Microdata interleaved with wiki markup.
I don't think that could ever happen. The "direct markup" would be
inserted into infoboxes (which are themselves wikitext, although they
can get quite complex).


Perhaps we shouldn't provide the full power of RDF or Microdata yet, and
provide instead a extension able to handle a subset, using one or another.


> (long text about if wikipedia XHTML is served as application/xml+xhtml
> and why it doesn't matter)
>
> Besides, the Wikipedia community has done a fantastic job of generating
> valid XHTML:
>
> http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Augustus&charset=(detect+automatically)&doctype=Inline&group=0
> http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Walyunga_National_Park&charset=(detect+automatically)&doctype=Inline&group=0&user-agent=W3C_Validator/1.654
> http://validator.w3.org/check?uri=http://en.wikipedia.org/wiki/Nishida_Shunei&charset=(detect+automatically)&doctype=Inline&group=0&user-agent=W3C_Validator/1.654
>
> The migration to XHTML+RDFa would only require the DOCTYPE to change...
> which shouldn't be any more difficult than transitioning to HTML5 (or
> HTML5+RDFa) in the future.

It's expected to provide good xhtml (the output is being passed by
tidy), but nonetheless it sometimes still fail. And there're IE users, too.
There is also a switch on MediaWiki for using HTML5 instead of XHTML.


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
2010/1/16 Platonides <Platonides@gmail.com>:

> Both of you seem to think that wikipedia editors would start placing
> RDF/Microdata interleaved with wiki markup.
> I don't think that could ever happen. The "direct markup" would be
> inserted into infoboxes (which are themselves wikitext, although they
> can get quite complex).


Something deep inside the plumbing of a template would be the place for this.


- d.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
I could see the flames rising at the start of this thread, so thank you both
for steering away from them.

Essentially we have a format war here, in which one or other format will win
and the other will go extinct. It might be being fueled by altruism rather
than capitalism, and that's brilliant, but VHS and Betamax are watching from
the wings. I know sod all about either of them except what has been posted
here, but I see that they're incredibly similar, but just different enough
to be incompatible; and I see that they are both horribly difficult for the
lay-editor to use. By that I mean that the discussion between "oh this one
only requires us to put in two new attributes instead of three" misses the
elephant in the room: *both* formats require us to whitelist and start
filling our wikitext with the HTML tag that the most iconic piece of
wikimarkup, the double brackets, have kept hidden for nine years. The
reason we brought in that now-ubiquitous syntax hasn't changed: the damn
thing was too difficult for the layman to understand and use.

We do, without a doubt, need to implement this metadata-capture in MediaWiki
somehow, but we need to do it not only in a way that the majority of people
can use and understand, but in a way which doesn't make wikitext even more
complicated for everyone. If either syntax were enabled, yes it would end
up at the bottom of a template stack, but a) that's not going to do anything
to ensure that the tags aren't being misused elsewhere, and b) even the most
careful implementation is going to manifest itself in article wikitext along
the lines of ""{{person|John Smith}}, born {{birthdate|12 June 1987}}, was a
{{occupation|football player}} for {{organisation|Puddlemere United}}"". Or
something like that. If we encourage editors to go the whole hog on this,
we might as well install SMW.

There seem to be two usecases for these systems. First, marking up the
'stuff' that MediaWiki serves: images, copyright links, author links, etc.
That requires MW to be able to get hold of the raw data for, for instance,
an image license; and that's begging for things like new magic words to put
on the image description page, not for enabling either format directly in
wikitext. The only reason to do *that*, is to support editors marking up
*their own stuff*, and that's where we have problems.

I think that it would be foolish beyond belief to encourage editors to
divert their volunteer time to implementing a system that could turn out to
be totally anachronistic within two years; and while I think it's a laudable
long-term goal I think it would thus be very silly to let editors insert
*either* format directly into wikitext at this point, or for a good year to
come. By far the top priority should be implementing structures by which
MediaWiki can *collect* semantic data. If we implement a {{COPYRIGHT:...}}
parser function, or a metadata form, or (as I've been musing over for a
while) a Category-esque system that wasn't based on wikitext and so which
could have a fine-grained permissions interface; we create a feature that is
useful whatever happens in the metadata world. We can implement RDFa with
that data, microdata, both, neither or something else entirely. We could
certainly expose it through our own API. Whatever happens, editor work is
not wasted.

TLDR version: jumping on either bandwagon is neither necessary nor sensible,
and we should avoid getting drawn into the issue. Implementing either of
the proposed methods in raw wikitext actively defeats one of the purposes of
MediaWiki: to make it as easy for anyone to edit stuff. It would need to be
carefully thought through, and there's no point putting that effort in until
we know which format has come out on top. Adding metadata to MW's own stuff
is much easier, but its groundwork should be format-independent.

in this world of economic crisis, £0.02 seems to go quite a long way :-D

--HM



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
Platonides wrote:
> Both of you seem to think that wikipedia editors would start placing
> RDF/Microdata interleaved with wiki markup.
> I don't think that could ever happen. The "direct markup" would be
> inserted into infoboxes (which are themselves wikitext, although they
> can get quite complex).

Just to be clear - I'm not trying to propose that wikipedia editors
should start writing wiki markup interleaved with RDFa/Microdata. Quite
the opposite - I think that allowing contributors to hand author RDFa or
Microdata would be a very bad idea for Wikipedia. However, it seems like
what you are saying is that interleaving HTML like this is not possible
anyway - which is a good thing, IMHO.

> Perhaps we shouldn't provide the full power of RDF or Microdata yet,
> and provide instead a extension able to handle a subset, using one or
> another.

XHTML1+RDFa is certainly ready for prime-time, so it would be up to this
community to decide if it should go that route and put it into the core
distribution or have it implemented as an extension.

I think our preference would be that it is implemented as an extension
first and in such a way as to make it very easy to integrate it into
MediaWiki core once all of the bugs are worked out in the extension.

Does anybody have a link to a previous discussion about how to get
Wikipedia to output the same data that dbpedia.org is publishing?

David Gerard wrote:
> Something deep inside the plumbing of a template would be the place
> for this.

I agree.

-- manu

--
Manu Sporny (skype: msporny, twitter: manusporny)
President/CEO - Digital Bazaar, Inc.
blog: Monarch - Next Generation REST Web Services
http://blog.digitalbazaar.com/2009/12/14/monarch/

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
On Sat, Jan 16, 2010 at 12:32 AM, Manu Sporny <msporny@digitalbazaar.com> wrote:
> I don't know if you intended the tone of
> your e-mail in the way that I read it, but it came off as purposefully
> misleading based on the discussions that both you and I have had as
> members of the HTMLWG and WHATWG.

I do not claim to be an expert on RDFa, Microdata, or any similar
technology. I'd prefer not to have to make a decision here at all,
and I've said so. However, it looks like we (MediaWiki) have good
reason to use something or other. For the reasons I gave, I think we
should choose whatever we believe is more likely to succeed, and
failing that, whatever we think is better (e.g., on grounds of
aesthetics or intuitiveness). The example markup I gave might not be
ideal or accurate, but it serves to give a general idea of what the
markup looks like in each case, at least. Thank you for your better
RDFa examples -- although it's telling that I was able to get
Microdata right on the first try, but apparently it took an RDFa
expert to figure out the correct RDFa.

However, as a Wikimedian, I'd like to point you to one of our core
guiding principles:

http://en.wikipedia.org/wiki/Wikipedia:Assume_good_faith

> One lesson that we learned during implementation of RDFa in Drupal is
> that it is helpful for CMS designers to pre-define vocabularies that are
> usable with their CMS systems if manual markup is necessary. Most markup
> of both Microdata and RDFa should also be left to the CMS code unless
> there is a very good reason to not do so.

The major use case for us is image licensing on Commons. Currently
the license templates are generated "by hand" as in not hardcoded in
the software, but actually they're maintained by technically advanced
community members, so ordinary users don't see the markup. To use my
example image, look at this page:

http://commons.wikimedia.org/wiki/File:EmeryMolyneux-terrestrialglobe-1592-20061127.jpg

You can see the wikitext source of the page by hitting "view source"
(or "edit" if it's unprotected by the time you read this) at the top.
The license info is generated by:

{{cc-by-2.0}}

This expands to:

<table cellspacing="8" cellpadding="0" style="width:100%; clear:both;
text-align:center; margin:0.5em auto; background-color:#f9f9f9;
border:2px solid #e0e0e0; direction: ltr;" class="layouttemplate">
<tr>
<td style="width:90px;" rowspan="3"><img alt="w:en:Creative Commons"
src="http://upload.wikimedia.org/wikipedia/commons/thumb/7/79/CC_some_rights_reserved.svg/90px-CC_some_rights_reserved.svg.png"
width="90" height="36" /><br />
<img alt="attribution"
src="http://upload.wikimedia.org/wikipedia/commons/thumb/1/11/Cc-by_new_white.svg/24px-Cc-by_new_white.svg.png"
width="24" height="24" /></td>
<td>This file is licensed under the <a
href="http://en.wikipedia.org/wiki/en:Creative_Commons" class="extiw"
title="w:en:Creative Commons">Creative Commons</a> <a
href="http://creativecommons.org/licenses/by/2.0/deed.en"
class="external text" rel="nofollow">Attribution 2.0 Generic</a>
license.</td>
<td style="width:90px;" rowspan="3"></td>
</tr>
<tr style="text-align:center;">
<td></td>
</tr>
<tr style="text-align:left;">
<td>
<dl>
<dd>You are free:
<ul>
<li><b>to share</b> – to copy, distribute and transmit the work</li>
<li><b>to remix</b> – to adapt the work</li>
</ul>
</dd>
<dd>Under the following conditions:
<ul>
<li><b>attribution</b> – You must attribute the work in the manner
specified by the author or licensor (but not in any way that suggests
that they endorse you or your use of the work).</li>
</ul>
</dd>
</dl>
</td>
</tr>
</table>

(Not cutting-edge markup, but oh well.) This is generated by the
contents of <http://commons.wikimedia.org/wiki/Template:Cc-by-2.0>,
which was created by the Commons community. Pretty much all
boilerplate on Wikimedia projects is created by such templates. So
when we enable Microdata and/or RDFa in MediaWiki wikitext, I'd expect
it to be used almost exclusively in templates, with few users actually
being directly exposed to it. Since the content of MediaWiki pages
has no structure other than wikitext, basically we have to allow this
in wikitext to make it useful to mark up content.

I'll emphasize from the start that I do *not* think either RDFa or
microdata is suitable for dbpedia.org-style content. There's no
reason we should put that in the HTML output, where it will take up
tons of space and not be useful to HTML consumers (e.g., browsers and
search engines). That sort of data should be made available in a
separate stream for consumers who want it, in a dedicated format like
RDF. That way HTML consumers aren't forced to download loads of
useless metadata, and metadata consumers aren't forced to download
loads of useless (and expensive-to-generate) HTML. RDFa/Microdata
should *only* be used for metadata that's useful to HTML consumers of
some kind.

> If you want to allow manual markup of RDFa, MediaWiki should probably
> pre-define at least Dublin Core (used to describe creative works), FOAF
> (used to describe people and organizations), and Creative Commons (used
> to describe licenses).

I expect that we'd allow contributors to use whatever vocabularies
they'd like. It's a wiki, after all. :)

> The above could be marked up in RDFa, with pre-defined vocabs, like so:
>
> <p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
>   typeof="dctype:StillImage">
> <span property="dc:title">Emery Molyneux Terrestrial Globe</span>
> by <a rel="cc:attributionUrl" href="http://example.org/bob/"
>      property="cc:attributionName">Bob Smith</span>
> is licensed under a <a rel="license"
> href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
> Commons Attribution-Share Alike 3.0 United States License</a>.</p>
>
> . . .
>
> So, four pieces of data, which is pretty good considering the
> compactness of the HTML code. The Microdata looks like this:
>
> <div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work">
> ...
> <img
> src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
> width="640" height="480" itemprop="work">
> ...
> <p><span itemprop="title">Emery Molyneux Terrestrial Globe</span>
> by <span itemprop="author">Bob Smith</span> is licensed under a <a
> itemprop="license"
> href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
> Commons Attribution-Share Alike 3.0 United States License</a>.</p>
> </div>
>
> The compactness of the markup between Microdata and RDFa is more or less
> the same in this particular example.

You're comparing apples to oranges here: you included the div and img
for Microdata but not RDFa. If you include that for RDFa, and also
count the xmlns:, it becomes (correct me if I'm wrong)

[[
<div id="bodyContent">
...
<img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
width="640" height="480">
...
<p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
typeof="dctype:StillImage"><span
xmlns:dc="http://purl.org/dc/elements/1.1/" property="dc:title">Emery
Molyneux Terrestrial Globe</span>
by <a xmlns:cc="http://creativecommons.org/ns#"
rel="cc:attributionUrl" href="http://example.org/bob/"
property="cc:attributionName">Bob Smith</span> is licensed under a <a
rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
]]

You do have to count the xmlns: somewhere. Even if you put them on
the <html>, they still count at least once, and in this case they're
only used once on the page, so they deserve to count in full. This is
685 characters. On the other hand, Microdata:

[[
<div id="bodyContent" itemscope="" itemtype="http://n.whatwg.org/work">
...
<img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
width="640" height="480" itemprop="work">
...
<p><span itemprop="title">Emery Molyneux Terrestrial Globe</span> by
<span itemprop="author">Bob Smith</span> is licensed under a <a
itemprop="license"
href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
</div>
]]

525 characters. Compare to the original with no extra semantics:

[[
<div id="bodyContent">
...
<img src="http://upload.wikimedia.org/wikipedia/commons/e/ef/EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
width="640" height="480">
...
<p>Emery Molyneux Terrestrial Globe by Bob Smith is licensed under a
<a href="http://creativecommons.org/licenses/by-sa/3.0/us/">Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>
</div>
]]

380 characters. So Microdata adds 145 bytes, while RDFa adds 305: 2.1
times as much extra markup. To be fair, you included an extra link to
http://example.org/bob/ which wasn't in the original example, but RDFa
is still about twice as many bytes.

It's not just bytes, though. It's also complexity. The Microdata is
*obvious*. I've never used Microdata before in my life, or RDFa, but
somehow I got the Microdata right on the first try, while making
several errors in the RDFa. It's not at all obvious what those xmlns:
things do, or what those cryptic prefixes mean. Microdata is simpler
to understand at first glance for people from an HTML background.
Since you've been working with RDF for years, the magnitude of the
difference is probably not apparent to you.

> Getting Microdata and RDFa markup correct is easier if there are
> templates or if the semantic markup is performed automatically by the
> CMS based on a pre-defined form. For example,
> http://en.wikipedia.org/wiki/Augustus, note the Infobox on the
> right. It would be much better for the RDFa markup to happen
> automatically via MediaWiki's template process, than for it to be marked
> up by
> hand.

As I noted, the templates are made by hand, by each community. The
software just gives the ability to include one page in another with
simple substitutions made. The infobox on the Augustus article is
<http://en.wikipedia.org/wiki/Template:Infobox_royalty>, invoked like
so:

{{Infobox royalty
| name = Caesar Augustus
| title = [[Roman Emperor|Emperor]] of the [[Roman Empire]]
. . . snip 18 lines . . .
| place of death = [[Nola]], [[Italia (Roman Empire)|Italia]], [[Roman Empire]]
| place of burial = [[Mausoleum of Augustus]], Rome
|}}

The template authors would be the ones to add semantics here, not the
software developers. There are a couple orders of magnitude more wiki
editors than software developers, so it just wouldn't be practical for
the developers to be the ones to assign semantic markup to each and
every template. Moreover, as you can tell from the HTML output of the
templates, template editors tend to be of the "copy-paste stuff until
it works" school of HTML authorship. So you cannot argue here that
RDFa is just as good if we abstract away the actual markup. We aren't
in a position to do that -- users with little to no knowledge of RDFa
or microdata will be editing the raw markup, and that has to be taken
into account.

> Intentional or not, Aryeh has painted RDFa in a negative light by not
> outlining a number of points related to adoption and both RDFa and
> Microdata's current status in the HTML Working Group. Adopting either
> RDFa or Microdata in an HTML5 document type would be premature
> at this time because both have not progressed past the Editors Draft
> stage yet. Either is subject to change as far as HTML5 is concerned
> and we really don't want you to ship HTML5 features before they've had
> a chance to solidify a bit more.
>
> However - XHTML1+RDFa is a published W3C Recommendation and it is safe
> to use it for deployment.

Microdata is also safe to use for deployment. Like other web
technologies maintained by the WHATWG, it will not change once it's
widely adopted, and Wikipedia adoption would probably count as wide
adoption by itself. Note that microdata, like all of HTML5, is at
Last Call at the WHATWG, independent of its status as Working Draft in
the W3C.

I've asked Hixie how stable Microdata is. Since he's the sole person
who decides on changes to HTML5 at the WHATWG, as you know, his answer
should be authoritative.

> Google[1] is actively indexing RDFa today as
> is Yahoo[2]. Sites such as Digg, Whitehouse.gov, the UK Government, The
> Public Library of Science, O'Reilly and the UK Government are
> high-profile sites that publish their pages using RDFa. Data formats
> such as XHTML1, SVG 1.2 and ODF have integrated RDFa as a core part of
> their language. Best Buy saw a 30% traffic increase after publishing
> their pages in RDFa using the GoodRelations vocabulary. I'm sure
> everyone here is aware of dbpedia.org[3] and Freebase[4] - which use RDF
> as a semantic representation format. dbpedia, which gets its data from
> Wikipedia, shows 479 million triples available - so that
> should give you folks some idea of the treasure trove of immediately
> extractable semantic data we're talking about.
>
> Make no mistake - RDFa has very strong deployment at this point and it
> will continue to grow past 100,000+ sites with the upcoming release of
> Drupal 7.

Right -- because microdata is so new. How many of those groups
actually considered using microdata? I'd guess roughly none, because
in most cases, microdata either didn't exist or was barely known. If
microdata is much more intuitive and simpler to use, I'd expect it to
win in the long run, say five years from now. RDFa isn't so widely
used that it can't be easily defeated by a clearly superior
technology.

On Sat, Jan 16, 2010 at 6:37 AM, Philip Jägenstedt <philip@foolip.org> wrote:
> Is Wikipedia using XHTML served as application/xml+xhtml?

No. We're currently using XHTML1.0 served as text/html. I expect us
to switch to HTML5 served as text/html (which happens to also be
well-formed XML) before we deploy support for either microdata or
RDFa.

On Sat, Jan 16, 2010 at 5:16 PM, Manu Sporny <msporny@digitalbazaar.com> wrote:
> You would do this in RDFa:
>
> <div about="#light">
> The speed of light is <span property="measure:speed"
> datatype="measure:meters-per-second">299792458</span> m/s.
> </div>
>
> which would generate the following triple:
>
> <#light>
>   measure:speed
>      "299792458"^^measure:meters-per-second .
>
> AFAIK, there is no way to do the equivalent in Microdata, is there Philip?

You could define different properties for different units, or allow
the data to include unit info directly. Like

<span itemprop="speed">299792458 m/s</span>

and have the format itself define what "m/s" means. I don't see this
as a practical issue in MediaWiki, given our use-cases (in particular,
emphatically excluding markup of data that's useless to typical HTML
consumers).

> An RDF reasoner would know that not only is the data not typed, but even
> if it were typed, the value "fast enough to hurt" is not valid.

A microdata standard would also define what type of data is valid.
For instance, from the license vocabulary: "The value must be an
absolute URL." "The value must be either an item with the type
http://microformats.org/profile/hcard, or text."

> What happens when an author forgets to include itemtype?

The same as if an author forgets to include xmlns:. It's not tied to
any vocabulary, you have to either guess or ignore it. It's not
ambiguous, it's just meaningless. There's no difference to RDFa here,
except that RDFa encourages you to link to the profile IDs on the
<html> element, which is much more likely to break under copy-paste.

> RDFa is built on a concept called "follow your nose", which means that
> all vocabulary term URLs in RDFa, such as
> http://purl.org/media/audio#Recording, should be dereference-able and at
> the end of that URL should be a machine-readable description of the
> vocabulary term. Preferably, a human-readable description should also
> exist at that URL.

The perils of using URLs like this are well-known. Just ask the W3C
how many hits it gets for DTDs every second. Microdata deliberately
and wisely avoids using URLs that machines are intended to
dereference. On the other hand, humans can find the info easily:

http://www.google.com/search?q=http://n.whatwg.org/work

I imagine it's meant to resolve to a human-readable spec, though, for
the same discoverability as RDFa. It's probably an oversight, I've
asked Hixie to clarify.

> Philip, could you give us an update on what the WHATWG sees as the
> publishing process for Microdata vocabularies? For example, if Wikipedia
> wanted to start expressing royal bloodlines using a vocabulary specific
> to Wikipedia, how would they go about getting that vocabulary into the
> HTML5 Microdata specification?

We don't have to. See the spec:

"The item type must be a type defined in an applicable specification."
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#item-type

"Applicable specification" links to

"When vendor-neutral extensions to this specification are needed,
either this specification can be updated accordingly, or an extension
specification can be written that overrides the requirements in this
specification. When someone applying this specification to their
activities decides that they will recognise the requirements of such
an extension specification, it becomes an applicable specification for
the purposes of conformance requirements in this specification."
http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#other-applicable-specifications

Anyone can write their own extension specification -- it becomes
"applicable" as soon as anyone decides to use it.

> It's like saying that programming in Python is more error prone than
> programming in PHP - it depends entirely on the skill of the developer,
> what you're doing, and many other factors that are out of the hands of
> language designers.

I think you'll find most MediaWiki developers strongly agree that PHP
is a terrible language and Python is way better, so maybe that was a
bad analogy. :)

> Besides, the Wikipedia community has done a fantastic job of generating
> valid XHTML:

Well, rather, MediaWiki has done a good job there, despite all
attempts by the community. ;) Community inputs tag soup, MediaWiki
converts to valid XHTML. But that's purely syntactic. You can tell
from the extensive usage of tables that Wikipedians don't care about
standards or theoretical purity, they just try to get things to work
right. That has to be taken into account.

On Sat, Jan 16, 2010 at 5:39 PM, Platonides <Platonides@gmail.com> wrote:
> Perhaps we shouldn't provide the full power of RDF or Microdata yet, and
> provide instead a extension able to handle a subset, using one or another.

What sort of user-visible syntax would you suggest? We'd still have
to use either RDFa or microdata for the actual output, so it doesn't
save us much.

On Sat, Jan 16, 2010 at 7:09 PM, Happy-melon <happy-melon@live.com> wrote:
> I know sod all about either of them except what has been posted
> here, but I see that they're incredibly similar, but just different enough
> to be incompatible; and I see that they are both horribly difficult for the
> lay-editor to use.  By that I mean that the discussion between "oh this one
> only requires us to put in two new attributes instead of three" misses the
> elephant in the room: *both* formats require us to whitelist and start
> filling our wikitext with the HTML tag that the most iconic piece of
> wikimarkup, the double brackets, have kept hidden for nine years.

I don't think microdata is harder to use than HTML generally. It's
sure a lot easier to use than wikitext template syntax (look at some
of those enwiki monstrosities).

> and b) even the most
> careful implementation is going to manifest itself in article wikitext along
> the lines of ""{{person|John Smith}}, born {{birthdate|12 June 1987}}, was a
> {{occupation|football player}} for {{organisation|Puddlemere United}}"".  Or
> something like that.

No, I don't think we'd do that at all. We'd add microdata (or RDFa)
to things like license templates, and maybe infobox templates. So
this would all be hidden behind templates people are already using
anyway. The goal is immediately useful metadata like licenses -- we
want web crawlers to be able to automatically tell what licenses
images are under, say. Abstract stuff like you're marking up
shouldn't be provided with the HTML output, and should be input as
part of infoboxes (since people do that anyway).

> There seem to be two usecases for these systems.  First, marking up the
> 'stuff' that MediaWiki serves: images, copyright links, author links, etc.
> That requires MW to be able to get hold of the raw data for, for instance,
> an image license; and that's begging for things like new magic words to put
> on the image description page, not for enabling either format directly in
> wikitext.  The only reason to do *that*, is to support editors marking up
> *their own stuff*, and that's where we have problems.

I don't follow. Why can't you just alter {{cc-by-2.0}} or whatever on
Commons so it outputs the right markup? MediaWiki doesn't have to do
anything beyond allowing the markup to begin with.

> TLDR version: jumping on either bandwagon is neither necessary nor sensible,
> and we should avoid getting drawn into the issue.

I would agree, except that we have an immediate potential use: marking
up image licenses so image crawlers know how the images are licensed.
Google already hardcodes Wikipedia licenses, apparently, but we should
use standards-based machine-readable markup for the benefit of all the
other MediaWikis, and any Wikimedia wikis they haven't hardcoded, and
Commons too if they change a template name or something and break the
scraping, etc. This is why Duesentrieb added the feature. Unless we
all agree it's not worth getting into this for the sake of that
use-case, we do have to address the issue now.

On Sat, Jan 16, 2010 at 7:13 PM, Manu Sporny <msporny@digitalbazaar.com> wrote:
> Just to be clear - I'm not trying to propose that wikipedia editors
> should start writing wiki markup interleaved with RDFa/Microdata. Quite
> the opposite - I think that allowing contributors to hand author RDFa or
> Microdata would be a very bad idea for Wikipedia. However, it seems like
> what you are saying is that interleaving HTML like this is not possible
> anyway - which is a good thing, IMHO.

HTML can be interleaved with wikitext. This is needed because all
templates are written in wikitext, for instance. Templates are just
chunks of wikitext that can get included in other pages, optionally
with some predefined parameters substituted with strings of yet more
wikitext. So MediaWiki recursively substitutes all templates (along
with other things like conditional constructs) with their wikitext
output before evaluating the whole resulting mess as a single wikitext
string.

> Does anybody have a link to a previous discussion about how to get
> Wikipedia to output the same data that dbpedia.org is publishing?

As far as I can tell, dbpedia.org just has people manually sift
through Wikipedia templates and translate them to RDF. Things like
infoboxes naturally lend themselves to users inputting key-value
pairs, which can easily be translated to RDF triples. I don't think
we should use either microdata or RDFa for this kind of data-mining
use-case -- it would be way too much markup and not useful to
practically any viewers. People who want to data-mine can use a
separate data stream, possibly RDF, possibly autogenerated by
MediaWiki. Inline metadata is only ideal for things you want either
browsers, search engines, etc. to see.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
On Sat, Jan 16, 2010 at 8:25 PM, Aryeh Gregor
<Simetrical+wikilist@gmail.com> wrote:
> Microdata is also safe to use for deployment.  Like other web
> technologies maintained by the WHATWG, it will not change once it's
> widely adopted, and Wikipedia adoption would probably count as wide
> adoption by itself.  Note that microdata, like all of HTML5, is at
> Last Call at the WHATWG, independent of its status as Working Draft in
> the W3C.
>
> I've asked Hixie how stable Microdata is.  Since he's the sole person
> who decides on changes to HTML5 at the WHATWG, as you know, his answer
> should be authoritative.

[100116 20:35:42] <AryehGregor> I assume that if Wikipedia starts
using it on a large scale and we do a MediaWiki release and such,
though, you won't change it after that and break all our content,
right?
[100116 20:35:56] <Hixie> correct

So it's certainly stable enough for us to use.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
Hello,

The discussion so far has been about biographical data on Wikipedia
and licensing data on Commons, but other projects have their own needs
for it.

Wikisource, especially, is in desperate need of metadata. We have some
140,000 pages on the English wiki alone that represent poems,
chapters, tables of contents, and so forth. These are essentially
disorganized: we have human-usable templates and categories, but
there's really no good way to find works besides searching their
titles.

A few years ago we combined our metadata templates into two standard
templates, {{header}} (for works) and {{author}} (for authors). Every
single page already provides metadata to these templates, so
implementing a metadata format for machine use is trivial once it is
available on MediaWiki. We *really* want this; it would allow us to
index our jumbled pile of works and authors in all sorts of very
useful and interesting ways. Just a few example are author search and
autocompletion (we currently list works manually), finding works by
genre and year and subject and so forth, searching work descriptions,
and distinguishing works from subpages.

Both formats have their own advantages and disadvantages. Microdata's
simplicity is a significant advantage, but RDFa's built-in validation
is also nice. Whichever format we choose, we'll make it all work
behind the scenes in the murky depths of our templates. But it would
be nice if you'd include creative works, authors, navigation, and
indexes in the equation. There's more here than biographies and image
licenses. :)

--
Yours cordially,
Jesse Plamondon-Willard

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
Trying my best to limit length of reply.

On Sat, Jan 16, 2010 at 23:16, Manu Sporny <msporny@digitalbazaar.com> wrote:
> Philip Jägenstedt wrote:

>> [ed: Microdata] maps well to the
>> RDF model if you want it, but doesn't force authors to think in terms
>> of subject, predicate, object triples.
>
> Well, Microdata /almost/ maps to the RDF model. Microdata doesn't
> support RDF literal typing, which is basically a fancy way of saying
> that you can't verify that weights, volumes, speeds, the full range of
> dates in different calendars, encodings such as chemical compositions,
> and varying other typed information is expressed cleanly by the
> Wikipedia contributors.
>
> So, if you wanted to say something like this:
>
> The speed of light is 299792458 m/s.
>
> You would do this in RDFa:
>
> <div about="#light">
> The speed of light is <span property="measure:speed"
> datatype="measure:meters-per-second">299792458</span> m/s.
> </div>
>
> which would generate the following triple:
>
> <#light>
>   measure:speed
>      "299792458"^^measure:meters-per-second .
>
> AFAIK, there is no way to do the equivalent in Microdata, is there Philip?

The datatype is a part of the vocabulary, if you want to validate your
data you validate it against the vocabulary, not what the author
claims. For examples, you'll see that the vCard vocabulary defines its
own datatypes: http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#vcard

Allowing mixing different types (like m/s and km/h) seems risky, but
is one of the things that exist in the RDF model that can't be
expressed directly using microdata, that is correct.

> The above is how you would do it in RDFa. Philip, I haven't seen any
> work related to this in Microdata - have there been any recent
> developments with regard to data validation in Microdata?

There is nothing like automatic validation, your software has
understand a certain vocabulary to be able to say if the data conforms
to the constraints of that particular vocabulary. (I don't know if
this is any different from the RDF model or if RDF software is able to
"automatically" learn how to validate measure:meters-per-second from
just seeing the string "measure:meters-per-second".)

>>> So, we get more-or-less the same number of data items out, but there is
>>> a problem. What does "title" mean in the semantic sense? Does it mean
>>> "job title" or does it mean "work title"? The term "title" in this case
>>> is ambiguous.
>>
>> No, as long as an item type is used (http://n.whatwg.org/work) there
>> is no ambiguity. This particular item type is defined at
>> http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#licensing-works
>>
>> Title here "Gives the name of the work." without ambiguity.
>
> This is new! I'm glad this issue was addressed in Microdata as it was
> one of my criticisms of it when I last read the Microdata spec about six
> months ago. Looks like that section of the spec was last changed on
> October 23rd 2009? Do you know when this was put in there, Philip?

Originally microdata used item="http://n.whatwg.org/work", but even
then there was no ambiguity about what a particular property meant.

> What happens when an author forgets to include itemtype? So, if somebody
> does this:
>
> <div itemscope>
> <span itemprop="title">Emery Molyneux Terrestrial Globe</span>
> </div>
>
> There's nothing to ground the "title" property. The way I'm reading the
> spec, it becomes ambiguous at that point, right?

Like Aryeh said it's not ambiguous, it's meaningless. Microdata allows
typeless items for site-private use (much like data-*), but such data
*should not* be used by external parties and is in fact ignored by the
RDF extraction algorithm.

> ... and with the added danger of expressing ambiguous data. This is not
> the real danger, though. While data ambiguity is really bad when it
> comes to data stores, centralized vocabulary management is even worse.

Anyone can make up a vocabulary, just point to it in itemtype. The
WHATWG maintains a few core vocabularies, but I expect that new
vocabularies will be developed independently by communities like
microformats.

> Philip, could you give us an update on what the WHATWG sees as the
> publishing process for Microdata vocabularies? For example, if Wikipedia
> wanted to start expressing royal bloodlines using a vocabulary specific
> to Wikipedia, how would they go about getting that vocabulary into the
> HTML5 Microdata specification?

No process, just do it :)

>> Finally I will note that it is very likely that the microdata DOM APIs
>> will get implemented in browsers, making the semantic data available
>> to both scrapers, to native browser interfaces and to browser
>> extensions such as user JavaScript. As an example, you might see an
>> icon in the address bar for saving events to a calendar, or the
>> license information of an image displayed in the native properties
>> dialog. I stress again that I don't make any promises on behalf of
>> Opera or any other browser vendor, these are just my predictions.
>
> Again, this is exciting news and while I don't think Microdata is the
> proper solution for the Web, for the same reasons that are outlined
> above and many more, I'm delighted to hear that Opera is taking
> in-browser semantic data expression very seriously. How far we have come
> in just 18 months! :)

I will stress again that I don't speak for Opera in these matters, but
I do think that microdata in many ways bridges the gap between the
"browsable web" and the "semantic web" (actually, there is only one
web). Browsers already do add some UI features based on the data in
documents (apart from rendering), e.g. exposing RSS feeds in the
address bar or navigating to the next page based on rel="next".
Microdata isn't really new in that regard, it just adds some new data
for browsers to expose.

--
Philip Jägenstedt

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
On Sat, Jan 16, 2010 at 9:07 PM, Jesse (Pathoschild)
<pathoschild@gmail.com> wrote:
> Wikisource, especially, is in desperate need of metadata. We have some
> 140,000 pages on the English wiki alone that represent poems,
> chapters, tables of contents, and so forth. These are essentially
> disorganized: we have human-usable templates and categories, but
> there's really no good way to find works besides searching their
> titles.
>
> A few years ago we combined our metadata templates into two standard
> templates, {{header}} (for works) and {{author}} (for authors). Every
> single page already provides metadata to these templates, so
> implementing a metadata format for machine use is trivial once it is
> available on MediaWiki. We *really* want this; it would allow us to
> index our jumbled pile of works and authors in all sorts of very
> useful and interesting ways. Just a few example are author search and
> autocompletion (we currently list works manually), finding works by
> genre and year and subject and so forth, searching work descriptions,
> and distinguishing works from subpages.

What we're talking about (microdata, RDFa, RDF, etc.) is categorically
useless for Wikimedia-internal use. The only use that any of this
metadata stuff has to us is exposing info to *non*-Wikimedia agents.
For internal use, we can make up our own custom formats and use plain
old database queries much more easily than resorting to any standard
format.

For instance, we have lots of images on Commons under various
licenses. *We* know which license each is under, because we use
MediaWiki's category system. But *other* people (e.g., search
engines) also want to know what licenses our images are under. So for
this we want a standard format like microdata or RDFa, so they don't
have to keep track of our internal data formats.

What Wikisource needs here is a MediaWiki extension. Standard
metadata languages are not going to help at all. If no one is willing
to write an extension for it now, no one will be willing with RDF
support -- since that won't make the job the slightest bit easier.

> Both formats have their own advantages and disadvantages. Microdata's
> simplicity is a significant advantage, but RDFa's built-in validation
> is also nice.

Neither has more built-in validation than the other. Both allow
arbitrary validation. RDFa seems to allow validation to be encoded in
a more machine-readable format, but whether that's an advantage at all
is debatable. HTML5 does not provide a DTD, XML Schema, or any other
machine-readable language description, for good reason.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
On Sat, Jan 16, 2010 at 9:37 PM, Aryeh Gregor
<Simetrical+wikilist@gmail.com> wrote:
> What we're talking about (microdata, RDFa, RDF, etc.) is categorically
> useless for Wikimedia-internal use.  The only use that any of this
> metadata stuff has to us is exposing info to *non*-Wikimedia agents.
> For internal use, we can make up our own custom formats and use plain
> old database queries much more easily than resorting to any standard
> format.
> [...]
> For instance, we have lots of images on Commons under various
> licenses. *We* know which license each is under, because we use
> MediaWiki's category system.

Unfortunately, categories and database queries are inadequate for our
needs. Someone can indeed navigate to Categories::Works::Works by
genre::Non-fiction::Governmental::Biographies::Ancient biographies,
and they'll find all 5 pages that someone thought to categorize to
this depth. But if someone hopes to find our 1872 American
biographies, they are going to be sorely disappointed.

Metadata, whether a standard or internal format, allows machines to
extract this data from template output and store it in a database for
human use. If you want 1872 American biographies mentioning a Willard,
just fill in the year, location, and description fields, and check off
the relevant genres from the database. This will return a list of
actual works that match the exact criteria given, not subpages or
mid-text false matches which are the best we can get now.

If we simply extend MediaWiki to support metadata for works or
authors, the metadata is limited to these types and fields. Public
metadata can be extended and parsed in any way the local community or
our content users feel useful. Users can add their own metadata
(translators? publishers? work licenses?) to templates, and add their
own tools and databases to the collection.

This is also not possible with database queries, since the metadata is
not provided to the software except as part of the wiki text. It's
conceivable to extract it directly from the wiki text of a wiki dump,
but this would be horrendously complex given the number of different
options and combinations. It's possible to use an internal Wikimedia
format, but this would be useless outside Wikimedia.

There is very little difference between internal and external use;
it's no easier for a Wikisource editor to find those 1872 American
biographies. Editors are also users. Categories are inadequate beyond
the simplest one-dimensional criteria.

So, these metadata formats are definitely *not* useless for internal
community use.

--
Yours cordially,
Jesse Plamondon-Willard

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
On Sat, Jan 16, 2010 at 10:07 PM, Jesse (Pathoschild)
<pathoschild@gmail.com> wrote:
> Unfortunately, categories and database queries are inadequate for our
> needs. Someone can indeed navigate to Categories::Works::Works by
> genre::Non-fiction::Governmental::Biographies::Ancient biographies,
> and they'll find all 5 pages that someone thought to categorize to
> this depth. But if someone hopes to find our 1872 American
> biographies, they are going to be sorely disappointed.

You can do this with database queries fine -- there are already
several different toolserver tools that will do category intersections
for you, and a couple extensions. In fact, bog-standard search will
do it for you, although AFAIK only for categories added literally (not
by templates):

http://en.wikipedia.org/w/index.php?title=Special:Search&redirs=1&search=incategory:"Living+people"+incategory:"1944+births"&fulltext=Search&ns0=1

It wouldn't be that hard to allow template-added categories too. I
assume you have categories like "books published in America", "books
published in 1872", and "biographies" -- if not, you can easily add
them via your templates (although that wouldn't work right now with
standard search AFAIK, it would work with things like CatScan).

> If we simply extend MediaWiki to support metadata for works or
> authors, the metadata is limited to these types and fields. Public
> metadata can be extended and parsed in any way the local community or
> our content users feel useful.

Sure, but this is not internal use, so not relevant to my last post.

> This is also not possible with database queries, since the metadata is
> not provided to the software except as part of the wiki text.

It is if you use categories. It would also be possible to hack up
some tool to store all template parameter-value pairs, which are
strikingly similar to the idea of RDFa triples: (article,
template+parameter name, parameter value).

> There is very little difference between internal and external use;
> it's no easier for a Wikisource editor to find those 1872 American
> biographies. Editors are also users.

By "internal use" I mean "use by software designed only to work with
MediaWiki", not "use by Wikimedia users". Standards are only needed
if we want to be useful to software that's also meant to work with
other sites. That way, the software can use the same code to process
both our site and the other sites, since all output the same standard
markup. If the software is only processing MediaWiki sites to begin
with, then standard markup is useless. (Unless it happens to expose
convenient libraries, like with XML or such -- but that's probably not
the case here.)

> So, these metadata formats are definitely *not* useless for internal
> community use.

No, they really are. It's almost certainly more work for us to use a
standard of any kind than to make up our own internal format, so if we
only care about internal use, bothering with standards is
counterproductive. The real use-cases are for external users only.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
Jesse (Pathoschild <pathoschild <at> gmail.com> writes:
> If we simply extend MediaWiki to support metadata for works or
> authors, the metadata is limited to these types and fields. Public
> metadata can be extended and parsed in any way the local community or
> our content users feel useful. Users can add their own metadata
> (translators? publishers? work licenses?) to templates, and add their
> own tools and databases to the collection.


Hi Jesse,

the use you may need seems to be a lot like what Semantic MediaWiki is offering.
I don't know if Wikisource would consider it, but adding user-curated metadata
using a user-generated vocabulary, and being able to query it internally (as
well as exporting it externally) is pretty much what we do.

If you have any questions on it, feel free to contact me.

Cheers,
denny




_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
On Sun, Jan 17, 2010 at 9:20 AM, Denny Vrandecic
<denny.vrandecic@kit.edu> wrote:
> the use you may need seems to be a lot like what Semantic MediaWiki is offering.
> I don't know if Wikisource would consider it, but adding user-curated metadata
> using a user-generated vocabulary, and being able to query it internally (as
> well as exporting it externally) is pretty much what we do.

The major problem with SMW in the past has been, AFAIK, that it's an
enormous amount of code written totally separately from MediaWiki by
different people, and would need to be reviewed in its entirety by
someone like Tim Starling before it could be enabled on any Wikimedia
site. I recall Tim looking briefly at the code and taking a few
minutes to find an XSS exploit. There are also likely to be major
performance issues scaling to Wikipedia (correct me if I'm wrong). So
I wouldn't bet on any progress here anytime soon, especially since
we're way behind on reviewing even existing core code, let alone large
new extensions.

A much more probable method of progress would be to try committing
more modest features incrementally to core, or to small
special-purpose extensions. I don't think it would be very hard at
all to have the API output a machine-readable summary of the template
parameters used on a given page. I might do that today as a
proof-of-concept. If I do, then someone familiar with RDF and PHP
could probably write a fairly simple patch to turn this code into RDF
output. From there it would be pretty simple to write a maintenance
script to output RDF for the template parameters on all pages on a
wiki, and we could see about incorporating that into the regular
Wikipedia data dump.

Notably, this doesn't try to actually use the data on the wiki, so
should have no scalability issues. It should also be small enough to
put in core with no problems, so all MW wikis could be outputting RDF
for their template parameters out of the box. My understanding is
that it's expected that data providers may output RDF in whatever
format is convenient to them, and someone will have to write OWL to
turn this into more conventional formats. But we can output the raw
data reasonably easily, at least.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
On Jan 17, 2010, at 16:11, Aryeh Gregor wrote:

> On Sun, Jan 17, 2010 at 9:20 AM, Denny Vrandecic
> <denny.vrandecic@kit.edu> wrote:
>> the use you may need seems to be a lot like what Semantic MediaWiki is offering.
>> I don't know if Wikisource would consider it, but adding user-curated metadata
>> using a user-generated vocabulary, and being able to query it internally (as
>> well as exporting it externally) is pretty much what we do.
>
> The major problem with SMW in the past has been, AFAIK, that it's an
> enormous amount of code written totally separately from MediaWiki by
> different people, and would need to be reviewed in its entirety by
> someone like Tim Starling before it could be enabled on any Wikimedia
> site. I recall Tim looking briefly at the code and taking a few
> minutes to find an XSS exploit. There are also likely to be major
> performance issues scaling to Wikipedia (correct me if I'm wrong). So
> I wouldn't bet on any progress here anytime soon, especially since
> we're way behind on reviewing even existing core code, let alone large
> new extensions.

I was not talking about WIkipedia -- even though our scalability tests suggest that it could work there, but it is hard to say in advance without testing on the actual WMF server farm. I am merely talking about Wikisource, and wondering if it could be used to solve the problems they have, right now.

Furthermore, the code has had some peer review by now, it is used by sites like Wikia. Our code is getting smaller and we are incorporating comments. It would be great to get further reviews.

So, as said, I am only talking about Wikisource. I think it could be a viable solution for them.

> Notably, this doesn't try to actually use the data on the wiki, so
> should have no scalability issues. It should also be small enough to
> put in core with no problems, so all MW wikis could be outputting RDF
> for their template parameters out of the box. My understanding is
> that it's expected that data providers may output RDF in whatever
> format is convenient to them, and someone will have to write OWL to
> turn this into more conventional formats. But we can output the raw
> data reasonably easily, at least.

Since for the requirements of Wikisource it seems that it would be helpful that the wiki itself stores and uses the data (e.g. give me all the chapters in their order of that book written by X between 1920 and 1940), I was wondering if an extension that does that could be helpful. It is obviously and entirely possible to have the metadata be generated by the RDFa-extension, the metadata be harvested by an external tool, the queries be processed by an external tool, and the result be uploaded to the wiki. It may be a bit easier for Wikisource if the wiki did it, since it could potentially enable more users to perform these tasks.

In the case of Wikisource I'd further suggest to switch off the additional annotation syntax of SMW on go for a modus were the templates do the whole annotation, but that again is an implementation detail that has to be decided by the Wikisource community.

Cheers,
denny
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
* Aryeh Gregor <Simetrical+wikilist@gmail.com> [Sat, 16 Jan 2010
23:06:06 -0500]:
> You can do this with database queries fine -- there are already
> several different toolserver tools that will do category intersections
> for you, and a couple extensions. In fact, bog-standard search will
> do it for you, although AFAIK only for categories added literally (not
> by templates):
>
>
http://en.wikipedia.org/w/index.php?title=Special:Search&redirs=1&search=incategory:"Living+people"+incategory:"1944+births"&fulltext=Search&ns0=1
>
Intersections probably are inefficient when someone needs a range search
between, let's say 1944 and 1965. SMW has probably right approach that
something sequental and numerical like date, mass, speed should not be a
Category but a Property..
Also, it's a bit sad that so many toolserver tools are standalone and
are not a part of MediaWiki distribution. That tool should be a part of
Special:Search.

> It wouldn't be that hard to allow template-added categories too. I
> assume you have categories like "books published in America", "books
> published in 1872", and "biographies" -- if not, you can easily add
> them via your templates (although that wouldn't work right now with
> standard search AFAIK, it would work with things like CatScan).
>
When comes to subcategories, I always wondered why they have to include
the name of parent category:
http://en.wikipedia.org/wiki/Category:Books
The word "Books" is repeated many times through the nested categories,
although we already know these are the "Books".
However, this brings the problem with "de-parenting" of categories,
which is hard to resolve, because the Categories are the part of source
text.
Perhaps, a full category name and a shorted subcategory alias, defined
at NS_CATEGORY pages.
Dmitriy

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
On Sun, Jan 17, 2010 at 11:32 AM, Denny Vrandecic
<denny.vrandecic@kit.edu> wrote:
> I was not talking about WIkipedia -- even though our scalability tests suggest that it could work there, but it is hard to say in advance without testing on the actual WMF server farm. I am merely talking about Wikisource, and wondering if it could be used to solve the problems they have, right now.

The code still must undergo security review to be enabled on any
Wikimedia site. As I said, we don't even have enough reviewers right
now to review core code, let alone large new extensions, so it's
really not likely in the near future. Even small extensions would
probably have a hard time getting enabled right now.

On Sun, Jan 17, 2010 at 11:40 AM, Dmitriy Sintsov <questpc@rambler.ru> wrote:
> Intersections probably are inefficient when someone needs a range search
> between, let's say 1944 and 1965. SMW has probably right approach that
> something sequental and numerical like date, mass, speed should not be a
> Category but a Property..

Yes, that would be awkward to phrase in Lucene search. The point is,
anyway, that enabling something like SMW (probably with fewer
features) is orthogonal to RDFa/microdata/RDF support -- the extension
could incidentally output RDF or whatnot, but it doesn't matter for
internal use.

> Also, it's a bit sad that so many toolserver tools are standalone and
> are not a part of MediaWiki distribution. That tool should be a part of
> Special:Search.

Most toolserver tool authors just don't bother applying for commit
access for whatever reason. Most tools also either perform badly
and/or would need to be rewritten to meet coding standards.
Toolserver roots routinely have to kill processes for using up
unreasonable amounts of resources.

> When comes to subcategories, I always wondered why they have to include
> the name of parent category:
> http://en.wikipedia.org/wiki/Category:Books
> The word "Books" is repeated many times through the nested categories,
> although we already know these are the "Books".

Because categories in MediaWiki form a directed graph, not a tree.
Categories don't have a unique parent. Whether this is good or bad is
debatable.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
Aryeh Gregor <Simetrical+wikilist@gmail.com> wrote:

> [...]
>> Also, it's a bit sad that so many toolserver tools are standalone and
>> are not a part of MediaWiki distribution. That tool should be a part of
>> Special:Search.

> Most toolserver tool authors just don't bother applying for commit
> access for whatever reason. Most tools also either perform badly
> and/or would need to be rewritten to meet coding standards.
> Toolserver roots routinely have to kill processes for using up
> unreasonable amounts of resources.
> [...]

How many of those tools would stand a chance as extensions
of not being disabled in $wgMiserMode? If such a small fea-
ture as the namespace filter in Special:Linksearch risks
server meltdown (bug #10593), I doubt more complex searches
are on the horizon.

Tim


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
Aryeh Gregor wrote:
> RDFa is a way to embed data in HTML more robustly than with attributes
> like class and title, which are reserved for author use or have
> existing functionality. It allows you to specify an external
> vocabulary that adds some semantics to your page that HTML is not
> capable of expressing by itself.

More to the point, it allows an RDF graph to be overlaid onto an XHTML document so that the XHTML document and the RDF graph can share some strings. The XHTML data model isn't extended per se. Instead, a separate RDF graph can be extracted.

> Both RDFa+HTML and Microdata are Working Drafts at the W3C right now

It's true that both HTML+RDFa and Microdata have been published in Working Drafts at the W3C. However, Microdata has never been through a Working Group Decision to publish as a First Public Working Draft while HTML+RDFa has. Microdata was added to a Working Draft after FPWD and there has since been a Working Group decision to take Microdata out of that spec.

It is reasonable to expect that soon HTML+RDFa and Microdata could be in the same stage Process-wise, but it's inaccurate to portray them as being at the same stage Process-wise right now.

> I should note that currently Google and a couple of others support
> RDFa but not Microdata.

See http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2009Sep/0126.html (search for the word "deviate").

Manu Sporny wrote:
> The general points that you made were riddled with technical
> inaccuracies, bad advice, and if implemented by the MediaWiki community,
> would have resulted in semantic data that would have been ambiguous at
> best and erroneous at worst.

With that introduction, I think it's fair to evaluate your message for inaccuracies or relevant omissions as well.

> The above could be marked up in RDFa, with pre-defined vocabs, like so:

It should be noted that the concept of "pre-defined vocabs" is neither in the HTML+RDFa draft nor in the RDFa in XHTML spec from the XHTML2 WG.

> <p about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
> typeof="dctype:StillImage">
> <span property="dc:title">Emery Molyneux Terrestrial Globe</span>
> by <a rel="cc:attributionUrl" href="
> http://example.org/bob/"
>
> property="cc:attributionName">Bob Smith</span>
> is licensed under a <a rel="license"
> href="
> http://creativecommons.org/licenses/by-sa/3.0/us/"
> >Creative
> Commons Attribution-Share Alike 3.0 United States License</a>.</p>

Hiding the CURIE declarations is a common pattern when advocating RDFa: It makes RDFa appear tidier than it is. To write this in RDFa in XHTML (the RDFa spec you say is safe to use for deployment), one would need to declare the CURIE prefixes:

<p xmlns:dctype="http://purl.org/dc/dcmitype/" about="EmeryMolyneux-terrestrialglobe-1592-20061127.jpg"
typeof="dctype:StillImage">
<span xmlns:dc="http://purl.org/dc/elements/1.1/" property="dc:title">Emery Molyneux Terrestrial Globe</span>
by <a xmlns:cc="http://creativecommons.org/ns#" rel="cc:attributionUrl" href="
http://example.org/bob/"

property="cc:attributionName">Bob Smith</span>
is licensed under a <a rel="license"
href="
http://creativecommons.org/licenses/by-sa/3.0/us/"
>Creative
Commons Attribution-Share Alike 3.0 United States License</a>.</p>

Philip Jägenstedt already covered other points about the examples.

> However - XHTML1+RDFa is a published W3C Recommendation and it is safe to use it for deployment.

RDFa in XHTML has indeed been published as a Recommendation jointly by the Semantic Web Deployment Working Group and the XHTML2 Working Group. However, you fail to mention that even though the document mentions "HTML" in its first sentence, all the normative matter concerns strictly XHTML and the document has gone through the W3C Process as a specification that applies to XML.

MediaWiki uses the text/html and, thus, its pages get processed as HTML, so it would be inappropriate to rely on a spec that had been reviewed as an XML spec.

I think it's misleading to promote text/html deployment of specs whose normative matter has been written and reviewed for XML. The most egregious example of this is that the XHTML2 WG has written the normative matter of XHTML 1.x specs for XML but then published a Working Group Note (Notes can be pretty much anything and don't go through the W3C Recommendation track Process) that gives advice on deployment as text/html (http://www.w3.org/TR/xhtml-media-types/).

Furthermore, the ease of getting a spec to REC at the W3C depends on how many people are interested in the spec. The more people are interested in a spec, the more review comments there are. The flip side is that when there's *less* interest in a spec, it's easier to get it to Recommendation due to fewer comments raised. Thus, progress along the REC track isn't a commensurable indicator of technical merit or technical maturity across different specs and WGs.

Also, when assessing the "safe" deployability of RDFa in XHTML, it's relevant to consider that
1) RDFa in XHTML was knowingly (see http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-August/015913.html) progressed on the Recommendation track without resolving how RDFa works with HTML first.
2) An RDFa 1.1 is in the works, and the changes being considered make RDFa 1.0 look like a beta release. (Which is understandable, since a good part of the technical review of RDFa has occurred after RDFa in XHTML was rushed to REC.)

--
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: RDFa and Microdata in MediaWiki [ In reply to ]
Since both RDFa and Microdata support the same underlying data model,
and it's likely to take some time to resolve which will be the eventual
winner, perhaps we should decouple the generation of the final HTML
output from the markup of semantic text in articles.

Since it makes no sense to implement yet another incompatible "semantic
wikitext" format for internal use, we will probably end up using
something that is pretty close to one or the other, buried inside
templates, to perform the actual in-wiki markup. Given this, is it worth
considering which is easier for template authors to write, and which is
easier to convert to the other -- RDFa to microdata or vice-versa?

-- Neil


_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

1 2 3  View All