Mailing List Archive

Parsoid announcement: Main roundtrip quality target achieved
Hello everyone,

On behalf of the parsing team, here is an update about Parsoid, the
bidirectional wikitext <-> HTML parser that supports Visual Editor,
Flow, and Content Translation.

Subbu.

-----------------------------------------------------------------------
TL:DR;

1. Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing
without introducing semantic diffs[2].
2. With trivial simulated edits, the HTML -> wikitext serializer used
in production (selective serialization) introduces ZERO dirty diffs
in 99.986% of those edits[3]. 10 of those 23 edits with dirty diffs
are minor newline diffs.
-----------------------------------------------------------------------

Couple days back (June 23rd), Parsoid achieved 99.95%[2] semantic accuracy
in the wikitext -> HTML -> wikitext roundtripping process on the set of
about 158K pages randomly picked from about 16 wikis back in 2013.
Keeping this test set constant has let us monitor our progress over time.
We were at 99.75% last year around this time.

What does this mean?
--------------------
* Despite the practical complexities of wikitext, the mismatch in the
processing models of wikitext (string-based) and Parsoid (DOM-based),
and the various wikitext "errors" that are found on pages, Parsoid is
able
to maintain a reversible mapping between wikitext constructs and their
equivalent HTML DOM trees that HTML editors and other tools can
manipulate.

The majority of differences in the 0.05% arise because of wikitext
errors:
links in links, 'fosterable'[4] content in tables, and some scenarios
with unmatched quotes in attributes. Parsoid does not support
round-tripping (RT) of these.

* While this is not a big change from how it has been for about a year now
in terms of Parsoid's support for editing, this is a notable milestone
for us in terms of the confidence we have in Parsoid's ability to handle
the wikitext usage seen in production wikis and our ability to RT them
accurately without corrupting pages. This should also boost confidence
of all applications that rely on Parsoid.

* In production, Parsoid uses a selective serialization strategy which
tries to preserve unedited parts of wikitext as far as possible.

As part of regular testing, we also simulate a trivial edit by adding
a new comment to the page and run the edited HTML through this
selective serializer. All but 23 pages (0.014% of trivial edits) had
ZERO dirty diffs[3]. Of these 23, 10 of the diffs were minor newline
diffs.

In production, the dirty diff rate will be higher than 0.014% because of
more complex edits and because of bugs in any of 3 components involved
in visual editing on Wikipedias (Parsoid, RESTBase[5] and Visual Editor)
and their interaction. But, the base accuracy of Parsoid's roundtripping
(both in terms of full and selective serialization) is critical to
ensuring
clean visual edits. The above milestones are part of ensuring that.

What does this not mean?
------------------------
* If you edit one of those 0.05% of pages in VE, the VE-Parsoid combination
will break the page. NO!

If you edit the broken part of the page, Parsoid will very likely
normalize
the broken wikitext to the non-erroneous form (break up nested links,
move fostered content out of the table, drop duplicate transclusion
parameters, etc.) In the odd case, it could cause a dirty diff that
changes
the semantics of those broken constructs.

* Parsoid's visual rendering is 99.95% identical to PHP parser
rendering. NO!

RT tests are focused on Parsoid's ability to support editing without
introducing dirty diffs. Even though Parsoid might render a page
differently than the default read view (and might even be incorrect),
we are nevertheless able to RT it without breaking the wikitext.

On the way to getting to 99.95% RT accuracy, we have improved and fixed
several bugs in Parsoid's rendering. The rendering is also fairly
identical
to the default read view (otherwise, VE editors will definitely
complain).
However, we haven't done sufficient testing to systematically identify
rendering incompatibilities and quantify this. In the coming quarters,
we are going to turn our attention to this problem. We have a visual
diffing infrastructure to help us with this (we take screenshots of
Parsoid's output and the default output and compare those images and find
diffs). We'll have to tweak and fix our visual-diffing setup and then fix
rendering problems we find.

* 100% roundtripping accuracy is within reach. NO!

The reality is that there are a lot of pages out there that have various
kinds of broken markup (mis-nested html tags, unmatched html tags,
broken templates) in production. There are probably other edge case
scenarios that trigger different behavior in Parsoid and the PHP parser.
Because we go to great lengths in Parsoid to avoid dirty diffs, our
selective serialization works quite well. There have been very few
reports
of page corruption over the last year. And, where they have surfaced,
we've
usually moved pretty quickly to fix them, and we'll continue to do so.

In addition, our diff classification algo will never be perfect and there
will always be false positives. Overall, we may crawl further along by
0.01% or 0.02%, but we are not holding our breath and neither should you.

* If we pick a new corpus of 100K pages, we'll have similar accuracy. MAYBE!

Because we've tested against a random sample of pages across multiple
Wikipedias, we expect that we've encountered the vast majority of
scenarios
that Parsoid will encounter in production. So, we have a very high degree
of confidence that our fixes are not tailored to our test pages.

As part of https://phabricator.wikimedia.org/T101928 we will be doing
a refresh of our test set, focusing more on enwp pages, non-Wikipedia
test
pages, and probably introducing a set of high traffic pages.

Next steps
----------
Given where we are now, we can now start thinking about the next level with
a bit more focus and energy. Our next steps are to bring the PHP parser and
Parsoid closer both in terms of output and long-term capabilities.

Some possibilities:
* Replace Tidy ( https://phabricator.wikimedia.org/T89331 )
* Pare down rendering differences between the two systems so that
we can start thinking about using Parsoid HTML instead of MWParser HTML
for read views. ( https://phabricator.wikimedia.org/T55784 )
* Use Parsoid as a WikiLint tool
https://phabricator.wikimedia.org/T48705
https://www.mediawiki.org/wiki/Parsoid/Linting/GSoC_2014_Application
* Support improved templating abilities (data-driven tables, etc.)
* Improve Parsoid's parsing performance.
* Implement stable ids to be able to attach long-lived metadata to the
DOM and track it across edits.
* Move wikitext to a DOM-based processing model, using Parsoid as a bridge.
This could make several useful things possible, e.g. much better
automatic edit conflict resolution.
* Long-term: Make Parsoid redundant in its current complex avatar.

References
----------
[1] https://www.mediawiki.org/wiki/Parsoid -- bidirectional parser
supporting
visual editing
[2] http://parsoid-tests.wikimedia.org/failsDistr
http://parsoid-tests.wikimedia.org/topfails shows the actual failures
[3]
http://parsoid-tests.wikimedia.org/rtselsererrors/aa5804ca89dc644f744af24c474cbc736f2edbe1
[4] http://dev.w3.org/html5/spec-LC/tree-construction.html#foster-parenting
[5] https://www.mediawiki.org/wiki/RESTBase#Use_cases

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Engineering] Parsoid announcement: Main roundtrip quality target achieved [ In reply to ]
<quote name="Subramanya Sastry" date="2015-06-25" time="17:22:53 -0500">
> -----------------------------------------------------------------------
> TL:DR;
>
> 1. Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing
> without introducing semantic diffs[2].
> 2. With trivial simulated edits, the HTML -> wikitext serializer used
> in production (selective serialization) introduces ZERO dirty diffs
> in 99.986% of those edits[3]. 10 of those 23 edits with dirty diffs
> are minor newline diffs.
> -----------------------------------------------------------------------

Huge congrats, Subbu and team!

--
| Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E |
| identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Parsoid announcement: Main roundtrip quality target achieved [ In reply to ]
On 25 June 2015 at 23:22, Subramanya Sastry <ssastry@wikimedia.org> wrote:

> On behalf of the parsing team, here is an update about Parsoid, the
> bidirectional wikitext <-> HTML parser that supports Visual Editor, Flow,
> and Content Translation.

eeeexcellent. How close are we to binning the PHP parser? (I realise
that's a way off, but grant me my dreams.)


- d.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Engineering] Parsoid announcement: Main roundtrip quality target achieved [ In reply to ]
On Thu, Jun 25, 2015 at 3:22 PM, Subramanya Sastry <ssastry@wikimedia.org>
wrote:

> Hello everyone,
>
> On behalf of the parsing team, here is an update about Parsoid, the
> bidirectional wikitext <-> HTML parser that supports Visual Editor, Flow,
> and Content Translation.
>
> Subbu.
>
> -----------------------------------------------------------------------
> TL:DR;
>
> 1. Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing
> without introducing semantic diffs[2].
>

Congratulations, parsing team. This is very cool.


...and, pssst, wink wink, nudge nudge, etc:
http://cacm.acm.org/about-communications/author-center/author-guidelines
http://queue.acm.org/author_guidelines.cfm

:)
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Engineering] Parsoid announcement: Main roundtrip quality target achieved [ In reply to ]
On 25 June 2015 at 15:22, Subramanya Sastry <ssastry@wikimedia.org> wrote:

> TL:DR;
>
> 1. Parsoid[1] roundtrips 99.95% of the 158K pages in round-trip testing
> without introducing semantic diffs[2].
> 2. With trivial simulated edits, the HTML -> wikitext serializer used
> in production (selective serialization) introduces ZERO dirty diffs
> in 99.986% of those edits[3]. 10 of those 23 edits with dirty diffs
> are minor newline diffs.
>

​Subbu,

You and your team have done, and keep on doing, amazing stuff. Thank you
all so very much. "Congratulations" doesn't come close. :-)

Yours,
--
James D. Forrester
Lead Product Manager, Editing
Wikimedia Foundation, Inc.

jforrester@wikimedia.org | @jdforrester
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Parsoid announcement: Main roundtrip quality target achieved [ In reply to ]
On 06/25/2015 06:29 PM, David Gerard wrote:
> On 25 June 2015 at 23:22, Subramanya Sastry <ssastry@wikimedia.org> wrote:
>
>> On behalf of the parsing team, here is an update about Parsoid, the
>> bidirectional wikitext <-> HTML parser that supports Visual Editor, Flow,
>> and Content Translation.
> eeeexcellent. How close are we to binning the PHP parser? (I realise
> that's a way off, but grant me my dreams.)

The "PHP parser" used in production has 3 components: the preprocessor,
the core parser, Tidy. Parsoid relies on the PHP preprocessor (access
via the mediawiki API), so that part of the PHP parser will continue to
be in operation.

As noted in my update, we are working towards read views served by
Parsoid HTML which requires several ducks to be lined up in a row. When
that happens everywhere, the core PHP parser and Tidy will no longer be
used.

However, I imagine your question is not so much about the PHP parser ...
but more about wikitext and templating. Since I don't want to go off on
a tangent here based on an assumption, maybe you can say more what you
had in mind when you asked about "binning the PHP parser".

Subbu.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Parsoid announcement: Main roundtrip quality target achieved [ In reply to ]
I didn't have anything in mind, evidently I was just vague on what the
stuff in there is and does :-)

On 26 June 2015 at 16:52, Subramanya Sastry <ssastry@wikimedia.org> wrote:
> On 06/25/2015 06:29 PM, David Gerard wrote:
>>
>> On 25 June 2015 at 23:22, Subramanya Sastry <ssastry@wikimedia.org> wrote:
>>
>>> On behalf of the parsing team, here is an update about Parsoid, the
>>> bidirectional wikitext <-> HTML parser that supports Visual Editor,
>>> Flow,
>>> and Content Translation.
>>
>> eeeexcellent. How close are we to binning the PHP parser? (I realise
>> that's a way off, but grant me my dreams.)
>
>
> The "PHP parser" used in production has 3 components: the preprocessor, the
> core parser, Tidy. Parsoid relies on the PHP preprocessor (access via the
> mediawiki API), so that part of the PHP parser will continue to be in
> operation.
>
> As noted in my update, we are working towards read views served by Parsoid
> HTML which requires several ducks to be lined up in a row. When that happens
> everywhere, the core PHP parser and Tidy will no longer be used.
>
> However, I imagine your question is not so much about the PHP parser ... but
> more about wikitext and templating. Since I don't want to go off on a
> tangent here based on an assumption, maybe you can say more what you had in
> mind when you asked about "binning the PHP parser".
>
> Subbu.
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Engineering] Parsoid announcement: Main roundtrip quality target achieved [ In reply to ]
On Thu, Jun 25, 2015 at 6:22 PM, Subramanya Sastry <ssastry@wikimedia.org>
wrote:

> * Pare down rendering differences between the two systems so that
> we can start thinking about using Parsoid HTML instead of MWParser HTML
> for read views. ( https://phabricator.wikimedia.org/T55784 )
>

Any hope of adding the Parsoid metadata to the MWParser HTML so various
fancy things can be done in core MediaWiki for smaller installations
instead of having to run a separate service? Or does that fall under "Make
Parsoid redundant in its current complex avatar"?

--
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Parsoid announcement: Main roundtrip quality target achieved [ In reply to ]
On Fri, Jun 26, 2015 at 11:52 AM, Subramanya Sastry <ssastry@wikimedia.org>
wrote:

> On 06/25/2015 06:29 PM, David Gerard wrote:
>
>> On 25 June 2015 at 23:22, Subramanya Sastry <ssastry@wikimedia.org>
>> wrote:
>>
>> On behalf of the parsing team, here is an update about Parsoid, the
>>> bidirectional wikitext <-> HTML parser that supports Visual Editor,
>>> Flow,
>>> and Content Translation.
>>>
>> eeeexcellent. How close are we to binning the PHP parser? (I realise
>> that's a way off, but grant me my dreams.)
>>
>
> The "PHP parser" used in production has 3 components: the preprocessor,
> the core parser, Tidy. Parsoid relies on the PHP preprocessor (access via
> the mediawiki API), so that part of the PHP parser will continue to be in
> operation.
>
> As noted in my update, we are working towards read views served by Parsoid
> HTML which requires several ducks to be lined up in a row. When that
> happens everywhere, the core PHP parser and Tidy will no longer be used.
>

Do we have plans for avoiding code rot in "unused" the PHP parser code that
would affect smaller third-party sites that don't using Parsoid?


--
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Engineering] Parsoid announcement: Main roundtrip quality target achieved [ In reply to ]
On 06/29/2015 09:19 AM, Brad Jorsch (Anomie) wrote:
> On Thu, Jun 25, 2015 at 6:22 PM, Subramanya Sastry
> <ssastry@wikimedia.org <mailto:ssastry@wikimedia.org>> wrote:
>
> * Pare down rendering differences between the two systems so that
> we can start thinking about using Parsoid HTML instead of
> MWParser HTML
> for read views. ( https://phabricator.wikimedia.org/T55784 )
>
>
> Any hope of adding the Parsoid metadata to the MWParser HTML so
> various fancy things can be done in core MediaWiki for smaller
> installations instead of having to run a separate service? Or does
> that fall under "Make Parsoid redundant in its current complex avatar"?

Short answer: the latter.
Long answer: read on.

Our immediate focus in the coming months would be to bring PHP parser
and Parsoid output closer. Some of that work would be to tweak Parsoid
output / CSS where required, but also to bring PHP parser output closer
to Parsoid output. https://gerrit.wikimedia.org/r/#/c/196532/ is one
step along those lines, for example. Scott has said he will review that
closely with this goal in mind. Another step is to get rid of Tidy and
use a HTML5 compliant tree builder similar to what Parsoid uses.

Beyond these initial steps, bringing the two together (both in terms of
output and functionality) will require bridging the computation models
... string-based vs. DOM-based. For example, we cannot really add
Parsoid-style metadata for templates to the PHP parser output without
being able to analyze the DOM -- that requires us to access the DOM
after Tidy (or the Tidy-replacement ideally) has a go at it. It requires
us to implement all the dirty tricks we implement to identify template
boundaries in the presence of unclosed tags, misnested tags, fostered
content from tables, and dom restructuring the HTML tree builder does to
comply with HTML5 semantics.

Besides that, if you want to also serialize this back to wikitext
without introducing dirty diffs (there is really no reason to do all
this extra work if you cannot also serialize it back to wikitext), you
also need to be able to either (a) maintain a lot of extra state in the
DOM beyond what Parsoid maintains, or (b) do all the additional work
that Parsoid does to maintain an extremely precise mapping between
wikitext strings and DOM trees. Once again, the only reason (b) is
complicated is because of unclosed tags, misnested tags, fostered
content, DOM restructuring because of HTML5 semantics.

There is a fair amount of complexity hidden there in those 2 steps, and
it really does not make sense to reimplement all of that in the PHP
parser. If you do, at that point, you've effectively reimplemented
Parsoid in PHP -- the PHP parser in its current form is unlikely to stay
as is.

So, the only real way out here is to move the wikitext computational
model closer to a DOM model. This is not a done deal really, but we have
talked about several ideas over the last couple years to move this
forward in increments. I don't want to go into a lot of detail in this
email since this is already getting lengthy, but I am happy to talk more
about it if there is interest.

To summarize, here are the steps as we see it:

* Bring PHP parser and Parsoid output as close as we can (replace Tidy,
fix PHP parser output wherever possible to be closer to Parsoid output).
* Incrementally move wikitext computational model to be DOM based using
Parsoid as the bridge that preserves compatibility. This is easier if we
have removed Tidy from the equation.
* Smoothen out the harder edge cases which simplifies the problem and
eliminates the complexity
* At this point, Parsoid current complexity will be unnecessary
(specifics dependent on previous steps) => you could have this
functionality back in PHP if it is so desired. But, by then, hopefully,
there will also be better clarity about mediawiki packaging that will
also influence this. Or, some small wikis might decide to be HTML-only
wikis.

Subbu.
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: Parsoid announcement: Main roundtrip quality target achieved [ In reply to ]
On 06/29/2015 09:20 AM, Brad Jorsch (Anomie) wrote:
> On Fri, Jun 26, 2015 at 11:52 AM, Subramanya Sastry <ssastry@wikimedia.org>
> wrote:
>
>> The "PHP parser" used in production has 3 components: the preprocessor,
>> the core parser, Tidy. Parsoid relies on the PHP preprocessor (access via
>> the mediawiki API), so that part of the PHP parser will continue to be in
>> operation.
>>
>> As noted in my update, we are working towards read views served by Parsoid
>> HTML which requires several ducks to be lined up in a row. When that
>> happens everywhere, the core PHP parser and Tidy will no longer be used.
> Do we have plans for avoiding code rot in "unused" the PHP parser code that
> would affect smaller third-party sites that don't using Parsoid?

My response to your other email covers quite a bit of this.

As far as I have observed, the PHP parser code has been quite stable for
a while. And, small third-party sites are unlikely to have complex
requirements and are less likely to hit serious bugs. In any case, we'll
do a good-faith effort to keep the PHP parser maintained and we'll fix
critical and really high priority bugs. But, simply by virtue of us
being a small team with multple reponsibilities, we will prioritize
reducing complexity in Parsoid over keeping the PHP parser maintained.
In the long run, I think that is a better path to bringing the two
systems together.

Subbu.

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l