Issue Details (XML | Word | Printable)

Key: OFFICE-2207
Type: Bug Bug
Status: Applied Applied
Resolution: Fixed
Priority: Major Major
Assignee: Svante Schubert
Reporter: Robert Weir
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
OASIS Open Document Format for Office Applications (OpenDocument) TC

Whitespace processing [N 1309]

Created: 09/Nov/09 11:49 AM   Updated: 05/Aug/10 10:13 PM
Component/s: General
Affects Version/s: ODF 1.0, ODF 1.0 (second edition)
Fix Version/s: ODF 1.0 Errata CD 5

Proposal: Adopt resolution of OFFICE/1243/OFFICE-1211
Resolution:
Replace the entire section and title with the following:
"1.6 White-Space Processing

ODF processing of whitespace characters is in conformance with the provisions of [XML 1.0].

In addition, ODF processors shall ignore all element children ([RNG] section 5, Data Model) of ODF-defined elements that are strings consisting entirely of whitespace characters and which do not satisfy a pattern of the ODF schema definition for the element.

Any special treatment of additional occurrences of whitespace characters depends on the specific definitions of individual ODF elements, attributes, and their datatypes. See, in particular, section 5.1.1."

Sub-Tasks  All   Open   
 Sub-Task Progress: 

 Description  « Hide
Submitter ID
    GB-26300-34
Nature of defect
    Technical
Document
    ISO/IEC 26300:2006
Clause
    1.6
Page
    34
Description of issue

It is stated that "In conformance with the W3C XML specification [XML1.0], optional white-space characters that are contained in elements that have element content (in other words that must contain elements only but not text) are ignored".

    * It is not clear what "optional white-space characters" are (the term is not defined in XML 1.0), or how the described behaviour conforms to XML 1.0.
    * Does the phrase "elements that have element content" mean elements that have only element content? This cannot make sense, as whitespace is itself text content.
    * Consider the markup <text:p><text:span>Hello</text:span> <text:span>world</text:span></text:p>. If processed according to the text above, the space between the words here would be ignored, yet no known ODF processor actually respects this provision.

Proposal

Reform the text to answer the above queries and modify the stated processing behaviour to accord with the existing corpus of documents and processors.



 All   Comments   Change History      Sort Order: Ascending order - Click to sort in descending order
Michael Brauer added a comment - 05/Jan/10 03:10 AM
This issue equals more or less OFFICE-1243, which is resolved by OFFICE-1211

Dennis Hamilton added a comment - 04/Jun/10 09:16 PM - edited
In reviewing the Errata CD04-rev02 resolution of this, and looking at the original defect text, I think we may be confusing the issue by saying too much about what [XML 1.0] is thought to say, rather than simply referencing what it is that [XML 1.0] says precisely.

I'm not sure where this should be handled, but I want to record these observations here so they are not lost in the context of Errata discussion:

 1. I think the distinction is between when white-space characters appear in element-content and when white-space characters appear in the PCDATA of mixed content. (In [XML 1.0], element-content is element-only content.)

 2. In [XML 1.0], it is presumed that all character data that occurs in the content of the root element, directly or indirectly, is character data of the XML document.

 3. How white-space characters are handled in element-content is determined by the application of [XML 1.0]. The xml:space="preserve" attribute applies, but the attribute value is a recommendation here.

  3.1 We can simply say that white-space characters encountered as character data in the immediate content of elements having element-content shall be ignored and any setting of xml:space has no effect for those particular occurrences.

  3.2 I do not see anywhere that [XML 1.0] says such white-space characters are to be ignored. Instead it is specified that the application be informed which white-space characers are of this kind. If we want them ignored, we need to make it our rule. [<b>Update 2010-06-05T17:07Z</b> It is more involved than this. A non-validating XML process does not distinguish because it can't, since it has no way of determining whether encountered white space is that encountered in the syntax of element content or is part of PCDATA in mixed content. That simply affirms that the rules described here are not rules of [XML 1.0].

4. With regard to how mixed-content white-space characters are handled by default, it may be better simply to refer to section 5.1 (or 5.1.1). It is also valuable to assert that xml:space does not over-ride the specified behavior in any case.

5. Since there is no change being made in the [XML 1.0] rules for treatment of white-space in attribute values, and for elimination of carriage-return characters, it is not necessary to say anything about that.

   5.1 The statement that "conforming to the XML specification, all possible line-ending cases are ..." can be misleading." It is preferable to simply observe that the only way that a carriage-return character, #xd, can occur as character data is via a character entity, and that should probably be in a note that refers to [XML 1.0] for that fact. This should be before the statement that makes reference to section 5.1 of ODF 1.0. [<b>Update 2010-06-05T17:07Z</b> clarified the sentence about carriage-return character occurrences.]

  5.2 Since the treatment of attribute values is mandatory, section 3.3.3 of [XML 1.0] might be referenced, but probably in a note that serves as a reminder, rather than appearing as normative content.

6. We should make sure this is cleaned up in ODF 1.2 regardless of handling this in ODF 1.0 Errata CD05 or later.

Dennis Hamilton added a comment - 05/Jun/10 01:11 AM
Looking at the specification for MathML, that specification and schema have their own rules for handling of white space in "token elements" and outside of "token elements."

So it is probably important to qualify the rules in section 1.6 as applying to elements in namespaces for which the ODF specificaiton is the authority.

We should ensure this is covered for ODF 1.2 also.

Dennis Hamilton added a comment - 11/Jun/10 08:06 PM - edited
After researching this further, I think we should apply the change to this in ODF 1.2 back to ODF 1.0 (and eventually 1.1). See OFFICE-1211 and the new comments on how to simplifiy this.

The advantage of using the ODF 1.2 version is that it eliminates all of the incorrect and/or misleading statements in section 1.6.

There is also a new Public Comment about this Errata item, at OFFICE-2575.

The question about DTDs and MathML in the original Alex Brown comment needs to be dealt with elsewhere.

There remains a disagreement about whether element content or element-only content are meaningful here in the absence of a DTD. But the RNG Schema has elements that do not have character data (not mixed content) in their immediate contnet, and ones that do. It may be just a matter of using our terms carefully. In the RNG Schema this is distinguished by whether or not the grammar permits text in the immediate element content.


Svante Schubert added a comment - 24/Jun/10 12:40 PM
Regarding comment from http://tools.oasis-open.org/issues/browse/OFFICE-2733

"Get rid of 1.6 entirely. It is merely useless and confusing, since
it is merely repeats what is normatively specified in XML 1.0.

Moreover, what do you mean by "element contents" here?
In XML 1.0, this term makes sense only when you have DTDs.
Since no ODF documents have DTDs, this term means nothing
here."

The XML specification defines element content without the usage of DTD, see http://www.w3.org/TR/REC-xml/#dt-elemcontent.
The whitespace is indeed already declared in XML spec, but as ODF do not use the xml:space attribute, this chapter seems helpful.

Adapted resolution from feedback from Alex Brown on the SC34 WG6 list.


Dennis Hamilton added a comment - 28/Jun/10 12:10 AM
Let's try something else.

Because the only XML document model we rely on is that of Relax NG, and we do not depend on there being a validating XL processor, we need to state this condition in terms that fit the ODF schema and that do not depend on anything of XML 1.0 related to DTDs and validating processors.

Proposed Errata that removes the objection concerning how we repeat [XML 1.0] content and misattfribute some provisions to [XML 1.0] that are not made there.

Replace section 1.6 with the following

 = = = = = = = = =
1.6 White-Space Processing

An ODF processor shall ignore all element children ([RNG] section 5, Data Model) that are strings consisting entirely of whitespace characters and that do not satisfy a text pattern of the ODF schema definition for the
element.

The whitespace characters are the following [XML 1.0] Unicode characters:

 * HORIZONTAL TABULATION (#x9)

 * LINE FEED (#xA)

 * CARRIAGE RETURN (#xD)

 * SPACE (#x20)

For all other occurrences of whitespace characters, treatment is in conformance with the provisions of [XML 1.0]. Any additional provisions to those of [XML 1.0] are included in the specifications of the elements where such provisions apply. See, for example, section 5.1.1.



Dennis Hamilton added a comment - 29/Jun/10 05:36 PM
Replace section 1.6 with the following (adjusted to use Unicode code-point notation consistently)

 = = = = = = = = =
1.6 White-Space Processing

An ODF processor shall ignore all element children ([RNG] section 5, Data Model) that are strings consisting entirely of whitespace characters and that do not satisfy a text pattern of the ODF schema definition for the
element.

The whitespace characters are the following [XML 1.0] Unicode characters:

 * HORIZONTAL TABULATION (U+0009)

 * LINE FEED (U+000A)

 * CARRIAGE RETURN (U+000D)

 * SPACE (U+0020)

For all other occurrences of whitespace characters, treatment is in conformance with the provisions of [XML 1.0]. Any additional provisions to those of [XML 1.0] are included in the specifications of the elements where such provisions apply. See, for example, section 5.1.1.

Dennis Hamilton added a comment - 05/Aug/10 10:09 PM
After going around a fair amount in discussion of this, I offered the following proposal for consideration in an off-list discussion (on Monday, 2010-07-26):

The text is not the exact text used, but the rationale remains the same:

Here is my proposal for a complete replacement for section 1.6 of ODF 1.0 (and thereby IS 26300 and ODF 1.1). A replacement can be made in ODF 1.2 as well:

[1] "1.6 White-Space Processing"

[2] "ODF processing of whitespace characters is in conformance with the provisions of [XML 1.0].

[3] "In addition, ODF processors shall ignore all element children ([RNG] section 5, Data Model) of ODF-defined elements that are strings consisting entirely of whitespace characters and that do not satisfy a pattern of the ODF schema definition for the element.

[4] "Any special treatment of additional occurrences of whitespace characters depends on the specific definitions of individual ODF elements, attributes, and datatypes. See, in particular, section 5.1.1."

RATIONALE

1. There is no reason to use "EOL" or introduce the notion. The definition of whitespace characters and the rules for collapsing CR-LF sequences are already dictated by the XML processing at [2].

2. The processing of whitespace characters specified by [XML 1.0] happens first, being performed in-effect by an XML processor that delivers its results to the ODF processor as an XML application.

3. The specific rule for ignoring all-whitespace element children (in the data model) when they do not satisfy the pattern for the element is a requirement of ODF as an application of XML. It is the RNG Schema definitions for those elements that determine whether element-children that are strings are accepted or not. (There is an edge case where an element instance that has a non-whitespace text pattern is now seen as empty. It should properly fail to be accepted depending on whether the element is allowed to be empty or not.)

4. I replaced the "for example" phrase because it creates confusion with examples and non-normative text. There might be a preferable restatement. There are numerous special treatments for valid white-space occurrences that remain after [2-3] are applied. For example: in white-space-separated lists, in the W3C Schema definitions for anyURI and base64Binary, in attributes whose values are formulas that include string constants, and in the special case where consecutive white-space sequences are treated as occurrences of single space characters in the HTML-like processing of text streams for layout in ODF documents. This statement [4] allows for all of those situations.

 - Dennis

PS: As far as Svante and I know, only 5.1.1 has a special rule for elements that deliver text to the layout of the document.

Technically, I think it might have been better to describe this as part of a text layout definition rather than treating it as something that happens in the consumption/interpretation of the document. This could also define what happens to white space that trails introduction of a (soft) line break, for example, along with the treatment of &nbsp; and other character entities and special ODF elements that affect layout. It also avoids statements about elements that deliver text that is not necessarily delivered into the layout stream, such as material in the change-tracking history. (I note that the showing of tracked changes in the layout can reveal white space introduced adjacent to where white space has been removed.)

Dennis Hamilton added a comment - 05/Aug/10 10:13 PM
The resolution is restated to have the exact text of CD04-rev07. Reconciliation with OFFIC-1243 and OFFICE-1211 has not yet been carried out.