Character Tool 1.0.1 !EXCLUSIVE!
This document is a W3C Recommendation. This fifth edition is not a new version of XML. As a convenience to readers,it incorporates the changes dictated by the accumulated errata (available at -V10-4e-errata) to the FourthEdition of XML 1.0, dated 16 August 2006. In particular, erratum [E09]relaxes the restrictions on element and attribute names, thereby providing in XML 1.0 the major end user benefitcurrently achievable only by using XML1.1. As a consequence, many possible documents which were not well-formed according to previous editions of this specification are now well-formed, and previously invalid documentsusing the newly-allowed name characters in, for example, IDattributes, are now valid.
Character Tool 1.0.1
XML documents are made up of storage units called entities,which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form characterdata, and some of which form markup.Markup encodes a description of the document's storage layout and logicalstructure. XML provides a mechanism to impose constraints on the storage layoutand logical structure.
This specification, together with associated standards (Unicode [Unicode]and ISO/IEC 10646 [ISO/IEC 10646] for characters, Internet BCP 47[IETF BCP 47] and the Language Subtag Registry [IANA-LANGCODES] for languageidentification tags), providesall the information necessary to understand XML Version 1.0 andconstruct computer programs to process it.
[Definition: An error which a conforming XML processorMUST detect and report to the application.After encountering a fatal error, the processor MAY continue processing thedata to search for further errors and MAY report such errors to the application.In order to support correction of errors, the processor MAY make unprocesseddata from the document (with intermingled character data and markup) availableto the application. Once a fatal error is detected, however, the processorMUST NOT continue normal processing (i.e., it MUST NOT continue to pass characterdata and information about the document's logical structure to the applicationin the normal way).]
[Definition: (Of strings or names:) Two stringsor names being compared are identical. Characters with multiple possiblerepresentations in ISO/IEC 10646 (e.g. characters with both precomposed andbase+diacritic forms) match only if they have the same representation in bothstrings. Nocase folding is performed. (Of strings and rules in the grammar:) A stringmatches a grammatical production if it belongs to the language generated bythat production. (Of content and content models:) An element matches its declarationwhen it conforms in the fashion described in the constraint [VC: Element Valid].]
Each XML document has both a logical and a physical structure. Physically,the document is composed of units called entities.An entity mayrefer to other entities tocause their inclusion in the document. A document begins in a "root"or document entity. Logically, the documentis composed of declarations, elements, comments, character references, andprocessing instructions, all of which are indicated in the document by explicitmarkup. The logical and physical structures MUST nest properly, as describedin 4.3.2 Well-Formed Parsed Entities.
[Definition: A parsed entity contains text,a sequence of characters, which mayrepresent markup or character data.][Definition: A characteris an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriagereturn, line feed, and the legal charactersof Unicode and ISO/IEC 10646. Theversions of these standards cited in A.1 Normative References werecurrent at the time this document was prepared. New characters may be addedto these standards by amendments or new editions. Consequently, XML processorsMUST accept any character in the range specified for Char.]
The mechanism for encoding character code points into bit patterns mayvary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16encodings of Unicode [Unicode];the mechanisms for signaling which of the two is in use,or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities.
Document authors are encouraged to avoid"compatibility characters", as definedin section 2.3 of [Unicode]. The characters defined in the following ranges are alsodiscouraged. They are either control characters or permanently undefined Unicodecharacters:
The presence of #xD in the above production ismaintained purely for backward compatibility with theFirst Edition.As explained in 2.11 End-of-Line Handling,all #xD characters literally present in an XML documentare either removed or replaced by #xA characters beforeany other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal.
TheNamespaces in XML Recommendation [XML Names] assigns a meaningto names containing colon characters. Therefore, authors should not use thecolon in XML names except for namespace purposes, but XML processors mustaccept the colon as a name character.
The first character of a Name MUST be a NameStartChar, and anyother characters MUST be NameChars; this mechanism is used toprevent names from beginning with European (ASCII) digits or withbasic combining characters. Almost all characters are permitted innames, except those which either are or reasonably could be used asdelimiters. The intention is to be inclusive rather than exclusive,so that writing systems not yet encoded in Unicode can be used inXML names. See J Suggestions for XML Names for suggestions on the creation ofnames.
Document authors are encouraged to use names which aremeaningful words or combinations of words in natural languages, andto avoid symbolic or white space characters in names. Note thatCOLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), andMIDDLE DOT are explicitly permitted.
The ASCII symbols and punctuation marks, along with a fairlylarge group of Unicode symbol characters, are excluded from namesbecause they are more useful as delimiters in contexts where XMLnames are used outside XML documents; providing this group givesthose contexts hard guarantees about what cannot be part ofan XML name. The character #x037E, GREEK QUESTION MARK, is excludedbecause when normalized it becomes a semicolon, which could changethe meaning of entity references.
Text consists of intermingled character data and markup. [Definition: Markup takes the form of start-tags, end-tags, empty-element tags, entity references, characterreferences, comments, CDATA section delimiters, documenttype declarations, processing instructions, XML declarations, text declarations,and any white space that is at the top level of the document entity (thatis, outside the document element and not inside any other markup).]
The ampersand character (&) and the left angle bracket () may be represented using the string ">",and MUST, for compatibility, be escapedusing either ">" or a character reference when itappears in the string "]]>" in content, whenthat string is not marking the end of a CDATAsection.
In the content of elements, character data is any string of characterswhich does not contain the start-delimiter of any markup and does not include the CDATA-section-closedelimiter, "]]>". In a CDATA section,character data is any string of characters not including the CDATA-section-closedelimiter, "]]>".
[Definition: Comments may appearanywhere in a document outside other markup;in addition, they may appear within the document type declaration at placesallowed by the grammar. They are not part of the document's characterdata; an XML processor MAY, but need not, make it possible for anapplication to retrieve the text of comments. Forcompatibility, the string "--" (double-hyphen)MUST NOT occur within comments.] Parameterentity references MUST NOT be recognized within comments.
PIs are not part of the document's characterdata, but MUST be passed through to the application. The PI beginswith a target (PITarget) used to identify the applicationto which the instruction is directed. The target names "XML", "xml",and so on are reserved for standardization in this or future versions of thisspecification. The XML Notation mechanismmay be used for formal declaration of PI targets. Parameterentity references MUST NOT be recognized within processing instructions.
[Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocksof text containing characters which would otherwise be recognized as markup.CDATA sections begin with the string ""and end with the string "]]>":]
Parameter-entity replacement textMUST be properly nested with markup declarations. That is to say, if eitherthe first character or the last character of a markup declaration (markupdeclabove) is contained in the replacement text for a parameter-entityreference, both MUST be contained in the same replacement text.
An XML processorMUST always passall characters in a document that are not markup through to the application.A validating XML processorMUST alsoinform the application which of these characters constitute white space appearingin element content.
XML parsed entities are often storedin computer files which, for editing convenience, are organized into lines.These lines are typically separated by some combination of the charactersCARRIAGE RETURN (#xD) and LINE FEED (#xA).
Tosimplify the tasks of applications, theXMLprocessorMUST behave as if it normalized all line breaks in external parsedentities (including the document entity) on input, before parsing, by translatingboth the two-character sequence #xD #xA and any #xD that is not followed by#xA to a single #xA character.
The language specified by xml:lang applies to the element where it is specified (including the values of its attributes), and to all elements in its content unless overridden with another instance of xml:lang. In particular, the empty value of xml:lang is used on an element B to override a specification of xml:lang on an enclosing element A, without specifying another language. Within B, it is considered that there is no language information available, just as if xml:lang had not been specified on B or any of its ancestors. Applications determine which of an element's attribute values and which parts of its character content, if any, are treated as language-dependent values described by xml:lang.