XML5
...
1. Writing XML documents
...
2. Parsing XML documents
This section only applies to user agents, data mining tools, and
conformance checkers.
The rules in this section define the XML
parser.
This specification defines the parsing rules for XML documents, whether
they are syntactically valid or not. Certain points in the parsing
algorithm are said to be parse
errors. The error handling for parse errors is well-defined: user
agents must either act as described below when
encountering such problems, or must abort processing at
the first error that they encounter for which they do not wish to apply
the rules described below.
2.1. Overview of the parsing model
The input to the XML parsing process consists of a stream of Unicode
characters, which is passed through a tokenization stage
(lexical analysis) followed by a tree construction stage
(semantic analysis). The output is a Document object.
The stream of Unicode characters that consists the input to the
tokenization stage will be initially seen by the user agent as a stream of
bytes (typically coming over the network or from the local file system).
The bytes encode the actual characters according to a particular
character encoding, which the user agent must use to decode the
bytes into characters.
Define how to find the character encoding...
2.3. The tokenization
stage
Implementations must act as if they used the following
state machine to tokenize XML. The state machine must
start in the data state. Most states consume a
single character, which can have various side-effects, and either switches
the state machine to a new state to reconsume the same character, or
switches it to a new state (to consume the next character), or repeats the
same state (to consume the next character). Some states have more
complicated behaviour and can consume several characters before switching
to another state.
The output of the tokenization stage is a series of zero or more of the
following tokens: start tag, empty tag, end tag, short end tag, comment,
character, processing instruction and end-of-file. Start and empty tag
tokens have a tag name and a list of attributes, each of which has a name
and a value. End tags have a tag name. Comment and character tokens have
data. Processing instructions have a name and data.
The tokenization stage also uses a list of
entities and a list of parameter entities.
Both lists are populated with tokens consisting of a name and value during
the tokenization stage and are also used within this stage.
Whenever the steps below indicate that the user agent has to append an entity an entity has to be appended to
the list of entities unless the entity flag has
been set to "parameter" in which case it hsa to be appended to the list of parameter entities. The entity
flag has two values: "normal" and "parameter". Its default value is
"normal". It is set to "normal" after an entity has been appended.
The tokenization stage also has a list of attribute
declarations each consisting of a tag name and a list of attributes
which consist of an attribute name, type and default value.
- Data state
-
Consume the next input character:
- U+0026 (
&)
- ...
- U+003C (
<)
- Switch to the tag state.
- EOF
- Emit an end-of-file token.
- Anything else
- Emit the input character as character token. Stay in this state.
- Tag state
-
Consume the next input character:
- U+002F (
/)
- Switch to the end tag state.
- U+003F (
?)
- Switch to the pi state.
- U+0021 (
!)
- Switch to the markup declaration state.
- U+0009
- U+000A
- U+0020
- U+003A (
:)
- U+003C (
<)
- U+003E (
>)
- EOF
- Parse error. Emit a U+003C (
<)
character. Reconsume the current input character in the data state.
- Anything else
- Create a new tag token and set its name to the input character, then
switch to the tag name state.
- End tag state
-
Consume the next input character:
- U+003E (
>)
- Emit a short end tag token and then switch to the data state.
- U+0009
- U+000A
- U+0020
- U+003C (
<)
- U+003A (
:)
- EOF
- Parse error. Emit a U+003C (
<)
character token and a U+002F (/) character token.
Reconsume the current input character in the data
state.
- Anything else
- Create an end tag token and set its name to the input character,
then switch to the end tag name state.
- End tag name state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the end tag name after state.
- EOF
- Parse error. Emit the current token and then
reprocess the current input character in the data
state.
- U+003E (
>)
- Emit the current token and then switch to the data state.
- Anything else
- Append the current input character to the tag name and stay in the
current state.
- End tag name after state
-
Consume the next input character:
- U+003E (
>)
- Emit the current token and then switch to the data state.
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- EOF
- Parse error. Emit the current token and then
reprocess the current input character in the data
state.
- Anything else
- Parse error. Stay in the current state.
- Pi state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- EOF
- Parse error. Reprocess the current input
character in the bogus comment state.
- Anything else
- Create a new processing instruction token. Set target to the current
input character and data to the empty string. Then switch to the pi target state.
- Pi target state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the pi target after state.
- EOF
- Parse error. Emit the current token and then
reprocess the current input character in the data
state.
- U+003F (
?)
- Switch to the pi after state.
- Anything else
- Append the current input character to the tag name and stay in the
current state.
- Pi target after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- Anything else
- Reprocess the current input character in the pi
data state.
- Pi data state
-
Consume the next input character:
- U+003F (
?)
- Switch to the pi after state.
- EOF
- Parse error. Emit the current token and then
reprocess the current input character in the data
state.
- Anything else
- Append the current input character to the pi's data and stay in the
current state.
- Pi after state
-
Consume the next input character:
- U+003E (
>)
- Emit the current token and then switch to the data state.
- U+003F (
?)
- Append the current input character to the pi's data and stay in the
current state.
- Anything else
- Reprocess the current input character in the pi
data state.
- Markup declaration state
-
If the next two characters are both U+002D (-)
characters, consume those two characters, create a comment token whose
data is the empty string and then switch to the comment state.
Otherwise, if the next seven characters are an exact match for
"[CDATA[", then consume those characters and switch to the CDATA state.
Otherwise, if the next seven characters are an exact match for
"DOCTYPE", then this is a parse error. Consume
those characters and switch to the DOCTYPE state.
Otherwise, this is a parse error. Switch to the
bogus comment state.
-
-
Consume the next input character:
- U+002D (
-)
- Switch to the comment dash state.
- EOF
- Parse error. Emit the comment token and then
reprocess the current input character in the data
state.
- Anything else
- Append the current character to the comment data.
-
-
Consume the next input character:
- U+002D (
-)
- Switch to the comment end state.
- EOF
- Parse error. Emit the comment token and then
reprocess the current input character in the data
state.
- Anything else
- Append a U+002D (
-) and the current input character to
the comment token's data. Stay in the current state.
-
-
Consume the next input character:
- U+003E (
>)
- Emit the comment token. Switch to the data
state.
- U+002D (
-)
- Append the current input character to the comment token's data. Stay
in the current state.
- EOF
- Parse error. Emit the comment token and then
reprocess the current input character in the data
state.
- Anything else
- Append two U+002D (
-) characters and the current input
character to the comment token's data. Switch to the comment state.
- CDATA state
-
Consume the next input character:
- U+005D (
])
- Switch to the CDATA bracket state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Emit the current input character as character token. Stay in the
current state.
- CDATA bracket state
-
Consume the next input character:
- U+005D (
])
- Switch to the CDATA end state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Emit a U+005D (
]) character as character token and also
emit the current input character as character token. Stay in the
current state.
- CDATA end state
-
Consume the next input character:
- U+003E (
>)
- Switch to the data state.
- U+005D (
])
- Emit the current input character as character token. Stay in the
current state.
- EOF
- Parse error. Reconsume the current input
character in the data state.
- Anything else
- Emit two U+005D (
]) characters as character tokens and
also emit the current input character as character token. Switch to the
CDATA state.
- DOCTYPE state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE root name before
state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Reprocess the current input character in the bogus
comment state.
- DOCTYPE root name before state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+003E (
>)
- Switch to the data state.
- EOF
- Parse error.
- Switch to the data state.
- Anything else
- Switch to the DOCTYPE root name state.
- DOCTYPE root name state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE root name after state.
- U+003E (
>)
- Switch to the data state.
- U+005B (
[)
- Switch to the DOCTYPE internal subset state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE root name after state
-
Consume the next input character:
- U+003E (
>)
- Switch to the data state.
- U+0022 (
")
- Switch to the DOCTYPE identifier double quoted
state.
- U+0027 (
')
- Switch to the DOCTYPE identifier single quoted
state.
- U+005B (
[)
- Switch to the DOCTYPE internal subset state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE identifier double quoted state
-
Consume the next input character:
- U+0022 (
")
- Switch to the DOCTYPE root name after state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE identifier single quoted state
-
Consume the next input character:
- U+0027 (
')
- Switch to the DOCTYPE root name after state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE internal subset state
-
Consume the next input character:
- U+003C (
<)
- Switch to the DOCTYPE tag state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- U+0025 (
%)
- consume parameter entity
- U+005D (
])
- Switch to the DOCTYPE internal subset after
state.
- Anything else
- Stay in the current state.
- DOCTYPE internal subset after state
-
Consume the next input character:
- U+003E (
>)
- Switch to the data state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE tag state
-
Consume the next input character:
- U+0021 (
!)
- Switch to the DOCTYPE markup declaration
state.
- U+003F (
?)
- Switch to the DOCTYPE pi state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Switch to the DOCTYPE bogus comment state.
- DOCTYPE markup declaration state
-
If the next two characters are both U+002D (-)
characters, then consume those characters and switch to the DOCTYPE comment state.
Otherwise, if the next six characters are an exact match for "ENTITY",
then consume those characters and switch to the DOCTYPE ENTITY state.
Otherwise, if the next seven characters are an exact match for
"ATTLIST", then consume those characters and switch to the DOCTYPE ATTLIST state.
Otherwise, if the next eight characters are an exact match for
"NOTATION", then consume those characters and switch to the DOCTYPE NOTATION state.
Otherwise, switch to the DOCTYPE bogus comment
state.
- DOCTYPE comment state
-
Consume the next input character:
- U+002D (
-)
- Switch to the DOCTYPE comment dash state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE comment dash state
-
Consume the next input character:
- U+002D (
-)
- Switch to the DOCTYPE comment end state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Switch to the DOCTYPE comment state.
- DOCTYPE comment end state
-
Consume the next input character:
- U+003E (
>)
- Switch to the DOCTYPE internal subset state.
- U+002D (
-)
- Switch to the DOCTYPE comment dash state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Switch to the DOCTYPE comment state.
- DOCTYPE ENTITY state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ENTITY type before
state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Switch to the DOCTYPE bogus comment state.
- DOCTYPE ENTITY type before state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+0025 (
%)
- Switch to the DOCTYPE ENTITY parameter before
state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Create an entity token with the name set to the current input
character and the value set to the empty string. Then switch to the DOCTYPE ENTITY name state.
- DOCTYPE ENTITY parameter before state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ENTITY parameter
state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Switch to the DOCTYPE bogus comment state.
- DOCTYPE ENTITY parameter state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Create an entity token with the name set to the current input
character and the value set to the empty string. Set the entity flag to "parameter". Switch to the DOCTYPE ENTITY name state.
- DOCTYPE ENTITY name state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ENTITY name after
state.
- EOF
- Parse error. Reconsume the current input
character in the data state.
- Anything else
- Append the current input character to the name of the entity.
- DOCTYPE ENTITY name after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+0022 (
")
- Switch to the DOCTYPE ENTITY value double
quoted state.
- U+0027 (
')
- Switch to the DOCTYPE ENTITY value single
quoted state.
- EOF
- Parse error. Reconsume the current input
character in the data state.
- Anything else
- Switch to the DOCTYPE ENTITY identifier
state.
- DOCTYPE ENTITY value double quoted state
-
Consume the next input character:
- U+0022 (
")
- Switch to the DOCTYPE ENTITY value after
state.
- U+0026 (
&):
- ... normalize numeric entities only
- EOF
- Parse error. Reconsume the current input
character in the data state.
- Anything else
- Append the current input character to the current entity token's
value.
- DOCTYPE ENTITY value single quoted state
-
Consume the next input character:
- U+0027 (
')
- Switch to the DOCTYPE ENTITY value after
state.
- U+0026 (
&):
- ... normalize numeric entities only
- EOF
- Switch to the data state.
- Anything else
- Append the current input character to the current entity token's
value.
- DOCTYPE ENTITY value after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+003E (
>)
- Append an entity. Switch to the DOCTYPE internal subset state.
- EOF
- Parse error. Reconsume the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE ENTITY identifier state
-
Consume the next input character:
- U+003E (
>)
- append entity ...
- U+0022 (
")
- Switch to the DOCTYPE ENTITY identifier double
quoted state.
- U+0027 (
')
- Switch to the DOCTYPE ENTITY identifier single
quoted state.
- EOF
- Parse error. Reconsume the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE ENTITY identifier double quoted state
-
Consume the next input character:
- U+0022 (
")
- Switch to the DOCTYPE ENTITY identifier
state.
- EOF
- Parse error. Reconsume the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE ENTITY identifier single quoted state
-
Consume the next input character:
- U+0027 (
')
- Switch to the DOCTYPE ENTITY identifier
state.
- EOF
- Parse error. Reconsume the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE ATTLIST state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ATTLIST name before
state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Switch to the DOCTYPE bogus comment state.
- DOCTYPE ATTLIST name before state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- ...
- DOCTYPE ATTLIST name state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ATTLIST name after
state.
- EOF
- Switch to the data state.
- Anything else
- ...
- DOCTYPE ATTLIST name after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+003E (
>)
- Switch to the DOCTYPE internal subset state.
- EOF
- Switch to the data state.
- Anything else
- ...
- DOCTYPE ATTLIST attribute name state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ATTLIST attribute name
after state.
- EOF
- Switch to the data state.
- Anything else
- ...
- DOCTYPE ATTLIST attribute name after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- EOF
- Switch to the data state.
- Anything else
- ...
- DOCTYPE ATTLIST attribute type state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ATTLIST attribute type
after state.
- EOF
- Switch to the data state.
- Anything else
- ...
- DOCTYPE ATTLIST attribute type after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+0023 (
#)
- Switch to the DOCTYPE ATTLIST attribute
declaration before state.
- EOF
- Switch to the data state.
- Anything else
- Switch to the DOCTYPE bogus comment state.
- DOCTYPE ATTLIST attribute declaration before
state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE bogus comment state.
- EOF
- Switch to the data state.
- Anything else
- Switch to the DOCTYPE ATTLIST attribute
declaration state.
- DOCTYPE ATTLIST attribute declaration state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE ATTLIST attribute
declaration after state.
- EOF
- Switch to the data state.
- Anything else
- Stay in the current state.
- DOCTYPE ATTLIST attribute declaration after
state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+003E (
>)
- Switch to the DOCTYPE internal subset state.
- U+0022 (
")
- Switch to the DOCTYPE ATTLIST attribute value
double quoted state.
- U+0027 (
')
- Switch to the DOCTYPE ATTLIST attribute value
single quoted state.
- EOF
- Switch to the data state.
- Anything else
- ...
- DOCTYPE ATTLIST attribute value double quoted
state
-
Consume the next input character:
- U+0022 (
")
- Switch to the DOCTYPE ATTLIST name after
state.
- U+0026 (
&):
- ...
- Anything else
- ...
- DOCTYPE ATTLIST attribute value single quoted
state
-
Consume the next input character:
- U+0027 (
')
- Switch to the DOCTYPE ATTLIST name after
state.
- U+0026 (
&):
- ...
- Anything else
- ...
- DOCTYPE NOTATION state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the DOCTYPE NOTATION identifier
state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Switch to the DOCTYPE bogus comment state.
- DOCTYPE NOTATION identifier state
-
Consume the next input character:
- U+003E (
>)
- Switch to the DOCTYPE internal subset state.
- U+0022 (
")
- Switch to the DOCTYPE NOTATION identifier
double quoted state.
- U+0027 (
')
- Switch to the DOCTYPE NOTATION identifier
single quoted state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE NOTATION identifier double quoted
state
-
Consume the next input character:
- U+0022 (
")
- Switch to the DOCTYPE NOTATION identifier
state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE NOTATION identifier single quoted
state
-
Consume the next input character:
- U+0027 (
')
- Switch to the DOCTYPE NOTATION identifier
state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE pi state
-
Consume the next input character:
- U+003F (
?)
- Switch to the DOCTYPE pi after state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Stay in the current state.
- DOCTYPE pi after state
-
Consume the next input character:
- U+003E (
>)
- Switch to the DOCTYPE internal subset state.
- U+003F (
?)
- Stay in the current state.
- EOF
- Parse error. Reprocess the current input
character in the data state.
- Anything else
- Switch to the DOCTYPE pi state.
- DOCTYPE bogus comment state
-
Consume every character up to the first U+003E (>) or
EOF, whichever comes first. Emit a comment token whose data is the
concatenation of all those consumed characters. Then consume the next
input character and switch to the DOCTYPE internal
subset state reprocessing the EOF character if that was the
character consumed.
- Tag name state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the tag attribute name before
state.
- U+003E (
>)
- Emit the current token and then switch to the data state.
- EOF
- Parse error. Emit the current token and then
reprocess the current input character in the data
state.
- U+002F (
/)
- Switch to the empty tag state.
- Anything else
- Append the current input character to the tag name and stay in the
current state.
- Empty tag state
-
Consume the next input character:
- U+003E (
>)
- Emit the current tag token as empty tag token and then switch to the
data state.
- Anything else
- Parse error. Reprocess the current input
character in the tag attribute name before
state.
- Tag attribute name before state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- uU+003E (
>)
- Emit the current token and then switch to the data state.
- U+002F (
/)
- Switch to the Empty tag state.
- U+003A (
:)
- Parse error. Stay in the current state.
- EOF
- Parse error. Emit the current token and then
reprocess the current input character in the data
state.
- Anything else
- Start a new attribute in the current tag token. Set that attribute's
name to the current input character and its value to the empty string
and then switch to the tag attribute name
state.
- Tag attribute name state
-
Consume the next input character:
- U+003D (
=)
- Switch to the tag attribute value before
state.
- U+003E (
>)
- Emit the current token as start tag token. Switch to the data state.
- U+0009
- U+000A
- U+0020
- Switch to the tag attribute name after
state.
- U+002F (
/)
- Switch to the Empty tag state.
- EOF
- Parse error. Emit the current token as start
tag token and then reprocess the current input character in the data state.
- Anything else
- Append the current input character to the current attribute's name.
Stay in the current state.
When the user agent leaves this state (and before emitting the tag
token, if appropriate), the complete attribute's name must be compared to the other attributes on the same
token; if there is already an attribute on the token with the exact same
name, then this is a parse error and the new attribute must be dropped, along with the value that gets associated
with it (if any).
- Tag attribute name after state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+003D (
=)
- Switch to the tag attribute value before
state.
- U+003E (
>)
- Emit the current token and then switch to the data state.
- U+002F (
/)
- Switch to the empty tag state.
- EOF
- Parse error. Emit the current token and then
reprocess the current input character in the data
state.
- Anything else
- Start a new attribute in the current tag token. Set that attribute's
name to the current input character and its value to the empty string
and then switch to the tag attribute name
state.
- Tag attribute value before state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Stay in the current state.
- U+0022 (
")
- Switch to the tag attribute value double
quoted state.
- U+0027 (
')
- Switch to the tag attribute value single
quoted state.
- U+0026 (
&):
- Reprocess the input character in the tag
attribute value unquoted state.
- U+003E (
>)
- Emit the current token and then switch to the data state.
- EOF
- Parse error. Emit the current token and then
reprocess the current input character in the data
state.
- Anything else
- Append the current input character to the current attribute's value
and then switch to the tag attribute value
unquoted state.
- Tag attribute value double quoted state
-
Consume the next input character:
- U+0022 (
")
- Switch to the tag attribute name before
state.
- U+0026 (
&)
- ...
- EOF
- Parse error. Emit the current token and then
reprocess the current input character in the data
state.
- Anything else
- Append the input character to the current attribute's value. Stay in
the current state.
- Tag attribute value single quoted state
-
Consume the next input character:
- U+0027 (
')
- Switch to the beforeattribute name state.
- U+0026 (
&)
- ...
- EOF
- Parse error. Emit the current token and then
reprocess the current input character in the data
state.
- Anything else
- Append the input character to the current attribute's value. Stay in
the current state.
- Tag attribute value unquoted state
-
Consume the next input character:
- U+0009
- U+000A
- U+0020
- Switch to the tag attribute name before
state.
- U+0026 (
&):
- ...
- U+003E (
>)
- Emit the current token as start tag token and then switch to the data state.
- EOF
- Parse error. Emit the current token as start
tag token and then reprocess the current input character in the data state.
- Anything else
- Append the input character to the current attribute's value. Stay in
the current state.
- Bogus comment state
-
Consume every character up to the first U+003E (>) or
EOF, whichever comes first. Emit a comment token whose data is the
concatenation of all those consumed characters. Then consume the next
input character and switch to the data state
reprocessing the EOF character if that was the character consumed.
2.4. The tree construction stage
The input to the tree construction stage is a sequence of tokens from
the tokenization stage. The output of this stage is a tree
model represented by a Document object.
The tree construction stage passes through several phases. The initial
phase is the start phase.
The stack of open elements contains all elements of
which the closing tag has not yet been encountered. Once the first start
tag token in the start phase is encountered it will contain
one open element. The rest of the elements are added during the main
phase.
The current element is the bottommost node in this
stack.
The stack of open elements is said to have an element in scope if the target element is in the
stack of open elements.
When the steps below require the user agent to append a
character to a node, the user agent must collect
it and all subsequent consecutive characters that would be appended to
that node and insert one Text node whose data is the
concatenation of all those characters.
Need to define create an element for the
token...
When the steps below require the user agent to insert an
element for a token the user agent must create an element for the token and then append it to
the current element and push it into the stack of open elements so that it becomes the new current element.
2.4.1. The start phase
Each token emitted from the tokenization stage must be
processed as follows until the algorithm below switches to a different
phase:
- A start tag token
-
Create an element for the token and then append
it to the Document node and push it into the stack of open elements. This element is the root
element and the first current element. Then
switch to the main phase.
- An empty tag token
-
Create an element for the token and append it to
the Document node. Then switch to the end
phase.
- A comment token
-
Append a Comment node to the Document node
with the data attribute set to the data given in the token.
- A processing instruction token
-
Append a ProcessingInstruction node to the
Document node with the target and
data atributes set to the target and data given in the
token.
- An end-of-file token
-
Parse error. Reprocess the token in the end
phase.
- Anything else
-
Parse error. Ignore the token.
2.4.2. The main phase
Once a start tag token has been encountered (as detailed in the previous
phase) each token must be process using the following
steps until further notice:
- A character token
-
Append a character to the current element.
- A start tag token
-
Insert an element for the token.
- An empty tag token
-
Create an element for the token and append it to
the current element.
- An end tag token
-
If the tag name of the current node does not match the
tag name of the end tag token this is a parse
error.
If there is an element in scope with the same tag name as
that of the token pop nodes from the stack of open
elements until the first such element has been popped from the
stack.
If there are no more elements on the stack of open elements at this
point switch to the end phase.
- A short end tag token
-
Pop an element from the stack of open elements.
If there are no more elements on the stack of open elements switch to
the end phase.
- A comment token
-
Append a Comment node to the current
element with the data attribute set to the data given
in the token.
- A processing instruction token
-
Append a ProcessingInstruction node to the current element with the target and
data atributes set to the target and data given in the
token.
- An end-of-file token
-
Parse error. Reprocess the token in the end
phase.
2.4.3. The end phase
Tokens in the end phase must be handled as follows:
- A comment token
-
Append a Comment node to the Document node
with the data attribute set to the data given in the token.
- A processing instruction token
-
Append a ProcessingInstruction node to the
Document node with the target and
data atributes set to the target and data given in the
token.
- An end-of-file token
-
Stop parsing.
- Anything else
-
Parse error. Ignore the token.
Once the user agent stops
parsing the document, it must follow these steps:
- ...