org.millscript.commons.xml.tokenizer
Class AbstractXmlTokenizerImpl

java.lang.Object
  extended by org.millscript.commons.xml.tokenizer.AbstractXmlTokenizerImpl
All Implemented Interfaces:
XmlTokenizer
Direct Known Subclasses:
Xml10Tokenizer, Xml11Tokenizer

public abstract class AbstractXmlTokenizerImpl
extends java.lang.Object
implements XmlTokenizer

This class provides an XmlTokenizer implementation that breaks an XML document into tokens, such as a start tag, end tag, character data, etc. This tokenizer will only perform a minimum number of well-formedness checks, such as for illegal characters, attributes, etc. This tokenizer does not perform checks such as for matching start/end tags, or that a DTD appears at the start of a document.


Field Summary
protected  int columnNumber
          The number of the current character on the current line.
protected  int lineNumber
          The current line number.
 
Constructor Summary
protected AbstractXmlTokenizerImpl(AbstractXmlTokenizerImpl axti)
          Constructs a new XML tokenizer which will copy it's state from the specified existing tokenizer.
protected AbstractXmlTokenizerImpl(AbstractXmlTokenizerImpl axti, java.io.Reader rr)
          Constructs a new XML tokenizer which will copy it's state from the specified existing tokenizer, but will use the specified reader instead of the one from the existing tokenizer.
protected AbstractXmlTokenizerImpl(java.io.InputStream is, java.nio.charset.Charset cs, boolean namespaceAware)
          Constructs a new XML tokenizer to read from the specified input stream, using the specified character set, with optional namespace support.
protected AbstractXmlTokenizerImpl(java.io.Reader r, boolean namespaceAware)
          Constructs a new XML tokenizer to read from the specified reader, with optional namespace support.
 
Method Summary
 void appendCurrentTokenData(char ch)
          Appends the specified char to the current token.
 void dropS()
          Drops and characters from the input stream that match the S production in the XML specification.
 char getChar()
          Returns the next character from the input stream, throwing an alert if the end of file is reached.
 int getIntChar()
          Returns the raw int version of the next char, handling any push back characters and XML version dependencies.
 int getLineNumber()
          Returns the current one-based line number in the source document.
 char getQuoteChar()
          Returns the next char, checking that it is a legal quote character.
abstract  int handleIntChar(int ch)
          Handles the specified character, performing any XML version dependent line break conversions and checks on it's validity.
 boolean hasNextToken()
          Indicates if this XML tokenizer has any more tokens to return.
abstract  boolean isChar(int ch)
          Tests if the specified character matches the Char production in the XML specification.
abstract  boolean isNameChar(char ch)
          Tests if the specified character matches the NameChar production in the XML specification.
abstract  boolean isNameStartChar(char ch)
          Tests if the specified character matches the NameStartChar production in the XML specification.
 boolean isNCNameChar(char ch)
          Tests if the specified character matches the NCNameChar production in the XML namespace specification.
 boolean isNCNameStartChar(char ch)
          Tests if the specified character matches the NCNameStartChar production in the XML namespace specification.
 boolean isS(int ch)
          Tests if the specified character matches the S production in the XML specification.
 void mustRead(char testch)
          Tests that the next character is the specified one, otherwise it throws an Alert.
 void mustReadEq()
          Tests if the next input sequence matches the Eq production in the XML specification, otherwise it throws an Alert.
 void mustReadS()
          Tests if the next input sequence matches the S production in the XML specification, otherwise it throws an Alert.
 Token nextToken()
          Returns this tokenizers next token.
 boolean peekRead(char testch)
          Tests that the next character is the specified one.
 boolean peekS()
          Tests if the next available character matches the S production in the XML specification.
 void pushBack(char ch)
          Pushes back the specified character so it will be the next one returned by the getChar() method.
 void pushBack(java.lang.String s)
          Pushes back all the characters in the string, so they will be returned by subsequent calls to the getChar() method.
 AttListDeclToken readAttlistDecl()
          Returns the next input sequence as an attribute list declaration token.
 java.lang.String readAttValue()
          Returns the next input sequence as an attribute value string.
 CharDataToken readCDSect()
          Returns the next input sequence as a CDATA section.
 CharDataToken readCharData()
          Returns the next input sequence as a character data token.
 CommentToken readComment()
          Returns the next input sequence as a comment token.
 DTDToken readDoctypeDecl()
          Returns the next input sequence as an document type declaration token.
 ElementDeclToken readElementDecl()
          Returns the next input sequence as an element declaration token.
 java.lang.String readEncodingDecl()
          Returns the next input sequence as an encoding declaration.
 EntityDeclToken readEntityDecl()
          Returns the next input sequence as an entity declaration token.
 EndTagToken readETag()
          Returns the next input sequence as an end tag token.
 void readIntSubset()
          Reads the next input sequence as the internal subset of a document type declaration.
 java.lang.String readNmtoken()
          Returns the next input sequence as an nmtoken.
 NotationDeclToken readNotationDecl()
          Returns the next input sequence as a notation declaration token.
 PIToken readPI()
          Returns the next input sequence as a processing instruction token.
 java.lang.String readPubidLiteral()
          Returns the next input sequence as a public literal.
 EntityImpl readReference()
          Returns the next input sequence as an entity reference.
 java.lang.String readSDDecl()
          Returns the next input sequence as a standalone declaration.
 StartTagToken readSTag()
          Returns the next input sequence as a start tag token.
 java.lang.String readSystemLiteral()
          Returns the next input sequence as a system literal.
 java.lang.String readVersionInfo()
          Returns the next input sequence as a version declaration.
 void setNamespaces(org.millscript.commons.util.IMap<java.lang.String,java.lang.String> spaces)
          Sets the mapping of namespace prefix to namespace IRI for tokenizing subsequent prefixed and unprefixed names.
 boolean tryRead(char testch)
          Tests that the next character is the specified one.
 boolean tryRead(char testch, char testch2)
          Tests that the next characters match the two character sequence.
 boolean tryReadS()
          Tests if the next available characters match the S production in the XML specification.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

columnNumber

protected int columnNumber
The number of the current character on the current line.


lineNumber

protected int lineNumber
The current line number.

Constructor Detail

AbstractXmlTokenizerImpl

protected AbstractXmlTokenizerImpl(java.io.InputStream is,
                                   java.nio.charset.Charset cs,
                                   boolean namespaceAware)
Constructs a new XML tokenizer to read from the specified input stream, using the specified character set, with optional namespace support.

Parameters:
is - the InputStream to read from
cs - the Charset to decode the InputStream with
namespaceAware - indicates if the tokenizer should be namespace aware

AbstractXmlTokenizerImpl

protected AbstractXmlTokenizerImpl(java.io.Reader r,
                                   boolean namespaceAware)
Constructs a new XML tokenizer to read from the specified reader, with optional namespace support.

Parameters:
r - the Reader to obtain characters from
namespaceAware - indicates if the tokenizer should be namespace aware

AbstractXmlTokenizerImpl

protected AbstractXmlTokenizerImpl(AbstractXmlTokenizerImpl axti)
Constructs a new XML tokenizer which will copy it's state from the specified existing tokenizer.

Parameters:
axti - the existing tokenizer to copy state from

AbstractXmlTokenizerImpl

protected AbstractXmlTokenizerImpl(AbstractXmlTokenizerImpl axti,
                                   java.io.Reader rr)
Constructs a new XML tokenizer which will copy it's state from the specified existing tokenizer, but will use the specified reader instead of the one from the existing tokenizer.

Parameters:
axti - the existing tokenizer to copy state from
rr - the new reader this tokenizer should read characters from
Method Detail

appendCurrentTokenData

public void appendCurrentTokenData(char ch)
Appends the specified char to the current token.

Parameters:
ch - the char to append

dropS

public void dropS()
Drops and characters from the input stream that match the S production in the XML specification.
 [3] S ::= (#x20 | #x9 | #xD | #xA)+
 


getChar

public char getChar()
Returns the next character from the input stream, throwing an alert if the end of file is reached.

Returns:
the next char from the input stream

getIntChar

public int getIntChar()
Returns the raw int version of the next char, handling any push back characters and XML version dependencies. This method accounts for the set of legal characters in an XML document.

Returns:
the int version of the next char or -1 if there are no more characters

getLineNumber

public int getLineNumber()
Description copied from interface: XmlTokenizer
Returns the current one-based line number in the source document.

Specified by:
getLineNumber in interface XmlTokenizer
Returns:
an int value for the one-based line number in the source document
See Also:
XmlTokenizer.getLineNumber()

getQuoteChar

public char getQuoteChar()
Returns the next char, checking that it is a legal quote character.

Returns:
the next char, if it is a legal quote character

handleIntChar

public abstract int handleIntChar(int ch)
Handles the specified character, performing any XML version dependent line break conversions and checks on it's validity.

Parameters:
ch - the character to test
Returns:
the handled character, which may not be the same as that supplied as the argument

hasNextToken

public boolean hasNextToken()
Description copied from interface: XmlTokenizer
Indicates if this XML tokenizer has any more tokens to return.

Specified by:
hasNextToken in interface XmlTokenizer
Returns:
true if this tokenizer has any more tokens to return
See Also:
XmlTokenizer.hasNextToken()

isChar

public abstract boolean isChar(int ch)
Tests if the specified character matches the Char production in the XML specification.

Parameters:
ch - the character to test
Returns:
true if the character is a Char and false otherwise

isNameChar

public abstract boolean isNameChar(char ch)
Tests if the specified character matches the NameChar production in the XML specification.

Parameters:
ch - the character to test
Returns:
true if the character is a NameChar and false otherwise

isNameStartChar

public abstract boolean isNameStartChar(char ch)
Tests if the specified character matches the NameStartChar production in the XML specification.

Parameters:
ch - the character to test
Returns:
true if the character is a NameStartChar and false otherwise

isNCNameChar

public boolean isNCNameChar(char ch)
Tests if the specified character matches the NCNameChar production in the XML namespace specification.

Parameters:
ch - the character to test
Returns:
true if the character is a NCNameChar and false otherwise

isNCNameStartChar

public boolean isNCNameStartChar(char ch)
Tests if the specified character matches the NCNameStartChar production in the XML namespace specification.

Parameters:
ch - the character to test
Returns:
true if the character is a NCNameStartChar and false otherwise

isS

public boolean isS(int ch)
Tests if the specified character matches the S production in the XML specification.
 [3] S ::= (#x20 | #x9 | #xD | #xA)+
 

Parameters:
ch - the character to test
Returns:
true if the character is a S character and false otherwise

mustRead

public void mustRead(char testch)
Tests that the next character is the specified one, otherwise it throws an Alert.

Parameters:
testch - the character we must read next

mustReadEq

public void mustReadEq()
Tests if the next input sequence matches the Eq production in the XML specification, otherwise it throws an Alert. If the sequence matches, it will be dropped.
 [25] Eq ::= S? '=' S?
 


mustReadS

public void mustReadS()
Tests if the next input sequence matches the S production in the XML specification, otherwise it throws an Alert. If the sequence matches, it will be dropped.
 [25] Eq ::= S? '=' S?
 


nextToken

public Token nextToken()
Description copied from interface: XmlTokenizer
Returns this tokenizers next token.

Specified by:
nextToken in interface XmlTokenizer
Returns:
this tokenizers next Token
See Also:
XmlTokenizer.nextToken()

peekRead

public boolean peekRead(char testch)
Tests that the next character is the specified one.

Parameters:
testch - the character to test for.
Returns:
true if the character is the required one and false otherwise

peekS

public boolean peekS()
Tests if the next available character matches the S production in the XML specification.
 [3] S ::= (#x20 | #x9 | #xD | #xA)+
 

Returns:
true if the next character is a S character and false otherwise

pushBack

public void pushBack(char ch)
Pushes back the specified character so it will be the next one returned by the getChar() method.

Parameters:
ch - the char to push back

pushBack

public void pushBack(java.lang.String s)
Pushes back all the characters in the string, so they will be returned by subsequent calls to the getChar() method. The characters are pushed in reverse order, so that the first character in the string will be the first character returned by getChar().

Parameters:
s - the String to push back

readAttlistDecl

public AttListDeclToken readAttlistDecl()
Returns the next input sequence as an attribute list declaration token. This will generate an Alert if the input sequence doesn't match the AttlistDecl production in the XML specification.
 [52] AttlistDecl ::=''
 [53] AttDef ::= S Name S AttType S DefaultDecl
 [54] AttType ::= StringType | TokenizedType | EnumeratedType
 [55] StringType ::= 'CDATA'
 [56] TokenizedType ::= 'ID' [VC: ID][VC: One ID per Element Type][VC: ID Attribute Default]
                      | 'IDREF' [VC: IDREF]
                      | 'IDREFS' [VC: IDREF]
                      | 'ENTITY' [VC: Entity Name]
                      | 'ENTITIES' [VC: Entity Name]
                      | 'NMTOKEN' [VC: Name Token]
                      | 'NMTOKENS' [VC: Name Token]
 [57] EnumeratedType ::= NotationType | Enumeration
 [58] NotationType ::= 'NOTATION' S '(' S? Name (S? '|' S? Name)* S? ')' [VC: Notation Attributes][VC: One Notation Per Element Type][VC: No Notation on Empty Element][VC: No Duplicate Tokens]
 [59] Enumeration ::= '(' S? Nmtoken (S? '|' S? Nmtoken)* S? ')' [VC: Enumeration] [VC: No Duplicate Tokens]
 [60] DefaultDecl ::= '#REQUIRED' | '#IMPLIED' | (('#FIXED' S)? AttValue) [VC: Required Attribute][VC: Attribute Default Value Syntactically Correct][WFC: No < in Attribute Values][VC: Fixed Attribute Default]
 

When this method is called the identifying sequence, i.e. '<!ATTLIST', and it should NOT be expected.

Returns:
an AttListDeclToken for the attribute list declaration

readAttValue

public java.lang.String readAttValue()
Returns the next input sequence as an attribute value string. This will generate an Alert if the input sequence doesn't match the AttValue production in the XML specification.

Returns:
a String holding the attribute value

readCDSect

public CharDataToken readCDSect()
Returns the next input sequence as a CDATA section. This will generate an Alert if the input sequence doesn't match the CDSect production in the XML specification.
 [18] CDSect ::= CDStart CData CDEnd
 [19] CDStart ::= '' Char*))
 [21] CDEnd ::= ']]>'
 

When this method is called the first three characters '<![' will have already been processed and should NOT be expected.

Returns:
a CharDataToken for the CDATA section

readCharData

public CharDataToken readCharData()
Returns the next input sequence as a character data token. This will generate an Alert if the input sequence doesn't match the CharData production in the XML specification.
 [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
 

Returns:
a CharDataToken for the character data

readComment

public CommentToken readComment()
Returns the next input sequence as a comment token. This will generate an Alert if the input sequence doesn't match the CharData production in the XML specification.
 [15] Comment ::= ''
 

When this method is called the first four characters '<!--' will have already been processed and should NOT be expected.

Returns:
a CommentToken for the comment

readDoctypeDecl

public DTDToken readDoctypeDecl()
Returns the next input sequence as an document type declaration token. This will generate an Alert if the input sequence doesn't match the doctypedecl production in the XML specification.
 [28] doctypedecl ::= '' [VC: Root Element Type] [WFC: External Subset]
 [75] ExternalID ::= 'SYSTEM' S SystemLiteral | 'PUBLIC' S PubidLiteral S SystemLiteral
 

When this method is called the identifying sequence, i.e. '<!DOCTYPE', and it should NOT be expected.

Returns:
a DTDToken for the document type declaration

readElementDecl

public ElementDeclToken readElementDecl()
Returns the next input sequence as an element declaration token. This will generate an Alert if the input sequence doesn't match the elementdecl production in the XML specification.
 [45] elementdecl ::= '' [VC: Unique Element Type Declaration]
 

When this method is called the identifying sequence, i.e. '<!ELEMENT', and it should NOT be expected.

Returns:
an ElementDeclToken for the element declaration

readEncodingDecl

public java.lang.String readEncodingDecl()
Returns the next input sequence as an encoding declaration. This will generate an Alert if the input sequence doesn't match the encodingDecl production in the XML specification.
 [80] EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )
 [81] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')*
 

Returns:
a String holding the value of the encoding declaration

readEntityDecl

public EntityDeclToken readEntityDecl()
Returns the next input sequence as an entity declaration token. This will generate an Alert if the input sequence doesn't match the EntityDecl production in the XML specification.
 [9] EntityValue ::= '"' ([^%&"] | PEReference | Reference)* '"'
                   | "'" ([^%&'] | PEReference | Reference)* "'"
 [70] EntityDecl ::= GEDecl | PEDecl
 [71] GEDecl ::= ''
 [72] PEDecl ::= ''
 [73] EntityDef ::= EntityValue| (ExternalID NDataDecl?)
 [74] PEDef ::= EntityValue | ExternalID
 

When this method is called the identifying sequence, i.e. '<!ENTITY', and it should NOT be expected.

Returns:
an EntityDeclToken for the entity declaration

readETag

public EndTagToken readETag()
Returns the next input sequence as an end tag token. This will generate an Alert if the input sequence doesn't match the ETag production in the XML specification.
 [42] ETag ::= ''
 

When this method is called the first two characters '</' will have already been processed and should NOT be expected.

Returns:
an EndTagToken for the end tag

readIntSubset

public void readIntSubset()
Reads the next input sequence as the internal subset of a document type declaration. This will generate an Alert if the input sequence doesn't match the intSubset production in the XML specification.
 [28a] DeclSep ::= PEReference | S [WFC: PE Between Declarations]
 [28b] intSubset ::= (markupdecl | DeclSep)*
 [29] markupdecl ::= elementdecl | AttlistDecl | EntityDecl | NotationDecl | PI | Comment
 [69] PEReference ::= '%' Name ';' [VC: Entity Declared] [WFC: No Recursion] [WFC: In DTD]
 


readNmtoken

public java.lang.String readNmtoken()
Returns the next input sequence as an nmtoken. This will generate an Alert if the input sequence doesn't match the Nmtoken production in the XML specification.
 [7] Nmtoken ::= (NameChar)+
 

Returns:
a String holding the Nmtoken

readNotationDecl

public NotationDeclToken readNotationDecl()
Returns the next input sequence as a notation declaration token. This will generate an Alert if the input sequence doesn't match the NotationDecl production in the XML specification.
 [82] NotationDecl ::= '' [VC: Unique Notation Name]
 [83] PublicID ::= 'PUBLIC' S PubidLiteral
 

When this method is called the identifying sequence, i.e. '<!NOTATION', and it should NOT be expected.

Returns:
an NotationDeclToken for the notation declaration

readPI

public PIToken readPI()
Returns the next input sequence as a processing instruction token. This will generate an Alert if the input sequence doesn't match the PIToken production in the XML specification.
 [16] PI ::= '' Char*)))? '?>'
 [17] PITarget ::= Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))
 

When this method is called the first two characters '<?' will have already been processed and should NOT be expected.

Returns:
a PIToken for the processing instruction

readPubidLiteral

public java.lang.String readPubidLiteral()
Returns the next input sequence as a public literal. This will generate an Alert if the input sequence doesn't match the PubidLiteral production in the XML specification.
 [12] PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"
 [13] PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]
 

Returns:
a String holding the public identifier

readReference

public EntityImpl readReference()
Returns the next input sequence as an entity reference. This will generate an Alert if the input sequence doesn't match the Reference production in the XML specification.

When this method is called the first character '' will have already been processed and should NOT be expected.

Returns:
an Entity for the reference

readSDDecl

public java.lang.String readSDDecl()
Returns the next input sequence as a standalone declaration. This will generate an Alert if the input sequence doesn't match the SDDecl production in the XML specification.
 [32] SDDecl ::= S 'standalone' Eq (("'" ('yes' | 'no') "'") | ('"' ('yes' | 'no') '"'))
 

Returns:
a String holding the value of the standalone declaration

readSTag

public StartTagToken readSTag()
Returns the next input sequence as a start tag token. This will generate an Alert if the input sequence doesn't match the STag or EmptyElemTag production in the XML specification.
 [44] EmptyElemTag ::= '<' Name (S Attribute)* S? '/>' [WFC: Unique Att Spec]
 [40] STag ::= '<' Name (S Attribute)* S? '>' [WFC: Unique Att Spec]
 [41] Attribute ::= Name Eq AttValue [VC: Attribute Value Type] [WFC: No External Entity References] [WFC: No < in Attribute Values]
 

When this method is called the first character '<' will have already been processed and should NOT be expected.

Returns:
a StartTagToken or EmptyElemToken for the next tag

readSystemLiteral

public java.lang.String readSystemLiteral()
Returns the next input sequence as a system literal. This will generate an Alert if the input sequence doesn't match the SystemLiteral production in the XML specification.
 [11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
 

Returns:
a String holding the system identifier

readVersionInfo

public java.lang.String readVersionInfo()
Returns the next input sequence as a version declaration. This will generate an Alert if the input sequence doesn't match the VersionInfo production in the XML specification.
 [24] VersionInfo ::= S 'version' Eq ("'" VersionNum "'" | '"' VersionNum '"')
 

Returns:
a String holding the value of the version declaration

setNamespaces

public void setNamespaces(org.millscript.commons.util.IMap<java.lang.String,java.lang.String> spaces)
Description copied from interface: XmlTokenizer
Sets the mapping of namespace prefix to namespace IRI for tokenizing subsequent prefixed and unprefixed names.

Specified by:
setNamespaces in interface XmlTokenizer
Parameters:
spaces - an IMap containing the namespace prefix to IRI mapping for subsequent names
See Also:
XmlTokenizer.setNamespaces(org.millscript.commons.util.IMap)

tryRead

public boolean tryRead(char testch)
Tests that the next character is the specified one. If the character matches it will be dropped.

Parameters:
testch - the character to test for.
Returns:
true if the character is the required one and false otherwise

tryRead

public boolean tryRead(char testch,
                       char testch2)
Tests that the next characters match the two character sequence. If the characters match they will be dropped.

Parameters:
testch - the first character to test for.
testch2 - the second character to test for.
Returns:
true if both characters match and false otherwise

tryReadS

public boolean tryReadS()
Tests if the next available characters match the S production in the XML specification. Any sequence of matching characters will be dropped.
 [3] S ::= (#x20 | #x9 | #xD | #xA)+
 

Returns:
true if the any characters matched the S production and false otherwise


Copyright © 2005-2007 Open World Ltd. All Rights Reserved.