Introduction to textwolf

How to use textwolf

To use textwolf we have to include include/textwolf.hpp
#include "textwolf.hpp"
The namespace textwolf is defined there for all templates introduced here.

W3C standards and textwolf

textwolf follows the standard of W3C XML 1.0 but does not implement all of it. The following list shows the exceptions:

Wide character set encodings

textwolf is a template library that is instantiated with and encoding type as template parameter. If we want to parse the encoding according to what is in the XML header, we have to parse the XML header first to determine the encoding. Then we can use the right instance of textwolf. In textwolf/xmlhdrparser.hpp is a helper class to parse the XML header and find out the encoding of the document: XmlHdrParser.

Definitions

textwolf defines the following templates and types

TextScanner

The TextScanner class template defines an iterator on the characters of the input as

Template parameters

  1. InputIterator (explained in the section iterator)
  2. Charset = encoding of the input (explained in the section "Character set encodings")

Example

char* input = ...;
textwolf::TextScanner<char*,charset::UCS2<LE> > itr( input);
while (itr->chr()) itr++;

XMLScanner

The XMLScanner class template defines the state of a parser scanning the XML elements like tags,atrributes,values,content,etc.. on a source defined with an input iterator.

Template parameters

  1. InputIterator = input iterator type (explained in the section iterator)
  2. InputCharSet = character set encoding of the input (explained in the section "Character set encodings")
  3. OutputCharSet = character set encoding of the output (explained in the section "Character set encodings")
  4. OutputBuffer = buffer type to use for the tokens parsed (STL back insertion sequence interface)

Constructor arguments

Example

char* input = ...;
std::string outputbuf;
typedef textwolf::XMLScanner
	<
		char*,
		charset::IsoLatin,
		charset::IsoLatin,
		std::string
	> Scan;
Scan xs( input, outputbuf);
for (Scan::iterator itr = xs.begin(); itr != xs.end(); itr++)
{
	switch (itr->type())
	{
		case MyXMLScanner::ErrorOccurred: throw(...);
		case MyXMLScanner::OpenTag: ... break;
		case MyXMLScanner::Content: ... break;
		...
	}
}
End of tag events come without tag name. So it is not possible to validate an XML with 'textwolf'.

XMLPathSelectAutomaton

This class template defines an automaton for selecting XML path expression and to assign a type as integer to the filtered tokens. It has one template argument:
  1. OutputCharSet = the encoding in which the tokens are stored and therefore in which format they are processed.
The automaton construction is described in the section "how to define an XML path expression automaton".

XMLPathSelect

The XMLPathSelect class template defines the state of a set of XML path selections to filter the output of an XMLScanner iterator. It is constructed by passing
  1. textwolf::Automaton* atm = the pointer to an XML path selection automaton
The class template definition has the following parameters
  1. CharSet = encoding used for processing
The XMLPathSelect class is fed with the output of an XMLScanner. The idea is to iterate on an input with XMLScanner and to push every element fetched to the XMLPathSelect. After every push we can iterate on the new results we got with this push.

Character set encodings

Predefined

The following encodings are defined in the textwolf::charset namespace as examples:

How to define our own

We can define our own character set encodings. A structure with the following interface is passed to a textwolf iterator as a character set encoding definition. Textwolf assumes some form of a single Unicode character (UChar) to be able to map them to each other.
/// \class Interface
/// \brief Interface that has to be implemented for a character set encoding
struct Interface
{
	/// \brief Defines the highest Unicode character representation 
	///	of a character in this encoding.
	enum {MaxChar=0xFF};

	/// \brief Skip to start of the next character
	/// \param [in] buf buffer for the character data
	/// \param [in,out] bufpos position in 'buf'
	/// \param [in,out] itr iterator to skip
	template <class Iterator>
	static void skip( char* buf, unsigned int& bufpos, Iterator& itr);

	/// \brief Fetches the ascii char representation of the current character
	/// \param [in] buf buffer for the parses character data
	/// \param [in,out] bufpos position in 'buf'
	/// \param [in,out] itr iterator on the source
	/// \return the value of the ascii character or -1
	template <class Iterator>
	static char asciichar( char* buf, unsigned int& bufpos, Iterator& itr);

	/// \brief Fetches the bytes of the current character into a buffer
	/// \param [in] buf buffer for the parses character data
	/// \param [in,out] bufpos position in 'buf'
	/// \param [in,out] itr iterator on the source
	template <class Iterator>
	static void fetchbytes( char* buf, unsigned int& bufpos, Iterator& itr);

	/// \brief Fetches the Unicode character representation of the current character
	/// \param [in] buf buffer for the parses character data
	/// \param [in,out] bufpos position in 'buf'
	/// \param [in,out] itr iterator on the source
	/// \return the value of the Unicode character
	template <class Iterator>
	static UChar value( char* buf, unsigned int& bufpos, Iterator& itr);

	/// \brief Prints a Unicode character to a buffer
	/// \tparam Buffer STL back insertion sequence
	/// \param [in] chr character to print
	/// \param [out] buf buffer to print to
	template <class Buffer>_
	static void print( UChar chr, Buffer& buf);
};

Iterators

Input

The TextScanner and XMLScanner template classes expect an input iterator on a sequence of bytes with the following properties:
The following iterators for various purposes are already defined:
The following example shows how we can define our own input iterator. It shows a simple textwolf iterator based on an STL iterator. We have to define a structure like this:
template <typename iterator>
struct twiterator
{
	iterator m_itr;
	iterator m_end;

	twiterator( const iterator& b, const iterator& e)
			:m_itr(b),m_end(e) {}

	char operator*() const
	{
		if (m_itr >= m_end)
		{
			return 0;
		}
		else
		{
			return *m_itr;
		}
	}
	twiterator& operator++()
	{
		++m_itr; return *this;
	}
};

Chunkwise processing of input

For chunk by chunk feeding of input a longjmp structure can be defined and passed to the source iterator. The iterator can issue a longjump if it reaches the end of the current chunk processed. The longjmp structure has to be initialized by the caller of textwolf. textwolf just ensures to save its state and that it can be called again, if it has data again and can continue. Because textwolf is completely table driven it has no problem to save its state in a stable way. The caller can continue iteration when the next chunk is prepared and has been passed to the source iterator. If no such longjmp structure is defined then the end of chunk is also seen as the end of the data. The following example shows the iterator from the example above with a longjmp structure declared for handling end of data in chunkwise processing of input:
template <typename iterator>
struct twiterator
{
	iterator m_itr;
	iterator m_end;
	jmp_buf* m_eom;

	twiterator( const iterator& b, const iterator& e, jmp_buf* eom=0)
			:m_itr(b),m_end(e),m_eom(eom) {}

	char operator*() const
	{
		if (m_itr >= m_end)
		{
			if (m_eom) longjmp(*m_eom,1);
			return 0;
		}
		else
		{
			return *m_itr;
		}
	}
	twiterator& operator++()
	{
		++m_itr; return *this;
	}
};
Using the iterator example for chunk by chunk data processing:
jmp_buf eom;
typedef twiterator ItrType;
std::string outputbuf;
typedef textwolf::XMLScanner
	<
		ItrType,
		charset::IsoLatin,
		charset::IsoLatin,
		std::string
	> Scan;
std::string chunk = <first chunk>;
ItrType itr;
Scan xs( ItrType( chunk.begin(), chunk.end(), &eom), outputbuf);
Scan::iterator itr = xs.begin();

if (setjmp(eom) != 0)
{
	//... the iterator processed the whole chunk and needs a new one
	chunk = <call for getting the next chunk>;
	xs.setSource( ItrType( chunk.begin(), chunk.end(), &eom));
}
for (; itr != xs.end(); itr++)
{
	... < processing the found elements > ...
}

Output

The following iterators refer to InputIterator as the required input iterator type (see input):

How to define an XML path expression automaton

An XML Path expression automaton in defined as tree. Every expression defines a set containing the root node as start of the selection. Subsequent node selections in the expression reference the previously selected nodes and create a new set of selected nodes. The set of nodes we get at the end of the selection defines the set of nodes selected by the expression. The following operators are defined for building node selection expressions.
For a node 'A'
A special role has the operator '--'. it corresponds to the operator '//' in abbreviated syntax of XPath expressions. It says that the following selection applies also for a successors of the current node.

textwolf and XPath

textwolf has not the power of XPath and it does not aim to. It does not buffer more than the currently processed token. Therefore it can't detect patterns that require buffering. It can't even cope with the fact that tag attributes in XML have no order. For expressions that are not expressible in this model, we have to build the logic around textwolf. textwolf is not XPath, but with some additional effort we get an engine that is able to process at least 'abbreviated syntax of XPath' without parent references and content conditions.
For example
A//ter[@id='5' and @name='kaspar']
has to be translated to
A--["ter"]("id","5")("name","kaspar")
A--["ter"]("name","kaspar")("id","5")
and
A//ter[@id='5' or @name='kaspar']
to
A--["ter"]("id","5")
A--["ter"]("name","kaspar")
Some cases are even worse. If we select attribute values where we have attribute conditions, then we can solve it only in the filter functions on the iterator after calling textwolf. Selections have to be at the end, because they are not buffered. Therefore something like
A//person[@id='se1']@name
cannot be expressed in textwolf for the case where 'name' appears before 'id' in the XML. The expression
A--["ter"]("id","se1")("name",0)
works only for the case where 'id' appears before 'name'. A possible solution is to define
A--["ter"]("id","se1") = 201;
A--["ter"]("name",0) = 202;
and to set the value with 202 on the element and a flag with 201 that together with 202 enables the element created.


Copyright © Patrick Frey, 2010-2014