IncredibleXMLParser  3.05
 All Classes Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Incredible XML Parser library

Introduction

Version
V3.05
Author
Frank Vanden Berghen

Copyright (c) 2013, Frank Vanden Berghen - All rights reserved.
See the file AFPL-license.txt about the licensing terms

The Incredible XML Parser library is an advanced non-validating XML parser written in ANSI C++ for portability.
The main objectives of the Incredible XML Parser library are:

  1. user-friendliness (i.e. it should be easy to use).
  2. speed & scalability (it should process terabyte size XML files in a few hours).
  3. Small foot-print & no dependencies (i.e. this must remain a small library, easy to include & compile).

The Incredible XML Parser library includes 2 parsers: It has:

  1. A very fast XML Pull Parser (IXMLPullParser) that requires very little memory to run. The Pull Parser is very fast but it does not offer the flexibility and the user-friendliness of a full-fledged DOM parser.
  2. A fast XML DOM parser (built on-top of the Pull Parsers)(IXMLDomParser) that provides more comfort when manipulating XML elements. It works by using recursion and building a node tree for breaking down the elements of an XML document.
  3. An ultra fast JSON Pull Parser (IJSONPullParser) that requires very little memory to run. The JSON Pull Parser is ultra fast and is compatible with the Incredible XML DOM Parser so that you can build in-memory a node tree that allows you to easily explore your JSON file.

The Incredible XML DOM Parser, the Incredible XML Pull Parser and the Incredible JSON Pull Parser can all process terabyte-size XML/JSON files in a few hours on commodity hardware with very low memory consumption (i.e. less than a few megabyte).

The three parsers (the Pull Parser and the DOM parser) generate strings either in "char*" or in "wchar_t*" mode. In "char*" mode, the Incredible XML Parser supports nearly any currently known character encodings (and it's very easy to add new ones if required). In "wchar_t*" mode, the Incredible XML Parser manipulates utf-16 strings. The Incredible XML Parser also automatically converts between character encodings (e.g. it automatically converts from "utf-8" to "utf-16" when using the "wchar_t* version of the library). The Incredible XML Parser is the only Small Foot-print, Non-validating XML parser that supports a very wide range of different character encodings.

The three parsers (the Incredible Pull Parsers and the Incredible DOM parser) are working on a stream of data: This means that you don't need to load into memory the complete XML/JSON file (or the complete XML/JSON String): You only need to provide a function (i.e. the "read" function of a IXMLReader object) that returns different small "chunks" of the XML/JSON stream. This has several advantages:

  1. you are not limited anymore by your RAM memory size.
  2. very reduced and (more or less) constant memory consumption.
  3. you can process very easily stream-lined data (such as data coming from an HTTP connection or the data coming from the decompression of a ZIP file).

The Incredible XML&JSON Pull parsers are 100% "in-place" parsers. This means that they do NOT copy strings: they only initializes different pointers to the memory buffer containing the XML/JSON data (There is however one inevitable memory copy when converting between different character encodings: for example when the Pull Parser is forced to convert the characters from "utf-16" to "utf-8"). "In-place" parsers are a lot faster because they do not require copying the whole data into separate buffers. The Incredible XML&JSON Pull parsers are thus one of the fastest XML parser available (and they might even be the fastest).

You can configure the size of all the memory buffers used inside the Incredible XML&JSON Parsers. When you setup small buffer sizes, it reduces the memory consumption of the parser but it also usually slightly increases the computation time. The default buffer sizes are optimal to have a good speed on a normal-size PC.

These 3 functionalities:

  1. the possibility to work on stream-lined data (i.e. constant memory consumption).
  2. 100% "in-place" parser (no memory buffer required to store "copied" strings).
  3. all memory buffer sizes are configurable.

...ensures that the Incredible XML/JSON Pull Parser is the XML/JSON Parser with the SMALLEST memory consumption amongst all parsers.

All the strings returned by the XML/JSON Pull Parser are zero-terminated so that you can directly and very easily use them. For example, you can write:

printf("name=%s",pullParser.getName());

because "getName()" returns a zero-terminated char* (or wchar_t*). The Incredible XML/JSON Parser is the only "in-place" Pull parser that returns zero-terminated strings without penalty hit (i.e. without copying the whole string into a separate buffer). It's thus a lot more "usable" than all other "in-place" Pull Parsers.

The XML DOM parser is able to "hot start" to create a node tree out of a sub-section of the original XML file. This means that, if you have a XML File such as this one:

<AllCustomers>
<OneCustomer> <name>Frank </name> <age>38</age> </OneCustomer>
<OneCustomer> <name>Sabrina</name> <age>36</age> </OneCustomer>
<OneCustomer> <name>David </name> <age>33</age> </OneCustomer>
</AllCustomers>

...you will typically call the XML DOM parser 3 times (i.e. one time for each customer). When the DOM parser "hot starts", it always re-uses the same RAM memory space as the previous call so that no additional memory allocations occurs. It is thus extremely fast. Since we are building in memory a XMLNode structure that only contains ONE customer at-a-time, the memory consumption is very small (and independent of the total size of the XML file!). The "hot start" functionality is unique and very important because it allows us to use a very flexible DOM-style Parser on UNLIMITED XML file size (see example7()).

The Incredible XML Parser is the only DOM-Style parser that is able to work on UNLIMITED XML/JSON file size (all other DOM-Style parsers are always limited to file size smaller than a few MegaByte). The Incredible XML Parser is thus the only parser that allows you to very easily analyze very complex XML/JSON files (thanks to the easy-to-use DOM-style parser) of UNLIMITED size.

The main bottleneck in any DOM-Style parser is always the memory allocations. If you remove this bottleneck (as inside the Incredible XML Parser), you obtain a parser that is between 10 to 100 times faster (depending of the structure of the XML/JSON). This explains why the Incredible XML DOM parser is also the fastest DOM-Style parser currently available. The Incredible XML DOM parser does not perform any memory allocations to build the different node trees (Thanks to the "Hot Start" functionality)(i.e. In the above example: There are no memory allocations to build each XMLNode structure for each of the customer). The extreme speed of the Incredible XML DOM parser allows to easily manipulate extremely large XML files (i.e. terabyte XML files are processed in a few hours on commodity hardware).

To summarize:

  1. The Incredible XML Pull Parser has one of the lowest memory consumption amongst all XML Pull parsers.
  2. The Incredible JSON Pull Parser has one of the lowest memory consumption amongst all JSON Pull parsers.
  3. The Incredible XML DOM Parser has the lowest memory consumption amongst all XML DOM parsers.
  4. The Incredible XML Pull Parser is one of the fastest XML Pull parser.
  5. The Incredible XML JSON Parser is one of the fastest JSON Pull parser.
  6. The Incredible XML DOM Parser is the fastest XML DOM parser.
  7. The Incredible XML DOM Parser is the only DOM parser able to work on UNLIMITED file size.
  8. The 2 Incredible XML Parsers are able to handle nearly any character encodings.
  9. The 3 Incredible Parsers fully support "char*" mode and "wchar_t*" mode.
  10. The 3 Incredible Parsers are able to handle stream-lined data.
  11. The 3 Incredible Parsers are 100% thread-safe (more precisely: they are reentrant).
  12. The Incredible XML Dom Parser provides an ultra fast XPATH support.
  13. The Incredible XML Pull Parser is one of the easiest-to-use XML Pull parsers.
  14. The Incredible JSON Pull Parser is one of the easiest-to-use JSON Pull parsers.
  15. The Incredible XML Dom Parser is a good replacement for the old XMLParser.

The points 8 to 13 here above are very UNCOMMON inside small foot-print, non-validating XML Parsers.

First Tutorial

You can follow a simple Tutorial to know the basics...

By default, the Incredible XML DOM parser creates a tree of ITCXMLNode. Because of the "hot start" functionality, this tree will disappear at the next call to the DOM parser (because the Incredible DOM parser always re-uses the same memory space to store the tree to avoid any memory allocation). The name ITCXMLNode is the acronym of "Incredible Transient Constant XMLNode". "Transient" means that the tree disappear at each call to the DOM parser. "Constant" means that you cannot change the tree (i.e. it's read-only) (e.g. you cannot add or remove any child nodes).

You can always convert a ITCXMLNode to a ICXMLNode (note that the 'T' letter in ICXMLNode is missing because it's not "transient" anymore), so that the tree obtained with the DOM parser still remains in memory after a new call to the DOM parser.

You can always convert a ITCXMLNode (or a ICXMLNode) to a IXMLNode (note that the 'T' and 'C' letters in IXMLNode are missing because this object is not "transient" nor "constant" anymore). You can edit/update IXMLNode's using the classical, well-known functions (e.g. the function addChildNode(), addAttribute(), deleteNodeContent(), etc.). The IXMLNode class is 100% compatible with the old, well-known XMLNode class from the old XMLParser library.

To summarize:

When using the Incredible DOM parser, you have access to 3 types of XMLNodes:

  1. ITCXMLNode: Transient & Constant XMLNode:
    • upside: Very fast to create because the DOM Parser does not need to perform any memory allocation to create ITCXMLNode's.
    • Downside: Disappear at each new call to the DOM Parser, not editable.
  2. ICXMLNode: Constant XMLNode:
    • upside: Quite efficient because we need to perform only ONE memory allocation to create a complete tree of ICXMLNode.
    • downside: Not editable.
  3. IXMLNode: Editable XMLNode:
    • upside: Editable, 100% equivalent to the XMLNode class from the old library.
    • downside: Quite Slow because we need to perform many memory allocations to create each IXMLNode.

For most operations, these 3 type of XMLNodes are interchangeable (however only the IXMLNode support "editing" operations). The main difference between these 3 XMLNode classes comes from the way to manage the memory allocations.

General usage: How to include the IXMLParser library inside your project.

The library is composed of only two files: IXMLParser.cpp and IXMLParser.h. These are the ONLY 2 files that you need when using the library inside your own projects.

All the functions of the library are documented inside the comments of the file IXMLParser.h. These comments can be transformed in full-fledged HTML documentation using the DOXYGEN software: simply type: "doxygen doxy.cfg"

By default, the IXMLParser library uses (char*) for string representation.To use the (wchar_t*) version of the library, you need to define the "_UNICODE" preprocessor definition variable (This is usually done inside your project definition file) (If you are using Visual Studio, then this is done automatically for you by the IDE).

Advanced Tutorial and Many Examples of usage.

Some very small introductory examples are described inside the Tutorial file IXMLParser.html

Some additional small examples are also inside the file IXMLTest.cpp (for the "char*" version of the library) and inside the file IXMLTestUnicode.cpp (for the "wchar_t*" version of the library). If you have a question, please review these additional examples before sending an e-mail to the author.

To build the examples:

Debugging with the IXMLParser library

Debugging under WINDOWS

The file IXML_Autoexp.txt contains some "tweaks" that improve substantially the display of the content of the ITCXMLNode, ICXMLNode & IXMLNode objects inside the Visual Studio Debugger. Believe me, once you have seen inside the debugger the "smooth" display of the ITCXMLNode objects, you cannot live without it anymore!

The Incredible XML Parser library is designed to minimize the quantity of memory allocations. As long as you are using ITCXMLNode objects or ICXMLNode objects, the number of memory allocations remains extremely low. However, manipulating IXMLNode objects requires many allocations or re-allocations. Inside Visual C++, the "debug versions" of the memory allocation functions are very slow: Do not forget to compile in "release mode" to get maximum speed.

When I had to debug a software that was using the IXMLNode objects, it was usually a nightmare because the library was really slow in debug mode (because of the slow memory allocations in Debug mode). To solve this problem, during debugging sessions of codes that include IXMLNode, I am now using a very fast DLL version of the IXMLParser Library (the DLL is compiled in release mode). Using the DLL version of the IXMLParser Library allows me to have lightening XML parsing speed even in debug! Other than that, the DLL version is useless: In the release version of my tool, I always use the normal, ".cpp"-based, IXMLParser Library (I simply include the IXMLParser.cpp and IXMLParser.h files into the project).