Small, simple, cross-platform, free and fast C++ XML Parser

This project started from my frustration that I could not find any simple, portable XML Parser to use inside all my projects (for example, inside the award-winning TIMi software suite commercialized by the TIMi company). Let's look at the well-known Xerces C++ library: The complete Xerces project is 53 MB! (11 MB compressed in a zipfile). In 2003, I was developping many small tools. I was using XML as standard for all my input/ouput configuration and data files. The source code of my small tools was usually around 600KB. In these conditions, don't you think that 53MB to be able to read an XML file is a little bit "too much"? So I created my own XML parser. My XML parser "library" is composed of only 2 files: a .cpp file and a .h file. The total size is 149 KB.

Here is how it works: The XML parser loads a full XML file in memory, it parses the file and it generates a tree structure representing the XML file. Of course, you can also parse XML data that you have already stored yourself into a memory buffer. Thereafter, you can easily "explore" the tree to get your data. You can also modify the tree using "add" and "delete" functions < and regenerate a formatted XML string from a subtree. Memory management is totally transparent through the use of smart pointers (in other words, you will never have to do any new, delete, malloc or free)("Smart pointers" are a primitive version of the garbage collector in Java).

UPDATE: Based on the expertise gained during the development of this XML Parsing library, I create a new, improved XML Parser: the Incredible XML Parser. The Incredible XML Parser has all the nice features from the library described on this page AND it's even faster, more scalable, less memory-hungry and easier to use. To the best of my knowledge, the Incredible XML Parser is the best "non-validating C++ XML parser" currently available 😄 (and by a large margin!). You should definitively check it out!

Here are the characteristics of the (old) XMLparser library:

Non-validating XML parser written in standard C++ (DTD's or XSD's informations are ignored).

Cross-plateform: the library is currently used every day on Solaris, Linux (32bit and 64bit) and Windows to manipulate "small" PMML documents (10 MB).
The library has been tested and is working flawlessly using the following compilers: gcc (under linux, Mac OS X Tiger and under many unix flavours), Visual Studio 6.0, Visual Studio .NET (under Windows 9x,NT,2000,XP,Vista,CE,mobile), Intel C/C++ compiler, SUN CC compiler, C++ Borland Compiler. The library is also used under Apple OS, iPhone/iPad OS, Amiga OS, QNX and under the Netburner plateform. To the best of my knowledge, i think that all plateforms are now supported.

The parser builds a tree structure that you can "explore" easily (DOM-type parser).

The parser can be used to generate XML strings from subtrees (it's called rendering). You can also save subtrees directly to files (automatic "Byte Order Mark"-BOM support).

Modification or "from scratch creation" of large XML tree structures in memory using funtions like addChild, addAttribute, updateAttribute, deleteAttribute,...

It's SIMPLE: no need to learn how to use dozens of classes: there is only one simple class: the 'XMLNode' class (that represents one node of the XML tree).

Quite efficient (Efficiency is required to be able to handle BIG files)...But my new XML Parser library (the Incredible XML Parser) is now several order of magnitude faster:

The string parser is quite efficient: It does only one pass over the XML string to create the tree. It somewhat manages to minimize slightly the amount of memory allocations required to build the structure of XMLNode's (the Incredible XML Parser does a significantly better job regarding this specific point). Inside Visual C++, the "debug versions" of the memory allocation functions are very slow: Do not forget to compile in "release mode" to get maximum speed.
The "tree exploration" is very efficient because all operations on the 'XMLNode' class are handled through references: there are no memory copy, no memory allocation, never.
The XML string rendering is very efficient: It does one pass to compute the total memory size of the XML string and a second pass to actually create the string. There is thus only one memory allocation and no extra memory copy. Other libraries are slower because they are using the string concatenation operator that requires many memory (re-)allocations and memory copy.

Complete In-memory parsing (as are 99% of the Dom-Style XML Parsers).

This is somewhat annoying when working on large files because you need to load the complete file in memory before parsing it. My new XML Parser library (the Incredible XML Parser) reads the file progressively (chunk-by-chunk)(it's a stream-oriented parser) and thus avoid to load the complete file in memory. The memory consumption of the Incredible XML Parser) is thus several order of magnitude better (i.e. smaller).

Supports XML namespaces

Very small and totally stand-alone (not built on top of something else). Uses only standard <stdio.h> library (and only for the 'fopen' and the 'fread' functions to load the XML file).

Easy to integrate into you own projects: it's only 2 files! The .h file does not contain any implementation code. Compilation is thus very fast.

Robust (We are using it every day the since 2004 inside the TIMi company).
Optionnally, if you define the C++ prepocessor directives STRICT_PARSING and/or APPROXIMATE_PARSING, the library can be "forgiving" in case of errors inside the XML.
I have tried to respect the XML-specs given at: http://www.w3.org/TR/REC-xml/

Fully integrated error handling :

The string parser gives you the precise position and type of the error inside the XML string (if an error is detected).
The library allows you to "explore" a part of the tree that is missing. However data extracted from "missing subtrees" will be NULL. This way, it's really easy to code "error handling" procedures.

Thread-safe (more precisely: reentrant). However there exists some global parameters (the "guessUnicodeChar" parameter, the "character Encoding" parameter, and the "strictUTF8Parsing" parameter) that you can't change because they are shared by all threads. My newest XML Parser library (the Incredible XML Parser) is 100% Thread-safe.

Full Native Supports for a wide range of Character Encodings: ANSI (legacy) / UTF-8 / Shift-JIS / GB2312 / Big5 / GBK.
Under Windows, Linux, Linux 64 bits & Solaris, we have additionnaly: Unicode 16bit / Unicode 32bit widechar characters support that includes:

For the unicode version of the library: Automatic conversion to Unicode before parsing (if the input XML file is standard ansi 8bit characters).
For the ascii version of the library: Automatic conversion to legacy or UTF-8 before parsing (if the input XML file is unicode 16 or 32bit wide characters).

The XMLParser library is able to handle successfuly chinese, japanese, cyrilic and other extended characters thanks to an extended UTF-8 encoding support, Shift-JIS (japanese) and to GB2312/Big5/GBK encoding support (chinese) (see this UTF-8-demo that shows the characters available). If you are still experiencing character encoding problems, I suggest you to convert your XML files to UTF-8 using a tool like iconv (precompiled win32 binary).

Transparent memory management through the use of smart pointers.

Limited Support for character entities. The current known character entities are:

<	<	less than
>	>	greater than
&	&	ampersand
'	'	apostrophe
"	"	quotation mark
K	K	direct access to the ascii code of any character (in hexadecimal)
K	K	direct access to the ascii code of any character (in standard decimal)

Support for a wide range of clearTags that are containing unformatted text:
<![CDATA[ ... ]]>, , <PRE> ... </PRE>, <!DOCTYPE ... >
Unformatted texts are not parsed by the library and can contain items that are usually 'forbidden' in XML (for example: html code)

Support for inclusion of pure binary data (images, sounds,...) into the XML document using the four provided ultrafast Base64 conversion functions.

Nice & Complete Doxygen documentation.

The library is under the Aladdin Free Public License(AFPL).
If you need another license,simply (I don't want any money for the XMLParser).

Easy to customize: The code is small, commented and written in a plain and simple way. Thus, if you really need to change something (but I doubt of it), it's easy.

Download

If you like this library, you can create a URL-Link towards this page from your website (use this URL: http://www.applied-mathematics.net/tools/xmlParser.html). If you want to help other people to produce better softwares using XML technology, you can increase the visibility of this library by adding a URL-link toward this page (so that its google-ranking increases

).

If you like this library, please add a message in the guestbook !

Download here: small, simple, multi-Plateform XMLParser library with examples (zipfile).
Inside the zip file, you will find 5 examples:

ansi unix/solaris project example (makefile based)
wide char unix/solaris project example (makefile based)
ansi windows project example (for Visual Studio 6 and .NET)
wide char windows project example (for Visual Studio 6 and .NET)
ansi windows .dll project with a small test project to check the generated .dll

If you have a Kindle, you might also be interested in KKCM: the Kranf Kindle Collection Manager.

Log

Version changes:

V1.00: February 20, 2002: initial version
V1.20: July 22, 2006: After 13 minor changes, 2 major changes, 8 bug fixes and 23 functionality additions(at user's request), I decided to switch to V2.01.
V2.01 to v2.16: 2006: 1 major change, 7 minor changes, 14 additions and 6 bug fixes
V2.17 to v2.33: 2007: 5 minor changes, 13 additions and 14 bug fixes
- added a Visual Studio projet file to build a DLL version of the library.
  Under Windows, when I have to debug a software that is using the XMLParser Library, it's usually a nightmare because the library is sooOOOoooo slow in debug mode. To solve this problem, during all the debugging session, I use a very fast DLL version of the XMLParser Library (the DLL is compiled in release mode). Using the DLL version of the XMLParser Library allows me to have reasonable XML parsing speed, even in debug mode! Other than that, the DLL version is useless: In the release version of my tool, I always use the normal, ".cpp"-based, XMLParser Library.
  Please note that my newest XML Parser library (the Incredible XML Parser) is ultra fast, even in debug mode.
V2.34 to v2.40: 2008: 5 minor changes, 11 additions and 5 bug fixes
v2.41 to v2.41: 2009: 1 minor change
v2.42 to 2.43: 2011: 4 minor changes and 1 bug fix
v2.44: May 19, 2013: 1 minor change, 1 bug fix
- FIX: the "xmltol()" function now returns 64 bit integers instead of 32 bit integers.
- updated the documentation

A small tutorial

Let's assume that you want to parse the XML file "PMMLModel.xml" that contains:

<?xml version="1.0" encoding="ISO-8859-1"?>
<PMML version="3.0"
  xmlns="http://www.dmg.org/PMML-3-0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema_instance" >
  <Header copyright="Frank Vanden Berghen">
     Hello TIMi!
     <Application name="&lt;Condor>" version="1.99beta" />
  </Header>
  <Extension name="keys"> <Key name="urn"> </Key> </Extension>
  <DataDictionary>
    <DataField name="persfam" optype="continuous" dataType="double">
       <Value value="9.900000e+001" property="missing" />
    </DataField>
    <DataField name="prov" optype="continuous" dataType="double" />
    <DataField name="urb" optype="continuous" dataType="double" />
    <DataField name="ses" optype="continuous" dataType="double" />
  </DataDictionary>
  <RegressionModel functionName="regression" modelType="linearRegression">
    <RegressionTable intercept="0.00796037">
      <NumericPredictor name="persfam" coefficient="-0.00275951" />
      <NumericPredictor name="prov" coefficient="0.000319433" />
      <NumericPredictor name="ses" coefficient="-0.000454307" />
      <NONNumericPredictor name="testXmlExample" />
    </RegressionTable>
  </RegressionModel>
</PMML>

Let's analyse line by line the following small example program:

#include <stdio.h>    // to get "printf" function
#include <stdlib.h>   // to get "free" function
#include "xmlParser.h"

int main(int argc, char **argv)
{
  // this open and parse the XML file:
  XMLNode xMainNode=XMLNode::openFileHelper("PMMLModel.xml","PMML");

  // this prints "<Condor>":
  XMLNode xNode=xMainNode.getChildNode("Header");
  printf("Application Name is: '%s'\n", xNode.getChildNode("Application").getAttribute("name"));
  
  // this prints "Hello TIMi!":
  printf("Text inside Header tag is :'%s'\n", xNode.getText());

  // this gets the number of "NumericPredictor" tags:
  xNode=xMainNode.getChildNode("RegressionModel").getChildNode("RegressionTable");
  int n=xNode.nChildNode("NumericPredictor");

  // this prints the "coefficient" value for all the "NumericPredictor" tags:
  for (int i=0; i<n; i++)
    printf("coeff %i=%f\n",i+1,atof(xNode.getChildNode("NumericPredictor",i).getAttribute("coefficient")));

  // this prints a formatted ouput based on the content of the first "Extension" tag of the XML file:
  char *t=xMainNode.getChildNode("Extension").createXMLString(true);
  printf("%s\n",t);
  free(t);
  return 0;
}

To manipulate the data contained inside the XML file, the first operation is to get an instance of the class XMLNode that is representing the XML file in memory. You can use:

XMLNode xMainNode=XMLNode::openFileHelper("PMMLModel.xml","PMML");

or, if you use the UNICODE windows version of the library:

XMLNode xMainNode=XMLNode::openFileHelper("PMMLModel.xml",_T("PMML"));

or, if the XML document is already in a memory buffer pointed by variable "char *xmlDoc" :

XMLNode xMainNode=XMLNode::parseString(xmlDoc,"PMML");

This will create an object called xMainNode that represents the first tag named PMML found inside the XML document. This object is the top of tree structure representing the XML file in memory. The following command creates a new object called xNode that represents the "Header" tag inside the "PMML" tag.

XMLNode xNode=xMainNode.getChildNode("Header");

The following command prints on the screen "<Condor>" (note that the "<" character entity has been replaced by "<"):

printf("Application Name is: '%S'\n", xNode.getChildNode("Application").getAttribute("name"));

The following command prints on the screen "Hello TIMi!":

printf("Text inside Header tag is :'%s'\n", xNode.getText());

Let's assume you want to "go to" the tag named "RegressionTable":

xNode=xMainNode.getChildNode("RegressionModel").getChildNode("RegressionTable");

Note that the previous value of the object named xNode has been "garbage collected" so that no memory leak occurs. If you want to know how many tags named "NumericPredictor" are contained inside the tag named "RegressionTable":

int n=xNode.nChildNode("NumericPredictor");

The variable n now contains the value 3. If you want to print the value of the coefficient attribute for all the NumericPredictor tags:

for (int i=0; i<n; i++)
  printf("coeff %i=%f\n",i+1,atof(xNode.getChildNode("NumericPredictor",i).getAttribute("coefficient")));

Or equivalently, but faster at runtime:

int iterator=0;
for (int i=0; i<n; i++)
  printf("coeff %i=%f\n",i+1,atof(xNode.getChildNode("NumericPredictor",&iterator).getAttribute("coefficient")));

If you want to generate and print on the screen the following XML formatted text:

<Extension name="keys">
  <Key name="urn" />
</Extension>

You can use:

char *t=xMainNode.getChildNode("Extension").createXMLString(true);
printf("%s\n",t);
free(t);

Note that you must free the memory yourself (using the "free(t);" function) : only the XMLNode objects and their contents are "garbage collected". The parameter true to the function createXMLString means that we want formatted output.

The XML Parser library contains many more other small usefull methods that are not described here (The zip file contains some additional examples to explain other functionalities and a complete Doxygen documentation about the XMParser.). These methods allows you to:

navigate easily inside the structure of the XML document
create, update & save your own XML structure of XMLNode's.

That's all folks! With this basic knowledge, you should be able to retreive easily any data from any XML file!