cross-platform, free and fast
C++ XML Parser
This project started from my frustration that I could not find any simple,
portable XML Parser to use inside all my projects (for example, inside the award-winning TIMi software suite created by
the Business-Insight company). Let's look at the well-known Xerces C++ library: The complete
Xerces project is 53 MB! (11 MB compressed in a zipfile). In 2003, I was developping
many small tools. I was using XML as standard for all my input/ouput configuration and data
files. The source code of my small tools was usually around 600KB. In
these conditions, don't you think that 53MB to be able to read an XML file is
a little bit "too much"? So I created my own XML parser. My XML parser
"library" is composed of only 2 files: a .cpp file and a .h file.
The total size is 104 KB
Here is how it works: The XML parser loads a full XML file in memory, it parses
the file and it generates a tree structure representing the XML file. Of course,
you can also parse XML data that you have already stored yourself into a memory
buffer. Thereafter, you can easily "explore" the tree to get your
data. You can also modify the tree using "add" and "delete"
and regenerate a formatted XML string from a subtree. Memory management
is totally transparent through the use of smart pointers (in other words, you
will never have to do any new, delete, malloc or free)("Smart pointers"
are a primitive version of the garbage collector in Java).
To the best of my knowledge, there exists no other "non-validating C++ XML parser"
that is as simple and as powerfull.
Here are the characteristics of the XMLparser library:
- Non-validating XML parser written in standard C++ (DTD's or XSD's informations
- Cross-plateform: the library is currently used every day on Solaris, Linux
(32bit and 64bit) and Windows to manipulate "small" PMML
documents (10 MB).
The library has been tested and is working flawlessly using the following
compilers: gcc (under linux, Mac OS X Tiger and under many unix flavours),
Visual Studio 6.0, Visual Studio .NET (under Windows 9x,NT,2000,XP,Vista,CE,mobile),
Intel C/C++ compiler, SUN CC compiler, C++ Borland Compiler. The library is
also used under Apple OS, iPhone/iPad OS, Amiga OS, QNX and under the Netburner plateform.
To the best of my knowledge, i think that all plateforms are now supported.
- The parser builds a tree structure that you can "explore" easily
- The parser can be used to generate XML strings from subtrees (it's called
rendering). You can also save subtrees directly to files (automatic "Byte
Order Mark"-BOM support).
- Modification or "from scratch creation" of large XML tree structures
in memory using funtions like addChild,
addAttribute, updateAttribute, deleteAttribute,...
- It's SIMPLE: no need to learn how to use dozens of classes:
there is only one simple class: the 'XMLNode' class (that represents one node
of the XML tree).
- Very efficient (Efficiency is required to be able to handle BIG
- The string parser is very efficient: It does only one
pass over the XML string to create the tree. It does the minimal amount
of memory allocations. For example: it does NOT use slow STL::String class
but plain, simple and fast C malloc 's. It also allocates large chunk
of memory instead of many small chunks. Inside Visual C++, the "debug
versions" of the memory allocation functions are very slow: Do not
forget to compile in "release mode" to get maximum speed.
- The "tree exploration" is very efficient because
all operations on the 'XMLNode' class are handled through references:
there are no memory copy, no memory allocation, never.
- The XML string rendering is very efficient: It does
one pass to compute the total memory size of the XML string and a second
pass to actually create the string. There is thus only one memory allocation
and no extra memory copy. Other libraries are slower because they are
using the string concatenation operator that requires many memory (re-)allocations
and memory copy.
- In-memory parsing
- Supports XML namespaces
- Very small and totally stand-alone (not built on top of something else).
Uses only standard <stdio.h> library (and only for the 'fopen' and the
'fread' functions to load the XML file).
- Easy to integrate into you own projects: it's only 2 files! The .h file
does not contain any implementation code. Compilation is thus very fast.
- Robust (We are using it every day the since 2004 inside the Business-Insight company).
Optionnally, if you define the C++ prepocessor directives STRICT_PARSING and/or
APPROXIMATE_PARSING, the library can be "forgiving" in case of errors
inside the XML.
I have tried to respect the XML-specs given at: http://www.w3.org/TR/REC-xml/
- Fully integrated error handling :
- The string parser gives you the precise position and
type of the error inside the XML string (if an error is detected).
- The library allows you to "explore" a part
of the tree that is missing. However data extracted from "missing
subtrees" will be NULL. This way, it's really easy to code "error
- Thread-safe (however the global parameters "guessUnicodeChar"
and"strictUTF8Parsing" must be unique because they are shared by
- Full Native Supports for a wide range of character sets & encodings: ANSI (legacy) /
UTF-8 / Shift-JIS / GB2312 / Big5 / GBK.
Linux, Linux 64 bits & Solaris, we have additionnaly:
Unicode 16bit / Unicode 32bit widechar characters support that includes:
The XMLParser library is able to handle successfuly chinese, japanese, cyrilic and other extended
characters thanks to an extended UTF-8 encoding support, Shift-JIS (japanese) and to GB2312/Big5/GBK encoding support (chinese) (see this UTF-8-demo
that shows the characters available). If you are still experiencing character
encoding problems, I suggest you to convert your XML files to UTF-8 using
a tool like iconv
- For the unicode version of the library: Automatic conversion
to Unicode before parsing (if the input XML file is standard ansi 8bit
- For the ascii version of the library: Automatic conversion
to legacy or UTF-8 before parsing (if the input XML file is unicode 16 or 32bit
- Transparent memory management through the use of smart pointers.
- Limited Support for character entities. The current known character entities
||direct access to the ascii code of any
||direct access to the ascii code of any
(in standard decimal)
- Support for a wide range of clearTags that are containing unformatted text:
<![CDATA[ ... ]]>,
<!-- ... -->, <PRE> ... </PRE>,
<!DOCTYPE ... >
Unformatted texts are not parsed by the library and can contain items that
are usually 'forbidden' in XML (for example: html code)
- Support for inclusion of pure binary data (images, sounds,...) into the
XML document using the four provided ultrafast Base64 conversion functions.
- The library is under the Aladdin Free Public License(AFPL).
If you need another license,simply
(I don't want any money for the XMLParser).
- Easy to customize: The code is small, commented and written in a plain and
simple way. Thus, if you really need to change something (but I doubt of it),
If you like this library, you can create a URL-Link towards this page from your
website (use this URL: http://www.applied-mathematics.net/tools/xmlParser.html).
If you want to help other people to produce better softwares using XML technology, you
can increase the visibility of this library by adding a URL-link toward this
page (so that its google-ranking increases ).
If you like this library, please add
a message in the guestbook
To obtain the library, simply
, and I will send to you the XMLParser library directly, the same day (I will most certainly restore
a direct link to download the XMLParser library in a few weeks). You will receive by e-mail a zip-file.
Inside the zip file, you will find 5 examples:
If you have a Kindle, you might also be interested in KKCM: the Kranf Kindle Collection Manager.
- ansi unix/solaris project example (makefile based)
- wide char unix/solaris project example (makefile based)
- ansi windows project example (for Visual Studio 6 and .NET)
- wide char windows project example (for Visual Studio 6 and .NET)
- ansi windows .dll project with a small test project to check the
- V1.00: February 20, 2002: initial version from M.C.Brown.
- V1.20: July 22, 2006: After 13 minor changes, 2 major changes,
8 bug fixes and 23 functionality additions(at user's request), I decided to
switch to V2.01.
- V2.01: July 24, 2006: 1 major change, 2 minor change, 3
- Major Change: no more "stringDup"
required for functions like "addText", "addAttribute",...
The old behavior is still accessible through functions like "addText_WOSD",
"addAttribute_WOSD",... ("_WSOD" stands for "WithOut
This change greatly simplifies the user's code. Unfortunately, old user's
code must be updated to work with the new version.
Fortunately, all the user's code used to READ the content of an XML file
is left unchanged: Only the "creation of XML" and the "update
of XML" user's code require a little updating work.
- V2.02: July 25, 2006: 1 minor change
- V2.03: July 28, 2006: 1 minor change
- V2.04: August 6, 2006: 1 addition
- V2.05: August 15, 2006: 1 addition
- V2.06: August 16, 2006: 2 additions
- V2.07: August 22, 2006: 1 addition
- V2.08: August 22, 2006: 1 bug fix
- V2.09: August 31, 2006: 1 bug fix
- V2.10: September 21, 2006: 1 bug fix
- V2.11: October 24, 2006: 3 additions, 1 bug fix.
- added the function getParentNode(). Thanks to Jakub
Siudzinski for notifying me a good way to do it easily.
- V2.12: October 25, 2006: 2 additions
- V2.13: October 31, 2006: 1 minor change, 1 bug fix
- V2.14: November 13, 2006: 1 minor change, 1 bug fix
- V2.15: December 22, 2006: 2 additions
- V2.16: December 27, 2006: 1 minor change
- V2.17: January 9, 2007: 1 addition, 1 minor change
- V2.18: January 15, 2007: 1 bug fix
- V2.19: January 30, 2007: 1 bug fix, 3 additions
- V2.20: February 17, 2007: 1 addition
- added a Visual Studio projet file to build a DLL version
of the library.
Under Windows, when I have to debug a software that is using the XMLParser
Library, it's usually a nightmare because the library is sooOOOoooo slow
in debug mode. To solve this problem, during all the debugging session,
I use a very fast DLL version of the XMLParser Library (the DLL is compiled
in release mode). Using the DLL version of the XMLParser Library allows
me to have lightening XML parsing speed, even in debug mode! Other than
that, the DLL version is useless: In the release version of my tool, I
always use the normal, ".cpp"-based, XMLParser Library.
- V2.21: Mars 1, 2007: 1 minor change, 1 bug fix
- V2.22: Mars 6, 2007: 1 bug fix
- V2.23: Mars 13, 2007: 1 bug fix
- V2.24: April 24, 2007: 1 bug fix, 1 addition
- V2.25: May 18, 2007: 1 bug fix
- V2.26: May 22, 2007: 1 bug fix
- V2.27: May 28, 2007: 2 additions, 1 minor change, 2 bug
- V2.28: June 27, 2007: 2 additions, 2 minor changes
- v2.29: July 3,2007: 1 bug fix
- v2.30: July 31,2007: 2 bug fixes, 1 addition
- v2.31: August 29,2007: 1 fix
- v2.32: October 4,2007: 1 addition
- v2.33: October 11, 2007: 1 addition
- v2.34: January 25, 2008, 2 additions
- v2.35: February 2, 2008: 1 minor change
- v2.36: March 9, 2008: 2 bug fixes, 2 additions, 4 minor changes
- v2.37: March 24, 2008: 1 bux fix
- v2.38: June 2, 2008: 3 additions
- v2.39: August 9, 2008: 4 additions, 2 bug fixes
- v2.40: December 19, 2008: 1 minor change
- v2.41: June 25, 2009: 1 minor change
- v2.42: Januray 4, 2011: 4 minor changes
- modified the function "writeTofile" to handle gracefully the case when it's not possible to write to the file
- slight speed improvement in parser (inside the tokenizer)
- changed some enumeration name to avoid any "name collision" with user's code
- better handling of the BOM when loading a XML file
A small tutorial
Let's assume that you want to parse the XML file "PMMLModel.xml"
<?xml version="1.0" encoding="ISO-8859-1"?>
<Header copyright="Frank Vanden Berghen">
<Application name="<Condor>" version="1.99beta" />
<Extension name="keys"> <Key name="urn"> </Key> </Extension>
<DataField name="persfam" optype="continuous" dataType="double">
<Value value="9.900000e+001" property="missing" />
<DataField name="prov" optype="continuous" dataType="double" />
<DataField name="urb" optype="continuous" dataType="double" />
<DataField name="ses" optype="continuous" dataType="double" />
<RegressionModel functionName="regression" modelType="linearRegression">
<NumericPredictor name="persfam" coefficient="-0.00275951" />
<NumericPredictor name="prov" coefficient="0.000319433" />
<NumericPredictor name="ses" coefficient="-0.000454307" />
<NONNumericPredictor name="testXmlExample" />
Let's analyse line by line the following small example program:
#include <stdio.h> // to get "printf" function
#include <stdlib.h> // to get "free" function
int main(int argc, char **argv)
// this open and parse the XML file:
// this prints "<Condor>":
printf("Application Name is: '%s'\n", xNode.getChildNode("Application").getAttribute("name"));
// this prints "Hello world!":
printf("Text inside Header tag is :'%s'\n", xNode.getText());
// this gets the number of "NumericPredictor" tags:
// this prints the "coefficient" value for all the "NumericPredictor" tags:
for (int i=0; i<n; i++)
// this prints a formatted ouput based on the content of the first "Extension" tag of the XML file:
To manipulate the data contained inside the XML file, the first operation is to get an
instance of the class XMLNode that is representing the XML file in
memory. You can use:
or, if you use the UNICODE windows version of the library:
or, if the XML document is already in a memory buffer pointed by variable "char
This will create an object called xMainNode
that represents the first tag named PMML
found inside the XML document. This object is the top of tree structure representing
the XML file in memory. The following command creates a new object called xNode
that represents the "Header"
tag inside the "PMML"
The following command prints on the screen "<Condor>"
(note that the "<"
character entity has been replaced by "<"):
printf("Application Name is: '%S'\n", xNode.getChildNode("Application").getAttribute("name"));
The following command prints on the screen "Hello
printf("Text inside Header tag is :'%s'\n", xNode.getText());
Let's assume you want to "go to"
the tag named "RegressionTable":
Note that the previous value of the object named xNode
has been "garbage collected" so that no memory leak occurs. If you
want to know how many tags named "NumericPredictor"
are contained inside the tag named "RegressionTable":
The variable n now
contains the value 3. If you want to print the value of the coefficient
attribute for all the NumericPredictor
for (int i=0; i<n; i++)
Or equivalently, but faster at runtime:
for (int i=0; i<n; i++)
If you want to generate and print on the screen the following XML formatted
<Key name="urn" />
You can use:
Note that you must free the memory yourself (using the "free(t);"
function) : only the XMLNode objects and their contents are "garbage collected".
The parameter true to
the function createXMLString
means that we want formatted output.
The XML Parser library contains many more other small usefull methods that are
not described here (The zip file contains some additional examples to explain
other functionalities and a complete Doxygen documentation about the XMParser.). These methods allows you to:
That's all folks! With this basic knowledge, you should be able to retreive easily
any data from any XML file!
- navigate easily inside the structure of the XML document
- create, update & save your own XML structure of XMLNode's.