In 2003, I started working on XML technology and I produced my
first XMLParser
library. This old library is now used in thousands of applications all around the
world (and also in space! 😲 ). The main objective of the old XMLParser library was to allow me to easily manipulate
input/ouput configuration files and some small xml data files. The old library was limited to relatively
small data files (typically, smaller than 10MB) because it's a pure DOM-style parser 😒 .
During the next 10 years, I received many emails from coders using the old XMLParser library to parse
larger and larger XML files (some individual use it to parse 300MB XML files!). Altough the old library managed to parse these larger files, it consumed a
very large amount of RAM memory (sometime up to 10GB) and of CPU ressources. Furthermore, I am now
manipulating (inside Anatella) terabyte-size XML files. In May 2013, I decided that it was
time for an "upgrade"! 😉 ...and the Ultimate XML Parser was born! 😊
The Ultimate XML Parser is composed of only 2 files: a .cpp file and a .h file.
The total size is 220 KB.
The Ultimate XML Parser library includes two parsers: It has:
<?xml version="1.0" encoding="ISO-8859-1"?>
<PMML version="3.0"
xmlns="http://www.dmg.org/PMML-3-0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema_instance" >
<Header copyright="Frank Vanden Berghen"> Hello World!
<Application name="<Condor>" version="1.99beta" />
</Header> <Extension name="keys"> <Key name="urn"> </Key> </Extension>
<DataDictionary>
<DataField name="persfam" optype="continuous" dataType="double">
<Value value="9.900000e+001" property="missing" />
</DataField>
<DataField name="prov" optype="continuous" dataType="double" />
<DataField name="urb" optype="continuous" dataType="double" />
<DataField name="ses" optype="continuous" dataType="double" />
</DataDictionary>
<RegressionModel functionName="regression" modelType="linearRegression">
<RegressionTable intercept="0.00796037">
<NumericPredictor name="persfam" coefficient="-0.00275951" />
<NumericPredictor name="prov" coefficient="0.000319433" />
<NumericPredictor name="ses" coefficient="-0.000454307" /> <NONNumericPredictor name="testXmlExample" />
</RegressionTable>
</RegressionModel>
</PMML>
Let's analyse line by line the following small example program:
#include <stdio.h> // to get the "printf" function #include "xmlParser.h" int main(int argc, char **argv) { // This create a new Ultimate XML DOM parser: UXMLDomParser uDom;
// This open and parse the XML file: UTCXMLNode xMainNode=uDom.openFileHelper("PMMLModel.xml","PMML");
// This prints "<Condor>": UTCXMLNode xNode=xMainNode.getChildNode("Header"); printf("Application Name is: '%s'\n", xNode.getChildNode("Application").getAttribute("name"));
// This prints "Hello world!": printf("Text inside Header tag is :'%s'\n", xNode.getText());
// This gets the number of "NumericPredictor" tags:
xNode=xMainNode.getChildNode("RegressionModel").getChildNode("RegressionTable"); int n=xNode.nChildNode("NumericPredictor"); // This prints the "coefficient" value for all the "NumericPredictor" tags:
for (int i=0; i<n; i++) printf("coeff %i=%f\n",i+1,atof(xNode.getChildNode("NumericPredictor",i).getAttribute("coefficient"))); // This create a UXMLRenderer object and use this object to print a formatted XML string based on // the content of the first "Extension" tag of the XML file (more details below): printf("%s\n",UXMLRenderer().getString(xMainNode.getChildNode("Extension")));
return 0;
}
To easily manipulate the data contained inside the XML file, the first operation is to create an UXMLDomParser object (in the above example, it's named "uDom") and use it to get an instance of the class UTCXMLNode that represents the XML file in memory. You can use:
UTCXMLNode xMainNode=uDom.openFileHelper("PMMLModel.xml","PMML");or, if you use the UNICODE windows version of the library:
UTCXMLNode xMainNode=uDom.openFileHelper(L"PMMLModel.xml",L"PMML");or, if the XML document is already in a memory buffer pointed by the variable "char *xmlDoc" :
UTCXMLNode xMainNode=uDom.parseString(xmlDoc,"PMML");This will create an object called xMainNode that represents the first tag named PMML found inside the XML document. This object is the top of tree structure representing the XML file in memory. The following command creates a new object called xNode that represents the "Header" tag inside the "PMML" tag.
UTCXMLNode xNode=xMainNode.getChildNode("Header");The following command prints on the screen "<Condor>" (note that the "<" character entity has been replaced by "<"):
printf("Application Name is: '%S'\n", xNode.getChildNode("Application").getAttribute("name"));The following command prints on the screen "Hello World!":
printf("Text inside Header tag is :'%s'\n", xNode.getText());Let's assume you want to "go to" the tag named "RegressionTable":
xNode=xMainNode.getChildNode("RegressionModel").getChildNode("RegressionTable");
Note that the previous value of the object named xNode has been "garbage collected" so that no memory leak occurs. If you want to know how many tags named "NumericPredictor" are contained inside the tag named "RegressionTable":
int n=xNode.nChildNode("NumericPredictor");
The variable n now contains the value 3. If you want to print the value of the coefficient attribute for all the NumericPredictor tags:
for (int i=0; i<n; i++) printf("coeff %i=%f\n",i+1,atof(xNode.getChildNode("NumericPredictor",i).getAttribute("coefficient")));Or equivalently, but faster at runtime:
int iterator=0; for (int i=0; i<n; i++) printf("coeff %i=%f\n",i+1,atof(xNode.getChildNode("NumericPredictor",&iterator).getAttribute("coefficient")));
If you want to generate and print on the screen the following XML formatted text:
<Extension name="keys"> <Key name="urn" /> </Extension>
You can use:
UXMLRenderer uRenderer; char *t=uRenderer.getString(xMainNode.getChildNode("Extension"),true);
printf("%s\n",t);
Note that you must NOT free youself the memory buffer containing the returned XML string (You must NOT write
any "free(t);") : The memory buffer
containing the XML string is owned by the uRenderer
object and it will be free'd when the uRenderer object
is destroyed (i.e. when it falls "out-of-scope").
The parameter true to
the function getString() means that we want formatted output.
The Ultimate XML Parser library contains many more other small usefull methods that are
not described here (The zip file contains some additional examples to explain
other functionalities and a complete Doxygen documentation about the UXMParser). These methods allows you to: