I present a small XML library that parses or creates XML documents with an extremely light-weight DOM that uses just 8 bytes per node. This post shows some of the interesting algorithms that make this possible.
Another XML Library?
I know what you are thinking: not another XML library! And you would be right, especially considering that XML is becoming more obsolete each year in favor of arguably better formats like JSON and YAML. But there is still a lot of XML out there, both old and new, and it will remain relevant for years to come. Besides, this post is not about presenting another XML library, but about some of the algorithms it uses to create a very lightweight DOM. You may find (parts of) those algorithms useful in other situations as well. But if you don’t care about that and just want to see how this library holds up against other libraries, then feel free to skip to the end for a comparison.
As you probably know, an XML DOM is a tree of nodes, where nodes are usually either elements or text. Each node needs to hold the following information:
- The node type. We support 4 node types: Element, Text, Comment and CData. XML has some other node types such as processing instructions, declarations and document type declarations, but we don’t support those.
- The parent node.
- For Element nodes:
- A list of child nodes.
- A list of attributes.
- The element name.
- For Text, Comment and CData nodes:
- The string value.
So how are you going to fit all this information into just 8 bytes? Well, obviously you can’t: a single pointer to a parent node for example is already 8 bytes on a 64-bit system. But we can try to fit most of this information into 8 bytes.
Key here is to store this information in a slightly different way. For example, storing a list of child nodes and a list of attributes is inefficient. The reference to the list object itself is already 8 bytes on 64-bit systems. Furthermore, these lists allocate additional dynamic memory to store its contents. Another way to store this information is to use a linked list instead. This is nothing new and many other XML libraries use this approach. In this case, each XML node holds the following information:
- Node type.
- Parent node.
- For Element nodes:
- Pointer to the first child node.
- Pointer to the next sibling node.
- Optional pointer to the previous sibling node (to create a doubly linked list).
- Pointer to the first attribute.
- Element name.
- For Text, Comment and CData nodes:
- The string value.
- And for each attribute:
- Pointer to the next attribute.
- Attribute name.
- Attribute value.
At first sight, this only seems to make things worse, since you are storing a lot of pointers now for each node.
Storing a pointer to the previous sibling node is not required, but it improves performance for certain node operations (like deleting nodes). By using a previous sibling, you are creating a doubly linked list, which can avoid linear searches when looking for nodes.
But this organization is key to squishing everything down into 8 bytes. We’ll see how a bit later.
Why 8 Bytes?
But why this seemingly arbitrary choice to try to store all this information into 8 bytes? Well, I got interested in this issue when looking at pugixml library. This is a popular XML library for C++ that supports a “compact mode”. In compact mode, each node is stored in just 12 bytes. Basically, I wanted to see if I could do better. And also, 12 bytes didn’t seem like an ideal number to me, not being a power of 2 and all.
Another motivation is that I wanted to reduce memory fragmentation by allocating complete memory pages at a time to store the nodes. Then, instead of freeing individual nodes, I could just free a complete memory page, freeing a whole bunch of nodes at once. This model avoids small memory allocations and reduces memory fragmentation.
On all operating systems that Delphi supports, memory pages are at least 4096 bytes (4 KB) in size. On some operating systems, the size of a memory page may be larger, but always a multiple of 4096. In my testing, memory pages are 4 KB on macOS and Android and 16 KB on 64-bit iOS. On Windows, you actually reserve larger blocks that are 64 KB in size (called the ‘allocation granularity’), and then allocate (commit) smaller 4 KB pages in these larger blocks.
The XML library doesn’t assume these pages sizes. It uses
GetSystemInfoon Windows to check the allocation granularity and page size, and uses
sysconf(_SC_PAGESIZE)on all other platforms to get the page size. To allocate pages, we use
VirtualAllocon Windows and
mmapon all other platforms.
So the library allocates 4096 bytes at a time. Using a node size of 8 bytes, this fits 512 nodes.
From Pointers to Indices
So let’s assume for now that our XML documents contain at most 512 nodes. This makes the following discussion a bit easier to follow. I will show later how this limitation is removed.
In this case, we can store all these nodes into a single memory page. This has the following important implication: instead of each node storing pointers to other nodes, we could store indices to other nodes in the same memory page instead. Since a page can hold up to 512 nodes, an index to another node is in the range 0..511. And it takes just 9 bits to store a value between 0 and 511. So instead of using 32 or 64 bits (depending on platform) to store a pointer, we use just 9 bits instead (regardless of whether you are running on a 32-bit or 64-bit OS).
A Compact Element
So let’s see how we can partition the 64 bits of a node to store as much information as possible. For Element nodes, we store the following:
|Index of parent node||9|
|Index of first child node||9|
|Index of next sibling node||9|
|Index of previous sibling node||9|
|Index of first attribute||9|
|Index of element name||17|
Since we support 4 node types, 2 bits suffice to store the type. We store attributes in the same way as nodes; that is, each attribute is also stored in an 8-byte segment of a memory page. So we also just need 9 bits to store the index of the first attribute.
Index 0 is reserved to represent a nil-node or nil-attribute. So if an element has no attributes, the “Index of first attribute” field will be 0.
This amounts to 2+5×9=47 bits to store the node type and indices. This leaves 17 bits to store the element name. Of course, we cannot store a string in these 17 bits, but we can use these 17 bits as an index into a list of strings. This has the additional advantage that elements with the same name all use the same index the string list, and the actual string value is only stored once in total, instead of once for each element with that name. Since many database-like XML files use the same element names over and over again, this can lead to considerable savings.
This way of storing strings only once is similar to “string interning”, and is also used in some other libraries.
With 17 bits, we can store up to 131,072 strings in the string list (where index 0 is reserved for an empty string). This doesn’t mean that your XML document can’t have more than 131,072 nodes though: it means that your XML document can’t have more than 131,072 distinct element names, which would be very unlikely.
This is not an entirely accurate statement, since the library uses the same string list to store attribute names (for the same reason). But let’s not overcomplicate things for this discussion.
A Compact Text/Comment/CData Node
For these types of nodes, we don’t need to store a name and list of child nodes and attributes. So the layout could be as follows:
|Index of parent node||9|
|Index of next sibling node||9|
|Index of previous sibling node||9|
After the required bits for node type and indices, 35 bits remain to store the string value somehow. We could use an index into a string list for this, like we did for element names, but in many cases this is not needed. And since the contents of text nodes is much more varied than element names, there isn’t much to be gained from string interning.
However, a string itself is basically just a pointer to some reference counted piece of memory. So on 32-bit systems, these 35 bits are more than enough to store the string pointer, and that is what we do.
We store the string pointer in such a way that it increases the reference count of the string, so the string will remain alive until we no longer need the node. At that point, we decrease the reference count.
On 64-bit systems though, the value of a string pointer can theoretically take any 64-bit value, and thus wouldn’t fit into 35 bits. You also cannot assume that these values fit into 32 bits even if your app uses far less than 4 GB of memory. For example, I noticed that on some Android devices, the pointer values of strings often exceed the 4 GB virtual address space.
But it is safe to assume that the addresses of almost all strings lie within a 4 GB range of some other arbitrary string. So the library calculates the difference between the actual string pointer and some other arbitrary dynamically allocated memory address. If the result fits in 32 bits, then it is stored in 32 of the 35 available bits. In the (very unlikely) case that the result doesn’t fit, we set one of the 35 bits to 1 to indicate that the string doesn’t fit, and use 32 of the other 34 bits to store an index into a special string list that is used just for this scenario. This all complicates things a bit, but is still safe and efficient.
A Compact Attribute
Finally, we need storage space for attributes. An attribute contains a name/value pair and an index to the next attribute in the list:
|Index of next sibling (attribute)||9|
|Index of attribute name||17|
Attributes are stored along nodes in a memory page. So each attribute uses 8 bytes as well. Again, we use 9 bits as the index of the next attribute. And as we did for element names, we use 17 bits for an index into the (shared) list of attribute and element names.
That leaves 38 bits for the attribute value. This value is stored in the same way as strings are stored for text, comment and CData nodes as described above.
A Simple Example
So, what would this look like in practice? Consider the following small XML document:
<foo> <bar>baz</bar> <bar a1="val1" a2="val2" /> </foo>
This results in the following memory page layout:
This may look a bit complicated but it isn’t too bad if you look at it one piece at a time. The 4096 byte memory page is divided into 512 segments of 8 bytes each (numbered from 0 to 511 in the image). The first segment (segment 0) is not used because index 0 represents a nil-pointer. The image shows the 3 XML elements in light blue, the text node in light green, and the 2 attributes in light red. The arrows indicate how these nodes and attributes are connected to each other.
Elements and attributes have a name index (N) that refers to element and attribute names in a string list (shown in light orange). Text nodes and attributes also have a string value (S) which points directly to a Delphi string as discussed earlier.
Let’s take a quick look at how all this can be used to create an XML node API. Since all information is stored in a 64-bit segment of a memory page, an XML node is nothing more than a pointer to a 64-bit integer (
PUInt64). The library wraps this inside a record called
type TXmlNodeType = (Element, Text, Comment, CData); type TXmlNode = record private FNode: PUInt64; function GetNodeType: TXmlNodeType; inline; function GetParent: TXmlNode; ... public property NodeType: TXmlNodeType read GetNodeType; property Parent: TXmlNode read GetParent; ... end;
As mentioned, the
NodeType can be stored using just 2 bits, and “pointers” to other nodes (like
Parent) are stored as 9-bit integer indices. By convention, we store the
NodeType in the lowest 2 bits of the 64-bit integer, so the property getter looks like this:
function TXmlNode.GetNodeType: TXmlNodeType; begin if (FNode = nil) then Result := TXmlNodeType.Element else Result := TXmlNodeType(FNode^ and 3); end;
We perform a sanity check first to check if the node is
nil. In this library, it is perfectly legal for the underlying pointer to be
nil. This means that the actual node is a nil-node. It is also perfectly legal to call methods and use properties of a nil-node. This way, you don’t have to check if a node is
nil each time before you use it, which can simplify your code considerably.
We take a look at the implementation of
So far we have assumed that all nodes and attributes fit into a single 4 KB memory page. Of course, that is not always the case. In fact, this compact memory layout is especially useful for (very) large documents, otherwise there would be little point in coming up with this design.
Before looking at a solution for this problem, we need to take a look at how a node is able to locate another (child or parent) node in the same memory page. Normally, the only information we have is just the index of another node in the same memory page. How can we use this index to locate the node? If there is just one memory page, then we can just use the address of this memory page and increment it with the value
8 x Index. But if there are multiple memory pages, then this doesn’t work.
From Node to Memory Page
The node needs to know to which memory page it belongs, so it can use the address of that memory page to locate other nodes. We cannot store the memory page with each node, since we already used up all 64 bits to store other information. However, since we allocate complete memory pages, we don’t actually need to store this information with each node. The reason for allocating complete pages is that these are always aligned to an address that is a multiple of the page size. In this case, this means that a memory page is always aligned to a 4 KB boundary. We can then use this fact to find the memory page that a node belongs to by just rounding down the address of the node to a 4 KB value. For example, when a node points to an 8-byte segment at address
$1FB53248, then it belongs to a memory page that starts at address
$1FB53000. Since 4 KB is
$1000 in hexadecimal, this means we can just set the lowest 3 hexadecimal digits to 0 to find the memory page. In code, this can quickly be done with a bitwise
const BLOCK_SIZE = 4096; function TXmlNode.GetBlock: PByte; begin Result := PByte(UIntPtr(FNode) and not (BLOCK_SIZE - 1)); end;
The library uses the term “block” instead of “page” since it is possible that a physical memory page contains multiple 4 KB blocks (for example, memory pages on iOS are 16 KB in size, and thus fit 4 blocks).
Linking to Nodes in another Memory Page
Now that we know how to find the memory page for a node, we can tackle the problem of needing multiple pages. The solution I used is to reserve another index value for the case that a referenced node is in another page. Remember, indices take on values between 0 and 511. We already reserved 0 to represent a nil-node. The library uses value 511 to indicate that the referenced node is in another page. This is kind of an arbitrary value, but it makes some logical sense.
Whenever index 511 is encountered, it uses a hash map (a kind of dictionary) to lookup the referenced node instead. The key of the hash map is a combination of the address of the source node, and a number that indicates what kind of node we are looking up (for example, the parent, first child or next sibling). The value of the hash map is the referenced node (that is, a pointer to a 64-bit integer in another memory page). Using a hash map for this purpose adds some memory and CPU overhead. But the idea is that the majority of node references will be in the same memory page, so the hash map is only used in the relatively few situations that this is not the case.
But how does the node know where to find the hash map? Again, all 64 bits for the node storage are already used. Here we can use another trick: Since indices 0 and 511 are not used, this means that the first 8 bytes and the last 8 bytes of each memory page aren’t used either. So we use the first 8 bytes to store a pointer to the XML document object that owns the memory page. The last 8 bytes are currently not used.
We haven’t talked about an XML document class yet. This is the only public class in the library. It wraps the
IXmlDocument interface and owns all memory pages (and thus nodes). It also owns the hash map that is used to locate out-of-page nodes.
With this information, lets take a look at how the getter for the
Parent property is implemented. As a recap, the lowest 2 bits of a node represent the node type (see the
TXmlNode.GetNodeType example earlier). By convention, the next 9 bits are the index of the parent node:
const NIL_BITS = 0; HASH_BITS = $1FF; function TXmlNode.GetParent: TXmlNode; begin if (FNode = nil) then Result.FNode := nil else begin var Bits: UIntPtr := (FNode^ shr 2) and $1FF; if (Bits = NIL_BITS) then Result.FNode := nil else if (Bits = HASH_BITS) then begin var Doc := GetDocument; Result.FNode := Doc.FPointerMap.Get(FNode, ID_PARENT); end else begin var Block := GetBlock; Result.FNode := Pointer(Block + (Bits shl 3)); end; end; end;
Lets look at this step-by-step:
- Again, first we check if the node itself is
nil, and return the default value if that is the case. Remember that this enables to user to safely access properties and methods of a nil-node.
- Otherwise, we retrieve the 9 index bits. This can easily and quickly be done with some bit shifting and a bitwise
ANDoperation. Since the lowest 2 bits are used for the node type, we need to shift the value 2 bits to the right, and then
ANDthe result with
511) to extract the lowest 9 bits.
- If the index has the predefined value 0 (
NIL_BITS), then it references a nil-node.
- Otherwise, if the index has the other predefined value 511 (
HASH_BITS) then the hash map is used to locate the parent (
ID_PARENT). To get to the hash map, we first need to get to the document that owns this node. As mentioned earlier, a reference to the document is stored in the first 8 bytes of the memory page.
- For all other indices, the parent node is in the same memory page (block). So we retrieve the block (as shown earlier) and add 8 times the index (using a left shift by 3) to locate the parent node.
These are the main algorithms used to create a very light-weight DOM. For more details you can check the source code.
So how much memory does this library save compared to other XML libraries? Before I get into that, first some notes about this library.
The library is a personal project of mine called Neslib.Xml, available on GitHub. It is not only light-weight in memory (and CPU) usage, but also in features. It has less features than many other XML libraries and only focuses on the core features that should satisfy most use cases:
- It can read and write XML to from and to files, streams or strings.
- It supports source files in UTF-8, UTF-16 (BE and LE) and UTF-32 (BE and LE) formats, but only through a BOM marker. It ignores any “encoding” attributes in the XML declaration. If there is no BOM marker, it assumes UTF-8. File output is always in UTF-8 format.
- It supports comments and CData sections.
- It has the usual APIs to traverse a DOM, create a DOM and search for immediate children by element or attribute name.
- It can be compiled in either Unicode (default) or UTF-8 mode for even lower memory usage.
But it doesn’t support the following features:
- There is no explicit support for namespaces. You can still use namespaces, but they are just considered part of a name.
- There is no support for DTDs (
<!DOCTYPE ...>) or validating against a DTD.
- There is no support for custom entity references. Only the standard character references (
<etc.) are supported.
- There is no support for processing instructions (
<? ... ?>) and declarations (
<?xml ...?>). They may still be present in an XML file, but are ignored. When writing an XML file, a standard XML declaration is used.
- There is no support for XPath or XSLT.
If any of those features are important to you, then check out the other libraries used for the comparisons.
The following other XML libraries are used for the comparisons below:
- Delphi’s built-in XML library
- MSXML, using Delphi’s COM library
- Alcinoe XML
- OXml, tested both object-based and record-based implementations
You can reproduce these performance tests yourself using the XmlPerfTests app in the Neslib.Xml repository. Please let me know if you think I implemented one or more tests incorrectly. I want to be fair in my comparisons.
Memory Usage Tests
These tests load the 24 MB nasa.xml file from the XML Data Repository into a DOM. For the Neslib.Xml library, this is done with the following code:
var Document := TXmlDocument.Create; Document.Load('nasa.xml');
TXmlDocument.Create is a class function that returns an
IXmlDocument interface, so memory management is automatic.
As I hoped, Neslib.Xml has the lowest memory usage. In UTF-8 mode, it uses only slightly more memory than the size of the XML document itself. Also, the differences between the 32-bit and 64-bit versions is pretty small. This is mostly because Neslib.Xml uses 9-bit indices to reference other nodes, regardless of platform. Whereas other libraries mostly use pointers, which are twice as large on 64-bit platforms.
I only used the 32-bit version of OXml for testing because the demo version is not available for 64-bit Windows.
Load Time Tests
I expected that loading a large XML file would take longer in my library, because of the additional code needed to manage 9-bit indices:
But the opposite is true. It looks like the CPU time needed for node management is more than compensated for by decreased memory usage and fragmentation. I assume this is mostly because my approach is more cache-friendly and results in less (expensive) cache misses. Only DIXml (which is a wrapper around libxml2) performs comparably or slightly better.
This chart is truncated at 2,500 ms because SimpleXML takes over 90,000 ms (or 90 seconds) to load, which would make the chart unreadable otherwise.
The traversal tests just iterate over all nodes and attributes in the document. For the Neslib.Xml library, this is done using the following recursive code:
procedure Traverse(const ANode: TXmlNode); begin var Attr := ANode.FirstAttribute; while (Attr <> nil) do begin MarkAttribute(Attr.Name, Attr.Value); Attr := Attr.Next; end; var Child := ANode.FirstChild; while (Child <> nil) do begin case Child.NodeType of TXmlNodeType.Element: begin MarkElement(Child.Value); Traverse(Child); end; TXmlNodeType.Text: MarkText(Child.Value); end; Child := Child.NextSibling; end; end;
Mark* methods are used to count the number of elements, attributes and text nodes in the document. These counts are used to check if the document loaded correctly.
The library also supports enumerators, which you may find easier to use to enumerate nodes and attributes.
Traversal is also very fast, on par with DIXml and OXml.
Again, this chart is truncated at 250 ms since the Delphi and MSXML implementations take over 2400 ms and 1200 ms respectively.
The query tests walk through all the datasets in the document, and for each dataset is searches for the first element with name ‘tableHead’, and searches that element for the first element with name ‘fields’ etc., for a few levels deep. It finally checks if a ‘units’ element has a text node with value ‘arcsec’. If so, it increments a counter.
I have written these tests without using XPath, since not all libraries (including mine) support XPath. For Neslib.Xml, the query is done using this code:
function QueryArcsecFields: Integer; begin Result := 0; var Datasets := FDocument.DocumentElement; var Dataset := Datasets.FirstChild; while (Dataset <> nil) do begin var Units := Dataset.ElementByName('tableHead') .ElementByName('fields') .ElementByName('field') .ElementByName('units'); var Text := Units.FirstChild; if (Text.NodeType = TXmlNodeType.Text) and (Text.Value = 'arcsec') then Inc(Result); Dataset := Dataset.NextSibling; end; end;
Note that any of the
ElementByName methods can return a nil-node. It is perfectly legal to use methods or properties on a nil-node, so you don’t have to check for this every step. This can simplify your code considerably.
Again, Neslib.Xml is very fast for this use case.
And again, the chart is truncated because MSXML takes over 75 ms for this test.
Finally, I measured how long it takes to clean up (or destroy) a document. Since an XML document is an interface in Neslib.Xml (as it is in many other libraries), this is just a matter of setting this interface reference to
Here, the record-based OXml implementation beats Neslib.Xml and is about twice as fast. But still, Neslib.Xml is very fast compared to most other libraries.
I had to truncate the chart once more because the default Delphi implementation takes over 900 ms here.
If you regularly work with large XML files and need to load them into a DOM, then try out Neslib.Xml. It uses some interesting algorithms to consume as little memory as possible. The library is not as feature-rich as many other ones, but it may be sufficient for your needs.