Topic
  • No replies
Mathias Mamsch
Mathias Mamsch
2618 Posts

Pinned topic Large Scale DXL (Tutorial) - How to store large amounts of arbitrarily nested data

‏2017-04-12T11:36:02Z | array class data dxl hack large-scale memory object object-orientation performance struct structs tutorial

Hi all, 

I was recently asked about the possibility to store large amounts of linked data efficiently in DXL. 

So in this post we want to take a look at two questions: 
- how can I store data in DXL efficiently without producing a lot of allocations, which will slow down all scripts? 
- how can I store complex data in DXL so, I do not need to cope with skip lists and key names / types all the time? 

Motivation

A very simple example would be to simply store the contents (or the links) of a module from DXL. Most people would probably use a "Nested Skip" approach. The code could look like this: 

// ********** library that caches our objects **********
Skip skObjects = create() // key: AbsNo, value: AttributeSki

void cacheObject(Object o, Skip skAttrs) {
   int nr = o."Absolute Number"; 
 // key: Attribute Name, Value: Attribute Value
   Skip skValues = createString() 
 // collect attribute values in skValues
   string s; for s in skAttrs do put(skValues, s, o.s ""); 
 // put skip in skip 
   put(skObjects, nr, skValues) 
}

Skip getValues(int nr) {
   Skip sk = null; 
   find(skObjects, nr, sk); 
   return sk; 
}

// ****************************************************

Module m = current
Skip attrList = create(); 
AttrDef ad; 

for ad in m do {
   if (ad.object) {
      string sName = ad.name; 
      put(attrList, sName, sName); 
   }
}

Object o; for o in entire m do cacheObject(o, attrList);

// Get attribute values of object 4
Skip sk = getValues(4)
string s; for s in sk do { print ((key sk) string) ":" s "\n"; }

The Problem of allocations and complexity of nested structures


While simple, this approach has two drawbacks: 
- it allocates one skip list per object
- the developers need to remember the skip list layout (string key? what value kind?) 

One can imagine that this becomes very complex, once more complex data is being stored, e.g. store all links of an object. The same is true for storing "class instances", i.e. object like data in DXL. Whenever a large amount of instances will be created, Skip lists are a bad choice for storage. 

While for more recent DOORS versions the allocations and deallocations are much quicker, the effect of having a large number of allocations is still relevant. You can read about allocations here: 

https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014657796&ps=25

or run the attached example script (allocation_problem.dxl), to see the effect. 

Example Problem

So imagine you want to store the structure of a complete DOORS project, with all the links inside, so you can make your awesome traceability analysis, without constantly closing and opening modules. So normally you will come up with a class structure like this (classes prefixed with S to avoid name clashes with DOORS types): 

 class SModule {
     // modules have top level objects, and module attributes
     SObject[] topLevel; 
     SAttrValue[] moduleAttributeValues; 
 }
 
 // Objects have childs, links and attribute values
 class SObject {
    SObject[] childs; 
    SLink[] outlinks; 
    SAttrValue[] attributeValues; 
 }
 
 // Links have a source object, a target object and link attributes
 class SLink {
    SObject source; 
    SObject target; 
    SAttrValue[] linkAttributes; 
 }
 
 // an attribute value has an attribute name and an attribute value
 class SAttrValue {
    string name; 
    string value; 
 }

 As you can see, with this structure there a two problems: 
 - The objects have a lot of cross references
 - There is a nesting hierarchy of 4   (module->object->link->attribute value)
 
 Implementing this with a Nested Skip approach can turn out to be a nightmare. From the allocations perspective you can say: 
 - each module in the project will allocate 2 skips (top level objects, links)
 - each object in the project will allocate 3 skips (childs, attributes, links)
 - each link in the project will allocate 1 Skip (attributes)
 
 That being said, imagining 100000 objects inside the project in 100 modules, containing 20000 links you will get an insane number of allocations (320200). Testing this with the below script  (allocation_problem) you get a runtime penalty for allocating and deleting 10000 Buffers: 
 - without allocations: 30ms
 - with 320200 allocations: 30779ms
 
 which might give you a factor 1000(!!!) slowdown of your script after some time. From a code perspective, managing a nesting level of 4 with 4 different skip types will give you spaghetti code. So lets take a look on what we can do to change this. 
 
Note: We recently found, that DxlObjects do not suffer from the allocation problem, since (at least newer versions of) DOORS will NOT store DxlObjects inside the allocations list. The good news about this is, you will not suffer from any slowdown inside your scripts! The bad news is you will suffer from bad permanent memory leaks, since the memory for the DXLObjects is never freed automatically!

Using classes - how to use arrays to reduce allocations

Most of you should by now know, that we can translate the above class hierarchy to DXL code, by creating custom structs. I already described how to make structs at several places in the forum, e.g. here the first time 2010: 
 https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014444661&ps=25
 
A detailed tutorial on classes in DXL you can find here: 
https://github.com/domoran/DXLParallels/blob/master/TUTORIAL.TXT
 
Usually these example use DxlObjects, which are nothing else but Skip lists. And while IBM tried to remove alowdown (in my opinion with a stupid approach) by removing DxlObjects from the allocation list, probably as a esult of some PR, there is a better way to store objects without producing the risk for permanent leaks. 
 
The idea is, instead of using a skip list to store each object instance, to simply use an Array(). Each column inside the array is a property value, each row is an object. This way: 
- All class instances are stored inside the same array instance
- Instead of a handle to the DxlObject/Skip we simply store the array index of the instance

So to create a class like this: 

class myType {
    string StringProperty;
}

instead of doing this:

struct myType {}
 
DxlObject DxlObjectOf (myType T) { return (addr_ T) DxlObject }
 
string getStringProperty (myType T)          { return ((DxlObjectOf T)->"stprop") string }
void   setStringProperty (myType T,string s) {         (DxlObjectOf T)->"stprop" = s    }
 
myType create_myType () {
     DxlObject x = new() 
     myType mt = (addr_ x) myType
     setStringProperty (mt, "Initial Value") 
     return mt
}

we will be doing this: 

struct myType {}
 
Array garmyTypeInstances = create(100000, 1)  // one column
int gmyTypeInstanceCount = 0; 

string getStringProperty (myType T)          { return return (get(garmyTypeInstances, (addr_ T) int, 0)) string }
void   setStringProperty (myType T,string s) {   put(garmyTypeInstances, s, (addr_ T) int, 0)  }
 
myType create_myType () {
     myType mt = (addr_ gmyTypeInstanceCount) myType
     setStringProperty (mt, "Initial Value") 
     gmyTypeInstanceCount++
     return mt
}

What have we won?

- Instead of allocating one DxlObject per instance, we only allocate 1 (!) array for all instances!
- Since the array will automatically resize, we do not really need to worry about the initial size of the array
- Since each array element will allocate 8 Bytes on DOORS 64 you can easily set the initial array size to 1 Million if you care to. This will allocate 8 MB x number of propreties which will not kill your memory, but 100k is a good starting size. 
- Since we do not need to allocate memory for new items (just increase a counter), we also do not need to free memory! 
 
BUT: you will probably say, what if I need to free and reallocate items a lot? Memory will not be freed and pile up!
Right. But this can be easily fixed by introducing a stack of freed indexes to be reused, whenever a new object is allocated. This slighly complicates the class boilerplate, but the code will be the same for ALL classes. 

So the following template can be used for ALL classes, you just need to replace <<<TYPE NAME>>> by the name of the class: 

// atomatic generated interface for class <<<CLASSNAME>>>

int gi<<<CLASSNAME>>>PropertyCount = 20; // Number of properties
Array gar<<<CLASSNAME>>>PropertyMemory = create(10000, gi<<<CLASSNAME>>>PropertyCount );  

// The maximum allocated object count
int gi<<<CLASSNAME>>>HandleCount = 0;

// Stack of Free Indices to be reused on next allocation
Skip skFree<<<CLASSNAME>>>Nodes = create();
int giFree<<<CLASSNAME>>>Nodes = 0;

struct <<<CLASSNAME>>> {};

// constructor & destructor for List
<<<CLASSNAME>>> create<<<CLASSNAME>>>_ () {
    <<<CLASSNAME>>> x = null
     
    // Do we have a free node to reassign? 
    if (giFree<<<CLASSNAME>>>Nodes > 0) {
        //print "Reallocating Stack Nr " giFree<<<CLASSNAME>>>Nodes "\n"
        if (!find(skFree<<<CLASSNAME>>>Nodes, giFree<<<CLASSNAME>>>Nodes--, x)) error "<<<CLASSNAME>>> memory error!"               
    } else {
        // allocate a new node
        x = (addr_ (++gi<<<CLASSNAME>>>HandleCount) ) <<<CLASSNAME>>>;
    }

    //print "Allocated <<<CLASSNAME>>> node " ((addr_ x) int) "\n"
    
    return x
}

void delete<<<CLASSNAME>>>_ (<<<CLASSNAME>>> &t) {
    // put in free list nodes
    <<<CLASSNAME>>> deref = t
    if (null deref) {
        print "Trying to delete null <<<CLASSNAME>>> at : " dxlHere()
        halt
    }

    put (skFree<<<CLASSNAME>>>Nodes, ++giFree<<<CLASSNAME>>>Nodes, deref, true)  
    // print "Freeing <<<CLASSNAME>>> Slot: " (<<<CLASSNAME>>>Handle t) " Value " (strVal t) " in Stack Nr: " (giFree<<<CLASSNAME>>>Nodes-1) "\n"
}

Note the following though:

- When a node is deleted and then later reallocated it will still have the values of the previous allocation, so make sure you properly initialize all values in the real constructur. 
- Since our class instances are merely integer indexes to a large array, the value "null" refers to the first entry inside the array. Therfore if you do instance = null; instance is still a valid class instance and all methods will work and you will get no null ptr errors! If you care about this behaviour you need to check for null on each property access!

While this template solves the problem of not creating allocations for class instances we still need to look on how to create properties for the class and how to solve the nesting problem (i.e. storing arrays inside a class instance without creating allocations). 

Class Properties

Properties can be added to an array managed class in an easy way (as already shown above): 

string getMyProperty  ( MyClass x ) { return (get(garMyPropertyPropertyMemory, (addr_ x) int, 3))  string }
void setMyProperty ( MyClass x,  string val) { put(garMyPropertyPropertyMemory, val, (addr_ x) int, 3) }

So making this a generic template we can do: 

// Simple property of type <<<INDEX>>>

// Getter for <<<PROPERTY NAME>>>
<<<PROPERTY TYPE>>> get<<<PROPERTY NAME>>>  ( <<<CLASSNAME>>> x ) { return (get(gar<<<CLASSNAME>>>PropertyMemory, (addr_ x) int, <<<INDEX>>>))  <<<PROPERTY TYPE>>> }

// Setter for <<<PROPERTY NAME>>>
void set<<<PROPERTY NAME>>> ( <<<CLASSNAME>>> x,  <<<PROPERTY TYPE>>> val) { put(gar<<<CLASSNAME>>>>PropertyMemory, val, (addr_ x) int, <<<INDEX>>>) }

where 
- <<<CLASSNAME>>> needs to be replaced by the name of the class, e.g. "MyClass". 
- <<<PROPERTY TYPE>>> needs to be replaced by the type of the property, e.g. "string" or "MyClass". 
- <<<PROPERTY NAME>>> needs to be replaced by the name of the property, e.g. "MyProperty"
- <<<INDEX>>> needs to be replaced by the index of the property in the array

Making a "virtual function" template is left as an exercise for the reader (See github tutorial for virtual functions).

Solving the problem of array properties

Ok, now lets take a look on how we can implement those array properties efficiently. As already mentioned using a Skip property is not indicated since it will again create one allocation per instance. Therefore what we will do is replace the Skip by our own implementation called "List" which has the same efficient memory management as shown above. 

class ListNode {
    _x value
    ListNode nextNode;
    ListNode currentNode; // for iteration over the list
}

Note that we have an "underscore" type parameter in the list, so our list can use any data type. For these kinds of properties the property definition is a bit different (see the attached List.dxl file) We added a currentNode property to our list so we can iterate it later from DXL with an easy syntax. 

Unfortunately in DXL we cannot implement a custom for loop, for iterating over the list. To implement an iteration protocol for DXL with a nice syntax I propose: 

ListNode iterator = iterate(myList); 
int value; 
while (next(iterator, value)) { print "Value: " value "\n"; }

which is pretty similar to iterating over a skip list:

Skip sk; 
int value; 
for value in sk do { ...}

Feel free to adapt the example list implementation in the attached file as you which (fast append to the end of the list, etc.). The important stuff is, that now having a List which items do not produce any allocation we are free to implement our classes that will not use any allocation. 

Applying the approach to the Example Problem

So assuming that we have made our classes SObject, SModule, SLink and SAttrValue according to the above description we can now easily code our complete project cache script: 

Project p = current; 
Item I; 

Skip skModules = createString(); 

for I in P do {
    if (type I != "Formal") continue; 
    Module mod = read(fullName I, false); 
    SModule smod = cacheModule(mod); 
    put(skModules, fullName sMod, smod); 
}

with the cacheModule code looking like this: 

void cacheLink(Link L) {
 // ... we need some cache here to make sure the link target modules and objects are not created twice ...
 // ... left as excercise for the reader ...
}

void cacheObject(Object obj) {
    SObject sobj = createSObject(); 
    
    Object oChild; for oChild in all obj do {
        appendNode(getChilds sobj, cacheObject oChild); 
    }
    
    Link L; for L in all obj->"*" do {
        appendNode(getLinks sobj, cacheLink L); 
    }
    
    AttrDef ad; for ad in module obj do {
        if (ad.object) {
          string sName = ad.name; 
          string sValue = obj.sName; 
          appendNode(getAttributeValues sobj, createSAttrValue(sName, sValue)); 
        }
    }
}

void cacheModule(Module mod) {
     SModule smod = createSModule(); 
     appendNode(getLinks obj, createListNode(slink))
     
     AttrDef ad; for ad in module obj do {
        if (ad.module) {
          string sName = ad.name; 
          string sValue = mod.sName; 
          appendNode(getAttributeValues smod, createSAttrValue(sName, sValue)); 
        }
    }
}

Imagine now the difference in the code if we had used Skips or DXLObjects to store all values. Look how nice and easy the code looks, using our classes. We have complete type safety, so that we will never have a wrong type inside a skip list and get crashes. This is a huge benefit!

Conclusion

As we saw, with the above approach we can 

- store arbitrary nested data type safe with an easy syntax
- we can store millions of data records without producing ANY allocation, i.e. without any slowdown

For all that made it to the end, thanks for reading. Like the post if you read it, so I know there is still DXL programmers interesting in tutorials. 

As always, have phun, regards, Mathias

 

P.S: For all of those who need to store tree data and are looking for an optimized Tree implementation look here: 

https://www.ibm.com/developerworks/community/forums/html/topic?id=7001c279-66c7-40f2-ac02-09c27547c81f&ps=25

Attachments

Updated on 2017-04-12T13:02:18Z at 2017-04-12T13:02:18Z by Mathias Mamsch