Topic
IC4NOTICE: developerWorks Community will be offline May 29-30, 2015 while we upgrade to the latest version of IBM Connections. For more information, read our upgrade FAQ.
10 replies Latest Post - ‏2014-09-25T10:28:23Z by Wolfgang Uhr
Mathias Mamsch
Mathias Mamsch
1954 Posts
ACCEPTED ANSWER

Pinned topic The DOORS StringTable - Unveiled

‏2012-09-14T20:03:40Z |

One of the last mysteries of DXL unveiled - the DOORS string table. Despite my earlier attempts to discuss the string table away (https://www.ibm.com/developerworks/forums/thread.jspa?messageID=14549758) I now publish the code, that will allow you to check for DXL programs for string table littering or just roam around the DOORS string table to find all funny things that are in there.

Okay, so first what is a string table and what is it good for? DXL has no way of freeing a string, once it is created. In C a string is allocated as a character array or by allocating some memory for it. Then the program needs to explicitly free the string once it does not need it anymore, so the memory is not leaked. So in DXL there is no delete (string) function in DXL. So one would think that a DXL program keeps track of the strings in creates, and when it ends it 'garbage collects' the strings and frees the memory. But in that case, the string created by the DXL program must not be used anywhere in the DOORS client anymore. That means, that DOORS would very strictly need to track where a string comes from, from DXL or from an internal function. Imagine you write a string to a DOORS attribute and the client would just take that string without copying it. When the DXL program would end and the string would be freed, then suddenly the module would refer to an invalid memory - crashs and data corruption would then occur. Maybe it is kind of a legacy problem, that strings from DXL are freely used in the internal DOORS client and separating the DXL strings from the internal string is too problematic. Additionally DOORS has this complex design, where a global DXL context is lingering around while several 'child' contexts of DXL programs you launch, DXL Layouts, eval_ code, DXL triggers, DXL GUIs are active somewhere - so that tracking string usage in the whole source code of DOORS, including all foreign libraries included is just to hard.

So, freeing the strings after DXL ends might be problematic. So one use for a string table would be to make sure that each string stays alive, but to avoid completely littering the memory each string is stored only once. If a string already exists, the DXL program will just get a reference to that string, that has already been stored. For this DOORS would need a data structure, where it can quickly check if a string already exists and if not, store it there. And this data structure is the string table. (By the way, this is only speculation. I still find it highly unusual. Anyway at some point one needs to accept that a string table exists for whatever reason.)

Alright, lets take a look at how the DOORS string table is constructed. You can think of the string table like an array of exactly 52021 buckets/bins where each bucket/bin stores a list of strings. A bucket can also be empty if it contains no strings.
 

Bucket    String List 
[1]       "abcd" -> "hello" -> "world"
[2]       empty 
[3]       "orly?" 
...
[52021]   "yarly!" -> "noway!"

 


So why and how are the strings assigned to buckets? Well the main problem is, that DOORS would need to check if a string already exists in the string table. One could think, well lets keep a sorted list of strings, then we can find if a string is already in there very fast. Well the problem is, that if you want to add a new string in such a list, then you would need to shift the strings in the list to insert a new string, which is time consuming. Another idea would be to use a Skip list. The problem with skip lists is, that they do a full string compare. For long strings this is extremely inefficient. So the string table works differently. Given a string like "Hello World!" a (hash) function will calculate the bucket/bin for that string. This function works extremely fast (its like a checksum). Once you know the bucket for that string in the string table, you only need to search inside the bucket if that "Hello World!" string already exists in there. If not, you just add it to the end of the list of the bucket. So imaging in the above example we would calculate the bucket 3 for the string "Hello World!". Then we would search the bucket for that string - we notice that "orly?" is the only string in there, and "Hello World!" is not in there yet. Therefore we add it to bucket three.

 

 

 

Bucket    String
...
[3]       "orly?" -> "Hello World"
...


Now notice that we already have two strings in the bucket 3. If more and more strings go to bucket 3 then the time for checking if a string is in there will get longer and longer. This is called a clash. So if only one string is in a bucket you got no clash. If there is more, you have a clash, that will slow down your string table. So you can see one of the main goal of the function that assigns buckets is, that it will distribute the strings evenly in the buckets.

Imagine the following code:

 

 

 

string s = "1" 
int i; for (i = 0; i < 100000; i++) s = s "1"



This program will create 100001 different strings "1", "11", "111", ... So if those string would all go to the same bucket, then this code would slow down drastically, since the time to check each string, if it is already in the string table, would become longer and longer. If the strings are distributed totally evenly in the buckets, then the buckets will have a maximum of 2 strings each. This means you would have no slowdown, since checking if the string is already present is still very efficient.

So as general rules for DXL you can take with you:



 

 

  • The more clashes you have in your string table the slower will DXL become
  • large strings in the string table will also slow DOORS down, since the string comparison that takes place will take a longer time, depending on the efficiency of the string compare.


Great, so now you know what the string table is and how it works. Take the attached code and explore for yourself. What I found out by now is, that at least in DOORS 8 / 9 there seems to be a mechanism for freeing DXL strings. If you run the following code multiple times, you will see, that at the next start of the DXL scripts, the number of entries in the string table are back to the value before you started the DXL script:

 

 

 

#include <petools.inc>
#include <stringTable.inc>
 
printStringTable () 
 
string s = "1" 
 
// lets litter the string table
int i; for (i = 0; i < 1000; i++) s = s "1"
 
printStringTable () // 1000 strings have been created.



Now when we pass that string to a DOORS perm now:



 

 

 

 

#include <src/test/core/petools.inc>
#include <src/test/core/stringTable.inc>
 
printStringTable () 
 
string s = "1" 
 
// lets litter the string table
int i; for (i = 0; i < 1000; i++) {
    s = s "1"
    // pass the string to the richText perm will persist the string
    richText s  
    // passing it to the length perm will not persist the string 
    // length s
 
}
 
printStringTable ()



Then we will find, that the second time we start the script, the string table count will not rise anymore since all strings have been persisted. The good thing is, that as long as we really create the same strings, the string count and the clashes will not rise any more, no matter how often we start the script.

This shows, that DOORS does a cleanup and strings will only be forever in the string table, when you pass them to certain perms. Which perms are those? Find out for yourself! I have no idea. I know that stringOf (Buffer) will do that.

Regards, Mathias



 

 

 


Mathias Mamsch, IT-QBase GmbH, Consultant for Requirement Engineering and D00RS

 

Updated on 2014-01-06T21:42:35Z at 2014-01-06T21:42:35Z by JAntley
  • Doug.Zawacki
    Doug.Zawacki
    80 Posts
    ACCEPTED ANSWER

    Re: The DOORS StringTable - Unveiled

    ‏2012-10-25T13:22:09Z  in response to Mathias Mamsch
    First, Thank you very much for the hard work. After looking at your code I see that you have certainly done your homework and have a lot of insight into the DOORS interpreter.

    I am extremely interested in using this tool.

    Unfortunately, when I include the files you provided I receive the following.

    String Table coud not be found.

    I am running DOORS 9.2.0.5

    Any assistance you can provide will be greatly appreciated. Thanks
    • Doug.Zawacki
      Doug.Zawacki
      80 Posts
      ACCEPTED ANSWER

      Re: The DOORS StringTable - Unveiled

      ‏2012-10-25T13:38:04Z  in response to Doug.Zawacki

      Update, After reviewing the code I modified the lines below which doesn't fail.

      if ((doorsInfo infoVersion)[0] == '9') // changed from 8 to 9 and it works
      {
        p = createPEFile 0x400000 
      } 
      else 
      {
        p = createPEFile 0x10000000
      }
      

       


      Is it safe to assume the address you are passing into createPEFile is just some random address to place your PEFile struct at?

       

      Updated on 2014-01-06T21:43:10Z at 2014-01-06T21:43:10Z by JAntley
      • Mathias Mamsch
        Mathias Mamsch
        1954 Posts
        ACCEPTED ANSWER

        Re: The DOORS StringTable - Unveiled

        ‏2012-10-26T07:35:55Z  in response to Doug.Zawacki
        Hmm ... The addresses that are in the code are not random at all. To hook the internal properties of DOORS, you need to find them in memory. For this you need to locate the executable in memory and search the data or code segment for the structure you are looking for. Normally you would call some Windows API function to get the base adress, but this is not so easy from DXL.

        Fortunately due to the memory management in Microsoft Windows an executable has its own address space, so it will almost always be loaded at the same virtual memory address. For the doors.exe executable this is 0x400000. Unfortunately the string table functionality moved from doors.exe to general_library.dll which is located at another base address 0x10000000.

        I was under the impression that this happened with DOORS 9, but obviously that is not the case in 9.2.0.5. So I guess we would need to find out from what version exactly the string table moved to the dll and adapt the if statement to account for it.

        Regards, Mathias

        Mathias Mamsch, IT-QBase GmbH, Consultant for Requirement Engineering and D00RS
        • Doug.Zawacki
          Doug.Zawacki
          80 Posts
          ACCEPTED ANSWER

          Re: The DOORS StringTable - Unveiled

          ‏2012-10-29T18:20:25Z  in response to Mathias Mamsch
          Mathias,

          I now understand why it must be that my address does not work with your code.

          We are using a DOORS.EXE (Large Address aware) version. This version of DOORS allows us to address up to 3gb of memory instead of the 2gb limitation.

          If you need more information please let me know.
  • SystemAdmin
    SystemAdmin
    3180 Posts
    ACCEPTED ANSWER

    Re: The DOORS StringTable - Unveiled

    ‏2012-10-29T13:42:46Z  in response to Mathias Mamsch

    I'd also like to thank you for posting this.
    I was on holiday when you posted it so I didn't see it until the the first reply bumped it up the forum.
    So the 'string table' is really just a hash table of strings which 'some' functions cause to persist.

    I've tried these so far:

    // Persist
        // string ::[] (Buffer, Range_)
        // string ::.. (Buffer, string)
        // string stringOf(Buffer)
        // richText(string)
    // No Persist
        // tempStringOf(Buffer)
    
    Updated on 2014-01-06T21:44:09Z at 2014-01-06T21:44:09Z by JAntley
    • llandale
      llandale
      2950 Posts
      ACCEPTED ANSWER

      Re: The DOORS StringTable - Unveiled

      ‏2012-10-29T18:12:38Z  in response to SystemAdmin

      Since the string returned from tempStringOf(Buffer) does not persist then you must dispose of it immediately; where "immediately" means before changing or deleting the Buffer. Usually this means:

      • compare it to something else
      • display on a dialog
      • assign to an obj.attr value
      • write it to a file.


      -Louie

      tempStringOf(Buffer) does not respect the "length(Buffer, int)" perm. Below results show "L4" differences.

       

      Buffer buf = create()
       
      void PrintIt(string Label)
      {  
         print Label "\t" (tempStringOf(buf) == stringOf(buf)) "\t[" tempStringOf(buf) "]\t<">\n"
      }
       
      buf = "ABCDEFG";      PrintIt("init")
      buf += "xyz";         PrintIt("xyz")
      length(buf, 4);         PrintIt("L4")
      buf     = "";         PrintIt("erase")
      buf += "abcdefg";             PrintIt("abc")
      buf += "tuv";         PrintIt("tuv")
      setempty(buf);          PrintIt("empty")
      /*      Results:
      init    true    [ABCDEFG]       <ABCDEFG>
      xyz     true    [ABCDEFGxyz]    <ABCDEFGxyz>
      L4      false   [ABCDEFGxyz]    <ABCD>
      erase   true    []      
      abc     true    [abcdefg]       <abcdefg>
      tuv     true    [abcdefgtuv]    <abcdefgtuv>
      empty   true    []      */
      

       

      Updated on 2014-01-06T21:44:50Z at 2014-01-06T21:44:50Z by JAntley
      • Mathias Mamsch
        Mathias Mamsch
        1954 Posts
        ACCEPTED ANSWER

        Re: The DOORS StringTable - Unveiled

        ‏2012-11-01T11:26:54Z  in response to llandale

        I never heard of a large memory aware version of DOORS. Therefore I have no idea on how to detect this version. It seems that it will locate the DOORS.exe at 0x10000000. If you find out on how to detect this DOORS version from DXL code, feel free to post it ;-)

        @Louie: tempStringOf is a case of its own. It will return a string that points to the string memory of the buffer. This memory should be stable as long you do not do any changes to the buffer, but that is not proven. However the important point that I wanted to make in my post is, that NOT ALL strings you create in DXL will be in the string table forever.

        • There are functions that will create a temporary string table entry (string concatenation), that will be removed from the string table after the DXL ends
        • There are perms that will create a permanent string table entry (e.g. string stringOf(Buffer))
        • and well ... tempStringOf will of course not create a real string at all, it only returns a type casted buffer memory pointer, which will a) as you showed not take into account buffer length b) will become invalid when the buffer reallocates its memory.


        So if we would take the work and for every perm in DXL that takes or returns a string would find out, whether it creates no, a temporary or a permanent string table entry, developers could tune their programs for string table optimization. It is probable that all functions returning a new string will create a string table entry. Interestingly substring s0:3 or concatenation s = s "123" will only generate a temporary string table entry. Passing the string to another function that will return a string like utf8 will make the existing string permanent. Interestingly some perms will also create permanent string table entries even though they do not handle strings at all:

        For example if a developer does:

         

        #include <c:/tmp/petools.inc>
        #include <c:/tmp/stringTable.inc>
         
        printStringTable () 
         
        Project p 
        for p in database do {  }
         
        printStringTable ()
        



        Running this code a first time and a second time will show an increase in string table entries the first time, the second time the string table entries will stay at the high level. Therefore the loop created permanent string table entries. My conclusion is that to avoid string table entries of course one has to use Buffers wherever possible - the fact that string operations only produce temporary string table entries is not very useful, since you can do all that with Buffers which produces no string table entries. However the indication that all functions that return a string (except the string operators) produce a permanent string table entry is rather disapointing, because it leaves no way for optimizing. It is clear that eval_ is of no use for those. Therefore it would be interesting to find out, if there is any other function despite the string operations that will only produce temporary string table entries (which would allow optimization using eval_). Additionally it would be interesting if any of the functions that operate on Buffers will create string table residuals.

        Regards, Mathias



         

         


        Mathias Mamsch, IT-QBase GmbH, Consultant for Requirement Engineering and D00RS

         

        Updated on 2014-01-06T21:45:32Z at 2014-01-06T21:45:32Z by JAntley
  • EHcnck
    EHcnck
    78 Posts
    ACCEPTED ANSWER

    Re: The DOORS StringTable - Unveiled

    ‏2014-09-24T23:20:27Z  in response to Mathias Mamsch

    Are you able to upload the code again for the string table, I'm unable to download the current version? Thank you.

    • Mathias Mamsch
      Mathias Mamsch
      1954 Posts
      ACCEPTED ANSWER

      Re: The DOORS StringTable - Unveiled

      ‏2014-09-25T08:09:56Z  in response to EHcnck

      Here you go - but I cannot imagine what problem you got with downloading? The download worked for me ...

      Please note, that this (and all other hacky code, like MemoryManagement, etc. ) will not work on DOORS 9.6 (64bit), since all memory offsets changed in there due to the word length increase to 8 bytes and due to massive changes inside the interpreter.

      Regards, Mathias

      Attachments

      • Wolfgang Uhr
        Wolfgang Uhr
        212 Posts
        ACCEPTED ANSWER

        Re: The DOORS StringTable - Unveiled

        ‏2014-09-25T10:28:23Z  in response to Mathias Mamsch

        What do you think, did IBM reduce the problem or did they only paint it with a different color?