IC5Notice: We have upgraded developerWorks Community to the latest version of IBM Connections. For more information, read our upgrade FAQ.
Topic
  • 2 replies
  • Latest Post - ‏2014-01-16T21:55:47Z by cedricvb
cedricvb
cedricvb
2 Posts

Pinned topic Problems modeling shannon entropy minimization

‏2014-01-16T01:46:16Z |

Hello,

I've got a model in which I'd like to minimize the entropy of a given text. As soon as I launch the model however, the memory gets exhausted.

using CP;

int encrypted_len = ...;
range encrange = 0..encrypted_len-1;
int encrypted[0..encrypted_len-1] = ...;
int xor[0..255][0..255];

execute {
    for (var i = 0; i < 256; i++) {
                for (var j = 0; j < 256; j++) {
                          xor[i][j] = i^j;
                }             
        }        
}

dvar int ckey[0..511] in 0..255;
dexpr int plain[i in encrange] = xor[ckey[i % 512]][encrypted[i]];
// shannon

dexpr int frequency[o in 0..255] = sum(i in encrange) plain[i] == o;
dexpr int freqsum = sum(o in 0..255) frequency[o];
dexpr float prob[o in 0..255] = frequency[o]/freqsum;
dexpr float comb[o in 0..255] = prob[o] * lg(prob[o]);
dexpr float entropy = -sum(o in 0..255) comb[o];

minimize entropy;

The data file contains the data as an array of ints which are the ordinals for the characters in the string.

I imagine that this model is far from optimal:
   - can it better be represented as an LP?
   - how can a XOR be done more efficiently (rather than precalculating a XOR table...)

At this moment I won't get any results at all, regardless of the size of the data file. Any help on what is going wrong would be welcome :-)

Kind regards,
Cedric

Attachments

  • PhilippeLaborie
    PhilippeLaborie
    20 Posts
    ACCEPTED ANSWER

    Re: Problems modeling shannon entropy minimization

    ‏2014-01-16T10:54:34Z  

    Hello,

    I think a part of the problem comes from the xor array expression. The model uses a large number of such expressions. They could be factorized by transforming plain into a decision variable.

    Furthermore, for computing the frequency, the count expression can be used to count the number of occurrences of a given value.

    Another thing: it seems to me expression freqsum is not required as the sum should be equal to constant encrypte_len.

    Also, in this problem, there are almost no constraint except for modeling the objective function. In this context it can be interesting to try the MultiPoint search type.

    Here is the model with all these changes. At least, it is able to find some feasible solutions on your instance and memory usage is more reasonable (note that extraction of the model takes a couple of minutes). I also tried a search phase that first works on the original ckey decision variables but it does not seem to help in producing better solutions. I'm using the latest CPLEX Optimization Studio version (V12.6).

    using CP;
    
    int encrypted_len = ...;
    int encrypted[0..encrypted_len-1] = ...;
    
    range encrange = 0..encrypted_len-1;
    int xor[0..255][0..255];
    
    execute {
      for (var i = 0; i < 256; i++) {
        for (var j = 0; j < 256; j++) {
          xor[i][j] = i^j;
        }         
      }           
    }
    
    dvar int ckey[0..511] in 0..255;
    dvar int plain[i in encrange] in 0..255;
    
    // shannon
    
    dexpr int frequency[o in 0..255] = count(plain, o); 
    dexpr float prob[o in 0..255] = frequency[o]/encrypted_len; 
    dexpr float comb[o in 0..255] = prob[o] * lg(prob[o]);
    dexpr float entropy = -sum(o in 0..255) comb[o];
    
    execute {
      cp.param.Workers = 1;
      cp.param.SearchType = "MultiPoint";
      // cp.setSearchPhases(cp.factory.searchPhase(ckey));
    }
    
    minimize entropy;
    constraints {
      forall(i in encrange) {
        plain[i] == xor[encrypted[i]][ckey[i % 512]];
      }
    }
    

    Philippe

     

  • PhilippeLaborie
    PhilippeLaborie
    20 Posts

    Re: Problems modeling shannon entropy minimization

    ‏2014-01-16T10:54:34Z  

    Hello,

    I think a part of the problem comes from the xor array expression. The model uses a large number of such expressions. They could be factorized by transforming plain into a decision variable.

    Furthermore, for computing the frequency, the count expression can be used to count the number of occurrences of a given value.

    Another thing: it seems to me expression freqsum is not required as the sum should be equal to constant encrypte_len.

    Also, in this problem, there are almost no constraint except for modeling the objective function. In this context it can be interesting to try the MultiPoint search type.

    Here is the model with all these changes. At least, it is able to find some feasible solutions on your instance and memory usage is more reasonable (note that extraction of the model takes a couple of minutes). I also tried a search phase that first works on the original ckey decision variables but it does not seem to help in producing better solutions. I'm using the latest CPLEX Optimization Studio version (V12.6).

    using CP;
    
    int encrypted_len = ...;
    int encrypted[0..encrypted_len-1] = ...;
    
    range encrange = 0..encrypted_len-1;
    int xor[0..255][0..255];
    
    execute {
      for (var i = 0; i < 256; i++) {
        for (var j = 0; j < 256; j++) {
          xor[i][j] = i^j;
        }         
      }           
    }
    
    dvar int ckey[0..511] in 0..255;
    dvar int plain[i in encrange] in 0..255;
    
    // shannon
    
    dexpr int frequency[o in 0..255] = count(plain, o); 
    dexpr float prob[o in 0..255] = frequency[o]/encrypted_len; 
    dexpr float comb[o in 0..255] = prob[o] * lg(prob[o]);
    dexpr float entropy = -sum(o in 0..255) comb[o];
    
    execute {
      cp.param.Workers = 1;
      cp.param.SearchType = "MultiPoint";
      // cp.setSearchPhases(cp.factory.searchPhase(ckey));
    }
    
    minimize entropy;
    constraints {
      forall(i in encrange) {
        plain[i] == xor[encrypted[i]][ckey[i % 512]];
      }
    }
    

    Philippe

     

  • cedricvb
    cedricvb
    2 Posts

    Re: Problems modeling shannon entropy minimization

    ‏2014-01-16T21:55:47Z  

    Hello,

    I think a part of the problem comes from the xor array expression. The model uses a large number of such expressions. They could be factorized by transforming plain into a decision variable.

    Furthermore, for computing the frequency, the count expression can be used to count the number of occurrences of a given value.

    Another thing: it seems to me expression freqsum is not required as the sum should be equal to constant encrypte_len.

    Also, in this problem, there are almost no constraint except for modeling the objective function. In this context it can be interesting to try the MultiPoint search type.

    Here is the model with all these changes. At least, it is able to find some feasible solutions on your instance and memory usage is more reasonable (note that extraction of the model takes a couple of minutes). I also tried a search phase that first works on the original ckey decision variables but it does not seem to help in producing better solutions. I'm using the latest CPLEX Optimization Studio version (V12.6).

    <pre class="html dw" data-editor-lang="js" data-pbcklang="html" dir="ltr">using CP; int encrypted_len = ...; int encrypted[0..encrypted_len-1] = ...; range encrange = 0..encrypted_len-1; int xor[0..255][0..255]; execute { for (var i = 0; i < 256; i++) { for (var j = 0; j < 256; j++) { xor[i][j] = i^j; } } } dvar int ckey[0..511] in 0..255; dvar int plain[i in encrange] in 0..255; // shannon dexpr int frequency[o in 0..255] = count(plain, o); dexpr float prob[o in 0..255] = frequency[o]/encrypted_len; dexpr float comb[o in 0..255] = prob[o] * lg(prob[o]); dexpr float entropy = -sum(o in 0..255) comb[o]; execute { cp.param.Workers = 1; cp.param.SearchType = "MultiPoint"; // cp.setSearchPhases(cp.factory.searchPhase(ckey)); } minimize entropy; constraints { forall(i in encrange) { plain[i] == xor[encrypted[i]][ckey[i % 512]]; } } </pre>

    Philippe

     

    That made a huge difference. Thanks for your elaborate answer and advice!