Topic
• 2 replies
• Latest Post - ‏2014-01-16T21:55:47Z by cedricvb
cedricvb
2 Posts

# Pinned topic Problems modeling shannon entropy minimization

‏2014-01-16T01:46:16Z |

Hello,

I've got a model in which I'd like to minimize the entropy of a given text. As soon as I launch the model however, the memory gets exhausted.

```using CP;

int encrypted_len = ...;
range encrange = 0..encrypted_len-1;
int encrypted[0..encrypted_len-1] = ...;
int xor[0..255][0..255];

execute {
for (var i = 0; i < 256; i++) {
for (var j = 0; j < 256; j++) {
xor[i][j] = i^j;
}
}
}

dvar int ckey[0..511] in 0..255;
dexpr int plain[i in encrange] = xor[ckey[i % 512]][encrypted[i]];
// shannon

dexpr int frequency[o in 0..255] = sum(i in encrange) plain[i] == o;
dexpr int freqsum = sum(o in 0..255) frequency[o];
dexpr float prob[o in 0..255] = frequency[o]/freqsum;
dexpr float comb[o in 0..255] = prob[o] * lg(prob[o]);
dexpr float entropy = -sum(o in 0..255) comb[o];

minimize entropy;
```

The data file contains the data as an array of ints which are the ordinals for the characters in the string.

I imagine that this model is far from optimal:
- can it better be represented as an LP?
- how can a XOR be done more efficiently (rather than precalculating a XOR table...)

At this moment I won't get any results at all, regardless of the size of the data file. Any help on what is going wrong would be welcome :-)

Kind regards,
Cedric

#### Attachments

• PhilippeLaborie
49 Posts

#### Re: Problems modeling shannon entropy minimization

‏2014-01-16T10:54:34Z

Hello,

I think a part of the problem comes from the xor array expression. The model uses a large number of such expressions. They could be factorized by transforming plain into a decision variable.

Furthermore, for computing the frequency, the count expression can be used to count the number of occurrences of a given value.

Another thing: it seems to me expression freqsum is not required as the sum should be equal to constant encrypte_len.

Also, in this problem, there are almost no constraint except for modeling the objective function. In this context it can be interesting to try the MultiPoint search type.

Here is the model with all these changes. At least, it is able to find some feasible solutions on your instance and memory usage is more reasonable (note that extraction of the model takes a couple of minutes). I also tried a search phase that first works on the original ckey decision variables but it does not seem to help in producing better solutions. I'm using the latest CPLEX Optimization Studio version (V12.6).

```using CP;

int encrypted_len = ...;
int encrypted[0..encrypted_len-1] = ...;

range encrange = 0..encrypted_len-1;
int xor[0..255][0..255];

execute {
for (var i = 0; i < 256; i++) {
for (var j = 0; j < 256; j++) {
xor[i][j] = i^j;
}
}
}

dvar int ckey[0..511] in 0..255;
dvar int plain[i in encrange] in 0..255;

// shannon

dexpr int frequency[o in 0..255] = count(plain, o);
dexpr float prob[o in 0..255] = frequency[o]/encrypted_len;
dexpr float comb[o in 0..255] = prob[o] * lg(prob[o]);
dexpr float entropy = -sum(o in 0..255) comb[o];

execute {
cp.param.Workers = 1;
cp.param.SearchType = "MultiPoint";
// cp.setSearchPhases(cp.factory.searchPhase(ckey));
}

minimize entropy;
constraints {
forall(i in encrange) {
plain[i] == xor[encrypted[i]][ckey[i % 512]];
}
}
```

Philippe

• PhilippeLaborie
49 Posts

#### Re: Problems modeling shannon entropy minimization

‏2014-01-16T10:54:34Z

Hello,

I think a part of the problem comes from the xor array expression. The model uses a large number of such expressions. They could be factorized by transforming plain into a decision variable.

Furthermore, for computing the frequency, the count expression can be used to count the number of occurrences of a given value.

Another thing: it seems to me expression freqsum is not required as the sum should be equal to constant encrypte_len.

Also, in this problem, there are almost no constraint except for modeling the objective function. In this context it can be interesting to try the MultiPoint search type.

Here is the model with all these changes. At least, it is able to find some feasible solutions on your instance and memory usage is more reasonable (note that extraction of the model takes a couple of minutes). I also tried a search phase that first works on the original ckey decision variables but it does not seem to help in producing better solutions. I'm using the latest CPLEX Optimization Studio version (V12.6).

```using CP;

int encrypted_len = ...;
int encrypted[0..encrypted_len-1] = ...;

range encrange = 0..encrypted_len-1;
int xor[0..255][0..255];

execute {
for (var i = 0; i < 256; i++) {
for (var j = 0; j < 256; j++) {
xor[i][j] = i^j;
}
}
}

dvar int ckey[0..511] in 0..255;
dvar int plain[i in encrange] in 0..255;

// shannon

dexpr int frequency[o in 0..255] = count(plain, o);
dexpr float prob[o in 0..255] = frequency[o]/encrypted_len;
dexpr float comb[o in 0..255] = prob[o] * lg(prob[o]);
dexpr float entropy = -sum(o in 0..255) comb[o];

execute {
cp.param.Workers = 1;
cp.param.SearchType = "MultiPoint";
// cp.setSearchPhases(cp.factory.searchPhase(ckey));
}

minimize entropy;
constraints {
forall(i in encrange) {
plain[i] == xor[encrypted[i]][ckey[i % 512]];
}
}
```

Philippe

• cedricvb
2 Posts

#### Re: Problems modeling shannon entropy minimization

‏2014-01-16T21:55:47Z

Hello,

I think a part of the problem comes from the xor array expression. The model uses a large number of such expressions. They could be factorized by transforming plain into a decision variable.

Furthermore, for computing the frequency, the count expression can be used to count the number of occurrences of a given value.

Another thing: it seems to me expression freqsum is not required as the sum should be equal to constant encrypte_len.

Also, in this problem, there are almost no constraint except for modeling the objective function. In this context it can be interesting to try the MultiPoint search type.

Here is the model with all these changes. At least, it is able to find some feasible solutions on your instance and memory usage is more reasonable (note that extraction of the model takes a couple of minutes). I also tried a search phase that first works on the original ckey decision variables but it does not seem to help in producing better solutions. I'm using the latest CPLEX Optimization Studio version (V12.6).

<pre class="html dw" data-editor-lang="js" data-pbcklang="html" dir="ltr">using CP; int encrypted_len = ...; int encrypted[0..encrypted_len-1] = ...; range encrange = 0..encrypted_len-1; int xor[0..255][0..255]; execute { for (var i = 0; i < 256; i++) { for (var j = 0; j < 256; j++) { xor[i][j] = i^j; } } } dvar int ckey[0..511] in 0..255; dvar int plain[i in encrange] in 0..255; // shannon dexpr int frequency[o in 0..255] = count(plain, o); dexpr float prob[o in 0..255] = frequency[o]/encrypted_len; dexpr float comb[o in 0..255] = prob[o] * lg(prob[o]); dexpr float entropy = -sum(o in 0..255) comb[o]; execute { cp.param.Workers = 1; cp.param.SearchType = "MultiPoint"; // cp.setSearchPhases(cp.factory.searchPhase(ckey)); } minimize entropy; constraints { forall(i in encrange) { plain[i] == xor[encrypted[i]][ckey[i % 512]]; } } </pre>

Philippe