Algorithm for computing the plausibility of a function / Monte Carlo Method

I

3

7

I am writing a program that attempts to duplicate the algorithm discussed at the beginning of this article,

http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf

F is a function from char to char. Assume that Pl(f) is a 'plausibility' measure of that function. The algorithm is:

Starting with a preliminary guess at the function, say f, and a then new function f* --

Compute Pl(f).
Change to f* by making a random transposition of the values f assigns to two symbols.
Compute Pl(f*); if this is larger than Pl(f), accept f*.
If not, flip a Pl(f)/Pl(f*) coin; if it comes up heads, accept f*.
If the coin toss comes up tails, stay at f.

I am implementing this using the following code. I'm using c# but tried to make it somewhat more simplified for everyone. If there is a better forum for this please let me know.

 var current_f = Initial();    // current accepted function f
 var current_Pl_f = InitialPl();  // current plausibility of accepted function f

 for (int i = 0; i < 10000; i++)
        {
            var candidate_f = Transpose(current_f); // create a candidate function

            var candidate_Pl_f = ComputePl(candidate_f);  // compute its plausibility

            if (candidate_Pl_f > current_Pl_f)            // candidate Pl has improved
            {
                current_f = candidate_f;            // accept the candidate
                current_Pl_f = candidate_Pl_f; 
            }
            else                                    // otherwise flip a coin
            {
                int flip = Flip(); 

                if (flip == 1)                      // heads
                {
                    current_f = candidate_f;        // accept it anyway
                    current_Pl_f = candidate_Pl_f; 
                }
                else if (flip == 0)                 // tails
                {
                    // what to do here ?
                }
            }
        }

My question is basically whether this looks like the optimal approach to implementing that algorithm. IT seems like I may be getting stuck in some local maxima / local minima despite implementing this method.

EDIT - Here is the essentially whats behind Transpose() method. I use a dictionary / hash table of type << char, char >> that the candidate function(s) use to look any given char -> char transformation. So the transpose method simply swaps two values in the dictionary that dictates the behavior of the function.

    private Dictionary<char, char> Transpose(Dictionary<char, char> map, params int[] indices)
    {
        foreach (var index in indices)
        {
            char target_val = map.ElementAt(index).Value;   // get the value at the index

            char target_key = map.ElementAt(index).Key;     // get the key at the index

            int _rand = _random.Next(map.Count);   // get a random key (char) to swap with

            char rand_key = map.ElementAt(_rand).Key;

            char source_val = map[rand_key]; // the value that currently is used by the source of the swap

            map[target_key] = source_val; // make the swap

            map[rand_key] = target_val;
        }

        return map; 
    }

Keep in mind that a candidate function that uses the underlying dictionary is basically just:

public char GetChar(char in, Dictionary<char, char> theMap)
{
     return theMap[char]; 
}

And this is the function that computes Pl(f):

    public decimal ComputePl(Func<char, char> candidate, string encrypted, decimal[][] _matrix)
    {
        decimal product = default(decimal);

        for (int i = 0; i < encrypted.Length; i++)
        {
            int j = i + 1;

            if (j >= encrypted.Length)
            {
                break;
            }

            char a = candidate(encrypted[i]);
            char b = candidate(encrypted[j]);

            int _a = GetIndex(_alphabet, a); // _alphabet is just a string/char[] of all avl chars 
            int _b = GetIndex(_alphabet, b);

            decimal _freq = _matrix[_a][_b]; 

            if (product == default(decimal))
            {
                product = _freq;
            }
            else
            {
                product = product * _freq;
            }
        }

        return product;
    }

Impediment answered 14/9, 2011 at 21:36 Comment(0)

I

2

Tentatively, codereview.stackexchange.com may be a "better forum for this".
Never the less, I'll take a quick stab at it:

At a glance, the snippet shown is a correct implementation of the algorithm.
Whether the algorithm will "get stuck" in local minima is a issue pertaining to the algorithm not to the implementation. (see discussion below)
Your quest for an "optimal approach" seems to be directed at tweaks in the algorithm (deviation from the original algorithm) rather than at tweaks in the implementation (to make it faster and/or to eradicate some possible flaws). For considerations regarding the algorithm, see discussion below; for discussion regarding the implementation consider the following:
- ensure the Flip() method is fair
- ensure the ComputePl() is correct: some errors often arise because of issues with the arithmetic precision in the value function.
- ensure fairness (equiprobability) with the Transpose() method.
- Performance improvements would likely come from optimizations in the ComputePl() method (not shown) rather than in the main loop.

Discussion regarding the algorithm per-se and its applicability to different problems.
In a nutshell the algorithm is a guided stochastic search, whereby the [huge] solution space is sampled with two random devices: the Transpose() method which modifies (very slightly each time) the current candidate function and the Flip() method which decides whether a [locally] suboptimal solution should survive. The search is guided by an objective function, ComputePl() which is itself based on a matrix of first order transition in some reference corpus.

In this context, local minima "tar pits" can be avoided by augmenting the probability of selecting "suboptimal" functions: rather than a fair 50-50 Flip(), maybe try with a probablity of retaining "suboptimal" solutions of 66% or even 75%. This approach typically extend the number of generations necessary to converge toward the optimal solution, but, as said may avoid getting stuck in local minima.

Another way of ensuring the applicability of the algorithm is to ensure a better evaluation of the plausibility of given functions. Likely explanations for the relative success and generality of the algorithm are

The distribution of first order transitions in the English language is not very uniform (and hence models rather well the plausibility of a given function, by rewarding it for its matches and by punishing it for its mismatches). This kind of multi-dimensional statistic, is more resilient to departure from the reference corpus than say the "order 0" distribution of characters in a a given language/corpus.
The distribution of first order transitions while being language-specific, are generally similar across different languages and/or in the context of jargon vocabularies [based on said languages]. Things do break down in the case of short-hand jargons whereby letters such as vowals are typically omitted.

Therefore to improve the applicability of the algorithm to a given problem, ensure that the distribution matrix used matches as closely as possible the language and domain of the underlying text.

Inquisition answered 14/9, 2011 at 22:17 Comment(4)

It seems like the value of Pl(f) just stops improving at a certain point. I need to reach a very high value but it seems to hang before getting there. Maybe there is something wrong with my method of Transposition. – Impediment 14/9, 2011 at 22:46

or I need a way for the algorithm to 'zero-in' on the solution more aggressively. – Impediment 14/9, 2011 at 22:52

@Sean: (I just noted your edit with the ComputePl() method). It is hard to know precisely because it depends on the way the values in _matrix were normalized, but the type of computations done there are likely to introduce rounding errors that could make the implementation "insensitive" to minor (but desirable) improvements, hence explaining why the progress towards optimal solutions seem to plateau. Concerning your suggestion of "Zeroing-in more aggressively on the solution": Typically not a good idea... (this contradict the fact tha you need a random search) – Inquisition 14/9, 2011 at 23:38

For the matrix values -- M['a']['b'] would be "given the first character 'a', what is the probability that the next character is 'b'". So the value that goes in that cell is ( (total cases where 'b' follows 'a') / (total instances of 'a') ) x 100 – Impediment 14/9, 2011 at 23:48

A

2

From the description in the article, your implementation seems correct (the part you mark as "what to do here" should indeed be nothing).

If you are having problems with local maxima (something the article claims the coin toss should avoid), make sure your implemenations of Initial, Transpose, ComputePl and Flip are correct.

You can also try making the coin tossed biased (increasing the probability of Flip() == 1 will make this closer to a random walk and less susceptible to getting stuck).

Here is a slightly tighter version of your code:

var current_f = Initial();    // current accepted function f
var current_Pl_f = ComputePl(current_f);  // current plausibility of accepted function f

for (int i = 0; i < 10000; i++)
{
    var candidate_f = Transpose(current_f); // create a candidate function
    var candidate_Pl_f = ComputePl(candidate_f);  // compute its plausibility

    if (candidate_Pl_f > current_Pl_f  ||  Flip() == 1)
    {
        // either candidate Pl has improved,
        // or it hasn't and we flipped the coin in candidate's favor
        //  - accept the candidate
        current_f = candidate_f;            
        current_Pl_f = candidate_Pl_f; 
    }
}

Almeida answered 14/9, 2011 at 22:15 Comment(7)

Thats much clearer, thanks. One other thing I realized is that the article doesn't really mention whether you should keep track of candidates already attempted and be sure not to repeat them, I imagine that should be the case...I posted in my edit the additional function used for Transposition and computing Pl. – Impediment 14/9, 2011 at 22:33

Perhaps it would be useful to hash certain things about the candidates already attempted (e.g., their plausability - though be careful since the candidate space is large), but I don't think you want to eliminate repetition altogether. Even though you have already visited a candidate, the random transpose from that candidate can take you to a previously unvisited candidate. By eliminating repetition, you could even "box yourself in" (be at a candidate where you have already visited all of its transposes). – Almeida 14/9, 2011 at 22:45

With regards to the Transpose() method -- Ive only been passing in one or two int values as the indices argument. Basically more indices would mean a more aggressive shuffling of the dictionary. So typically its just one randomly generated integer that gets passed to that method. Also the alphabet of available chars used contains many chars 'ABCDEFG...' etc, but if the encoded message is just 'bveb' or something, it will only make transpositions for mappings of b -> (), v -> (), e -> (), etc. where () represents any of the possible chars in the alphabet. – Impediment 14/9, 2011 at 23:0

That sounds correct to me, I'll take out my edit regarding the Transpose method. – Almeida 14/9, 2011 at 23:11

I think that Pl(f) values of 0 might be throwing the whole thing off. If a candidate creates an output that for all practical purposes never occurs in the english language, it will get a value of 0 even if it is otherwise very close? Something like 'qz' which as far as I know never or rarely occurs in any english words adjacently. – Impediment 14/9, 2011 at 23:19

That causes candidate functions that produce an "impossible" result (e.g. including qz) to be given a plausibility of 0, but doesn't necessarily break the search. Can you post a link to the matrix values you are using? How are you handling spaces? – Almeida 14/9, 2011 at 23:44

let us continue this discussion in chat – Almeida 14/9, 2011 at 23:44

I

2

Tentatively, codereview.stackexchange.com may be a "better forum for this".
Never the less, I'll take a quick stab at it:

At a glance, the snippet shown is a correct implementation of the algorithm.
Whether the algorithm will "get stuck" in local minima is a issue pertaining to the algorithm not to the implementation. (see discussion below)
Your quest for an "optimal approach" seems to be directed at tweaks in the algorithm (deviation from the original algorithm) rather than at tweaks in the implementation (to make it faster and/or to eradicate some possible flaws). For considerations regarding the algorithm, see discussion below; for discussion regarding the implementation consider the following:
- ensure the Flip() method is fair
- ensure the ComputePl() is correct: some errors often arise because of issues with the arithmetic precision in the value function.
- ensure fairness (equiprobability) with the Transpose() method.
- Performance improvements would likely come from optimizations in the ComputePl() method (not shown) rather than in the main loop.

Discussion regarding the algorithm per-se and its applicability to different problems.
In a nutshell the algorithm is a guided stochastic search, whereby the [huge] solution space is sampled with two random devices: the Transpose() method which modifies (very slightly each time) the current candidate function and the Flip() method which decides whether a [locally] suboptimal solution should survive. The search is guided by an objective function, ComputePl() which is itself based on a matrix of first order transition in some reference corpus.

In this context, local minima "tar pits" can be avoided by augmenting the probability of selecting "suboptimal" functions: rather than a fair 50-50 Flip(), maybe try with a probablity of retaining "suboptimal" solutions of 66% or even 75%. This approach typically extend the number of generations necessary to converge toward the optimal solution, but, as said may avoid getting stuck in local minima.

Another way of ensuring the applicability of the algorithm is to ensure a better evaluation of the plausibility of given functions. Likely explanations for the relative success and generality of the algorithm are

The distribution of first order transitions in the English language is not very uniform (and hence models rather well the plausibility of a given function, by rewarding it for its matches and by punishing it for its mismatches). This kind of multi-dimensional statistic, is more resilient to departure from the reference corpus than say the "order 0" distribution of characters in a a given language/corpus.
The distribution of first order transitions while being language-specific, are generally similar across different languages and/or in the context of jargon vocabularies [based on said languages]. Things do break down in the case of short-hand jargons whereby letters such as vowals are typically omitted.

Therefore to improve the applicability of the algorithm to a given problem, ensure that the distribution matrix used matches as closely as possible the language and domain of the underlying text.

Inquisition answered 14/9, 2011 at 22:17 Comment(4)

It seems like the value of Pl(f) just stops improving at a certain point. I need to reach a very high value but it seems to hang before getting there. Maybe there is something wrong with my method of Transposition. – Impediment 14/9, 2011 at 22:46

or I need a way for the algorithm to 'zero-in' on the solution more aggressively. – Impediment 14/9, 2011 at 22:52

@Sean: (I just noted your edit with the ComputePl() method). It is hard to know precisely because it depends on the way the values in _matrix were normalized, but the type of computations done there are likely to introduce rounding errors that could make the implementation "insensitive" to minor (but desirable) improvements, hence explaining why the progress towards optimal solutions seem to plateau. Concerning your suggestion of "Zeroing-in more aggressively on the solution": Typically not a good idea... (this contradict the fact tha you need a random search) – Inquisition 14/9, 2011 at 23:38

For the matrix values -- M['a']['b'] would be "given the first character 'a', what is the probability that the next character is 'b'". So the value that goes in that cell is ( (total cases where 'b' follows 'a') / (total instances of 'a') ) x 100 – Impediment 14/9, 2011 at 23:48

S

2

This algorithm seems to be related to http://en.wikipedia.org/wiki/Simulated_annealing. If that is the case, the behaviour might be helped by changing the probability with which you accept poorer alternatives to the current solution, especially if you reduce this probability over time.

Alternatively, you could try simple hill-climbing from multiple random starts - never accept poorer alternatives, which means that you will get stuck in local maxima more often, but repeatedly run the algorithm from different starts.

When you test this out, you usually know the right answer for your test problem. It is a good idea to compare the plausibility value at the right answer with those your algorithm is coming up with, just in case the weakness is in the plausibility formula, in which case your algorithm will come up with wrong answers which appear more plausible than the correct one.

Slather answered 15/9, 2011 at 3:48 Comment(1)

Its worth a try. I suppose that would amount to making the coin flip less and less likely to favor an inferior result over time. – Impediment 15/9, 2011 at 4:51

Recommended topics

Hot tags