Is there a fast way to sample from a subset of GLn?
Asked Answered
G

1

6

The rules of this problem are fairly specific because I'm actually looking at a subset of GLn, where the row and column vectors must have a certain form (call these vectors valid -- examples below), so please bear with me. Here are the rules:

  1. You can select a valid nonzero vector of length n uniformly at random.

  2. Given valid vectors v1, v2, ..., vk, you can determine whether or not the partial columns they form are prefixes of valid vectors (e.g. whether [v1_1 v2_1 ... vk_1] occurs as the first k entries of a valid vector) using the function isPrefix.

  3. Given valid vectors v1, v2, ..., vk, you can determine whether or not they are linearly dependent using the function areIndependent.

The goal is to sample from this subset of GLn uniformly at random. Here is a naive solution that fails:

    Step 1: Select a valid v1 uniformly at random. 
            If isPrefix(v1) then Go to Step 2.
            Otherwise repeat Step 1.

    Step 2: Select a valid v2 uniformly at random. 
            If areIndependent(v1,v2) & isPrefix(v1,v2), then go to Step 3. 
            Otherwise, repeat Step 2.

    ...

    Step n: Select a valid vn uniformly at random. 
            If areIndependent(v1,v2,...,vn) & isPrefix(v1,v2,...,vn), then END. 
            Otherwise, repeat Step n.

The problem with this would-be solution is that it can get stuck in an infinite loop in the not-too-unlikely event that areIndependent(v1,v2,...,vk) & isPrefix(v1,v2,...,vk) correctly returns TRUE but there is no way to complete this k-tuple to a linearly independent valid n-tuple. A common instance of this is when k=n-1 and there is a unique valid vector vn such that isPrefix(v1,v2,...,vn) is TRUE but this vector is not independent of the previous n-1 vectors.

Therefore I need to add in some way to back up a step or two when I'm in this loop, but I don't necessarily know when I'm in it. I'm looking for a solution along these lines: If Step k fails f(k) times for some explicit function f that may depend on the distribution of valid vectors, then go back to Step k-1 (or even further, in some explicit way).

Any suggestions, comments, references, etc will be greatly appreciated! Thanks!

EXAMPLES:

I'm actually looking for a general reference or guideline for how to proceed with the sampling. I have several examples of valid vectors on which I would like to do this, and it's more helpful to me to be able to do it on my own eventually than to list every example and have the SO community hash out solutions. However, to be concrete and give a sense of the difficulties involved, here is one example:

Alternating Sign Matrices: A vector is valid iff its entries are all 0, -1, 1, the nonzero entries alternate between 1s and -1s, and the sum of the entries is 1. For instance, every permutation matrix will consist of valid vectors, and also the following:

 0  1  0
 1 -1  1
 0  1  0

Distinct Entries: A vector is valid iff it has distinct entries. This one is particularly annoying because the condition applies to rows as well as columns.

Thanks again to all the folks who've looked at this!

Given answered 17/5, 2011 at 5:28 Comment(14)
Is pure rejection sampling too slow? Why not just go back to step 1 when you hit your first sign of difficulty?C
Far too slow, sadly. The valid vectors condition limits the possibilities because the rows and columns must all be valid.Given
It would help greatly if we had an idea what your valid check looked like. Because that is the heart of the problem.Compulsive
Please describe the subset. Also, the last step terminates almost never. Also, what law do you expect on this subset ? Uniform as a subset of R^(n x n) ? Other ? Irrelevant ?Benzocaine
Also GLn is not a vector space, so any sub-something cannot be a subspace (it could have been a subgroup though). We also need what you plan to use this sampling for. If you want to compute integrals over a matrix space, then we'll need to look for a sampling procedure which gives the correct distribution. Most importantly, describe the subset. We at least need to know if sampling over GLn and then doing rejection over your subspace is possible.Benzocaine
Your examples have nothing to do with GLn. Are the coefficients always integers ?Benzocaine
I'm a bit lost when you state : "it can get stuck in an infinite loop in the not-too-unlikely event that areIndependent(v1,v2,...,vk) correctly returns TRUE but there is no way to complete this k-tuple to a linearly independent valid n-tuple". If you've got k<n vectors, then at most they form a k-vector space, thus, nearly every possible (i.e : With probability 1) random vector can be added to form a (k+1)-vector space. Then again, we need to know which subset of GLn you're thinking of.Gun
@Alexandre C.: The vectors are not necessarily integer vectors and the matrices must be invertible... this is the connection with GLn, the space of invertible matrices.Given
@Fezvez: You are correct if I allow any possible vector to be selected. The difficulty is that I am allowing only certain types of vectors to be selected, and it may be that no vector in the complement of the row space is valid.Given
@PengOne: we still don't know anything about what probability distribution we should expect. Let's say I take the second example: then almost every matrix on Mn(R) is invertible and has distinct entries. So drawing a matrix at random and discarding whatever doesn't fit the bill is fine. Now what does it mean for a matrix to be random ? Normally distributed coefficients ? Coefficients are uniform in [0,1] ? The matrix is a product of a orthogonal (Haar distributed) matrix and a Wishart distributed one ?Benzocaine
@Alexander C: Yes, I am aware of that. I have dozens of problems of this form, each with a different distribution. I fully expect the answer to depend upon the distribution. I am not looking for a one off solution to one example, but a general solution that takes into account the distribution. If you have a solution for a single example, then I can perhaps extrapolate from that, which is better than where I am now, but I'm looking more for a theoretical answer.Given
@Alexander C: I need to sample vector by vector as described in the problem. If you absolutely must know when, then shoot me an e-mail and I can send you a 22 page paper that explains why. The real question here is to find a rule for when to "give up" on a step and how far back up the chain to go.Given
I may have an idea. I have now idea how slow it can be, but it does not get stuck. The main idea is that when you have your vectors v1..vk instead of generating v(k+1)again and again until there are all independant (which may never happen), you generate once v(k+1), then (if they are not independant anymore) you choose a number at random r in [1:k+1] and replace vr with a new randomly sampled vector. Repeat until v1..v(k+1) are independant.I do not know if it's much better than the pure rejection sampling though :PGun
@Fezvez: That should be an answer, not a comment. I'm voting up any non-trivial, sensible answer, and this certainly qualifies. I'm skeptical that it will be faster, but willing to try it on a few examples. Thanks for thinking on this!Given
G
3

I suspect that you might have to move to a Markov Chain Monte Carlo algorithm - http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm - not necessarily for speed, but ensure that your random samples are sensibly distributed.

One definition of random sampling is to produce the same distribution as if you generated matrices at random from your original distribution of columns and then kept only the valid ones.

If you generate items from a tree, and the nodes do not all have the same degree, the leaves will not be visited with equal probability. Consider a simple tree with two non-leaf nodes, one of which has a single leaf child, with the other having one million leaf children. If you sample by moving down from the root, making a random choice at each node, the single leaf child will be visited more often than any particular leaf from the set with a million siblings.

Since you got stuck in an infinite loop above, you found a case where there is a node with no children. Assuming that there are any leaves at all, you have a tree where the nodes do not all have the same degree.

You may end up having to code different approaches for the different validity rules, and have to worry about how long it takes your Markov Chain to "burn in" and become reasonably random. There is one (sort of) exception from the later point. If you are trying to work out a tail probability to rule out the possibility that a given configuration was selected at random, you could start your Markov Chain from that point, because - under the null hypothesis - that point is already randomly chosen, so if all of your generated values have a larger statistic than that starting point, something fishy is going on.

Gibeonite answered 17/5, 2011 at 18:23 Comment(1)
+1 for a good suggestion. I'm more interested in speed than achieving perfect uniformity. When I implement Metropolis, I lose the sequential aspect of choosing vectors. In many, but not all, cases, it turns out that once k vectors are chosen, the possibilities for the k+1st are severely limited, which speeds things up considerably. In all of the cases I tried, Metropolis was much slower than repeating a step 10 times, then returning to a randomly chosen higher step. Still, a very good suggestion that ensures uniformity much better than the outlined method.Given

© 2022 - 2024 — McMap. All rights reserved.