Motivation
I've just rewritten some 30 mostly trivial parsers and I need that the new versions behave exactly like the old ones. Therefore, I stored their example input files and some signature of the outputs produced by the old parsers for comparison with the new ones. This signature contains the counts of successfully parsed items, sums of some hash codes and up to 10 pseudo-randomly chosen items.
I thought this was a good idea as the equality of the hash code sums sort of guarantee that the outputs are exactly the same and the samples allow me to see what's wrong. I'm only using samples as otherwise it'd get really big.
The problem
Basically, given an unordered collection of strings, I want to get a list of up to 10 of them, so that when the collection changes a bit, I still get mostly the same samples in the same positions (the input is unordered, but the output is a list). This should work also when something is missing, so ideas like taking the 100th smallest element don't work.
ImmutableList<String> selectSome(Collection<String> list) {
if (list.isEmpty()) return ImmutableList.of();
return IntStream.range(1, 20)
.mapToObj(seed -> selectOne(list, seed))
.distinct()
.limit(10)
.collect(ImmutableList.toImmutableList());
}
So I start with numbers from 1 to 20 (so that after distinct
I still most probably have my 10 samples), call a stateless deterministic function selectOne
(defined below) returning one string which is maximal according to some funny criteria, remove duplicates, limit the result and collect it using Guava. All steps should be IMHO deterministic and "ordered", but I may be overlooking something. The other possibility would be that all my 30 new parsers are wrong, but this is improbable given that the hashes are correct. Moreover, the results of the parsing look correct.
String selectOne(Collection<String> list, int seed) {
// some boring mixing, definitely deterministic
for (int i=0; i<10; ++i) {
seed *= 123456789;
seed = Integer.rotateLeft(seed, 16);
}
// ensure seed is odd
seed = 2*seed + 1;
// first element is the candidate result
String result = list.iterator().next();
// the value is the hash code multiplied by the seed
// overflow is fine
int value = seed * result.hashCode();
// looking for s maximizing seed * s.hashCode()
for (final String s : list) {
final int v = seed * s.hashCode();
if (v < value) continue;
// tiebreaking by taking the bigger or smaller s
// this is needed for determinism
if (s.compareTo(result) * seed < 0) continue;
result = s;
value = v;
}
return result;
}
This sampling doesn't seem to work. I get a sequence like
"9224000", "9225000", "4165000", "9200000", "7923000", "8806000", ...
with one old parser and
"9224000", "9225000", "4165000", "3030000", "1731000", "8806000", ...
with a new one. Both results are perfectly repeatable. For other parsers, it looks very similar.
Is my usage of streams wrong? Do I have to add .sequential()
or alike?
Update
Sorting the input collection has solved the problem:
ImmutableList<String> selectSome(Collection<String> collection) {
final List<String> list = Lists.newArrayList(collection);
Collections.sort(list);
.... as before
}
What's still missing is an explanation why.
The explanation
As stated in the answers, my tiebreaker was an all-breaker as I missed to check for a tie. Something like
if (v==value && s.compareTo(result) < 0) continue;
works fine.
I hope that my confused question may be at least useful for someone looking for "consistent sampling". It wasn't really Java 8 related.
I should've used Guava ComparisonChain
or better Java 8 arg max to avoid my stupid mistake:
String selectOne(Collection<String> list, int seed) {
.... as before
final int multiplier = 2*seed + 1;
return list.stream()
.max(Comparator.comparingInt(s -> multiplier * s.hashCode())
.thenComparing(s -> s)) // <--- FOOL-PROOF TIEBREAKER
.get();
}
a, b, c, d
and the other isa, x, c, d
, that's much better for (visual) comparison thana, c, d, x
. – Changlist
) is unordered, then how islist.iterator().next()
supposed to be deterministic? – ShaverselectSome
. My problem is either caused by the code I posted here, or by some inexplicable error somewhere else... I wouldn't have asked this terribly long question, if I hadn't triple checked everything. – Changlist.iterator().next()
). It would only matter, if there were ties, but my tiebreaker should take care of them. – ChangCollection<String>
which isn't. – Hourigantake the 100th smallest element
, but sorting can't do any harm... and maybe it helps. – ChangString -> int
gets computed (overflow is nothing but a modulo operation). 2. The returned string is the one maximizing this function. 3. The function has only collisions whenhashCode
does, so they're only few of them. 4. A collision is only relevant when it happens for the maximizing string, this way producing a tie. Ties are rare and can't lead to the observed behavior. 5. Ties get broken deterministically, so the output should be order-independent. – Changs.compareTo(result) * seed
. – Shaver.comparingInt(s -> multiplier * s.hashCode())
, but also for the expression you mentioned. I removed the multiplier from the tiebreaker as ties are rare and therefore the decision doesn't matter. I could've useds.compareTo(result) ^ seed
in order to avoid problems, namely overflow (compareTo
needn't return just -1,0,1) +++ Even multipliers in anything hashing-like are bad; the more trailing zeros they have, the worse, zero being the worst. Odd multipliers are all fine. – Changjava.util.Random
, which does already provide a working deterministic pseudo random number generator? Your code does not look better than the algorithm ofjava.util.Random
… – Utoaztecanseed
s. But this doesn't really work because of the making the multiplier odd. Anyway, repetitions in a such short sequence are improbable both with my algorithm and withj.u.R.
. +++ Concerning better-looking, my algorithm is different, as it computes a bijective function of the seed. I'd bet that (unlikej.u.R.
) it has has no easily detectable patterns, bit this doesn't matter either. I guess, I'll switch to'j.u.R
some day. – Chang1…19
as a seed to your function and require yourselectOne
to do something useful with that. I’d rather instantiate a singlej.u.R
instance with a fixed seed (0xcafebabe
or0xdeadc0de
or the time millis when writing the code or a true random number picked then) and use that instance to pick ten distinct elements from the source list. That eliminates the need for theselectOne
method completely. It’s reproducible when using the same seed, but very unlikely to have a detectable pattern among the ten elements. – Utoaztecan1..19
and drop the multiply-rotate loop. I can not dropselectOne
as I want more than just reproducibility: When some elements was chosen in a previous run and the computation has changed, resulting in some elements changed, removed and/or inserted, but the changed element stayed the same, I want it to be selected again with some high probability. By storing a couple of such elements, the chances are good, that some elements are present in both collections, so I can compare them. – Changseed * s.hashCode()
). – Changj.u.R
to generate a random string and pick the set’s string with the smallest distance to it. Unlike the hash code, for which collisions are unavoidable, there can be at most two different strings being closest to the reference. You can usenextBoolean()
to decide whether to use the smaller or bigger one in that case. Or keep on always using the bigger one like in your tie breaking comparator. – Utoaztecan