Determinism of Java 8 streams
Asked Answered
C

2

12

Motivation

I've just rewritten some 30 mostly trivial parsers and I need that the new versions behave exactly like the old ones. Therefore, I stored their example input files and some signature of the outputs produced by the old parsers for comparison with the new ones. This signature contains the counts of successfully parsed items, sums of some hash codes and up to 10 pseudo-randomly chosen items.

I thought this was a good idea as the equality of the hash code sums sort of guarantee that the outputs are exactly the same and the samples allow me to see what's wrong. I'm only using samples as otherwise it'd get really big.

The problem

Basically, given an unordered collection of strings, I want to get a list of up to 10 of them, so that when the collection changes a bit, I still get mostly the same samples in the same positions (the input is unordered, but the output is a list). This should work also when something is missing, so ideas like taking the 100th smallest element don't work.

ImmutableList<String> selectSome(Collection<String> list) {
        if (list.isEmpty()) return ImmutableList.of();
        return IntStream.range(1, 20)
            .mapToObj(seed -> selectOne(list, seed))
            .distinct()
            .limit(10)
            .collect(ImmutableList.toImmutableList());
    }

So I start with numbers from 1 to 20 (so that after distinct I still most probably have my 10 samples), call a stateless deterministic function selectOne (defined below) returning one string which is maximal according to some funny criteria, remove duplicates, limit the result and collect it using Guava. All steps should be IMHO deterministic and "ordered", but I may be overlooking something. The other possibility would be that all my 30 new parsers are wrong, but this is improbable given that the hashes are correct. Moreover, the results of the parsing look correct.

String selectOne(Collection<String> list, int seed) {
    // some boring mixing, definitely deterministic
    for (int i=0; i<10; ++i) {
        seed *= 123456789;
        seed = Integer.rotateLeft(seed, 16);
    }
    // ensure seed is odd
    seed = 2*seed + 1;

    // first element is the candidate result
    String result = list.iterator().next();
    // the value is the hash code multiplied by the seed
    // overflow is fine
    int value = seed * result.hashCode();

    // looking for s maximizing seed * s.hashCode()
    for (final String s : list) {
        final int v = seed * s.hashCode();
        if (v < value) continue;
        // tiebreaking by taking the bigger or smaller s
        // this is needed for determinism
        if (s.compareTo(result) * seed < 0) continue;
        result = s;
        value = v;
    }
    return result;
}

This sampling doesn't seem to work. I get a sequence like

"9224000", "9225000", "4165000", "9200000", "7923000", "8806000", ...

with one old parser and

"9224000", "9225000", "4165000", "3030000", "1731000", "8806000", ...

with a new one. Both results are perfectly repeatable. For other parsers, it looks very similar.

Is my usage of streams wrong? Do I have to add .sequential() or alike?

Update

Sorting the input collection has solved the problem:

ImmutableList<String> selectSome(Collection<String> collection) {
    final List<String> list = Lists.newArrayList(collection);
    Collections.sort(list);
    .... as before
}

What's still missing is an explanation why.

The explanation

As stated in the answers, my tiebreaker was an all-breaker as I missed to check for a tie. Something like

if (v==value && s.compareTo(result) < 0) continue;

works fine.

I hope that my confused question may be at least useful for someone looking for "consistent sampling". It wasn't really Java 8 related.

I should've used Guava ComparisonChain or better Java 8 arg max to avoid my stupid mistake:

String selectOne(Collection<String> list, int seed) {
    .... as before
    final int multiplier = 2*seed + 1;
    return list.stream()
          .max(Comparator.comparingInt(s -> multiplier * s.hashCode())
          .thenComparing(s -> s)) // <--- FOOL-PROOF TIEBREAKER
          .get();
}
Chang answered 25/8, 2017 at 6:56 Comment(28)
"Basically, given an unordered collection of strings, I want to get a list of up to 10 of them, so that when the collection changes a bit, I still get about the same samples in about the same positions. " What do "positions" mean in an unordered collection?Hourigan
Also, you posted the new code, said it's supposed to work the same way as the old one, but didn't post the old one.Trippet
@AndyTurner Bad formulation, hopefully fixed. The input is unordered, the output is a list. When one output is a, b, c, d and the other is a, x, c, d, that's much better for (visual) comparison than a, c, d, x.Chang
I don't understand. If the input collection (list) is unordered, then how is list.iterator().next() supposed to be deterministic?Shaver
@JBNizet I see, I was very unclear... What I posted is the part common to the old and new code. What has changed, are the parsers producing some items and their item numbers are the inputs of selectSome. My problem is either caused by the code I posted here, or by some inexplicable error somewhere else... I wouldn't have asked this terribly long question, if I hadn't triple checked everything.Chang
@Shaver It is not. But I'm searching a single string which maximizes a function, so the iteration order does not matter neither does the starting element (provided by list.iterator().next()). It would only matter, if there were ties, but my tiebreaker should take care of them.Chang
@Chang I've read your question four times and I still do not understand how these parsers should work, what they are used for?Incurve
@Chang I'll have to agree with the previous comment - it's probably a great question (I've actually read and seriously like some of your answers), so I bet there is something really good here, but the overall question is really unclear IMO, I've read it like 10 times now...Dorolisa
When you say that "it is entirely repeatable", what I take from this is that the process you are employing here is deterministic, including the streams. It's the process generating the Collection<String> which isn't.Hourigan
@KrzysztofCichocki They're parsing something. They have become very ugly in the meantime, so I decided to rewrite them... and what I'm describing in this question is some code used for testing that they work the same as the old versions. I was hoping I could explain the motivation, but I maybe should have leave it out.Chang
Why do you need this hashed ordering? Have you tried sorting the input naturally?Shaver
@Dorolisa I recall and like your answers, too. I wanted to simplify the explanation of what I'm doing by giving the motivation, but it probably was no good idea as the motivation seems to be the harder part. :DChang
@Shaver Sure, I could sort the sequence... I can't do things like take take the 100th smallest element, but sorting can't do any harm... and maybe it helps.Chang
@Chang Can you please post the code for old parsers? It would be easier to spot the possible problem having the "old" code to compare.Incurve
Can we assume the input ordering is consistent between runs, but not between the old and new parsers?Shaver
@KrzysztofCichocki See my above comment answering to JBNizet. This is neither the new nor the old code, this is the common part used for testing that the new code works just like the old one.Chang
@Shaver Yes. And you were right... there's something wrong with the posted code as sorting the input changes the outcome. No idea how it's possible, but it's a big step forwards.Chang
@Shaver Yes. More precisely, there are still some differences, but they're rare and obviously caused by minor bugs in some of the parsers (that's why I wrote this code).Chang
I can't see how the maximizing loop in ´selectOne´ should give results independent of the order of the input. The overflow is in fact a problem, I think.Projector
@Projector Given that it indeed doesn't work, you may be right. My reasoning was as follows: 1. With or without overflow, some function String -> int gets computed (overflow is nothing but a modulo operation). 2. The returned string is the one maximizing this function. 3. The function has only collisions when hashCode does, so they're only few of them. 4. A collision is only relevant when it happens for the maximizing string, this way producing a tie. Ties are rare and can't lead to the observed behavior. 5. Ties get broken deterministically, so the output should be order-independent.Chang
I don't know if it was intentional, but you dodged a bullet by using odd seeds. Even seeds can break symmetry for s.compareTo(result) * seed.Shaver
@Shaver Even multipliers (seeds) are terrible in general., e.g, for .comparingInt(s -> multiplier * s.hashCode()), but also for the expression you mentioned. I removed the multiplier from the tiebreaker as ties are rare and therefore the decision doesn't matter. I could've used s.compareTo(result) ^ seed in order to avoid problems, namely overflow (compareTo needn't return just -1,0,1) +++ Even multipliers in anything hashing-like are bad; the more trailing zeros they have, the worse, zero being the worst. Odd multipliers are all fine.Chang
Why didn’t you use java.util.Random, which does already provide a working deterministic pseudo random number generator? Your code does not look better than the algorithm of java.util.RandomUtoaztecan
@Utoaztecan You mean instead of my multiply-rotate algorithm? I probably wanted to avoid repetitions, i.e., get different multipliers from a range of seeds. But this doesn't really work because of the making the multiplier odd. Anyway, repetitions in a such short sequence are improbable both with my algorithm and with j.u.R.. +++ Concerning better-looking, my algorithm is different, as it computes a bijective function of the seed. I'd bet that (unlike j.u.R.) it has has no easily detectable patterns, bit this doesn't matter either. I guess, I'll switch to 'j.u.R some day.Chang
Well, you are feeding the numbers 1…19 as a seed to your function and require your selectOne to do something useful with that. I’d rather instantiate a single j.u.R instance with a fixed seed (0xcafebabe or 0xdeadc0de or the time millis when writing the code or a true random number picked then) and use that instance to pick ten distinct elements from the source list. That eliminates the need for the selectOne method completely. It’s reproducible when using the same seed, but very unlikely to have a detectable pattern among the ten elements.Utoaztecan
@Utoaztecan I can generate a pseudorandom sequence, use it instead of 1..19 and drop the multiply-rotate loop. I can not drop selectOne as I want more than just reproducibility: When some elements was chosen in a previous run and the computation has changed, resulting in some elements changed, removed and/or inserted, but the changed element stayed the same, I want it to be selected again with some high probability. By storing a couple of such elements, the chances are good, that some elements are present in both collections, so I can compare them.Chang
@Utoaztecan My explanation is surely confusing. Imagine you want to see how a group of people (a school class) change over the time without storing the information of all of them. Let's assume, each of them has a unique SSN. When I store the info of the ones with the smallest and the biggest SSNs, then the chances are good, that at least one of the now chosen persons will be present in the class the next year and can be found using the same criterion. What I did, is just a generalization to a bigger selection from a bigger set (instead of smallest and biggest I use arg max seed * s.hashCode()).Chang
If I understood correctly, you could just use the j.u.R to generate a random string and pick the set’s string with the smallest distance to it. Unlike the hash code, for which collisions are unavoidable, there can be at most two different strings being closest to the reference. You can use nextBoolean() to decide whether to use the smaller or bigger one in that case. Or keep on always using the bigger one like in your tie breaking comparator.Utoaztecan
I
7

In selectOne you just want to select String s with max rank of value = seed * s.hashCode(); for that given seed.

The problem is with the "tiebreaking" line: if (s.compareTo(result) * seed < 0) continue;

It is not deterministic - for different order of elements it omits different elements from being check, and thus change in order of elements is changing the result.

Remove the tiebreaking if and the result will be insensitive to the order of elements in input list.

Incurve answered 25/8, 2017 at 7:55 Comment(0)
S
13

The mistake is that your tiebreaker is not in fact breaking a tie. We should be selecting s when v > value, but instead we're falling back to compareTo(). This breaks comparison symmetry, making your algorithm dependent on encounter order.

As a bonus, here's a simple test case to reproduce the bug:

System.out.println(selectOne(Arrays.asList("1", "2"), 4));  // 1
System.out.println(selectOne(Arrays.asList("2", "1"), 4));  // 2
Shaver answered 25/8, 2017 at 8:16 Comment(0)
I
7

In selectOne you just want to select String s with max rank of value = seed * s.hashCode(); for that given seed.

The problem is with the "tiebreaking" line: if (s.compareTo(result) * seed < 0) continue;

It is not deterministic - for different order of elements it omits different elements from being check, and thus change in order of elements is changing the result.

Remove the tiebreaking if and the result will be insensitive to the order of elements in input list.

Incurve answered 25/8, 2017 at 7:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.