How do you efficiently find a union of a list of lists of values in haskell?
Asked Answered
C

3

5

Since a code example is worth a thousand words I'll start with that:

testList = [1,2,2,3,4,5]
testSet = map sumMapper $ tails testList
          where sumMapper [] = []
                sumMapper (a:b) = sumMap a b
                sumMap a b = map (+ a) b

This code takes a list and adds up all the elements to get the sum of all of them (I'd also be interested in efficiency of this). The output of testSet is:

[[3,3,4,5,6],[4,5,6,7],[5,6,7],[7,8],[9],[],[]]

I would like to find the union of these lists (to make it into a set) but I feel that:

whatIWant = foldl1 union testSet

will have bad performance (the real lists will be thousands of elements long).

Is this the correct solution or am I missing something obvious?

Convalesce answered 25/9, 2014 at 17:31 Comment(3)
so, is testList always in non-decreasing order (as is the case with your example here)?Shaff
Yes, I think in my case testList will always be in non-decreasing order which (provided I'm understanding everything correctly) makes your solution the most efficient. I should really start using fold-trees more in my code, thanks for pointing them out.Convalesce
ok, thanks for the clarification. if you ever get to test these approaches, it'd be very interesting to see the results, space and time. :)Shaff
K
5

You might want to try

nub $ concat theListOfLists

In the version using union, the code to cut out duplicates will get run many times. Here it only is run once.

It will only execute the code to pull out the unique values once.

There is also a Data.Set library, you could alternatively use

import Data.Set
S.fromList $ concat theListOfLists

The important point is that the code (here and above) that pulls out duplicates only gets run on the full list once, rather than over and over again.


edit- Rein mentions below that nub is O(n^2), so you should avoid the first solution above in favor of something O(n log n), as Data.Set.fromList should be. As others have mentioned in the comments, you need something that enforces Ord a to get the proper complexity O(n log n), and Data.Set does, nub does not.

I will leave the two solutions (poor performance and good performance) because I think the resulting discussion was useful.

Kopeisk answered 25/9, 2014 at 17:49 Comment(4)
If "efficiently" is a selection criteria for answers, I should point out that nub is O(n^2).Andersen
@ReinHenrichs- wow, I didn't realize that! Why would it be implemented that way? At any rate I just gave an alternate answer using Data.Set, I'll add some more commentary above (next you are going to tell me that S.fromList was implemented stupidly as O(n^2) also?.... :P )Kopeisk
nub only requires Eq, not Ord, hence it can not even sort the list. It is arguably a design defect. It is more general in this way, but its performance is not the expected one. :-/ Writing an efficient nubSort is easy, though.Mylander
I expand on @chi's comment slightly in my answer :)Andersen
A
5

If you're using elements that are members of the Ord typeclass, as in your example, you can use Data.Set:

import qualified Data.Set as Set

whatYouWant = foldl' (Set.union . Set.fromList) Set.empty testSet

This has the advantage of taking space proportional to the size of the largest sublist rather than to the size of the entire concatenated list as does the Set.fromList . concat solution. The strict foldl' also prevents buildup of unevaluated thunks, preventing O(n) stack and heap space usage.

Generally speaking, an Ord constraint allows more efficient algorithms than an Eq constraint because it allows you to build a tree. This is also the reason that nub is O(n^2): the more efficient algorithm requires Ord rather than just Eq.

Andersen answered 25/9, 2014 at 18:1 Comment(11)
Hang on, let me check if my intuition is correct, does this have better space complexity than using deDup list = (S.toList . S.fromList) $ concat list (from @jamshidh's answer), as I would have initially thought that that would be better but now thinking about optimisations a compiler could be making, I'm not sure.Convalesce
@MikeH-R In these cases, the optimizer should be able to eliminate the intermediate lists via short-cut fusion since foldl is a "good consumer" and map is both a "good consumer" and a "good producer".Andersen
So, just to be clear, (have just pushed short-cut fusion onto my to-read stack) which method do you suggest would be more performant? And would that be in space, time complexity or both?Convalesce
@MikeH-R Actually, I'm going to retract my statement about foldl and short-cut fusion with map for now because I am having doubts. I think I was conflating foldr laws with foldl. Still, the foldl solution without map should require less space than the concat solution since it doesn't allocate the entire joined list.Andersen
I would also expect foldl' to be more efficient, but I notice that Data.Set uses foldl for, e.g., unions.Andersen
Ok, thank you. Have you got any thoughts on how the time complexity compares? I might think that it would be better in the other (poorer space complexity) method but I can't back that up with proper facts (at least not enough to assuage any doubts I have).Convalesce
I'm pretty sure Data.Set.unions leaks space precisely because it uses foldl instead of foldl'.Tanager
@GabrielGonzalez Suspisions confirmed :) Updating my answer accordingly.Andersen
@MikeH-R I would expect the two versions to be within a constant factor of each other. Benchmarking suggests that the foldl' version will be slightly faster for larger lists, but I haven't thoroughly tested with different workloads (lots of small lists, fewer large lists, etc). Here's my benchmarking script: lpaste.net/1744879705800048640Andersen
if the original input list is non-decreasing then so will be all the produced lists that are to be merged, but union ... fromList doesn't take advantage of that. Using fromAscList could be advantageous then, theoretically. Still, a foldl'-based processing will invariably be faced with having to merge progressively larger sets with smaller and smaller lists (and no, the space won't be proportional to largest sublist's (setified) necessarily, it will be the final set's size). A tree-shaped merging solution seems to be more balanced.Shaff
(... contd.) also, it should operate in space no larger than the original list's size (and possibly even less), unlike the set-based solution which must have the whole set forced through, as it is being merged into. (all the above predicated of course on the original list's being in non-decreasing order).Shaff
S
1

Since union is an associative operation (a+(b+c)==(a+b)+c), you can use tree-shaped folding for a logarithmic advantage in time complexity:

_U []     = []
_U (xs:t) = union xs (_U (pairs t))

pairs (xs:ys:t)  = union xs ys : pairs t
pairs t          = t

Of course Data.List.union itself is O(n2) in general, but if your testList is ordered non-decreasing, all the lists will be too, and you can use a linear ordUnion instead of the union, for a solution which is linearithmic overall and shouldn't leak space:

ordUnion :: (Ord a) => [a] -> [a] -> [a]
ordUnion a      []     = a
ordUnion []     b      = b
ordUnion (x:xs) (y:ys) = case compare x y of
   LT -> x : ordUnion xs  (y:ys)
   EQ -> x : ordUnion xs     ys
   GT -> y : ordUnion (x:xs) ys

To prevent duplicates which might slip through, one more function is needed to process _U's output—a linear ordNub :: (Ord a) => [a] -> [a], with an obvious implementation.

Using the left-preferential (\(x:xs) ys -> x:ordUnion xs ys) could be even more productive overall (force smaller portions of the input at each given moment):

g testList = ordNub . _U $ [map (+ a) b | (a:b) <- tails testList]
  where
    _U []         = []
    _U ((x:xs):t) = x : ordUnion xs (_U (pairs t))
    pairs ((x:xs):ys:t) = (x : ordUnion xs ys) : pairs t
    pairs t             = t

see also:

Shaff answered 26/9, 2014 at 10:1 Comment(4)
Wow, thanks for the new approach. I couldn't ask you about your line: "Using the left-preferential (\(x:xs) ys -> x:ordUnion xs ys) could be even more productive overall (force smaller portions of the input at each given moment)." As I don't think I fully grok what that means?Convalesce
with your example, [[3,3,4,5,6],[4,5,6,7],[5,6,7],[7,8],[9]], we know that 3>4>5>7>9, so we can pull that first 3 right away, without comparing it to the head of result of merging the rest of them - which would have to force them all through. More at the primes link that I added (and on the main primes page as well). And even though they are '>=', we can still do that, as we need the ordNub anyway.Shaff
Ahhh, brilliant. Took me a while to grok, but thank you for those links. Very helpful.Convalesce
@MikeH-R you're welcome. (there was a typo above. >, <, who cares, right?! ;) ).Shaff

© 2022 - 2024 — McMap. All rights reserved.