A faster way of generating combinations with a given length, preserving the order
Asked Answered
E

2

4

TL;DR: I want the exact behavior as filter ((== 4) . length) . subsequences. Just using subsequences also creates variable length of lists, which takes a lot of time to process. Since in the end only lists of length 4 are needed, I was thinking there must be a faster way.


I have a list of functions. The list has the type [Wor -> Wor]

The list looks something like this

[f1, f2, f3 .. fn]

What I want is a list of lists of n functions while preserving order like this

input : [f1, f2, f3 .. fn]

argument : 4 functions

output : A list of lists of 4 functions.

Expected output would be where if there's an f1 in the sublist, it'll always be at the head of the list.

If there's a f2 in the sublist and if the sublist doens't have f1, f2 would be at head. If fn is in the sublist, it'll be at last.

In general if there's a fx in the list, it never will be infront of f(x - 1) .

Basically preserving the main list's order when generating sublists.

It can be assumed that length of list will always be greater then given argument.

I'm just starting to learn Haskell so I haven't tried all that much but so far this is what I have tried is this:

Generation permutations with subsequences function and applying (filter (== 4) . length) on it seems to generate correct permutations -but it doesn't preserve order- (It preserves order, I was confusing it with my own function).

So what should I do?

Also if possible, is there a function or a combination of functions present in Hackage or Stackage which can do this? Because I would like to understand the source.

Evince answered 10/1, 2019 at 13:34 Comment(4)
I'm a little confused by the description. Do you just want to take a list (of functions, although nowhere do you seem to use the fact that these are functions, so it should apply to a list of arbitrary type) and generate all sublists of length exactly 4, but have each sublist in the same order as the elements appear in the original list?Gosselin
This sounds like take 4 [f1, ..., fn]Narthex
subsequences should definitely preserve order in the way it sounds like you want. Can you show an example where it does something wrong? (It doesn’t even have to involve functions, right? Could be lists of numbers?)Antimonyl
I have updated the question. Sorry if it was confusing.Evince
S
5

You describe a nondeterministic take:

ndtake :: Int -> [a] -> [[a]]
ndtake 0 _      = [[]]
ndtake n []     = []
ndtake n (x:xs) = map (x:) (ndtake (n-1) xs) ++ ndtake n xs

Either we take an x, and have n-1 more to take from xs; or we don't take the x and have n more elements to take from xs.

Running:

> ndtake 3 [1..4]
[[1,2,3],[1,2,4],[1,3,4],[2,3,4]]

Update: you wanted efficiency. If we're sure the input list is finite, we can aim at stopping as soon as possible:

ndetake n xs = go (length xs) n xs
    where
    go spare n _  | n >  spare = []
    go spare n xs | n == spare = [xs]
    go spare 0 _      =  [[]]
    go spare n []     =  []
    go spare n (x:xs) =  map (x:) (go (spare-1) (n-1) xs) 
                            ++     go (spare-1)  n   xs

Trying it:

> length $ ndetake 443 [1..444]
444

The former version seems to be stuck on this input, but the latter one returns immediately.


But, it measures the length of the whole list, and needlessly so, as pointed out by @dfeuer in the comments. We can achieve the same improvement in efficiency while retaining a bit more laziness:

ndzetake :: Int -> [a] -> [[a]]
ndzetake n xs | n > 0 = 
    go n (length (take n xs) == n) (drop n xs) xs
    where
    go n b p ~(x:xs)
         | n == 0 = [[]]
         | not b  = []
         | null p = [(x:xs)]
         | otherwise = map (x:) (go (n-1) b p xs)
                          ++ go n b (tail p) xs

Now the last test also works instantly with this code as well.

There's still room for improvement here. Just as with the library function subsequences, the search space could be explored even more lazily. Right now we have

> take 9 $ ndzetake 3 [1..]
[[1,2,3],[1,2,4],[1,2,5],[1,2,6],[1,2,7],[1,2,8],[1,2,9],[1,2,10],[1,2,11]]

but it could be finding [2,3,4] before forcing the 5 out of the input list. Shall we leave it as an exercise?

Sadye answered 10/1, 2019 at 13:59 Comment(14)
Oh, you can surely do better than those! I would consider ndetake a dead end, but let's go back to ndtake. First off, you can almost certainly use a DList-style optimization to get rid of the ++. Second off, there's a critical piece of information you fail to use: in the first recursive call, ndtake (n-1) xs is never empty (unless the list is too short), so you should be able to produce a cons somewhat more lazily. Basically, there are two cases: n < length xs or not. Once you know which one, you either produce the empty list or do something efficient.Vicennial
Another option that should be even lazier (I think) is to produce a single too-short list when the list is short. ndtake 4 [1,2,3] = [[1,2,3]].Vicennial
@Vicennial the OP didn't want shorter subsequences, so your last suggestion doesn't fit the requirements. as for n < length xs, how am I to know that, if not by measuring the length at the start? and that's what ndetake is doing. If we don't know the length, we keep skipping the elements even when the suffix is already too short, that's why ndtake was so slow on the last example. If it goes on blind, it will always do that. I don't see why you dismiss ndetake, really. I think ndtake is hopeless. We should measure length, & try to stop on short suffixes as ndetake does, I think.Sadye
but indeed there is another way to do this: instead of map (x:), have another argument to go which is all the lists built so far; popping off it all the lists of required length as soon as they are built (this needs to be combined with the too-short-suffixes detection, too). but I'm not sure whether ndetake isn't already doing that, effectively. will need to think it over (or experiment :) ).Sadye
you really don't need the length of the list; you just need to know if it's long enough. That's a much cheaper problem to solve.Vicennial
ah, wait, you meant to look ahead just as much as needed. aha.Sadye
Right. I haven't worked out the precise details, but pass down two pointers ("tortoise" and "hare"). You know you're done when the hare runs out. Also pass down a function representing the current prefix. And take the result to append to (DList style). Roughly speaking, at least, go :: ([t] -> [t]) -> Int -> [t] -> [t] -> [[t]] -> [[t]]. To get the process started, use some sort of dropMay or splitAtMay.Vicennial
I think for n-long required segments, we just need to always look ahead n notches on the list ahead of the current element, that's it. either that, or it is good enough :) I'm not worried about DL related stuff; I can always build them in reverse (again, 'good enough", complexity-wise :) ) but I still fear the explosion in numbers of these semi-built segments...Sadye
Yes, we need to look ahead n notches. The base cases seem a little fussy, which is why I haven't produced a solution yet. And ... life calls. I bet you'll find a good one!Vicennial
No no! Building in reverse means it will fall over for infinite lists! Don't do that! And that part is really easy anyway (just the final argument I showed above). The fussy bit is base cases.Vicennial
no, I'll pop off the built ones as they are produced. the segments-being-built will have to be maintained ordered by their length, for that. n is finite! and yes, life calls for me, too. :)Sadye
I'll have to look at it when you're done. There are certainly a number of ways to do it, which will lead to different orderings of the results. Not sure how they'll affect, e.g., space complexity. Once you work out the base cases, there are lots of ways to experiment with the easy bits.Vicennial
This b argument is kind of gross. Luckily it's not needed; you can follow the exact recursive structure from the original ndtake. Here's how: ndtake n xs = go n (drop (n-1) xs) xs where go 0 _ _ = [[]]; go n [] xs = []; go n succm@(_:m) ~(x:xs) = map (x:) (go (n-1) succm xs) ++ go n m xs.Simulant
@DanielWagner yeah I remember looking back at that code some time after I posted it and wondering what's going on, it doesn't look like b is changing there at all...Sadye
V
2

Here's the best I've been able to come up with. It answers the challenge Will Ness laid down to be as lazy as possible in the input. In particular, ndtake m ([1..n]++undefined) will produce as many entries as possible before throwing an exception. Furthermore, it strives to maximize sharing among the result lists (note the treatment of end in ndtakeEnding'). It avoids problems with badly balanced list appends using a difference list. This sequence-based version is considerably faster than any pure-list version I've come up with, but I haven't teased apart just why that is. I have the feeling it may be possible to do even better with a better understanding of just what's going on, but this seems to work pretty well.

Here's the general idea. Suppose we ask for ndtake 3 [1..5]. We first produce all the results ending in 3 (of which there is one). Then we produce all the results ending in 4. We do this by (essentially) calling ndtake 2 [1..3] and adding the 4 onto each result. We continue in this manner until we have no more elements.

import qualified Data.Sequence as S
import Data.Sequence (Seq, (|>))
import Data.Foldable (toList)

We will use the following simple utility function. It's almost the same as splitAtExactMay from the 'safe' package, but hopefully a bit easier to understand. For reasons I haven't investigated, letting this produce a result when its argument is negative leads to ndtake with a negative argument being equivalent to subsequences. If you want, you can easily change ndtake to do something else for negative arguments.

-- to return an empty list in the negative case.
splitAtMay :: Int -> [a] -> Maybe ([a], [a])
splitAtMay n xs
  | n <= 0 = Just ([], xs)
splitAtMay _ [] = Nothing
splitAtMay n (x : xs) = flip fmap (splitAtMay (n - 1) xs) $
  \(front, rear) -> (x : front, rear)

Now we really get started. ndtake is implemented using ndtakeEnding, which produces a sort of "difference list", allowing all the partial results to be concatenated cheaply.

ndtake :: Int -> [t] -> [[t]]
ndtake n xs = ndtakeEnding n xs []

ndtakeEnding :: Int -> [t] -> ([[t]] -> [[t]])
ndtakeEnding 0 _xs = ([]:)
ndtakeEnding n xs = case splitAtMay n xs of
    Nothing -> id -- Not enough elements
    Just (front, rear) ->
        (front :) . go rear (S.fromList front)
  where
    -- For each element, produce a list of all combinations
    -- *ending* with that element.
    go [] _front = id
    go (r : rs) front =
      ndtakeEnding' [r] (n - 1) front
        . go rs (front |> r)

ndtakeEnding doesn't call itself recursively. Rather, it calls ndtakeEnding' to calculate the combinations of the front part. ndtakeEnding' is very much like ndtakeEnding, but with a few differences:

  1. We use a Seq rather than a list to represent the input sequence. This lets us split and snoc cheaply, but I'm not yet sure why that seems to give amortized performance that is so much better in this case.
  2. We already know that the input sequence is long enough, so we don't need to check.
  3. We're passed a tail (end) to add to each result. This lets us share tails when possible. There are lots of opportunities for sharing tails, so this can be expected to be a substantial optimization.
  4. We use foldr rather than pattern matching. Doing this manually with pattern matching gives clearer code, but worse constant factors. That's because the :<|, and :|> patterns exported from Data.Sequence are non-trivial pattern synonyms that perform a bit of calculation, including amortized O(1) allocation, to build the tail or initial segment, whereas folds don't need to build those.

NB: this implementation of ndtakeEnding' works well for recent GHC and containers; it seems less efficient for earlier versions. That might be the work of Donnacha Kidney on foldr for Data.Sequence. In earlier versions, it might be more efficient to pattern match by hand, using viewl for versions that don't offer the pattern synonyms.

ndtakeEnding' :: [t] -> Int -> Seq t -> ([[t]] -> [[t]])
ndtakeEnding' end 0 _xs = (end:)
ndtakeEnding' end n xs = case S.splitAt n xs of
     (front, rear) ->
        ((toList front ++ end) :) . go rear front
  where
    go = foldr go' (const id) where
      go' r k !front = ndtakeEnding' (r : end) (n - 1) front . k (front |> r)
    -- With patterns, a bit less efficiently:
    -- go Empty _front = id
    -- go (r :<| rs) !front =
    --  ndtakeEnding' (r : end) (n - 1) front
    --    . go rs (front :|> r)
Vicennial answered 13/1, 2019 at 6:46 Comment(4)
Will work through this later, after I've mended my last code some more... (like you said initially, trying the DL route, for then the sharing of prefix structure is possible -- the map (x:) bit prevents that, and that's what makes it quadratic, as per old inits discussion... )Sadye
@WillNess, note that in this solution (to meet your challenge) the prefix structure has become a suffix structure.Vicennial
Hold my beer: go acc xs = zipWith (map . (:)) xs (scanl1 (flip (++)) acc); allSubseqs xs = scanl go ([[]] : ([] <$ xs)) (tails xs); subseqs n = take 1 . drop n . allSubseqs >=> id >=> map reverseSimulant
@DanielWagner, that is certainly inscrutable. Is it also fast? Could you expand it into an answer?Vicennial

© 2022 - 2024 — McMap. All rights reserved.