Haskell to find the distance between the two closest points
Asked Answered
V

2

6

Given a list of points in a two dimensional space, you want to perform a function in Haskell to find the distance between the two closest points. example: Input: project [(1,5), (3,4), (2,8), (-1,2), (-8.6), (7.0), (1.5), (5.5), (4.8), (7.4)] Output: 2.0

Assume that the distance between the two farthest points in the list is at most 10000.

Here´s my code:

import Data.List
import System.Random

sort_ :: Ord a => [a] -> [a]
sort_ []    =  []
sort_ [x]  =  [x]
sort_ xs   =  merge (sort_ left) (sort_ right)
  where
    (left, right) = splitAt (length xs `div` 2) xs
    merge [] xs = xs
    merge xs [] = xs
    merge (x:xs) (y:ys)=
    if x <= y then 
        x : merge xs (y:ys)
    else  y : merge (x:xs) ys     

project :: [(Float,Float)] -> Float
project [] = 0
project (x:xs)=
    if null (xs) then 
        error "The list have only 1 point"
    else head(sort_(dstList(x:xs)))

distance :: (Float,Float)->(Float,Float) -> Float
distance (x1,y1) (x2,y2) = sqrt((x1 - x2)^2 + (y1 - y2)^2)


dstList :: [(Float,Float)] -> [Float]
dstList (x:xs)=
    if length xs == 1 then 
        (dstBetween x xs):[]
    else (dstBetween x xs):(dstList xs)


dstBetween :: (Float,Float) -> [(Float,Float)] -> Float
dstBetween pnt (x:xs)=
    if null (xs) then 
        distance pnt x
    else  minimum ((distance pnt ):((dstBetween pnt xs)):[])

{-
Calling generator to create a file created at random points
-}
generator = do
    putStrLn "Enter File Name"
    file <- getLine
    g <- newStdGen
    let pts = take 1000 . unfoldr (Just . (\([a,b],c)->((a,b),c)) . splitAt 2) 
                $ randomRs(-1,1) g :: [(Float,Float)]
    writeFile file . show $ pts

{-
Call the main to read a file and pass it to the function of project
The function of the project should keep the name 'project' as described 
in the statement
-}
main= do
    putStrLn "Enter filename to read"
    name <- getLine
    file <- readFile name
    putStrLn . show . project $ readA file

readA::String->[(Float,Float)]
readA = read

I can perform a run of the program as in the example or using the generator as follows:

in haskell interpreter must type "generator", the program will ask for a file name containing a thousand points here. and after the file is generated in the Haskell interpreter must write main, and request a file name, which is the name of the file you create with "generator".

The problem is that for 1000 points randomly generated my program takes a long time, about 3 minutes on a computer with dual core processor. What am I doing wrong? How I can optimize my code to work faster?

Volcanic answered 17/11, 2012 at 5:56 Comment(3)
Did you profile your program?Gravitate
Why are you deleting so much of your post? It's helpful to see what you tried.Deceive
I've restored the 2nd version, to recover the context.Maximilien
M
13

You are using a quadratic algorithm:

project []  = error "Empty list of points"
project [_] = error "Single point is given"
project ps  = go 10000 ps
  where
    go a [_]    = a
    go a (p:ps) = let a2 = min a $ minimum [distance p q | q<-ps]
                  in a2 `seq` go a2 ps

You should use a better algorithm. Search computational-geometry tag on SO for a better algorithm.

See also http://en.wikipedia.org/wiki/Closest_pair_of_points_problem .


@maxtaldykin proposes a nice, simple and effective change to the algorithm, which should make a real difference for random data -- pre-sort the points by X coordinate, and never try points more than d units away from the current point, in X coordinate (where d is the currently known minimal distance):

import Data.Ord (comparing)
import Data.List (sortBy)

project2 ps@(_:_:_) = go 10000 p1 t 
  where
    (p1:t) = sortBy (comparing fst) ps
    go d _         [] = d
    go d p1@(x1,_) t  = g2 d t
      where
        g2 d []          = go d (head t) (tail t)
        g2 d (p2@(x2,_):r)
           | x2-x1 >= d  = go d (head t) (tail t)
           | d2 >= d     = g2 d  r
           | otherwise   = g2 d2 r   -- change it "mid-flight"
               where
                 d2 = distance p1 p2

On random data, g2 will work in O(1) time so that go will take O(n) and the whole thing will be bounded by a sort, ~ n log n.

Empirical orders of growth show ~ n^2.1 for the first code (on 1k/2k range) and ~n^1.1 for the second, on 10k/20k range, testing it quick'n'dirty compiled-loaded into GHCi (with second code running 50 times faster than first for 2,000 points, and 80 times faster for 3,000 points).

Maximilien answered 17/11, 2012 at 8:14 Comment(9)
Excellent point. O(n^2) gives a very rough estimate of 1 000 000 calculations and O(n log n) gives a very rough estimate of 3000. You should also combine your generate and main functions into one main and compile your file with ghc -O2, which will speed it up a bit compared to the interpreter.Deceive
@AndrewC, don't think ghc -O2 will help, the question is tagged with hugsDingess
@maxtaldykin Oops. Then revised advice is to also download the Haskell Platform.Deceive
@Will, can you please elaborate on how "g2 will work in O(1)". I think this is completely incorrect. And n log n also: this algorithm is still O(n^2). Proposed modification results in O((n/c)^2) where c depends on data.Dingess
@maxtaldykin "on random data". I was thinking along the lines of "from each point we will go to the right only a fixed amount of x-distance, which translates into a fixed amount of points, on random data". This under likely supposition that for random data after the very first g2 scan (that indeed will be O(n)) we will find a value which will be "very close" to the final answer (so all the rest of g2 scans will only go on for similar distances). There won't be a gradual lessening of the currently known minimum, because that would contradict the randomness.Maximilien
@maxtaldykin or maybe this will happen for the 2nd point. The rigorous prove would need to consider the likely graph of currently known minimum value and prove, for random data, that it will be diminishing very fast at first, with high probability, so there will be only a few first O(n) g2 scans. And 2 or 3 O(n) operations is still O(n) of course. And empirical data supports this. As I mentioned, the empirical order of growth is n^1.1 which is exactly what you get from logarithmic complexity - a very low power coefficient. So your algorithm is very good, it seems. :)Maximilien
What mean these instructions? ps@(::_) and p1@(x1,_), I don't know the meaning of "@" in haskellVolcanic
@Melkhiah66 @ denotes at-pattern: ps@(p:t) is a pattern (p:t) that matches a value, which we can also call ps, as a whole. I.e. for some value val, ps = val ; p = head val; t = tail val. _ is an anonymous variable - no need to name it if we're not going to refer to it. ps@(_:_:_) will match a list (which we call ps) with no less than two elements in it. p1@(x1,_) will match a pair (which we call p1) whose first element we call x1.Maximilien
(correction to a comment 5 steps above - "the empirical order of growth is n^1.1 which is exactly what you get from" linearithmic "complexity - a very low" above 1.0 "power coefficient").Maximilien
D
6

It's possible to slightly modify your bruteforce search to get better performance on random data.

Main idea is to sort points by x coordinate and, while comparing distances in loop, consider only points that have horizontal distance not grater than current minimum distance.

This could be order of magnitude faster but in the worst case it is still O(n^2).
Actually, on 2000 points it is 50 times faster on my machine.

project points = loop1 10000 byX
  where
    -- sort points by x coordinate
    --  (you need import Data.Ord to use `comparing`)
    byX = sortBy (comparing fst) points

    -- loop through all points from left to right
    -- threading `d` through iterations as a minimum distance so far
    loop1 d = foldl' loop2 d . tails

    -- `tail` drops leftmost points one by one so `x` is moving from left to right
    -- and `xs` contains all points to the right of `x`
    loop2 d [] = d
    loop2 d (x:xs) = let
        -- we take only those points of `xs` whose horizontal distance
        -- is not greater than current minimum distance
        xs' = takeWhile ((<=d) . distanceX x) xs
        distanceX (a,_) (b,_) = b - a

        -- then just get minimum distance from `x` to those `xs'`
      in minimum $ d : map (distance x) xs'

Btw, please don't use so many parentheses. Haskell does not require to enclose function arguments.

Dingess answered 17/11, 2012 at 11:34 Comment(1)
very nice! simple and effective. :)Maximilien

© 2022 - 2024 — McMap. All rights reserved.