Purity of functions generating ByteString (or any object with ForeignPtr component)
Asked Answered
H

1

14

Since a ByteString is a constructor with ForeignPtr:

data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8) -- payload
                     {-# UNPACK #-} !Int                -- offset
                     {-# UNPACK #-} !Int                -- length

If I have a function that returns ByteString, then given an input, say a constant Word8, the function will return a ByteString with non-deterministic ForeignPtr value - as to what that value will be is determined by the memory manager.

So, does that mean that a function that returns ByteString is not pure? That doesn't seem to be the case obviously, if you have used ByteString and Vector libraries. Surely, it would have been discussed widely if it were the case (and hopefully show up on top of google search). How is that purity enforced?

The reason for asking this question is I am curious what are the subtle points involved in using ByteString and Vector objects, from the GHC compiler perspective, given ForeignPtr member in their constructor.

Hildahildagard answered 23/12, 2011 at 13:52 Comment(0)
H
18

There is no way to observe the value of the pointer inside the ForeignPtr from outside the Data.ByteString module; its implementation is internally impure, but externally pure, because it makes sure that the invariants required to be pure are maintained as long as you cannot see inside the ByteString constructor — which you can't, because it's not exported.

This is a common technique in Haskell: implementing something with unsafe techniques under the hood, but exposing a pure interface; you get both the performance and power unsafe techniques bring, without compromising Haskell's safety. (Of course, the implementation modules can have bugs, but do you think ByteString would be less likely to leak its abstraction if it was written in C? :))

As far as the subtle points go, if you're talking from a user's perspective, don't worry: you can use any function the ByteString and Vector libraries export without worrying, as long as they don't start with unsafe. They are both very mature and well-tested libraries, so you shouldn't run into any purity problems at all, and if you do, that's a bug in the library, and you should report it.

As far as writing your own code that provides external safety with an unsafe internal implementation, the rule is very simple: maintain referential transparency.

Taking ByteString as an example, the functions to construct ByteStrings use unsafePerformIO to allocate blocks of data, which they then mutate and put in the constructor. If we exported the constructor, then user code would be able to get at the ForeignPtr. Is this problematic? To determine whether it is, we need to find a pure function (i.e. not in IO) that lets us distinguish two ForeignPtrs allocated in this way. A quick glance at the documentation shows that there is such a function: instance Eq (ForeignPtr a) would let us distinguish these. So we must not allow user code to access the ForeignPtr. The easiest way to do this is to not export the constructor.

In summary: When you use an unsafe mechanism to implement something, verify that the impurity it introduces cannot leak outside of the module, e.g. by inspecting the values you produce with it.

As far as compiler issues go, you shouldn't really have to worry about them; while the functions are unsafe, they shouldn't allow you to do anything more dangerous, beyond violating purity, than you can do in the IO monad to start with. Generally, if you want to do something that could produce really unexpected results, you'll have to go out of your way to do so: for instance, you can use unsafeDupablePerformIO if you can deal with the possibility of two threads evaluating the same thunk of the form unsafeDupablePerformIO m simultaneously. unsafePerformIO is slightly slower than unsafeDupablePerformIO because it prevents this from happening. (Thunks in your program can be evaluated by two threads simultaneously during normal execution with GHC; this is normally not a problem, as evaluating the same pure value twice should have no adverse side-effects (by definition), but when writing unsafe code, it's something you have to take into account.)

The GHC documentation for unsafePerformIO (and unsafeDupablePerformIO, as I linked above) details some pitfalls you might run into; similarly the documentation for unsafeCoerce# (which should be used through its portable name, Unsafe.Coerce.unsafeCoerce).

Hanging answered 23/12, 2011 at 14:2 Comment(4)
Well, I do plan to use unsafe operations. Hence, this question :) I will like to learn of the issues I should be aware of, as the library writers were. Those insights will be very useful, when writing our own code that needs to be fast, but still externally pure, for parallel and concurrent extensions.Hildahildagard
Ah, OK; that wasn't clear to me from the question. I'll try and incorporate some of that information into my answer, although it's tricky since the basic rule is just "ensure referential transparency from outside the module".Hanging
I've expanded it some more, hope this helps :)Hanging
thank you so much for taking time to expand on it. It is very useful. I too read the GHC documentation about unsafePerformIO and unsafeDupablePerformIO, which got me thinking, which got me to this question :)Hildahildagard

© 2022 - 2024 — McMap. All rights reserved.