How to implement the Hashable Protocol in Swift for an Int array (a custom string struct)
Asked Answered
C

4

46

I am making a structure that acts like a String, except that it only deals with Unicode UTF-32 scalar values. Thus, it is an array of UInt32. (See this question for more background.)

What I want to do

I want to be able to use my custom ScalarString struct as a key in a dictionary. For example:

var suffixDictionary = [ScalarString: ScalarString]() // Unicode key, rendered glyph value

// populate dictionary
suffixDictionary[keyScalarString] = valueScalarString
// ...

// check if dictionary contains Unicode scalar string key
if let renderedSuffix = suffixDictionary[unicodeScalarString] {
    // do something with value
}

Problem

In order to do that, ScalarString needs to implement the Hashable Protocol. I thought I would be able to do something like this:

struct ScalarString: Hashable {

    private var scalarArray: [UInt32] = []

    var hashValue : Int {
        get {
            return self.scalarArray.hashValue // error
        }
    }
}

func ==(left: ScalarString, right: ScalarString) -> Bool {
    return left.hashValue == right.hashValue
}

but then I discovered that Swift arrays don't have a hashValue.

What I read

The article Strategies for Implementing the Hashable Protocol in Swift had a lot of great ideas, but I didn't see any that seemed like they would work well in this case. Specifically,

  • Object property (array is does not have hashValue)
  • ID property (not sure how this could be implemented well)
  • Formula (seems like any formula for a string of 32 bit integers would be processor heavy and have lots of integer overflow)
  • ObjectIdentifier (I'm using a struct, not a class)
  • Inheriting from NSObject (I'm using a struct, not a class)

Here are some other things I read:

Question

Swift Strings have a hashValue property, so I know it is possible to do.

How would I create a hashValue for my custom structure?

Updates

Update 1: I would like to do something that does not involve converting to String and then using String's hashValue. My whole point for making my own structure was so that I could avoid doing lots of String conversions. String gets it's hashValue from somewhere. It seems like I could get it using the same method.

Update 2: I've been looking into the implementation of string hash codes algorithms from other contexts. I'm having a little difficulty knowing which is best and expressing them in Swift, though.

Update 3

I would prefer not to import any external frameworks unless that is the recommended way to go for these things.

I submitted a possible solution using the DJB Hash Function.

Calumet answered 15/7, 2015 at 18:24 Comment(4)
Aside: UTF-32 does not handle all unicode characters in one code unit. There are surrogate pairs, emoji color and weird things like flags that require more than one.Pedology
Unicode uses 21 bits to store all of its code points, so UTF-32 is enough to handle things like flags, emojis and other things in the upper planes. However, UTF-16 encoding requires surrogate pairs to refer to any Unicode character above Plane 0. That said, UInt32 cannot handle Swift Characters (extended grapheme clusters), which can be composed of multiple Unicode scalar values. The reason I am making my own ScalarString is to avoid some of the ambiguity of Character.Calumet
As of Swift 4.2 there is no need anymore do define your own hash combiner, see my update to codereview.stackexchange.com/a/111573/35991.Maryammaryann
@MartinR, thanks for leaving a link here. I also quoted you in my answer below.Calumet
C
50

Update

Martin R writes:

As of Swift 4.1, the compiler can synthesize Equatable and Hashable for types conformance automatically, if all members conform to Equatable/Hashable (SE0185). And as of Swift 4.2, a high-quality hash combiner is built-in into the Swift standard library (SE-0206).

Therefore there is no need anymore to define your own hashing function, it suffices to declare the conformance:

struct ScalarString: Hashable, ... {

    private var scalarArray: [UInt32] = []

    // ... }

Thus, the answer below needs to be rewritten (yet again). Until that happens refer to Martin R's answer from the link above.


Old Answer:

This answer has been completely rewritten after submitting my original answer to code review.

How to implement to Hashable protocol

The Hashable protocol allows you to use your custom class or struct as a dictionary key. In order to implement this protocol you need to

  1. Implement the Equatable protocol (Hashable inherits from Equatable)
  2. Return a computed hashValue

These points follow from the axiom given in the documentation:

x == y implies x.hashValue == y.hashValue

where x and y are values of some Type.

Implement the Equatable protocol

In order to implement the Equatable protocol, you define how your type uses the == (equivalence) operator. In your example, equivalence can be determined like this:

func ==(left: ScalarString, right: ScalarString) -> Bool {
    return left.scalarArray == right.scalarArray
}

The == function is global so it goes outside of your class or struct.

Return a computed hashValue

Your custom class or struct must also have a computed hashValue variable. A good hash algorithm will provide a wide range of hash values. However, it should be noted that you do not need to guarantee that the hash values are all unique. When two different values have identical hash values, this is called a hash collision. It requires some extra work when there is a collision (which is why a good distribution is desirable), but some collisions are to be expected. As I understand it, the == function does that extra work. (Update: It looks like == may do all the work.)

There are a number of ways to calculate the hash value. For example, you could do something as simple as returning the number of elements in the array.

var hashValue: Int {
    return self.scalarArray.count
} 

This would give a hash collision every time two arrays had the same number of elements but different values. NSArray apparently uses this approach.

DJB Hash Function

A common hash function that works with strings is the DJB hash function. This is the one I will be using, but check out some others here.

A Swift implementation provided by @MartinR follows:

var hashValue: Int {
    return self.scalarArray.reduce(5381) {
        ($0 << 5) &+ $0 &+ Int($1)
    }
}

This is an improved version of my original implementation, but let me also include the older expanded form, which may be more readable for people not familiar with reduce. This is equivalent, I believe:

var hashValue: Int {

    // DJB Hash Function
    var hash = 5381

    for(var i = 0; i < self.scalarArray.count; i++)
    {
        hash = ((hash << 5) &+ hash) &+ Int(self.scalarArray[i])
    }

    return hash
} 

The &+ operator allows Int to overflow and start over again for long strings.

Big Picture

We have looked at the pieces, but let me now show the whole example code as it relates to the Hashable protocol. ScalarString is the custom type from the question. This will be different for different people, of course.

// Include the Hashable keyword after the class/struct name
struct ScalarString: Hashable {

    private var scalarArray: [UInt32] = []

    // required var for the Hashable protocol
    var hashValue: Int {
        // DJB hash function
        return self.scalarArray.reduce(5381) {
            ($0 << 5) &+ $0 &+ Int($1)
        }
    }
}

// required function for the Equatable protocol, which Hashable inheirits from
func ==(left: ScalarString, right: ScalarString) -> Bool {
    return left.scalarArray == right.scalarArray
}

Other helpful reading

Credits

A big thanks to Martin R over in Code Review. My rewrite is largely based on his answer. If you found this helpful, then please give him an upvote.

Update

Swift is open source now so it is possible to see how hashValue is implemented for String from the source code. It appears to be more complex than the answer I have given here, and I have not taken the time to analyze it fully. Feel free to do so yourself.

Calumet answered 16/7, 2015 at 2:47 Comment(4)
I have deleted my comment, I was overlooking one important thing, this is used in isEqual which will handle hash collisions. So this is probably a good solution and a lot faster than SHA-256. The only thing missing is insuring the classes of the two objects are the same. Since it is not an extension to Array and just a method in the class this is a workable solution for a homogenous array.Pedology
Refer to Equality & Identity by Mattt Thompson. "One of the most common misconceptions about implementing a custom hash function comes from affirming the consequent, thinking that hash values must be distinct. ... In reality, a simple XOR over the hash values of critical properties is sufficient 99% of the time."Pedology
Thanks for idea of using reduce!Gossett
FYI - I don't believe that the expanded form of your method example above is equivalent to the reduce method. In the reduce method the value of $0 & $1 will change each iteration as the value is reduced (so initially will be 5381 but next will be the value of the first reduction), whereas in the for loop example the variable 'hash' remains constant at 5381. I'm not sure if this has any affect on performance but I wanted to point out the two are not functionally equivalent.Zama
G
5

Edit (31 May '17): Please refer to the accepted answer. This answer is pretty much just a demonstration on how to use the CommonCrypto Framework

Okay, I got ahead and extended all arrays with the Hashable protocol by using the SHA-256 hashing algorithm from the CommonCrypto framework. You have to put

#import <CommonCrypto/CommonDigest.h>

into your bridging header for this to work. It's a shame that pointers have to be used though:

extension Array : Hashable, Equatable {
    public var hashValue : Int {
        var hash = [Int](count: Int(CC_SHA256_DIGEST_LENGTH) / sizeof(Int), repeatedValue: 0)
        withUnsafeBufferPointer { ptr in
            hash.withUnsafeMutableBufferPointer { (inout hPtr: UnsafeMutableBufferPointer<Int>) -> Void in
                CC_SHA256(UnsafePointer<Void>(ptr.baseAddress), CC_LONG(count * sizeof(Element)), UnsafeMutablePointer<UInt8>(hPtr.baseAddress))
            }
        }

        return hash[0]
    }
}

Edit (31 May '17): Don't do this, even though SHA256 has pretty much no hash collisions, it's the wrong idea to define equality by hash equality

public func ==<T>(lhs: [T], rhs: [T]) -> Bool {
    return lhs.hashValue == rhs.hashValue
}

This is as good as it gets with CommonCrypto. It's ugly, but fast and not manypretty much no hash collisions for sure

Edit (15 July '15): I just made some speed tests:

Randomly filled Int arrays of size n took on average over 1000 runs

n      -> time
1000   -> 0.000037 s
10000  -> 0.000379 s
100000 -> 0.003402 s

Whereas with the string hashing method:

n      -> time
1000   -> 0.001359 s
10000  -> 0.011036 s
100000 -> 0.122177 s

So the SHA-256 way is about 33 times faster than the string way. I'm not saying that using a string is a very good solution, but it's the only one we can compare it to right now

Grobe answered 15/7, 2015 at 20:12 Comment(11)
Have you considered how that might affect Arrays of other types?Pedology
@Pedology It doesn't. This implementation is completely safe for every array of any type.Grobe
Just for my education, why does this not affect Arrays of other types.Pedology
I just read another answer that says cryptographic hashes should be avoided for hash tables. https://mcmap.net/q/23366/-hash-function-for-string Does that not apply in this case?Calumet
I just made some speed tests, check out the edit. @Calumet Maybe SHA-256 is slow for hashing standards but it's still pretty fast and for sure doesn't cause lots (or even any) hash collisions. It depends on what you're looking for I guessGrobe
@Pedology The reason it's not affected by the type and works for all of them is that the function uses the raw memory to generate the hash, so whether there's an Int, a pointer to an object, or a character all doesn't matter because it just uses the memory, which will be exactly the same for two equal arraysGrobe
It is possible that objects contained in an Array have different pointers yet the same contents, one would expect these to compare isEqual true. But if the addresses are compared they will compare false. So it seems that this extension may upset equality.Pedology
@Pedology objects are generally compared by the location they're pointing to. Objects are passed by reference, not by value. So I think this makes more sense like this. But I get your pointGrobe
I disagree that comparison is the general case. Even compile-time NSString can no longer be handled that way because there may be identical constant strings in separate compilation modules, libraries and frameworks. That is why hash on the contents needs to be implemented. There may be special cases where addresses can be used for comparison. Further, why would now want to has an address, it is already a primitive and unique.Pedology
This implementation is only valid if it is part of the class. In this case, depending on the array size and general contents, a item by item comparison of the values may be as fast or faster than a cryptographic hash.Pedology
Refer to Equality & Identity by Mattt Thompson. "For container classes like NSArray, NSDictionary, and NSString, the expected and indeed more useful behavior would be to do a deep equality comparison, to test that each member in the collection is equal."Pedology
G
4

It is not a very elegant solution but it works nicely:

"\(scalarArray)".hashValue

or

scalarArray.description.hashValue

Which just uses the textual representation as a hash source

Grobe answered 15/7, 2015 at 18:57 Comment(2)
Both of these ways are converting the array to a String and then getting the String's hashValue. Do you know how String gets its hashValue? It seems like if I knew that, I could just do that directly without having to go through String.Calumet
interesting solution anyway.Duvetyn
D
3

One suggestion - since you are modeling a String, would it work to convert your [UInt32] array to a String and use the String's hashValue? Like this:

var hashValue : Int {
    get {
        return String(self.scalarArray.map { UnicodeScalar($0) }).hashValue
    }
}

That could conveniently allow you to compare your custom struct against Strings as well, though whether or not that is a good idea depends on what you are trying to do...

Note also that, using this approach, instances of ScalarString would have the same hashValue if their String representations were canonically equivalent, which may or may not be what you desire.

So I suppose that if you want the hashValue to represent a unique String, my approach would be good. If you want the hashValue to represent a unique sequence of UInt32 values, @Kametrixom's answer is the way to go...

Delenadeleon answered 15/7, 2015 at 18:57 Comment(2)
This is an interesting idea and it works. I'm a little hesitant to use it, though, because the whole idea for making this custom UInt32 array was to avoid converting to String. Do you know how String gets its hashValue?Calumet
I would trust Swift, however, to have optimized the process of calculating a String's hashValue better than I could by winging it...Delenadeleon

© 2022 - 2024 — McMap. All rights reserved.