Hashable / Hasher

Written by Mattt August 13^th, 2018

When you make a Genius Bar reservation at an Apple Store, you’re instructed to show up at a particular time of day and check in with the concierge. After directing you to pull up a stool, the concierge adds you to the queue and makes a note about how to identify you.

According to anonymous reports from former retail employees, there are strict guidelines about how customers can be described. Nothing about their physical appearance is used: age, gender, ethnicity, height — not even hair color. Instead, all customers are described by their clothing, as in “Person with black turtleneck, jeans, and glasses”.

This practice of describing customers has a lot in common with a hashing function in programming. Like any good hashing function, it’s consistent and easy to compute, and can be used to quickly find what (or who) you’re looking for. Much better than a queue, I think you’ll agree!

Our topic this week is Hashable and its new related type, Hasher. Together, they comprise the functionality underlying two of Swift’s most beloved collection types: Dictionary and Set.

Let’s say you have a list of objects that can be compared for equality with one another. To find a particular object in that list, you iterate all the elements until you find a match. As you add more elements to the array, the average amount of time necessary to find any one of them increases linearly (O(n)).

If you instead store those objects in a set, you can theoretically find any one of them in constant time (O(1)) — that is, a lookup on a set with 10 elements takes the same amount of time as a lookup on a set with 10,000^*. How does this work? Instead of storing objects sequentially, a set computes a hash as an index based on the contents of the object. When you perform a lookup of an object in a set, you use the same function to compute a new hash and look for the object there.

^* Two objects produce a hash collision when they have the same hash value but aren’t equal. When a collision occurs on insertion, they’re stored in a list at that address. The higher the rate of collision between objects, the more linear the performance of a hash collection becomes.

Hashable

In Swift, Array provides the standard interface for lists and Set for sets. In order for an object to be stored in a Set, its type must conform to Hashable (and by extension, Equatable). Swift’s standard map interface, Dictionary has a similar constraint on its associated Key type.

In previous versions of the language, it took quite a bit of boilerplate code to satisfy the requirements for storing a custom type in a Set or Dictionary.

Consider the following Color type, which represents a color using 8-bit values for red, green, and blue intensity:

struct Color {
    let red: UInt8
    let green: UInt8
    let blue: UInt8
}

To conform to Equatable, you had to provide an implementation for the == operator. To conform to Hashable, you had to provide an implementation of the computed hashValue property:

// Swift < 4.1
extension Color: Equatable {
    static func ==(lhs: Color, rhs: Color) -> Bool {
        return lhs.red == rhs.red &&
               lhs.green == rhs.green &&
               lhs.blue == rhs.blue
    }
}

extension Color: Hashable {
    var hashValue: Int {
        return self.red.hashValue ^
               self.green.hashValue ^
               self.blue.hashValue
    }
}

For most developers, implementing Hashable was a speed bump on the way to getting real work done, so they’d simply XOR over all the stored properties and call it a day.

One downside to this approach is its high rate of hash collisions. Because XOR is commutative, colors as different as cyan and yellow produce a hash collision:

// Swift < 4.2
let cyan = Color(red: 0x00, green: 0xFF, blue: 0xFF)
let yellow = Color(red: 0xFF, green: 0xFF, blue: 0x00)

cyan.hashValue == yellow.hashValue // true, collision

Most of the time, this isn’t a problem; modern computers are so powerful that you have to get a lot of implementation details wrong in order to notice any decrease in performance.

But that’s not to say that details don’t matter — they often matter immensely. More on that later.

Automatic Synthesis of Hashable Conformance

As of Swift 4.1, the compiler automatically synthesizes conformance to the Equatable and Hashable protocols for types that adopt these protocols in their declaration if their members also conform to those protocols.

In addition to being a huge boost to developer productivity, this can drastically reduce the size of a codebase. For instance, our Color example from before — is now ⅓ of its original size:

// Swift >= 4.1
struct Color: Hashable {
    let red: UInt8
    let green: UInt8
    let blue: UInt8
}

Despite these unambiguous improvements to the language, there was still a lingering question about some of the implementation details.

In his Swift Evolution proposal SE-0185: Synthesizing Equatable and Hashable conformance, Tony Allevato offered this note about hashing functions:

The choice of hash function is left as an implementation detail, not a fixed part of the design; as such, users should not depend on specific characteristics of its behavior. The most likely implementation would call the standard library’s _mixInt function on each member’s hash value and then combine them with exclusive-or (^), which mirrors the way Collection types are hashed today.

Fortunately, it didn’t take long for Swift to settle on a hash function. We got our answer in the very next release:

Hasher

Swift 4.2 refines Hashable even further by introducing the Hasher type and adopting a new universal hashing function.

From the Swift Evolution proposal, SE-0206: Hashable Enhancements:

With a good hash function, simple lookups, insertions and removals take constant time on average. However, when the hash function isn’t carefully chosen to suit the data, the expected time of such operations can become proportional to the number of elements stored in the table.

As Karoy Lorentey and Vincent Esche note, the main draw of hash-based collections like Set and Dictionary is their ability to look up values in constant time. If the hash function doesn’t produce an even distribution of values, these collections effectively become linked lists.

Swift 4.2 implements hashing based on the SipHash family of pseudorandom functions, specifically SipHash-1-3 and SipHash-2-4, with 1 or 2 rounds of hashing per message block and 3 or 4 rounds of finalization, respectively.

Now if you want to customize how your type implements Hashable, you can override the hash(into:) method instead of hashValue. The hash(into:) method passes a Hasher object by reference, which you call combine(_:) on to add the essential state information of your type.

// Swift >= 4.2
struct Color: Hashable {
    let red: UInt8
    let green: UInt8
    let blue: UInt8

    // Synthesized by compiler
    func hash(into hasher: inout Hasher) {
        hasher.combine(self.red)
        hasher.combine(self.green)
        hasher.combine(self.blue)
    }

    // Default implementation from protocol extension
    var hashValue: Int {
        var hasher = Hasher()
        self.hash(into: &hasher)
        return hasher.finalize()
    }
}

By abstracting away low-level bit manipulation details, developers automatically take advantage of Swift’s built-in hashing function, which has the extra benefit of not reproducing the collisions we had with our original XOR-based implementation:

// Swift >= 4.2
let cyan = Color(red: 0x00, green: 0xFF, blue: 0xFF)
let yellow = Color(red: 0xFF, green: 0xFF, blue: 0x00)

cyan.hashValue == yellow.hashValue // false, no collision

Customizing Hash Function

By default, Swift uses a universal hash function that reduces a sequence of bytes to a single integer.

However, you can improve on this by tailoring your hash function to your domain. For example, if you were writing a program to play a board game like chess or go, you might implement Zobrist hashing to quickly store the game state.

Guarding Against Hash-Flooding

Selecting a cryptographic algorithm like SipHash helps protect against hash-flooding DoS attacks, which deliberately try to generate hash collisions in an attempt to enforce the worst case of hashing data structures and cause a program to slow to a halt. This caused a bunch of problems for the web in the early 2010’s.

To make things even safer, Hasher generates random seed values each time an app is launched, to make hash values even less predictable.

The challenge of programming analogies is they normalize antisocial behavior by way of edge cases.

We excel as software engineers when we can think through all the ways that an attacker might leverage a particular behavior to some sinister end — as in the case of hash-flooding DoS attacks. But by doing so, we risk failing as humans when we apply that knowledge AFK.

That is to say… I’m not in any way encouraging you, dear reader, to coordinate outfits with your besties the next time you visit your local Apple retailer in an attempt to sow confusion and discord among Geniuses.

Please don’t.

Instead, please let your takeaway be this:

If you’re waiting at the Genius Bar, stay away from anyone wearing the same color shirt as you. It’ll make things a lot easier for everyone.