Hashable / Hasher
When you make a Genius Bar reservation at an Apple Store, you’re instructed to show up at a particular time of day and check in with the concierge. After directing you to pull up a stool, the concierge adds you to the queue and makes a note about how to identify you.
According to anonymous reports from former retail employees, there are strict guidelines about how customers can be described. Nothing about their physical appearance is used: age, gender, ethnicity, height — not even hair color. Instead, all customers are described by their clothing, as in “Person with black turtleneck, jeans, and glasses”.
This practice of describing customers has a lot in common with a hashing function in programming. Like any good hashing function, it’s consistent and easy to compute, and can be used to quickly find what (or who) you’re looking for. Much better than a queue, I think you’ll agree!
Our topic this week is Hashable
and its new related type, Hasher
.
Together, they comprise the functionality underlying two of Swift’s
most beloved collection types: Dictionary
and Set
.
Let’s say you have a
list
of objects that can be compared for equality with one another.
To find a particular object in that list,
you iterate all the elements until you find a match.
As you add more elements to the array,
the average amount of time necessary to find any one of them increases linearly
(O(n)
).
If you instead store those objects in a
set,
you can theoretically find any one of them in constant time (O(1)
) —
that is, a lookup on a set with 10 elements
takes the same amount of time as a lookup on a set with 10,000*.
How does this work?
Instead of storing objects sequentially,
a set computes a hash as an index based on the contents of the object.
When you perform a lookup of an object in a set,
you use the same function to compute a new hash
and look for the object there.
* Two objects produce a hash collision when they have the same hash value but aren’t equal. When a collision occurs on insertion, they’re stored in a list at that address. The higher the rate of collision between objects, the more linear the performance of a hash collection becomes.
Hashable
In Swift,
Array
provides the standard interface for lists
and Set
for sets.
In order for an object to be stored in a Set
,
its type must conform to Hashable
(and by extension, Equatable
).
Swift’s standard map
interface, Dictionary
has a similar constraint on its associated Key
type.
In previous versions of the language,
it took quite a bit of boilerplate code
to satisfy the requirements for storing a custom type
in a Set
or Dictionary
.
Consider the following Color
type,
which represents a color using 8-bit values for red, green, and blue intensity:
struct Color {
let red: UInt8
let green: UInt8
let blue: UInt8
}
To conform to Equatable
,
you had to provide an implementation for the ==
operator.
To conform to Hashable
,
you had to provide an implementation of the computed hash
property:
// Swift < 4.1
extension Color: Equatable {
static func ==(lhs: Color, rhs: Color) -> Bool {
return lhs.red == rhs.red &&
lhs.green == rhs.green &&
lhs.blue == rhs.blue
}
}
extension Color: Hashable {
var hash Value: Int {
return self.red.hash Value ^
self.green.hash Value ^
self.blue.hash Value
}
}
For most developers,
implementing Hashable
was a speed bump on the way to getting real work done,
so they’d simply
XOR
over all the stored properties and call it a day.
One downside to this approach is its high rate of hash collisions. Because XOR is commutative, colors as different as cyan and yellow produce a hash collision:
// Swift < 4.2
let cyan = Color(red: 0x00, green: 0x FF, blue: 0x FF)
let yellow = Color(red: 0x FF, green: 0x FF, blue: 0x00)
cyan.hash Value == yellow.hash Value // true, collision
Most of the time, this isn’t a problem; modern computers are so powerful that you have to get a lot of implementation details wrong in order to notice any decrease in performance.
But that’s not to say that details don’t matter — they often matter immensely. More on that later.
Automatic Synthesis of Hashable Conformance
As of Swift 4.1,
the compiler automatically synthesizes conformance to
the Equatable
and Hashable
protocols
for types that adopt these protocols in their declaration
if their members also conform to those protocols.
In addition to being a huge boost to developer productivity,
this can drastically reduce the size of a codebase.
For instance, our Color
example from before —
is now ⅓ of its original size:
// Swift >= 4.1
struct Color: Hashable {
let red: UInt8
let green: UInt8
let blue: UInt8
}
Despite these unambiguous improvements to the language, there was still a lingering question about some of the implementation details.
In his Swift Evolution proposal SE-0185: Synthesizing Equatable and Hashable conformance, Tony Allevato offered this note about hashing functions:
The choice of hash function is left as an implementation detail, not a fixed part of the design; as such, users should not depend on specific characteristics of its behavior. The most likely implementation would call the standard library’s
_mix
function on each member’s hash value and then combine them with exclusive-or (Int ^
), which mirrors the wayCollection
types are hashed today.
Fortunately, it didn’t take long for Swift to settle on a hash function. We got our answer in the very next release:
Hasher
Swift 4.2 refines Hashable
even further
by introducing the Hasher
type
and adopting a new universal hashing function.
From the Swift Evolution proposal, SE-0206: Hashable Enhancements:
With a good hash function, simple lookups, insertions and removals take constant time on average. However, when the hash function isn’t carefully chosen to suit the data, the expected time of such operations can become proportional to the number of elements stored in the table.
As Karoy Lorentey
and Vincent Esche note,
the main draw of hash-based collections like Set
and Dictionary
is their ability to look up values in constant time.
If the hash function doesn’t produce an even distribution of values,
these collections effectively become linked lists.
Swift 4.2 implements hashing based on the SipHash family of pseudorandom functions, specifically SipHash-1-3 and SipHash-2-4, with 1 or 2 rounds of hashing per message block and 3 or 4 rounds of finalization, respectively.
Now if you want to customize how your type implements Hashable
,
you can override the hash(into:)
method instead of hash
.
The hash(into:)
method passes a Hasher
object by reference,
which you call combine(_:)
on
to add the essential state information of your type.
// Swift >= 4.2
struct Color: Hashable {
let red: UInt8
let green: UInt8
let blue: UInt8
// Synthesized by compiler
func hash(into hasher: inout Hasher) {
hasher.combine(self.red)
hasher.combine(self.green)
hasher.combine(self.blue)
}
// Default implementation from protocol extension
var hash Value: Int {
var hasher = Hasher()
self.hash(into: &hasher)
return hasher.finalize()
}
}
By abstracting away low-level bit manipulation details,
developers automatically take advantage of Swift’s built-in hashing function,
which has the extra benefit of not reproducing the collisions
we had with our original XOR
-based implementation:
// Swift >= 4.2
let cyan = Color(red: 0x00, green: 0x FF, blue: 0x FF)
let yellow = Color(red: 0x FF, green: 0x FF, blue: 0x00)
cyan.hash Value == yellow.hash Value // false, no collision
Customizing Hash Function
By default, Swift uses a universal hash function that reduces a sequence of bytes to a single integer.
However, you can improve on this by tailoring your hash function to your domain. For example, if you were writing a program to play a board game like chess or go, you might implement Zobrist hashing to quickly store the game state.
Guarding Against Hash-Flooding
Selecting a cryptographic algorithm like SipHash helps protect against hash-flooding DoS attacks, which deliberately try to generate hash collisions in an attempt to enforce the worst case of hashing data structures and cause a program to slow to a halt. This caused a bunch of problems for the web in the early 2010’s.
To make things even safer,
Hasher
generates random seed values each time an app is launched,
to make hash values even less predictable.
The challenge of programming analogies is they normalize antisocial behavior by way of edge cases.
We excel as software engineers when we can think through all the ways that an attacker might leverage a particular behavior to some sinister end — as in the case of hash-flooding DoS attacks. But by doing so, we risk failing as humans when we apply that knowledge AFK.
That is to say… I’m not in any way encouraging you, dear reader, to coordinate outfits with your besties the next time you visit your local Apple retailer in an attempt to sow confusion and discord among Geniuses.
Please don’t.
Instead, please let your takeaway be this:
If you’re waiting at the Genius Bar, stay away from anyone wearing the same color shirt as you. It’ll make things a lot easier for everyone.