CharacterSet
In Japan, there’s a comedy tradition known as Manzai (漫才). It’s kind of a cross between stand up and vaudeville, with a straight man and a funny man delivering rapid-fire jokes that revolve around miscommunication and wordplay.
As it were, we’ve been working on a new routine
as a way to introduce the subject for this week’s article, Character
,
and wanted to see what you thought:
Character Set
a Set<Character>
?
Of course not!
What about NSCharacter Set
?
That's an old reference.
Then what do you call a collection of characters?
That would be a String
!
(╯° 益 °)╯ 彡 ┻━┻
(Yeah, we might need to workshop this one a bit more.)
All kidding aside,
Character
is indeed ripe for miscommunication and wordplay (so to speak):
it doesn’t store Character
values,
and it’s not a Set
in the literal sense.
So what is Character
and how can we use it?
Let’s find out! (行きましょう!)
Character
(and its reference type counterpart, NSCharacter
)
is a Foundation type used to trim, filter, and search for
characters in text.
In Swift,
a Character
is an extended grapheme cluster
(really just a String
with a length of 1)
that comprises one or more scalar values.
Character
stores those underlying Unicode.Scalar
values,
rather than Character
values, as the name might imply.
The “set” part of Character
refers not to Set
from the Swift standard library,
but instead to the Set
protocol,
which bestows the type with the same interface:
contains(_:)
, insert(_:)
, union(_:)
, intersection(_:)
, and so on.
Predefined Character Sets
Character
defines constants
for sets of characters that you’re likely to work with,
such as letters, numbers, punctuation, and whitespace.
Most of them are self-explanatory and,
with only a few exceptions,
correspond to one or more
Unicode General Categories.
Type Property | Unicode General Categories & Code Points |
---|---|
alphanumerics |
L*, M*, N* |
letters |
L*, M* |
capitalized *
|
Lt |
lowercase |
Ll |
uppercase |
Lu, Lt |
non |
M* |
decimal |
Nd |
punctuation |
P* |
symbols |
S* |
whitespaces |
Zs, U+0009 |
newlines |
U+000A – U+000D, U+0085, U+2028, U+2029 |
whitespaces |
Z*, U+000A – U+000D, U+0085 |
control |
Cc, Cf |
illegal |
Cn |
The remaining predefined character set, decomposables
,
is derived from the
decomposition type and mapping
of characters.
Trimming Leading and Trailing Whitespace
Perhaps the most common use for Character
is to remove leading and trailing whitespace from text.
"""
😴
""".trimming Characters(in: .whitespaces And Newlines) // "😴"
You can use this, for example, when sanitizing user input or preprocessing text.
Predefined URL Component Character Sets
In addition to the aforementioned constants,
Character
provides predefined values
that correspond to the characters allowed in various
components of a URL:
url
User Allowed url
Password Allowed url
Host Allowed url
Path Allowed url
Query Allowed url
Fragment Allowed
Escaping Special Characters in URLs
Only certain characters are allowed in certain parts of a URL
without first being escaped.
For example, spaces must be percent-encoded as %20
(or +
)
when part of a query string like
https://nshipster.com/search/?q=character%20set
.
URLComponents
takes care of percent-encoding components automatically,
but you can replicate this functionality yourself
using the adding
method
and passing the appropriate character set:
let query = "character set"
query.adding Percent Encoding(with Allowed Characters: .url Query Allowed)
// "character%20set"
Building Your Own
In addition to these predefined character sets, you can create your own. Build them up character by character, inserting multiple characters at a time by passing a string, or by mixing and matching any of the predefined sets.
Validating User Input
You might create a Character
to validate some user input to, for example,
allow only lowercase and uppercase letters, digits, and certain punctuation.
var allowed = Character Set()
allowed.form Union(.lowercase Letters)
allowed.form Union(.uppercase Letters)
allowed.form Union(.decimal Digits)
allowed.insert(characters In: "!@#$%&")
func validate(_ input: String) -> Bool {
return input.unicode Scalars.all Satisfy { allowed.contains($0) }
}
Depending on your use case,
you might find it easier to think in terms of what shouldn’t be allowed,
in which case you can compute the inverse character set
using the inverted
property:
let disallowed = allowed.inverted
func validate(_ input: String) -> Bool {
return input.range Of Character(from: disallowed) == nil
}
Caching Character Sets
If a Character
is created as the result of an expensive operation,
you may consider caching its bitmap
for later reuse.
For example,
if you wanted to create Character
for Emoji,
you might do so by enumerating over the Unicode code space (U+0000 – U+1F0000)
and inserting the scalar values for any characters with
Emoji properties
using the properties
property added in Swift 5 by
SE-0221 “Character Properties”:
import Foundation
var emoji = Character Set()
for code Point in 0x0000...0x1F0000 {
guard let scalar Value = Unicode.Scalar(code Point) else {
continue
}
// Implemented in Swift 5 (SE-0221)
// https://github.com/apple/swift-evolution/blob/master/proposals/0221-character-properties.md
if scalar Value.properties.is Emoji {
emoji.insert(scalar Value)
}
}
The resulting bitmap
is a 16KB Data
object.
emoji.bitmap Representation // 16385 bytes
You could store that in a file somewhere in your app bundle, or embed its Base64 encoding as a string literal directly in the source code itself.
extension Character Set {
static var emoji: Character Set {
let base64Encoded = """
AAAAAAg E/w MAAAAAAAAAAAAAAAAA...
"""
let data = Data(base64Encoded: base64Encoded)!
return Character Set(bitmap Representation: data)
}
}
Character Set.emoji.contains("👺") // true
Much like our attempt at a Manzai routine at the top of the article,
some of the meaning behind Character
is lost in translation.
NSCharacter
was designed for NSString
at a time when characters were equivalent to 16-bit UCS-2 code units
and text rarely had occasion to leave the Basic Multilingual Plane.
But with Swift’s modern,
Unicode-compliant implementations of String
and Character
,
the definition of terms has drifted slightly;
along with its NS
prefix,
Character
lost some essential understanding along the way.
Nevertheless,
Character
remains a performant, specialized container type
for working with collections of scalar values.
FIN