CFStringTransform
There are two indicators that tell you everything you need to know about how nice a language is to use:
- API Consistency
- Quality of String Implementation
NSString
is the crown jewel of Foundation. In an age where other languages still struggle to handle Unicode correctly, NSString
is especially impressive. Not content to just work with whatever is thrown at it, NSString
can parse strings into linguistic tags, determine the dominant language of the content, and convert between every string encoding imaginable. It’s unfairly good.
But as powerful as NSString
/ NSMutable
are, one would be remiss not to mention their toll-free bridged cousin, CFMutable
—or more specifically, CFString
.
As denoted by the CF
prefix, CFString
is part of Core Foundation. The function takes the following arguments, and returns a Boolean
for whether or not the transform was successful:
-
string
: The string to be transformed. Since this argument is aCFMutable
, anString Ref NSMutable
can be passed using toll-free bridging cast.String -
range
: The range of the string over which the transformation should be applied. This argument is aCFRange
, rather than anNSRange
value. -
transform
: The transformation to apply. This argument takes an ICU transform string, including any one of the string constants described below. -
reverse
: Whether to run the transformation in reverse, where applicable.
CFString
covers a lot of ground with its transform
argument. Here’s a rundown of what it can do:
Strip Accents and Diacritics
Énġlišh långuãge lẳcks iñterêßţing diaçrïtičş. As such, it can be useful to normalize extended Latin characters into ASCII-friendly representations. Rid any string of its squiggly bits using the k
transformation.
Name Unicode Characters
k
allows you to determine the Unicode standard name for special characters, including Emoji. For instance, “🐑💨✨” is transformed into “{SHEEP} {DASH SYMBOL} {SPARKLES}”, and “🐷” becomes “{PIG FACE}”.
Transliterate Between Orthographies
With the notable exception of English (and its delightful spelling inconsistencies), writing systems generally encode speech sounds into a consistent written representation. European languages generally use the Latin alphabet (with a few added diacritics), Russian uses Cyrillic, Japanese uses Hiragana & Katakana, and Thai, Korean, & Arabic each have their own scripts.
Although each language has a particular inventory of sounds, some of which other languages may lack, the overlap across all of the major writing systems is remarkably high—enough so that one can rather effectively transliterate (not to be confused with translation) from one script to another.
CFString
can transliterate back and forth between Latin and Arabic, Cyrillic, Greek, Korean (Hangul), Hebrew, Japanese (Hiragana & Katakana), Mandarin Chinese, and Thai.
Transformation | Input | Output |
---|---|---|
k |
mrḥbạ | مرحبا |
k |
privet | привет |
k |
geiá sou | γειά σου |
k |
annyeonghaseyo | 안녕하세요 |
k |
şlwm | שלום |
k |
hiragana | ひらがな |
k |
katakana | カタカナ |
k |
s̄wạs̄dī | สวัสดี |
k |
にほんご | ニホンゴ |
k |
中文 | zhōng wén |
And that’s only using the constants defined in Core Foundation! By passing an ICU transform directly,
CFString
can transliterate between Latin and Arabic, Armenian, Bopomofo, Cyrillic, Georgian, Greek, Han, Hangul, Hebrew, Hiragana, Indic ( Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, & Telegu), Jamo, Katakana, Syriac, Thaana, & Thai.Transform
Normalize User-Generated Content
One of the more practical applications for string transformation is to normalize unpredictable user input. Even if your application doesn’t specifically deal with other languages, you should be able to intelligently process anything the user types into your app.
For example, let’s say you want to build a searchable index of movies on the device, which includes greetings from around the world:
var mutable String = NSMutable String(string: "Hello! こんにちは! สวัสดี! مرحبا! 您好!") as CFMutable String Ref
- First, apply the
k
transform to transliterate all non-English text into a Latin alphabetic representation.CFString Transform To Latin
CFString Transform(mutable String, nil, k CFString Transform To Latin, Boolean(0))
Hello! こんにちは! สวัสดี! مرحبا! 您好! → Hello! kon’nichiha! s̄wạs̄dī! mrḥbạ! nín hǎo!
- Next, apply the
k
transform to remove any diacritics or accents.CFString Transform Strip Combining Marks
CFString Transform(mutable String, nil, k CFString Transform Strip Combining Marks, Boolean(0))
Hello! kon’nichiha! s̄wạs̄dī! mrḥbạ! nín hǎo! → Hello! kon’nichiha! swasdi! mrhba! nin hao!
- Finally, downcase the text with
CFString
, and split the text into tokens withLowercase CFString
to use as an index for the text.Tokenizer
let tokenizer = CFString Tokenizer Create(nil, mutable String, CFRange Make(0, CFString Get Length(mutable String)), 0, CFLocale Copy Current())
var mutable Tokens: [String] = []
var type: CFString Tokenizer Token Type
do {
type = CFString Tokenizer Advance To Next Token(tokenizer)
let range = CFString Tokenizer Get Current Token Range(tokenizer)
let token = CFString Create With Substring(nil, mutable String, range) as NSString
mutable Tokens.append(token)
} while type != .None
(hello, kon’nichiha, swasdi, mrhba, nin, hao)
By applying the same set of transformations on search text entered by the user, you have a universal way to search regardless of either the language of the search string or content!
For anyone wanting to be especially clever, all of the necessary transformations can actually be done in a single pass, by specifying the ICU transform
"Any-Latin; Latin-ASCII; Any-Lower"
.
CFString
can be an insanely powerful way to bend language to your will. And it’s but one of many powerful features that await you if you’re brave enough to explore outside of Objective-C’s warm OO embrace.