Search Kit
NSHipsters love irony, right? How about this for irony:
There’s this framework called Search Kit, which despite being insanely powerful and useful for finding information, is something that almost no one has ever heard of.
It’s true! I’d reckon there’s a better chance that more of you have implemented your own search functionality from scratch than have ever even heard of Search Kit. (Heck, most people haven’t even heard of Core Services, its parent framework)
If only everyone knew that they could harness the same killer search functionality that Apple uses for their own applications…
Search Kit is a C framework for searching and indexing content in human languages. It supports matching on phrase or partial word, including logical (AND
, OR
) and wildcard (*
) operators, and can rank results by relevance. Search Kit also provides document summarization, which is useful for generating representative excerpts. And best of all: it’s thread-safe.
All of the whiz-bang search-as-you-type features in OS X—from Mail.app and Xcode to System Preferences and Spotlight—use Search Kit under the hood.
But to understand how Search Kit does its magic, it’s important to explain some of the basics of Information Retrieval and Natural Language Processing.
Be sure to check out Apple’s Search Kit Programming Guide for an authoritative explanation of the what’s, why’s, and how’s of this great framework.
Search 101
Quoth Apple:
You have an information need. But before you can ask a question, you need someone or something to ask. That is, you need to establish who or what you will accept as an authority for an answer. So before you ask a question you need to define the target of your question.
Finding the answer in a reasonable amount of time requires effort from the start. This is what that process looks like in general terms:
Extract
First, content must be extracted from a corpus. For a text document, this could involve removing any styling, formatting, or other meta-information. For a data record, such as an NSManaged
, this means taking all of the salient fields and combining it into a representation.
Once extracted, the content is tokenized for further processing.
Filter
In order to get the most relevant matches, it’s important to filter out common, or “stop” words like articles, pronouns, and helping verbs, that don’t really contribute to overall meaning.
Reduce
Along the same lines, words that mean basically the same thing should be reduced down into a common form. Morpheme clusters, such as grammatical conjugations like “computer”, “computers”, “computed”, and “computing”, for example, can all be simplified to be just “compute”, using a stemmer. Synonyms, likewise, can be lumped into a common entry using a thesaurus lookup.
Index
The end result of extracting, filtering, and reducing content into an array of normalized tokens is to form an inverted index, such that each token points to its origin in the index.
After repeating this process for each document or record in the corpus until, each token can point to many different articles. In the process of searching, a query is mapped onto one or many of these tokens, retrieving the union of the articles associated with each token.
Using Search Kit
Creating an Index
SKIndex
is the central data type in Search Kit, containing all of the information needed to process and fulfill searches, and add information from new documents. Indexes can be persistent / file-based or ephemeral / in-memory. Indexes can either be created from scratch, or loaded from an existing file or data object—and once
an index is finished being used, like many other C APIs, the index is closed.
When starting a new in-memory index, use an empty NSMutable
instance as the data store:
let mutable Data = NSMutable Data()
let index = SKIndex Create With Mutable Data(mutable Data, nil, SKIndex Type(k SKIndex Inverted.raw Value), nil).take Retained Value()
NSMutable Data *mutable Data = [NSMutable Data data];
SKIndex Ref index = SKIndex Create With Mutable Data((__bridge CFMutable Data Ref)mutable Data, NULL, k SKIndex Inverted, NULL);
Adding Documents to an Index
SKDocument
is the data type associated with entries in the index. When a search is performed, documents (along with their context and relevance) are the results.
Each SKDocument
is associated with a URI.
For documents on the file system, the URI is simply the location of the file on disk:
let file URL = NSURL(file URLWith Path: "/path/to/document")
let document = SKDocument Create With URL(file URL).take Retained Value()
NSURL *file URL = [NSURL file URLWith Path:@"/path/to/document"];
SKDocument Ref document = SKDocument Create With URL((__bridge CFURLRef)file URL);
For Core Data managed objects, the NSManaged
can be used:
let object URL = object ID.URIRepresentation()
let document = SKDocument Create With URL(object URL).take Retained Value()
NSURL *object URL = [object ID URIRepresentation];
SKDocument Ref document = SKDocument Create With URL((__bridge CFURLRef)object URL);
For any other kinds of data, it would be up to the developer to define a URI representation.
When adding the contents of a SKDocument
to an SKIndex
, the text can either be specified manually:
let string = "Lorem ipsum dolar sit amet"
SKIndex Add Document With Text(index, document, string, true)
NSString *string = @"Lorem ipsum dolar sit amet";
SKIndex Add Document With Text(index, document, (__bridge CFString Ref)string, true);
…or collected automatically from a file:
let mime Type Hint = "text/rtf"
SKIndex Add Document(index, document, mime Type Hint, true)
NSString *mime Type Hint = @"text/rtf";
SKIndex Add Document(index, document, (__bridge CFString Ref)mime Type Hint, true);
To change the way a file-based document’s contents are processed, properties can be defined when creating the index:
let stopwords: Set = ["all", "and", "its", "it's", "the"]
let properties: [NSObject: Any Object] = [
"k SKStart Term Chars": "", // additional starting-characters for terms
"k SKTerm Chars": "-_@.'", // additional characters within terms
"k SKEnd Term Chars": "", // additional ending-characters for terms
"k SKMin Term Length": 3,
"k SKStop Words": stopwords
]
let index = SKIndex Create With URL(url, nil, SKIndex Type(k SKIndex Inverted.raw Value), properties).take Retained Value()
NSSet *stopwords = [NSSet set With Objects:@"all", @"and", @"its", @"it's", @"the", nil];
NSDictionary *properties = @{
@"k SKStart Term Chars": @"", // additional starting-characters for terms
@"k SKTerm Chars": @"-_@.'", // additional characters within terms
@"k SKEnd Term Chars": @"", // additional ending-characters for terms
@"k SKMin Term Length": @(3),
@"k SKStop Words":stopwords
};
SKIndex Ref index = SKIndex Create With URL((CFURLRef)url, NULL, k SKIndex Inverted, (CFDictionary Ref)properties);
After adding to or modifying an index’s documents, you’ll need to commit the changes to the backing store via SKIndex
to make your changes available to a search.
Searching
SKSearch
is the data type constructed to perform a search on an SKIndex
. It contains a reference to the index, a query string, and a set of options:
let query = "kind of blue"
let options = SKSearch Options(k SKSearch Option Default)
let search = SKSearch Create(index, query, options).take Retained Value()
NSString *query = @"kind of blue";
SKSearch Options options = k SKSearch Option Default;
SKSearch Ref search = SKSearch Create(index, (CFString Ref)query, options);
SKSearch
is a bitmask with the following possible values:
k
: Default search options include:SKSearch Option Default
- Relevance scores will be computed
- Spaces in a query are interpreted as Boolean AND operators.
- Do not use similarity searching.
These options can be specified individually as well:
k
: This option saves time during a search by suppressing the computation of relevance scores.SKSearch Option No Relevance Scores k
: This option alters query behavior so that spaces are interpreted as Boolean OR operators.SKSearch Option Space Means OR k
: This option alters query behavior so that Search Kit returns references to documents that are similar to an example text string. When this option is specified, Search Kit ignores all query operators.SKSearch Option Find Similar
Just creating an SKSearch
kicks off the asynchronous search; results can be accessed with one or more calls to SKSearch
, which returns a batch of results at a time until you’ve seen all the matching documents. Iterating through the range of found matches provides access to the document URL and relevance score (if calculated):
let limit = ... // Maximum number of results
let time: NSTime Interval = ... // Maximum time to get results, in seconds
var document IDs: [SKDocument ID] = Array(count: limit, repeated Value: 0)
var urls: [Unmanaged<CFURL>?] = Array(count: limit, repeated Value: nil)
var scores: [Float] = Array(count: limit, repeated Value: 0)
var found Count = 0
let has More Results = SKSearch Find Matches(search, limit, &document IDs, &scores, time, &count)
SKIndex Copy Document URLs For Document IDs(index, found Count, &document IDs, &urls)
let results: [NSURL] = zip(urls[0 ..< found Count], scores).flat Map({
(cfurl, score) -> NSURL? in
guard let url = cfurl?.take Retained Value() as NSURL?
else { return nil }
print("- \(url): \(score)")
return url
})
NSUInteger limit = ...; // Maximum number of results
NSTime Interval time = ...; // Maximum time to get results, in seconds
SKDocument ID document IDs[limit];
CFURLRef urls[limit];
float scores[limit];
CFIndex found Count;
Boolean has More Results = SKSearch Find Matches(search, limit, document IDs, scores, time, &found Count);
SKIndex Copy Document URLs For Document IDs(index, found Count, document IDs, urls);
NSMutable Array *mutable Results = [NSMutable Array array];
[[NSIndex Set index Set With Indexes In Range:NSMake Range(0, count)] enumerate Indexes Using Block:^(NSUInteger idx, BOOL *stop) {
CFURLRef url = urls[idx];
float relevance = scores[idx];
NSLog(@"- %@: %f", url, relevance);
if (object ID) {
[mutable Results add Object:(NSURL *)url];
}
CFRelease(url);
}];
For more examples of Search Kit in action, be sure to check out Indragie Karunaratne’s project, SNRSearchIndex.
And so this article becomes yet another document in the corpus we call the Internet. By pointing to Search Kit, and explaining even the briefest of its features, this—the strings of tokens you read at this very moment—are (perhaps) making it easier for others to find Search Kit.
…and it’s a good thing, too, because Search Kit is a wonderful and all-too-obscure framework, which anyone building a content-based system would do well to investigate.