Parsing Words for Pronunciation

Apple's VoiceOver technology is a powerful screen reader built into every Apple device. It can read both the visible text as well as the various accessibility attributes available to views and controls to provide an audio description of your user interface. However, even with advancements to computer speech synthesis over the last few decades, it's not always able to deduce your intended pronunciation from the context. This is particularly irksome in English where there are several common homophones. If VoiceOver doesn't pick the right version it can be very annoying and even misleading for your users. So how do we fix it?

Luckily Apple offers support for the International Phonetic Alphabet through annotations. You can add these annotations to NSAttributedString representations of your text content, even if you don't otherwise use attributed strings in your interface. This attribute, .accessibilitySpeechIPANotation, is available in iOS 11 and later. For example, to correct the pronunciation of lead (as in the metal) to lead (as in leader), we add the attribute with the appropriate phonetic string. This attributed string can then be set as your view's accessibilityAttributedLabel.

// Create an NSMutableAttributedString from the original string so we can add an attribute.
let attributedString = NSMutableAttributedString(string: string)
// Find the range of "lead".
let range = attributedString.mutableString.range(of: "lead")
// Use IPA notation to set the long e pronunciation: /i:/.
attributedString.addAttributes([.accessibilitySpeechIPANotation: "l/i:/d"], range: range)

However, this simple usage has several drawbacks.

  • It incorrectly finds any usage of lead, even as part of other words.
  • It only applies the attribute to the first instance of the word it finds.
  • If we want to change the pronunciation of the plural we have to search for separate ranges, which is inefficient.
  • It only works for lowercased values. We could normalize the the initial attributedString with the string.lowercased() value, but that breaks pronunciation emphasis rules around capital letters.
  • It only works for English. Of course, your pronunciation issues probably only exist in English, so that might be okay, but it would be good to be internationalized.

So we need a solution that allows us to find all instances of lead, but only when used as a word on its own, makes it efficient to fix multiple pronunciations, leaves capitalization intact, and can be internationalized. A tall order! Or perhaps not.

Like all good programmers in the modern age we can start by standing on the shoulders of giants (or a large pile of smaller works created over the last 50 years). While Swift provides no native String scanning or tokenization APIs other than simple, manual slicing, there exist word and other String scanning APIs in various Apple frameworks that can be used in Swift, so let's start there. Various summaries of these APIs are available online, but this one by Søren L Kristiansen is a good summary of some word-based approaches. However, it is quite dated (Swift used to age very quickly) so we can't just copy the code directly. Instead, we can take the article's performance results and choose the basis for our solution: CFStringTokenizer. While its API is not the most Swift-friendly, it's highly performant and accurate enough for our use. So let's get started.


We start by constructing the CFStringTokenizer instance that we'll use to find the words in our Strings. All of these examples take place within a String extension.

let enUS = Locale(identifier: "en_US")
let tokenizer = CFStringTokenizerCreate(kCFAllocatorDefault,         // 1
                                        self as CFString,            // 2
                                        CFRangeMake(0, utf16.count), // 3
                                        kCFStringTokenizerUnitWord,  // 4
                                        enUS as CFLocale)            // 5

CoreFoundation uses free functions rather than an initializer, as we'd usually see in Swift, and its unfortunate lack of parameter labels makes this somewhat unintelligible, so let's break it down.

  1. We must provide a CFAllocator. This allows low level customization of our memory allocation, but we don't care, so just pass the default allocator kCFAllocatorDefault.
  2. Next the actual string, but it must be a CFString. Luckily Swift's String can be cast to that representation directly due to its automatic bridging to NSString and NSString's bridging to CFString.
  3. Now we provide the tokenizer the CFRange we want to operate over. A CFRange is composed of a starting location (0 for the beginning of the string) and a length. Given that CFString, like NSString, operates on UTF-16 codepoints and not Swift's native UTF-8, we can't just provide the length of the String directly. Instead we must calculate that length in UTF-16. Luckily String provides a convenient utf16 property we can use to get that count.
  4. CFStringTokenizer can tokenize on different types of boundaries so we must provide a CFOptionFlags value to tell it which boundaries we care about. In this case we only care about word boundaries, so we provide kCFStringTokenizerUnitWord.
  5. We can provide a CFLocale to indicate under what language's rules we want tokenization to be performed, as different languages have different tokenization logic. Apple's documentation says to use CFLocaleCopyCurrent() to provide the user's current locale. This would be important if we were tokenizing text entered by the user in their preferred language but in this case we're customizing the pronunciation for a specific language, English. So we provide the US English Locale, cast to CFLocale using the same sort of bridging String has, which should also work for other English dialects. If your app is fully localized you could use this setting to customize the CFLocale based on the current active localization, but this example won't go that far.

Once we've built our tokenizer, we need to iterate over all of the tokens it finds. We do this by looping over the CFStringTokenizerTokenType values produced by CFStringTokenizerAdvanceToNextToken until there is no result. CFStringTokenizerTokenType allows us to check the kind of boundary (defined by the Unicode standard) used to parse the token but in this case we don't care. Once there are no more boundaries we know we've reached the end of the string.

while CFStringTokenizerAdvanceToNextToken(tokenizer) != [] { // 1
    let cfRange = CFStringTokenizerGetCurrentTokenRange(tokenizer) // 2
    guard let range = Range(NSRange(location: cfRange.location, length: cfRange.length), in: self) else { return } // 3
    let word = self[range] // 4

We can examine this loop more closely.

  1. To advance through the tokens generated by the tokenizer, we call CFStringTokenizerAdvanceToNextToken and give it a reference to our already created tokenizer. We continue this advancement only while there are boundaries. This results in a somewhat peculiar API in Swift, as a native API would likely just return an Optional result directly, but that's the price we pay for using such a low level API.
  2. For each token we need to grab its CFRange. This should be the range of the word the tokenizer has found for us.
  3. Unlike the CFString -> NSString -> String bridging we get for free, there's no such relationship between CFRange, NSRange, and String's native Range<String.Index> type. Instead we must manually create an NSRange from the location and length of the CFRange and then translate that NSRange into a native Range<String.Index> by using the Range(_:in:) initializer. The initializer can fail if the range is outside the String instance, so we guard to unwrap it. We should never see a failure here since we're operating on ranges returned by the tokenizer from within the String.
  4. We can then slice out the word from the String, giving us a Substring for each word.

Now that we can get the words of a String, how do we accomplish our goal of annotating particular pronunciation? How should we access words to add the pronunciation attribute?

Building an API

Simply providing access to a collection of words in a String with a function, such as words() -> [String], isn't enough for our intended use. We also need the range of each word so we can properly apply the attribute. We could instead return an array of tuples of (word: String, range: Range<String.Index>) rather than just the word, but this may introduce other inefficiencies. For instance, we'd have to create Strings from every word's Substring, which duplicates almost our entire String into memory. Additionally, creating an entire collection first and then iterating it again to perform our work is fundamentally unnecessary. If we design an API that lets us iterate and perform work on Substrings at the same time we can be more efficient. With this efficient base API we can then compose new APIs with more complex capabilities.

Let's start simple and provide a way to iterate every word in a String as a Substring. Since we'll need the Range as well, our API should make it available as well. We can start by composing our properly configured CFStringTokenizer into a function that takes a closure to provide access to each word and its range.

func byWord(perform closure: (_ word: Substring, _ wordRange: Range<String.Index>) -> Void) {
    let enUS = Locale(identifier: "en_US")
    let tokenizer = CFStringTokenizerCreate(kCFAllocatorDefault,
                                            self as CFString,
                                            CFRangeMake(0, utf16.count),
                                            enUS as CFLocale)

    while CFStringTokenizerAdvanceToNextToken(tokenizer) != [] {
        let cfRange = CFStringTokenizerGetCurrentTokenRange(tokenizer)
        guard let range = Range(NSRange(location: cfRange.location, length: cfRange.length), in: self) else { return }
        closure(self[range], range)

This provides maximum flexibility while requiring only a single iteration to perform whatever work we need. Let's try it out.

let string = "Swift is a programming language."
string.byWord { word, range in
  print("\(word): \(range)")

This gives us the output:

Swift: Index(_rawBits: 1)..<Index(_rawBits: 327680)
is: Index(_rawBits: 393216)..<Index(_rawBits: 524288)
a: Index(_rawBits: 589824)..<Index(_rawBits: 655360)
programming: Index(_rawBits: 720896)..<Index(_rawBits: 1441792)
language: Index(_rawBits: 1507328)..<Index(_rawBits: 2031616)

(String's Index type doesn't correspond to Character indexes, so they don't really make sense to read like this.)

So we can get our words and ranges. We could use this API directly to find the words we need but it would be simpler if we didn't have to filter out the words we don't care about manually. So let's add a convenience function that only calls the closure when it encounters a word we care about.

func onWords(_ words: Set<Substring>, perform closure: (_ word: Substring, _ range: Range<String.Index>) -> Void) {
    byWord { word, range in
        guard words.contains(word) else { return }
        closure(word, range)

This onWords function lets us pass any number of words (as a Set for fast contains checking) to use as a filter for when the closure is called with a word. We can use it to filter down our list to only the words we care about.

let string = "Swift is a programming language."
string.onWords(["is", "programming"]) { word, range in
    print("\(word): \(range)")

Running this snipped gives us the output:

is: Index(_rawBits: 393216)..<Index(_rawBits: 524288)
programming: Index(_rawBits: 720896)..<Index(_rawBits: 1441792)

However, this convenience method is missing one our previous requirements: detection of every instance of a word regardless of case. There are several ways we could provide a normalization to deal with this but in this case simply enabling a case-insensitive comparison is enough. Unfortunately this means we lose our fast contains checking in the insensitive case but since our words Set is expected to be very small the overall difference should be minimal. We'll default to the fast path just in case. By putting this complexity in our convenience function we leave the base implementation untouched.

func onWords(_ words: Set<Substring>, caseSensitively: Bool = true, perform closure: (_ word: Substring, _ range: Range<String.Index>) -> Void) {
    byWord { word, range in
        let wordsContainsWord: Bool
        if caseSensitively {
            wordsContainsWord = words.contains(word)
        } else {
            wordsContainsWord = words.contains { $0.caseInsensitiveCompare(word) == .orderedSame }
        guard wordsContainsWord else { return }

        closure(word, range)

This allows us to insensitively match words. For example:

let string = "Swift is a programming language."
string.onWords(["swift", "programming"], caseSensitively: false) { word, range in
    print("\(word): \(range)")

Running this snippet gives us the output:

Swift: Index(_rawBits: 1)..<Index(_rawBits: 327680)
programming: Index(_rawBits: 720896)..<Index(_rawBits: 1441792)

Now we're ready to change some pronunciation.

Putting It All Together

We're now ready update our original example to use our new, more accurate word parsing API.

func leadPronunciationCorrectedAttributedString() -> NSAttributedString {
    let attributedString = NSMutableAttributedString(string: self) // 1
    onWords(["lead", "leads"], caseSensitively: false) { word, range in
        let pronunciation = (word.lowercased() == "lead") ? "l/i:/d" : "l/i:/ds" // 2
        attributedString.addAttribute(.accessibilitySpeechIPANotation, value: pronunciation, range: NSRange(range, in: self)) // 3
    return attributedString.copy() as! NSAttributedString // 4

Our extra logic here is as follows:

  1. Create the NSMutableAttributedString just like before.
  2. Inside our onWords closure, look at which version of lead we've identified and set the appropriate pronunciation String, singular or plural. This check is simple enough a ternary expression is nicely compact while still readable.
  3. Add the attribute to the attributed string using the proper pronunciation for the proper NSRange. Once again we must translate our ranges between types, this time from Range<String.Index to NSRange. In this case there's another NSRange initializer to do the work for us.
  4. Given NSAttributedString's Objective-C heritage we must manually copy our result to an immutable type, otherwise the mutability could return down the line.

This code now produces the proper output for our two words in the new NSAttributedString.

let lotsOfLeads = "lead Leads leadership unleaded lead Leads leads"
let corrected = lotsOfLeads.leadPronunciationCorrectedAttributedString()

This snippet produces the following output:

    UIAccessibilitySpeechAttributeIPANotation = "l/i:/d";
} {
    UIAccessibilitySpeechAttributeIPANotation = "l/i:/ds";
} leadership unleaded {
    UIAccessibilitySpeechAttributeIPANotation = "l/i:/d";
} {
    UIAccessibilitySpeechAttributeIPANotation = "l/i:/ds";
} {
    UIAccessibilitySpeechAttributeIPANotation = "l/i:/ds";

As you can see, our attributes are correctly set for both singular and plural version, regardless of case, with no overlap with spaces or other words, avoids words that just contain "lead", and does so while having only iterated the original String once.

Wrapping Up

In this post we've seen how to use CFStringTokenizer to provide a performant general API to find the words in a String, as well as how to create convenience API that makes our use case much nice while not compromising functionality or performance. This sort of API could be extended in several additional ways, including:

  • API to make it easier to map between many words and pronunciations.
  • A lazy wrapper for our string tokenizer so that we don't need to tokenize an entire string if we only want the first word.
  • Extensions to relevant views, like UILabel, to add these corrections automatically.

But I leave these as an exercise for the reader. 😉

Thanks for reading!

Tagged with: