Parsing Words for Pronunciation
Apple's VoiceOver technology is a powerful screen reader built into every Apple device. It can read both the visible text as well as the various accessibility attributes available to views and controls to provide an audio description of your user interface. However, even with advancements to computer speech synthesis over the last few decades, it's not always able to deduce your intended pronunciation from the context. This is particularly irksome in English where there are several common homophones. If VoiceOver doesn't pick the right version it can be very annoying and even misleading for your users. So how do we fix it?
Luckily Apple offers support for the International Phonetic Alphabet through annotations. You can add these annotations to NSAttributedString
representations of your text content, even if you don't otherwise use attributed strings in your interface. This attribute, .accessibilitySpeechIPANotation
, is available in iOS 11 and later. For example, to correct the pronunciation of lead
(as in the metal) to lead
(as in leader
), we add the attribute with the appropriate phonetic string. This attributed string can then be set as your view's accessibilityAttributedLabel
.
// Create an NSMutableAttributedString from the original string so we can add an attribute.
let attributedString = NSMutableAttributedString(string: string)
// Find the range of "lead".
let range = attributedString.mutableString.range(of: "lead")
// Use IPA notation to set the long e pronunciation: /i:/.
attributedString.addAttributes([.accessibilitySpeechIPANotation: "l/i:/d"], range: range)
However, this simple usage has several drawbacks.
- It incorrectly finds any usage of
lead
, even as part of other words. - It only applies the attribute to the first instance of the word it finds.
- If we want to change the pronunciation of the plural we have to search for separate ranges, which is inefficient.
- It only works for lowercased values. We could normalize the the initial
attributedString
with thestring.lowercased()
value, but that breaks pronunciation emphasis rules around capital letters. - It only works for English. Of course, your pronunciation issues probably only exist in English, so that might be okay, but it would be good to be internationalized.
So we need a solution that allows us to find all instances of lead
, but only when used as a word on its own, makes it efficient to fix multiple pronunciations, leaves capitalization intact, and can be internationalized. A tall order! Or perhaps not.
Like all good programmers in the modern age we can start by standing on the shoulders of giants (or a large pile of smaller works created over the last 50 years). While Swift provides no native String
scanning or tokenization APIs other than simple, manual slicing, there exist word and other String
scanning APIs in various Apple frameworks that can be used in Swift, so let's start there. Various summaries of these APIs are available online, but this one by Søren L Kristiansen is a good summary of some word-based approaches. However, it is quite dated (Swift used to age very quickly) so we can't just copy the code directly. Instead, we can take the article's performance results and choose the basis for our solution: CFStringTokenizer
. While its API is not the most Swift-friendly, it's highly performant and accurate enough for our use. So let's get started.
CFStringTokenizer
We start by constructing the CFStringTokenizer
instance that we'll use to find the words in our String
s. All of these examples take place within a String
extension.
let enUS = Locale(identifier: "en_US")
let tokenizer = CFStringTokenizerCreate(kCFAllocatorDefault, // 1
self as CFString, // 2
CFRangeMake(0, utf16.count), // 3
kCFStringTokenizerUnitWord, // 4
enUS as CFLocale) // 5
CoreFoundation uses free functions rather than an initializer, as we'd usually see in Swift, and its unfortunate lack of parameter labels makes this somewhat unintelligible, so let's break it down.
- We must provide a
CFAllocator
. This allows low level customization of our memory allocation, but we don't care, so just pass the default allocatorkCFAllocatorDefault
. - Next the actual string, but it must be a
CFString
. Luckily Swift'sString
can be cast to that representation directly due to its automatic bridging toNSString
andNSString
's bridging toCFString
. - Now we provide the tokenizer the
CFRange
we want to operate over. ACFRange
is composed of a startinglocation
(0
for the beginning of the string) and alength
. Given thatCFString
, likeNSString
, operates on UTF-16 codepoints and not Swift's native UTF-8, we can't just provide the length of theString
directly. Instead we must calculate that length in UTF-16. LuckilyString
provides a convenientutf16
property we can use to get thatcount
. CFStringTokenizer
can tokenize on different types of boundaries so we must provide aCFOptionFlags
value to tell it which boundaries we care about. In this case we only care about word boundaries, so we providekCFStringTokenizerUnitWord
.- We can provide a
CFLocale
to indicate under what language's rules we want tokenization to be performed, as different languages have different tokenization logic. Apple's documentation says to useCFLocaleCopyCurrent()
to provide the user's current locale. This would be important if we were tokenizing text entered by the user in their preferred language but in this case we're customizing the pronunciation for a specific language, English. So we provide the US EnglishLocale
, cast toCFLocale
using the same sort of bridgingString
has, which should also work for other English dialects. If your app is fully localized you could use this setting to customize theCFLocale
based on the current active localization, but this example won't go that far.
Once we've built our tokenizer, we need to iterate over all of the tokens it finds. We do this by looping over the CFStringTokenizerTokenType
values produced by CFStringTokenizerAdvanceToNextToken
until there is no result. CFStringTokenizerTokenType
allows us to check the kind of boundary (defined by the Unicode standard) used to parse the token but in this case we don't care. Once there are no more boundaries we know we've reached the end of the string.
while CFStringTokenizerAdvanceToNextToken(tokenizer) != [] { // 1
let cfRange = CFStringTokenizerGetCurrentTokenRange(tokenizer) // 2
guard let range = Range(NSRange(location: cfRange.location, length: cfRange.length), in: self) else { return } // 3
let word = self[range] // 4
}
We can examine this loop more closely.
- To advance through the tokens generated by the tokenizer, we call
CFStringTokenizerAdvanceToNextToken
and give it a reference to our already createdtokenizer
. We continue this advancement only while there are boundaries. This results in a somewhat peculiar API in Swift, as a native API would likely just return anOptional
result directly, but that's the price we pay for using such a low level API. - For each token we need to grab its
CFRange
. This should be the range of the word the tokenizer has found for us. - Unlike the
CFString
->NSString
->String
bridging we get for free, there's no such relationship betweenCFRange
,NSRange
, andString
's nativeRange<String.Index>
type. Instead we must manually create anNSRange
from thelocation
andlength
of theCFRange
and then translate thatNSRange
into a nativeRange<String.Index>
by using theRange(_:in:)
initializer. The initializer can fail if the range is outside theString
instance, so weguard
to unwrap it. We should never see a failure here since we're operating on ranges returned by the tokenizer from within theString
. - We can then slice out the word from the
String
, giving us aSubstring
for each word.
Now that we can get the words of a String
, how do we accomplish our goal of annotating particular pronunciation? How should we access words to add the pronunciation attribute?
Building an API
Simply providing access to a collection of words in a String
with a function, such as words() -> [String]
, isn't enough for our intended use. We also need the range of each word so we can properly apply the attribute. We could instead return an array of tuples of (word: String, range: Range<String.Index>)
rather than just the word, but this may introduce other inefficiencies. For instance, we'd have to create String
s from every word's Substring
, which duplicates almost our entire String
into memory. Additionally, creating an entire collection first and then iterating it again to perform our work is fundamentally unnecessary. If we design an API that lets us iterate and perform work on Substring
s at the same time we can be more efficient. With this efficient base API we can then compose new APIs with more complex capabilities.
Let's start simple and provide a way to iterate every word in a String
as a Substring
. Since we'll need the Range
as well, our API should make it available as well. We can start by composing our properly configured CFStringTokenizer
into a function that takes a closure to provide access to each word and its range.
func byWord(perform closure: (_ word: Substring, _ wordRange: Range<String.Index>) -> Void) {
let enUS = Locale(identifier: "en_US")
let tokenizer = CFStringTokenizerCreate(kCFAllocatorDefault,
self as CFString,
CFRangeMake(0, utf16.count),
kCFStringTokenizerUnitWord,
enUS as CFLocale)
while CFStringTokenizerAdvanceToNextToken(tokenizer) != [] {
let cfRange = CFStringTokenizerGetCurrentTokenRange(tokenizer)
guard let range = Range(NSRange(location: cfRange.location, length: cfRange.length), in: self) else { return }
closure(self[range], range)
}
}
This provides maximum flexibility while requiring only a single iteration to perform whatever work we need. Let's try it out.
let string = "Swift is a programming language."
string.byWord { word, range in
print("\(word): \(range)")
}
This gives us the output:
Swift: Index(_rawBits: 1)..<Index(_rawBits: 327680)
is: Index(_rawBits: 393216)..<Index(_rawBits: 524288)
a: Index(_rawBits: 589824)..<Index(_rawBits: 655360)
programming: Index(_rawBits: 720896)..<Index(_rawBits: 1441792)
language: Index(_rawBits: 1507328)..<Index(_rawBits: 2031616)
(String
's Index
type doesn't correspond to Character
indexes, so they don't really make sense to read like this.)
So we can get our words and ranges. We could use this API directly to find the words we need but it would be simpler if we didn't have to filter out the words we don't care about manually. So let's add a convenience function that only calls the closure when it encounters a word we care about.
func onWords(_ words: Set<Substring>, perform closure: (_ word: Substring, _ range: Range<String.Index>) -> Void) {
byWord { word, range in
guard words.contains(word) else { return }
closure(word, range)
}
}
This onWords
function lets us pass any number of words (as a Set
for fast contains
checking) to use as a filter for when the closure
is called with a word. We can use it to filter down our list to only the words we care about.
let string = "Swift is a programming language."
string.onWords(["is", "programming"]) { word, range in
print("\(word): \(range)")
}
Running this snipped gives us the output:
is: Index(_rawBits: 393216)..<Index(_rawBits: 524288)
programming: Index(_rawBits: 720896)..<Index(_rawBits: 1441792)
However, this convenience method is missing one our previous requirements: detection of every instance of a word regardless of case. There are several ways we could provide a normalization to deal with this but in this case simply enabling a case-insensitive comparison is enough. Unfortunately this means we lose our fast contains
checking in the insensitive case but since our words
Set
is expected to be very small the overall difference should be minimal. We'll default to the fast path just in case. By putting this complexity in our convenience function we leave the base implementation untouched.
func onWords(_ words: Set<Substring>, caseSensitively: Bool = true, perform closure: (_ word: Substring, _ range: Range<String.Index>) -> Void) {
byWord { word, range in
let wordsContainsWord: Bool
if caseSensitively {
wordsContainsWord = words.contains(word)
} else {
wordsContainsWord = words.contains { $0.caseInsensitiveCompare(word) == .orderedSame }
}
guard wordsContainsWord else { return }
closure(word, range)
}
}
This allows us to insensitively match words. For example:
let string = "Swift is a programming language."
string.onWords(["swift", "programming"], caseSensitively: false) { word, range in
print("\(word): \(range)")
}
Running this snippet gives us the output:
Swift: Index(_rawBits: 1)..<Index(_rawBits: 327680)
programming: Index(_rawBits: 720896)..<Index(_rawBits: 1441792)
Now we're ready to change some pronunciation.
Putting It All Together
We're now ready update our original example to use our new, more accurate word parsing API.
func leadPronunciationCorrectedAttributedString() -> NSAttributedString {
let attributedString = NSMutableAttributedString(string: self) // 1
onWords(["lead", "leads"], caseSensitively: false) { word, range in
let pronunciation = (word.lowercased() == "lead") ? "l/i:/d" : "l/i:/ds" // 2
attributedString.addAttribute(.accessibilitySpeechIPANotation, value: pronunciation, range: NSRange(range, in: self)) // 3
}
return attributedString.copy() as! NSAttributedString // 4
}
Our extra logic here is as follows:
- Create the
NSMutableAttributedString
just like before. - Inside our
onWords
closure, look at which version oflead
we've identified and set the appropriate pronunciationString
, singular or plural. This check is simple enough a ternary expression is nicely compact while still readable. - Add the attribute to the attributed string using the proper pronunciation for the proper
NSRange
. Once again we must translate our ranges between types, this time fromRange<String.Index
toNSRange
. In this case there's anotherNSRange
initializer to do the work for us. - Given
NSAttributedString
's Objective-C heritage we must manually copy our result to an immutable type, otherwise the mutability could return down the line.
This code now produces the proper output for our two words in the new NSAttributedString
.
let lotsOfLeads = "lead Leads leadership unleaded lead Leads leads"
let corrected = lotsOfLeads.leadPronunciationCorrectedAttributedString()
print(corrected)
This snippet produces the following output:
lead{
UIAccessibilitySpeechAttributeIPANotation = "l/i:/d";
} {
}Leads{
UIAccessibilitySpeechAttributeIPANotation = "l/i:/ds";
} leadership unleaded {
}lead{
UIAccessibilitySpeechAttributeIPANotation = "l/i:/d";
} {
}Leads{
UIAccessibilitySpeechAttributeIPANotation = "l/i:/ds";
} {
}leads{
UIAccessibilitySpeechAttributeIPANotation = "l/i:/ds";
}
As you can see, our attributes are correctly set for both singular and plural version, regardless of case, with no overlap with spaces or other words, avoids words that just contain "lead", and does so while having only iterated the original String
once.
Wrapping Up
In this post we've seen how to use CFStringTokenizer
to provide a performant general API to find the words in a String
, as well as how to create convenience API that makes our use case much nice while not compromising functionality or performance. This sort of API could be extended in several additional ways, including:
- API to make it easier to map between many words and pronunciations.
- A lazy wrapper for our string tokenizer so that we don't need to tokenize an entire string if we only want the first word.
- Extensions to relevant views, like
UILabel
, to add these corrections automatically.
But I leave these as an exercise for the reader. 😉
Thanks for reading!