Fuzzy search, graphemes, and getting it just right
Sometimes getting things done is a matter of compromises, this is not such a time.
I have been working on codepoint n-gram search methods, I use one for the index page of this very blog, and I have published a basic version of the index in Clojure as je.suis/un-petit-index on Clojars. This method of searching produces large indexes, typically about 15-20x larger than the text indexed; but I think the quality of results, particularly on arbitrary data, can make it worthwhile.
At my company, I have put this codepoint n-gram method to work on various search indexing tasks where embedded incremental search indexes are a good fit. One such task is client profile recall by name or email address.
Authorized staff can use a fuzzy search box to recall client profiles, in our agent portal; and in our case we wanted to display the position of the match. The first version of this only showed exact matches of the whole query string by bolding the matching UCS-2 chars. That just didn't seem right, and it breaks down in some edge cases, so I got to thinking: What if we could show the user precisely what matched and how much?
When I started to think through the problem, it became apparent that there would be a lot more to it. First, the existing solution has a bug: when the text contains surrogate pairs, if the high surrogate matches, half of it will be in the bold span, and half in the non-bold span. Second, the browser does not display text by UCS-2 character, but rather by Extended Grapheme Cluster; and the tools for segmenting text by EGC are not built in to every browser yet. Third, it would be nice to be able to see exactly which parts of the results matched the query, but unfortunately there is no straightforward way to display different components of a composed character in different shades, let alone different weights.
The solution I came to, given these constraints, is to display as a shade the degree of the match within an EGC by the ratio of the number of its decomposed codepoints which participate in an n-gram from the query. There are some additional stylistic decisions, like whether to reduce the maximum shade for lower results in the list, or to not show 1-gram matches of single alphabet characters, but this is a start.
Extended What Cluster?
A ‘grapheme’ is generally defined as “the smallest functional unit of a writing system”, and in the context of Unicode, a ‘grapheme cluster’ is a set or stack of one or more graphemes that make up what the user might consider a ‘character’.
For example, you might consider ‘á’ to be a character, and in Unicode there are two ways to represent that: one involves a single codepoint, U+00E1 LATIN SMALL LETTER A WITH ACUTE, the other involves two codepoints, U+0061 LATIN SMALL LETTER A and U+0301 COMBINING ACUTE ACCENT. Both of these are one Grapheme Cluster according to Unicode.
My search index decomposes text in both the query and the index into those individual codepoints before indexing, because this allows me to partially match accented characters with non-accented ones, or in Korean text, match hangul with typos in the individual jamo, analogous to mistyping a letter.
The ‘Extended’ in ‘Extended Grapheme Cluster’ means that the Grapheme Cluster can include instructions for composing complex emoji, like the ones with different sets of family members, some regional flags, and roles with non-default sexes (like the lady doctor emoji).
Each of these is an element of getting things just right, and together they make for an interesting and pleasant experience. You can see a live demonstration of this in English and in Korean.
Each of these demos is a configuration of the code at a repository on Github