How we implemented Keyword Sort
Searching for properties is one of the pillars on which our portal is built. What we used to understand as search before May 2018 was a set of rules to describe a binary logic – a given property either matches a user’s search or it doesn’t. As simple as that. But there’s much more to it that Elasticsearch allows us to do. This reason, together with our long-running ambition to move away from search and towards (what I like to call) relevance engineering, made us take a stab at full text search. This was our first step in the direction of “guessing” what the user wants and needs, as opposed to just letting them tell us straight away via a set of predefined options (see prices, number of bedrooms and all that).
Precision and recall
Or precision versus recall. Because it’s hard to get both and sometimes achieving one of them means giving up on the other. This is the inherent dilemma of any relevance engineer. Once we embark on the relevance boat, we better embrace it and live with it. So what is it? Here’s an example.
We have a universe with three horses (where someone misspelled one horse as “hosre”), a cow, a house and a barn. And we’ve got someone who wants to find all the horses in the universe using text search. So they run a search for “horse”, obviously. At this point we have (at least) two options:
- Search strictly for the word “horse”, which will return two results – Horse 1 and Horse 2. What’s important to note at this point is that from our user’s perspective the fact that Horse 3 was misspelled when saved into our data store doesn’t make that horse any less relevant. Conceptually, the correct answer to this query will contain all three horses, because this is what’s relevant to the user. That’s good precision and bad recall. In other words, we’re likely to get some false negatives.
- If this situation is frequent enough we can proactively work towards meeting the users’ relevance requirements by building some degree of fuzziness into our searches. Let’s say we allow words to contain some misspelled characters and still be counted as a match. That would include the third horse, of course. However, it’d have the side effect of including the house to the results, as there’s only one letter difference between the two. In this case you get bad precision and good recall. In other words, we’re likely to get some false positives.
To formalise, according to Wikipedia, precision is the fraction of relevant instances among the retrieved instances, while recall is the fraction of relevant instances that have been retrieved over the total amount of relevant instances. For those interested in a more in depth read on the topic of relevance engineering, I highly recommend the book Relevant Search by Doug Turnbull and John Berryman (https://www.manning.com/books/relevant-search). Chapter 4.2 talks extensively about precision and recall.
All the decisions that we made while building our keyword search solution were influenced to some degree by the inherent tension between these two concepts.
The big decisions
Sort vs filter
In a way, we addressed the tradeoff between precision and recall by not filtering out results at all based on text search and only sorting them instead based on how relevant we think they are to the user. This allowed us to achieve a virtually perfect recall. One can argue it came at the price of achieving virtually no precision. However, this is not as bad it sounds in reality, because here’s where relevance scoring kicks in; while we don’t filter out any properties based on keywords, we order them based on how relevant (we think) they are to the user running the search.
There’s one more thing to consider here, with a bit of background information on what we search on. We get our property listings from estate agents and new home developers. They provide a lot of information about their listings, including a summary that goes on the property card on the search results page (highlighted in red), a longer description and a set of features that go on the property details page (highlighted in blue). These are the bits of data that we run our text search on. Now let’s imagine that a user searches for the word “door”. I think it’s fair to say that most of our properties have at least one door, am I right? I also think I’m right when saying that not all the agents will mention that their properties have a door in the description or summary. Would the user be happy to end up on a zero results page when searching for “door”? Would that mean that we don’t have properties with doors on our website? No and no.
The bottom line is we didn’t want to write just another filter, but to develop a means of prioritising results.
Elasticsearch comes with a default formula for scoring matched documents against a query, starting from two core principles – term frequency and inverse document frequency. While I won’t go into the theoretical details of what it means – if you’re the curious type, please find them here – it revolves around the following rules:
- The more times a query word appears in a document, the higher the document scores;
- The fewer times a query word appears in other documents, the higher the document scores;
- When there are more query words, the scores for all words are added together to give the score of the document.
In the good spirit of starting simple and not doing any unneeded customisation, we first wanted to see how this algorithm would behave with our real world data set. We didn’t know what to expect so we were quite eager to deploy our work and let it loose in the pre-prod environments. The way we highlight results is a simple “this keyword matched, that one didn’t” as illustrated. So we asked people internally to play with keyword search and the most frequent feedback we got was that the ordering was confusing; rather frequently, people would get what looked like a random ordering because they would expect results to be ordered by the number of keywords that matched. Term frequency/inverse document frequency formula doesn’t sort results by how many words matched, but rather by how relevant it “feels” words are for a given result; i.e. a property matching three “slightly relevant” words would score lower than a property matching two “very relevant” words.
It was then that we realised we needed to adapt the scoring algorithm to our domain. The reality is that people who search properties to move into use keyword search as a feature prioritisation tool. Using an example, they don’t care whether “garden” appears once or ten times in the description of a property. In all cases, it is equally a good sign that the property has a garden. Moreover, if they searched for “garden” and “en suite”, they expect all properties containing “garden” and “en suite” to appear above the ones matching only one of the words. This is probably the first important lesson we learned while working on keyword search – your use case is unique and most likely you will need to customise the default implementation of your search engine of choice. It’s also one of the main ideas advocated in Relevant Search by Doug Turnbull and John Berryman, which I mentioned above.
So we went on and changed it. And here’s the beauty of it – because we actually went with a much simpler way of scoring properties. We thought we’d simply give one score point per keyword matched, no matter how many times it matched. This is where the constant score query in Elasticsearch came in handy. The outcome was a beautifully sorted list of properties with a few well-defined blocks grouped by number of keywords matched.
Do you like when you type “eggplant” in Google and you also get results for “aubergine”? Or you probably didn’t even notice that is happening because it feels so natural? Well… we thought the same. However, Elasticsearch doesn’t come bundled with a built-in English synonym dictionary. And you can imagine it was out of the scope of the project to define our own one. We had neither the capacity, nor the willingness to do that. But all was not lost. We realised we didn’t need a full English dictionary just for the sake of it because we could achieve a good return on investment by addressing the most common search terms first. And then we’ll see, in the good ol’ agile spirit. Therefore we chose inaction to begin with. We waited a couple of months to gain enough search history data so that we can figure out what are the top searched words in Rightmove. Then we had a look at the top few hundreds and defined our own synonym series for those that made sense.
The result was eleven synonym pairs or series, e.g. flat/apartment or loft/attic. Unsurprisingly, most of them were domain-specific words from the real estate universe. You can now understand why it wouldn’t have been worth looking at the whole English dictionary. No one searches for eggplant on Rightmove.
Misspelled words. We’re all guilty of that. Keyboards are small, keys are close to each other, especially when on the phone. You wanted to hit the O, but you hit the P. I know, I know. However, we didn’t think our users had to be punished for such a benign mistake, so we needed to find a way to meet them halfway and at least try to figure out what they meant. Long story short, there were a couple of solutions we considered, out of which we picked a winner.
One of the alternatives was to suggest words as users would type them, something like a typeahead. There were a number of reasons for which we dismissed this one at an early stage:
- User experience and design: we didn’t want to add any additional, unnecessary clutter to the page. Finding some real estate on the screen for the suggested word would have been quite difficult without squeezing things together.
- Performance: obtaining the most accurate suggested word is not trivial. Any scenario would eventually involve additional calls to the backend from the user’s device, which in turn would add computation strain on our apps and Elasticsearch. It’s not that we’re close to reaching capacity in terms of hardware, but we weren’t sold on the benefit in return. Maybe we were a bit too careful? Maybe.
Another option was to enable fuzzy matching in our Elasticsearch queries. It is based on using the Damerau-Levenshtein distance formula, which essentially defines the “distance” between two words as the smallest number of changes that the first word needs to go through in order to become the second word, where a change can be either:
- A character being added, e.g. detached -> detatched
- A character being removed, e.g. detached -> deached
- Two neighbouring characters being swapped, e.g. detached -> deatched
See an extensive, more reader friendly explanation at https://www.elastic.co/blog/found-fuzzy-search#determining-edit-distance.
We decided to give this solution a go. It has the benefits of being quick to implement – hence quick to release to the user and get quick feedback from the real world – and of using a proven solution out of the box, i.e. Elasticsearch’s fuzzy matching module. We didn’t think reinventing the wheel would make sense at this point. Nevertheless, we knew it wasn’t going to be all sunshine and rainbows; we knew it was one of the aspects where the right balance between precision and recall was going to be harder to achieve. Applying fuzziness had the potential to cause a lot of false positives.
There are quite a few knobs that we can turn in order to fine tune the fuzziness. The most prominent being the maximum Damerau-Levenshtein distance we allow for a word to be considered a match. Looking at that, we could quickly find a fair number of examples where a distance of 2 would be too much. Think attached/detached, possible/impossible, where a 2-character modification means a 180-degree change of the word meaning. Add this to our being extremely cautious about false positives, so we decided to begin with a distance of 1.
The current state
I’ll be honest and admit there have been hiccups. Every now and then, we get the odd enquiry where a puzzled user wonders why a certain property does match one of their keywords. One common example is “acre”, which matches “care” (see how there is only one character swap between the two words) in constructs such as “…property taken care of by…”.
What’s inherently difficult about fuzzy matching is assessing its effectiveness and return on investment. One could easily grow disappointed by the amount of negative feedback, like the one I mentioned above. It’s because negative feedback is very visible, precise and straight to the point. We realised we don’t have an easy way to tell when we actually helped a user by applying fuzziness. Tracking the positive impact is the hard problem around fuzzy matching for us. For instance, if we get a query for “horse”, we will surely return everything that contains the word “house”. But how do we know if the user made a mistake or they were really looking for horses? We are still looking for the best way to identify the positive impact. Let’s be honest, who would send a thank you note when they got what they wanted after spelling a word wrong?
Other alternatives to potentially improve the fuzzy matching behaviour we are currently evaluating are:
- Skip the first N characters of a word when applying fuzziness. This is a configuration parameter that Elasticsearch exposes. They do it based on the theory that misspellings occur more towards the end of words; arguably.
- Refine the scoring model and demote the results that matched on a word only as a result of applying fuzziness, i.e. not as a perfect match.
- Only run a fuzzy search when the query word is not an actual English word. We’d need a third party library to tell us what’s a legit English word.
Feedback is equally as important as search
In any modern search solution, there are (at least) two first class citizens – querying and feedback. Finding the relevant results for the user is very important. Equally as important is giving the user a bit of context and making it obvious to them why they are seeing a certain result. Let’s look at what Google Search does.
I ran a search for “properties for sale in Camden and Wimbledon”. Looking at this particular result, we can see that Google found the relevant excerpts anywhere on the target page where my search phrase, or parts thereof, appear and put them under the actual page title and link. Also, they highlighted the exact words from my query. Last but not least, if there’s any bit from my query that wasn’t found anywhere on the page, I can see the word crossed – Wimbledon in this case -, so I know what to expect if I click through to the page.
But what did we do?
There are a couple things to notice here:
- Right at the top, the label that reads “Matching 1 keyword”. That one doesn’t appear above each result matching one property, though. Remember that the properties are scored based on the number of keywords that matched? This label is just a separator between the block of properties that matched two keywords and the block of properties that matched only one. This is the first visual cue users get so they know there’s a caveat about the following results. In this very case, that it only matched one of “camden” and “wimbledon”
- At the bottom of the property card there’s a list of all the search words, where the one(s) that didn’t match are highlighted more strikingly. This is similar in intention and purpose to what Google does when crossing out words that were not found.
To sum up, we did it our own way. Google set a good standard, but we need to stay aware of the fact that what we do is not web page search. We do property search, which is different. Our users expect a certain set of data to be presented, which is not the same to what they expect – even if it’s the same person – when they do a Google search. Just like Google, we set some expectations over time and we educated our own public to a certain way of distributing information on screen. We can’t mess with that. One change at a time is safer.
We like to think of keyword sort as a product rather than a project, in that we don’t see it as a time-bound endeavour with a precise ending date. We’ll always have a backlog of things that at least we’d like to try. At the top of the list we’ve got currently:
- Fight fuzzy matching false positives by all means described a few sections above
- Relax the phrase matching rules. Currently we allow a slop factor of 1, which is similar to the fuzziness factor for words, but applied to phrases. Taking the concept of relaxing phrase matching further would mean in the end searching for all words separately – like Google does – and probably providing a single text field rather than three. But this is quite far-fetched to discuss now, given that it would involve redesigning the visual experience from scratch
- Improve feedback when clicking through to the details page. Currently there is no remembering the context of a keyword search. Ideally, users arriving on the property details page would get some visual confirmation of things that matched from their original search, for instance words highlighted in the extended property description, which is not shown on the search results page.
All in all, it’s been an interesting journey so far. The main goal was to help users find more quickly something that actually fits their needs. We’re at a point where we collect usage statistics and try to read through them and understand whether we’ve been successful in our effort. We’ll keep you posted.