Relevant Search by Doug Turnbull and John Berryman
Mental Model for Search
- Lucene indexes tokens not words. The goal is to find the optimal encoding for tokens so that it captures the search user's intent. A token doesn't need to be a word. It could, but it could be multiple words (e.g. bigrams) or something else, e.g. geohash.
- An indexed document as well as a user's search query can then be considered as vectors in n-dimensional space, where each dimension represents a feature or a token. The relevance score is a measure of the similarity between the two vectors.
- A typical pipeline is
- intent -> search query -> retrieval -> display pipeline. But this can be changed like
- intent -> search query -> tokens < intersection > tokens <- documents
|
Filtering
|
Ranking
UX <- |
- The key is to use analysis to break down the documents and the user's query into a shared set of tokens.
Relevance Scores
- Term Frequence (TF): The frequency of the terms in a document. Higher the value, more relevant the document
- Inverse Document Frequency (IDF): Fraction of relevant documents. The inverse matters because fewer documents can mean they are more relevant for the terms used.
- Coord: The damping factor
- Field Norm: Lucene favors shorter fields for matches than longer ones.
Precision vs Recall
Precision is % of relevant documents in the result set that are relevant. Recall is % of relevant documents that are returned in the result set. These two are often at odds with each other, espcially as the user's query is a lossy translation of their intent. Improving precision can tighten the criteria to the extent that some results expected by the user are not in the results. On the other hand reducing precision means the user may find some results irrelevant.
Signal modeling
To the relevance engineer, fields exist to return a signal that measures information in the form of that field's relevance score.
Signal modeling can be considered to be feature extraction. The goal of signal modeling is to find a optimal set of fields such that they capture the user's intent completely. It's called signal modeling because the goal is to maximize the signal (the field-specific relevance score) when it detects the feature it's tuned for without any false positives.
The key here is to avoid being biased in favor of the source data model/schema and to think from the point of the search user's intent and finding the model that best represents that intent for all users of that system.
A symptom of sub-optimal signal modeling is signal discordance, which can cause false postives or false negatives. Finding the optimal set is not guaranteed as the data may not support it. Lucene's scoring mechanism, which is based on TF & IDF also cause issues. This is where the hybrid of field-centric and term-centric searches comes in.
Field-centric search
- Field centric vs term centric
- Field centric ranks each field for the entire phrase and the scores for each field are used in the next reduction step
- Term centric gives primacy to terms/tokens instead of fields. Users may prefer all terms being present in a document higher than a subset of terms appearing, many appearing in multiple fields.
Scoring formula
- best_fields:
- S[F1] + tie_breaker * (S[F2] + S[F3] + ...), where S[F1] was the highest score
- Works well when documents rarely have multiple fields that match the search string
- Rewards rareness instead of common words due to TF * IDF scores for rare words being higher.
- You need to provide precise signals - signals so precise that it's unlikely that the search string will match many documents.
- most_fields:
- (S[F1] + S[F2] + ...) * coord, where coord = # matching clauses / # total clauses
- Provides a way to specify what an ideal document looks like
- Works well when documents commonly have multiple fields matching the search string
- Equivalent to
best_fields with a tie_breaker = 1
- Beware of lopsided fields, basically fields that bias the final score due to their TF * IDF scores. Use combination of down-boost/up-boost to even things a bit.
- Beware of a token matching multiple fields. You need to decide it that translates to the users's intent.
Term-centric search
Term centric gives primacy to terms/tokens instead of fields. Users may prefer all terms being present in a document higher than a subset of terms appearing, many appearing in multiple fields.
It can address some deficiency of field centric searches, such as:
- Albino Elephant problem. Field centric search (especially best_fields) biases in favor of matching fields and often in favor of subset of the terms in the search query matching instead of all terms.
- Signal Discordance. This is where the signal modelling in terms of fields works against user's term-focused intent. An example is where not a single field contains all the search terms. Again TF * IDF will result in documents not having a strict subset still being ranked higher.
- Field synchonicity:
- The terms must be matched in the same way across fields. An example is where term "william" is matched against a fieldset that includes a bigramed field. The bigramed field doesn't make sense as its tokens have word pairs. So it's left out of the query plan to comply with Field synchronicity.
- A similar synchronicity is for analyzers in cross_fields. The analyzers should match.
Options to solve issues:
- _all field
- Addresses Albino Elephant problem but loses signals hurting relevance.
- Bad in terms of storage and is an indexing time issue
- cross_fields
- Max(S[T1|F1-FN]) + Max(S[T2|F1-FN] + ...) * coord
- This uses a blended scoring mechanism that tries to even out lopsided IDFs for terms.
- Can be used at query time
- Solves both Albino Elephant and Signal Discordance, but can have issues due to the way IDF is calculated.
- custom all
- A special case of #1 where similar fields are aggregated and then aggregated field is used in field matches. This doesn't always work as in cases where you cannot have disjoint sets of fields.
- Drawback is this needs to be done at index time and increases storage cost
- Use a layered approach with a base query that has high recall and the use a more specialized query that's a subset of the former to boost to improve precision.