Relevant Search by Doug Turnbull and John Berryman

Mental Model for Search

Relevance Scores

Precision vs Recall

Precision is % of relevant documents in the result set that are relevant. Recall is % of relevant documents that are returned in the result set. These two are often at odds with each other, espcially as the user's query is a lossy translation of their intent. Improving precision can tighten the criteria to the extent that some results expected by the user are not in the results. On the other hand reducing precision means the user may find some results irrelevant.

Signal modeling

To the relevance engineer, fields exist to return a signal that measures information in the form of that field's relevance score.

Signal modeling can be considered to be feature extraction. The goal of signal modeling is to find a optimal set of fields such that they capture the user's intent completely. It's called signal modeling because the goal is to maximize the signal (the field-specific relevance score) when it detects the feature it's tuned for without any false positives.

The key here is to avoid being biased in favor of the source data model/schema and to think from the point of the search user's intent and finding the model that best represents that intent for all users of that system.

A symptom of sub-optimal signal modeling is signal discordance, which can cause false postives or false negatives. Finding the optimal set is not guaranteed as the data may not support it. Lucene's scoring mechanism, which is based on TF & IDF also cause issues. This is where the hybrid of field-centric and term-centric searches comes in.

Field-centric search

Scoring formula

Term-centric search

Term centric gives primacy to terms/tokens instead of fields. Users may prefer all terms being present in a document higher than a subset of terms appearing, many appearing in multiple fields.

It can address some deficiency of field centric searches, such as: - Albino Elephant problem. Field centric search (especially best_fields) biases in favor of matching fields and often in favor of subset of the terms in the search query matching instead of all terms. - Signal Discordance. This is where the signal modelling in terms of fields works against user's term-focused intent. An example is where not a single field contains all the search terms. Again TF * IDF will result in documents not having a strict subset still being ranked higher.

Options to solve issues:

  1. _all field - Addresses Albino Elephant problem but loses signals hurting relevance. - Bad in terms of storage and is an indexing time issue
  2. cross_fields - Max(S[T1|F1-FN]) + Max(S[T2|F1-FN] + ...) * coord - This uses a blended scoring mechanism that tries to even out lopsided IDFs for terms. - Can be used at query time - Solves both Albino Elephant and Signal Discordance, but can have issues due to the way IDF is calculated.
  3. custom all - A special case of #1 where similar fields are aggregated and then aggregated field is used in field matches. This doesn't always work as in cases where you cannot have disjoint sets of fields. - Drawback is this needs to be done at index time and increases storage cost
  4. Use a layered approach with a base query that has high recall and the use a more specialized query that's a subset of the former to boost to improve precision.