Relevant Search by Doug Turnbull and John Berryman

Mental Model for Search

Lucene indexes tokens not words. The goal is to find the optimal encoding for tokens so that it captures the search user's intent. A token doesn't need to be a word. It could, but it could be multiple words (e.g. bigrams) or something else, e.g. geohash.
An indexed document as well as a user's search query can then be considered as vectors in n-dimensional space, where each dimension represents a feature or a token. The relevance score is a measure of the similarity between the two vectors.
A typical pipeline is
intent -> search query -> retrieval -> display pipeline. But this can be changed like
intent -> search query -> tokens < intersection > tokens <- documents | Filtering | Ranking UX <- |
The key is to use analysis to break down the documents and the user's query into a shared set of tokens.

Relevance Scores

Term Frequence (TF): The frequency of the terms in a document. Higher the value, more relevant the document
Inverse Document Frequency (IDF): Fraction of relevant documents. The inverse matters because fewer documents can mean they are more relevant for the terms used.
Coord: The damping factor
Field Norm: Lucene favors shorter fields for matches than longer ones.

Precision vs Recall

Precision is % of relevant documents in the result set that are relevant. Recall is % of relevant documents that are returned in the result set. These two are often at odds with each other, espcially as the user's query is a lossy translation of their intent. Improving precision can tighten the criteria to the extent that some results expected by the user are not in the results. On the other hand reducing precision means the user may find some results irrelevant.

Index for generality search for specificity
Understanding scoring:
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html#explain

Signal modeling

To the relevance engineer, fields exist to return a signal that measures information in the form of that field's relevance score.

Signal modeling can be considered to be feature extraction. The goal of signal modeling is to find a optimal set of fields such that they capture the user's intent completely. It's called signal modeling because the goal is to maximize the signal (the field-specific relevance score) when it detects the feature it's tuned for without any false positives.

The key here is to avoid being biased in favor of the source data model/schema and to think from the point of the search user's intent and finding the model that best represents that intent for all users of that system.

A symptom of sub-optimal signal modeling is signal discordance, which can cause false postives or false negatives. Finding the optimal set is not guaranteed as the data may not support it. Lucene's scoring mechanism, which is based on TF & IDF also cause issues. This is where the hybrid of field-centric and term-centric searches comes in.

Field-centric search

Field centric vs term centric
Field centric ranks each field for the entire phrase and the scores for each field are used in the next reduction step
Term centric gives primacy to terms/tokens instead of fields. Users may prefer all terms being present in a document higher than a subset of terms appearing, many appearing in multiple fields.

Scoring formula

best_fields:
S[F1] + tie_breaker * (S[F2] + S[F3] + ...), where S[F1] was the highest score
Works well when documents rarely have multiple fields that match the search string
Rewards rareness instead of common words due to TF * IDF scores for rare words being higher.
You need to provide precise signals - signals so precise that it's unlikely that the search string will match many documents.
most_fields:
(S[F1] + S[F2] + ...) * coord, where coord = # matching clauses / # total clauses
Provides a way to specify what an ideal document looks like
Works well when documents commonly have multiple fields matching the search string
Equivalent to best_fields with a tie_breaker = 1
Beware of lopsided fields, basically fields that bias the final score due to their TF * IDF scores. Use combination of down-boost/up-boost to even things a bit.
Beware of a token matching multiple fields. You need to decide it that translates to the users's intent.

Term-centric search

Term centric gives primacy to terms/tokens instead of fields. Users may prefer all terms being present in a document higher than a subset of terms appearing, many appearing in multiple fields.

It can address some deficiency of field centric searches, such as: - Albino Elephant problem. Field centric search (especially best_fields) biases in favor of matching fields and often in favor of subset of the terms in the search query matching instead of all terms. - Signal Discordance. This is where the signal modelling in terms of fields works against user's term-focused intent. An example is where not a single field contains all the search terms. Again TF * IDF will result in documents not having a strict subset still being ranked higher.

Field synchonicity:
The terms must be matched in the same way across fields. An example is where term "william" is matched against a fieldset that includes a bigramed field. The bigramed field doesn't make sense as its tokens have word pairs. So it's left out of the query plan to comply with Field synchronicity.
A similar synchronicity is for analyzers in cross_fields. The analyzers should match.

Options to solve issues:

_all field - Addresses Albino Elephant problem but loses signals hurting relevance. - Bad in terms of storage and is an indexing time issue
cross_fields - Max(S[T1|F1-FN]) + Max(S[T2|F1-FN] + ...) * coord - This uses a blended scoring mechanism that tries to even out lopsided IDFs for terms. - Can be used at query time - Solves both Albino Elephant and Signal Discordance, but can have issues due to the way IDF is calculated.
custom all - A special case of #1 where similar fields are aggregated and then aggregated field is used in field matches. This doesn't always work as in cases where you cannot have disjoint sets of fields. - Drawback is this needs to be done at index time and increases storage cost
Use a layered approach with a base query that has high recall and the use a more specialized query that's a subset of the former to boost to improve precision.