Traditionally used in conjunction with PRP
“Binary” = Boolean: documents are represented as binary incidence vectors of terms (cf. lecture 1):
“Independence”: terms occur in documents independently
Different documents can be modeled as same vector
Queries: binary term incidence vectors
Given query q,
for each document d need to compute p(R|q,d).
replace with computing p(R|q,x) where x is binary term incidence vector representing d Interested only in ranking
Will use odds and Bayes’ Rule:
Using Independence Assumption:
So :
Since xi is either 0 or 1:
Let Pi = P(xi = 1 | R,q); rt = p (x_{i} =1 | NR,q)
Assume, for all terms not occurring in the query (q_{i}=0) pi = r_{i}
Then... (This can be changed (e.g.,
in relevance feedback)
Retrieval Status Value:
All boils down to computing RSV.
So, how do we compute ci’s from our data ?
Estimating RSV coefficients.
For each term i look at this table of document counts: