Evaluation at large search engines

  • Search engines have test collections of queries and hand-ranked results

  • Recall is difficult to measure on the web

  • Search engines often use precision at top k, e.g., k = 10

  • . . . or measures that reward you more for getting rank 1 right than for getting rank 10 right.

    • NDCG (Normalized Cumulative Discounted Gain)

  • Search engines also use non-relevance-based measures.

    • Clickthrough on first result

      • Not very reliable if you look at a single clickthrough … but pretty reliable in the aggregate.

    • Studies of user behavior in the lab

    • A/B testing

