Basic assumptions of Information Retrieval

  • Collection: Fixed set of documents
  • Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task.



The classic search model



How good are the retrieved docs?

  • Precision : Fraction of retrieved docs that are relevant to user’s information need

  • Recall : Fraction of relevant docs in collection that are retrieved

  • More precise definitions and measurements to follow in later lectures



Bigger collections

  • Consider N = 1 million documents, each with about 1000 words.

  • Avg 6 bytes/word including spaces/punctuation

    • 6GB of data in the documents.

  • Say there are M = 500K distinct terms among these.



Can’t build the matrix

  • 500K x 1M matrix has half-a-trillion 0’s and 1’s.

  • But it has no more than one billion 1’s.                               Why? 

    • matrix is extremely sparse.

  • What’s a better representation?

    • We only record the 1 positions.



Inverted index

  • For each term t, we must store a list of all documents that contain t.
      • Identify each by a docID, a document serial number
  • Can we use fixed-size arrays for this?
Brutus124113145 173 174 
Caesar 124 5 616 57 32
Calpurnia 23154 101  

 


                     What happens if the word Caesar is added to document 14? 



Inverted index

  • We need variable-size postings lists

    • On disk, a continuous run of postings is normal and best

    • In memory, can use linked lists or variable length arrays

      • Some tradeoffs in size/ease of insertion                         Posting(below)

Brutus124113145 173 174 
Caesar 124 5 616 57 32
Calpurnia 23154 101 


 

 


Brutus , Caeser and Calpurnia are dictionaries.

Numbers are postings.

Sorted by docID (more later on why).



Inverted index construction



Indexer steps: Token sequence



Indexer steps: Sort



Indexer steps: Dictionary & Postings



Where do we pay in storage?



The index we just built

  • How do we process a query?

    • Later - what kinds of queries can we process?                ← Today’s focus 



Query processing: AND

  • Consider processing the query: Brutus AND Caesar

    • Locate Brutus in the Dictionary;

      • Retrieve its postings.

    • Locate Caesar in the Dictionary;

      • Retrieve its postings.

    • Merge” the two postings:



The merge

  • Walk through the two postings simultaneously, in time linear in the total number of postings entries


Intersecting two postings lists (a "merge algorithm")





Creator: Tgbyrdmc

Contributors:
-


Licensed under the Creative Commons
Attribution ShareAlike CC-BY-SA license


This deck was created using SlideWiki.