Information Retrieval---Search Engines in the Web

This is a graduate level course on Information Retrieval with emphasis on search engines in the Web. The goal of the course is to guide the students in a journey---a journey aimed at the design, implementation, and testing of a Web search engine. This search engine should be able to receive and answer queries posed in the domain of the Brazilian Web.

While the design and conception of the various modules that compose the engine---crawler, indexer, and query processor---will be discussed in class, the students will be responsible for implementing them in C++ and validating their implementations during the course. Additionally, the students will explore techniques of machine learning to improve the ranking of the results. Evaluation will consist of three implementation assignments (crawler, indexer, and query processor) and two exams.

This is a rigorous and intensive course with a non-trivial load of theoretical material and practical tasks in the form of coding in C++. At the end of the course, the students should gain a technical understanding of how a search engine works and the challenges at the frontier of the technology---they should understand why search is hard.

Required Textbook

Ricardo Baeza-Yates & Berthier Ribeiro-Neto. Modern Information Retrieval. Pearson, 2011, 2nd Edition.
Slides can be found here.

Other Useful References

1. Ian Witten, Alistair Moffat, Timothy Bell. Managing Gigabytes. Morgan Kauffman, 1999, 2nd Edition.
2. Christopher Manning, Prabakhar Raghavan, Hinrich Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008.
3. Bruce Croft, Donald Metzer, Trevor Strohman. Search Engines: Information Retrieval in Practice. Pearson, 2009.
4. Charles Clarke, Gordon Cormack, Stefan Buttcher. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press, 2016.
5. Gerard Salton. Introduction to Modern Information Retrieval. McGraw Hill, 1983.

Calendar: Classes, Assignments, Tests
Mar 5 Introduction: the e-publishing era, basic architecture of a search engine (Chapter 1)
Mar 7 Crawling the Web: history, taxonomy of crawlers, architecture and implementation, scheduling of URLs (Chapter 12, slides 1-44)
Mar 12 Crawling the Web: page events, freshness, exclusion protocol, efficiency, evaluation (Chapter 12, slides 45-91)

Assignment 1: Implementation of a Crawler (20 points)
Due Date: Apr 8th, 23:59pm, BR time
Penalty for late delivery: 20% by Apr 15 and 40% by Apr 22
Mar 14 Modeling: basic concepts, Boolean Model, TF-IDF weights (Chapter 3, slides 1-48)
Mar 19 Modeling: document normalization, Vector Model, Link Analysis, Probabilistic Model (Chapter 3, slides 49-86 and Chapter 11, slides 70-85)
Mar 21 Modeling: Set-based Model, Generalized Vector Model, BM25 (Chapter 3, slides 89-107, 133-144, 164-176)
Mar 26 Modeling: Language models, Belief Network models (Chapter 3, slides 177-205, 219-247)
Mar 28 Indexing: basics, inverted indexes, compressed inverted indexes (slides 1-48)
Apr 2 RA assistantship for Assignment 1.
Apr 4 RA assistantship for Assignment 1.
Apr 9 Indexing: implementation of compressed inverted indexes (slides: indexing, vocabulary, compression)

Assignment 2: Implementation of a Compressed Index (20 points).
Due Date: May 13th, 23:59pm, BR time.
Penalty for late delivery: 20% by May 20th and 40% by May 27th.
Apr 11 Test I (20 points)
Apr 16 Retrieval Evaluation: basics, precision-recall, single value summaries, user evaluation (Chapter 4, slides 1-41)
Apr 18 Retrieval Evaluation: DCG, Bpref, rank correlation metrics, Spearman, Kendall-Tau (Chapter 4, slides 42-93)
Apr 23 Retrieval Evaluation: reference collections, TREC, OSHUMED, INEX, user based evaluation, side-by-side, A/B testing, clickthrough data (Chapter 4, slides 94-144)
Apr 25 Relevance Feedback (RF): basics, explicit RF, Rocchio, clicks (Chapter 5, slides 1-53)
Apr 30 Relevance Feedback (RF): implicit RF, local analysis, global analysis (Chapter 5, slides 54-104)
May 2 Text Classification: basics, machine learning, unsupervised learning, supervised learning, decision trees (Chapter 8, slides 1-51)
May 7 RA assistantship for assignment 2.
May 9 RA assistantship for assignment 2.
May 14 Text Classification: KNN, Rocchio, Naive Bayes, SVM, Ensemble, Stacking (Chapter 8, slides 52-111)

Assignment 3: Implementation of a Query Processor with Page Rank (20 points).
Due Date: Jun 17th, 23:59pm, BR time.
Hard deadline: assignments not accepted after due date.
May 16 Text Classification: feature selection, tf-idf, mutual information, information gain, chi-square, evaluation metrics, precision-recall, F-measures, cross-validation, standard collections (Chapter 8, slides 112-157)
May 21 Web Retrieval: characterizing the Web, the Web graph, modeling the Web, link analysis, search engine architectures, caching (Chapter 11, slides 1-50)
May 23 Web Retrieval: multiple indices, multi-site architecture, search engine ranking signals, learning to rank, clickthrough data, Web spam (Chapter 11, slides 51-69, 86-98)
May 28 Web Retrieval: managing Web data, document identifiers, metadata, compressing the Web graph, duplicated data, user interaction, user interfaces, the retangle search box, title-snippet-url entity, universal search, spelling assistance, query recommendations, query refinements, actionable results (Chapter 11, slides 99-163)
May 30 Test II (20 points)
Jun 4 RA assistantship for assignment 3.
Jun 6 RA assistantship for assignment 3.
Jun 11 RA assistantship for assignment 3.
Jun 13 RA assistantship for assignment 3.
Jun 18 No class.
Jun 20 Discussion, evaluation of course, final results.