This is a graduate level course on Information Retrieval with emphasis on search engines in the Web. The goal of the course is to guide the students in a journey---a journey aimed at the design, implementation, and testing of a Web search engine. This search engine should be able to receive and answer queries posed in the domain of the Brazilian Web.
While the design and conception of the various modules that compose the engine---crawler, indexer, and query processor---will be discussed in class, the students will be responsible for implementing them in C++ and validating their implementations during the course. Additionally, the students will explore techniques of machine learning to improve the ranking of the results. Evaluation will consist of three implementation assignments (crawler, indexer, and query processor) and two exams.
This is a rigorous and intensive course with a non-trivial load of theoretical material and practical tasks in the form of coding in C++. At the end of the course, the students should gain a technical understanding of how a search engine works and the challenges at the frontier of the technology---they should understand why search is hard.
Required Textbook
Ricardo Baeza-Yates & Berthier Ribeiro-Neto. Modern Information
Retrieval. Pearson, 2011, 2nd Edition.
Slides can be
found here.
Other Useful References
1. Ian Witten, Alistair Moffat, Timothy Bell. Managing
Gigabytes. Morgan Kauffman, 1999, 2nd Edition.
2. Christopher Manning, Prabakhar Raghavan, Hinrich
Schutze. Introduction to Information Retrieval. Cambridge
University Press, 2008.
3. Bruce Croft, Donald Metzer, Trevor Strohman. Search Engines:
Information Retrieval in Practice. Pearson, 2009.
4. Charles Clarke, Gordon Cormack, Stefan Buttcher. Information
Retrieval: Implementing and Evaluating Search Engines. MIT
Press, 2016.
5. Gerard Salton. Introduction to Modern Information
Retrieval. McGraw Hill, 1983.
| Calendar: Classes, Assignments, Tests | |
| Mar 5 | Introduction: the e-publishing era, basic architecture of a search engine (Chapter 1) |
| Mar 7 | Crawling the Web: history, taxonomy of crawlers, architecture and implementation, scheduling of URLs (Chapter 12, slides 1-44) |
| Mar 12 | Crawling the Web: page events, freshness, exclusion protocol,
efficiency, evaluation (Chapter 12, slides 45-91) Assignment 1: Implementation of a Crawler (20 points) Due Date: Apr 8th, 23:59pm, BR time Penalty for late delivery: 20% by Apr 15 and 40% by Apr 22 |
| Mar 14 | Modeling: basic concepts, Boolean Model, TF-IDF weights (Chapter 3, slides 1-48) |
| Mar 19 | Modeling: document normalization, Vector Model, Link Analysis, Probabilistic Model (Chapter 3, slides 49-86 and Chapter 11, slides 70-85) |
| Mar 21 | Modeling: Set-based Model, Generalized Vector Model, BM25 (Chapter 3, slides 89-107, 133-144, 164-176) |
| Mar 26 | Modeling: Language models, Belief Network models (Chapter 3, slides 177-205, 219-247) |
| Mar 28 | Indexing: basics, inverted indexes, compressed inverted indexes (slides 1-48) |
| Apr 2 | RA assistantship for Assignment 1. |
| Apr 4 | RA assistantship for Assignment 1. |
| Apr 9 | Indexing: implementation of compressed inverted indexes
(slides: indexing,
vocabulary,
compression) Assignment 2: Implementation of a Compressed Index (20 points). Due Date: May 13th, 23:59pm, BR time. Penalty for late delivery: 20% by May 20th and 40% by May 27th. |
| Apr 11 | Test I (20 points) |
| Apr 16 | Retrieval Evaluation: basics, precision-recall, single value summaries, user evaluation (Chapter 4, slides 1-41) |
| Apr 18 | Retrieval Evaluation: DCG, Bpref, rank correlation metrics, Spearman, Kendall-Tau (Chapter 4, slides 42-93) |
| Apr 23 | Retrieval Evaluation: reference collections, TREC, OSHUMED, INEX, user based evaluation, side-by-side, A/B testing, clickthrough data (Chapter 4, slides 94-144) |
| Apr 25 | Relevance Feedback (RF): basics, explicit RF, Rocchio, clicks (Chapter 5, slides 1-53) |
| Apr 30 | Relevance Feedback (RF): implicit RF, local analysis, global analysis (Chapter 5, slides 54-104) |
| May 2 | Text Classification: basics, machine learning, unsupervised learning, supervised learning, decision trees (Chapter 8, slides 1-51) |
| May 7 | RA assistantship for assignment 2. |
| May 9 | RA assistantship for assignment 2. |
| May 14 | Text Classification: KNN, Rocchio, Naive Bayes, SVM, Ensemble,
Stacking (Chapter 8, slides 52-111) Assignment 3: Implementation of a Query Processor with Page Rank (20 points). Due Date: Jun 17th, 23:59pm, BR time. Hard deadline: assignments not accepted after due date. |
| May 16 | Text Classification: feature selection, tf-idf, mutual information, information gain, chi-square, evaluation metrics, precision-recall, F-measures, cross-validation, standard collections (Chapter 8, slides 112-157) |
| May 21 | Web Retrieval: characterizing the Web, the Web graph, modeling the Web, link analysis, search engine architectures, caching (Chapter 11, slides 1-50) |
| May 23 | Web Retrieval: multiple indices, multi-site architecture, search engine ranking signals, learning to rank, clickthrough data, Web spam (Chapter 11, slides 51-69, 86-98) |
| May 28 | Web Retrieval: managing Web data, document identifiers, metadata, compressing the Web graph, duplicated data, user interaction, user interfaces, the retangle search box, title-snippet-url entity, universal search, spelling assistance, query recommendations, query refinements, actionable results (Chapter 11, slides 99-163) |
| May 30 | Test II (20 points) |
| Jun 4 | RA assistantship for assignment 3. |
| Jun 6 | RA assistantship for assignment 3. |
| Jun 11 | RA assistantship for assignment 3. |
| Jun 13 | RA assistantship for assignment 3. |
| Jun 18 | No class. |
| Jun 20 | Discussion, evaluation of course, final results. |