605.744: Information Retrieval, Spring 2011
- Lecturer
- Paul McNamee
Textbooks
Course Times and Location
- Lecture: Monday, 7:20pm - 10:00pm, Room K-2.
- Email should be the primary means of out-of-class communication; however
I can meet with students by appointment.
- Course Overview
- This course covers the storage and retrieval of unstructured digital
information. Topics include automatic index construction,
retrieval models, textual representations, efficiency issues,
search engines, text classification, and multilingual retrieval.
- Grading Policy
- Grades will be given based on the A,B,C, etc..., scale, per
university policy. Work for the class will include homework
assignments, an independent research project, exams,
and classroom participation (e.g., quizzes, oral presentations, paper summaries).
Refer to the course outline for details.
- Academic Integrity
- Work for this class is expected to be the result of individual
effort; however, unless explicitly prohibited, it is perfectly
acceptable to make use of published examples and source code
from the literature or public domain - but only if attribution
is given.
Furthermore, while it is permissible to discuss the general
nature of lecture material and assignments with your peers, this
does not extend to discussing or revealing solutions or source
code.
Students are expected to uphold the academic integrity of the
university.
Students using without reference, published material or copying
the work (i.e., particularly source code) of another individual
will face consequences such as receiving a zero on the
assignment and having the matter referred to the dean.
Contact me if you have any questions, no matter how
slight, about this policy, or if you have questions about a
particular assignment.
Assigned Readings
- 1/31/11 Chapters 1 and 2 in Manning, Raghavan, and Schütze.
- Michael Lesk, The Seven Ages of Information Retrieval
- 2/7/11 Chapters 3 and 4 in Manning, Raghavan, and Schütze.
- 2/14/11 Chapters 5 - 7 in Manning, Raghavan, and Schütze.
- 2/21/11 Chapters 8 and 9 in Manning, Raghavan, and Schütze.
- Economic Impact of TREC. Executive Summary and Chapters 1-3
- 2/28/11 Chapters 11 and 12 in Manning, Raghavan, and Schütze.
- 3/7/11 Chapters 13, 14, and 15 in Manning, Raghavan, and Schütze.
- 3/7/11 Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features.
- 3/7/11 Optional. Goodman et al., Spam and the on-going battle for the inbox., CACM 50(2), pp. 24-33, 2007.
- 3/14/11 Chapters 19, 20, and 21 in Manning, Raghavan, and Schütze.
- 4/4/11 K. Kashida, Technical
issues of cross-language information retrieval: a review,
IPM 41, pp. 433-455, 2005.
- 4/18/11 M. Sanderson, Retrieving with Good Sense, Information Retrieval 2(1), pp. 49-69, 2000.
Handouts
- 1/31/11 Course Description and tentative outline
- 1/31/11 Lecture notes for week 1. Introduction. Boolean Retrieval.
- 2/7/11 Lecture notes for week 2. Tolerant Retrieval. Index Construction.
- 2/14/11 Lecture notes for week 3. Efficiency. Vector models.
- 2/21/11 Lecture notes for week 4. Evaluation. Relevance Feedback.
- 2/28/11 Lecture notes for week 5. Binary Independence Model and Statistical Language Models.
- 3/7/11 Lecture notes for week 6. Text classification.
- 3/14/11 Lecture notes for week 7. Web IR.
- 3/14/11 Practice exam distributed in class (Email me if you didn't get it).
- 4/4/11 Lecture notes for week 9. Multilingual IR.
- 4/11/11 Lecture notes for week 10. Distributed IR and Multimedia Retrieval.
- 4/18/11 Lecture notes for week 11. NLP and IR
- 4/25/11 Lecture notes for week 12. Information Extraction and Question Answering
Assignments
Course related web-links
- Sources for on-line papers:
CiteSeer
ACL Anthology
TREC Publications
ACM Digital Library
- IR Textbooks:
Information Retrieval: Implementingn and Evaluating Search Engines,
Managing Gigabytes,
Information Retrieval: Algorithms and Heuristics
Readings in Information Retrieval (Amazon),
Foundations of Statistical Natural Language Processing
- IR Evaluations: TREC,
CLEF,
NTCIR,
FIRE
ROMIP (a Russian language evaluation)
- Organizations that distribute corpora:
LDC,
ELRA
- IR Journals:
JASIST,
IP&M,
IR
ACM Transactions on Speech and Language Processing
- IR-related conferences:
SIGIR,
CIKM 2010,
KDD,
ACL 2010
WWW-2010
CEAS 2010 (email spam)
AIRWeb (web spam)
ISMIR (music IR)
- Blogs
Probably Irrelevant,
NLPers
- On-line magazines:
The Noisy Channel,
Search Engine Watch,
D-Lib Magazine
- Peter Norvig's tutorial on spelling correction.
- Berkeley Primer: Finding Information on the Internet
- HLT Central Repository
- Discrete Mathematics Primer
- Web Protocols:
HTML,
Z39.50 (Information Retrieval)
Software Resources
- Lucene: a popular open-source search engine software
- Wumpus system
- Lemur / Indri: a language modelling IR toolkit
- Cornell's SMART system
- Managing Gigabytes mg system
- Very nice list of NLP, IR, CL, resources (i.e. parsers, taggers) at
Stanford.
- University of Michigan tool suite: Clairlib
- Trigrams-n-Text
(TnT) toolkit, a visible markov model tagger written by Thorsten
Brants (now of Google).
- QTag
a probabilistic POS-tagger.
- On-line translators: Systran,
FreeTranslation.com
- Google's on-line translation service: Google Translate
- WordNet, a
lexical database for English
- Andrew McCallum's MALLET
toolkit, a Java-based API for machine learning applications
using Conditional Random Fields
- Wget
- Perl LWP library (at CPAN).
- Public
Domain Speech Recognition Software (from Miss State Univ).
- Machine Learning / Data Mining tool: WEKA
- Joachim's Support Vector Machine toolkit: SVMlight
- SVM-Multiclass, a multi-class version of SVMlight.
- Python-based set of tools for NLP tasks (parsing, POS tagging, etc...): NLTK
Cool Demos
- A 'meta' search engine: Dogpile
- A question-answering system: START
- An online joke recommendation system that demonstrates
collaborative filtering:
JESTER
- A faux computer science paper generator,
SCIgen, from MIT
- No IR system with 3 billion queries a day is going to be perfect. Best of Google Bloopers ;-).
IR Test collections
Web Search Engines
JHU Links
Regional Links
Paul McNamee:
http://apl.jhu.edu/~paulmac/
(paulmac@apl.jhu.edu)