CSCI E-108
Data Mining, Discovery, and Exploration
Extracting actionable insights and relationships from massive complex data sets is the domain of data mining.
Data mining has wide-ranging applications in science and technology, where data-set size defies use of algorithms commonly applied at small scale.
This course addresses several key aspects of data mining, including the use of key-value pairs and hashing methods to manage and compute analytics for massive scale datasets; highly scalable approximate similarity search and embedding algorithms for information retrieval, as used in retrieval-augmented generation (RAG) algorithms, web search, image search, and recommendation systems; algorithms for ranking search and recommendation results; highly memory-efficient sketch algorithms for infinite sized data, such as streaming data and online processing of massive datasets; unsupervised learning, including clustering models and dimensionality reduction algorithms, for finding and exploring relationships in massive complex datasets; and graph representations and algorithms for search and social network analysis.
The course comprises readings and lectures on theory along with hands-on exercises and projects where students apply the theory through Python coding and interpretation of results.
The hands-on component of the course uses a variety of libraries in the Python language, Scikit-Learn, NetworkX, FAISS, and deep learning platforms and packages.
Students enrolled for graduate credit are required to perform, present, and report on an independent project.
This project must demonstrate a mastery of methods covered in the course as applied to a suitable real-world data set.
Students may not take both CSCI E-96 and CSCI E-108 for degree or certificate credit.