George Karypis

Data Mining

The goal of this project is to develop effective and computationally efficient algorithms for analyzing large volumes of data. The ultimate purpose of these analyses is to discover key and actionable information and gain insights about the underlying processes/systems that created the data (or are being described by the data).

This emerging discipline is becoming increasingly important as advances in data collection have led to the explosive growth in the amount of available data. Data mining algorithms are used extensively to analyze business, commerce, scientific, engineering, and security data and dramatically improve the effectiveness of applications in areas such as marketing, predictive modeling, life sciences, information retrieval, and engineering.

Our research was initially focused on developing high-performance scalable parallel algorithms for solving core data mining problems but in recent years, it has expanded to include research on fundamental data mining algorithms in the areas of data clustering, classification, pattern discovery, sequence mining, graph mining, and its applications in information retrieval, collaborative filtering, and bioinformatics.

Our latest research is focusing on the following areas:

Algorithms for finding meaningful clusters in large sparse graphs like those arising in relational/social networks and the web.
Large-margin and kernel-based classification algorithms with an emphasis towards algorithms that can learn arbitrary output spaces.
Algorithms that can mine large and complex graphs.

The research over the years has been funded by a number of Federal agencies including ARL, NSF, and NIH.

Software

Many of the algorithms that we developed have been made available to the public in the form of stand-alone programs and libraries that are used extensively in many academic, government, and industrial sites. This includes the CLUTO clustering toolkit that implements different classes of feature- and similarity-based clustering algorithms, the SUGGEST and SLIM libraries of scalable collaborative-filtering based recommendation algorithms, and the PAFI pattern finding toolkit that contains various algorithms to find frequent patterns in transaction, sequence, and graph databases.

All of these tools are available for download from our Software page.