Practical Data Complexity Analysis in Pattern Recognition
Tin Kam Ho
Data complexity analysis aims at providing a scientific basis to relate behavior of classifiers to certain intrinsic characteristics in the data available for training the classifiers. The analysis seeks explanations of a classifier’s observed performance variability in different tasks, and thereby provides some guidance on selecting a classifier method for a given task. The data complexity measures
can also be used to evaluate alternative set-ups of a classification tasks, or different feature transformations on how they may impact the difficulty of the underlying task. In a way, the complexity measures are “features” about a classification task, which can support meta-learning for the decisions to be made for each task, e.g. which classifier to use.
In this tutorial we review the concepts and methods for data complexity analysis, and its successes and shortcomings. We then describe available code repositories for performing such analysis on arbitrary learning tasks. A hands-on exercise section is designed for attendees, who can choose (or bring) a data set and launch an analysis using a set of online tools that host the analysis code.
Discussion of the results is to be conducted in class following the online trials.
The tutorial is planned for 3 hours, and will be given in two parts, about 1.5 hours each, with a brief break in between:
Ø Part I: Review of data complexity analysis: concepts, tools, and use cases.
· Background, motivation, methods, and tools for data complexity analysis.
· Previous applications in comparing classifier domains of competences, comparing families of problems spanned by systematic changes (e.g. parameters of Gaussian distributions), and evaluating feature transformations.
Ø Part II: Applications: accessible code, with hands-on exercises.
· Available code libraries and online accessible functions.
· Suggested experiments with data complexity analysis in selected domains, with available examples from an image classification task, a text categorization task, choices of data from ML repositories, or “bring-your-own tasks”.
· Presentation and discussion of results.
· Open problems and recommendations for follow up studies.
Tin Kam Ho is a lead scientist in artificial intelligence research and applications at IBM Watson. Before, she led a department in statistics and machine learning research in Bell Labs. She pioneered research in multiple classifier systems, random decision forests, and data complexity analysis. Over her career she contributed to many application domains of pattern recognition and data analysis, including multilingual reading machines, optical network design and monitoring, wireless geolocation, and smart grid demand forecasting. She served as Editor-In-Chief for Pattern Recognition Letters in 2004-2010, and as Editor or Associate Editor for several other journals including IEEE Transactions on Pattern Analysis and Machine Intelligence, Pattern Recognition, and International Journal on Document Analysis and Recognition. Her work has been honored with the Pierre Devijver Award in statistical pattern recognition, several Bell Labs awards, and the Young Scientist Award of the International Conference on Document Analysis and Recognition. Her publications have received over 9000 citations. She is a Fellow of the IAPR and the IEEE.