IEEE Toronto Section


Ground Truth Bias in External Cluster Validity Indices

June 28, 2016 at 2:00 p.m. IEEE CIS Distinguished Lecturer James C. Bezdek will be presenting “Ground Truth Bias in External Cluster Validity Indices”.

Speaker: James C. Bezdek
IEEE CIS Distinguished Lecturer

Day & Time: Tuesday, June 28, 2016
2:00 p.m. – 4:00 p.m.

Location: Room ENG 106, George Vary Engineering & Computing Centre
245 Church St., Toronto, ON, M5B 2K3
(Intersection of Church and Gould)


Contact: Dr. Maryam Davoudpour, Dr. Glaucio Carvalho, Dr. Alireza Sadeghian

Organizers: Signals & Computational Intelligence Chapter, Magnetics Chapter, Instrumentation & Measurement/Robotics & Automation Chapter

Abstract: This talk begins with a short review of clustering that emphasizes external cluster validity indices (CVIs). A method for generalizing external pairbased CVIS (e.g., the crisp Rand and Jacard indices) to evaluate soft partitions is described and illustrated. Three types of validation experiments conducted with synthetic and real world labeled data are discussed: “best c” (internal validation with labeled data), and “best I/E” (agreement between an internal and external CVI pair).

As is always the case in cluster validity, conclusions based on empirical evidence are at the mercy of the data, so the reported results might be invalid for different data sets and/or clustering models and algorithms. But much more importantly, we discovered during these tests that some external cluster validity indices are also at the mercy of the distribution of the ground truth itself. We believe that our study of this surprising fact is the first systematic analysis of a largely unknown but very important problem ~ bias due to the distribution of the ground truth partition.

Specifically, in addition to the well known bias in many external CVIs caused by monotonic dependency on c, the number of clusters in candidate partitions, there are two additional kinds of bias that can be caused by an unusual distribution of the clusters in the ground truth partition provided with labeled data. The most important ground truth bias is caused by imbalance (unequally sized labeled subsets). We demonstrate these effects with randomized experiments on 25 pair-based external CVIs. Then we provide a theoretical analysis of bias due to ground truth for several CVis by relating Rand’s index to the Havrda-Charvat quadratic entropy.

Biography: Jim received the PhD in Applied Mathematics from Cornell University in 1973. Jim is past president of NAFIPS (North American Fuzzy Information Processing Society), IFSA (International Fuzzy Systems Association) and the IEEE CIS (Computational Intelligence Society): founding editor the Int’l. Jo. Approximate Reasoning and the IEEE Transactions on Fuzzy Systems: Life fellow of the IEEE and IFSA; and a recipient of the IEEE 3rd Millennium, IEEE CIS Fuzzy Systems Pioneer, and IEEE technical field award Rosenblatt medals. Jim’s interests: woodworking, optimization, motorcycles, pattern recognition, cigars, clustering in very large data, fishing, co-clustering, blues music, wireless sensor networks, poker and visual clustering. And of course, clustering in big data. Jim retired in 2007, and will be coming to a university near you soon.

Comments are closed.