A large number of real-world domains possess heterogeneity in their data, which implies that different partitions of the data show different relationships between explanatory and response variables. This increases the overall model complexity of predictive learning in the presence of heterogeneity. Additionally, a number of real-world domains lack sufficient training data, making the learning algorithm prone to over-fitting, especially when the model complexity is large. However, there often exists a structure among the data instances and their partitions which can be appropriately leveraged for reducing the model complexity along with addressing heterogeneity. In this paper, we present a framework for learning robust predictive models in real-world heterogeneous datasets which lack sufficient number of training samples. We demonstrate the usefulness of our framework in the domain of remote sensing for forest cover estimation. Through a series of comparative experiments with baseline approaches, we are able to show that our framework: (a) captures meaningful information about heterogeneity in the data, (b) improves prediction performance by addressing data heterogeneity, (c) is robust to over-fitting in the presence of limited training data, and (d) is robust to the choice of the number of partitions used for representing heterogeneitv.
|Original language||English (US)|
|Title of host publication||SIAM International Conference on Data Mining 2014, SDM 2014|
|Editors||Mohammed J. Zaki, Arindam Banerjee, Srinivasan Parthasarathy, Pang Ning-Tan, Zoran Obradovic, Chandrika Kamath|
|Publisher||Society for Industrial and Applied Mathematics Publications|
|Number of pages||9|
|State||Published - 2014|
|Event||14th SIAM International Conference on Data Mining, SDM 2014 - Philadelphia, United States|
Duration: Apr 24 2014 → Apr 26 2014
|Name||SIAM International Conference on Data Mining 2014, SDM 2014|
|Other||14th SIAM International Conference on Data Mining, SDM 2014|
|Period||4/24/14 → 4/26/14|
Bibliographical noteFunding Information:
This research was supported in part by the National Science Foundation under Grants IIS-1029711 and IIS- 0905581, as well as the Planetary Skin Institute. Access to computing facilities was provided by the University of Minnesota Supercomputing Institute.