توضیحات
ABSTRACT
A dataset is imbalanced if the classification categories are not approximately qually represented. Recent years brought increased interest in applying machine learning techniquesto difficult”real-world” problems, many of which are characterized by imbalanced data. Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time. Predictive accuracy, a popular choice for evaluating performanceof a classifier, might not be appropriatewhen the data is imbalanced andlor the costs of different errors vary markedly. In this Chapter, we discuss some of the sampling techniquesused for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets.
INTRODUCTION
The issue with imbalance in the class distribution became more pronounced with the applications of the machine learning algorithms to the real world. These applications range from telecommunications management (Ezawaet al., 1996), bioinformatics (Radivojac et al., 2004), text classification (Lewis and Catlett, 1994; Dumais et al., 1998; Mladeni6 and Grobelnik, 1999; Cohen, 1995b), speechrecognition (Liu et al., 2004), to detection of oil spills in satellite images (Kubat et al., 1998). The imbalance can be an artifact of class distribution and/or different costs of errors or examples. It has received attention from machine learningand Data Mining community in form of Workshops (Japkowicz, 2000b; Chawla et al., 2003a; Dietterich et al., 2003; Fem et al. 2004) and Special Issues (Chawla et al., 2004a). The range of papers in these venues exhibitedthe pervasive and ubiquitous nature of the class imbalanceissues faced by the Data Mining community. Samplingmethodologies continue to be popular in the research work. However, the research continues to evolve with different applications,as each application providesa compellingproblem. One focus of the initial workshops was primarily the performance evaluation criteria for mining imbalanced datasets. The limitation of the accuracy as the performance measure was quickly established. ROC curves soon emerged as a popular choice (Fem et al., 2004). 2004) and Special Issues (Chawla et al., 2004a). The range of papers in these venues exhibitedthe pervasive and ubiquitous nature of the class imbalanceissues faced by the Data Mining community. Samplingmethodologies continue to be popular in the research work. However, the research continues to evolve with different applications,as each application providesa compellingproblem. One focus of the initial workshops was primarily the performance evaluation criteria for mining imbalanced datasets. The limitation of the accuracy as the performance measure was quickly established. ROC curves soon emerged as a popular choice (Fem et al., 2004).
Year: 2004
Publishe: Universityof Notre Dame
By: Nitesh V. Chawla
File Information: English Language/ 15 Page / size:2,855KB
Download: click
سال : 2004
ناشر : Universityof Notre Dame
کاری از : Nitesh V. Chawla
اطلاعات فایل : زبان انگلیسی / 15 صفحه / حجم :2,855KB
لینک دانلود : روی همین لینک کلیک کنید
نقد و بررسیها
هیچ دیدگاهی برای این محصول نوشته نشده است.