A hybrid evolutionary algorithm for attribute selection in data mining

ABSTRACT

Real life data sets are often interspersed with noise, making the subsequent data mining process difficult. The task of the classifier could be simplified by eliminating attributes that are deemed to be redundant for classification, as the retention of only pertinent attributes would reduce the size of the dataset and subsequently allow more comprehensible analysis of the extracted patterns or rules. In this article, a new hybrid approach comprising of two conventional machine learning algorithms has been proposed to carry out attribute selection. Genetic algorithms (GAs) and support vector machines (SVMs) are integrated effectively based on a wrapper approach. Specifically, the GA component searches for the best attribute set by applying the principles of an evolutionary process. The SVM then classifies the patterns in the reduced datasets, corresponding to the attribute subsets represented by the GA chromosomes. The proposed GA- SVM hybrid is subsequently validated using datasets obtained from the UCI machine learning repository. Simulation results demonstrate that the GA-SVM hybrid produces good classification accuracy and a higher level of consistency that is comparable to other established algorithms. In addition, improvements are made to the hybrid by using a correlation measure between attributes as a fitness measure to replace the weaker members in the population with newly formed chromosomes. This injects greater diversity and increases the overall fitness of the population. Similarly, the improved mechanism is also validated on the same data sets used in the first stage. The results justify the improvements in the classification accuracy and demonstrate its potential to be a good classifier for future data mining purposes.

INTRODUCTION

In today’s context, data mining has developed into an important application due to the abundance of data and the imperative to extract useful information from raw data. Many useful data patterns can be selected out, which helps predict outcomes of unprecedented scenarios. The knowledge gained from data mining can also be subsequently used for different applications ranging from business management to medical diagnosis. Decision makers can hence make a more accurate assessment of situations based on this attained knowledge. Support vector machines (SVMs) have recently gained recognition as a powerful data mining technique to tackle the problem of knowledge extraction (Burges Christopher, 1998). SVMs use kernel functions to transform input features from lower to higher dimensions. Many practical applications exploit the efficiency and accuracy of SVMs, such as intrusion detection (Mukkamala, Janoski, & Sung, 2002) and bioinformatics where the input features are of very high dimensions.

چکیده

مجموعه داده های واقعی زندگی اغلب با سر و صدا ظاهر می شود، و فرآیند پردازش داده ها پس از آن دشوار است. وظیفه طبقه بندی می تواند با حذف ویژگی هایی که برای طبقه بندی بیش از حد مورد توجه قرار می گیرد، ساده می شود، زیرا حفظ تنها ویژگی های مرتبط، اندازه مجموعه داده را کاهش می دهد و پس از آن امکان تجزیه و تحلیل قابل درک بیشتر از الگوهای یا قوانین استخراج را فراهم می کند. در این مقاله، یک روش ترکیبی جدید شامل دو الگوریتم یادگیری ماشین متداول برای انجام ویژگی انتخاب ارائه شده است. الگوریتم های ژنتیکی (GAs) و ماشین های بردار پشتیبانی (SVM ها) به صورت مؤثر بر مبنای رویکرد بسته بندی می شوند. به طور خاص، کامپوننت GA برای بهترین مشخصه ای که با استفاده از اصول یک فرایند تکاملی تعیین می شود، جستجو می کند. SVM سپس الگوها را در مجموعه داده های کاهش یافته طبقه بندی می کند، که مربوط به زیر مجموعه های ویژگی های نشان داده شده توسط کروموزوم های GA می باشد. ترکیبی پیشنهاد شده GA-SVM پس از آن با استفاده از مجموعه داده های به دست آمده از مخزن یادگیری ماشین UCI تأیید می شود. نتایج شبیه سازی نشان می دهد که ترکیبی GA-SVM دقت طبقه بندی خوب و سطح بالاتر سازگاری را می دهد که قابل مقایسه با سایر الگوریتم های ایجاد شده است. علاوه بر این، با استفاده از یک معیار همبستگی بین صفات به عنوان اندازه گیری تناسب اندام برای جایگزینی اعضای ضعیف در جمعیت با کروموزوم های تازه شکل گرفته، به ترکیبی تبدیل می شود. این باعث تنوع بیشتری می شود و تناسب کلی جمعیت را افزایش می دهد. به طور مشابه، مکانیزم بهبود یافته نیز بر روی مجموعه داده های مشابه در مرحله اول تأیید شده است. نتایج بهبود پیشرفت در دقت طبقه بندی را توجیه می کنند و توان بالقوه خود را برای طبقه بندی مناسب برای اهداف داده های آینده نشان می دهد.

مقدمه

در زمینه امروز، داده کاوی به دلیل فراوانی داده ها و ضرورت استخراج اطلاعات مفید از داده های خام به یک برنامه مهم تبدیل شده است. بسیاری از الگوهای اطلاعات مفید می توانند انتخاب شوند، که به پیش بینی نتایج سناریوهای بی سابقه کمک می کند. دانش مورد استفاده در استخراج داده ها همچنین می تواند برای برنامه های کاربردی مختلف از جمله مدیریت کسب و کار تا تشخیص پزشکی استفاده شود. از این رو تصمیم گیرندگان می توانند ارزیابی دقیق تر از شرایط مبتنی بر این دانش به دست آورد. ماشین های بردار پشتیبانی (SVM ها) اخیرا به عنوان یک روش قدرتمند داده کاوی برای مقابله با مشکل استخراج دانش به رسمیت شناخته شده اند (Burges Christopher، 1998). SVM ها از توابع هسته برای تبدیل ویژگی های ورودی از ابعاد پایین تر به بالاتر استفاده می کنند. بسیاری از کاربردهای عملی بهره وری از کارایی و دقت SVM ها، از جمله تشخیص نفوذ (Mukkamala، Janoski، & Sung، 2002) و بیوانفورماتیک که ویژگی های ورودی از ابعاد بسیار بالایی است، بهره می برند.

Year: 2009

Publisher : ELSEVIER

By : K.C. Tan , E.J. Teoh , Q. Yu , K.C. Goh