Database Preprocessing and Comparison between Data Mining Methods

ABSTRACT

Database preprocessing is very important to utilize memory usage, compression is one of the preprocessing needed to reduce the memory required to store and load data for processing, the method of compression introduced in this paper was tested, by using proposed examples to show the effect of repetition in database, as well as the size of database, the results showed that as the repetition increased the compression ratio will be increased. The compression is one of the important activities for data preprocessing before implementing data mining. Data mining methods such as Na¨ıve Bayes, Nearest Neighbor and Decision Tree are tested. The implementation of the three methods showed that Na¨ıve Bayes method is effectively used when the data attributes are categorized, and it can be used successfully in machine learning. The Nearest Neighbor is most suitable when the data attributes are continuous or categorized. The third method tested is the Decision Tree, it is a simple predictive method implemented by using simple rule methods in data classification. The success of data mining implementation depends on the completeness of database, that represented by data warehouse, that must be organized by using the important characteristics of data warehouse.

INTRODUCTION

The extraction of useful and non-trivial information from the huge amount of data that is possible to collect in many and diverse fields of science, business and engineering, is called Data Mining (DM). DM is part of a bigger framework, referred to as Knowledge Discovery in Databases (KDD); this covers a complex process, from data preparation to knowledge modeling. Data compression is one of the preparations methods which are needed to compress the huge amount of database. Data mining is a process that is used to identify hider, unexpected pattern or relationships in large quantities of data. Historically, the notion of finding useful patterns in data has been given a variety of names, including data mining, knowledge extraction, information discovery, information harvesting, data archaeology, and data pattern processing. The term data mining has mostly been used by statisticians, data analysts and the Management Information Systems (MIS) communities. The phrase knowledge discovery in databases was coined at the first KDD to emphasize that knowledge is the end product of a data-driven discovery.

چکیده

پیش پردازش پایگاه داده برای استفاده از حافظه بسیار مهم است؛ فشرده سازی یکی از پیش پردازش های مورد نیاز برای کاهش حافظه مورد نیاز برای ذخیره و بارگذاری داده ها برای پردازش است؛ روش فشرده سازی معرفی شده در این مقاله با استفاده از نمونه های پیشنهادی برای نشان دادن اثر تکرار در پایگاه داده، و همچنین اندازه پایگاه داده، نتایج نشان داد که به عنوان تکرار افزایش نسبت فشرده سازی افزایش خواهد یافت. فشرده سازی یکی از فعالیت های مهم برای پیش پردازش داده ها قبل از اجرای داده کاوی است. روش های داده کاوی مانند Bayes Na¨ive، نزدیک ترین محله و درخت تصمیم گیری آزمایش می شوند. پیاده سازی این سه روش نشان داد که وقتی روش های داده ها طبقه بندی می شوند، روش بیهوش کردن Na¨ıve Bayes به طور موثری مورد استفاده قرار می گیرد و می تواند با موفقیت در یادگیری ماشین مورد استفاده قرار گیرد. نزدیک ترین همسایه مناسب تر است اگر ویژگی های داده پیوسته یا طبقه بندی شوند. روش سوم آزمایش شده است درخت تصمیم، این یک روش پیش بینی ساده است که با استفاده از روش های ساده ی قانون در طبقه بندی داده ها اجرا می شود. موفقیت پیاده سازی داده کاوی بستگی به کامل بودن پایگاه داده، که توسط انبار داده نشان داده می شود، باید با استفاده از ویژگی های مهم انبار داده ها سازماندهی شود.

مقدمه

داده کاوی (DM) نامیده می شود استخراج اطلاعات مفید و غیرمترقبه از مقدار زیادی از اطلاعات که ممکن است برای جمع آوری در بسیاری از زمینه های مختلف علم، تجارت و مهندسی باشد. DM بخشی از یک چارچوب بزرگتر است که به عنوان Discovery Knowledge در پایگاههای داده (KDD) شناخته می شود؛ این فرآیند پیچیده را شامل می شود از تهیه داده ها تا مدل سازی دانش. فشرده سازی داده ها یکی از روش های آماده سازی است که برای فشرده سازی مقدار زیادی از پایگاه داده مورد نیاز است. داده کاوی یک فرایند است که برای شناسایی الگو، غیرمنتظره و ارتباطات در مقادیر زیادی از داده ها استفاده می شود. از لحاظ تاریخی، مفهوم یافتن الگوهای مفید در داده ها نام های مختلفی از جمله داده کاوی، استخراج دانش، کشف اطلاعات، برداشت اطلاعات، باستان شناسی داده ها و پردازش الگوی داده شده است. اصطلاح داده کاوی بیشتر توسط آمارگیران، تحلیلگران داده ها و جوامع اطلاعاتی مدیریت اطلاعات (MIS) مورد استفاده قرار می گیرد. کشف علم دانش در پایگاه داده ها در اولین KDD شکل گرفت تا تأکید کند که دانش محصول نهایی یک کشف داده محور است.

Year: 2017

Publisher : IEEE

By : Yas A. Alsultanny