توضیحات
ABSTRACT
The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual
similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are
usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.
INTRODUCTION
Decision support analysis on data warehouses influences important business decisions; therefore, accuracy of such
analysis is crucial. However, data received at the data warehouse from external sources usually contains errors: spelling mistakes, inconsistent conventions, etc. Hence, significant amount of time and money are spent on data
cleaning, the task of detecting and correcting errors in data.
The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data
cleaning and data quality [e.g., HS95, ME97, RD00]. Many times, the same logical real world entity may have multiple representations in the data warehouse. For example, when Lisa purchases products from SuperMart
twice, she might be entered as two different customers— [Lisa Simson, Seattle, WA, United States, 98025]and [Lisa Simpson, Seattle, WA, USA, 98025]—due to data entry errors
Year : 2002
Publisher : Proceedings of the 28th VLDB Conference
By : Rohit Ananthakrishna , Surajit Chaudhuri , Venkatesh Ganti
File Information : English Language / 12 Page / Size : 150 K
Download : click
سال : 2002
ناشر : Proceedings of the 28th VLDB Conference
کاری از : Rohit Ananthakrishna , Surajit Chaudhuri , Venkatesh Ganti
اطلاعات فایل : زبان انگلیسی / 12 صفحه / حجم : 150 K
لینک دانلود : روی همین لینک کلیک کنید
نقد و بررسیها
هنوز بررسیای ثبت نشده است.