Efficient storage and querying of sequential patterns in database systems
Alexandros Nanopoulosa, , Maciej Zakrzewiczb, , Tadeusz Morzyb, and Yannis Manolopoulos
The number of patterns discovered by data mining can become tremendous, in some cases exceeding the size of the original database. Therefore, there is a requirement for querying previously generated mining results or for querying the database against discovered patters. In this paper, we focus on developing methods for the storage and querying of large collections of sequential patterns. We describe a family of algorithms, which address the problem of considering the ordering among elements, that is crucial when dealing with sequential patterns. Moreover, we take into account the fact that the distribution of elements within sequential patterns is highly skewed, to propose a novel approach for the effective encoding of patterns. Experimental results, which examine a variety of factors, illustrate the efficiency of the proposed method.
read full report
Mining from large databases (also referred to as database mining ) sets new challenges and opportunities to database technology itself. There is a need for new query languages and query processing methods that will address the requirements posed by database mining. Most of the existing data mining applications, however, assume a loose coupling between the data-mining environment and the database The narrowing of the Ëœgapâ„¢ between data mining and databases refers to the problem of developing data mining algorithms that will present a tighter coupling with the DBMS .This problem has started recently to be confronted by introducing new design specifications of the DBMS (not having to adhere to third NF, reduction of concurrency control and recovering overhead, synergy between OLTP and Data Mining )Moreover, efficient algorithms that exploit the support and achieve a tighter coupling with existing DBMS have been proposed (for a comparison of several implementations). Nevertheless, it is important to observe that a major obstacle in the wide spread use of data mining technology is not only insufficient performance, but also the absence of a paradigm for the robust development of data-mining applications and their integration with the DBMS . Along the lines of the latter observation, ImielinÃ‚Â´ski and Mannila describe a long term paradigm, called KDDMS (Knowledge and Data Discovery Management System1), which is based on developing KDD query languages (see Section 1.1 for a more detailed description), building optimizing compilers for ad hoc mining queries and application programming interfaces (APIs). Although several KDD query languages have been proposed (e.g. Mine-Rule , MSQL , DMQL of DBMiner ), few methods have been proposed in the other directions; for instance, OLE DB for Data Mining , the Discovery Board system , or the system proposed.