Summary of a Paper:

 

Data Mining and the Management of Large Dynamic Data Warehouses where Requirements Change over Time

 

Management of large databases, especially those that contain significant narrative data and those which are characterized by changing requirements for field structures, data definitions and retrieval interests, has not been good.

Although significant advances have been made in the sheer administration of large databases up to 50 Terabytes (known as data warehouses) these technologies do nearly nothing for the problem of getting information from the data. This problem gets exponentially harder as the database increases in size, while the problems addressed by data warehousing tend not to increase in complexity nearly as fast.

Mathematically the more "mechanical problem" of storing large databases and retrieving data against a known, correct specification where the specifier also knows the entire structure of the database, increases with size only as (n · log2n) if optimally solved, and increases with size as much as (nk, k>2) for ordinary Boolean searches. However, the challenging problems are those of finding information, not data, against a specification which

1. does not use the same words or terms as the database uses,

2. has uncertainties and preferences implicit in the search

3. if attacked logically often produces too few or too many "hits", and

4. does not demand that the user completely understand the database itself.

That is our challenge.

The purpose of this paper is to examine some of the reasons that data mining has not worked well in the context of "programmers' solutions", and to recommend some mathematically-based solutions which can be implemented today. We do not discuss research in the field, and try to distinguish clearly between what can be done today and what may be done some day in the future.

 

What Doesn't Work Well , and Why

1: Data mining is a proper superset of search and retrieval. In the latter case the user generally knows what is being sought. Example: Finding the telephone number of all Terrell Allens who live in Virginia. In the former case, the user has a general need which can be expressed in a sentence, but not precisely. Example: Using the same telephone book to find a source of equipment for bird watching A human has to make the leap from function required to function catalogued.

In general, the process of transferring a human need to a formal computer query remains difficult, despite recent research, particularly sponsored by ARPA and a Government agency.

Moreover, the facility for proactive software-driven data mining, where the software discovers links in the data, remains a topic for research, because one can always find relationships in a set of random numbers.

2: Nearly all so-called expert systems used for data mining are based on logical search techniques, i.e. derived from Boolean algebra or its isomorph, propositional calculus. There is one major exception: search engines such as those used with Nexus/Lexis, which use a <within> function, e.g., "'Juvenile' within 5 words of 'Delinquent'." However, humans do not think with Boolean Algebra, except in rare instances (at best using formal logic to lend an aura of certainty to conclusions arrived at by induction).

This is, in our experience, the principal reason for the consistent failure of Boolean search methods to retrieve data from large file systems&emdash;whether organized via flat files, relational files or narrative material.

There is another exception: the experimental use of neural net structures to perform searches based on experience. This technology is in its infancy, and at best finds extensional relations among small numbers of semantically-related terms. Neural nets have found their metier primarily where highly repetitive processes with simple, highly repetitive algorithms are dsought, and when a given neural net has learned its program, it can usually be replaced by a simple equation. The price paid in numbers of neurons required (of the order of 3N3, for two-word groups, where N is the number of distinct words on which to search) and in the numbers of training data sets required (usually of the order of N!, which tends to be a big number. Since the practical bound for neural net size is still of the order of 105 neurons, and the brain has somewhere around 1012 neurons, one can understand the limitations which researcher face. In our opinion the neural-network approach to data mining is still a deep research issue for any but "toy problems."

The need exists for using methods which more closely correspond to human reasoning, which re both probabilistic and adaptive to their users. Fortunately, such methods exist.

3: There is no way to completely index a database prospectively in light of the questions that are likely to be asked, assuming that the database has a degree of richness and the user a degree of sophistication. This is the problem that lawyers face with litigation-support databases, which can easily model truckloads of paper. The complexity of indexing is not in finding the words, but the relationships among words and the semantic extensions. A search for "aircraft" may not find "helicopter", but is most unlikely to find "flying machine", the pilot's favorite surrogate.

The need is for indexing methods which rol mechanisms to decide when to do so, and the data mechanisms to adjust prior data to the new formats, are still manual and complex.

 

A Practical Solution Approach

The challenge is to implement qualitatively new solutions to the data mining challenge without additional basic research. This paper responds to that need.

Overview: Formatted and Unformatted Data

In this discussion we consider both formatted field-based files (whether flat, hierarchical or relational) and unformatted fields or entire files which are narrative in nature. The reason: Many databases (and certainly all the challenging ones) have both formatted and unformatted fields, and the fields themselves may not be fixed-length.

The interesting data are often found in the Remarks column.

The interesting information lies not only in the Remarks but also in the semantic and statistical analysis of all the data.

The solution consists of three parts, each of which is in some degree independent but contributed to the whole. Unfortunately the solution does not "map" simply onto the identified challenges, but the relationships will be clear from the discussion.

The Semantic Search Engine: Data-Driven Adaptivity

A search engine which is semantically sensitive, i.e. based on a semantic network in which the relationships among terms (nouns, adjectives, etc.) are defined in terms of a matrix of transitive, asymmetric values derived from the data themselves using the NINA system. The basic word relationship are derived by a probabilistic extension of the Princeton University WordNet System.

(TechLabs has had a continuous dialog with the Princeton University Linguistic Laboratory on this matter. The underlying WordNet is public-domain, and it will be TechLabs' intention to retain this Laboratory for scientific support.)

The context of a record, or narrative may be implicit in the field or file itself, or can be derived from an on-the-fly statistical analysis of the data. Since the data are assumed to be very large and/or streaming, the Modified Bayesian portions of the Nina algorithms are used for this process.

Thus, for example:

Pot is ordinarily identified as a superset of Flowerpot and a parallel level (bi-directional) semantic relative of Kettle., among others. This is not derivable from the data set, but matrices of these relationships are available (see appendix). On the other hand, if the statistical analysis identifies the Frame or context as being that of drugs, Pot is then linked to Ganja.

Formatting from Data Themselves

In any database, whether flat-file, relational or object-oriented, the question of identifying fields is important, because it is difficult to determine in advance what fields are required and what data should be either ignored or consigned to "Remarks" fields.

Actually, in the context proposed, the use of fields is focused on faster retrieval, and the easier development of relational pointers by constraining the entropy of the database itself.

We approach this issue as follows:

Initially, fields have been defined for a database on some logical, historical or legacy-based grounds. Other data, as they occur, are put into narrative fields.

Adding Fields: In a background process, searches and retrievals are examined in terms of the frequency with which hit occur in the narrative sections. When such hits occur substantially in excess of random expectation, a candidate is established for modifying the database architecture to include new field. The data are identified. In terms of what is conservatively acceptable from the current research, the definition of that field however requires human cooperation. The data sets are presented off-line to data specialists along with the software's recommendations. The software specialists then actually accept or produce the field definitions, and the fields are automatically filled by the software. Indexing is of course done in background (see below).

Deleting Fields: The utility of a field is determined by its frequency of access and by the negentropy of its contents. The software (drawn from the entropy-estimation portions of NINA) can, on the fly, estimate the latter and monitor the former. If a given field has little utility for the user and has relatively little information content, it then becomes a candidate for deletion, with the data reverted to a narrative field. Final decisions must however be made by data administrators, since there may be non-informational (e.g. administrative) reasons for having certain fields in a database.

Automated Indexing

The real purpose of indexing is to save time in search and retrieval, reducing processes which grow linearly or worse with database size to processes which grow logarithmically with database size. However, as we have pointed out, classical indexing creates downstream problems because no person is likely to be wise enough to anticipate future needs for indexing terms.

Our solution: as data enter the system, scan each field and narrative section for terms which have the following characteristics:

1. Their occurrence is nonconventional in terms of general statistics of usage, whether higher or lower.

2. Their occurrence is rare enough that usage in a query is unlikely or futile (such as occurrences of the word "the".

3, Pairs and n-tuples of words linked by strong semantic relevance are treated by using the incorporated phrases for indexing.

This is another application of the NINA algorithms relating to the appropriate processing Type I and Type II variation from the norm.

The process is carried out in background using the actual data as received. If the database i relatively static, the job is reduced in size over time. If he database is dynamic, the NINA algorithms will allow for decisions to remove index terms and phrases as well, since a forever-growing index can also be self-defeating.

 

<-- Please click here to return to Papers, Papers, Papers.