Company Info
Technology
Products
Services

 

Spacer gifSpacer gif
Spacer gif
Scianta Intelligence Turning Knowledge into Intelligence

Fuzzy SQL

A Tool for Finding the Truth.
The Power of Approximate Database Queries

©1999 Earl Cox


Dans les champs de l’observation le hasard ne favorise que les espirits prepares
Where observation is concerned, chance favors only the prepared mind.

--Louis Pasteur

precise, pri-sis, adj. definite: exact: accurate: free from vagueness
[Fr. Precis. –e –L. praecisus, pa.p. of praecidere, to cut]

imprecise im-pre-sis, adj. not precise.

Accurate adj. exact, correct,.
[L. accuratus, performed with care]

Precision and Accuracy

We live a world obsessed and infused with precision. We wear wrist watches accurate to a hundredth of a second a year (often automatically coupled to the atomic clock at the National Bureau of Standards). Ultra high density computer chips are made possible by nearly nanometer level placement of circuits on a silicate substrate;. Hunters, hikers, and drivers use Global Positioning Systems – GPS’s – that can locate them within two meters anywhere on earth. IBM writes its name within individual Xenon atoms. Intelligent missiles speed near ground level across a thousand miles of mountainous terrain to strike a single target in a crowded city. We can calculate irrational numbers, such as pi, to over a billion decimal places. Our space craft sling shot around the sun and rendezvous a year later with a distant planet on a time schedule that is exact to a few hundredth of a second. And changes in the economy are measured, and reported, down to a tenth of a percent.

These are our models of precision – models that provide the gauge for measuring other physical phenomena. We judge our weather forecasts, as an example, as highly imprecise – and consequently less accurate -- because they cannot predict the exact weather for this weekend. We believe statistical analysis is less precise than calculus. And we believe, above all other things, that numbers are more precise than words. In our modern analytical world, tempered by the rise of the scientific method, fields of study that cannot reduce their axiomatic truths to sets of equations are viewed with skepticism and placed outside the scope of the “hard” sciences. And this kind of drive for precision has been carried forward to our tools for data analysis and knowledge discovery. It is, in fact, the often unstated assumption that precision is equivalent to rigor and truth that has impeded the use of such technologies as fuzzy logic. In this article, we look at ways that imprecision can be made to work for us in important and powerful ways when searching production databases. In fact, we will see that a decrease in precision gives us an unexpected increase in accuracy.

Why do we search a database?

This seems like a trivial question to ask, yet it lies at the heart of our need for precision and accuracy (as well as how we measure both of these properties.) The simple answer is: to retrieve data we need to make a decision. The “we” can be either a person sitting at a computer terminal or an application running on some computer. And this simple answer is important. It links the process of data retrieval with the cognitive process of decision making. In making a decision we want to have all the facts. And we want the facts to be as accurate as possible. The question is, how do we go about making a database query that improves the accuracy of our results? Lets consider a very simple database of project data1 as shown in Table1.

project database

Table 1. A Project Database

Suppose we need to find all the projects with a high Budget and a short Duration. In a conventional query we must decide what set of budgets are considered “high” and what set of durations are considered “short.” In this case we decide a budget over $80,000 is high and a duration less than 90 days is short. The SQL query is cast as,

Select ProjectNo, Budget, Duration
From Projects
Where Budget > 80 and Duration < 90

The query selects the projects which have both a budget greater than $80,000 and a duration less than 90 days. Table 2 shows the results of this query.

selected rows

Table 2. Selected Rows (1)

As we can see, the query meets the precise requirements of the selection criteria expressed in the where statement. However, the query is not very accurate in terms of our intended objective. There are projects with budget and duration values that are conceptually close to our ideas of high and short, but are not selected by the query. In a small database such as our example this is not a problem – we can visually see other candidate records. But in large production databases, such a visual pattern recognition is not 1 This is actually a table in a database (which would consists of multiple tables). I use the term database here in a more informal way to mean a collection of related information.possible. So…how can we improve our ability to find the needed data? There are two ways: by expanding the scope of our selection criteria or by changing the way we perform our query against the database. We will look at each in turn.

Expanding the Query Scope

One intuitively obvious way of enriching the amount of data retrieved from the database is by expanding the scope of our query. This means changing the selection criteria so that it encompasses more data. As an example, in our previous query we might extend its reach by saying,

Select ProjectNo, Budget, Duration
From Projects
Where Budget > 75 and Duration < 100

This moves the Budget limit down to projects with a budget greater than $75,000 and moves the duration threshold up to durations less than 100 days. Indeed, as Table 3 shows, we now pick up more rows from the database. Project Budget (x00) Duration Staff

selected rows

Table 3. Selected Rows (2)

Yet even this expanded range is not sufficient to capture the actual intent of our query. While it selects more records, it still fails to find records lying just outside the explicit range of the selection criteria (as an example, it fails to find project P10 with a duration of 73 and a budget of 102, which conceptually, is a high budget and short duration.) Further, even when we change our criteria to include more records, there is no measure of inclusion – that is, we have no way of know which records are strongly a representative of the concept and which are only weakly characteristic of the concept. To solve both of these problems we need to change how we perform our selection. This brings us to a fuzzy query approach.

Using Fuzzy Query Criteria

Instead of applying crisp boundaries to delineate the search space, we can represent the space linguistically using the concept of fuzzy sets. Fuzzy sets provide a flexible and fluid method of defining semantic concepts within the database and provide the basis for a much richer and much more powerful method of looking through a database. In a fuzzy query, the selected records are ranked according to their compatibility with the semantics – the intent - of the query. This provides a measure of how well a record fits in with the complete set of records retrieved. Records are no longer simply lumped together as selected and not selected. To see how we gain this additional expressiveness and robustness from the use of fuzzy logic, let’s examine a simple query in a fuzzy SQL system. Figure 1 shows the main definition dialog with the available database columns appearing in the let hand list. Checked columns are those with defined fuzzy sets.

Forming the FuzzySQL Query

Figure 1. Forming the FuzzySQL Query

For each column that we want to use in the FuzzySQL where statement (the selection criteria), a set of underlying fuzzy sets is defined. In our simple example, we define a single fuzzy set for each column. Figure 2 shows the definition of the fuzzy set High associated with the Budget column. This is an S-curve starting at 75 and covering the remainder of the fuzzy set.

The High Budget Fuzzy Set

Figure 2. The High Budget Fuzzy Set

Figure 3 shows the definition of the fuzzy set Short associated with the Duration column. This is an S-curve starting at 87 (the extreme left hand edge of the column’s current domain) and moving to zero membership at 100 days.

The Short Duration Fuzzy Set

Figure 3. The Short Duration Fuzzy Set

After the underlying fuzzy sets have been defined, we can use them to define which rows in the database table should be selected. As Figure 4 illustrates this is done by selecting a column and matching it to one of the its fuzzy sets. In this example, we have specified budget is High.

Constructing the Selection Criteria (Where Statement)

Figure 4. Constructing the Selection Criteria (Where Statement)

The light blue spikes in the background show the actual frequency distribution of data points in the table, providing a valuable guideline for selecting the proper fuzzy set. When all the selection criteria have been specified, the actual query (shown in Figure 5) is executed.

The Fuzzy Query Definition

Figure 5. The Fuzzy Query Definition

The fuzzy query system finds the membership of each column’s data in the associated fuzzy set. This membership is in the interval [0,1], where zero indicates no membership and one indicates complete membership. The memberships are combined (usually by taking the minimum of the values (this is the Fuzzy AND operator) or their weighted average) to produce a composite membership call the Query Compatibility Index or CIX. The higher the outcome’s CIX, the more closely the outcome reflects the intent of the query. The outcome from a fuzzy query are those records in the database (actually the database tables) which have a composite membership greater than zero (or greater than some threshold). Figure 6 shows the outcome from our simple fuzzy query (where we have also included the zero CIX rows) when the minimum composition technique is used. The outcome is ordered in descending CIX order so that the most compatible rows appear at the top of the outcome window.

Figure 6. The Query Outcome (Minimum CIX)

Figure 7 shows the same query when the average composition approach is used. In this case we not only have a different number of candidate records, but their order has been changed, reflecting a greater degree of truth in one fuzzy comparison than in another. The minimum method requires that some truth exist in each of the where statement expressions (modeling the traditional AND) while the average method can often find elements where one or more where expressions have a high degree of truth but one or two other expressions have very little or no truth.

Figure 7. The Query Outcome (Average CIX)

As we have seen, by lowering the precision of our query’s selection criteria, we actually increase the accuracy of the search. The retrieved records are more closely connected to the idea behind our query than is possible with a conventional crisp search. This connection is expressed directly by each row’s compatibility index (CIX). Furthermore, we can easily fine tune our queries linguistically by changing the shape or the domain of the fuzzy sets or by apply hedges to one or more our fuzzy expressions. A hedge is a modifier that changes the shape of a fuzzy set in a specific, linguistic way. Thus, we might say where budget is very high or where duration is somewhat short. The hedge very intensifies High’s membership function while the hedge somewhat dilutes Short’s membership function.

The FuzzySQL Approach

FuzzySQL provides a powerful, easy to implement, and robust tool for searching corporate databases as well as private data sources (such as local Access databases, Excel spreadsheets, and flat files – as our example illustrates). Its user friendly approach to query qualification makes it an ideal component of the data warehouse and data mart, allowing marketing and systems analysts, data mining engineers, project managers, department managers, and a wide spectrum of computer literate staff officers convenient and open access to the operational data store as well as on-line archival information. Fuzzy queries enhance our ability to probe databases with increased confidence that the results will more accurately reflect the intent instead of simply the mathematics of our selection criteria. Based on linguistic properties, fuzzy queries are noise tolerant. They allow the scouring of databases for patterns not easily discovered with conventional SQL, thus providing the data analyst and the data mining knowledge engineer with a valuable tool for exploring complex and often noisy databases in such broad and diverse areas as retailing, manufacturing, financial services, pharmaceuticals, petrochemical, and medicine.

TOP OF PAGE

Scianta SI
© Scianta Intelligence 2005 all rights reserved
For more information or to schedule a presentation call (919) 678-0477

Spacer gif Spacer gif nav_top nav_top nav_top nav_top