nat1700 Syllabus: Number 14

NATS 1700 6.0 COMPUTERS, INFORMATION AND SOCIETY
Lecture 14: Databases and Data Mining
| Previous | Next | Syllabus | Selected References | Home |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 |

Introduction

The notion of computer has involved, from its very beginning, the notion of data and data storage. It is time now to ask how in fact is data stored? For small amounts of data a table or similar structure is adequate. We can store and retrieve data by accessing the appropriate cell in the table. This system, however, becomes quickly inadequate as the amount of data grows. It would be rather impractical to manipulate, say, the Canada census data using this method: millions of households, each with pages of information about their family composition, vital statistics, employment, income, etc. Such huge amount of data must be stored more efficiently and more orderly, and especially it must be practically accessible. We just can't go on a fishing expedition. Enters the database.

The Aspen Institute Communications and Society Program has just released a report entitled The Promise and Peril of Big Data which "explores the implications of inferential technologies used to analyze massive amounts of data and the ways in which these techniques can positively affect business, medicine, and government."

Read the clear    Introduction to Databases by professor Marcovitz at Loyola University. Selena Sol offers a very lucid and non technical    Introduction to Databases for the Web with examples of the use of databases in libraries. Jane Chandler has a a very visual    Introduction to Object-Oriented Databases .

Bill Palace has a very nice introduction to    Data Mining , where he also discusses some of the problems of the technolog. See also Web Mining.

In    Surf like a Bushman, Rachel Chalmers discusses a surprising recent finding: "According to researchers in the US, the strategies you use when you surf the Web are exactly the same as the ones hunter-gatherers used to find food. You may be plugged into the information superhighway, but deep down you're still a caveman."

Topics

A database is a large set of data stored together with its description, reports, queries, etc., that is, data about the data, or metadata. Databases were first introduced in the fifties, and the data were stored either in memory or on punched cards or magnetic tapes. Access was sequential--you had to traverse the database until you found the item you were looking for. The technology quickly improved, allowing multiple direct access to the data and interactive data processing. By the late sixties the very notion of data had been enlarged to encompass complex objects, such as images. New forms of data organization and standardization were introduced, and entirely new computer languages were developed to permit very powerful queries and the automatic preparation of comprehensive reports.

One of the crucial problems in constructing a database is the definition of record. Each record should contain all the information desired about a particular item such as a certain car model, a taxpayer, an endangered species, etc. Such information can not (yet) be stored as a narrative, but must be classified in some sort of logical scheme. There are currently two main types of databases: object oriented databases, in which data are stored together with the appropriate methods for accessing it, and relational databases, which, in the words of Jane Chandler "'hammer the world flat' by normalisation." A relational database consists of certain tables which allow records pertaining to two or more such tables based to be linked to each other on the basis of the contents of a common field. For example, a cat would be represented in a relational database as a collection of fields such as shape of the nose, tail, etc., color of the fur, size of the animal, etc., while in an objected-oriented database the cat may appear as a whole animal. The relational approach allows a wide variety of complex data types to be stored and manipulated in the same database, making it easier (see Jane Chandler's site) "to follow objects through time (e.g. 'evolutionary applications')." New types of databases, such as object-relational database systems, which combine the two major traditional kinds of databases, and deductive databases. The latter take direct advantage of logic programming, which essentially approaches computation as deduction from a set of rules or axioms. This approach allows a sort of generalization of relational databases and makes it possible to do automated searching, extraction and processing of data. A particular version of these databases is the so-called lightweight deductive database, which promises to have to have important applications on the web, for example by allowing "distributed maintenance. A Web page can contain a part of a deductive database. The completed database can be created as necessary by retrieving the relevant Web pages, and composing them together. This allows the components of a database to be separately maintained, and combined only during query processing...Because not all application modifications are fully documented, the creation of a Data Warehouse involves a process of reverse-engineering which involves the processes of:

identifying the components of an application and how they are linked

extracting the business rules enforced by the application

capturing the transformation rules that govern how data is shared between various application"

[ see Lightweight Deductive Databases on the World-Wide Web ]

One of the difficult issues with databases is that often the data are stored in different locations and in different databases. This problem is further complicated by the fact that each database reflects, often in unclear ways, the assumptions, experience, habits and values of the local managers, making queries from outside users very difficult. Data warehouses are used to consolidate data located in separate databases. Data warehousing requires that we be able to gather and to manage (for example by aggregating and summarizing) historical data from a variety of sources and applications without duplications, errors, or loss of data. It is an tool useful in decision processes that must be continuously reworked and reformulated. By making information easily available, such tool empowers those involved decision making processes. Some data warehouses are read-only databases that distribute data to other applications.

The Data Mining Process

Bill Palace defines data mining (sometimes called data or knowledge discovery) as "the process of analyzing data from different perspectives and summarizing it into useful information--information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases." Data mining is one of the best way to illustrate the difference between data and information: data mining transforms data into information. Such transformation takes place in all three of the basic components of data mining: the creation of the data set (the data captured), the mining process proper, and the organization of the presentation of the information mined.

Here is an example from the same source: "one Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays."

Data mining is largely applied in business, but important scientific applications have been the initial impetus and continue to be developed. For example, the deployment of growing numbers of artificial satellites has yielded huge amounts of data about the earth, from its geology to its ecology and climatology. NASA space program has also accumulated immense sets of data which take decades to analyze. The human genome project has done the same in biology, and the major particle physics laboratories in the world (e.g. Fermi Lab and CERN) in physics. And so on. See for example    Sapphire Overview. Another site well worth visiting is    Good prospects ahead for data mining . Note the underlying principles of the Science Method, namely the cycle observation-hypothesis-experiment, fits well with the processes of Data Mining. Discovery-driven mining works well in the observation-hypothesis step, and verification-driven mining works well in the hypothesis-experiment step. "Data mining is a component of a wider process called 'knowledge discovery from databases'. It involves scientists from a wide range of disciplines, including mathematicians, computer scientists and statisticians, as well as those working in fields such as machine learning, artificial intelligence, information retrieval and pattern recognition."

Questions and Exercises

Having reviewed some of the major achievements of AI, has your sense of the debate on human vs artificial intelligence changed? How? Why?

In your own words, find and describe an example of AI technology not covered in these lectures.

Picture Credit: Data Mining in a Scientific Environment
Last Modification Date: 15 January 2010