NATS 1700 6.0 COMPUTERS, INFORMATION AND SOCIETY
Lecture 14: Databases and Data Mining
| Previous | Next | Syllabus | Selected References | Home |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 |
14 | 15 | 16 | 17 | 18 | 19 |
20 | 21 | 22 |
Introduction
-
The notion of computer has involved, from its very beginning, the notion of data and data
storage. It is time now to ask how in fact is data stored? For small amounts of data a table or similar
structure is adequate. We can store and retrieve data by accessing the appropriate cell in the table. This system,
however, becomes quickly inadequate as the amount of data grows. It would be rather impractical to manipulate, say,
the Canada census data using this method: millions of households, each with pages of information about their family
composition, vital statistics, employment, income, etc. Such huge amount of data must be stored more efficiently and
more orderly, and especially it must be practically accessible. We just can't go on a fishing expedition. Enters the
database.
- The Aspen Institute Communications and Society Program has
just released a report entitled The Promise and Peril of Big Data which
"explores the implications of inferential technologies used to analyze massive amounts of data and the ways in which these techniques
can positively affect business, medicine, and government."
-
Read the clear Introduction to Databases
by professor Marcovitz at Loyola University. Selena Sol offers a very lucid and non technical Introduction to Databases for the Web
with examples of the use of databases in libraries. Jane Chandler has a a very visual Introduction to Object-Oriented Databases .
-
Bill Palace has a very nice introduction to Data Mining ,
where he also discusses some of the problems of the technolog. See also Web Mining.
- In Surf like a Bushman, Rachel Chalmers discusses a surprising recent finding:
"According to researchers in the US, the strategies you use when you surf the Web are exactly the same as the ones
hunter-gatherers used to find food. You may be plugged into the information superhighway, but deep down you're still
a caveman."
Topics
- A database is a large set of data stored together with its description, reports, queries, etc., that is, data about
the data, or metadata. Databases were first introduced in the fifties, and the data were stored either in memory
or on punched cards or magnetic tapes. Access was sequential--you had to traverse the database until you found the item
you were looking for. The technology quickly improved, allowing multiple direct access to the data and interactive
data processing. By the late sixties the very notion of data had been enlarged to encompass complex objects, such
as images. New forms of data organization and standardization were introduced, and entirely new computer languages
were developed to permit very powerful queries and the automatic preparation of comprehensive reports.
- One of the crucial problems in constructing a database is the definition of record. Each record should
contain all the information desired about a particular item such as a certain car model, a taxpayer, an endangered
species, etc. Such information can not (yet) be stored as a narrative, but must be classified in some sort of logical
scheme. There are currently two main types of databases: object oriented databases, in which
data are stored together with the appropriate methods for accessing it, and relational databases,
which, in the words of Jane Chandler "'hammer the world flat' by normalisation." A relational database
consists of certain tables which allow records pertaining to two or more such tables based to be linked to each other
on the basis of the contents of a common field. For example, a cat would be represented in a relational database
as a collection of fields such as shape of the nose, tail, etc., color of the fur, size of the animal, etc., while
in an objected-oriented database the cat may appear as a whole animal. The relational approach allows a wide variety
of complex data types to be stored and manipulated in the same database, making it easier (see Jane Chandler's site)
"to follow objects through time (e.g. 'evolutionary applications')." New types of databases, such as
object-relational database systems, which combine the two major traditional kinds of databases, and
deductive databases. The latter take direct advantage of logic programming, which
essentially approaches computation as deduction from a set of rules or axioms. This approach allows a sort of
generalization of relational databases and makes it possible to do automated searching, extraction and processing of
data. A particular version of these databases is the so-called lightweight deductive database, which
promises to have to have important applications on the web, for example by allowing "distributed maintenance. A Web
page can contain a part of a deductive database. The completed database can be created as necessary by retrieving the
relevant Web pages, and composing them together. This allows the components of a database to be separately maintained,
and combined only during query processing...Because not all application modifications are fully documented, the
creation of a Data Warehouse involves a process of reverse-engineering which involves the processes of:
- identifying the components of an application and how they are linked
- extracting the business rules enforced by the application
- capturing the transformation rules that govern how data is shared between various application
"
[ see Lightweight Deductive Databases on the
World-Wide Web ]
One of the difficult issues with databases is that often the data are stored in different locations and in different
databases. This problem is further complicated by the fact that each database reflects, often in unclear ways, the
assumptions, experience, habits and values of the local managers, making queries from outside users very difficult.
Data warehouses are used to consolidate data located in separate databases. Data warehousing
requires that we be able to gather and to manage (for example by aggregating and summarizing) historical data from a variety of
sources and applications without duplications, errors, or loss of data. It is an tool useful in decision processes that
must be continuously reworked and reformulated. By making information easily available, such tool empowers those involved
decision making processes. Some data warehouses are read-only databases that distribute data to other applications.
The Data Mining Process
Bill Palace defines data mining (sometimes called data or knowledge discovery) as
"the process of analyzing data from different perspectives and summarizing it into useful information--information
that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools
for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and
summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns
among dozens of fields in large relational databases." Data mining is one of the best way to illustrate the
difference between data and information: data mining transforms data into information. Such
transformation takes place in all three of the basic components of data mining: the creation of the data set (the
data captured), the mining process proper, and the organization of the presentation of the information mined.
Here is an example from the same source: "one Midwest grocery chain used the data mining capacity of Oracle
software to analyze local buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays,
they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping
on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer
to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in
various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And,
they could make sure beer and diapers were sold at full price on Thursdays."
Data mining is largely applied in business, but important scientific applications have been the initial impetus
and continue to be developed. For example, the deployment of growing numbers of artificial satellites has yielded
huge amounts of data about the earth, from its geology to its ecology and climatology. NASA space program has also
accumulated immense sets of data which take decades to analyze. The human genome project has done the same in biology,
and the major particle physics laboratories in the world (e.g. Fermi Lab and CERN) in physics. And so on. See for
example Sapphire Overview.
Another site well worth visiting is Good prospects ahead for data mining
. Note the underlying principles of the Science Method, namely the cycle observation-hypothesis-experiment, fits well with the processes of Data Mining.
Discovery-driven mining works well in the observation-hypothesis step, and verification-driven mining works well
in the hypothesis-experiment step. "Data mining is a component of a wider process called 'knowledge discovery from databases'.
It involves scientists from a wide range of disciplines, including mathematicians, computer scientists and statisticians,
as well as those working in fields such as machine learning, artificial intelligence, information retrieval and pattern recognition."
Questions and Exercises
- Having reviewed some of the major achievements of AI, has your sense of the debate on human vs artificial intelligence changed?
How? Why?
- In your own words, find and describe an example of AI technology not covered in these lectures.
Picture Credit: Data Mining in a Scientific Environment
Last Modification Date: 15 January 2010
|