Home
IDS 410
Our previous discussions about conceptual data modeling and the logical database design phases of the database development process (including E-R/EER notations and the relational data model and normalization) do not explain how data will be processed or stored. The purpose of physical database design is to translate the logical description of data into the technical specifications for storing and retrieving data. The goal is to create a design for storing data that will provide adequate performance and insure database integrity, security, and recoverability
The information needed for physical file and database design include there requirements:
Physical database design requires the following critical decisions:
An easy way to show the statistic about data volumes and usage (access frequency) is by adding notation to the EER diagram that represents the final set of normalize relations from logical database design. The volume and frequency statistics are generated during the systems analysis phase of the systems development process when systems analysts are studying current and proposed data processing and business activities.
Fortunately, precise numbers are not necessary. What is crucial is the relative size of the numbers, which will suggest where the greatest attention needs to be given in order to achieve the best possible performance.
Field: The smallest unit of named application data recognized by system software, such as a programming language or database management system.
A field corresponds to a simple attribute from the logical data model, so a field represents each component of a composite attribute.
Data Type: A detailed coding scheme recognized by system software, such as a DBMS, for representing organizational data.
Selecting a data type involves four objectives that will have different relative importance for different applications.
A field with a limited number of possible values can be translated into a code that requires less space. By creating a code or translation table for the product Finish field (with possible values Birth, Maple, and Oak), each Finish field value can be replaced by a code, a cross-reference to the look-up table, similar to a foreign key. (Note that the code table would not appear in the conceptual or logical model; the code table is a physical construct to achieve data processing performance improvements, not a set of data with business value.)
A form of a code table is used by data compression techniques, such as file zip routines. Also related to data compression techniques are encryption techniques.
For many DBMA, data integrity controls (i.e. controls on the possible value a field can assume) can be built into the physical structure of the fields and controls enforced by the DBMS on those fields. The data type enforces one type of data integrity control since it limits the type of data and length of a field value. Some other integrity controls that a DBMS might support are:
Default Value--the value a field will assume unless a user enters an explicit value for an instance of that field. (Assigning a default value can (1) reduce data entry time, and (2) reduce data entry errors.)
Range control--to limit the set of permissible values a field may assume. (Range controls should be used with caution. Also, it is better to implement any range controls through the DBMS rather than through a program--since range controls in programs may be inconsistently enforced and are difficult to maintain.)
Null value control--each primary key must have an integrity control that prohibits a null value. (Any other required field may also have a null value control placed on it if it is the policy of the organization.)
Referential integrity--a form of range control in which the value of that field must exist as the value in some field in another row of the same or different table. That is, the range of legitimate values comes from the dynamic contents of a field in a database table, not from some prespecified set of values. (Note that referential integrity guarantees that only some existing cross-referencing value is used, not that it is the correct one.)
When a field may be null, simply entering no value may be sufficient. But if we needed to generate say a report of customers by zip codes, how would be handle customers with unknown zip codes? Two options are discussed in the data integrity controls mentioned above: use a default value or do not permit missing (null) values. Other possible methods for handling missing data are (Babad & Hoffer, 1984):
Trigger: A routine that will automatically execute when some event occurs or time period elapses.
In a logical data model, we group into a relation those attributes that are determined by the same primary key. In contrast, a
Physical record is a group of fields stored in adjacent memory locations and retrieved and written together as a unit by a DBMS.
The design of physical record involved choosing the sequencing of fields into adjacent storage locations to achieve two goals: efficient use of secondary storage and data processing speed.
The efficient use of secondary storage is influenced by both the size of the physical record and the structure of secondary storage. Operating systems read data from hard disks in units called pages (not physical records) ...
Page: The amount of data read or written by an operating system in one secondary storage (disk) input or output operation. For I/O with a magnetic tape, the equivalent term is record block.
If page length is not an integer multiple of the physical record length, wasted space may occur at the end of the page.
Blocking Factor: The number of physical records per page.
If storage space is scarce and physical records cannot span pages, creating multiple physical records from one logical relation will minimize wasted storage space. Some DBMSs will also block multiple physical records into a data block; in this case, the DBMS manages the data block, whereas the OS manages a page.
Our preceding discussion focuses on efficient use of storage space. In most case, the second goal of physical record design--efficient data processing--dominates the design process. Efficient processing of data depends on how close together related data are.
Often all the attributes that appear within a relation are not used together, and data from different relations are needed together to answer a query or produce a report. Thus although normalized relations solve data maintenance anomalies and minimize redundancies, normalized relations, if implemented one for one as physical records, may not yield efficient data processing.
Denormalization: The process of transforming normalized relations into unnormalized physical record specifications.
The following are some common denormalization opportunities:
The opportunities listed above deal with combining tables to avoid doing joins. However, denormalization can also be used to create more tables by partitioning a relation into multiple tables.
Horizontal Partitioning: Distributing the rows of a table into several separate files. (Breaking a relation into multiple record specifications by placing different rows in different records based upon common column values.) Example: breaking a customer relation into four regional customer files based on the value of a field region.
We should note that horizontal partitioning is very similar to creating supertype/subtype relationships
The Oracle DBMS supports several forms of horizontal partitioning ... :
Vertical Partitioning: Distributing the columns of a table into several separate physical records, (repeating the primary key in each of the records). Example: breaking a part relation by placing a part number along with accounting-related part data in one record specification, the part number along with engineering-related part data in another record specification, and the part number along with sales-related part data in yet another record specification.
Advantages of Partitioning
Disadvantages of Partitioning
In data replication, the same data are purposely stored in multiple places in the database in order to achieve increased processing (data retrieval speeds). (Improved speed is worthwhile only if the replicated data is frequently accessed in each of the relations it is replicated in, and if the costs for extra secondary storage and data maintenance are not prohibitive.)
Physical File: A named portion of secondart memory (such as a magnetic tape of hard disk) allocated for the purpose of storing physical records.
An important structure for physical storage space in Oracle is a tablespace.
A tablespace is a named set of disk storage elements in which physical files from one or more database tables may be stored.
Oracle has responsibility for managing the storage of data inside a tablespace, whereas the operating system has many responsibilities for managing a tablespace as a whole as it would any operating system file (e.g. handling file level security, allocating space, and responding to disk read and write errors).
A tablespace may be spread over several extents (where an extent is a contiguous section of disk storage space). When a tablespace needs to enlarge to hold more data, it is assigned another extent ... Managing table spaces, or physical database files, is a significant job of the database administrator in an Oracle environment.
All files are organized by using two basic constructs to link one piece of data with another piece of data: sequential storage and pointers.
Pointer: A field of data that can be used to locate a related field or record of data. Generally, a pointer contains the address, or location, of the associated data.
(We define pointers here because a knowledge of what pointers are is necessary for understanding file organization, discussed below.)
File Organization: A technique for physically arranging the records of a file on secondary storage devices.
With modern DBMSs, we do not have to design file organizations, though we may be allowed to select an organization and its parameters for a table or physical file. In choosing a file organization for a particular file in a database, we should consider seven important factors:
Often these objectives conflict, and we must select a file organization that provides a reasonable balance among the criteria within resources available. We now consider three families of basic file organization
Sequential File Organization: The storage of records in a file in sequence according to a primary key value.
Because of their inflexibility (a program must normally scan a file from the beginning until the desired record is located), sequential files are not used in a database, but may be used for files that backup data from a database.
Indexed File Organization: The storage of records either sequentially or nonsequentially with an index that allows software to locate individual records.
An index is a table or other data structure used to determine the location of rows in a file that satisfy some condition.
Each index entry matches a key value with one or more records, and can point to unique records or to potentially more than one record. An index that allows each entry to point to more than one record is called a secondary key.
Secondary key: One field or combination of fields for which more than one record may have the same combination of values. Also called nonunique key.
When the term primary and secondary index are used, there are four types of indexes:
Indexes can be built on top of indexes, since indexes (being files themselves) can become so large as to benefit from indexing. Each index entry then, has a key value and a pointer to another index or to a data record.
One of the most powerful capabilities of indexed fule organization is the ability to create multiple indexes (such as in the library, where there are indexes on title, author, subject, etc. pointing to the same set of books). Also, multiple indexes can be manipulated (e.g. "find products with a Birth finish AND that cost under $500") to increase gains in processing efficiency. Logical AND, OR, and NOT operations can be handled by simply manipulating results from an index scan, thus avoiding the more costly access to data records that do not meet all the qualifications of the query.
The general structure of a hierarchical index is called a tree (albeit an upside-down one, with the root node at the top and leaf nodes at the bottom). The performance of a tree of index entries is influenced by the properties of the tree structure used. A common type of tree used by DBMSs to store indexes is a balanced-tree, of B-tree.
Another type of index is called a bitmap index.
Bitmap index: A table of bits in which the row represents the distinct values of a key and each colum is a bit, when when on indicates that the record for that bit column position has the associated field value.
A bitmap is ideal for attributes that have only a few possible values. In proper situations, because bit manipulation and searching is so fast, the speed of query processing with a bitmap can be 10 times faster than a conventional tree index.
Another important type of index, especially in data warehousing and other decision support applications, is a join index.
A join index is an index on columns from two or more tables that come from the same domain of values.
Simply stated, a join says find rows in the same or different tables that have values that match some criterion.
Hashed file organization: A storage system in which the address of each record is determined using a hashing algorithm.
Hashing algorithm: A routine that converts a primary key value into a relative record number (or relative file address).
A typical hashing algorithm uses the technique of dividing each primary key value by a suitable prime number and then using the remainder of the division as the relative storage location. (Another technique must then be used to resolve the duplicates, or overflow, that can occur when two or more keys hash to the same address (known as "hash clash").
Hash index table: A file organization that uses hashing to map a key into a location in an index, where there is a pointer to the actual data record matching the hash key.
The DBMS will handle the management of the file organization. What is important for the DB designer is to understand the properties of the different file organizations so that we can choose the most appropriate one for the type of database processing required in the database and application we are designing.
Some DBMSs allow adjecent secondary memory space to contain rows from several tables. In this case, a physical file does not contain records with identical structures. For example, in Oracle, rows from one, two, or more related tables that are often joined together can be stored in the same disk area. A cluster is defined by the table and the column or columns by which the tables are usually joined.
Clustering reduces the time to access related records compared to the normal allocation of different files to different areas of a disk. Time is reduced since related records are closer to each other than if the records were stored in separate files in separate areas of the disk.
One additional aspect of a database file about which we may have design options are the types of controls we can use to protect the file from destruction or contamination or to reconstruct the file if it is damaged. The proprietary storage format provided by the DBMS offers a basic level of control, but we may require additional security controls on fields, files (tables), or databases. Backup procedures provide a copy of a file and of the transactions that have changed the file. When a file is damaged, the file copy or current file along with the log of transactions are used to recover the file to an uncontaminated state. Further, in terms of security, the most effective method is to encrypt the contents of the file so that only programs with access to the decryption routine will be able to see the file contents.
Indexes on a file can be created for either a primary or secondary key or both. The index is itself a table with two columns: the key and the address of the record or records that contain that key value. For a primary key, there will be only one entry in the index for each key value.
CREATE UNIQUE CUSTINDEX ON CUSTOMER(CUSTOMER_ID)
CREATE INDEX DESCINDX ON PRODUCT(DECRIPTION)
During physical database, we must choose which attributes to use to create indexes. There is a trade-off between improved performance for retrievals through the use on indexes, and the degraded performance for inserting, deleting, and updating the records in a file. Thus, indexes should be used generously for databases intended primarily to support data retrievals, such as for decision support and data warehouse applications; and should be used judiciously for databases that support transaction processing and other applications with heavy updating requirements, since the indexes impose additional overhead.
The following are some rules of thumb for choosing indexes for relational databases.
The increasing feasibility of using multiple computer components (as a result of the constant reduction in cost and size of computer technologies) allow parallel processing of data--the result being that four IO operations, when done in parallel, take as long as one such operation.
Parallel database processing and fault tolerance can be accomplished by a database designer using hardware or software technology called Redundant Array of Inexpensive Disks (RAID).
Redundant Array of Inexpensive Disks (RAID): A set, or array, of physical disk drives that appear to the database user (and programs) as if they form one large logical storage unit. (Thus, RAID does not change the logical or physical structure of application programs or database queries.
In order to maximize I/O performance of RAID, all the disk drives need to be kept busy. This is accomplished by striping--the balancing of the workload across the disk drives.
Stripe: The set of pages on all disks in a RAID that are the same relative distrance from the beginning of the disk drive
The details of parrallel processing are usually implemented on the OS level.
RAID has one important risk: the increased likelihood of a disk drive failure across the whole database. To cope with this risk and to make disk storage fault tolerant, many types of RAID technologies redundantly store data so that at least one copy of the data is assessible when demanded, or store extra error correction codes so that damages or lost data can be rebuilt.
A particular database management or data warehousing system supports one of five different architectures:
Charles E. Oyibo
IDS :: CBA
:: UIC