Database Design - A Complete Study
Part 1 - The Theory
Written by Stéphane Richard (Mystikshadows)
Today, with FreeBasic and it's new realm of possibilities, the Qmunity can now see
database application as a definite possibility. Indeed efforts are being made to integrate
ODBC to FreeBasic and with it, a whole new world falls within the grasp of freebasic
developers. It's not enough today to say that your program can handle 10s of thousands of records
the industry is much more demanding than this. Even if you could code to handly more than this
say using random access files, would you really want to knowing there's a better solution out there? I
didn't think so.
In this document, I will try to tell you everything you need to know to equip yourself with a good solid knowledge about databases and database design concepts so you can sit down, look at a problem and be able to intelligently design a database to fit the very specific needs of a given database project. This will be a 3 part study. The first part, which you are reading now will cover alot of the basic database concepts, What types of database exist and some design considerations. The Second part will take a sample situation and design a database for it. The third part will cover everything there is to know about database management, interrogation and reporting. Throughout this series we will be using MySQL as the database environment because it's free and will most likely be the first "bigger" database you'll want to get your hands on. So all SQL you'll learn will be as it is implemented in MySQL. With a few changes you can addapt your knowledge to most well known databases fairly quickly.
We'll start things off by explaining some basic database concepts. Like I mentionned this first part is all about theory so brace yourselves. Here I'll just explain some parts of database design that you can expect to see in all "database related projects" by today's standards as they have changed over the years.
When this term was first used, it's original definition was that a database was the name of a file on disk that could be opened (flat file). With the coming of the different early databases like DBase Foxpro and Paradox, this definition change to say that a Database was a folder name. Today a database regained it's original meaning in that it now represents a filename only the structure inside the file has changed dramatically.
Table started to exist only recently and only had one definition. A table is a name that represent one fixed set of records, all of the same structure definition. Back in the days of DBase, FoxPro and Paradox, tables were each filename found inside of a folder with the proper file extension and inner structure. Today, a table is defined as one of the tables found inside a database file. Like it's first definition it has a fixed set of structures to define one given set of information.
A record always remained the same, it's one row of information that pertains to a given table. For instance in a Customer Table, each row of data (be it in a flat file, or a database, still represent the information that pertains to one given customer only. The means to store that one set of information changed drastically over the years, but the definition of a record still remains the same today.
Indexes always had just one purpose in history and still do today. Indexes server to present the records in a database in a specific order. In the days of DBase, FoxPro and Paradox (and others of the time) an index was a seperate file on the drive that held the information pertaining to the order of the table. An index also serve to accelerate the search process to find a specific record most database had a quick way to search the index to get to a specific record number in the main data file quite instantly even in indexes of 10s of thousands of records. It can still serve that purpose today as well.
With the coming of the truely relational database management system and their concept of a field in a table relating to a field of another table, the term Key was used to represent those fields that helped link two or more tables together. There are two types of keys, Primary Keys and Foreign keys. Here's a brief description of both:
- Primary Keys:
This is usually a field that is chosen in a table that will more than likely not allow duplicate records with the same value as the primary key. The reason is this helps keep the referential integrity intact by making sure that no related set of records can be mistaken as belonging to a different record being processed.
- Foreign Keys:
A Foreign key was given it's name because it helps maintain a relationship between outside tables and the main table being used. This one could typically allow duplicates however (see below about Relationship for the reasons why).
- Primary Keys:
When the Relational DBMS was invented is when the concept of relationships was created. The first implementation of a relationship system was in DBase with the "SET RELATION TO" instruction. When a relation was created in DBase it made the whole set of related tables synchronize themselves to each other automatically but today these relation only server to help maintain a database's referential integrity (see below). There are four types of relationships that you can make between tables of a database. Here they are:
- One To One Relationships:
This relationship states that a record in a table must be related to one record of another table. Not more than one but one record in both related tables must be present for the same record.
- One To Many Relationships:
This is the most common relationship used. this states that of one record in one table, atleast one (or more) records must exist in the related table. This is where you'd do your typical Invoice Header and Invoice Purchase Details type of relationships.
- Many To One Relationships:
This one is the inverted version of the previous relationship. This says that one or more records of the first table must belong to one record in the related table. A good example of that is a customer opening more than one bank account in a bank and the bank works with the bank account as the main table and not the customers themselves.
- Many To Many Relationships:
This relationship implies that one or more records from the first table must have one or more record in the related table. This can be usefull for cross referencing tables. There are many cases where more than one piece of a car, for example can be used to build one or more model of cars. There's other examples as well. This type of relationship isn't used as often as the one to many relationship but it is used often in these types of situations.
- One To One Relationships:
- Referencial Integrity:
This concept is defined as if all the tables that are related to each other all have data to respect the contents of all the relationships that are defined then one can say that the database's referencial integrity is intact. This means that for each of the fore types of relationship defined in a database system, all data that lies in each of the related table must be present. This also means that if, for example, in a one to one relationship, one of the tables should have 2 elements relating one element in another table then the referencial integrity of the database has been compromised.
Indeed, over the years there have been many types of databases developed. The main reason for the different types was simply because of the evolution of the data that were needed by the industry. It was, in my opinion a very evolutionnary (not revolutionnary) sequence of database types that simply fit the needs of the data that would be handled by the databases of the times. Here we'll cover these database types.
- Flat File tables:
This was the first type of data storage available, it consisted of flat files (text files) in which a line of text data represented a record. It was later called the Fixed Length Text file because all rows were lined up with the previous ones to allow to easily parse it's contents to retreive the different elements that made up a racord. It was created just to have a means of storing the data (which was the main focus of databases at the time, to store the data. Later, flat files evolved a bit because a new need became obvious, the need to find the data. The first evolution was found in spreadsheets which allowed to sort the data and then you could just page up and down to go to the area where your record would most likely reside. And even later systems like DBase and FOX database environment gave flat files even more flexibility. But they were still flat files (one filename per table) which had it's limits.
- Navigational Database Management Systems:
Not too long after the Flat File model, the needs to browse through the data, the ability to go to specific places in the database became apparent. Data became bigger and bigger and in the mean the "conventional" means of retreiving data was becoming less and less of a "proper" solution. Hence came the Navigational DBMS. It was the first model that actually allowed a database to be interrogated. For instance you could answer questions like "give me all the people in Washington" and it would go about perusing the database collecting the records that matched the criteria and return it back to you. However good this was for global maintenance, when you only wanted one record in the whole database, this system had to browse through all the records nonetheless until it reached the end of the database.0
- Relational Database Management Systems:
The Relational Model was the first Model that boasted the seperation of data elements in order to avoid unnecessary redundancy of information in a record. Bring the information elements back together was the key to this model's success. A relational model assumes that there will be more than one table to navigate and search in order to piece information together about a given criteria. A "key" field is defined to connect the pieces of information throughout all the "related" tables of data. This model was the first model that could allow a very streamlined organization of the data so that the tables would stay as small as possible in size. It was invented by Edgar Codd (who worked at IBM at the time). It was also then that what they call a "set oritented language" was developed for this need of looping. This set oriented language became what we now know as SQL. Main difference between navigational DBMS and Relational DBMS was that in the Navigational DBMS you looped to find a particular record, in the Relational DBMS you looped to find all the information pertaining to a particular record.
- Multidimensional Database Management Systems:
Multidimensional DBMS was shortlived. In essence it was invented because at times, doing a query in a fully normalized Relating DMBS could sometimes be quite long. And so it was created to denormalize a database in order to be able to query the data faster (somewhat of a step backwards which is probably why it was so short lived). Indeed, Multidimensional DBMSs ignore the logical/physical independence of the relational model and instead exposes pointers to the programmer. Instead of finding an address by looking up the "key" in the address table, the multidimensional DBMS store a pointer to the data in question. In fact, if the data is "owned" by the original record (that is, no other records in the main table point to it), it can be stored in the same physical location, thereby increasing the speed at which it can be accessed. needless to say referential integrity was darn near impossible to achieve hence another reason for it's short lived existence There was no 100% sure way to secure the database from itself.
- Object Oriented Database Management Systems:
The only good thing about Multidimentional DBMS was that they led directly to the development of Object-Oriented DBMS. The Relational DBMS serves to represent the relationship between the tables of a database. An Object Oriented DBMS is there to represent containment of one table as part of another table. Hence you could say that an invoice needs a customer from the customer table, details of items purchased in the details table and so on. When normalizing the database for this model it still resembles the classic normalization approach with a twist on records being part of other records in other tables. But it is still about avoiding redundancy in information as much as possible.
You seen this term earlier in this document. Database normalization is the process in which duplicate information is prevented and avoided when designing a database. There are 6 defined normalization levels (5 distinct levels and an modified version of the 3 normal form). Each of these normalization phase is designed to refine the previous level a degree more to assure that the database definition is properly normalized. Let's see these Normalization Forms along with their description. After we'll take an example situation and create a normalized database to reflect the needs of that situation.
- First Normal Form (1NF):
This first normal form requires that all column values in a table are atomic (a number is an atomic value while a list or a set is not). This essentially means that you shouldn't have Repetitive fields in a table. For example if your table has Item1ID, Item1Description, Item2ID, Item2Description and so on, those should be put in a seperate table definition.
- Second Normal Form (2NF):
The Second Normal Form requires that all elements of a table that are not dependant on the primary key, be put in a separate table. For example if you have the following fields in a table: PartNumber, SupplierID, Price, SupplierAddress, you don't need to go by the PartNumber to get the supplier information. Therefore Supplier Information should be in it's own table definition.
- Third Normal Form (3NF):
The Third normal form requires that there are no non-trivial functional dependencies of non-key attributes on something other than a superset of a candidate key. A relation is in 3NF if none of the non-Primary Key attributes are a fact about any other non-Primary Key attribute. Hence if you have the following fields: PartNumber ManufacturerName and ManufacturerAddress the manufacturer information is not information about the part and therefore should be in a seperate table definition.
- Boyce-Codd Normal Form (BCNF):
This is the modified version of the Third Normal Form. It requires that the database be 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. This essentially forces the fields in a table definition to only be about what the Key field is about (except for foreign keys that relate to information in other tables.
- Fourth Normal Form (4NF):
The Fourth normal form (or 4NF) requires that there are no non-trivial multi-valued dependencies of attribute sets on something else than a superset of a candidate key. For Example If an employee relation has multiple job categories and multiple locations where they work it might seem to make sense to just add these fields in: EmployeeID, LocationID and JobCategoryID. However, we might have to enter Employee's Job Category more than once if they fulfill the same Job Category at more than one location. There should be 3 many to many relations here: EmployeeID, JobCategoryID is one, EmployeeID, LocationID is another and finally JobCategoryID, LocationID is the last one. Then the relation is truely 4NF.
- Fifth Normal Form (5NF):
The Fifth normal form (or 5NF or PJ/NF) requires that there are no non-trivial join dependencies that do not follow from the key constraints. A relation is said to be in the 5NF if and only if it is in 4NF and every join dependency is implied by the candidate keys. For example, if a field can be calculated it shouldn't be stored. This is a level of normalization that optimized disk space usage a great deal but at the cost of performance.
I've stated here all the normalization forms. However, more than likely most database designs consider only the first three normaliziation levels. The reason why is because the other normalization levels usually cost too much in performance to be useful. All these normalization phases can be applied in both a relation DBMS and an Object-Oriented DBMS, the only difference is how you imply a relationship between two tables. In a relation DBMS you simply create a relationship stating which fields are the foreign keys going into which table and if it's a one to one, one to many, etcetc type of relationship. In Object-Oriented DBMS you create a relationship in terms of what the parent table contains (An InvoiceHeader table has a InvoiceDetails Table to hold the items purchased by the customer and specify the key that holds this containment relationship together).
IN CONCLUSION TO THIS FIRST INSTALLEMENT:
And this is it for the bigger part of the theory (the first installement of this series). There's alot of material that I've covered and that's only to prepare you for what's up ahead. I tried to make
it as concise and clear as I could to help make sure that the next installement will go smoothly. If you have questions, be sure to let me know so I can explain it further if needed.
In the next installement we'll take a complete sample situation and we'll define a database to represent the situation. We'll also normalize the database as per what we've learned here in terms of normalization. In other words, we'll begin to take the theory you learned here and apply it to a realworld situation. I'll give you tips and pointers to help you figure out what type of table you'd probably need as well as what fields you'd typically need to properly describe the items being stored in the table. Database design is really about domain specific knowledge and common sense. The way the situation is described to you often holds key information when you try to evaluate the database needs of a project. I'll show you how to extract that information from a typical description and you'll see it becaomes a very straightforward process in the ned. Until then, Happy coding and remember that you can always email me with questions or comments.