Monday, 4 January 2016

What is database normalization?

Some years ago, when I was start working as junior database administrator, one of seniors given me a conceptual data model sheet and asked me to read the model and prepare a logical model from the sheet. I studied ER diagram (Entity-Relationship diagram) during my engineering course, so I must have some basic knowledge, but the given conceptual model is little different from what I studied (only the symbols are different). So I was searching in internet with queries, "what is logical data model?" and "how to convert a conceptual data model to a logical data model?". I found some pages related to what I was searching and I understood the concept, after a week somehow I submitted my newly created incorrect logical data model to my senior (it is a simple user management application).


During that process I came across a term "Normalization" (I did incorrect Normalization without knowing the actual term for that process in my first task given by my senior) in some web pages with examples. At that time I could not understand the Normalization with the mentioned example. So I thought to write a post on normalization, intent of this post is to share the concepts of database normalization for beginners with a simple example. After reading this post you will be able to answer the following question.  

What is logical data model?
What is normalization?
What are the levels of normalization?
What is first normal form (1NF)?
What is second normal form (2NF)?
What is third normal form (3NF)?

Logical Data Model:


We have seen about the conceptual data model in the previous post, in which we used the school management application as our example. In CDM we capture the basic and critical concepts of a school, so the next step will be capturing the business requirements. The logical data model is a business solution. This model is independent of implementation concerns such as software or hardware. Logical data model uses business names for entities and data fields such as Customer, Order Number.

The logical model includes all of the data fields and business rules. For example, in our conceptual data model example, A Teacher may teach many Subjects. The logical data model will capture all of the details behind the Teacher and Subject entities, such as teacher's name, identity, address, subject name, code, level, etc. Logical data model is based on the conceptual data model and it provides a base for physical data model (database). 


Normalization:


When I was working in our carrot field with my Mom, she asked me to pack the carrot basket for selling in the market to our customer. So I categorized each basket according to size, color and variety. Organizing the carrot baskets in this way helped me to understand the carrot size and their variety. This helped us to sell our products in better price in the market. 

We apply set of rules to achieve the goal of organizing something. In our carrot field, I applied the rules, such as size, color and variety to organize the carrot baskets. Similarly, in simple words normalization, is the process of applying set of rules to organize entities and data fields within our database. These rules are based on the business process in our organisation.  

Normalization process involves the spliting of a entity into less redundant and smaller entities without losing data instances, in our example (below) we will see why we are splitting. Series of rules are applied at different levels, each level includes the rules of the lower level. Let us see those normalization levels in detail.


Levels of normalization:


Levels of normalization is given below,

First normal form (1NF)
Second normal form (2NF)
Third normal form (3NF)
Boyce/Codd normal form (BCNF)
Fourth normal form (4NF)
Fifth normal form (5NF)

Each level includes the rules of the lower levels. For example, if a model is 4NF, it is also in BCNF, 3NF, and so on. Usually we call a model is normalized if it is in 3NF. This is because the higher levels normalization (BCNF, 4NF and 5NF) used for specific situations that occurs very rarely. So in this post we are going to see about only first, second and third normal forms.

In normalization process we need to make sure that, every data field is atomic and uniquely provides a fact and only about its primary key.

For example, lets consider a website company, a subset of conceptual data model for this website is given below.

User Conceptual Model
USER CONCEPTUAL MODEL

Initially all the data fields can be assigned to one entity, the User itself. The following diagram shows the User entity with data fields:

USER ENTITY AT BEGINNING
USER ENTITY

When we assigning all the data fields to User, we need to have a clear definition for User. Since different people or organisations may have different interpretation for the term User. 

A person, a company or a person from a company who access our website.

We know the definition for the User, now we have to answer the question, how can a website owner or admin identify a specific User?, What is uniquely identify a User?, or What data field used to uniquely identify a User?

In order to answer to above questions, we can assume that, user id with user email address uniquely identify a User. So the new User entity model with the primary key will be as look like follows:

USER ENTITY WITH CANDIDATE KEY

Note that, while choosing primary key, we need to consider many rules, most important of them are, we cannot use sensitive information as primary key also, we need to choose the data field which is short in length. In our model, the selected primary keys are just for illustration purpose.

In the above entity model, the User Id is just a number and therefore the values do not have to be a sequential number. So it may contain duplicate values,    we are going to remove redundancies by applying the series of rules mentioned in the normalization process.


First normal form (1NF)


First normal form ensures that, domain of every data field must contain only atomic (single) values and the value of each data field contains only single value.

In simple words, when a business person asks for name of the user id 1, the system must return only one value Steve. It should not return John and Steve. Also, it should return single value from that domain. For example, Person Name may contain multi values such as "Steve Jobs", first normal form split this multi value into a single value by spliting the data field as Person First Name and Person Last Name.
  
First normal form will eliminate the repeating data fields and multi valued data fields. Let us discuss about this rules here.

Now come to our website company model, what if someone has two or three Phone Numbers? For example, consider a User Raj having two phone numbers, in order to accommodate this requirement, our User model will be altered as showing below:

REPEATING DATA FIELDS

In this model, the data field Phone Number repeated twice, so this violates the first normal form. Also there is a chance in future, a User may use more than two phone numbers like three or five and so on. In general a User can have more than one Phone Number. So we need to resolve this repeating data fields issue.

In order to resolve this repeating data fileds, we are creating a separate entity called User Phone Numbers. By creating one-to-many relationship with this User Phone Number, we can satisfy the business rule "Each User can have one or many User Phone Number", so the new User model may look like this:

REPEATING DATA FIELDS RESOLVED
REPEATING DATA FIELDS RESOLVED

Now, lets say, a Phone Number is used by more than one User. For example, a Phone Number 97872 11111 used by both Raj and Kumar. A model must resolve this need, we are creating a many to many relationship with the User. So the User model will be altered as follows,

USER AND PHONE NUMBER MANY TO MANY
USER AND PHONE NUMBER MANY TO MANY

We have solved the repeating data fields in the User model, yet the model is not in the first order. What if the business people asks for only Persons first name? what if the business asks for persons last name? and what if the business people asks for phone number of a particular country or particular area? As a data modeler you need to ask such questions with business people as well as to yourself. 

Most of the time, we are dealing with present business requirements, but as a data modeler you have to think in terms of future business needs, in future we have additional set of requirements, so we need to model our system in such a way that accommodate  future requirements as well.

Now lets come to our User model, in our model, what if the website owner asks for only person first name or last name? In our model, we have only Person Name which contains name values. For example, Person Name contains the value Bill Gates. When we need only first name we need to split the data field into separate data fields, such as Person First Name and Person Last Name, so that we can solve this multi-valued data fields issues. 

Every data field must contain only single atomic value, so the new model may looks like as follows:

USER MODEL IN 1NF
USER MODEL IN 1NF

Lets say you are modeling an application which is used only within your particular geographical area. For example, I was working with one of my clients, and my assignment was model for a billing application, which is used only by their employees who are all from within our country. Out of my interest, I referred some modeling books, I developed a model, which include the separate data field for Area Code, Country Code and Phone Number.

When I submitted my model, the requirement of the client was completely different, They don't need the separate fields for area and country code. Since there are no outside employees and they don't hire the outside employees for their field work. So they just needs the Phone number with area code in the same field. My effort for modeling those entities was not required. In our model just for illustration purpose we added a new data field Area Code in Phone Number entity.

Some times we may face such kind of situation where even the data field contains more than one value we need to consider them as atomic values. Make sure to ask questions to business people and make the requirements clear so that you can save the extra modeling efforts and time.













Second normal form (2NF):


We have already know that, each level of normalization includes the rules of the lower level of normalization, so a model in second normal form includes the rules of first normal form.

A model is in second normal form (2NF), if it is in first normal form (1NF) and every data field in the entity is dependent on any proper subset of candidate key of the entity. This means that, each entity must have minimal set of candidate keys that uniquely identifies each entity instance. 

In our User model example, the minimal set of candidate key instances are User Id and Email Address. Do we need both the User Id and Email Address to retrieve a single instances of each data field? No, we do not need both the User Id and and Email Address as primary keys in User entity. Either one of these is enough to uniquely identify a user and return a single instance of any of the data fields in the User. So we just need one of these as primary key.

We have already seen about the keys in data modeling and choosing a primary key from the set of candidate keys in previous post. By applying those criteria we chose User Id as the primary key and Email address as an alternate key. So the new updated model will be shown below:


USER MODEL WITH MINIMAL SET OF CANDIDATE KEY
USER MODEL - MINIMAL SET OF CANDIDATE KEY

Also analyzing the User model with the same view, we are getting the following questions, Is data field Gender applies to Company? Is data field Logo Image applies to Person? The answer is no as per our website company requirement, so we are introduce sub-typing into our model.  The updated model will be shown below:

USER MODEL WITH SUBTYPING
USER MODEL WITH SUB-TYPING

In the above diagram, you can see that, the data fields Person First Name, Last Name Title, Gender, Birth Date belongs to entity Person and Company Name and Logo Image data fields belongs to Company and data fields Email Address, Web Address, Address Line , City, State and Country shared by both Persona and Company. Also you can see that, both the Person and Company share same primary key User Id, that is primary key of the super-type user, since super-type and subtype share one-to-one relationship.

Lets recall the definition of the entity User, "a person, a company or a person from a company who access our website". As per the definition, we have to assign the relationship between the Person and Company. The following questions reveal the relationship between the Persona and Company.

Can a Person work for more than one Company? 
Can a Person exist without a Company?
Can a Company employ more than one Person?
Can a Company exist without a Person?

The answer for these questions are, a Person can work for one Company or no company at all and a Company can employ one or many Persons or no Person at all. The following diagram shows this model:

USER MODEL IN 2NF


Third normal form (3NF):


In simple words, third normal form deals with "nothing but the key". A model is in third normal form, if and only if the model is in second normal form (2NF) and every non-candidate key in the entity determined only by the candidate keys of that entity and not by any non-candidate key.

In third normal form we are going to remove hidden dependencies. Except alternate key and foreign keys all other data fields in the entity must directly dependent on the primary key of the same entity. When analyzing our User model, we see that there are some hidden dependencies within the address data fields. If we know the PIN code we can determine the city, state and country. Therefore, we move postal code, city, state and country to a different entity. 

The relationship between the User and new entity Address can be identified by asking the following questions, 

Can a User belongs to more than one City? 
Can a User exists without a City? 
Can a City contains more than one User? 
Can a City exists without a User?

A user belongs only one city and a user cannot exists without a city, also a city can contain more than one user also there are cities without a user. Therefore, the redefined model will like as follows:

USER MODEL IN 3NF
USER MODEL IN 3NF

The above diagram explains the model which is third normal form (3NF), once you learn the concept of normalization, it is not necessary to start the normalization from step by step. You can apply all levels at once. Also you can further normalize this model depends upon the requirements, but keep in mind this extra modeling efforts must be used only when it is necessary, since it may further delay the project allotted time and resources.

USER MODEL IN NORMALIZED FORM
USER MODEL IN NORMALIZED FORM

Note that in the above mentioned model, primary key names of the sub-types are changed for our convenience and it still reference the super-type primary key. I hope this post will help you to understand the normalization in logical data modeling. In next post let us discuss about the abstraction in logical data modeling. I expect your valuable feedback, kindly share your thoughts on comments section.
Post a Comment