Database Normalization

Normalization is a systematic approach of decomposing tables to eliminate data redundancy and undesirable characteristics like Insertion, Update and Deletion Anamolies. It is a multi-step process that puts data into tabular form by removing duplicated data from the relation tables.

Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy. Normalization usually involves dividing large tables into smaller tables and defining relationships between them. It also deals with the insertion, update and deletion anomalies.

First Normal Form:
Deals with atomicity. Columns should not have multiple values.

Student       Age      Subject 
Adam          15       Biology, Maths 
Alex          14       Maths

Second Normal Form:
Removes partial dependency.
a table is in 2NF if and only if it is in 1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the table.

As per the Second Normal Form there must not be any partial dependency of any column on primary key. It means that for a table that has concatenated primary key, each column in the table that is not part of the primary key must depend upon the entire concatenated key for its existence. If any column depends only on one part of the concatenated key, then the table fails Second normal form.

Student    Age    Subject 
Adam       15     Biology 
Adam       15     Maths 
Alex       14     Maths 

Candidate Key is {Student, Subject}. Age is dependent only on Student.

Third Normal Form:
Removes transitive dependency. Table is in 3NF if
– It is in second normal form
– There is no transitive functional dependency
By transitive functional dependency, we mean we have the following relationships in the table: A is functionally dependent on B, and B is functionally dependent on C. In this case, C is transitively dependent on A via B

Student_id    Student_name  Student_DOB Student_Zip  City State 

Student Zip is dependent on Student_id and City / State are dependent on Zip.

BCNF (Boyce Codd Normal Form):
– Strict form of 3NF

Fourth Normal Form:
-Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.

Consider employees, skills, and languages, where an employee may have several skills and several languages. We have here two many-to-many relationships, one between employees and skills, and one between employees and languages. Under fourth normal form, these two relationships should not be represented in a single record such as

——————————-
| EMPLOYEE | SKILL | LANGUAGE |
===============================

Fifth Normal Form:

http://www.studytonight.com/dbms/database-normalization.php

Sharding

Sharding is the equivalent of “horizontal partitioning”. When you shard a database, you create replica’s of the schema, and then divide what data is stored in each shard based on a shard key. For example, I might shard my customer database using CustomerId as a shard key – I’d store ranges 0-10000 in one shard and 10001-20000 in a different shard. When choosing a shard key, the DBA will typically look at data-access patterns and space issues to ensure that they are distributing load and space across shards evenly.

?”Vertical partitioning” is the act of splitting up the data stored in one entity into multiple entities – again for space and performance reasons. For example, a customer might only have one billing address, yet I might choose to put the billing address information into a separate table with a CustomerId reference so that I have the flexibility to move that information into a separate database, or different security context, etc.

To summarize – partitioning is a generic term that just means dividing your logical entities into different physical entities for performance, availability, or some other purpose. “Horizontal partitioning”, or sharding, is replicating the schema, and then dividing the data based on a shard key. “Vertical partitioning” involves dividing up the schema (and the data goes along for the ride).

MySQL Cluster vs MySQL Replication

MySQL currently supports two different solutions for creating a high availability environment and achieving multi-server scalability.

MySQL Replication
The first form is replication, which MySQL has supported since MySQL version 3.23. Replication in MySQL is currently implemented as an asyncronous master-slave setup that uses a logical log-shipping backend.
A master-slave setup means that one server is designated to act as the master. It is then required to receive all of the write queries. The master then executes and logs the queries, which is then shipped to the slave to execute and hence to keep the same data across all of the replication members.
Replication is asyncronous, which means that the slave server is not guaranteed to have the data when the master performs the change. Normally, replication will be as real-time as possible. However, there is no guarantee about the time required for the change to propagate to the slave.
Replication can be used for many reasons. Some of the more common reasons include scalability, server failover, and for backup solutions.
Scalability can be achieved due to the fact that you can now do can do SELECT queries across any of the slaves. Write statements however are not improved generally due to the fact that writes have to occur on each of the replication member.
Failover can be implemented fairly easily using an external monitoring utility that uses a heartbeat or similar mechanism to detect the failure of a master server. MySQL does not currently do automatic failover as the logic is generally very application dependent. Keep in mind that due to the fact that replication is asynchronous that it is possible that not all of the changes done on the master will have propagated to the slave.

MySQL Cluster
MySQL Cluster is a shared nothing, distributed, partitioning system that uses synchronous replication and automatic sharding in order to maintain high availability and performance.
MySQL Cluster is implemented through a separate storage engine called NDB Cluster. This storage engine will automatically partition data across a number of data nodes. The automatic partitioning of data allows for parallelization of queries that are executed. Both reads and writes can be scaled in this fashion since the writes can be distributed across many nodes.
Internally, MySQL Cluster also uses synchronous replication in order to remove any single point of failure from the system. Since two or more nodes are always guaranteed to have the data fragment, at least one node can fail without any impact on running transactions. Failure detection is automatically handled with the dead node being removed transparent to the application. Upon node restart, it will automatically be re-integrated into the cluster and begin handling requests as soon as possible.

A shared nothing architecture (SN) is a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system

Storage Engines for MySQL Tables

MySQL provides various storage engines for its tables as below:

MyISAM
InnoDB
MERGE
MEMORY (HEAP)
ARCHIVE
CSV
FEDERATED
NDB
EXAMPLE
BLACKHOLE

Each storage engine has its own advantages and disadvantages. It is crucial to understand each storage engine features and choose the most appropriate one for your tables to maximize the performance of the database. In the following sections we will discuss about each storage engine and its features so that you can decide which one to use.

MyISAM

MyISAM extends the former ISAM storage engine. The MyISAM tables are optimized for compression an speed. MyISAM tables are also portable between platforms and OSes.

The size of MyISAM table can be up to 256TB, which is huge. In addition, MyISAM tables can be compressed into read-only tables to save space. At startup, MySQL checks MyISAM tables for corruption and even repair them in case of errors. The MyISAM tables are not transaction-safe.

Before MySQL version 5.5, MyISAM is the default storage engine when you create a table without explicitly specify the storage engine. From version 5.5, MySQL uses InnoDB as the default storage engine.

InnoDB

The InnoDB tables fully support ACID-compliant and transactions. They are also very optimal for performance. InnoDB table supports foreign keys, commit, rollback, roll-and forward operations. The size of the InnoDB table can be up to 64TB.

Like MyISAM, the InnoDB tables are portable between different platforms and OSes. MySQL also checks and repair InnoDB tables, if necessary, at startup.

MERGE

A MERGE table is a virtual table that combines multiple MyISAM tables, which has similar structure, into one table. The MERGE storage engine is also known as the MRG_MyISAM engine. The MERGE table does not have its own indexes; it uses indexes of the component tables instead.

Using MERGE table, you can speed up performance in joining multiple tables. MySQL only allows you to perform SELECT, DELETE, UPDATE and INSERT operations on the MERGE tables. If you use DROP TABLE statement on a MERGE table, only MERGE specification is removed. The underlying tables will not be affected.

Memory

The memory tables are stored in memory and used hash indexes so that they are faster than MyISAM tables. The lifetime of the data of the memory tables depends on the up time of the database server. The memory storage engine is formerly known as HEAP.

Archive

The archive storage engine allows you to store a large number of records, which for archiving purpose, into a compressed format to save disk space. The archive storage engine compresses a record when it is inserted and decompress it using zlib library as it is read.

The archive tables only allow INSERT and SELECT commands. The archive tables do not support indexes, so reading records requires a full table scanning.

CSV

The CSV storage engine stores data in comma-separated values file format. A CSV table brings a convenient way to migrate data into non-SQL applications such as spreadsheet software.

CSV table does not support NULL data type and read operation requires a full table scan.

FEDERATED

The FEDERATED storage engine allows you to manage data from a remote MySQL server without using cluster or replication technology. The local federated table stores no data. When you query data from a local federated table, the data is pull automatically from the remote federated tables.

NDB
This clustered database engine is particularly suited for applications that require the highest possible degree of uptime and availability.

BLACKHOLE
The Blackhole storage engine accepts but does not store data and retrievals always return an empty set. Useful for testing.

EXAMPLE
The Example storage engine is “stub” engine that does nothing. You can create tables with this engine, but no data can be stored in them or retrieved from them.

MyISAM vs InnoDB

MyISAM was the default storage engine chosen by MySQL database, when creating a new table. But since 5.5 version, InnoDB is the default storage engine.

The major differences between these two storage engines are :

  • InnoDB supports transactions which is not supported by tables which use MyISAM storage engine. InnoDB table supports foreign keys, commit, rollback, roll-and forward operations
  • InnoDB has row-level locking, relational integrity i.e. supports foreign keys, which is not possible in MyISAM.
  • InnoDB ‘s performance for high volume data cannot be beaten by any other storage engines available.
  • The size of MyISAM table can be up to 256TB, which is huge. InnoDB tables can be upto 64 TB
  • In addition, MyISAM tables can be compressed into read-only tables to save space.

Tables created in MyISAM are known to have higher speed compared to tables in InnoDB. But since InnoDB supports volume, transactions, integrity it’s always a better option which you are dealing with a larger database. It is worth mentioning that a single database can have tables of different storage engines.

MyISAM is good for read-heavy applications, but it doesn’t scale very well when there are a lot of writes. Even if you are updating one field of one row, the whole table gets locked, and no other process can even read from it until that query is finished. MyISAM is very fast at calculating SELECT COUNT(*) types of queries.

InnoDB tends to be a more complicated storage engine and can be slower than MyISAM for most small applications. But it supports row-based locking, which scales better. It also supports some more advanced features such as transactions.