Architecting reliable backup and recovery strategy


I have been managing multiple databases, mostly in Microsoft SQL Server and MySQL server, both on on-premise and cloud. We have faced a lot of challenging issues such as records deleted from a user table, backup file is corrupted, backup file is not compatible, backup files got deleted, backup storage is full and backup is running for long time, etc. When you are facing this issues for the first time, it is surprising to see new kind of issues every day and if you are not good in documentation, repetitive issues will keep occurring and we keep fixing rather than suctioning. If you are facing same challenges, then you need to focus on your backup and recovery strategy.

A well-designed backup and recovery strategy maximizes data availability and minimizes data loss without tolerating business requirement. In this post, we will discuss about the following topics:
  1. Recovery Time Objective (RTO)
  2. Recovery Point Objective (RPO)
  3. Business Strategy
  4. Backup 
    • Backup Type
    • Backup Frequency
    • Backup Device
    • Retention and Archival plans
  5. Restore
    • Backup Availability
    • Restoration Access
    • Documentation
  6. Testing
Recovery Time Objective (RTO): Time taken to recover the data

Recovery Time Objective defines, how long would an interruption to data access need to be, before the business ceased to be viable / capable of working successfully?

Best Practice:

Backup and Restoration plan that involves quick recovery with small data loss might be acceptable to business than a plan that reduces data loss but takes much longer to implement.
Deciding optimal RTO might involve following things:
  •  Identifying the correct backup media
  •  Identifying the authorized persons to perform restore
  •  Identifying documentation related to the restore
Example:

RTO for the Stock Exchange System might be 1 minutes – which means, in case of any failure on database system, backup should be restored and system should be online within a minute.

Recovery Point Objective (RPO): The amount of data loss

Recovery Point Objective defines how much data loss will be acceptable to business.

Best Practice:

Backup plan should be able to recover all committed data without any loss and it should be acceptable to loss the data which is in process at the time of failure.

Example:

RPO for the banking system might be 1 seconds – which means, in case of any failure, the bank can loss at most 1 second worth of data.

Business Strategy:

Backup and recovery plan should be in sync with business requirement. As a best practice, it should be communicated with business stakeholders and expectations of the business users should be managed in line with the agreed strategy.

Each database must be categorized based on the importance to the core functions of the organization and RPO and RTO for each database must be documented well in advance and communicated to all stakeholders.

Example:

In certain business, Sales database is very critical than Marketing database and needs to restored in high priority, in case of failure. The RPO and RTO for Sales and Marketing databases will be defined accordingly.

The business requirements will determine all aspects of the backup strategy, including:
  •          How frequently backups need to occur?
  •          How much data is to be backed up each time?
  •          The type of media, that the backups will be held on?
  •          Retention and Archival plans for the media? 
Backup Type:

Backup and recovery plan should clearly define the backup type required for different scenarios. A full backup might be appropriate in certain cases and partial backup will be appropriate in other scenario.

For example, Product database may have huge number of records and backup and restoration of a huge backup is time consuming. When you have RTO of 1 minutes, it is not right strategy to have a full backup which needs 1 hour of restoration time. Similarly, when RPO is defined as 5 seconds, we cannot run full backup on every 5 seconds.

Backup and Recovery strategy should address this question. It should choose the combination of different backup type to achieve the defined RPO and RTO metrics.
Generally, there are 3 types of backups are used:
  •  Full backup.
  •  Differential backup. 
  •  Log backup.
Backup Frequency:

Most of the places, the backups are implemented in below frequency:
  •  Full backup – every week
  •  Differential backup – every day
  •  Log backup – every hour

Review the above schedule and think, will it satisfy the business requirement. There is no right or wrong questions, the above frequency might satisfy the RPO of 1 hours and depends on the size of the database and where backup stored, RTO might be achievable. There is no one size fit all approach here, you have to understand the business requirement and define the RTO and RPO.

When business demand no data loss, log backup of every 1 hour will not help, in case any failure, with the above backup plan, you may lose the data worth 1 hour. A well-designed backup and restore plan will address this issue and it should be documented and shared across business stakeholders.





Backup Device:

This is one of the important factors, choosing the wrong backup device, will negatively impact the RTO and business requirement. Generally, backup device can be either one or combination of below devices:
  •  Disk Files – (SAN, NAS, etc.)
  •  Cloud Storage (Azure Blob Storage, Amazon Cloud Storage, etc.)
Best Practice:

Make sure, the backup device is easily accessible and secured. There are instances, SAN administrator mistakenly deleted the backup files while freeing disk spaces and other instances, where DBA does not have access to backup storage device and the person who is having access is on vacation. If you are in Enterprise setup, usually different things are managed by multiple persons, if you are in small or start-up setup, you might have access to everything. Plan your backup and recovery strategy accordingly, where to write the backup and how you can access the backup, when there is a need.

Also, there are compliance policy, where it demands the backup must be secured, might be using encryption algorithms.

Archival and Retention Policy:

Archival and Retention policy should be defined considering the legal requirements and compliance requirement. Depends upon your region compliance requirements may impact how you store and how long you retain the backup.

Restore:

Backup Availability:

There are scenarios, where backup file got corrupted or mistakenly deleted from storage, if you have only one backup copy, it will lead to loss of data. Backup plan should consider this and implement, more than one copy of backup in an accessible location.

Best Practice:

Generally, it is good practice to have more than 2 copies of backup. Keep 1 one copy in the local machine or attached drive, other copy in SAN/NAS storage and another copy in remote location, such as cloud or other geo.

Keeping a copy in local help in reduce the time needed to transfer the file to local machine and it will help in achieving RTO.






Restoration Access:

In a large enterprise setup, everyone may not have access to restore or recover data. These permissions include, access to backup storage, access to restoration and security access, if encryptions are in place. It is advisable to capture all these permissions in Backup and Recovery strategy and share it with business and technical stakeholders.

In case of emergency, everyone knows their role and understand the sequence, when to execute their job. It will remove the confusions and keep the team in calm environment.

Documentation:

I have interviewed a lot of people for the database administration role and I realized, the documentation is least important task in their day to day job. For a DBA, most of the tasks are repetitive and we are using a lot of scripts and automations. Even something developed or written by us few days back will be very difficult to understand later point of time. When you are in emergency, searching on Google or using others scripts will not help you and it seems it is simple to recover the data but it is not. It is mandatory to document each and every steps and business requirements in clear manner and communicate with all stakeholders.

Also, it is important to agree both the business and technical stakeholders on how quickly data can be retrieved and how much data can be lost and get sign-off on the documentation.

Testing:

As John Ruskin said, “Quality is never an accident; it is always the result of intelligent effort.” It does not matter, how perfectly you designed your backup and recovery strategy, unless you tested many number of times and documented all the challenges and resolution, it is going to be surprise at the time of emergency.

As a best practice, restore the backup on testing and staging frequently in a defined interval and restore on the production machine in longer intervals. There are instances, a backup copy may work in one machine with same database engine and configuration, may not work in other machine with same engine and configuration. Make a habit of test, test and test frequently in a defined period.

Summary:

A well-designed Backup and Recovery strategy will ensure the data availability and minimize the data lose. A documented strategy will act as guiding light in terms of emergency. It will remove the confusions and dependency on individuals. A good tested and orchestrated strategy ensures the business continuity and give guarantee to your sleep J.

I hope this helps in planning your strategy, please share your thoughts on this article and let me know, if I need to edit or add content on this article. Thanks you for your time. Be proactive rather than being reactive!!!

[Solved] MySQL User Operation - ERROR 1396 (HY000): Operation CREATE / DROP USER failed for 'user'@'host'

Error Message:
ERROR 1396 (HY000): Operation CREATE USER failed for 'admin'@'%'

Generic Error Message:
Operation %s failed for %s

Error Scenario:
Operation CREATE USER failed
Operation DROP USER failed

Reason:
The reason for this error is, you are trying to do some user operation but the user does not exist on the MySQL system. Also, for drop user, the user details are stored somewhere in the system, even though, you have already dropped the user from MySQL server.

Resolution:
Revoke all access granted to the user, drop the user, and run FLUSH PRIVILEGE command to remove the caches. Now create/drop/alter the user, it will work.

REVOKE ALL ON *.* FROM 'user'@'host';
DROP USER 'user'@'host';
FLUSH PRIVILEGES;

Grant Tables:
The following tables will help you in identifying the user related informations (as of MySQL 5.7):

mysql.user: User accounts, global privileges, and other non-privilege columns
mysql.db: Database-level privileges
mysql.tables_priv: Table-level privileges
mysql.columns_priv: Column-level privileges
mysql.procs_priv: Stored procedure and function privileges
mysql.proxies_priv: Proxy-user privilege




Related MySQL Bug Reports:

I hope this post will help you, if you faced this error on some other scenarios or if you know, some other workaround / solution for this error, please add on the comment section. It will be helpful for other readers.




MySQL Replication Notes


The MySQL Replication was my first project as a Database Administrator (DBA) and I have been working with Replication technologies for last few years and I am indebted to contribute my little part for development of this technology. MySQL supports different replication topologies, having better understanding of basic concepts will help you in building and managing various and complex topologies.
I am writing here, some of the key points to taken care when you are building MySQL replication. I consider this post as a starting point for building a high performance and consistent MySQL servers.  Let me start with below key points
Hardware.
MySQL Server Version
MySQL Server Configuration
Primary Key
Storage Engine
I will update this post with relevant points, whenever I get time. I am trying to provide generic concepts and it will be applicable to all version of MySQL, however, some of the concepts are new and applicable to latest versions (>5.0).


Hardware:
Resourcing of the slave must be on par (or better than) for any Master to keep up with the Master. The slave resource includes the following things:

Disk IO
Computation (vCPU)
InnoDB Buffer Pool (RAM)
MySQL 5.7 supports Multi Threaded Replication, but are limited to one thread per database. In case of heavy writes (multiple threads) on Master databases, there is a chance that, Slave will be lag behind the Master, since only one thread is applying BINLOG to the Slave per database and its writes are all serialised.

MySQL Version:

It is highly recommended to have Master and Slave servers should run on same version. Different version of MySQL on slave can affect the SQL execution timings.
For example, MySQL 8.0 is comparatively much faster than 5.5. Also, it is worth to consider the features addition, deletion and modifications.

MySQL Server Configuration:

The MySQL server configuration should be identical, we may have identical hardware resources and same MySQL version, but if MySQL is not configured to utilize the available resources in similar method, there will be changes in execution plan.
For example, InnoDB buffer pool size should be configured on MySQL server to utilize the memory. Even if we have a identical hardwares, buffer pool must be configured at the MySQL instance.

Primary Key:

The primary key plays an important role in Row-Based-Replication (when binlog_format is either ROW or MIXED). Most often, slave lagging behind master while applying RBR event is due to the lack of primary key on the table involved.
When no primary key is defined, for each affected row on master, the entire row image has to be compared on a row-by-row basis against the matching table’s data on the slave.
This can be explained by how a transaction is performed on master and slave based on the availability of primary key:


With Primary Key
Without Primary Key
On Master
Uniquely identifies the row
Make use of any available key or performs a full table scan
On Slave
Uniquely identifies each rows & changes can be quickly applied to appropriate row images on the slave.
Entire row image is compared on a row-by-row basis against the matching table’s data on slave.
Row-by-row scan can be very expensive and time consuming and cause slave to lag behind master.
When there is no primary key defined on a table, InnoDB internally generates a hidden clustered index named GEN_CLUST_INDEX containing row ID values. MySQL replication cannot use this hidden primary key for sort operations, because this hidden row IDs are unique to each MySQL instance and are not consistent between a master and a slave.
The best solution is to ensure all tables have a primary key. When there is no unique not null key available on table, at least create an auto-incrementing integer column (surrogate key) as primary key.
If immediately, it is not possible to create a primary key on all such tables, there is a workaround to overcome this for short period of time by changing slave rows search algorithm. This is not the scope of this post, I will write future post on this topic.

Mixing of Storage Engines:

MySQL Replication supports different storage engines on master and slave servers. But, there are few important configuration to be taken care when mixing of storage engines.
It should be noted that, InnoDB is a transactional storage engine and MyISAM is a non-transactional.
On Rollback: If binlog_format is STATEMENT and when a transaction updates, InnoDB and MyISAM tables and then performs ROLLBACK, only InnoDB tables data is removed and when this statement is written to binlog it will be send to slave, on slave where both the tables are MyISAM will not perform the ROLLBACK, since it does not supports transaction. It will leave the table inconsistent with master.
Auto-Increment column: This should be noted that, the way auto-increment is implemented on MyISAM and InnoDB different, MyISAM will lock a entire table to generate auto-increment and the auto-increment is part of a composite key, insert operation on MyISAM table marked as unsafe. Refer this page for better understanding https://dev.mysql.com/doc/refman/8.0/en/replication-features-auto-increment.html
Referential Integrity Constraints: InnoDB supports foreign keys and MyISAM does not. Cascading updates and deletes operations on InnoDB tables on master will replicate to slave, only if the tables are InnoDB on both master and slave. This is true for both STATEMENT and ROW based replications. Refer this page for explanation: https://dev.mysql.com/doc/refman/5.7/en/innodb-and-mysql-replication.html
Locking: InnoDB performs row-level locking and MyISAM performs table-level locking and all transaction on the slave are executed in a serialized manner, this will negatively impact the slave performance and end up in slave lagging behind the master.
Logging: MyISAM is a non-transactional storage engine and transactions are logged into binary log by client thread, immediately after execution, but before the locks are released.
If the query is part of the transaction and if there is a InnoDB table involved on same transaction and it is executed before the MyISAM query, then it will not written to binlog immediately after execution, it will wait for either commit or rollback. This is done to ensure, order of execution is same in slave as in the master.
Transaction on InnoDB tables will be written to the binary log, only when the transaction is committed.
It is highly advisable to use transactional storage engine on MySQL Replication. Mixing of storage engine may leads to inconsistency and performance issues between master and slave server. Though MySQL does not produce any warnings, it should be noted and taken care from our end.
Also, the introduction of MySQL 8.0 (from 5.6) with default storage engine as InnoDB and deprecating older ISAM feature indicates the future of MySQL database, it is going to be completely transactional and it is recommended to have InnoDB storage engine.
There is a discussion online, about the removal of other storage engines and development on InnoDB engine by Oracle, though it is not scope of this article, as a Database Administrator, I prefer having different storage engine for different use cases and it has been unique feature of MySQL.
Reference:


I hope this post is useful, please share your thoughts / feedbacks on comment section.

[Solved] How to install MySQL Server on CentOS 7?


Recently, when I am working on setting up MySQL Enterprise Server, I found, there is too much information available over internet and it was very difficult for a newbie to get what is needed for direct implementation. So, I decided to write a quick reference guide for setting up the server, covering end to end, starting from planning to production to maintenance. This is a first post in that direction, in this post, we will discuss about installing MySQL Enterprise Server on CentOS 7 machine. Note that, the steps are same for both the Enterprise and Community editions, only binary files are different, and downloaded from different repositories.

If you are looking for installing MySQL on Windows operating system, please visit this page https://www.rathishkumar.in/2016/01/how-to-install-mysql-server-on-windows.html. I am assuming, hardware and the Operating System is installed and configured as per the requirement. Let us begin with the installation.

Removing MariaDB:

The CentOS comes with MariaDB as a default database, if you try to install, MySQL on top of it, you will encounter an error message stating the MySQL library files conflict with MariaDB library files. Remove the MariaDB to avoid errors and to have a clean installation.  Use below statements to remove MariaDB completely:

sudo yum remove MariaDB-server
sudo yum remove MariaDB-client (This can be done in single step)
sudo rm –rf /var/lib/mysql
sudo rm /etc/my.cnf

(Run with sudo, if you are not logged in as Super Admin).

Downloading RPM files:

MySQL installation files (On CentOS 7 – rpm packages) can be downloaded from MySQL yum repository.

For MySQL Community Edition – there is clear and step-by-step guide available at the MySQL website - https://dev.mysql.com/doc/mysql-yum-repo-quick-guide/en/. The only step missing is downloading MySQL yum repository to your local machine. (This might looks very simple step, but most of the newbies, it is very helpful). 


For MySQL Enterprise Edition – the binary files can be downloaded from Oracle Software Delivery Cloud (http://edelivery.oracle.com) for latest version or previous versions visit My Oracle Support (https://support.oracle.com/).

As mentioned earlier, there is a clear and step-by-step guide available at the MySQL website for Community Edition, I will be continue with installing Enterprise Edition, steps are almost same.

Choosing the RPM file:

For MySQL Community Edition, all the RPM files will be included in the downloaded YUM repository, but for Enterprise Editions, these files will be downloaded separately. (For system administration purpose, all these files can be created under a MySQL repository).

For newbies, it may be confusing to understand, the different RPM files and its contents, I am concentrating on only files required for stand-alone MySQL instances. If there is requirement for embedded MySQL or if you working on developing plugins for MySQL, can install other files. It is completely depends on your requirement. The following tables, describe the required files and where to install.

RPM File
Description
Location
mysql-commercial-server-5.7.23-1.1.el7.x86_64.rpm
MySQL Server and related utilities to run and administer a MySQL server.
On Server
mysql-commercial-client-5.7.23-1.1.el7.x86_64.rpm
Standard MySQL clients and administration tools.
On Server & On Client
mysql-commercial-common-5.7.23-1.1.el7.x86_64.rpm
Common files needed by MySQL client library, MySQL database server, and MySQL embedded server.
On Server
mysql-commercial-libs-5.7.23-1.1.el7.x86_64.rpm
Shared libraries for MySQL Client applications
On Server

Installing MySQL:

Install the MySQL binary files in the following order, this is to avoid dependency errors, the following statements will install MySQL on local machine:

sudo yum localinstall mysql-commercial-libs-5.7.23-1.1.el7.x86_64.rpm
sudo yum localinstall mysql-commercial-client-5.7.23-1.1.el7.x86_64.rpm
sudo yum localinstall mysql-commercial-common-5.7.23-1.1.el7.x86_64.rpm
sudo yum localinstall mysql-commercial-server-5.7.23-1.1.el7.x86_64.rpm

Starting the MySQL Service:

On CentOS 7, the mysql service can be started by following:

sudo systemctl start mysqld.service
sudo systemctl status mysqld.service

Login to MySQL Server for first time:

Once the service is started, the superuser account ‘root’@’localhost’ created and temporary password is stored at the error log file (default /var/log/mysqld.log). The password can be retrieved by using the following command:

sudo grep 'temporary password' /var/log/mysqld.log

As soon as logged in to MySQL with the temporary password, need to reset the root password, until that, you cannot run any queries on MySQL server. You can reset the root account password by running below command.

mysql –u root –h localhost -p
alter user 'root'@'localhost' identified by 'NewPassword';

You can verify the MySQL status and edition by running the following commands, sample output provided below for MySQL 8.0 Community Edition (GPL License) running on Windows machine.

MySQL License Status
Notes:

MySQL conflicts with MariaDB: in case if there is conflict with MariaDB, you will see the error message as below:

file /usr/share/mysql/xxx from install of MySQL-server-xxx conflicts with file from package mariadb-libs-xxx

To resolve this error remove mariadb server and its related files from CentOS server. Refer the section - Removing mariadb.

Can’t connect to mysql server: MySQL server is installed but unable to connect from client.

Check this page for possible causes and solutions: https://www.rathishkumar.in/2017/08/solved-cannot-connect-to-mysql-server.html

Please let me know, if you are facing any other errors on comment section. I hope this post is helpful.