SQL Server Always On Availability Group (AG) Listener: step by step guide


Availability databases hosted on SQL Server Always On Availability Group (AG) can be connected using a unique Virtual Network Name (VNN) called Availability Group Listener. When Availability Group is enabled, clients can connect to databases in both primary and secondary replicas without explicitly specifying the SQL Server instance name. You don’t even need to know the instance name to connect to Availability Group (AG).

When you have configured read-only routing for secondary replicas and the application or client connection contains the application intent as read-only, the listener will redirect the connection to secondary readable replicas, Otherwise, the listener redirects both read-write and read-only intent connections to primary replica. If there is a fail-over, the listener will redirect connections to new primary and secondary connections based on the application intent.
Availability group listener consist of following objects:
  • Domain Name Systems (DNS)
  • Listener Port
  • One or more IP addresses (VIP)
The Listener is always owned by the SQL Server instance where the primary replica resides. At the time of failover, the new primary replica will own the listener.

Example:
  • DNS: salesag.mscorp.com
  • Port: 1433
  • IP: static or DHCP

Creating Availability Group Listener:

Availability Group Listener can be created while creating Availability Group or it can be created post creating the AG. The Listener creation options are available on Specify Replicas window and under the Listener tab.

If you want to skip the Listener creation while creating Availability Group, you can leave the default option Do not create an availability group listener now check the below image for reference:

New Availability Group - Specify Replicas
New Availability Group - Specify Replicas


Step by step guide to add SSISDB to SQL Server 2016 Always On Availability Group (AG)

Starting from SQL Server 2016, you can add SSISDB as Availability Database in Always On Availability Group (AG). This enables high availability and automated failover for SSISDB. This article provides step by step guide to add SSISDB to Availability Group.

Make sure, you have installed SQL Server Integration Services on both primary and secondary replicas.

Step 1: Create SSIDB on primary replica:

Right-click on the Integration Services Catalogs and click on Create Catalog Wizard option

Creating Integration Service (SSIS) Catalog
Creating Integration Service (SSIS) Catalog

Understanding and configuring Windows Server Failover Cluster Quorum for SQL Server Always On


What is quorum?

SQL Server availability solutions such as Always On Availability Group (AG) and Failover Cluster Instance (FCI) depends on Windows Server Failover Cluster (WSFC) for high availability and failover operations. WSFC failover works based on the quorum. Generally, quorum is minimum requirement of something to be up and running. In Windows Cluster, quorum decides, minimum number of nodes required for cluster to be up and running.

In Windows Server cluster, more than half of the nodes required for cluster to be up and run. For example, if you have 4 node cluster, you need (4 / 2) + 1 nodes for cluster to work.  This is called quorum in WSFC.
Quorum and Roles:

Windows Server designed in a way, quorum understands the node which is active and nodes which are in stand-by. Also, at a given point of time, there cannot be more than active nodes in a Windows Server Failover Cluster. The quorum aware of this active node and in case of fail-over, it will decide the next active node and stand-by nodes. In AG and FCI, active means, nodes which owns the Resource Group at a given point of time.

Quorum and Partitioned Clusters:

In Windows Server Failover Cluster, each node communicates to other nodes in the cluster through a dedicated network connection. When there is communication failure between nodes, there will be partition of nodes and each node will think, other nodes are down and trying to host Resource Group (active) for keeping the system up and running. In Windows Server Failover Cluster (WSFC), more than one node cannot be active at a given point of time. This conflict is called as split-brain scenario.

Quorum designed in a way, it is aware of this communication, and if there are partitions in cluster, due to network failure or some other issues, it will intervene to prevent the split-brain scenarios. The partition with majority quorum will own the resource group and cluster will force stop the cluster service from nodes of other partitions and its removed from the WSFC.

WSFC designed as when the communication between removed nodes established either manually or automatically, nodes can communicate to current cluster nodes, they will automatically join the cluster and start their cluster service.

How Windows Server Failover Cluster Quorum works:

Let’s assume, the cluster is partitioned due to network failure, each partitioned nodes will try to own the resource group. In order to achieve this, each subset of nodes has to prove the quorum majority.

For example, you have 5 node cluster, post partition each subset will try to prove the quorum majority and subset with 3 nodes will own the resources. Here, cluster quorum used to avoid the split-brain situation.

This works well, when you have odd number of nodes, what happens when you have even number of nodes? For example, you have a cluster with 4 nodes, and if network failure, you have partitioned cluster of 2 subsets with 2 nodes each. It is 50/50 quorum, and again both subset will consider itself as majority and try to own the resources, and fail to prove the majority hence cluster will be down.

How the cluster manages quorum, when you have even number of nodes? There are 2 options:

Add a cluster witness and increase the vote count by one or decrease the vote of node as zero and make total number of votes as odd number.

Starting from Windows Server 2012 quorum is configured automatically using Dynamic Quorum. However cluster witness must be added manually to cluster.

Dynamic Quorum:

Post verifying the quorum majority, the new majority definition will be updated among new cluster nodes. The concept of the total number of votes adopted after a successive failure is known as Dynamic Quorum. This allows cluster to lose one node, then another, then another until the last standing member and majority definition is updated on each loss of nodes dynamically.

For example, you have a 3 node cluster and if there is a failure, the subset of 2 nodes will survive and other node will be removed from cluster. The quorum needed for cluster to be up and running is more than half, in our case, we have left with only 2 nodes, the quorum is 50/50. Now cluster will automatically zero down vote of either one of the nodes and other node will be assigned as majority node with total vote of 1 out 1. This is called dynamic quorum.

Dynamic Witness:

Starting from Windows Server 2012 R2, the vote of cluster witness is calculated dynamically, when you have odd number nodes, the witness does not have vote and, if there are even number of nodes, the witness will have votes make total number of vote as odd. The process of dynamically deciding the witness vote is called Dynamic Witness.

For example, you have a 5 node cluster with a cluster witness. By default, each node and witness will have 1 vote each. The total vote for quorum of this cluster is 5, in this case, the value of witness vote is 0, since it is already an odd number. Now assume, there is a failure and one node removed from cluster, the total vote value will be 4, now the cluster will automatically assign the voting of witness by 1 to make the total value of vote as odd number. This process is called, dynamic witness.

Quorum Witness Types:

The Windows Server Failover Cluster (WSFC) supports the following 3 types of quorum witnesses:

Disk Witness:

A small size of storage disk (typically 1 GB) attached with the cluster. The quorum disk is accessible from all nodes in the cluster and is highly available. Disk witness contains the copy of the cluster database.

File Share Witness:

A shared folder from an external server, which is accessible from all the nodes in the cluster. Usually from Windows File Share or a folder from Domain Controller. It should be reliable and DBA should aware of any changes to the file share, either access related or maintenance related. File Share witness maintains the clustering information in a witness.log file, but does not store copy of the cluster database.
  
Cloud Witness:

Introduced on Windows Server 2016, BLOB storage account in Azure, which is accessible to all the nodes in the cluster. Cloud witness maintains the clustering information in a witness.log file, but does not store copy of the cluster database.

Quorum Models:

Choosing the correct quorum model is very critical decision for your availability solution to work. As we discussed earlier, Windows Server provides multiple combination of quorum models. Let’s look at what works best for your availability solution.
  • Node Majority
  • Node & Disk Majority
  • Node & File Share Majority
  • Node & Cloud Storage Majority
  • Disk-only - No Majority.
Let’s look at each models in detail.

Node Majority:

Quorum majority is calculated based on the active number of nodes in the cluster. By default, each node will have 1 vote each. There is not witness configured here. Recommended quorum mode for Availability Groups (AGs) and Failover Cluster Instances (FCIs) when there is an odd number of nodes.

Node & Disk Majority:

Quorum majority vote is calculated as number of active nodes in the cluster and shared disk cluster resource. Connectivity by any node to disk resource count as an affirmative vote for the disk. Size of the disk should be 512MB minimum and it should be excluded from Anti-Virus scanning.  The disk resource should be able to failover as a stand-alone instance. This model is recommended for Failover Cluster Instance but not recommended for Availability Group (AGs).

Node & File Share Majority:

In this model, voting is based on the active number of nodes in the cluster and a file share resource. By default, each node will have a vote and file share will have 1 vote. As a best practice, the file share resource should not be physically located on any node of the cluster, in-case of loss of that node will result in loss of 2 votes. This model is recommended for Availability Groups and Failover Cluster Instances, when there are even number of nodes in the cluster.

Node & Cloud Storage Majority:

In this model, voting is calculated based on the active number of nodes in the cluster and Azure Blob Storage. By default, each node will have 1 vote and Azure Blob Storage will have a vote. The Cloud witness is available from Windows Server 2016 and previous versions can continue to use other models. Cloud Storage witness act as data centre and will provide reliable voting. This is recommended for Always On Availability Group and Failover Cluster Instance, when there is an even number of nodes.
  
Disk-Only - No majority:

In this model, there is no quorum majority calculated, active node is determined by the shared disk cluster resource. There are no votes for the nodes in the cluster, only single-vote of the disk-resource is required to be online as primary. All nodes in the cluster must have connectivity to disk to gain an affirmative vote and be online. This model is not recommended for Always On Availability Group and Failover Cluster Instances. This model exists only for backward compatibility.

Example Scenarios:

Scenario 1: Two nodes without witness:

Cluster will dynamically decide the quorum, vote of either one of two nodes will be zeroed down and another node will be assigned as total votes 1 out 1 and it has quorum majority and own the resources.

If the current active node fails unexpectedly, then the cluster will go down, it is a single point of failure, but if is a graceful shutdown, the voting will be shifted to the another node and the cluster will be still up. There is a fifty percent chance, the cluster will survive one failure. This is the reason we need a cluster witness.

Scenario 2: Two nodes with witness:

In this case, the quorum has total 3 votes. If any of a node or witness goes down, you will still have a node and witness or 2 nodes with quorum majority and cluster will be up and running. This cluster will survive maximum one failure.

Scenario 3: Three nodes without witness:

Total vote is 3, if any one of the nodes is, the vote will be 2/3 and the cluster can survive, at this point, dynamic quorum updated and cluster will be scenario 1. This cluster can survive one failure and there is a fifty percent chance of next failure.

Scenario 4: Three nodes with witness:

In this case, the witness does not have vote (dynamic witness), so the total vote is an odd number 3. If any one of the nodes fails, the cluster will be scenario 2. This cluster can survive 2 subsequent failures of node.

Scenario 5: Four nodes without witness:

In this scenario, the cluster automatically zero downs one vote from a node and make quorum majority as odd number. If a node fails, the cluster becomes scenario 3. This cluster can survive 2 subsequent failures of nodes and there is a fifty percent chance of next subsequent failure of nodes.

Scenario 6: Four nodes with witness:

All four nodes and witness will have a vote and make the quorum majority an odd number. In case of failure, the cluster will become scenario 4. The cluster will survive 3 node failures.

Configuring the failover cluster quorum:

There are two ways you can configure the cluster quorum, Failover Cluster Manager and Failover Cluster Windows PowerShell cmdlets. In this article, we are going to configure node and file share majority for a cluster. The Failover Cluster Manager wizard is self-explanatory and having understanding of the quorum concept will help you in configuring other quorum models and it is very similar.


Go to Failover Cluster Manager and connect to the cluster. Right click on the cluster name and go to More Actions and choose the Configure Cluster Quorum Settings.

Failover Cluster Manager - Quorum Configuration
Failover Cluster Manager - Quorum Configuration

How to configure Read-Only routing on SQL Server 2016 Always On Availability Group (AG)?

The great advantage of Always On Availability Group over other availability solutions is their ability to scale-out read operations (SELECT queries). Read-only routing is a feature of Always On Availability Group, which redirects connection requests from applications into readable secondary. In SQL Server 2016 Always On Availability Group, you can configure upto 8 readable secondary replicas.

The connections are redirected based on the routing rules. Always On Availability Group provides the following options to define the rules:
  • Read-Only Routing URL
  • Read-Only Routing List
Before defining the routing rules, we must understand the following conditions:

Architecting reliable backup and recovery strategy


I have been managing multiple databases, mostly in Microsoft SQL Server and MySQL server, both on on-premise and cloud. We have faced a lot of challenging issues such as records deleted from a user table, backup file is corrupted, backup file is not compatible, backup files got deleted, backup storage is full and backup is running for long time, etc. When you are facing this issues for the first time, it is surprising to see new kind of issues every day and if you are not good in documentation, repetitive issues will keep occurring and we keep fixing rather than suctioning. If you are facing same challenges, then you need to focus on your backup and recovery strategy.

A well-designed backup and recovery strategy maximizes data availability and minimizes data loss without tolerating business requirement. In this post, we will discuss about the following topics:
  1. Recovery Time Objective (RTO)
  2. Recovery Point Objective (RPO)
  3. Business Strategy
  4. Backup 
    • Backup Type
    • Backup Frequency
    • Backup Device
    • Retention and Archival plans
  5. Restore
    • Backup Availability
    • Restoration Access
    • Documentation
  6. Testing
Recovery Time Objective (RTO): Time taken to recover the data

Recovery Time Objective defines, how long would an interruption to data access need to be, before the business ceased to be viable / capable of working successfully?

Best Practice:

Backup and Restoration plan that involves quick recovery with small data loss might be acceptable to business than a plan that reduces data loss but takes much longer to implement.
Deciding optimal RTO might involve following things:
  •  Identifying the correct backup media
  •  Identifying the authorized persons to perform restore
  •  Identifying documentation related to the restore
Example:

RTO for the Stock Exchange System might be 1 minutes – which means, in case of any failure on database system, backup should be restored and system should be online within a minute.

Recovery Point Objective (RPO): The amount of data loss

Recovery Point Objective defines how much data loss will be acceptable to business.

Best Practice:

Backup plan should be able to recover all committed data without any loss and it should be acceptable to loss the data which is in process at the time of failure.

Example:

RPO for the banking system might be 1 seconds – which means, in case of any failure, the bank can loss at most 1 second worth of data.

Business Strategy:

Backup and recovery plan should be in sync with business requirement. As a best practice, it should be communicated with business stakeholders and expectations of the business users should be managed in line with the agreed strategy.

Each database must be categorized based on the importance to the core functions of the organization and RPO and RTO for each database must be documented well in advance and communicated to all stakeholders.

Example:

In certain business, Sales database is very critical than Marketing database and needs to restored in high priority, in case of failure. The RPO and RTO for Sales and Marketing databases will be defined accordingly.

The business requirements will determine all aspects of the backup strategy, including:
  •          How frequently backups need to occur?
  •          How much data is to be backed up each time?
  •          The type of media, that the backups will be held on?
  •          Retention and Archival plans for the media? 
Backup Type:

Backup and recovery plan should clearly define the backup type required for different scenarios. A full backup might be appropriate in certain cases and partial backup will be appropriate in other scenario.

For example, Product database may have huge number of records and backup and restoration of a huge backup is time consuming. When you have RTO of 1 minutes, it is not right strategy to have a full backup which needs 1 hour of restoration time. Similarly, when RPO is defined as 5 seconds, we cannot run full backup on every 5 seconds.

Backup and Recovery strategy should address this question. It should choose the combination of different backup type to achieve the defined RPO and RTO metrics.
Generally, there are 3 types of backups are used:
  •  Full backup.
  •  Differential backup. 
  •  Log backup.
Backup Frequency:

Most of the places, the backups are implemented in below frequency:
  •  Full backup – every week
  •  Differential backup – every day
  •  Log backup – every hour

Review the above schedule and think, will it satisfy the business requirement. There is no right or wrong questions, the above frequency might satisfy the RPO of 1 hours and depends on the size of the database and where backup stored, RTO might be achievable. There is no one size fit all approach here, you have to understand the business requirement and define the RTO and RPO.

When business demand no data loss, log backup of every 1 hour will not help, in case any failure, with the above backup plan, you may lose the data worth 1 hour. A well-designed backup and restore plan will address this issue and it should be documented and shared across business stakeholders.





Backup Device:

This is one of the important factors, choosing the wrong backup device, will negatively impact the RTO and business requirement. Generally, backup device can be either one or combination of below devices:
  •  Disk Files – (SAN, NAS, etc.)
  •  Cloud Storage (Azure Blob Storage, Amazon Cloud Storage, etc.)
Best Practice:

Make sure, the backup device is easily accessible and secured. There are instances, SAN administrator mistakenly deleted the backup files while freeing disk spaces and other instances, where DBA does not have access to backup storage device and the person who is having access is on vacation. If you are in Enterprise setup, usually different things are managed by multiple persons, if you are in small or start-up setup, you might have access to everything. Plan your backup and recovery strategy accordingly, where to write the backup and how you can access the backup, when there is a need.

Also, there are compliance policy, where it demands the backup must be secured, might be using encryption algorithms.

Archival and Retention Policy:

Archival and Retention policy should be defined considering the legal requirements and compliance requirement. Depends upon your region compliance requirements may impact how you store and how long you retain the backup.

Restore:

Backup Availability:

There are scenarios, where backup file got corrupted or mistakenly deleted from storage, if you have only one backup copy, it will lead to loss of data. Backup plan should consider this and implement, more than one copy of backup in an accessible location.

Best Practice:

Generally, it is good practice to have more than 2 copies of backup. Keep 1 one copy in the local machine or attached drive, other copy in SAN/NAS storage and another copy in remote location, such as cloud or other geo.

Keeping a copy in local help in reduce the time needed to transfer the file to local machine and it will help in achieving RTO.






Restoration Access:

In a large enterprise setup, everyone may not have access to restore or recover data. These permissions include, access to backup storage, access to restoration and security access, if encryptions are in place. It is advisable to capture all these permissions in Backup and Recovery strategy and share it with business and technical stakeholders.

In case of emergency, everyone knows their role and understand the sequence, when to execute their job. It will remove the confusions and keep the team in calm environment.

Documentation:

I have interviewed a lot of people for the database administration role and I realized, the documentation is least important task in their day to day job. For a DBA, most of the tasks are repetitive and we are using a lot of scripts and automations. Even something developed or written by us few days back will be very difficult to understand later point of time. When you are in emergency, searching on Google or using others scripts will not help you and it seems it is simple to recover the data but it is not. It is mandatory to document each and every steps and business requirements in clear manner and communicate with all stakeholders.

Also, it is important to agree both the business and technical stakeholders on how quickly data can be retrieved and how much data can be lost and get sign-off on the documentation.

Testing:

As John Ruskin said, “Quality is never an accident; it is always the result of intelligent effort.” It does not matter, how perfectly you designed your backup and recovery strategy, unless you tested many number of times and documented all the challenges and resolution, it is going to be surprise at the time of emergency.

As a best practice, restore the backup on testing and staging frequently in a defined interval and restore on the production machine in longer intervals. There are instances, a backup copy may work in one machine with same database engine and configuration, may not work in other machine with same engine and configuration. Make a habit of test, test and test frequently in a defined period.

Summary:

A well-designed Backup and Recovery strategy will ensure the data availability and minimize the data lose. A documented strategy will act as guiding light in terms of emergency. It will remove the confusions and dependency on individuals. A good tested and orchestrated strategy ensures the business continuity and give guarantee to your sleep J.

I hope this helps in planning your strategy, please share your thoughts on this article and let me know, if I need to edit or add content on this article. Thanks you for your time. Be proactive rather than being reactive!!!