Availability Group issues fixed with Alerts

 Availability Group issues fixed with Alerts

Being able to work with the Availability Groups and closely analyzing the SQL service, I am amazed by how all of that works. These SQL servers work fast and quickly without any Glitch. However, the service still requires the monitoring and analysis to ensure they are working in the right form. But don’t you think SQL server would be perfect to solve issues of AG?

Well, the answer to this would be a big yes!

In this post regarding availability group issues, I will mention some of the problems that I came across while working with the Availability Groups. For those issues and problems, what automatic solution can be done.

Problem: 1 Suspended data movement

The first and foremost issues that most people encounter in the Availability Group is the data that is glitching and not moving between the primary and secondary replica. This action or Glitch creates a suspended state for the movement of the data.

Because of the suspended data issue, one would want to restore the data as soon as possible and get the flow going again. But if the data is suspended for a long while, one has to take actions that will require a lot of effort. You might have to remove the database from the Availability Groups, or you will have to get the data or the actions back in sync by attaching the files and restoring the log files. So, why not try the SQL server to resolve this issue and get the data out of the suspended state.

Solution resuming the data

The solution to this data resuming problem is usually very easy. You can Resume the suspended data back into the data movement by issuing the t-SQL statement.

ALTER DATABASE [DB] SET HADR RESUME;

One just has to know that the data movement has been suspended for a particular database. However, time matters a lot to determine and respond to the suspended data movement. One has to understand the drastic major that could be taken to get the data back in sync. The SQL Server will raise an error 35264 in case of data suspension. This sys message is; 

AlwaysOn Availability Groups data movement for database ‘%.*ls’ has been suspended for the following reason: “%S_MSG” (Source ID %d; Source string: ‘%.*ls’). One has to Resume the database manually to Resume the data Movement in a particular database. For resuming the database and availability of databases, you can check the SQL online server books.

Automating with SQL server agent alerts for the resumption of data movement

For a particular error that is running the job, you have to create the SQL agent alert. It will help in the automation of the resuming data movement. To discover and determine which data movement has been suspended out of which particular database, we have to hit this system of dynamic management views to ensure that we know a particular database for further resuming the data movement. The DMV sys DM HADR replica database will guide you through and tell you what particular database is suspended and which data movement needs to be resumed. You can create the Alert and the job as follows.

Job script to resume data movement

By creating the job and alert, the availability group will get the Alert as soon as the data is suspended. It will immediately create the t SQL statement for resuming the data and get it out of the suspended state. However, when you have finally created the job, you need to take another step. An alert also needs to set up to trap the error and to run the job of data resuming successfully.

Trapping suspended data movement with an alertConfigure alert to run the job

When the database is in the suspended state, the SQL server will finally trap the error and will help in running the job after the creation of the job and the Alert. But if you are wondering how you can start the data movement and get it flowing again and out of the suspended state, you will have to work on the script below.

Wrapping it all up in a script

The T-SQL script to create the job and alert is:

/******************************************************************************


Create job to resume AG database data movement for suspended databases.


Create alert to catch when data movement has been suspended, and run the job.


*******************************************************************************


MODIFICATION LOG


*******************************************************************************


2018-11-20 WGS Initial creation.


******************************************************************************/


USE [msdb]


GO


BEGIN TRANSACTION;


DECLARE @ReturnCode INT;


SELECT @ReturnCode = 0;DECLARE @jobname sysname = N’AG – Resume Data Movement’;


DECLARE @categoryname sysname = ‘HADR-Availability Group’;


IF EXISTS (SELECT name FROM dbo.sysjobs where name = @jobname)


BEGIN


EXECUTE msdb.dbo.sp_delete_job @job_name = @jobname;


END;


IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=@categoryname AND category_class=1)


BEGIN


EXECUTE @ReturnCode = msdb.dbo.sp_add_category @class=N’JOB’, @type=N’LOCAL’, @name=@categoryname;


IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;


END;


EXECUTE @ReturnCode = msdb.dbo.sp_add_job


@job_name=@jobname,


@enabled=1,


@notify_level_eventlog=0,


@notify_level_email=2,


@notify_level_netsend=2,


@notify_level_page=2,


@delete_level=0,


@description=N’Resume data movement on suspended Availability Group databases.


This job can be run manually, or from an alert’,


@category_name=N’HADR-Availability Group’,


@owner_login_name=N’sa’;


IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;


EXECUTE @ReturnCode = msdb.dbo.sp_add_jobserver @job_name=@jobname, @server_name = N'(local)’


IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;


EXECUTE @ReturnCode = msdb.dbo.sp_add_jobstep


@job_name=@jobname,


@step_name=N’Resume data movement in AG databases’,


@step_id=1,


@cmdexec_success_code=0,


@on_success_action=1,


@on_fail_action=2,


@retry_attempts=0,


@retry_interval=0,


@os_run_priority=0,


@subsystem=N’TSQL’,


@command=N’DECLARE @SQLCMD VARCHAR(1000);


DECLARE cDBSuspended CURSOR FOR


SELECT ”ALTER DATABASE [” + DB_NAME(database_id) + ”] SET HADR RESUME;”


FROM sys.dm_hadr_database_replica_states drs


JOIN sys.availability_replicas ar ON ar.replica_id = drs.replica_id


WHERE ar.replica_server_name = @@SERVERNAME


AND drs.is_suspended = 1;


OPEN cDBSuspended;


FETCH NEXT FROM cDBSuspended INTO @SQLCMD;


WHILE @@FETCH_STATUS = 0


BEGIN


EXECUTE (@SQLCMD);


FETCH NEXT FROM cDBSuspended INTO @SQLCMD;


END;


CLOSE cDBSuspended;


DEALLOCATE cDBSuspended;


‘,


@database_name=N’master’,


@flags=0;


IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;


EXECUTE @ReturnCode = msdb.dbo.sp_update_job


@job_name=@jobname,


@enabled=1,


@start_step_id=1,


@notify_level_eventlog=0,


@notify_level_email=2,


@notify_level_netsend=2,


@notify_level_page=2,


@delete_level=0,


@description=N’Resume data movement on suspended Availability Group databases.’,


@category_name=N’HADR-Availability Group’,


@owner_login_name=N’sa’,


@notify_email_operator_name=N”,


@notify_netsend_operator_name=N”,


@notify_page_operator_name=N”;


IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;


DECLARE @alertname sysname = N’AG Data Movement suspended’;


IF EXISTS (SELECT * FROM msdb.dbo.sysalerts WHERE name = @alertname)


BEGIN


EXECUTE @ReturnCode = msdb.dbo.sp_delete_alert @alertname;


IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;


END;


/*


Alert text:


AlwaysOn Availability Groups data movement for database ‘%.*ls’ has been suspended for the following reason: “%S_MSG” (Source ID %d; Source string: ‘%.*ls’). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.


*/


EXECUTE @ReturnCode = msdb.dbo.sp_add_alert


@name=@alertname,


@message_id=35264,


@severity=0,


@enabled=1,


@delay_between_responses=0,


@include_event_description_in=0,


@job_name = @jobname;


IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;


COMMIT TRANSACTION;


GOTO EndSave


QuitWithRollback:


IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION;


EndSave:


GO

Problem: 2 Failover

One other issue that you will come across while you are working with the availability groups is the failover condition. When some data needs to be updated from the database, the available third party intervenes for showcasing the primary replica. However, the adjustment is required and needed for the cluster witness configuration. There may be a possibility that you will have to turn off the old primary replica. In this case, the SQL shows an error 1480 that interprets as failover of Availability Group. However, it can occur for both the primary replica of new and old databases. You can read the text of the error from sys messages as;

The %S_MSG database “%.*ls” is changing roles from “%ls” to “%ls” because the mirroring session or availability group failed over due to %S_MSG. However, the message that is shown after the error is only for the information of the user, and you cannot take any action on this message.

There are three roles that each replica will go through in the process of failover. When there is an old primary replica, it will change from primary to resolving state and then to do the secondary state. However, the old secondary replica I will change from secondary to resolving state and then to the primary state. You can track all of these roles and Change of these states in the error message.

There are three possible conditions for which the alerts can be shown. The first condition is if the failover occurred. Secondly, if the replica is in the instance of the primary role and the failover has occurred. Whereas the last and the third potential alert can show that the replica is now at the secondary instance. The script that is available below will help in creating the last two potential conditions of the alerts.

/*


Use this event to run a job when the replica becomes primary


*/

USE [msdb]

GO

EXEC msdb.dbo.sp_add_alert @name=N’AG Failover Detected – Now Primary’,

@message_id=1480,

@severity=0,

@enabled=1,

@delay_between_responses=0,

@include_event_description_in=0,

@event_description_keyword=N'”RESOLVING” to “PRIMARY”‘,

@job_id=N’00000000-0000-0000-0000-000000000000′

GO

/*

Use this event to run a job when the replica becomes secondary

*/

USE [msdb]

GO

EXEC msdb.dbo.sp_add_alert @name=N’AG Failover Detected – Now Secondary’,

@message_id=1480,

@severity=0,

@enabled=1,

@delay_between_responses=0,

@include_event_description_in=0,

@event_description_keyword=N'”RESOLVING” to “SECONDARY”‘,

@job_id=N’00000000-0000-0000-0000-000000000000′

GO

When you have successfully added the alerts, it will help in the modification of the error to successfully run a particular job. These alerts will not only tell you about the failover condition but also help you to know the instance of the replica. If you just want to know about the failover, you can conveniently remove the message on the Alert to alert on.

Configuring alert to detect AG failover

Wrap up

In the above article, we have successfully learned how to capture the issues in the system and fix the availability groups using the alerting system of the SQL server agent. We have also learned to respond to these errors and alerts by running the jobs after the creation of alerts and fix these issues using the SQL server.

Reference: Article 

No comments:

Post a Comment