8 Eylül 2016 Perşembe

An alternative way to manage jobs during an AlwaysOn AG failover

Hello beautiful SQL people!

In this article, I'd like to share a solution to handle SQL Server Agent Jobs (I'll call them "jobs" from now on) during a failover scenario in an AlwaysOn Availability Group environment. I know, there are some different suggestions around but I promise, this is a new one as far as I am aware.

I'll skip the basics and I assume you already know about SQL Server AlwaysOn Availability Group (AG), jobs and SQL Server Agent Alerts (alerts from now on) features. This is all about managing the status (enabled/disabled) of the jobs and not synchronising / copying them among AG replicas. Also for now, this solution works between 2 replicas only, but it can be expanded to cover more with some extra work.

Note: I haven't tested it yet, but I strongly believe that with some tweaking, this solution can be implemented in a Log Shipping or Database Mirroring environment too.

The solution is absolutely not a totally-new one. I love to read published articles written by my peers about SQL Server and try to follow them daily and up until now I've seen some variations of this solution around. However, I find this way much simpler to implement and manage. That's why I want to share it with the community and to reach out to a wider portion of the community, I'm writing it in English obviously which I do not do so often.

As you know SQL Server does not have a native solution to handle jobs after a AG failover. In terms of its enabled/disabled status, a job stays as is after a failover occurs, they are not failover-aware. In some environments, jobs are critical if not vital. So at least for a lot of DBAs out there, it's very important to switch the status of the jobs at the primary and secondary replicas immediately after a failover.

Some suggest adding conditional checks to each job, however it is sometimes almost impossible (SSIS, too many jobs to modify etc.), some suggest adding some tags to each job and schedule a job at both replicas to enable / disable them accordingly after a failover. Michael K. Campbell has a very long series of articles about AG on SQLMag. I'd like to thank to him to let me use his Function code below to detect the primary and secondary AG replicas in this solution. I prefer to use this function so that I can implement the solution to my SQL Server 2012 instances too.

With this solution, my aim is to set up & forget about jobs should a failover occurs. After implementing this, the only thing you may want to do is adding the names of some jobs to the exclusion table to exclude them from switching status after failovers. The rest will be handled by the solution and when a failover occurs, the status of your jobs will be the same at the new primary and secondary as it was previously. Enabled/disabled jobs at the previously-primary replica will be enabled/disabled at the new-primary replica and the same for the secondary. It does not matter if you perform a manual failover or an automatic failover in case the primary shuts down. Even if your primary shuts down / powered off, it will update the status of the jobs when the SQL Server Agent service is up. It just works.

I recommend you to create the following objects in your DBA / Tools database if you have any. Here are the codes and descriptions to create the solution.


************************** Beginning of the script **************************

-- REMINDER: The following codes must be executed at both of the replicas.


-- Create a Linked Server named "PARTNER" that goes to the other replica and same for the other one. 


-- [Optional] Create a schema named "jobs" to distinguish this solution from others. If you do not prefer to create this schema or if it's already taken and can not be used for this purpose, then please modify rest of the code below.

USE ;
GO
CREATE SCHEMA [jobs];
GO


-- This table will be updated automatically by a stored procedure "[jobs].[update_job_status_accross_nodes]". I prefer to execute this SP hourly by another scheduled job which is not scripted here. But hey! You already know how to do it.

CREATE TABLE [jobs].[job_status](
[job_name] [nvarchar](256) NULL,
[primary_enabled] [bit] NULL,
[secondary_enabled] [bit] NULL,
[collection_date] [datetime] NULL
) ON [PRIMARY]
GO

CREATE UNIQUE CLUSTERED INDEX CUIX_jobname ON [jobs].[job_status](job_name);
GO


-- This is the table you may want to add some jobs to exclude from switching automatically.

CREATE TABLE [jobs].[job_status_exclusion_list](
[job_name] [nvarchar](256) NULL
) ON [PRIMARY]

GO

CREATE UNIQUE CLUSTERED INDEX CUIX_jobname ON [jobs].[job_status_exclusion_list](job_name);
GO


-- To find out if the current replica is primary or not (courtesy of Michael K. Campbell)
CREATE FUNCTION [dbo].[fn_hadr_group_is_primary] (@AGName sysname)
RETURNS bit 
AS
        BEGIN 
                DECLARE @PrimaryReplica sysname; 

                SELECT TOP (1) @PrimaryReplica = hags.primary_replica 
                FROM 
                        sys.dm_hadr_availability_group_states hags
                        INNER JOIN sys.availability_groups ag ON ag.group_id = hags.group_id
                WHERE
                        ag.name = CASE WHEN @AGName IS NULL THEN ag.name ELSE @AGName END;

                IF UPPER(@PrimaryReplica) =  UPPER(@@SERVERNAME)
                        RETURN 1; -- primary

                RETURN 0; -- not primary
                
        END; 



-- Updates the "[jobs].[job_status]" table about the status of the jobs at the primary and secondary separately.

CREATE PROCEDURE [jobs].[update_job_status_accross_nodes]
AS
SET NOCOUNT ON;

IF (SELECT dbo.fn_hadr_group_is_primary(NULL)) = 1
BEGIN
DECLARE @date DATETIME;
SELECT @date = GETDATE();

DELETE FROM [jobs].[job_status];
INSERT INTO [jobs].[job_status]([job_name], [primary_enabled], [secondary_enabled], [collection_date]) 
SELECT 
[name] = CASE WHEN [primary].[name] IS NULL THEN [secondary].[name] ELSE [primary].[name] END, 
[primary].[enabled], 
[secondary].[enabled], 
@date 
FROM [msdb].[dbo].[sysjobs] AS [primary]
FULL OUTER JOIN [PARTNER].[msdb].[dbo].[sysjobs] AS [secondary] ON [primary].[name] = [secondary].[name];

DELETE FROM [PARTNER].[DBA].[jobs].[job_status];
INSERT INTO [PARTNER].[DBA].[jobs].[job_status]([job_name], [primary_enabled], [secondary_enabled], [collection_date]) 
SELECT 
[name] = CASE WHEN [primary].[name] IS NULL THEN [secondary].[name] ELSE [primary].[name] END, 
[primary].[enabled], 
[secondary].[enabled], 
@date 
FROM [msdb].[dbo].[sysjobs] AS [primary]
FULL OUTER JOIN [PARTNER].[msdb].[dbo].[sysjobs] AS [secondary] ON [primary].[name] = [secondary].[name];
END


-- Enables the jobs at the primary and secondary according to the records in "[jobs].[job_status]" and "[jobs].[job_status_exclusion_list]" tables and is executed by a job named "_DBA: Configure SQL Server Agent jobs after a failover" which is triggered by an alert. These are going to be created later.

CREATE PROCEDURE [jobs].[enable_jobs_after_failover]
AS
BEGIN
SET NOCOUNT ON;

-- If the SP is running at the new-Primary Replica, except for the ones in the exclusion list, enable all jobs which was enabled at the previously Primary Replica.
IF (SELECT [dbo].[fn_hadr_group_is_primary](NULL)) = 1
BEGIN
DECLARE @sql NVARCHAR(max) = '';

SELECT 
@sql += N'EXEC msdb.dbo.sp_update_job @job_name = ''' + [name] + N''', @enabled = 1; '
FROM [msdb].[dbo].[sysjobs] AS [j]
INNER JOIN [jobs].[job_status] AS [js] ON [j].[name] = [js].[job_name]
WHERE 
NOT EXISTS(SELECT [job_name] 
FROM [jobs].[job_status_exclusion_list] AS [jsel] WHERE [jsel].[job_name] = [j].[name]) AND
[js].[primary_enabled] = 1
ORDER BY [j].[name];

EXEC (@sql);
END

-- If the SP is running at the new-Secondary Replica, except for the ones in the exclusion list, enable all jobs which was enabled at the previously Secondary Replica.
IF (SELECT [dbo].[fn_hadr_group_is_primary](NULL)) = 0
BEGIN
DECLARE @sql2 NVARCHAR(max) = '';

SELECT 
@sql2 += N'EXEC msdb.dbo.sp_update_job @job_name = ''' + [name] + N''', @enabled = 1; '
FROM [msdb].[dbo].[sysjobs] AS [j]
INNER JOIN [jobs].[job_status] AS [js] ON [j].[name] = [js].[job_name]
WHERE 
NOT EXISTS(SELECT [job_name] 
FROM [jobs].[job_status_exclusion_list] AS [jsel] WHERE [jsel].[job_name] = [j].[name]) AND
[js].[secondary_enabled] = 1
ORDER BY [j].[name];

EXEC (@sql2);
END
END


-- Disables the jobs at the primary and secondary according to the records in "[jobs].[job_status]" and "[jobs].[job_status_exclusion_list]" tables and is executed by a job named "_DBA: Configure SQL Server Agent jobs after a failover" which is triggered by an alert.

CREATE PROCEDURE [jobs].[disable_jobs_after_failover]
AS
BEGIN
SET NOCOUNT ON;

-- If the SP is running at the new-Secondary Replica, except for the ones in the exclusion list, disable all jobs which was disabled at the previously Secondary Replica.
IF (SELECT [dbo].[fn_hadr_group_is_primary](NULL)) = 0
BEGIN
DECLARE @sql NVARCHAR(max) = '';

SELECT 
@sql += N'EXEC msdb.dbo.sp_update_job @job_name = ''' + [job_name] + N''', @enabled = 0;' 
FROM [jobs].[job_status] AS [j]
WHERE 
NOT EXISTS(SELECT [job_name] FROM [jobs].[job_status_exclusion_list] AS [ex] WHERE [ex].[job_name] = [j].[job_name]) AND
[j].[secondary_enabled] = 0
ORDER BY [j].[job_name];

EXEC (@sql);
END

-- If the SP is running at the new-Primary Replica, except for the ones in the exclusion list, disable all jobs which was disabled at the previously Primary Replica.
IF (SELECT [dbo].[fn_hadr_group_is_primary](NULL)) = 1
BEGIN
DECLARE @sql2 NVARCHAR(max) = '';

SELECT 
@sql2 += N'EXEC msdb.dbo.sp_update_job @job_name = ''' + [job_name] + N''', @enabled = 0;' 
FROM [jobs].[job_status] AS [j]
WHERE 
NOT EXISTS(SELECT [job_name] FROM [jobs].[job_status_exclusion_list] AS [ex] WHERE [ex].[job_name] = [j].[job_name]) AND
[j].[primary_enabled] = 0
ORDER BY [j].[job_name];

EXEC (@sql2);
END
END


-- The job that executes the stored procedures above during a failover.

I find it unnecessary to copy/paste the whole script of the job here, it's a simple job consisting of 3 steps.

The name of the job must be "_DBA: Configure SQL Server Agent jobs after a failover", if you do not agree, than please modify the code below accordingly.

REMINDER: Ensure the following code is executed in the correct database where these objects are created.

1. step's code: EXEC [jobs].[enable_jobs_after_failover];
2. step's code: EXEC [jobs].[disable_jobs_after_failover];
3. step's code: EXEC [jobs].[update_job_status_accross_nodes];

A schedule in the type of "Start automatically when SQL Server Agent starts" is needed so that the steps above will be executed at startup in case a replica shuts down unexpectedly.

You also may want to add a notification (Database Mail configuration and an existing operator is needed) so that you'll be informed about a failover.


-- The following alert is triggered when a failover occurs and it is used to execute the job at both of the replicas.

DECLARE @job_id_var uniqueidentifier;
SELECT @job_id_var = job_id FROM msdb..sysjobs WHERE name = '_DBA: Configure SQL Server Agent jobs after a failover'

EXEC msdb.dbo.sp_add_alert
        @name = N'AG Role Change',
        @message_id = 1480,
    @severity = 0,
    @enabled = 1,
    @delay_between_responses = 0,
    @include_event_description_in = 1,
    @category_name=N'[Uncategorized]', 
    @job_id=@job_id_var
GO

REMINDER: Take the following part out if you do not have a SQL Server Agent Operator.

EXEC msdb.dbo.sp_add_notification 
        @alert_name = N'AG Role Change', 
        @operator_name = N'DBA', 
        @notification_method = 1; 

GO

************************** End of the script **************************

After executing the scripts above, the SP "[jobs].[update_job_status_accross_nodes]" needs to be executed at the primary replica at least once to update the "[jobs].[job_status]" table so that the solution will know about the status of your jobs at both of the replicas.

During an AG failover the alert is triggered at the previously-primary (if it's still up) and new-primary and every time it is triggered, it executes the job that is created above. Then job enables and disables the related jobs at both of the replicas. Simple.

Some critical points:
- If you happen to decide to use a different name for the job, then please be careful about the create alert step, because the default job name is used to find the correct job_id from the related system table.
- The job must be enabled all the time and at all replicas. This is vital. Otherwise the alert does not execute it.
- If you want to change the name of the Linked Server, then you must update the script accordingly, otherwise it'll be broken.
- Make sure that your SQL Server Agent service's startup parameter is "Automatic", so that it executes the job at startup and update the status of the jobs accordingly.

As I said from the beginning, most of these code blocks are familiar to almost all of us thanks to my peers' hard work. I guess the only bit I add is during a failover, executing a job by a triggered alert at both replicas and enabling/disabling related jobs this way. If a method exactly like this (obviously I'm unaware) has already been shared by someone else, please let me know so that I can write a note about it.

If you have a suggestion, question or if there's something wrong with the code or method above, please tell me and let's make it better together!

That's it folks! I hope you find this solution useful and practical as I do.

Thanks for reading,
Ekrem Onsoy
Microsoft SQL Server Consultant

Hiç yorum yok: