Snippet Name: AAG_Failover_Timeout_Settings
Tags:
Created Date: 2025-06-16 13:12:44
Last Modified Date: 2025-05-31 10:32:10
Dependencies:
DBA Notes:
Sample Output:
Title: AAG Failover Timeout Settings
Author: user
in my sql server AAG, i am seeing a failover, and i think it is related to some timeout , where can i find that setting
Author: ChatGPT
In your SQL Server Always On Availability Group (AAG), failovers can occur due to various timeout settings that control health detection and response times. The key timeout settings you should check include:
### 1. **Cluster Heartbeat Timeout (Windows Failover Cluster)**
- This setting determines how long a node waits before considering another node unresponsive.
- You can check and adjust it using PowerShell:
```powershell
Get-Cluster | Select-Object -Property *
```
Look for `SameSubnetThreshold` and `CrossSubnetThreshold`.
### 2. **Availability Group Lease Timeout**
- This setting is critical for monitoring SQL Server instance health.
- You can check and modify it with:
```sql
SELECT name, value, description
FROM sys.availability_groups AS ag
JOIN sys.availability_group_options AS ago
ON ag.group_id = ago.group_id
WHERE ago.option_name = 'LEASE_TIMEOUT';
```
- The default is 20000 milliseconds (20 seconds). If lease timeouts are too aggressive, consider increasing this value.
### 3. **Failover Detection Timeout (Health Check Interval)**
- This setting controls how long SQL Server waits before failing over when health checks fail.
- To check the current setting:
```sql
SELECT failure_condition_level, health_check_timeout
FROM sys.availability_groups;
```
- The `health_check_timeout` value is in milliseconds (default is 30000 ms or 30 seconds).
### 4. **Session Timeout (Replica Communication)**
- This controls how long the secondary replica waits before marking the primary as unavailable.
- Check the session timeout setting with:
```sql
SELECT replica_server_name, session_timeout
FROM sys.availability_replicas;
```
- The default is 10 seconds but can be adjusted based on network conditions.
### 5. **Cluster Quorum and Witness Settings**
- If quorum is lost, the AG can failover or go offline.
- Check quorum settings using:
```powershell
Get-ClusterQuorum | Format-List
```
### 6. **SQL Server Error Logs and Extended Events**
- Check the SQL Server error logs for entries related to lease timeouts or health detection failures.
```sql
EXEC sp_readerrorlog 0, 1, 'lease timeout';
```
- Extended Events can capture failover-related events.
Would you like to analyze any of these settings in more detail?
Author: user
can i query that the timeout was high and forced a failover
Author: ChatGPT
Yes, you can query SQL Server logs and availability group-related DMVs to determine if a timeout-related issue caused the failover. Here are a few ways to investigate:
### 1. **Check the SQL Server Error Logs**
Timeout-related issues (such as lease timeouts or session timeouts) will be logged in the SQL Server error log. Run the following query to check for timeout-related messages:
```sql
EXEC sp_readerrorlog 0, 1, 'timeout';
EXEC sp_readerrorlog 0, 1, 'failover';
EXEC sp_readerrorlog 0, 1, 'availability group';
```
Look for messages such as:
- "Lease timeout"
- "The local availability replica is going offline because the lease is expired."
- "Availability Group state changed"
### 2. **Query the Always On Health Events (Extended Events)**
If Always On Health Extended Events are enabled, you can check for failover reasons, including timeout issues:
```sql
SELECT event_data.value('(event/@timestamp)[1]', 'datetime') AS EventTime,
event_data.value('(event/data[@name="error_number"]/value)[1]', 'int') AS ErrorNumber,
event_data.value('(event/data[@name="message"]/value)[1]', 'nvarchar(max)') AS Message
FROM sys.fn_xe_file_target_read_file('AlwaysOn_health*.xel', NULL, NULL, NULL)
WHERE event_data.value('(event/data[@name="message"]/value)[1]', 'nvarchar(max)') LIKE '%timeout%'
OR event_data.value('(event/data[@name="message"]/value)[1]', 'nvarchar(max)') LIKE '%failover%';
```
Look for messages indicating lease timeout or session timeout errors.
### 3. **Check Availability Group Failover History**
SQL Server tracks failover events in the `sys.dm_hadr_availability_group_states` and `sys.dm_hadr_availability_replica_states` DMVs.
```sql
SELECT
ag.name AS AGName,
ar.replica_server_name,
ags.primary_replica,
ags.synchronization_health_desc,
ags.recovery_lsn,
ags.last_commit_lsn,
ags.failover_lsn
FROM sys.availability_groups ag
JOIN sys.dm_hadr_availability_group_states ags ON ag.group_id = ags.group_id
JOIN sys.availability_replicas ar ON ag.group_id = ar.group_id
WHERE ags.synchronization_health_desc <> 'HEALTHY'
ORDER BY ags.failover_lsn DESC;
```
This can help pinpoint failovers and the current health status of the group.
### 4. **Windows Failover Cluster Logs**
If the failover was triggered at the cluster level (e.g., node failure detection timeout), check Windows Failover Cluster logs with:
```powershell
Get-WinEvent -LogName Microsoft-Windows-FailoverClustering/Operational | Where-Object { $_.Message -like "*timed out*" } | Select-Object TimeCreated, Id, Message -Last 20
```
This log can reveal if the cluster service detected a timeout leading to the failover.
### 5. **Query AG Lease Timeout Settings**
To check the current lease timeout setting, use:
```sql
SELECT ag.name, ag.group_id, ag.is_distributed,
ago.option_name, ago.value
FROM sys.availability_groups ag
JOIN sys.availability_group_options ago
ON ag.group_id = ago.group_id
WHERE ago.option_name = 'LEASE_TIMEOUT';
```
If the value is too low, it might be worth increasing it to avoid unnecessary failovers.
---
Would you like assistance interpreting the results of these queries?