Skip to content

Incident Management

Incident Management, sistem kesintilerini ve performans sorunlarını hızlıca tespit edip çözmek için standartlaştırılmış süreçlerdir. Etkili incident management, sistem güvenilirliğinin ve kullanıcı memnuniyetinin korunması için kritik önem taşır.

Incident Tanımı ve Sınıflandırması

Incident Türleri

1. Severity Levels

yaml
# incident-severity-levels.yaml
severity_levels:
  SEV1_CRITICAL:
    description: "Tam sistem kesintisi"
    response_time: "15 minutes"
    escalation_time: "30 minutes"
    communication: "Every 30 minutes"
    example: "Ana uygulamanın tamamen erişilemez olması"
    
  SEV2_HIGH:
    description: "Ana fonksiyonlarda ciddi sorun"
    response_time: "1 hour"
    escalation_time: "2 hours"
    communication: "Every 2 hours"
    example: "Ödeme sisteminin çalışmaması"
    
  SEV3_MEDIUM:
    description: "Performans sorunları"
    response_time: "4 hours"
    escalation_time: "8 hours"
    communication: "Daily"
    example: "Yavaş response time'lar"
    
  SEV4_LOW:
    description: "Minor sorunlar"
    response_time: "24 hours"
    escalation_time: "48 hours"
    communication: "Weekly"
    example: "UI küçük bug'ları"

2. Impact Assessment

java
// Incident Impact Calculator
@Component
public class IncidentImpactCalculator {
    
    public IncidentImpact calculateImpact(IncidentDetails incident) {
        IncidentImpact impact = new IncidentImpact();
        
        // Affected users calculation
        int affectedUsers = calculateAffectedUsers(incident);
        impact.setAffectedUsers(affectedUsers);
        
        // Financial impact
        double revenueImpact = calculateRevenueImpact(incident);
        impact.setRevenueImpact(revenueImpact);
        
        // Business process impact
        BusinessProcessImpact processImpact = assessBusinessProcessImpact(incident);
        impact.setBusinessProcessImpact(processImpact);
        
        // Brand reputation impact
        ReputationImpact reputationImpact = assessReputationImpact(incident);
        impact.setReputationImpact(reputationImpact);
        
        return impact;
    }
    
    private double calculateRevenueImpact(IncidentDetails incident) {
        double avgRevenuePerMinute = getAverageRevenuePerMinute();
        long durationMinutes = incident.getDurationMinutes();
        double impactPercentage = incident.getImpactPercentage();
        
        return avgRevenuePerMinute * durationMinutes * (impactPercentage / 100);
    }
}

Incident Response Process

1. Detection Phase (Tespit)

Monitoring ve Alerting

yaml
# prometheus-incident-detection.yaml
groups:
- name: incident_detection
  rules:
  - alert: ServiceDown
    expr: up{job="api-service"} == 0
    for: 5m
    labels:
      severity: critical
      incident_type: "service_unavailable"
    annotations:
      summary: "Service {{ $labels.instance }} is down"
      description: "Service has been down for more than 5 minutes"
      runbook_url: "https://wiki.company.com/runbooks/service-down"
      
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
    for: 2m
    labels:
      severity: high
      incident_type: "high_error_rate"
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value | humanizePercentage }}"

Automated Detection System

java
// Anomaly Detection Service
@Service
public class AnomalyDetectionService {
    
    @Autowired
    private MetricsService metricsService;
    
    @Autowired
    private IncidentService incidentService;
    
    @Scheduled(fixedRate = 30000) // Her 30 saniyede kontrol et
    public void detectAnomalies() {
        List<ServiceMetrics> metrics = metricsService.getCurrentMetrics();
        
        for (ServiceMetrics metric : metrics) {
            AnomalyDetectionResult result = analyzeMetric(metric);
            
            if (result.isAnomalyDetected()) {
                Incident incident = createIncidentFromAnomaly(result);
                incidentService.createIncident(incident);
                notificationService.sendAlert(incident);
            }
        }
    }
    
    private AnomalyDetectionResult analyzeMetric(ServiceMetrics metric) {
        // Statistical analysis, machine learning models
        double threshold = calculateDynamicThreshold(metric);
        boolean isAnomaly = metric.getValue() > threshold;
        
        return AnomalyDetectionResult.builder()
            .isAnomalyDetected(isAnomaly)
            .confidence(calculateConfidence(metric))
            .metricName(metric.getName())
            .actualValue(metric.getValue())
            .expectedValue(threshold)
            .build();
    }
}

2. Response Phase (Müdahale)

Incident Commander System

java
// Incident Commander Selection
@Component
public class IncidentCommanderSelector {
    
    public IncidentCommander selectCommander(Incident incident) {
        Severity severity = incident.getSeverity();
        String serviceArea = incident.getServiceArea();
        
        // Severity bazlı commander seçimi
        switch (severity) {
            case SEV1:
                return selectSeniorCommander(serviceArea);
            case SEV2:
                return selectExperiencedCommander(serviceArea);
            case SEV3:
            case SEV4:
                return selectAvailableCommander(serviceArea);
            default:
                throw new IllegalArgumentException("Unknown severity: " + severity);
        }
    }
    
    private IncidentCommander selectSeniorCommander(String serviceArea) {
        return commanderRepository.findSeniorCommandersByArea(serviceArea)
            .stream()
            .filter(IncidentCommander::isAvailable)
            .filter(commander -> commander.getExperienceLevel() >= ExperienceLevel.SENIOR)
            .findFirst()
            .orElseThrow(() -> new NoAvailableCommanderException("No senior commander available"));
    }
}

Communication Bridge

java
// Incident Communication Service
@Service
public class IncidentCommunicationService {
    
    public void establishCommunicationBridge(Incident incident) {
        // Slack channel oluştur
        String channelName = createIncidentChannel(incident);
        
        // War room kurulum
        WarRoom warRoom = warRoomService.setupWarRoom(incident);
        
        // Stakeholder notifications
        notifyStakeholders(incident);
        
        // Status page update
        updateStatusPage(incident);
    }
    
    private void notifyStakeholders(Incident incident) {
        List<Stakeholder> stakeholders = getStakeholdersForIncident(incident);
        
        for (Stakeholder stakeholder : stakeholders) {
            NotificationPreference preference = stakeholder.getNotificationPreference();
            
            switch (preference.getChannel()) {
                case SMS:
                    smsService.sendIncidentNotification(stakeholder, incident);
                    break;
                case EMAIL:
                    emailService.sendIncidentNotification(stakeholder, incident);
                    break;
                case SLACK:
                    slackService.sendIncidentNotification(stakeholder, incident);
                    break;
                case PHONE:
                    phoneService.initiateIncidentCall(stakeholder, incident);
                    break;
            }
        }
    }
}

3. Investigation Phase (Araştırma)

Root Cause Analysis

java
// RCA Framework
@Component
public class RootCauseAnalyzer {
    
    public RootCauseAnalysis performRCA(Incident incident) {
        RootCauseAnalysis rca = new RootCauseAnalysis();
        
        // Timeline analysis
        Timeline timeline = buildTimeline(incident);
        rca.setTimeline(timeline);
        
        // 5 Whys analysis
        FiveWhysAnalysis fiveWhys = performFiveWhysAnalysis(incident);
        rca.setFiveWhysAnalysis(fiveWhys);
        
        // System architecture analysis
        ArchitectureAnalysis archAnalysis = analyzeSystemArchitecture(incident);
        rca.setArchitectureAnalysis(archAnalysis);
        
        // Contributing factors
        List<ContributingFactor> factors = identifyContributingFactors(incident);
        rca.setContributingFactors(factors);
        
        // Prevention measures
        List<PreventionMeasure> measures = identifyPreventionMeasures(rca);
        rca.setPreventionMeasures(measures);
        
        return rca;
    }
    
    private FiveWhysAnalysis performFiveWhysAnalysis(Incident incident) {
        FiveWhysAnalysis analysis = new FiveWhysAnalysis();
        String currentProblem = incident.getDescription();
        
        for (int i = 1; i <= 5; i++) {
            String why = analyzeWhy(currentProblem, i);
            analysis.addWhy(i, why);
            currentProblem = why;
        }
        
        return analysis;
    }
}

Ishikawa Diagram Implementation

java
// Fishbone Diagram for RCA
public class IshikawaDiagramBuilder {
    
    public IshikawaDiagram buildDiagram(Incident incident) {
        IshikawaDiagram diagram = new IshikawaDiagram(incident.getDescription());
        
        // Ana kategoriler
        diagram.addCategory(CategoryType.PEOPLE, analyzePeopleFactors(incident));
        diagram.addCategory(CategoryType.PROCESS, analyzeProcessFactors(incident));
        diagram.addCategory(CategoryType.TECHNOLOGY, analyzeTechnologyFactors(incident));
        diagram.addCategory(CategoryType.ENVIRONMENT, analyzeEnvironmentFactors(incident));
        diagram.addCategory(CategoryType.MATERIALS, analyzeMaterialFactors(incident));
        diagram.addCategory(CategoryType.METHODS, analyzeMethodFactors(incident));
        
        return diagram;
    }
    
    private List<CauseFactor> analyzeTechnologyFactors(Incident incident) {
        List<CauseFactor> factors = new ArrayList<>();
        
        // Hardware failures
        if (hasHardwareRelatedLogs(incident)) {
            factors.add(new CauseFactor("Hardware Failure", "Server hardware malfunction"));
        }
        
        // Software bugs
        if (hasSoftwareBugIndicators(incident)) {
            factors.add(new CauseFactor("Software Bug", "Application code defect"));
        }
        
        // Configuration issues
        if (hasConfigurationChanges(incident)) {
            factors.add(new CauseFactor("Configuration Error", "Incorrect system configuration"));
        }
        
        return factors;
    }
}

4. Resolution Phase (Çözüm)

Automated Remediation

java
// Automated Incident Remediation
@Component
public class AutomatedRemediationService {
    
    @EventListener
    public void handleIncident(IncidentCreatedEvent event) {
        Incident incident = event.getIncident();
        
        // Automated remediation stratejisini belirle
        RemediationStrategy strategy = determineRemediationStrategy(incident);
        
        if (strategy.isAutomationApplicable()) {
            executeAutomatedRemediation(incident, strategy);
        } else {
            escalateToHuman(incident);
        }
    }
    
    private void executeAutomatedRemediation(Incident incident, RemediationStrategy strategy) {
        switch (strategy.getType()) {
            case RESTART_SERVICE:
                restartService(incident.getAffectedService());
                break;
                
            case SCALE_UP:
                scaleUpService(incident.getAffectedService(), strategy.getScaleFactor());
                break;
                
            case FAILOVER:
                executeFailover(incident.getAffectedService());
                break;
                
            case ROLLBACK:
                rollbackDeployment(incident.getAffectedService());
                break;
                
            case TRAFFIC_REDIRECT:
                redirectTraffic(incident.getAffectedService(), strategy.getTargetService());
                break;
        }
        
        // Remediation sonrasında verification
        scheduleVerification(incident, strategy);
    }
}

Manual Resolution Procedures

bash
#!/bin/bash
# incident-response-playbook.sh

# Service restart procedure
restart_service() {
    local service_name=$1
    echo "Restarting service: $service_name"
    
    # Graceful shutdown
    kubectl scale deployment $service_name --replicas=0
    sleep 30
    
    # Restart with new replicas
    kubectl scale deployment $service_name --replicas=3
    
    # Wait for healthy state
    kubectl rollout status deployment/$service_name --timeout=300s
    
    # Verify service health
    verify_service_health $service_name
}

# Database connection issue resolution
resolve_db_connection_issue() {
    echo "Resolving database connection issues..."
    
    # Check connection pool status
    check_connection_pool_status
    
    # Reset connection pool
    reset_connection_pool
    
    # Verify database connectivity
    verify_database_connectivity
    
    # Update monitoring dashboards
    update_incident_dashboard "Database connection restored"
}

# Network issue resolution
resolve_network_issue() {
    echo "Diagnosing network issues..."
    
    # Check network connectivity
    ping -c 5 database.internal
    telnet database.internal 5432
    
    # Check DNS resolution
    nslookup database.internal
    
    # Restart network services if needed
    if [ $? -ne 0 ]; then
        sudo systemctl restart networking
    fi
}

Incident Tools ve Integration

1. PagerDuty Integration

java
// PagerDuty Integration Service
@Service
public class PagerDutyService {
    
    @Value("${pagerduty.routing.key}")
    private String routingKey;
    
    public void createIncident(Incident incident) {
        PagerDutyEvent event = PagerDutyEvent.builder()
            .routingKey(routingKey)
            .eventAction(EventAction.TRIGGER)
            .dedupeKey(incident.getId())
            .payload(createPayload(incident))
            .build();
            
        pagerDutyClient.sendEvent(event);
    }
    
    private PagerDutyPayload createPayload(Incident incident) {
        return PagerDutyPayload.builder()
            .summary(incident.getTitle())
            .source(incident.getSource())
            .severity(mapSeverity(incident.getSeverity()))
            .timestamp(incident.getCreatedAt())
            .customDetails(createCustomDetails(incident))
            .build();
    }
    
    public void resolveIncident(String incidentId) {
        PagerDutyEvent event = PagerDutyEvent.builder()
            .routingKey(routingKey)
            .eventAction(EventAction.RESOLVE)
            .dedupeKey(incidentId)
            .build();
            
        pagerDutyClient.sendEvent(event);
    }
}

2. Jira Integration

java
// Jira Incident Tracking
@Service
public class JiraIncidentService {
    
    public JiraIssue createIncidentTicket(Incident incident) {
        JiraIssueRequest request = JiraIssueRequest.builder()
            .projectKey("INC")
            .issueType("Incident")
            .summary(incident.getTitle())
            .description(buildIncidentDescription(incident))
            .priority(mapPriority(incident.getSeverity()))
            .labels(Arrays.asList("incident", incident.getServiceArea()))
            .customFields(buildCustomFields(incident))
            .build();
            
        return jiraClient.createIssue(request);
    }
    
    private Map<String, Object> buildCustomFields(Incident incident) {
        Map<String, Object> customFields = new HashMap<>();
        customFields.put("Affected Service", incident.getAffectedService());
        customFields.put("Impact Level", incident.getImpactLevel());
        customFields.put("Customer Impact", incident.getCustomerImpact());
        customFields.put("Incident Commander", incident.getCommanderEmail());
        return customFields;
    }
}

3. Slack Integration

java
// Slack Incident Bot
@Component
public class SlackIncidentBot {
    
    public void createIncidentChannel(Incident incident) {
        String channelName = generateChannelName(incident);
        
        SlackChannel channel = slackClient.createChannel(
            CreateChannelRequest.builder()
                .name(channelName)
                .topic(String.format("Incident: %s", incident.getTitle()))
                .purpose("Incident response and coordination")
                .build()
        );
        
        // Invite relevant team members
        inviteTeamMembers(channel, incident);
        
        // Post incident summary
        postIncidentSummary(channel, incident);
        
        // Pin important information
        pinImportantInfo(channel, incident);
    }
    
    private void postIncidentSummary(SlackChannel channel, Incident incident) {
        SlackMessage message = SlackMessage.builder()
            .channel(channel.getId())
            .blocks(buildIncidentSummaryBlocks(incident))
            .build();
            
        slackClient.postMessage(message);
    }
    
    private List<Block> buildIncidentSummaryBlocks(Incident incident) {
        return Arrays.asList(
            SectionBlock.builder()
                .text(markdownText(String.format("*Incident: %s*", incident.getTitle())))
                .fields(Arrays.asList(
                    markdownText(String.format("*Severity:* %s", incident.getSeverity())),
                    markdownText(String.format("*Commander:* %s", incident.getCommanderName())),
                    markdownText(String.format("*Affected Service:* %s", incident.getAffectedService())),
                    markdownText(String.format("*Impact:* %s", incident.getImpactDescription()))
                ))
                .build(),
            ActionsBlock.builder()
                .elements(Arrays.asList(
                    ButtonElement.builder()
                        .text(plainText("Update Status"))
                        .actionId("update_incident_status")
                        .value(incident.getId())
                        .build(),
                    ButtonElement.builder()
                        .text(plainText("Escalate"))
                        .actionId("escalate_incident")
                        .value(incident.getId())
                        .style("danger")
                        .build()
                ))
                .build()
        );
    }
}

Incident Metrics ve Reporting

1. MTTR (Mean Time To Recovery)

java
// MTTR Calculator
@Component
public class MTTRCalculator {
    
    public MTTRMetrics calculateMTTR(LocalDate startDate, LocalDate endDate) {
        List<Incident> incidents = incidentRepository.findByDateRange(startDate, endDate);
        
        Map<Severity, List<Duration>> mttrBySeverity = incidents.stream()
            .filter(incident -> incident.getStatus() == IncidentStatus.RESOLVED)
            .collect(Collectors.groupingBy(
                Incident::getSeverity,
                Collectors.mapping(
                    incident -> Duration.between(incident.getCreatedAt(), incident.getResolvedAt()),
                    Collectors.toList()
                )
            ));
        
        MTTRMetrics metrics = new MTTRMetrics();
        
        for (Map.Entry<Severity, List<Duration>> entry : mttrBySeverity.entrySet()) {
            Severity severity = entry.getKey();
            List<Duration> durations = entry.getValue();
            
            Duration averageMTTR = calculateAverage(durations);
            Duration medianMTTR = calculateMedian(durations);
            Duration p95MTTR = calculatePercentile(durations, 95);
            
            metrics.addSeverityMTTR(severity, SeverityMTTR.builder()
                .average(averageMTTR)
                .median(medianMTTR)
                .p95(p95MTTR)
                .count(durations.size())
                .build());
        }
        
        return metrics;
    }
}

2. Incident Trend Analysis

java
// Incident Analytics Service
@Service
public class IncidentAnalyticsService {
    
    public IncidentTrendReport generateTrendReport(AnalyticsPeriod period) {
        IncidentTrendReport report = new IncidentTrendReport();
        
        // Incident frequency trends
        Map<LocalDate, Long> dailyIncidentCount = calculateDailyIncidentCount(period);
        report.setIncidentFrequencyTrend(dailyIncidentCount);
        
        // Top incident categories
        Map<String, Long> topCategories = getTopIncidentCategories(period);
        report.setTopIncidentCategories(topCategories);
        
        // Most affected services
        Map<String, Long> affectedServices = getMostAffectedServices(period);
        report.setMostAffectedServices(affectedServices);
        
        // Resolution time trends
        Map<LocalDate, Duration> avgResolutionTime = calculateAvgResolutionTime(period);
        report.setResolutionTimeTrend(avgResolutionTime);
        
        // Recurring incidents
        List<RecurringIncidentPattern> patterns = identifyRecurringPatterns(period);
        report.setRecurringPatterns(patterns);
        
        return report;
    }
    
    private List<RecurringIncidentPattern> identifyRecurringPatterns(AnalyticsPeriod period) {
        // Machine learning based pattern recognition
        List<Incident> incidents = getIncidentsForPeriod(period);
        
        return patternAnalysisService.findPatterns(incidents)
            .stream()
            .filter(pattern -> pattern.getOccurrenceCount() >= 3)
            .sorted(Comparator.comparing(RecurringIncidentPattern::getOccurrenceCount).reversed())
            .collect(Collectors.toList());
    }
}

Post-Incident Activities

1. Post-Mortem Process

java
// Post-Mortem Service
@Service
public class PostMortemService {
    
    public PostMortem createPostMortem(Incident incident) {
        PostMortem postMortem = PostMortem.builder()
            .incidentId(incident.getId())
            .title(String.format("Post-Mortem: %s", incident.getTitle()))
            .createdBy(incident.getCommanderEmail())
            .status(PostMortemStatus.DRAFT)
            .build();
        
        // Template sections
        postMortem.addSection(createSummarySection(incident));
        postMortem.addSection(createTimelineSection(incident));
        postMortem.addSection(createRootCauseSection(incident));
        postMortem.addSection(createImpactSection(incident));
        postMortem.addSection(createLessonsLearnedSection());
        postMortem.addSection(createActionItemsSection());
        
        return postMortemRepository.save(postMortem);
    }
    
    private PostMortemSection createActionItemsSection() {
        return PostMortemSection.builder()
            .title("Action Items")
            .content("""
                ## Action Items
                
                | Action | Owner | Priority | Due Date | Status |
                |--------|-------|----------|----------|--------|
                | [Describe action] | [Team/Person] | [High/Medium/Low] | [Date] | [Open/In Progress/Done] |
                
                ### Prevention Measures
                - [ ] [Specific prevention measure]
                - [ ] [Another prevention measure]
                
                ### Process Improvements
                - [ ] [Process improvement item]
                - [ ] [Another process improvement]
                
                ### Technical Improvements
                - [ ] [Technical improvement item]
                - [ ] [Another technical improvement]
            """)
            .build();
    }
}

2. Action Item Tracking

java
// Action Item Management
@Entity
public class ActionItem {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    
    private String incidentId;
    private String title;
    private String description;
    private String assignee;
    private ActionItemPriority priority;
    private LocalDate dueDate;
    private ActionItemStatus status;
    private String implementationNotes;
    
    // getters, setters, constructors
}

@Service
public class ActionItemService {
    
    @Scheduled(cron = "0 9 * * MON") // Her Pazartesi sabah 9'da
    public void sendWeeklyActionItemReport() {
        List<ActionItem> overdueItems = findOverdueActionItems();
        List<ActionItem> dueSoonItems = findItemsDueSoon(7); // 7 gün içinde
        
        ActionItemReport report = ActionItemReport.builder()
            .overdueItems(overdueItems)
            .dueSoonItems(dueSoonItems)
            .completionRate(calculateCompletionRate())
            .build();
        
        emailService.sendActionItemReport(report);
        slackService.postActionItemSummary(report);
    }
    
    public void trackActionItemProgress(String incidentId) {
        List<ActionItem> actionItems = actionItemRepository.findByIncidentId(incidentId);
        
        ActionItemProgress progress = ActionItemProgress.builder()
            .totalItems(actionItems.size())
            .completedItems(countCompletedItems(actionItems))
            .inProgressItems(countInProgressItems(actionItems))
            .notStartedItems(countNotStartedItems(actionItems))
            .build();
        
        // Progress tracking event'i publish et
        eventPublisher.publishEvent(new ActionItemProgressEvent(incidentId, progress));
    }
}

Best Practices

1. Preparation

  • Runbook'ları güncel tutun: Her servis için detaylı troubleshooting rehberleri
  • On-call rotasyon planı: Adil ve sürdürülebilir on-call programı
  • Cross-training: Ekip üyelerinin birden fazla alanda bilgi sahibi olması

2. Response

  • Hızlı acknowledgment: Alert'leri mümkün olan en kısa sürede onaylayın
  • Clear communication: Stakeholder'lara düzenli ve net bilgilendirme
  • Focus on resolution: Blame culture yerine çözüm odaklı yaklaşım

3. Learning

  • Blameless post-mortems: Kişi odaklı değil, sistem odaklı analiz
  • Action item follow-up: Belirlenen aksiyonların takibi ve implementasyonu
  • Knowledge sharing: Öğrenilen derslerin tüm organizasyonla paylaşılması

4. Continuous Improvement

  • Incident metrics tracking: MTTR, frequency, impact metriklerinin takibi
  • Process refinement: Süreçlerin düzenli olarak gözden geçirilmesi
  • Tool optimization: Kullanılan araçların etkinliğinin artırılması

Etkili incident management, sadece sorunları hızlı çözmekle kalmaz, aynı zamanda organizasyonun öğrenme kapasitesini artırır ve gelecekteki sorunları önlemeye yardımcı olur.

Eren Demir tarafından oluşturulmuştur.