2 Fast 2 MCM · bacchanalian madness

2 Fast 2 MCM

These visual representations helps understand the complex workflows within the Machine Controller Manager.

Machine Controller Manager Architecture

  • The system consists of three main controllers working in concert
  • Each controller handles specific aspects of machine lifecycle management
  • Interfaces with both cloud providers and Kubernetes clusters
  • Manages the full lifecycle of machines from creation to deletion

Let’s start with an overview of the main components and their interactions:

stateDiagram-v2
    direction TB
    
    state "Machine Controller Manager" as MCM {
        state "Machine Controller" as MC
        state "Safety Controller" as SC
        state "MCM Controller" as MCMC
        
        [*] --> MC
        [*] --> SC
        [*] --> MCMC
    }
    
    state "Cloud Provider" as CP {
        VMs
        API
    }
    
    state "Kubernetes Cluster" as K8S {
        state "Control Plane" as CP_K8S {
            API_Server
            etcd
        }
        
        state "Node Components" as NC {
            kubelet
            container_runtime
        }
    }
    
    MCM --> CP : Manages VMs
    MCM --> K8S : Manages Nodes
    
    note right of MCM
        Handles:
        - Machine lifecycle
        - Safety checks
        - Deployments/Sets
    end note

Machine Controller Core Flows

Now, let’s dive into the Machine Controller’s core reconciliation flows for different resources. It handles three main types of reconciliation:

  • Secret Reconciliation: Manages secrets referenced by MachineClasses
  • MachineClass Reconciliation: Handles machine class lifecycle
  • Machine Reconciliation: Core machine lifecycle management
---
  config:
    layout: elk
---
stateDiagram-v2
    state "Machine Controller" as MC {
        state "Secret Reconciliation" as SR {
            [*] --> FetchSecret
            FetchSecret --> GetMachineClass
            GetMachineClass --> CheckReferences
            CheckReferences --> FinalizerAdd : Has References
            CheckReferences --> FinalizerRemove : No References
            FinalizerAdd --> [*]
            FinalizerRemove --> [*]
        }

        state "MachineClass Reconciliation" as MCR {
            [*] --> FetchClass
            FetchClass --> GetMachines
            GetMachines --> CheckMachines
            CheckMachines --> AddFinalizer : Has Machines
            CheckMachines --> RemoveFinalizer : No Machines
            AddFinalizer --> EnqueueMachines
            EnqueueMachines --> [*]
            RemoveFinalizer --> [*]
        }

        state "Machine Reconciliation" as MR {
            [*] --> FetchMachine
            FetchMachine --> CheckFrozen
            
            CheckFrozen --> ValidateMachine : Not Frozen
            CheckFrozen --> RetryLater : Frozen
            
            ValidateMachine --> ValidateMachineClass
            VaildateMachineClass --> DeletionTimestamp

            DeletionTimestamp --> DeletionFlow : Deletion Requested
            DeletionTimestamp --> AddFinalizers : No Deletion
            
            AddFinalizers --> CheckPhase&NodeLabel
            
            CheckPhase&NodeLabel --> ReconcileHealth : Has Node & Non-empty phase
            CheckPhase&NodeLabel --> CreationFlow : No Node or<br/>CrashLoopBackOff<br/>or EmptyPhase
            
            ReconcileHealth --> SyncNodeName
            SyncNodeName --> SyncTemplates
            SyncTemplates --> [*]
            
            CreationFlow --> [*]
            DeletionFlow --> [*]
        }
    }

Machine Creation

Machine Creation Flow:

  • Complex process involving multiple status checks
  • Handles initialization and error cases
  • Includes node verification and cleanup of stale resources
  • Multiple retry mechanisms for resilience
---
  config:
    look: handDrawn
---
stateDiagram-v2
    classDef imp font-weight:bold,stroke-width:5px;
        state "From <u>CreateResponse</u>: Assign Node Name & ProviderID" as ANPIDCMR
        state "From <u>GetMachineStatusResponse</u>: Assign Node Name & ProviderID" as ANPIDGMS
        state "From <u>GetMachineStatusResponse</u>: Assign Node Name & ProviderID" as ANPIDGMSR
        state "Assign Node Name<br/>from Machine label" as ANML
        state "Phase: <i>Pending</i><br/>State: <i>Processing</i><br/>OpType: Create" as CPPP
        state "State: <i>Failed</i><br/>OpType: <i>Create</i>" as SFFF
        
        [*] --> AddBootToken&MachineName
        AddBootToken&MachineName --> GetMachineStatus:::imp
        
        GetMachineStatus:::imp --> ANPIDGMS : Success
        ANPIDGMS --> UpdateAnnotationsLabels
        UpdateAnnotationsLabels --> CPPP : Phase <i>""(empty) or CrashLoopBackOff</i>
        CPPP --> StatusUpdate
        StatusUpdate --> [*]
        
        GetMachineStatus:::imp --> CheckNodeExists : NotFound or Unimplemented
        CheckNodeExists --> ANML : Node Exists
        ANML --> UpdateAnnotationsLabels
        
        CheckNodeExists --> CreateMachine:::imp : No Node
        CreateMachine:::imp --> ANPIDCMR : Successful creation
        CreateMachine:::imp --> CheckFailurePhase : Creation Error
        ANPIDCMR --> SetUninitialized : Node name is Machine Name
        SetUninitialized --> UpdateAnnotationLabel
        UpdateAnnotationLabel --> InitializeMachine:::imp
        InitializeMachine:::imp --> [*]
        
        ANPIDCMR --> DeleteMachine:::imp : <u>Stale Node</u><br/>NodeName is not MachineName
        DeleteMachine:::imp --> SFFF: "VM using old node obj"
        
        GetMachineStatus:::imp --> ANPIDGMSR : Uninitialized
        ANPIDGMSR --> SetUninitialized
        
        GetMachineStatus:::imp --> CheckFailurePhase : Other Errors
        CheckFailurePhase --> Failed : Timeout
        CheckFailurePhase --> CrashLoopBackOff : Not timed out
        Failed --> SFFF
        CrashLoopBackOff --> SFFF
        
        SFFF --> [*]

Health Check

---
  config:
    layout: elk
---
stateDiagram-v2
    state "Health Reconciliation" as HR {
        state "Phase: <i>Unknown</i><br/>State: <i>Processing</i><br/>LastOp: <i>HealthChk</i>" as PUSP
        state "Phase: <i>Failed</i><br/>State: <i>Failed</i>" as PFSF
        state "LastOp State: Successful<br/>Phase: Running" as SSPR

        [*] --> GetMachineNode
        GetMachineNode --> PUSP : Not Found & RunningPhase<br/>Node object missing
        GetMachineNode --> Found

        Found --> MachineCondSetToNodeCond : NodeCondition != MachineCondition
        Found --> isHealthy : TODO (isHealthy)

        GetMachineNode --> CreationTimeout : PendingPhase
        GetMachineNode --> HealthTimeout : UnknownPhase

        CreationTimeout --> PFSF : Now - LastUpdateTime > Timeout
        HealthTimeout --> GetDeploymentName : Now - LastUpdateTime > Timeout
        CreationTimeout --> EnqueueAfter : Not timed out
        HealthTimeout --> EnqueueAfter : Not timed out


        GetDeploymentName --> RegisterPermit
        RegisterPermit --> TryMarkingMachineFailed
        TryMarkingMachineFailed --> InProgressMachines++ : Phase not<br/>Unknown or Running<br/>Machines "getting replaced"
        InProgressMachines++ --> PFSF:  InProgressMachines < MaxReplacements(1)

        MachineCondSetToNodeCond --> isHealthy
        isHealthy --> PUSP: Not Healthy & RunningPhase
        isHealthy --> CheckLastOp : Healthy & NotRunningPhase &<br/>NoCriticalComponentNotReadyTaint

        CheckLastOp --> DeleteBootstrapToken: TypeCreate &<br/> State is not Successful<br/>(Machine creation happened)
        CheckLastOp --> LastOpType=HealthChk: Not Create<br/>(Machine re-joined)

        DeleteBootstrapToken --> SSPR
        LastOpType=HealthChk --> SSPR

        SSPR --> UpdateStatus
        PUSP --> UpdateStatus
        PFSF --> UpdateStatus

        UpdateStatus --> [*]
        EnqueueAfter --> [*]
    }       

Machine Deletion

Machine Deletion Flow:

  • Carefully orchestrated process to ensure clean resource cleanup
  • Involves multiple phases from drain to final cleanup
  • Handles volume attachments and node cleanup
  • Includes finalizer management for resource protection
---
  config:
    layout: elk
---
stateDiagram-v2
    state "Deletion Flow" as DF {
        direction LR
        state "ProcessPhase" as PP
        state "UpdateStatus" as US

        [*] --> CheckFinalizers
        CheckFinalizers --> SetTerminating
        SetTerminating --> PP

        PP --> GetVMStatus
        GetVMStatus --> [*]
        PP --> InitiateDrain
        InitiateDrain --> [*]
        PP --> DeleteVolumeAttachments
        DeleteVolumeAttachments --> [*]
        PP --> InitiateVMDeletion
        InitiateVMDeletion --> [*]
        PP --> InitiateNodeDeletion
        InitiateNodeDeletion --> [*]
        PP --> RemoveFinalizers
        RemoveFinalizers --> [*]
        PP --> US
        US --> [*]
    }
---
  config:
    layout: elk
---
stateDiagram-v2
    state "Initiate Drain" as ND {
        [*] --> ValidateNode
        state "UpdateStatus" as USD
        state "State: Processing<br/>Type: Delete" as SPTD
        state "CheckNodeCondition<br/>'Ready' or 'Read-only FS'" as CNC
        state "Phase is not Terminating" as NAT
        state "Terminating<br/>Reason: Unhealthy" as TRU
        state "Terminating<br/>Reason: ScaleDown" as TRSD
        state "SkipDrain<br/>State: Failed" as CUFail
        state "State: Processing<br/>Desc: DelVolAttachments" as SPDDVA
        state "State: Processing<br/>Desc: InitVMDeletion" as SPDIVD
        state "State: Failed<br/>Desc: InitiateDrain" as SFDID

        ValidateNode --> SPTD : NodeName is empty
        SPTD --> USD
        ValidateNode --> CNC
        CNC --> ForceDeletion : Read-Only/NotReady &<br/>Last-transition Timeout
        CNC --> NormalDrain : Healthy
        CNC --> ForceDeletion : "force-delete" label on machine or Drain<br/> Timeout on deletion

        ForceDeletion --> UpdateTerminationCondition
        NormalDrain --> UpdateTerminationCondition

        UpdateTerminationCondition --> RunDrain : Phase is empty or CrashLoopBackOff
        UpdateTerminationCondition --> NAT : Non-creation Phase
        NAT --> TRU : Phase is failed
        NAT --> TRSD : Phase not failed
        TRU --> TerminationConditionUpdate
        TRSD --> TerminationConditionUpdate

        TerminationConditionUpdate --> CUFail : Update failure<br/>during NormalDrain
        TerminationConditionUpdate --> RunDrain : Update failure<br/>during ForceDeletion
        TerminationConditionUpdate --> RunDrain : Update Successful
        CUFail --> USD

        RunDrain --> SPDDVA : Drain successful<br/>during ForceDeletion
        RunDrain --> SPDIVD : Drain successful<br/>during NormalDrain
        RunDrain --> SPDDVA : Drain failed<br/>"force-delete" label present
        RunDrain --> SFDID : Drain failed<br/>"force-delete" label absent

        SPDDVA --> USD
        SPDIVD --> USD
        SFDID --> USD

        USD --> [*]
    }

Let’s visualize the Node Drain process, which is a critical part of machine deletion:

  • Sophisticated pod eviction handling
  • Supports both forced and normal drain scenarios
  • Handles PDB (Pod Disruption Budget) violations
  • Includes parallel and serial eviction strategies
---
  config:
    layout: elk
---
stateDiagram-v2
    state "RunDrain" as Normal {
        state "CordonNode (Sealing off)<br/>(Set Unschedulable to true)" as CN
        [*] --> CN
        CN --> WaitForPodCacheSync
        WaitForPodCacheSync --> GetPodsForDeletion : TODO
        
        %% http://localhost:3000/machine-controller/node_drain.html#drainoptionsgetpodsfordeletion
        %% mirrorPodFilter: pod doesnt have MirrorPodAnnotation (set by kubelet when creating mirror pods)
        %% localStorageFilter
        %% unreplicatedFilter
        %% daemonSetFilter
        
        GetPodsForDeletion --> DeleteOrEvictPods

        DeleteOrEvictPods --> UpdateNodeCondition
        UpdateNodeCondition --> [*]
        
        state "DeleteOrEvictPods" as EP {
            [*] --> CheckEvictionSupport

            CheckEvictionSupport --> ParallelEviction : ForceDeletion
            CheckEvictionSupport --> MixedEviction : NormalDrain

            MixedEviction --> ParallelEvictNoPV
            MixedEviction --> SerialEvictWithPV

            ParallelEvictNoPV --> WaitForEviction
            SerialEvictWithPV --> WaitForEviction
            ParallelEviction --> WaitForEviction
            WaitForEviction --> HandlePDBViolation
            HandlePDBViolation --> RetryEviction
            RetryEviction --> [*]
        }
}
---
title: EvictPodsNoPV
---
stateDiagram-v2
    classDef imp font-weight:bold,stroke-width:5px;
        state "Retry count >= MaxEvictRetries" as Term
        state "Set attemptEvict as False" as AEF
        state "Sleep(EvictRetryInterval)" as SRC

        [*] --> Term:::imp

        Term:::imp --> CheckAttemptEvict : No
        Term:::imp --> AEF : Yes
        AEF --> CheckAttemptEvict

        CheckAttemptEvict --> EvictPod : True
        CheckAttemptEvict --> DeletePod : False

        EvictPod --> CheckErr
        DeletePod --> CheckErr

        CheckErr --> BreakLoop:::imp : nil
        CheckErr --> LogEvict : notFound
        CheckErr --> EvictFailErr : AttemptEvict is False
        CheckErr --> PDBViolation : APIErr too many req

        PDBViolation --> GetPDB

        GetPDB --> SRC : No PDB
        GetPDB --> CheckMisconfigured : PDB exists

        CheckMisconfigured --> MisconfigErr : Generation is ObserverGen<br/>HealthyPods >= ExpectedPods<br/>DisruptionsAllowed is 0
        CheckMisconfigured --> SRC : No

        SRC:::imp --> Term : count++


        BreakLoop:::imp --> ReturnSuccess:::imp : ForceDeletion
        BreakLoop:::imp --> GetTerminationGracePeriod : NormalDrain

        GetTerminationGracePeriod --> SetToTimeout : GracePeriod > Timeout
        GetTerminationGracePeriod --> WaitForDeletion : Grace < Timeout
        SetToTimeout --> WaitForDeletion

        WaitForDeletion --> TimeoutErr : timeout &<br/>pod exists
        WaitForDeletion --> WaitErr : err
        WaitForDeletion --> ReturnSuccess:::imp : timeout &<br/>pod deleted

        LogEvict --> [*]
        EvictFailErr --> [*]
        MisconfigErr --> [*]
        TimeoutErr --> [*]
        WaitErr --> [*]
        ReturnSuccess:::imp --> [*]
---
title: TODO EvictPodsWithPV
config:
  layout: elk
---
stateDiagram-v2
    classDef imp font-weight:bold,stroke-width:5px;
        state "Retry count < MaxEvictRetries" as Term
        state "Sleep(EvictRetryInterval)" as SRC
        state "CheckRemainingPods" as CRP
        
        [*] --> SortPodsByPriority
        SortPodsByPriority --> podVolumeInfoMap : Create a map from pod to list of attached PVs (VolName, VolID -> GetVolumeID)

        podVolumeInfoMap --> AttemptEvict
        AttemptEvict --> evictPodPVInternal(Delete):::imp : false
        AttemptEvict --> Term:::imp : true
        Term:::imp --> evictPodPVInternal(Evict):::imp : true
        evictPodPVInternal(Evict):::imp --> break:::imp : FastTrack or<br/>All pods evicted
        evictPodPVInternal(Evict):::imp --> SRC : Not FastTrack and<br/>Pods Remaining
        SRC --> Term:::imp : count++

        Term:::imp --> evictPodPVInternal(Delete):::imp : false<br/>Not FastTrack and<br/>Pods Remaining
        break:::imp --> [*] : All pods evicted

        break:::imp --> CRP : FastTrack
        evictPodPVInternal(Delete):::imp --> CRP

        CRP --> Success:::imp : Node Not Found
        CRP --> ChkAttemptEvict
        ChkAttemptEvict --> EvictErr : True
        ChkAttemptEvict --> DeleteErr : False
---
title: EvictPodsWithPVInternal
config:
  layout: elk
---
stateDiagram-v2
    classDef imp font-weight:bold,stroke-width:5px;
        state "Add Pod to RetryPods" as Retry
        state "Log NotFound<br/>DeleteWorker" as LogNotFound
        [*] --> SelectPod : Start Eviction Process

        SelectPod --> CheckContextTimeout:::imp

        CheckContextTimeout:::imp --> AbortProcess : Context Done
        CheckContextTimeout:::imp --> AddWorker(AttachmentHandler) : Context Not Done

        AddWorker(AttachmentHandler) --> EvictOrDelete

        EvictOrDelete --> CheckEvictionResult:::imp

        CheckEvictionResult:::imp --> EvictionFailed
        EvictionFailed --> PDBViolation : Eviction Attempted &<br/>TooManyRequests
        EvictionFailed --> PodAlreadyGone : Pod Not Found
        EvictionFailed --> EvictionError : Other Errors
        CheckEvictionResult:::imp --> WaitForVolumeDetach : Successful Eviction

        PDBViolation --> GetPDB
        GetPDB --> CheckMisconfigured : PDB Exists
        GetPDB --> Retry : NoPDB
        CheckMisconfigured --> MisconfigErr : Generation is ObserverGen<br/>HealthyPods >= ExpectedPods<br/>DisruptionsAllowed is 0
        CheckMisconfigured --> Retry:::imp : NotMisconfig
        MisconfigErr --> DeleteWorker

        PodAlreadyGone --> DeleteWorker

        EvictionError --> Retry:::imp

        WaitForVolumeDetach --> CheckDetachResult:::imp : TerminationGracePeriod + DetachTimeout

        CheckDetachResult:::imp --> LogNotFound : Node Not Found
        CheckDetachResult:::imp --> DetachError : Detach Failed
        CheckDetachResult:::imp --> WaitForReattach : Successful Detach

        LogNotFound --> AbortProcess
        DetachError --> DeleteWorker

        WaitForReattach --> CheckReattachResult:::imp : PvReattachTimeout

        CheckReattachResult:::imp --> ReattachTimeout : Timeout
        CheckReattachResult:::imp --> LogError : Reattach Failed
        CheckReattachResult:::imp --> SuccessfulEviction:::imp : Successful Reattach

        ReattachTimeout --> DeleteWorker : TODO IsThisCorrect?
        LogError --> DeleteWorker
        SuccessfulEviction:::imp --> DeleteWorker : Pod Processed

        DeleteWorker --> [*]
        Retry:::imp --> DeleteWorker
        AbortProcess --> Exit:::imp : Terminate (FastTrack)<br/>Return Remaining Pods

Safety Controller

  1. Orphan VM Check:

    • Runs periodically (every 15 minutes) to detect and clean up orphaned VMs
    • Lists all VMs in the cloud provider matching the cluster’s tag
    • Maps VMs to machine objects using ProviderID
    • Handles nodes without machine objects:
      • Adds NotManagedByMCM annotation after timeout
      • Removes annotation if machine object is found
    • Logs all cleanup operations for audit purposes
  2. API Server Safety:

    • Monitors connectivity to both control and target API servers
    • Implements a freezing mechanism when API servers are unreachable
    • Manages machine controller state based on API server health:
      • Freezes operations if timeout exceeded
      • Unfreezes when API servers become available
    • Handles machine status updates during API server recovery
---
  config:
    layout: elk
---
stateDiagram-v2
    state "Safety Controller" as SC {
        state "Orphan VM Check" as OVC {
            [*] --> ListCloudVMs
            ListCloudVMs --> MapToMachines
            MapToMachines --> CheckOrphans
            
            state "CheckOrphans" as CO {
                [*] --> NoMachineObject
                NoMachineObject --> ConfirmDeletion
                ConfirmDeletion --> DeleteVM
                DeleteVM --> LogDeletion
            }
            
            CheckOrphans --> AnnotateNodes
            
            state "AnnotateNodes" as AN {
                [*] --> CheckNodeMachine
                CheckNodeMachine --> MultipleMatch : Multiple Machines
                CheckNodeMachine --> NoMatch : No Machine
                CheckNodeMachine --> SingleMatch : One Machine
                
                NoMatch --> TimeoutCheck
                TimeoutCheck --> AddAnnotation : Timeout Exceeded
                
                SingleMatch --> RemoveAnnotation : Has Annotation
                
                AddAnnotation --> UpdateNode
                RemoveAnnotation --> UpdateNode
            }
        }

        state "API Server Safety" as ASS {
            [*] --> CheckFrozen
            CheckFrozen --> CheckAPIServer : Frozen
            CheckFrozen --> MonitorAPI : Not Frozen
            
            CheckAPIServer --> Unfreeze : API Up
            CheckAPIServer --> Requeue : API Down
            
            MonitorAPI --> SetInactiveTime : API Down
            MonitorAPI --> ClearInactiveTime : API Up
            
            SetInactiveTime --> CheckTimeout
            CheckTimeout --> Freeze : Timeout Exceeded
            
            Unfreeze --> UpdateMachines
            UpdateMachines --> ResetTimeout
        }
    }

MachineSet Controller

  1. Core Reconciliation:

    • Validates MachineSet specifications
    • Manages finalizers for proper cleanup
    • Implements machine ownership through controller references
    • Synchronizes node templates and configurations
  2. Replica Management:

    • Implements sophisticated scaling logic:
      • Slow-start batching for scale-up operations
      • Prioritized scale-down based on machine health
    • Handles stale machine cleanup
    • Maintains desired replica count
    • Updates status to reflect current state
---
  config:
    layout: elk
---
stateDiagram-v2
    state "MachineSet Controller" as MSC {
        [*] --> FetchMachineSet
        FetchMachineSet --> ValidateSpec
        ValidateSpec --> AddFinalizers : No Deletion
        ValidateSpec --> ProcessDeletion : Deletion Requested
        
        AddFinalizers --> ClaimMachines
        
        state "ClaimMachines" as CM {
            [*] --> CreateControllerRef
            CreateControllerRef --> MatchSelector
            MatchSelector --> AdoptOrphan : No Owner
            MatchSelector --> ReleaseClaimed : Wrong Owner
            
            AdoptOrphan --> UpdateOwnerRef
            ReleaseClaimed --> RemoveOwnerRef
        }
        
        ClaimMachines --> SyncNodeTemplates
        SyncNodeTemplates --> SyncMachineConfig
        SyncMachineConfig --> HandleDeletion : Deletion Requested
        SyncMachineConfig --> ManageReplicas : No Deletion
        
        state "ManageReplicas" as MR {
            [*] --> GetActiveMachines
            GetActiveMachines --> DeleteStale
            DeleteStale --> CheckReplicas
            
            CheckReplicas --> ScaleUp : Active Machine Too Few
            CheckReplicas --> ScaleDown : Active Machine Too Many
            
            ScaleUp --> SlowStartBatch
            SlowStartBatch --> CreateMachines
            
            ScaleDown --> SortMachines
            SortMachines --> DeleteExcess
        }
        
        ManageReplicas --> UpdateStatus
        HandleDeletion --> UpdateStatus
        UpdateStatus --> [*]
    }

MachineDeployment Controller

Deployment Management:

  • Handles multiple MachineSets for a deployment
  • Maintains deployment history through revisions
  • Supports pausing and resuming deployments
  • Implements rollback functionality
  1. Deployment Strategies:

    • Recreate Strategy:

      • Scales down old MachineSets completely
      • Creates and scales up new MachineSet
      • Ensures clean cutover between versions
    • Rolling Update Strategy:

      • Gradually scales up new MachineSet
      • Gradually scales down old MachineSets
      • Maintains availability during updates
      • Handles surge and unavailability constraints
  2. Scaling Operations:

    • Detects and handles scaling events
    • Manages desired replica counts across MachineSets
    • Updates annotations for autoscaler integration
    • Ensures proper resource cleanup
---
  config:
    layout: elk
---
stateDiagram-v2
    state "MachineDeployment Controller" as MDC {
        [*] --> FetchDeployment
        FetchDeployment --> ValidateSpec
        ValidateSpec --> GetMachineSets
        
        state "GetMachineSets" as GMS {
            [*] --> ClaimMachineSets
            ClaimMachineSets --> BuildMachineMap
            BuildMachineMap --> SyncRevision
            
            state "ClaimMachineSets" as CMS {
                [*] --> CreateControllerRef
                CreateControllerRef --> MatchSelector
                MatchSelector --> AdoptOrphan : No Owner
                MatchSelector --> ReleaseClaimed : Wrong Owner

                AdoptOrphan --> UpdateOwnerRef
                ReleaseClaimed --> RemoveOwnerRef
            }
        }
        
        GetMachineSets --> CheckDeletion
        CheckDeletion --> HandleDeletion : Deletion Requested
        CheckDeletion --> CheckPaused : No Deletion
        
        
        CheckPaused --> Sync : Paused
        CheckPaused --> CheckRollback : Not Paused
        
        state "Rollback" as RB {
            [*] --> FindRevision
            FindRevision --> RemoveTaints
            RemoveTaints --> UpdateTemplate
            UpdateTemplate --> SyncStatus
        }
        
        CheckRollback --> Rollback : Rollback Requested
        CheckRollback --> CheckScaling : No Rollback
        
        state "Scaling" as SC {
            [*] --> CheckActiveMS
            CheckActiveMS --> CheckReplicas
            CheckReplicas --> SyncScale
        }
        
        CheckScaling --> Scaling : Scale Event
        CheckScaling --> DeployStrategy : No Scale Event
        
        state "DeployStrategy" as DS {
            state "Recreate" as RC {
                [*] --> OldScaleDown
                OldScaleDown --> CreateNew
                CreateNew --> NewScaleUp
            }
            
            state "RollingUpdate" as RU {
                [*] --> ScaleUpNew
                [*] --> ScaleDownOld
                ScaleDownOld --> CleanupOld
            }
        }
        
        DeployStrategy --> UpdateStatus
        UpdateStatus --> [*]
    }
        

Summary

Each of these controllers implements sophisticated error handling and retry mechanisms:

  1. Error Handling:

    • Categorizes errors into recoverable and non-recoverable
    • Implements exponential backoff for retries
    • Maintains error counters and conditions
    • Updates status to reflect error states
  2. Resource Protection:

    • Uses finalizers to prevent premature deletion
    • Implements owner references for proper garbage collection
    • Maintains consistent state through careful status updates
    • Handles race conditions through proper locking
  3. Performance Considerations:

    • Implements work queues for efficient processing
    • Uses informers for efficient cache handling
    • Batches operations when possible
    • Implements rate limiting for API calls
  4. Monitoring and Metrics:

    • Tracks operation durations
    • Records error counts and types
    • Provides health metrics
    • Implements proper logging for debugging

The entire system works together to provide:

  1. Reliable machine lifecycle management
  2. Proper cleanup of resources
  3. Scaling capabilities
  4. Rolling updates and rollbacks
  5. Protection against race conditions and API server issues
  6. Efficient resource utilization
  7. Proper monitoring and debugging capabilities

This comprehensive system ensures robust machine management while maintaining high availability and proper resource utilization. The controllers work together to maintain the desired state while handling various edge cases and failure scenarios.