MCM
First, go through the README, later on check local setup guide as well.
Components
Check CRDs present in the MCM tree
All the below class diagrams omit TypeMeta and ObjectMeta for brevity.
Machine
---
title: Machine
config:
class:
hideEmptyMembersBox: true
---
%%{init: { 'themeVariables': { 'lineColor': '#51565b'}}}%%
classDiagram
%% note "apiVersion, kind & metadata components have not been shown"
%% Machine ---------------------------------
note for Machine "Representation of a physical or virtual machine that corresponds to a front-end kubernetes node object"
Machine <|-- MachineSpec
Machine <|-- MachineStatus
class Machine{
MachineSpec spec
MachineStatus status
}
%%--
MachineSpec <|-- ClassSpec
MachineSpec <|-- MachineConfiguration
MachineSpec <|-- NodeTemplateSpec
class MachineSpec{
ClassSpec class
string providerID
NodeTemplateSpec nodeTemplate
MachineConfiguration MachineConfiguration
}
class ClassSpec{
string apiGroup
string kind
string name
}
note for ClassSpec "Refers to the MachineClass template for the Machine"
class NodeTemplateSpec{
string podCIDR: IP range
[]string podCIDRs
string providerID
bool unschedulable
[]taint taints
}
note for NodeTemplateSpec "Actually part of NodeSpec which is a field in NodeTemplateSpec"
class MachineConfiguration{
v1.Duration drainTimeout
v1.Duration healthTimeout
v1.Duration creationTimeout
int32* maxEvictRetries
string* nodeConditions
}
%%--
MachineStatus <|-- LastOperation
MachineStatus <|-- CurrentStatus
class MachineStatus{
v1.NodeCondition conditions
LastOperation lastOperation
CurrentStatus currentStatus
string lastKnownState
}
class LastOperation{
string description
string errorCode
v1.Time lastUpdateTime
MachineState state: string
MachineOperationType type: string
}
note for LastOperation "State can be 'Processing', 'Failed' or 'Successful'<br>Type can be 'Create', 'Update', 'HealthCheck' or 'Delete'"
class CurrentStatus{
MachinePhase phase: string
bool timeoutActive
v1.Time lastUpdateTime
}
link MachineStatus "https://kubernetes.io/docs/reference/node/node-status/#condition"
Machine Class
---
title: MachineClass
config:
class:
hideEmptyMembersBox: true
---
%%{init: { 'themeVariables': { 'lineColor': '#51565b'}}}%%
classDiagram
%% MachineClass ---------------------------------
note for MachineClass "Represents a template that contains cloud provider specific details used to create machines"
MachineClass <|-- NodeTemplate
class MachineClass{
NodeTemplate nodeTemplate
v1.SecretReference credentialsSecretRef
RawExtension providerSpec
string provider
v1.SecretReference secretRef
}
note for MachineClass "credentialsSecretRef has cloud provider credentials<br>secretRef has cloud init script: userData containing bootstrap token, API server address, kubeconfig"
note for MachineClass "provider specific configuration to create a node is specified in providerSpec of the extensible custom type rawExtension which permits instances of different structure types to be held within a single type"
class NodeTemplate{
v1.ResourceList capacity
string instanceType
string region
string zone
string* architecture
}
Machine Set
---
title: MachineSet
config:
class:
hideEmptyMembersBox: true
---
%%{init: { 'themeVariables': { 'lineColor': '#51565b'}}}%%
classDiagram
%% MachineSet ---------------------------------
note for MachineSet "Ensures that the specified number of Machine replicas are running at any time"
MachineSet <|-- MachineSetSpec
MachineSet <|-- MachineSetStatus
class MachineSet{
MachineSetSpec spec
MachineSetStatus status
}
MachineSetSpec <|-- ClassSpec
MachineSetSpec <|-- MachineTemplateSpec
class MachineSetSpec{
int32 replicas
v1.LabelSelector selector
ClassSpec machineClass
MachineTemplateSpec template
int32 minReadySeconds
}
MachineTemplateSpec <|-- MachineSpec
class MachineTemplateSpec{
MachineSpec spec
}
MachineSetStatus <|-- MachineSetCondition
MachineSetStatus <|-- LastOperation
MachineSetStatus <|-- MachineSummary
class MachineSetStatus{
int32 replicas
int32 fullyLabeledReplicas
int32 readyReplicas
int32 availableReplicas
int64 observedGeneration
[]MachineSetCondition machineSetCondition
LastOperation lastOperation
MachineSummary failedMachines
}
MachineSummary <|-- LastOperation
class MachineSummary{
string name
string providerID
LastOperation lastOperation
string ownerRef
}
class MachineSetCondition{
MachineSetConditionType type: string
ConditionStatus status: string
v1.Time lastTransitionTime
string reason
string message
}
note for LastOperation "One has information about lastOp for Machine (summary) and the other is for MachineSet"
Machine Deployment
---
title: MachineDeployment
config:
class:
hideEmptyMembersBox: true
---
%%{init: { 'themeVariables': { 'lineColor': '#51565b'}}}%%
classDiagram
%% Deployment ---------------------------------
note for MachineDeployment "Provides a declarative update for MachineSet and Machines"
MachineDeployment <|-- MachineDeploymentSpec
MachineDeployment <|-- MachineDeploymentStatus
class MachineDeployment{
MachineDeploymentSpec spec
MachineDeploymentStatus status
}
%%--
MachineDeploymentSpec <|-- MachineTemplateSpec
MachineDeploymentSpec <|-- MachineDeploymentStrategy
class MachineDeploymentSpec{
int32 replicas
v1.LabelSelector selector
MachineTemplateSpec template
MachineDeploymentStrategy strategy
int32 minReadySeconds
int32* revisionHistoryLimit
bool paused
RollbackConfig* rollbackTo
int32* progressDeadlineSeconds
}
MachineDeploymentStrategy <|-- RollingUpdateMachineDeployment
class MachineDeploymentStrategy{
MachineDeploymentStrategyType type: string
RollingUpdateMachineDeployment rollingUpdate
}
class RollingUpdateMachineDeployment{
IntOrString MaxUnavailable
IntOrString MaxSurge
}
%%--
MachineDeploymentStatus <|-- MachineDeploymentCondition
MachineDeploymentStatus <|-- MachineSummary
class MachineDeploymentStatus{
int64 observedGeneration
int32 replicas
int32 updatedReplicas
int32 readyReplicas
int32 availableReplicas
int32 unavailableReplicas
[]MachineDeploymentCondition condition
int32* collisionCount
MachineSummary failedMachines
}
class MachineDeploymentCondition{
MachineDeploymentConditionType type: string
ConditionStatus status: string
v1.Time lastUpdateTime
v1.Time lastTransitionTime
string reason
string message
}
Miscellaneous
These are added for reference purpose only, no need to delve too much into dissecting them.
---
title: SafetyOptions & PodDisruptionBudget
config:
class:
hideEmptyMembersBox: true
---
classDiagram
class SafetyOptions{
metav1.Duration MachineCreationTimeout
metav1.Duration MachineHealthTimeout
metav1.Duration MachineDrainTimeout
metav1.Duration PvDetachTimeout
metav1.Duration PvReattachTimeout
metav1.Duration MachineSafetyAPIServerStatusCheckTimeout
metav1.Duration MachineSafetyAPIServerStatusCheckPeriod
metav1.Duration MachineSafetyOrphanVMsPeriod
time.Time APIserverInactiveStartTime
bool MachineControllerFrozen
int32 MaxEvictRetries
}
PodDisruptionBudget <|-- PodDisruptionBudgetSpec
PodDisruptionBudget <|-- PodDisruptionBudgetStatus
class PodDisruptionBudget{
PodDisruptionBudgetSpec spec
PodDisruptionBudgetStatus status
}
class PodDisruptionBudgetSpec{
IntOrString* minAvailable
v1.LabelSelector* selector
IntOrString* maxUnavailable
}
class PodDisruptionBudgetStatus{
int64 observedGeneration
map[string]v1.Time disruptedPods
int32 disruptionsAllowed
int32 currentHealthy
int32 desiredHealthy
int32 expectedPods
[]v1.Condition conditions
}
---
title: VolumeAttachmentHandler
config:
class:
hideEmptyMembersBox: true
---
classDiagram
VolumeAttachmentHandler <|-- VolumeAttachment
note for VolumeAttachmentHandler "Distribute incoming VolumeAttachment requests to all listening workers"
class VolumeAttachmentHandler{
sync.Mutex
[]chan storagev1.VolumeAttachment* workers
dispatch()
AddWorker()
DeleteWorker()
}
VolumeAttachment <|-- VolumeAttachmentSpec
VolumeAttachment <|-- VolumeAttachmentStatus
note for VolumeAttachment "Captures the intent to attach/detach specified volume to/from the specified node"
class VolumeAttachment{
VolumeAttachmentSpec spec
VolumeAttachmentStatus status
}
VolumeAttachmentSpec <|-- VolumeAttachmentSource
class VolumeAttachmentSpec{
string attacher
VolumeAttachmentSource source
string nodeName
}
VolumeAttachmentStatus <|-- VolumeError
class VolumeAttachmentStatus{
bool attached
map[string]string attachmentMetadata
VolumeError* attachError
VolumeError* detachError
}
class VolumeAttachmentSource{
string* persistentVolumeName
v1.PersistentVolumeSpec* inlineVolumeSpec
}
class VolumeError{
metav1.Time time
string message
}
Machine Controller
Reconcile Cluster Secret
- Fetch secrets
- Get the
MachineClassreferring to the passed secret - In case there are machine classes that refer it, add the finalizer to the secret else remove it. (The removal of the finalizer is needed because if the secrets aren’t referred by
MachineClassthen their deletion shouldn’t depend on the finalizer, Ref)
Reconcile Cluster Machine Class
- Fetch the
MachineClass - Get all the
Machinesreferring to theMachineClass. - If there are machines that match given
MachineClassand the deletion hasn’t been requested yet for theClassthen add the finalizers to the class proceeding which all the machines referring to the class are enqueued for further processing. (TODO: Learn about workqueue) - Otherwise if deletion isn’t scheduled and there are machines referring to the Class then add them to the
machineQueue. - If there aren’t any machines referring to the
Class, then the finalizers, if any, on it can be deleted.
Reconcile Cluster Machine
-
Fetch the required
Machine. -
Check if the Machine controller is frozen (
MachineControllerFrozen) and machine isn’t scheduled for deletion. If so, retry reconciliation afterlongRetry. -
If machine isn’t frozen, then validate the
Machine(check Name and Kind for the Class shouldn’t be empty). Also validate theMachineClass’Nodetemplatefields (capacity: cpu, gpu and memory;instanceType,regionandzone). -
In case the machine is scheduled for deletion, TriggerDeletionFlow with a
DeleteMachineRequest. -
If the
Machineisn’t scheduled for deletion, then add the finalizers on the machine if not present. -
Check Machine Phases State Diagram
Check if the machine has an associated node (look for label with key “
node”) and if the machine has been created (i.e.phase!= “”, if phase is empty string, that corresponds to machine being in creation stage). If so, then:- Reconcile Machine Health: Updates the machine object with any change in node conditions or health. Ref
- Sync Machine Name to Node: Syncs machine name on corresponding node object by adding machine name label to metadata.
- Sync Machine NodeTemplates: Syncs
nodeTemplatebetween the machine and the node object by fetching it, then syncs the annotations, labels and taints ().
-
If machine phase is
emptyorcrashloopbackofforproviderIdis empty i.e. machine doesn’t exist then TriggerCreationFlow withCreateMachineRequest.
TriggerDeletionFlow
Check out FAQ: Why is my machine deleted?
Check if finalizers are present on the machine, if not retry after a while. In case they are present and the machine isn’t in terminating phase, then set it to terminating (also sets LastOperation state to Processing and description to GetVMStatus). If machine is in terminating phase, then check LastOperation Description:
- If it contains the string
GetVMStatusi.e. “Set machine status to termination. Now, getting VM Status” thenupdateMachineStatus&NodeLabelis called which adds the “node” label if not present while checking for any errors in getting machine error status. After confirming that the node label is updated, the LastOperationtypeis set todeleteand LastOperationdescriptiontoInitiateDrain. - If it contains the string
InitiateDrainthen call DrainNode. - If its
DelVolumesAttachments, thendeleteNodeVAis invoked:- If node is not found or if
node.VolumesAttachedis zero thendescriptionis updated toInitiateVMDeletionandstateis set toProcessing. - If there are non-zero attached volumes then
getLiveVAForNodefetches all the volumes attached to our node not having DeletionTimestamp set. - If there are live volumes,
deleteVAForNoderemoves them and when zero live volumes are attached thenInitiateVMDeletionis set as thedescription. - After this the machine status is updated with the corresponding state and description.
- If node is not found or if
- When it’s
InitiateVMDeletion, thendeleteVMdelgates the task to driver.DeleteMachine method. If deletion is successful then description is set toInitiateNodeDeletionwithprocessingstate. If there are any errors:- In case its
NotFound, then deletion flow is continued andstateis set toprocessingwithdescriptionasInitiateNodeDeletion. - If it’s
Unknown,DeadlineExceeded,AbortedorUnavailableor any other error code(default) or failure in decoding the machineErr itself: thestateis set asfailedanddescriptionremainsInitiateVMDeletionas deletion is re-tried. - After this the machine status is updated with the corresponding state and description.
- In case its
- Whereas
InitiateNodeDeletioncalls upondeleteNodeObjectwhich if it can’t find the associated node, continues the deletion flow withdescriptionset toInitiateFinalizerRemovaland sets thestateasProcessingotherwiseDelete()is called for the node which on success moves the state toProcessingandInitiateFinalizerRemovalstages however on error switches tofailedstate (any error other than object not found). After this the machine status is updated. - At last when we have
InitiateFinalizerRemoval,deleteMachineFinalizersis invoked which deletesMCMFinalizerNameif present, if there are any errors while updating the machine with the new set of finalizers, then re-queue is triggered after a ShortRetry.
DrainNode
Check out FAQ: How is drain implemented?, and the flowchart.
-
Check that the nodeName is valid, if not set
skipDrain. -
Check if machine is in
ReadyorReadOnlystate. -
If node condition is not empty and its
not Readyor itsReadOnlyand it’s been 5mins, thenforceDeleteMachinesandforceDeletePodslabels are set,printLogInitError()is called i.e.descriptionis set toInitiateVMDeletion. -
If skip isn’t set, then check if (MachineDrainTimeout) has occured.
- If
force-deletionlabel is present on the machine or if timeout has happened thenforceDeleteMachinesandforceDeletePodslabels are set and force deletion is triggered otherwise normal deletion is triggered. - After that,
UpdateNode TerminationConditionis called, which checks that thephaseis notCrashLoopBackOfforemptyi.e. machine is not in creation stage. Then we check if there are nodes that are already interminatingcondition, if not, then ifphaseisfailedthenterminationConditionreason is set toUnhealthy, otherwise it’s set toScaleDown. - If termination condition update results in an error in a normal deletion then,
skipDrainis set andstateis set tofailed. - If termination condition update doesn’t result in an error then RunDrain is invoked; if drain is successful then
stateis set toProcessing& if it’s normal deletiondescriptionis set toInitiateVMDeletion(since regular deletion already waits for volume detach/attach) otherwise if it’s force deletion thendescriptionis set toDelVolumesAttachmentsalongwithProcessingstate. IfRunDrainis unsuccessful however and its normal deletion thendescriptionis set toInitiateDrainwithfailedstatefor a retry. - To summarize the above:
To perform forceful machine drain/delete either: 1. force-deletion: "True" label must be present 2. Deletion operation is more than drain-timeout minutes old 3. Last machine drain had failed
- If
-
If
skipDrainis set, then set the state toProcessing, LastOperationtypetodelete,descriptionis set toInitiateVMDeletionbyprintLogInitError()and it’s re-queued.
RunDrain
-
drainContext
-
RunCordonOrUncordonFirst we try to get the associated node, if it results in an error that means the node doesn’t exist i.e. deletion was triggered during the machine creation process hence there’s nothing to do. If the node is there and it’s
Spec.Unschedulableis true, then there’s nothing to do otherwiseUnschedulableis set to true. (Cordoning means sealing off/preventing access) -
WaitForCacheSyncis processed for the pod informer cache to sync. -
deleteOrEvictPodsSimple-
Calls
getPodsForDeletionwhich lists allpods.Spec.NodeNamematching our node, (TODO: check podFilter) and if there aren’t any errors, then returns the list of pods. -
Following which we
deleteOrEvictPods: Here, we first check if the server supports eviction bySupportEvictionand ifForceDeletePodsis false then we callevictPodswhich performs normal eviction of the pods, parallellyevictPodsWithoutPvand serially (by sorting by Priority)evictPodsWithPvfor ones having volume attached since it needs to wait for volume detachment. Both of these ways attempt toevictPodif supported forMaxEvictRetriestimes failing which theydeletePod. IfforceDeletePodsis true, then forceful eviction of the pods happens and all pods are evicted parallelly by callingevictPodsWithPv. -
In case there are errors while performing either evictPod or deletePod, it could be that the pod wasn’t evicted due to PDB violation, hence we get the PDB for the pod and check if its misconfigured by verifying that the
Generationis same asStatus.ObservedGenerationand thatCurrentHealthypods are more thanExpectedPods(which should be more than 0) andDisruptionsAllowedis zero. All these scenarios result in a misconfigured PDB. -
If it’s not a case of misconfigured PDB then we sleep for
PodEvictionRetryIntervaland then move to next iteration of the eviction attempt.TODOTerminationGracePeriod,waitForDeleteReferences: Disruptions, Configuring PDB, API Eviction, evictPodsWoPVInternal
-
TODOCheck evictPodsWithPVInternal -
If there are any failures in
deleteOrEvictPods, they are reported as pendingPods.
-
TriggerCreationFlow
- Update MachineRequest’s secret by processing the secret with (TODO)
addBootstrapTokenToUserDataandaddMachineNameToUserDatawhich checks ifsecret["userData"]exists, then replacingMachineNamePlaceholder<MACHINE_NAME> with the actualmachine.Name. - After that, the secret is used alongwith the Machine and MachineClass to create a GetMachineStatusRequest which responds with
providerIdand associatednodeName. - If there is no error with the GetMachineStatusRequest and the machine’s “node” label or
providerIDis empty then the required VM is found and the node and providerID for the machine is updated. (updateLabels: adds the necessary labels, annotations with MachinePriority as 3 if unset and updates the providerID) - If there is an error in getting the machine status and the ErrCode is
NotFoundorUnimplementedand if the “node” label is not present on the machine then: driver.CreateMachine() is invoked which on success gives us theproviderIdand thenodeName, then we try toGet()the returnednodeNameto check if a node object already exists. If so, it is treated as a stale node anddriver.DeleteMachine()is triggered and the machine object is marked asstatefailed,phasebeingfailedwith LastOperationdescriptionset as “VM using old node obj” and then its status is updated. NOTE: Also,uninitializedMachineis set to true. - If the returned codes are
Unknown,DeadlineExceeded,AbortedorUnavailable, then againmachineStatusUpdateis called withstateset tofailedandphasebeingfailedorcrashloopbackoffdepending upon whether or not MachineCreationTimeout happened. - Whereas the case when returned code is
Uninitialized, the flaguninitializedMachineis set to true. In any other scenario, amachineStatusUpdatecall similar to the previous case is made with a re-queue call. - In case the
uninitializedMachineflag is set,initializeMachineis called for starting the VM instance for the machine, it delegates the task to driver.InitializeMachine which if it returnsUnimplementederror code, then VM initialization is skipped as provider doesn’t support it otherwisemachineStatusUpdatecall is made withfailedstateandcrashloopbackofforfailedphasebased on MachineCreationTimeout. Upon successful initialization, an error is returned bytriggerCreationFlowfor another re-queue. - If however the machine
phaseisempty(creation) orcrashloopbackoffthenstateis set asProcessing,phaseasPendingand aUpdateStatuscall is made for the machine.
Safety Controller
Reconcile Cluster Machine Safety Orphan VMs
It lists all the VMs in the cloud matching the tag of given cluster name and maps the VMs with the machine objects using the ProviderID field. VMs without any backing machine objects are logged and deleted after confirmation. It runs every MachineSafetyOrphanVMsPeriod duration (15 mins). It runs:
checkMachineClasses: fetches all the MachineClasses and for each of them callscheckMachineClass:- It fetches their
secretData, lists all theMachinesreferencing that class, updates their cache viaWaitForCacheSync - Then it iterates over the MachineList returned by ListMachines:
<providerID, machineName>, fetching each machine and ensuring that the providerID matches. - If it doesn’t and the
phaseis "" orcrashloopbackoff, then it continues iterating otherwise it creates a dummy machine object forDeleteMachineRequest.
- It fetches their
AnnotateNodesUnmanagedByMCM: fetches all the nodes on the target cluster then iterates over them:- Calls
getMachineFromNodewhich returnserrMultipleMachineMatchorerrNoMachineMatchdepending on the number of machines associated with the node. If there’s only one machine, then it returns the machine. - If we get
errNoMachineMatchthen we comparecurrentTime - nodeCreationTimewith MachineCreationTimeout, if timeout has occured and the node doesn’t haveNotManagedByMCMannotation present on it then it means there’s no backing machine object for the node. So we annotate it withNotManagedByMCMand then update the node. - If it returns the machine having
NotManagedByMCMannotation then we remove the annotation and update the node.
- Calls
Reconcile Cluster Machine Safety API Server
Freezes MachineDeployment and MachineSet controller if number of machine objects goes beyond a certain threshold on top of Spec.Replicas by applying a freeze label. It also freezes the functionality of MCM if target-apiserver or control-apiserver is not reachable and unfreezes once situation becomes normal.
- Checks if
MachineControllerFrozenis true if it is, then it checks ifisApiServerUp. isApiServerUpmakes a dummy call to control APIServer (controlMachineClient) and target APIServer (targetCoreClient) and if either gives an error different from isNotFound error, it returns false. If it returns true then we unfreeze them by iterating over them and if theirphaseisUnknown, we get machine fromcontrolMachineClientand set the phase asRunning,stateassuccessfulanddescriptionas “Machine Health Timeout was reset due to APIServer being unreachable”. After which we update the status andenqueueMachineAfter30 seconds for reconcile and setMachineControllerFrozento false &APIserverInactiveStartTimeto nil.- If initially
MachineControllerFrozenwas false, andisApiServerUpreturned false i.e. its not up then we setAPIserverInactiveStartTimeas current time if its unset and check if time passed since Inactive time is more thanMachineSafety APIServer StatusCheckTimeout, if so we setMachineControllerFrozento true and then re-enqueue the API Server check.
Machine Controller Manager
Handles reconciliation of MachineDeployment and MachineSet objects.
Reconcile Cluster Machine Set
-
Fetch the
MachineSet. -
ValidateMachineSetSpec: Ensures thatspec.Replicasis greater than or equal to zero andselector.matchLabelsmatchestemplate.metadata.labels. -
If DeletionTimestamp is not set, then
addMachineSetFinalizers -
LabelSelectorAsSelector converts MachineSet selector to a selector object that’s used by
claimMachinesfunction that creates a*Machine Controller Ref Manager*tagged with aCanAdoptfunction that first fetchesfreshi.e. latest version of the machineSet and then verifies its UID with the existingmachineSet’s UID to ensure that the machineSet isn’t being deleted. (This function is later used byAdoptMachine) -
ClaimMachineindirectly callsClaimObjectwithMatch,AdoptandRelease:Match: Checks if the selector matches, after which it runs a set offilterfunctions. On success, it returns true.AdoptMachine(Adopt): Runs the previously createdCanAdoptfunction to verify that the machine can be adopted by the machineSet, if so thenPatchMachinewith the OwnerRef to the MachineSet and add toclaimedMachines.ReleaseMachine(Release): Sends adeleteOwnerRefPatchto free the Machine from controller. On patching,NotFoundandIsInvaliderrors are ignored since if the machine doesn’t exist or if it has no ownerRef or if the UID doesn’t match (implies machine was recreated after deletion) then it’s fine.
-
ClaimObject:
It will reconcile the following: - Adopt orphans if the match function returns true. - Release owned objects if the match function returns false.- Get the
ownerRefusingGetControllerOf, it it’s not nil, check if ownerRef.UID is the same as the machineSet’s UUID. (UID is k8s system generated string to uniquely identify object). If the DeletionTimestamp is not set (nil), and the selectormatchfails, then attempt arelease. - In case the ownerRef was nil, check if machineSet is not scheduled for deletion (
DeletionTimestampnil) and thematchwas succesful. If so, then if the machine isn’t scheduled for deletion, attemptAdopt.
- Get the
-
After the list of claimedMachines if procured, we call
syncMachinesNodeTemplates,syncMachinesConfig&syncMachineClassKindwhich goes through the list and checks if a machine requires updation of theNodetemplate,MachineConfigurationandClass.Kindand copies them from themachineSetto themachineif needed. -
Then check if the machineSet is scheduled for deletion i.e.
DeletionTimestampis not nil, if there aren’t any claimedMachines for this machineSet (len(filteredMachines) is 0) then remove the finalizer from the machineSet otherwise trigger deletion of the claimedMachines viaterminateMachines. -
In case the machineSet wasn’t scheduled for deletion,
manageReplicasis called:- Gets the MetaNameSpaceKey for the machineSet and creates a list of activeMachines (
phase!=failedorterminating) and staleMachines (phaseisfailed) from the list of claimedMachines. - If there are staleMachines then they are scheduled for deletion,
terminateMachinesis called. - Then we check the difference between the number of activeMachines and the number of
Replicasspecified by the machineSet. - If the activeMachines are less, then count (diff) is set to
min(BurstReplicas: 100, Replicas - len(activeMachines))and <TODO ExpectCreations>slowStartBatchis called which calls a functionCreateMachines With ControllerRefcounttimes, starting slow with group ofSlowStartInitialBatchSize(1) to check for errors, then speeding up if calls succeed. Within each batch, it may call the function multiple times concurrently. If a whole batch succeeds, the next batch gets exponentially larger. - <TODO> skippedMachines CreationObserved
- In case the activeMachines are more than the
Replicacount, then deletion must be triggered.diffis set tomin(BurstReplicas: 100, len(activeMachines) - Replicas)and then the activeMachines are sorted in the orderNonReady, Ready, Unscheduled, Scheduled, Pending, Running. <TODO: ExpectDeletions>. After getting the sorted list, the machines are deleted by calledterminateMachines.
- Gets the MetaNameSpaceKey for the machineSet and creates a list of activeMachines (
-
Then MachineSet status is calculated and updated following which reconcile is enqueued after 10 minutes.
Expectations
Reference, and implementation
Expectations are in-memory data structures that we use to register an action we performed during a reconciliation. They are needed since kubernetes client caches all resources locally, and relies on apiserver watches to maintain the cache up-to-date, and they are helpful when trying to control how many creations/deletions/upgrades happen in parallel. A controller isn’t synced till its expectations are either fulfilled, or expire and controllers that don’t set expectations will get woken up for every matching controllee.
There’s a SatisfiedExpectations method that returns true if the required adds/dels for the given controller have been observed. Add/del counts are established by the controller at sync time, and updated as controllees are observed by the controller manager. That is, if the resources in local cache do not match what is expected, its safer to not move on with reconciliation as that may lead to unexpected number of creations/deletions.
For machineSet controller, the Satisfied method is inactive hence all the creation/deletion expectations that are modified are ineffective and can be ignored for now.
Reconcile Cluster Machine Deployment
- Fetch the deployment by getting the name from
SplitMetaNamespaceKey, then validate it by ensuringReplicasis not negative and thatselector.matchLabelsis consistent withtemplate.metadata.labels. - Add finalizers if DeletionTimestamp is nil, then check if
Selectoris empty i.e. the deployment is selecting all machines, if so record that event and setStatus.ObservedGenerationtoGenerationif it’s less and then update the status. - Then
getMachineSetsis called which uses a controller reference manager to handle ownership and internally callsClaimMachineSets(acts in a similar fashion toClaimMachinein Reconcile Cluster Machine Set by leveragingClaimObjecthowever it doesn’t run any filters) to get the machineSets owned by the deployment via adoption/orphaning with a similarCanAdoptfunction that fetches a fresh copy of the deployment object and compares UID to ensure that the deployment isn’t being deleted. getMachineMap: Creates a map from MachineSet UID to list of Machines it controls; used to verify machine labels and handle recreate deployments; it groups machines by their controller reference.- If DeletionTimestamp is set and deployment has
DeleteFinalizerNameand if there are no MachineSets referring to the deployment then remove the finalizers. If there are MachineSets owned by the deployment thenterminateMachineSetsis called after whichsyncStatusOnlycallsgetMSAndSyncRevisionthat lists the old and new (with updated revision) MachineSets and machines for the deployment and appliesmachine-template-hashto the adopted ones and then deployment status is synced. - Ref: Pausing a deployment, Deployment Status <TODO: Why?! are we doing this>
checkPausedConditionsfetches the condition withtypeasProgressing- If
cond.ReasonisTimedOutReason, return without overwriting with paused condition. - Check if
pausedCondExistsby seeing if cond.Reason isPausedMachineDeployReason. - If
Spec.Pausedis true andpausedCondExistsis false, then update the condition withReasonasPausedMachineDeployReasonandmessageas “Deployment is paused”. - Otherwise if
Spec.Pausedis false andpausedCondExistsis true, then update the condition withReasonasResumedMachineDeployReasonandmessageas “Deployment is resumed”. - Update the deployment status.
- If
- Then we check if
Spec.Pausedis true, if so it implies that Scaling was detected for deployment which is paused, sosyncis called (which has the responsibility of reconciling deployments on scaling/when paused)- First
getMSAndSyncRevisionis called to get list of new and old machineSets. - Then Scale() method is invoked, following which
spec.Pausedis checked, if deployment is paused and there’s no rollback in progress i.e.spec.RollbackTois nil, then deploymentcleanupis called for. The cleanup filters the old machineSet to get ones which are currently active (i.e. DeletionTimestamp is not set) and then deletes (note that if any delete fails, it stops the cleanup process since the failure could be due to api server overload) all butRevisionHistoryLimitnumber of machineSets (in this process it checks if the filtered machineSets have non-zero replica counts and avoids deleting them). - Otherwise the new machineSets, if any are added to set of all machineSets and their status is synced by calling
sync MachineDeployment Status.
- First
- If the deployment wasn’t paused and
Spec.RollbackTois non-nil i.e. there exists a revision that the deployment is rolling back to, therollback()method is invoked:- First
getMSAndSyncRevisionis called to get list of new and old machineSets. - Ref: k8s rollbacks, Then we check the rollback revision version (
RollbackTo.Revision), if its zero then we find theLastRevision()i.e. second max revision number by iterating over all machineSets and if its also zero, then that means we were unable to find the last revision and we give up rollback (update Deployment and Clear Rollback To is called) - Otherwise we proceed with the rollback: Iterating over the list of machineSets, we get the
Revisionfor each machineSet and if it matches theRollbackTo.Revision, then thePreferNoScheduletaint is removed from the nodes which are backing that machineSet. - After taint removal,
rollbackToTemplate()method is called which compares the templates (hash) of the deployment with the matching machineSet and updates the the template of the deployment incase they’re different (Set From Machine Set Template) and the annotations of the deployment to the rollbacked machineSet revision (Set Machine Deployment Annotations To).
- First
- If it wasn’t a rollback event either, then we check if its a
ScalingEventi.e. there are activeMachineSets (found by filtering as new MachineSets with Replicas more than 0). For scaling eventssyncis called (check above Paused discussion to see what it does)- In case there’s no activeMachineSets and the
Replicason the deployment is more than 0 i.e. its a scale from zero scenario so its considered as a scalingEvent. - Otherwise when there are activeMachineSets with
desired-replicasannotation not equal to the deployment replicas then also its a scalingEvent. - If no such machineSets exist i.e. all the active ones have the same amount of desired-replicas as the deployment replica then its not a scaling event.
- In case there’s no activeMachineSets and the
- In case none of the scenarios were true i.e. not paused, not rolling back, and not scaling, then we execute the deployment strategy which is either Recreate or Rolling.
Scale
Recreate
- First
getMSAndSyncRevisionis called to get list of new and old machineSets. Old MS is filtered to find the active ones i.e. those having non-zero replicas. - Create annotations to disable CA performing scale-downs during rollout and apply them to all machineSets if
autoscaler ScaleDown Annotation During Rolloutbool is true and there are old MS that haven’t been scaled to zero yet. - Then activeOld MS are scaled down:
- If replica for the MS is already zero then scaling is not required.
- Otherwise MS has to be scaled-down, its annotated with
desiredReplicasbeing the deployment Replicas and themaxReplicasbeingmaxSurgeamount more than that.maxSurgeis 0 for non-rolling updates. - In case scaling was required, after adding the annotations, the old MS is updated.
- If a scale-down occured or if there are old MS having non-zero replicas then,
Sync RolloutStatusis invoked:- TODO calculateDeploymentStatus
- Then we get all the MD conditions having
typeasProgressingand check if the current deployment is complete or not by:- Ensuring
ReplicasandUpdatedReplicasare the same - There is a MD condition of
Progressing TypehavingReasonas “New Machine Set Available”.
- Ensuring
- If its not a complete deployment then i.e. rollout hasn’t happened then progress estimation happens:
- In case deployment is completed (UpdatedReplicas, AvailableReplicas and Replicas are the same and new ObserverGen is more than or equal to MD generation) i.e. all desired replicas are updated and available, no old machine are running; then the condition is updated to
Progressingwith “New Machine Set Available” (The conditions talked about in the previous point). - If its in progressing state, then the progress is estimated by comparing the current and new status of the deployment. Specifically, when new machines are scaled up or become ready/available or old machines are scaled down. In this case, the
Reasonis set to “Machine Set Updated”. - If its in timed out state, then “Progressing” state is set to false and the reason is updated as “Progress Deadline Exceeded”.
- In case deployment is completed (UpdatedReplicas, AvailableReplicas and Replicas are the same and new ObserverGen is more than or equal to MD generation) i.e. all desired replicas are updated and available, no old machine are running; then the condition is updated to
- Then we fetch all the failure conditions of the
type“Replica Failure” for MS and convert them to deployment conditions. In case we don’t find any, we remove “Replica Failure” condition from the deployment if present. - If there aren’t any status changes to be done on the deployment, then its re-queued otherwise the status is updated.
- If there is a need to create new MS i.e. new MS from initial
getMSAndSyncRevisionis nil, then the function is called again with the creation flag set to true which creates the new MS. - The new MS are scaled up to deployment’s
Replicascount and the annotations fordesiredReplicasandmaxReplicasare added. - Again the progress is estimated and if deployment is completed then, autoscaler annotations are removed, deployment is cleaned up. At the end
Sync RolloutStatusis called again.
Rolling
- First
getMSAndSyncRevisionis called to get list of new and old machineSets. Create annotations to disable CA performing scale-downs during rollout and apply them to all machineSets ifautoscaler ScaleDown Annotation During Rolloutbool is true and there are old MS that haven’t been scaled to zero yet. - Also, nodes backing all the old MS have “PreferNoSchedule” taint added to them to avoid scheduling on them. And
reconcileNewMSis called to try to scale up, if possible:- Check if newMS.Replicas is same as deployment’s Replicas, that means there’s no scaling needed.
- If newMS.Replicas are more then scale-down needs to happen alongwith updation of
desiredReplicasandmaxReplicasannotation. In cases of rolling updates,surgeis not simply 0 it is calculated viaResolveFencepostsas \(surge = \lceil\frac{maxSurge * Replicas}{100}\rceil\) whenmaxSurgeis passed as a % value otherwise its taken as is. And themaxReplicasvalue is annotated asReplicas + surge - If its neither, then scale-up needs to happen,
NewISNewReplicasis used to get the target Replicas count, firstsurgeis computed in a similar fashion as above and thenmaxReplicasis compared with currently active MS replica count. If the current count is more i.e. scale up is not possible and target count is returned the same as newMS replicas count otherwise scale up needs to happen by
andmin(max - current, dep.Replicas - newMS.Replicas)scaleUp + newIS.Replicasis returned as the target replicas count. - If any scaling happens, then
Sync RolloutStatusis called to update the deployment status.
- After this an attempt is made to scale-down the active oldMS. First check if their total replica count is 0 in which case no further scale-down is possible. Otherwise we attempt the scale-down:
- Check minAvailable needed number of replicas,
deployment.Replicas - unavailablewhere it is calculated viaResolveFencepostsas \(unavail = \lfloor\frac{maxUnavailable * Replicas}{100}\rfloor\) whenmaxAvailableis passed as a % value otherwise its taken as is. - Check newMS that are still unavailable,
newMS.Replicas - newMS.AvailableReplicas - Hence there by removing (minAvailable + newMSUnavailable) from total active replicas, we get the number of replicas that can be cleanup up without any issues. (maxScaledDown)
- An attempt is made to clean up those replicas from the oldMS where first oldMS are sorted in the order
notready, ready, unscheduled, scheduled, pending, runningso that unhealthy replicas are cleaned up before to not increase unavailability:- Iterating over the sorted oldMS, we first check if the targetMS’s replica count is 0 or equal to its availableReplicas in which cases it can’t be downscaled or has no unhealthy replicas so we skip its cleanup.
- Otherwise compute
scaledownCountfor the targetMS by
wheremin(maxCleanupCount - totalScaledDown, Replicas - AvailableReplicas)totalScaledDownkeeps incrementing byscaleDownCountas we iterate over all oldMS. - We then compute
newReplicacountasReplicas - scaledownCountand attempt to scaledown the targetMS to it. <TODO: Rewrite this>
- After attempting the cleanup of old unhealthy MS, we them attempt a scaledown by first checking if its possible, computing
minAvailableneeded replica count andtotalAvailableand seeing if total’s less than min in which case we can’t scale down. Otherwise we computetotalScaledDownCount(aka target) astotal - minavailable replica count & sort oldMS similar to how we did in the cleanup process above and start iterating over them:- Check if scaling’s done by comparing
totalScaledDownnumber with the target in which case we stop. - If MS’s replicas is 0 then scaling isnt possible for it, so skip it.
- Otherwise scaledown is attempted with
scaleDownCountbeing
andmin(Replicas, target - totalScaledDown)newReplicacountasReplicas - scaleDownCountafter whichtotalScaledDownis updated withscaleDownCount
- Check if scaling’s done by comparing
- If any scaling happens, then
sync RolloutStatusis invoked later for updating the status.
- Check minAvailable needed number of replicas,
- At the end, the progress is estimated and if deployment is completed then, autoscaler annotations are removed, deployment is cleaned up. At the end
Sync RolloutStatusis called again.
TODO Reconcile Machine Safety Overshooting
Extension
All methods in Driver interface take a Request and provide a Response.
type Driver interface {
CreateMachine(context.Context, *CreateMachineRequest) (*CreateMachineResponse, error)
InitializeMachine(context.Context, *InitializeMachineRequest) (*InitializeMachineResponse, error)
DeleteMachine(context.Context, *DeleteMachineRequest) (*DeleteMachineResponse, error)
GetMachineStatus(context.Context, *GetMachineStatusRequest) (*GetMachineStatusResponse, error)
ListMachines(context.Context, *ListMachinesRequest) (*ListMachinesResponse, error)
GetVolumeIDs(context.Context, *GetVolumeIDsRequest) (*GetVolumeIDsResponse, error)
}
Machine Error Codes document goes a bit more in-depth into the workings of the various driver interface methods.
All methods follow a similar pattern at the start where they first check that the MachineClass.provider in the request is the correct one otherwise InvalidArgument is returned as the error code. Then the request’s MachineClass and Secret are used to extract the providerSpec, validation of the spec happens during the extraction process. If there’s an issue while unmarshalling the rawSpec, internal error is returned and if there’s an issue with validation, InvalidArgument is returned. The other common part is creation of a service client which is handled by the createSVC method (takes in region and secret).
Note that during the invocation of some methods, Record Driver API Metric function runs which captures the duration of the operation and records any failures. Coming to the specifics of the methods:
CreateMachine
- Secret’s
userDatais checked for existence and is encoded if found. (Later used inRunInstancesInput) - Validate given AMI’s (Amazon Machine Image) existence, verify that the image exists and is valid using DescribeImages
POSTrequest. - Block devices are the storage volumes attached to an
Instance. AndBlockDeviceMappingsdefine the EBS volumes and instance store volumes to attach to an instance at launch. So using the list of devices from providerSpec and theRootDeviceNamefrom the image, block device mappings are generated for storage. - Then tags are generated for
instance,volumeandNetworkInterface. These tags are useful as they act as identifiers which map the clusters that resources belong to and for management (cleanup, securtiy) etc. Failure in tag generation results in anInternalerror code being returned. - Afterwards,
NetworkInterfaceSpecis defined for all the interfaces and by iterative over providerSpec NetworkInterfaces. AndIamInstanceProfileis set up by going through IAM name and ARN. And after that instance metadata options are initialized. - Following all this, the
RunInstancesInputconfig is initialized using all the above fetched/initialized fields. Some additional fields such asPlacement,Capacity Reservation Target,CPU OptionsandSpot Priceare further set up before the RunInstance call which launches the EC2 instance. - Then we wait for the VM with
ProviderIDfor machineNameto be visible to AWS endpoints by checking usingconfirmInstancecalls following which’s success the MachineResponse consisting of theProviderIDandMachine Nameis returned.
InitializeMachine
InitializeMachine method should only be invoked as a post VM creation initialization to configure network configuration etc.
- First we decode spec, create a service request and then we fetch the running
instancesfor the requestedMachineby leveraging thetagspopulated earlier. In case searching via tags fails, fall back to usinginstanceIDfromproviderIDfor fetching the VM instances. Failure returnscodes.Uninitializedas the error code. - After getting the instance, we disable
SrcDstCheckfor NAT instances if specified as false by targeting theinstanceID. If this operations fails,Uninitializedis retured as the error code. - Then we iterate over the
Network Interfacesand assigns IPv6 prefixes to them if configured. Failure returnscodes.Uninitializedas the error code. If everything is successful, we returnResponsewith providerID and node name.
DeleteMachine
We do provider check and decode the spec, create the service client. Afterwards we check if the requested Machine has providerID populated, if so we extract the instanceID from it and invoke terminateInstance. If providerID is not populated, then we get the instances by tags and terminate them.
GetMachineStatus
Again we start with checking the provider, decoding the spec and creating the service client. Then we get the matching instances by tags or instanceID failing the fetching by former. If multiple VM instances are returned for the machine object, it is flagged as OutOfRange error and returned.
Otherwise we check if the SourceDestCheck on the instance matches the one on providerSpec, in case they don’t match it returns Uninitialized else we return the response with NodeName and ProviderID.
ListMachines
Do the needful provider check, spec decoding and service creation following which we iterate over the tags to check if they contain the clusterName tag or the nodeRole tag. These are then used to create “tag-key” filters alongwith a “instance-state-name” filter (with pending, running, stopping & stopped values). These filters are used to fetch the relevant VM instances for the machineClass. After getting the instances, a map of providerID to the machineName is created and and returned as the response.
GetVolumeIDs
Extracts EBS volume IDs from PersistentVolume specification by handling both in-tree AWS EBS volumes and CSI driver volumes and returns list of volume IDs as response. (Invoked indirectly by evictPodsWithPv)
Queries
- How is MCM different from CCM? Is there any overlap between the two?
- How much access/info does the end user have about machines? How much can they control them or is the access restricted to editing nodes only?
- Check control and target clusters, inClusterConfig. Control cluster is the one where the machine-* objects are stored. Target cluster is where all the node objects are registered.
- When node deletion checks are happening, how is the
ReadOnlyFileSystemflag checked for? - Why is this required in two places? FAQ: API Change
- Is local setup guide enough to get started or is there something else to look out for as well?
- In the
ClaimMachinesfunction, we’re not passing the filters, what’s the point of having them then? In what scenarios will they be needed? How is theCanAdopt()function working?- What is this controllerRef for?
What is theKeyFunc()doing in manageReplicas: cache.MetaNameSpaceKeyFunc- What is this ControllerExpectations, why is it needed? Handle race conditions, Expectations are supposed to be used to await previous controller actions to be seen in watch. ContExpectations is a cache mapping controllers to what they expect to see before being woken up for a sync.
- What is this
FakeDriverbusiness? In a lot of functions, we clone the resource, then work with the cloned copy, why is that necessary?Why is bootstrap token deletion required when machine creation goes through?What is this permit that’s used whenUnknownphase and machine times outWhy is internalExternalScheme conversion needed?