MCM
First, go through the README, later on check local setup guide as well.
Components
Check CRDs present in the MCM tree
All the below class diagrams omit TypeMeta
and ObjectMeta
for brevity.
Machine
--- title: Machine config: class: hideEmptyMembersBox: true --- %%{init: { 'themeVariables': { 'lineColor': '#51565b'}}}%% classDiagram %% note "apiVersion, kind & metadata components have not been shown" %% Machine --------------------------------- note for Machine "Representation of a physical or virtual machine that corresponds to a front-end kubernetes node object" Machine <|-- MachineSpec Machine <|-- MachineStatus class Machine{ MachineSpec spec MachineStatus status } %%-- MachineSpec <|-- ClassSpec MachineSpec <|-- MachineConfiguration MachineSpec <|-- NodeTemplateSpec class MachineSpec{ ClassSpec class string providerID NodeTemplateSpec nodeTemplate MachineConfiguration MachineConfiguration } class ClassSpec{ string apiGroup string kind string name } note for ClassSpec "Refers to the MachineClass template for the Machine" class NodeTemplateSpec{ string podCIDR: IP range []string podCIDRs string providerID bool unschedulable []taint taints } note for NodeTemplateSpec "Actually part of NodeSpec which is a field in NodeTemplateSpec" class MachineConfiguration{ v1.Duration drainTimeout v1.Duration healthTimeout v1.Duration creationTimeout int32* maxEvictRetries string* nodeConditions } %%-- MachineStatus <|-- LastOperation MachineStatus <|-- CurrentStatus class MachineStatus{ v1.NodeCondition conditions LastOperation lastOperation CurrentStatus currentStatus string lastKnownState } class LastOperation{ string description string errorCode v1.Time lastUpdateTime MachineState state: string MachineOperationType type: string } note for LastOperation "State can be 'Processing', 'Failed' or 'Successful'<br>Type can be 'Create', 'Update', 'HealthCheck' or 'Delete'" class CurrentStatus{ MachinePhase phase: string bool timeoutActive v1.Time lastUpdateTime } link MachineStatus "https://kubernetes.io/docs/reference/node/node-status/#condition"
Machine Class
--- title: MachineClass config: class: hideEmptyMembersBox: true --- %%{init: { 'themeVariables': { 'lineColor': '#51565b'}}}%% classDiagram %% MachineClass --------------------------------- note for MachineClass "Represents a template that contains cloud provider specific details used to create machines" MachineClass <|-- NodeTemplate class MachineClass{ NodeTemplate nodeTemplate v1.SecretReference credentialsSecretRef RawExtension providerSpec string provider v1.SecretReference secretRef } note for MachineClass "credentialsSecretRef has cloud provider credentials<br>secretRef has cloud init script: userData containing bootstrap token, API server address, kubeconfig" note for MachineClass "provider specific configuration to create a node is specified in providerSpec of the extensible custom type rawExtension which permits instances of different structure types to be held within a single type" class NodeTemplate{ v1.ResourceList capacity string instanceType string region string zone string* architecture }
Machine Set
--- title: MachineSet config: class: hideEmptyMembersBox: true --- %%{init: { 'themeVariables': { 'lineColor': '#51565b'}}}%% classDiagram %% MachineSet --------------------------------- note for MachineSet "Ensures that the specified number of Machine replicas are running at any time" MachineSet <|-- MachineSetSpec MachineSet <|-- MachineSetStatus class MachineSet{ MachineSetSpec spec MachineSetStatus status } MachineSetSpec <|-- ClassSpec MachineSetSpec <|-- MachineTemplateSpec class MachineSetSpec{ int32 replicas v1.LabelSelector selector ClassSpec machineClass MachineTemplateSpec template int32 minReadySeconds } MachineTemplateSpec <|-- MachineSpec class MachineTemplateSpec{ MachineSpec spec } MachineSetStatus <|-- MachineSetCondition MachineSetStatus <|-- LastOperation MachineSetStatus <|-- MachineSummary class MachineSetStatus{ int32 replicas int32 fullyLabeledReplicas int32 readyReplicas int32 availableReplicas int64 observedGeneration []MachineSetCondition machineSetCondition LastOperation lastOperation MachineSummary failedMachines } MachineSummary <|-- LastOperation class MachineSummary{ string name string providerID LastOperation lastOperation string ownerRef } class MachineSetCondition{ MachineSetConditionType type: string ConditionStatus status: string v1.Time lastTransitionTime string reason string message } note for LastOperation "One has information about lastOp for Machine (summary) and the other is for MachineSet"
Machine Deployment
--- title: MachineDeployment config: class: hideEmptyMembersBox: true --- %%{init: { 'themeVariables': { 'lineColor': '#51565b'}}}%% classDiagram %% Deployment --------------------------------- note for MachineDeployment "Provides a declarative update for MachineSet and Machines" MachineDeployment <|-- MachineDeploymentSpec MachineDeployment <|-- MachineDeploymentStatus class MachineDeployment{ MachineDeploymentSpec spec MachineDeploymentStatus status } %%-- MachineDeploymentSpec <|-- MachineTemplateSpec MachineDeploymentSpec <|-- MachineDeploymentStrategy class MachineDeploymentSpec{ int32 replicas v1.LabelSelector selector MachineTemplateSpec template MachineDeploymentStrategy strategy int32 minReadySeconds int32* revisionHistoryLimit bool paused RollbackConfig* rollbackTo int32* progressDeadlineSeconds } MachineDeploymentStrategy <|-- RollingUpdateMachineDeployment class MachineDeploymentStrategy{ MachineDeploymentStrategyType type: string RollingUpdateMachineDeployment rollingUpdate } class RollingUpdateMachineDeployment{ IntOrString MaxUnavailable IntOrString MaxSurge } %%-- MachineDeploymentStatus <|-- MachineDeploymentCondition MachineDeploymentStatus <|-- MachineSummary class MachineDeploymentStatus{ int64 observedGeneration int32 replicas int32 updatedReplicas int32 readyReplicas int32 availableReplicas int32 unavailableReplicas []MachineDeploymentCondition condition int32* collisionCount MachineSummary failedMachines } class MachineDeploymentCondition{ MachineDeploymentConditionType type: string ConditionStatus status: string v1.Time lastUpdateTime v1.Time lastTransitionTime string reason string message }
Miscellaneous
These are added for reference purpose only, no need to delve too much into dissecting them.
--- title: SafetyOptions & PodDisruptionBudget config: class: hideEmptyMembersBox: true --- classDiagram class SafetyOptions{ metav1.Duration MachineCreationTimeout metav1.Duration MachineHealthTimeout metav1.Duration MachineDrainTimeout metav1.Duration PvDetachTimeout metav1.Duration PvReattachTimeout metav1.Duration MachineSafetyAPIServerStatusCheckTimeout metav1.Duration MachineSafetyAPIServerStatusCheckPeriod metav1.Duration MachineSafetyOrphanVMsPeriod time.Time APIserverInactiveStartTime bool MachineControllerFrozen int32 MaxEvictRetries } PodDisruptionBudget <|-- PodDisruptionBudgetSpec PodDisruptionBudget <|-- PodDisruptionBudgetStatus class PodDisruptionBudget{ PodDisruptionBudgetSpec spec PodDisruptionBudgetStatus status } class PodDisruptionBudgetSpec{ IntOrString* minAvailable v1.LabelSelector* selector IntOrString* maxUnavailable } class PodDisruptionBudgetStatus{ int64 observedGeneration map[string]v1.Time disruptedPods int32 disruptionsAllowed int32 currentHealthy int32 desiredHealthy int32 expectedPods []v1.Condition conditions }
--- title: VolumeAttachmentHandler config: class: hideEmptyMembersBox: true --- classDiagram VolumeAttachmentHandler <|-- VolumeAttachment note for VolumeAttachmentHandler "Distribute incoming VolumeAttachment requests to all listening workers" class VolumeAttachmentHandler{ sync.Mutex []chan storagev1.VolumeAttachment* workers dispatch() AddWorker() DeleteWorker() } VolumeAttachment <|-- VolumeAttachmentSpec VolumeAttachment <|-- VolumeAttachmentStatus note for VolumeAttachment "Captures the intent to attach/detach specified volume to/from the specified node" class VolumeAttachment{ VolumeAttachmentSpec spec VolumeAttachmentStatus status } VolumeAttachmentSpec <|-- VolumeAttachmentSource class VolumeAttachmentSpec{ string attacher VolumeAttachmentSource source string nodeName } VolumeAttachmentStatus <|-- VolumeError class VolumeAttachmentStatus{ bool attached map[string]string attachmentMetadata VolumeError* attachError VolumeError* detachError } class VolumeAttachmentSource{ string* persistentVolumeName v1.PersistentVolumeSpec* inlineVolumeSpec } class VolumeError{ metav1.Time time string message }
Machine Controller
Reconcile Cluster Secret
- Fetch secrets
- Get the
MachineClass
referring to the passed secret - In case there are machine classes that refer it, add the finalizer to the secret else remove it. (The removal of the finalizer is needed because if the secrets aren’t referred by
MachineClass
then their deletion shouldn’t depend on the finalizer, Ref)
Reconcile Cluster Machine Class
- Fetch the
MachineClass
- Get all the
Machines
referring to theMachineClass
. - If there are machines that match given
MachineClass
and the deletion hasn’t been requested yet for theClass
then add the finalizers to the class proceeding which all the machines referring to the class are enqueued for further processing. (TODO
: Learn about workqueue) - Otherwise if deletion isn’t scheduled and there are machines referring to the Class then add them to the
machineQueue
. - If there aren’t any machines referring to the
Class
, then the finalizers, if any, on it can be deleted.
Reconcile Cluster Machine
-
Fetch the required
Machine
. -
Check if the Machine controller is frozen (
MachineControllerFrozen
) and machine isn’t scheduled for deletion. If so, retry reconciliation afterlongRetry
. -
If machine isn’t frozen, then validate the
Machine
(check Name and Kind for the Class shouldn’t be empty). Also validate theMachineClass
’Nodetemplate
fields (capacity
: cpu, gpu and memory;instanceType
,region
andzone
). -
In case the machine is scheduled for deletion, TriggerDeletionFlow with a
DeleteMachineRequest
. -
If the
Machine
isn’t scheduled for deletion, then add the finalizers on the machine if not present. -
Check Machine Phases State Diagram
Check if the machine has an associated node (look for label with key “
node
”) and if the machine has been created (i.e.phase
!= “”, if phase is empty string, that corresponds to machine being in creation stage). If so, then:- Reconcile Machine Health: Updates the machine object with any change in node conditions or health. Ref
- Sync Machine Name to Node: Syncs machine name on corresponding node object by adding machine name label to metadata.
- Sync Machine NodeTemplates: Syncs
nodeTemplate
between the machine and the node object by fetching it, then syncs the annotations, labels and taints ().
-
If machine phase is
empty
orcrashloopbackoff
orproviderId
is empty i.e. machine doesn’t exist then TriggerCreationFlow withCreateMachineRequest
.
TriggerDeletionFlow
Check out FAQ: Why is my machine deleted?
Check if finalizers are present on the machine, if not retry after a while. In case they are present and the machine isn’t in terminating
phase, then set it to terminating
(also sets LastOperation
state to Processing
and description to GetVMStatus
). If machine is in terminating
phase, then check LastOperation Description
:
- If it contains the string
GetVMStatus
i.e. “Set machine status to termination. Now, getting VM Status” thenupdateMachineStatus&NodeLabel
is called which adds the “node
” label if not present while checking for any errors in getting machine error status. After confirming that the node label is updated, the LastOperationtype
is set todelete
and LastOperationdescription
toInitiateDrain
. - If it contains the string
InitiateDrain
then call DrainNode. - If its
DelVolumesAttachments
, thendeleteNodeVA
is invoked:- If node is not found or if
node.VolumesAttached
is zero thendescription
is updated toInitiateVMDeletion
andstate
is set toProcessing
. - If there are non-zero attached volumes then
getLiveVAForNode
fetches all the volumes attached to our node not having DeletionTimestamp set. - If there are live volumes,
deleteVAForNode
removes them and when zero live volumes are attached thenInitiateVMDeletion
is set as thedescription
. - After this the machine status is updated with the corresponding state and description.
- If node is not found or if
- When it’s
InitiateVMDeletion
, thendeleteVM
delgates the task to driver.DeleteMachine method. If deletion is successful then description is set toInitiateNodeDeletion
withprocessing
state
. If there are any errors:- In case its
NotFound
, then deletion flow is continued andstate
is set toprocessing
withdescription
asInitiateNodeDeletion
. - If it’s
Unknown
,DeadlineExceeded
,Aborted
orUnavailable
or any other error code(default
) or failure in decoding the machineErr itself: thestate
is set asfailed
anddescription
remainsInitiateVMDeletion
as deletion is re-tried. - After this the machine status is updated with the corresponding state and description.
- In case its
- Whereas
InitiateNodeDeletion
calls upondeleteNodeObject
which if it can’t find the associated node, continues the deletion flow withdescription
set toInitiateFinalizerRemoval
and sets thestate
asProcessing
otherwiseDelete()
is called for the node which on success moves the state toProcessing
andInitiateFinalizerRemoval
stages however on error switches tofailed
state (any error other than object not found). After this the machine status is updated. - At last when we have
InitiateFinalizerRemoval
,deleteMachineFinalizers
is invoked which deletesMCMFinalizerName
if present, if there are any errors while updating the machine with the new set of finalizers, then re-queue is triggered after a ShortRetry.
DrainNode
Check out FAQ: How is drain implemented?, and the flowchart.
-
Check that the nodeName is valid, if not set
skipDrain
. -
Check if machine is in
Ready
orReadOnly
state. -
If node condition is not empty and its
not Ready
or itsReadOnly
and it’s been 5mins, thenforceDeleteMachines
andforceDeletePods
labels are set,printLogInitError()
is called i.e.description
is set toInitiateVMDeletion
. -
If skip isn’t set, then check if (MachineDrainTimeout) has occured.
- If
force-deletion
label is present on the machine or if timeout has happened thenforceDeleteMachines
andforceDeletePods
labels are set and force deletion is triggered otherwise normal deletion is triggered. - After that,
UpdateNode TerminationCondition
is called, which checks that thephase
is notCrashLoopBackOff
orempty
i.e. machine is not in creation stage. Then we check if there are nodes that are already interminating
condition, if not, then ifphase
isfailed
thenterminationCondition
reason is set toUnhealthy
, otherwise it’s set toScaleDown
. - If termination condition update results in an error in a normal deletion then,
skipDrain
is set andstate
is set tofailed
. - If termination condition update doesn’t result in an error then RunDrain is invoked; if drain is successful then
state
is set toProcessing
& if it’s normal deletiondescription
is set toInitiateVMDeletion
(since regular deletion already waits for volume detach/attach) otherwise if it’s force deletion thendescription
is set toDelVolumesAttachments
alongwithProcessing
state
. IfRunDrain
is unsuccessful however and its normal deletion thendescription
is set toInitiateDrain
withfailed
state
for a retry. - To summarize the above:
To perform forceful machine drain/delete either: 1. force-deletion: "True" label must be present 2. Deletion operation is more than drain-timeout minutes old 3. Last machine drain had failed
- If
-
If
skipDrain
is set, then set the state toProcessing
, LastOperationtype
todelete
,description
is set toInitiateVMDeletion
byprintLogInitError()
and it’s re-queued.
RunDrain
-
drainContext
-
RunCordonOrUncordon
First we try to get the associated node, if it results in an error that means the node doesn’t exist i.e. deletion was triggered during the machine creation process hence there’s nothing to do. If the node is there and it’s
Spec.Unschedulable
is true, then there’s nothing to do otherwiseUnschedulable
is set to true. (Cordoning means sealing off/preventing access) -
WaitForCacheSync
is processed for the pod informer cache to sync. -
deleteOrEvictPodsSimple
-
Calls
getPodsForDeletion
which lists allpods.Spec.NodeName
matching our node, (TODO: check podFilter) and if there aren’t any errors, then returns the list of pods. -
Following which we
deleteOrEvictPods
: Here, we first check if the server supports eviction bySupportEviction
and ifForceDeletePods
is false then we callevictPods
which performs normal eviction of the pods, parallellyevictPodsWithoutPv
and serially (by sorting by Priority)evictPodsWithPv
for ones having volume attached since it needs to wait for volume detachment. Both of these ways attempt toevictPod
if supported forMaxEvictRetries
times failing which theydeletePod
. IfforceDeletePods
is true, then forceful eviction of the pods happens and all pods are evicted parallelly by callingevictPodsWithPv
. -
In case there are errors while performing either evictPod or deletePod, it could be that the pod wasn’t evicted due to PDB violation, hence we get the PDB for the pod and check if its misconfigured by verifying that the
Generation
is same asStatus.ObservedGeneration
and thatCurrentHealthy
pods are more thanExpectedPods
(which should be more than 0) andDisruptionsAllowed
is zero. All these scenarios result in a misconfigured PDB. -
If it’s not a case of misconfigured PDB then we sleep for
PodEvictionRetryInterval
and then move to next iteration of the eviction attempt.TODO
TerminationGracePeriod
,waitForDelete
References: Disruptions, Configuring PDB, API Eviction, evictPodsWoPVInternal
-
TODO
Check evictPodsWithPVInternal -
If there are any failures in
deleteOrEvictPods
, they are reported as pendingPods.
-
TriggerCreationFlow
- Update MachineRequest’s secret by processing the secret with (TODO)
addBootstrapTokenToUserData
andaddMachineNameToUserData
which checks ifsecret["userData"]
exists, then replacingMachineNamePlaceholder
<MACHINE_NAME> with the actualmachine.Name
. - After that, the secret is used alongwith the Machine and MachineClass to create a GetMachineStatusRequest which responds with
providerId
and associatednodeName
. - If there is no error with the GetMachineStatusRequest and the machine’s “node” label or
providerID
is empty then the required VM is found and the node and providerID for the machine is updated. (updateLabels
: adds the necessary labels, annotations with MachinePriority as 3 if unset and updates the providerID) - If there is an error in getting the machine status and the ErrCode is
NotFound
orUnimplemented
and if the “node” label is not present on the machine then: driver.CreateMachine() is invoked which on success gives us theproviderId
and thenodeName
, then we try toGet()
(TODO: nodeName != machineName significance?) the returnednodeName
to check if a node object already exists. If so, it is treated as a stale node anddriver.DeleteMachine()
is triggered and the machine object is marked asstate
failed
,phase
beingfailed
with LastOperationdescription
set as “VM using old node obj
” and then its status is updated. NOTE: Also,uninitializedMachine
is set to true. - If the returned codes are
Unknown
,DeadlineExceeded
,Aborted
orUnavailable
, then againmachineStatusUpdate
is called withstate
set tofailed
andphase
beingfailed
orcrashloopbackoff
depending upon whether or not MachineCreationTimeout happened. - Whereas the case when returned code is
Uninitialized
, the flaguninitializedMachine
is set to true. In any other scenario, amachineStatusUpdate
call similar to the previous case is made with a re-queue call. - In case the
uninitializedMachine
flag is set,initializeMachine
is called for starting the VM instance for the machine, it delegates the task to driver.InitializeMachine which if it returnsUnimplemented
error code, then VM initialization is skipped as provider doesn’t support it otherwisemachineStatusUpdate
call is made withfailed
state
andcrashloopbackoff
orfailed
phase
based on MachineCreationTimeout. Upon successful initialization, an error is returned bytriggerCreationFlow
for another re-queue. - If however the machine
phase
isempty
(creation) orcrashloopbackoff
thenstate
is set asProcessing
,phase
asPending
and aUpdateStatus
call is made for the machine.
Safety Controller
Reconcile Cluster Machine Safety Orphan VMs
It lists all the VMs in the cloud matching the tag
of given cluster name and maps the VMs with the machine
objects using the ProviderID
field. VMs without any backing machine objects are logged and deleted after confirmation. It runs every MachineSafetyOrphanVMsPeriod
duration (15 mins). It runs:
checkMachineClasses
: fetches all the MachineClasses and for each of them callscheckMachineClass
:- It fetches their
secretData
, lists all theMachines
referencing that class, updates their cache viaWaitForCacheSync
- Then it iterates over the MachineList returned by ListMachines:
<providerID, machineName>
, fetching each machine and ensuring that the providerID matches. - If it doesn’t and the
phase
is "" orcrashloopbackoff
, then it continues iterating otherwise it creates a dummy machine object forDeleteMachineRequest
.
- It fetches their
AnnotateNodesUnmanagedByMCM
: fetches all the nodes on the target cluster then iterates over them:- Calls
getMachineFromNode
which returnserrMultipleMachineMatch
orerrNoMachineMatch
depending on the number of machines associated with the node. If there’s only one machine, then it returns the machine. - If we get
errNoMachineMatch
then we comparecurrentTime - nodeCreationTime
with MachineCreationTimeout, if timeout has occured and the node doesn’t haveNotManagedByMCM
annotation present on it then it means there’s no backing machine object for the node. So we annotate it withNotManagedByMCM
and then update the node. - If it returns the machine having
NotManagedByMCM
annotation then we remove the annotation and update the node.
- Calls
Reconcile Cluster Machine Safety API Server
Freezes MachineDeployment
and MachineSet
controller if number of machine
objects goes beyond a certain threshold on top of Spec.Replicas
by applying a freeze
label. It also freezes the functionality of MCM if target-apiserver
or control-apiserver
is not reachable and unfreezes once situation becomes normal.
- Checks if
MachineControllerFrozen
is true if it is, then it checks ifisApiServerUp
. isApiServerUp
makes a dummy call to control APIServer (controlMachineClient
) and target APIServer (targetCoreClient
) and if either gives an error different from isNotFound error, it returns false. If it returns true then we unfreeze them by iterating over them and if theirphase
isUnknown
, we get machine fromcontrolMachineClient
and set the phase asRunning
,state
assuccessful
anddescription
as “Machine Health Timeout was reset due to APIServer being unreachable”. After which we update the status andenqueueMachineAfter
30 seconds for reconcile and setMachineControllerFrozen
to false &APIserverInactiveStartTime
to nil.- If initially
MachineControllerFrozen
was false, andisApiServerUp
returned false i.e. its not up then we setAPIserverInactiveStartTime
as current time if its unset and check if time passed since Inactive time is more thanMachineSafety APIServer StatusCheckTimeout
, if so we setMachineControllerFrozen
to true and then re-enqueue the API Server check.
Machine Controller Manager
Handles reconciliation of MachineDeployment
and MachineSet
objects.
Reconcile Cluster Machine Set
-
Fetch the
MachineSet
. -
ValidateMachineSetSpec
: Ensures thatspec.Replicas
is greater than or equal to zero andselector.matchLabels
matchestemplate.metadata.labels
. -
If DeletionTimestamp is not set, then
addMachineSetFinalizers
-
LabelSelectorAsSelector converts MachineSet selector to a selector object that’s used by
claimMachines
function that creates a*Machine Controller Ref Manager*
tagged with aCanAdopt
function that first fetchesfresh
i.e. latest version of the machineSet and then verifies its UID with the existingmachineSet
’s UID to ensure that the machineSet isn’t being deleted. (This function is later used byAdoptMachine
) -
ClaimMachine
indirectly callsClaimObject
withMatch
,Adopt
andRelease
:Match
: Checks if the selector matches, after which it runs a set offilter
functions. On success, it returns true.AdoptMachine
(Adopt): Runs the previously createdCanAdopt
function to verify that the machine can be adopted by the machineSet, if so thenPatchMachine
with the OwnerRef to the MachineSet and add toclaimedMachines
.ReleaseMachine
(Release): Sends adeleteOwnerRefPatch
to free the Machine from controller. On patching,NotFound
andIsInvalid
errors are ignored since if the machine doesn’t exist or if it has no ownerRef or if the UID doesn’t match (implies machine was recreated after deletion) then it’s fine.
-
ClaimObject:
It will reconcile the following: - Adopt orphans if the match function returns true. - Release owned objects if the match function returns false.
- Get the
ownerRef
usingGetControllerOf
, it it’s not nil, check if ownerRef.UID is the same as the machineSet’s UUID. (UID is k8s system generated string to uniquely identify object). If the DeletionTimestamp is not set (nil), and the selectormatch
fails, then attempt arelease
. - In case the ownerRef was nil, check if machineSet is not scheduled for deletion (
DeletionTimestamp
nil) and thematch
was succesful. If so, then if the machine isn’t scheduled for deletion, attemptAdopt
.
- Get the
-
After the list of claimedMachines if procured, we call
syncMachinesNodeTemplates
,syncMachinesConfig
&syncMachineClassKind
which goes through the list and checks if a machine requires updation of theNodetemplate
,MachineConfiguration
andClass.Kind
and copies them from themachineSet
to themachine
if needed. -
Then check if the machineSet is scheduled for deletion i.e.
DeletionTimestamp
is not nil, if there aren’t any claimedMachines for this machineSet (len(filteredMachines) is 0) then remove the finalizer from the machineSet otherwise trigger deletion of the claimedMachines viaterminateMachines
. -
In case the machineSet wasn’t scheduled for deletion,
manageReplicas
is called:- Gets the MetaNameSpaceKey for the machineSet and creates a list of activeMachines (
phase
!=failed
orterminating
) and staleMachines (phase
isfailed
) from the list of claimedMachines. - If there are staleMachines then they are scheduled for deletion,
terminateMachines
is called. - Then we check the difference between the number of activeMachines and the number of
Replicas
specified by the machineSet. - If the activeMachines are less, then count (diff) is set to
min(BurstReplicas: 100, Replicas - len(activeMachines))
and <TODO ExpectCreations>slowStartBatch
is called which calls a functionCreateMachines With ControllerRef
count
times, starting slow with group ofSlowStartInitialBatchSize
(1) to check for errors, then speeding up if calls succeed. Within each batch, it may call the function multiple times concurrently. If a whole batch succeeds, the next batch gets exponentially larger. - <TODO> skippedMachines CreationObserved
- In case the activeMachines are more than the
Replica
count, then deletion must be triggered.diff
is set tomin(BurstReplicas: 100, len(activeMachines) - Replicas)
and then the activeMachines are sorted in the orderNonReady, Ready, Unscheduled, Scheduled, Pending, Running
. <TODO: ExpectDeletions>. After getting the sorted list, the machines are deleted by calledterminateMachines
.
- Gets the MetaNameSpaceKey for the machineSet and creates a list of activeMachines (
-
Then MachineSet status is calculated and updated following which reconcile is enqueued after 10 minutes.
Reconcile Cluster Machine Deployment
- Fetch the deployment by getting the name from
SplitMetaNamespaceKey
, then validate it by ensuringReplicas
is not negative and thatselector.matchLabels
is consistent withtemplate.metadata.labels
. - Add finalizers if DeletionTimestamp is nil, then check if
Selector
is empty i.e. the deployment is selecting all machines, if so record that event and setStatus.ObservedGeneration
toGeneration
if it’s less and then update the status. - Then
getMachineSets
is called which uses a controller reference manager to handle ownership and internally callsClaimMachineSets
(acts in a similar fashion toClaimMachine
in Reconcile Cluster Machine Set by leveragingClaimObject
however it doesn’t run any filters) to get the machineSets owned by the deployment via adoption/orphaning with a similarCanAdopt
function that fetches a fresh copy of the deployment object and compares UID to ensure that the deployment isn’t being deleted. getMachineMap
: Creates a map from MachineSet UID to list of Machines it controls; used to verify machine labels and handle recreate deployments; it groups machines by their controller reference.- If DeletionTimestamp is set and deployment has
DeleteFinalizerName
and if there are no MachineSets referring to the deployment then remove the finalizers. If there are MachineSets owned by the deployment thenterminateMachineSets
is called after whichsyncStatusOnly
callsgetMSAndSyncRevision
that lists the old and new (with updated revision) MachineSets and machines for the deployment and appliesmachine-template-hash
to the adopted ones and then deployment status is synced. - Ref: Pausing a deployment, Deployment Status <TODO: Why?! are we doing this>
checkPausedConditions
fetches the condition withtype
asProgressing
- If
cond.Reason
isTimedOutReason
, return without overwriting with paused condition. - Check if
pausedCondExists
by seeing if cond.Reason isPausedMachineDeployReason
. - If
Spec.Paused
is true andpausedCondExists
is false, then update the condition withReason
asPausedMachineDeployReason
andmessage
as “Deployment is paused”. - Otherwise if
Spec.Paused
is false andpausedCondExists
is true, then update the condition withReason
asResumedMachineDeployReason
andmessage
as “Deployment is resumed”. - Update the deployment status.
- If
- Then we check if
Spec.Paused
is true, if so it implies that Scaling was detected for deployment which is paused, sosync
is called (which has the responsibility of reconciling deployments on scaling/when paused)- First
getMSAndSyncRevision
is called to get list of new and old machineSets. - Then Scale() method is invoked, following which
spec.Paused
is checked, if deployment is paused and there’s no rollback in progress i.e.spec.RollbackTo
is nil, then deploymentcleanup
is called for. The cleanup filters the old machineSet to get ones which are currently active (i.e. DeletionTimestamp is not set) and then deletes (note that if any delete fails, it stops the cleanup process since the failure could be due to api server overload) all butRevisionHistoryLimit
number of machineSets (in this process it checks if the filtered machineSets have non-zero replica counts and avoids deleting them). - Otherwise the new machineSets, if any are added to set of all machineSets and their status is synced by calling
sync MachineDeployment Status
.
- First
- If the deployment wasn’t paused and
Spec.RollbackTo
is non-nil i.e. there exists a revision that the deployment is rolling back to, therollback()
method is invoked:- First
getMSAndSyncRevision
is called to get list of new and old machineSets. - Ref: k8s rollbacks, Then we check the rollback revision version (
RollbackTo.Revision
), if its zero then we find theLastRevision()
i.e. second max revision number by iterating over all machineSets and if its also zero, then that means we were unable to find the last revision and we give up rollback (update Deployment and Clear Rollback To is called) - Otherwise we proceed with the rollback: Iterating over the list of machineSets, we get the
Revision
for each machineSet and if it matches theRollbackTo.Revision
, then thePreferNoSchedule
taint is removed from the nodes which are backing that machineSet. - After taint removal,
rollbackToTemplate()
method is called which compares the templates (hash) of the deployment with the matching machineSet and updates the the template of the deployment incase they’re different (Set From Machine Set Template) and the annotations of the deployment to the rollbacked machineSet revision (Set Machine Deployment Annotations To).
- First
- If it wasn’t a rollback event either, then we check if its a
ScalingEvent
i.e. there are activeMachineSets (found by filtering as new MachineSets with Replicas more than 0). For scaling eventssync
is called (check above Paused discussion to see what it does)- In case there’s no activeMachineSets and the
Replicas
on the deployment is more than 0 i.e. its a scale from zero scenario so its considered as a scalingEvent. - Otherwise when there are activeMachineSets with
desired-replicas
annotation not equal to the deployment replicas then also its a scalingEvent. - If no such machineSets exist i.e. all the active ones have the same amount of desired-replicas as the deployment replica then its not a scaling event.
- In case there’s no activeMachineSets and the
- In case none of the scenarios were true i.e. not paused, not rolling back, and not scaling, then we execute the deployment strategy which is either Recreate or Rolling.
TODO Scale
Recreate
- First
getMSAndSyncRevision
is called to get list of new and old machineSets. Old MS is filtered to find the active ones i.e. those having non-zero replicas. - Create annotations to disable CA performing scale-downs during rollout and apply them to all machineSets if
autoscaler ScaleDown Annotation During Rollout
bool is true and there are old MS that haven’t been scaled to zero yet. - Then activeOld MS are scaled down:
- If replica for the MS is already zero then scaling is not required.
- Otherwise MS has to be scaled-down, its annotated with
desiredReplicas
being the deployment Replicas and themaxReplicas
beingmaxSurge
amount more than that.maxSurge
is 0 for non-rolling updates. - In case scaling was required, after adding the annotations, the old MS is updated.
- If a scale-down occured or if there are old MS having non-zero replicas then,
Sync RolloutStatus
is invoked:- TODO calculateDeploymentStatus
- Then we get all the MD conditions having
type
asProgressing
and check if the current deployment is complete or not by:- Ensuring
Replicas
andUpdatedReplicas
are the same - There is a MD condition of
Progressing Type
havingReason
as “New Machine Set Available”.
- Ensuring
- If its not a complete deployment then i.e. rollout hasn’t happened then progress estimation happens:
- In case deployment is completed (UpdatedReplicas, AvailableReplicas and Replicas are the same and new ObserverGen is more than or equal to MD generation) i.e. all desired replicas are updated and available, no old machine are running; then the condition is updated to
Progressing
with “New Machine Set Available” (The conditions talked about in the previous point). - If its in progressing state, then the progress is estimated by comparing the current and new status of the deployment. Specifically, when new machines are scaled up or become ready/available or old machines are scaled down. In this case, the
Reason
is set to “Machine Set Updated”. - If its in timed out state, then “Progressing” state is set to false and the reason is updated as “Progress Deadline Exceeded”.
- In case deployment is completed (UpdatedReplicas, AvailableReplicas and Replicas are the same and new ObserverGen is more than or equal to MD generation) i.e. all desired replicas are updated and available, no old machine are running; then the condition is updated to
- Then we fetch all the failure conditions of the
type
“Replica Failure” for MS and convert them to deployment conditions. In case we don’t find any, we remove “Replica Failure” condition from the deployment if present. - If there aren’t any status changes to be done on the deployment, then its re-queued otherwise the status is updated.
- If there is a need to create new MS i.e. new MS from initial
getMSAndSyncRevision
is nil, then the function is called again with the creation flag set to true which creates the new MS. - The new MS are scaled up to deployment’s
Replicas
count and the annotations fordesiredReplicas
andmaxReplicas
are added. - Again the progress is estimated and if deployment is completed then, autoscaler annotations are removed, deployment is cleaned up. At the end
Sync RolloutStatus
is called again.
Rolling
- First
getMSAndSyncRevision
is called to get list of new and old machineSets. Create annotations to disable CA performing scale-downs during rollout and apply them to all machineSets ifautoscaler ScaleDown Annotation During Rollout
bool is true and there are old MS that haven’t been scaled to zero yet. - Also, nodes backing all the old MS have “PreferNoSchedule” taint added to them to avoid scheduling on them. And
reconcileNewMS
is called to try to scale up, if possible:- Check if newMS.Replicas is same as deployment’s Replicas, that means there’s no scaling needed.
- If newMS.Replicas are more then scale-down needs to happen alongwith updation of
desiredReplicas
andmaxReplicas
annotation. In cases of rolling updates,surge
is not simply 0 it is calculated viaResolveFenceposts
as \(surge = \lceil\frac{maxSurge * Replicas}{100}\rceil\) whenmaxSurge
is passed as a % value otherwise its taken as is. And themaxReplicas
value is annotated asReplicas + surge
- If its neither, then scale-up needs to happen,
NewISNewReplicas
is used to get the target Replicas count, firstsurge
is computed in a similar fashion as above and thenmaxReplicas
is compared with currently active MS replica count. If the current count is more i.e. scale up is not possible and target count is returned the same as newMS replicas count otherwise scale up needs to happen by
andmin(max - current, dep.Replicas - newMS.Replicas)
scaleUp + newIS.Replicas
is returned as the target replicas count. - If any scaling happens, then
Sync RolloutStatus
is called to update the deployment status.
- After this an attempt is made to scale-down the active oldMS. First check if their total replica count is 0 in which case no further scale-down is possible. Otherwise we attempt the scale-down:
- Check minAvailable needed number of replicas,
deployment.Replicas - unavailable
where it is calculated viaResolveFenceposts
as \(unavail = \lfloor\frac{maxUnavailable * Replicas}{100}\rfloor\) whenmaxAvailable
is passed as a % value otherwise its taken as is. - Check newMS that are still unavailable,
newMS.Replicas - newMS.AvailableReplicas
- Hence there by removing (minAvailable + newMSUnavailable) from total active replicas, we get the number of replicas that can be cleanup up without any issues. (maxScaledDown)
- An attempt is made to clean up those replicas from the oldMS where first oldMS are sorted in the order
notready, ready, unscheduled, scheduled, pending, running
so that unhealthy replicas are cleaned up before to not increase unavailability:- Iterating over the sorted oldMS, we first check if the targetMS’s replica count is 0 or equal to its availableReplicas in which cases it can’t be downscaled or has no unhealthy replicas so we skip its cleanup.
- Otherwise compute
scaledownCount
for the targetMS by
wheremin(maxCleanupCount - totalScaledDown, Replicas - AvailableReplicas)
totalScaledDown
keeps incrementing byscaleDownCount
as we iterate over all oldMS. - We then compute
newReplicacount
asReplicas - scaledownCount
and attempt to scaledown the targetMS to it. <TODO: Rewrite this>
- After attempting the cleanup of old unhealthy MS, we them attempt a scaledown by first checking if its possible, computing
minAvailable
needed replica count andtotalAvailable
and seeing if total’s less than min in which case we can’t scale down. Otherwise we computetotalScaledDownCount
(aka target) astotal - min
available replica count & sort oldMS similar to how we did in the cleanup process above and start iterating over them:- Check if scaling’s done by comparing
totalScaledDown
number with the target in which case we stop. - If MS’s replicas is 0 then scaling isnt possible for it, so skip it.
- Otherwise scaledown is attempted with
scaleDownCount
being
andmin(Replicas, target - totalScaledDown)
newReplicacount
asReplicas - scaleDownCount
after whichtotalScaledDown
is updated withscaleDownCount
- Check if scaling’s done by comparing
- If any scaling happens, then
sync RolloutStatus
is invoked later for updating the status.
- Check minAvailable needed number of replicas,
- At the end, the progress is estimated and if deployment is completed then, autoscaler annotations are removed, deployment is cleaned up. At the end
Sync RolloutStatus
is called again.
TODO Reconcile Machine Safety Overshooting
Extension
All methods in Driver
interface take a Request
and provide a Response
.
type Driver interface {
CreateMachine(context.Context, *CreateMachineRequest) (*CreateMachineResponse, error)
InitializeMachine(context.Context, *InitializeMachineRequest) (*InitializeMachineResponse, error)
DeleteMachine(context.Context, *DeleteMachineRequest) (*DeleteMachineResponse, error)
GetMachineStatus(context.Context, *GetMachineStatusRequest) (*GetMachineStatusResponse, error)
ListMachines(context.Context, *ListMachinesRequest) (*ListMachinesResponse, error)
GetVolumeIDs(context.Context, *GetVolumeIDsRequest) (*GetVolumeIDsResponse, error)
}
Machine Error Codes document goes a bit more in-depth into the workings of the various driver interface methods.
All methods follow a similar pattern at the start where they first check that the MachineClass.provider
in the request is the correct one otherwise InvalidArgument
is returned as the error code. Then the request’s MachineClass
and Secret
are used to extract the providerSpec
, validation of the spec happens during the extraction process. If there’s an issue while unmarshalling the rawSpec
, internal
error is returned and if there’s an issue with validation, InvalidArgument
is returned. The other common part is creation of a service client which is handled by the createSVC
method (takes in region and secret).
Note that during the invocation of some methods, Record Driver API Metric
function runs which captures the duration of the operation and records any failures. Coming to the specifics of the methods:
CreateMachine
- Secret’s
userData
is checked for existence and is encoded if found. (Later used inRunInstancesInput
) - Validate given AMI’s (Amazon Machine Image) existence, verify that the image exists and is valid using DescribeImages
POST
request. - Block devices are the storage volumes attached to an
Instance
. AndBlockDeviceMappings
define the EBS volumes and instance store volumes to attach to an instance at launch. So using the list of devices from providerSpec and theRootDeviceName
from the image, block device mappings are generated for storage. - Then tags are generated for
instance
,volume
andNetworkInterface
. These tags are useful as they act as identifiers which map the clusters that resources belong to and for management (cleanup, securtiy) etc. Failure in tag generation results in anInternal
error code being returned. - Afterwards,
NetworkInterfaceSpec
is defined for all the interfaces and by iterative over providerSpec NetworkInterfaces. AndIamInstanceProfile
is set up by going through IAM name and ARN. And after that instance metadata options are initialized. - Following all this, the
RunInstancesInput
config is initialized using all the above fetched/initialized fields. Some additional fields such asPlacement
,Capacity Reservation Target
,CPU Options
andSpot Price
are further set up before the RunInstance call which launches the EC2 instance. - Then we wait for the VM with
ProviderID
for machineName
to be visible to AWS endpoints by checking usingconfirmInstance
calls following which’s success the MachineResponse consisting of theProviderID
andMachine Name
is returned.
InitializeMachine
InitializeMachine
method should only be invoked as a post VM creation initialization to configure network configuration etc.
- First we decode spec, create a service request and then we fetch the running
instances
for the requestedMachine
by leveraging thetags
populated earlier. In case searching via tags fails, fall back to usinginstanceID
fromproviderID
for fetching the VM instances. Failure returnscodes.Uninitialized
as the error code. - After getting the instance, we disable
SrcDstCheck
for NAT instances if specified as false by targeting theinstanceID
. If this operations fails,Uninitialized
is retured as the error code. - Then we iterate over the
Network Interfaces
and assigns IPv6 prefixes to them if configured. Failure returnscodes.Uninitialized
as the error code. If everything is successful, we returnResponse
with providerID and node name.
DeleteMachine
We do provider check and decode the spec, create the service client. Afterwards we check if the requested Machine has providerID
populated, if so we extract the instanceID
from it and invoke terminateInstance. If providerID
is not populated, then we get the instances by tags
and terminate them.
GetMachineStatus
Again we start with checking the provider, decoding the spec and creating the service client. Then we get the matching instances by tags
or instanceID
failing the fetching by former. If multiple VM instances are returned for the machine object, it is flagged as OutOfRange
error and returned.
Otherwise we check if the SourceDestCheck
on the instance matches the one on providerSpec
, in case they don’t match it returns Uninitialized
else we return the response with NodeName
and ProviderID
.
ListMachines
Do the needful provider check, spec decoding and service creation following which we iterate over the tags
to check if they contain the clusterName
tag or the nodeRole
tag. These are then used to create “tag-key” filters alongwith a “instance-state-name” filter (with pending, running, stopping & stopped values). These filters are used to fetch the relevant VM instances
for the machineClass. After getting the instances, a map of providerID
to the machineName
is created and and returned as the response.
GetVolumeIDs
Extracts EBS volume IDs from PersistentVolume specification by handling both in-tree AWS EBS volumes and CSI driver volumes and returns list of volume IDs as response. (Invoked indirectly by evictPodsWithPv
)
Queries
- How is MCM different from CCM? Is there any overlap between the two?
- How much access/info does the end user have about machines? How much can they control them or is the access restricted to editing nodes only?
- Check control and target clusters, inClusterConfig. Control cluster is the one where the machine-* objects are stored. Target cluster is where all the node objects are registered.
- When node deletion checks are happening, how is the
ReadOnlyFileSystem
flag checked for? - Why is this required in two places? FAQ: API Change
- Is local setup guide enough to get started or is there something else to look out for as well?
- In the
ClaimMachines
function, we’re not passing the filters, what’s the point of having them then? In what scenarios will they be needed? How is theCanAdopt()
function working?- What is this controllerRef for?
What is theKeyFunc()
doing in manageReplicas: cache.MetaNameSpaceKeyFunc- What is this ControllerExpectations, why is it needed? Handle race conditions, Expectations are supposed to be used to await previous controller actions to be seen in watch. ContExpectations is a cache mapping controllers to what they expect to see before being woken up for a sync.
- What is this
FakeDriver
business? In a lot of functions, we clone the resource, then work with the cloned copy, why is that necessary?Why is bootstrap token deletion required when machine creation goes through?What is this permit that’s used whenUnknown
phase and machine times outWhy is internalExternalScheme conversion needed?