Skip to main content

Node restart

Node restart disrupts the state of the node by restarting it.

Node Restart

Use cases

  • Node restart fault determines the deployment sanity (replica availability and uninterrupted service) and recovery workflows of the application pod in the event of an unexpected node restart.
  • It simulates loss of critical services (or node-crash).
  • It verifies resource budgeting on cluster nodes (whether request(or limit) settings honored on available nodes).
  • It verifies whether topology constraints are adhered to (node selectors, tolerations, zone distribution, affinity or anti-affinity policies) or not.

Permissions required

Below is a sample Kubernetes role that defines the permissions required to execute the fault.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: hce
name: node-restart
spec:
definition:
scope: Cluster
permissions:
- apiGroups: [""]
resources: ["pods"]
verbs: ["create", "delete", "get", "list", "patch", "deletecollection", "update"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "get", "list", "patch", "update"]
- apiGroups: [""]
resources: ["chaosEngines", "chaosExperiments", "chaosResults"]
verbs: ["create", "delete", "get", "list", "patch", "update"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["get", "list", "create"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "delete", "get", "list", "deletecollection"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list"]

Prerequisites

  • Kubernetes > 1.16

  • Create a Kubernetes secret named id-rsa where the fault will be executed. The contents of the secret will be the private SSH key for SSH_USER that will be used to connect to the node that hosts the target pod in the secret field ssh-privatekey.

    • Below is a sample secret file:

      apiVersion: v1
      kind: Secret
      metadata:
      name: id-rsa
      type: kubernetes.io/ssh-auth
      stringData:
      ssh-privatekey: |-
      # SSH private key for ssh contained here

      Creating the RSA key pair for remote SSH access for those who are already familiar with an SSH client, has been summarized below.

      1. Create a new key pair and store the keys in a file named my-id-rsa-key and my-id-rsa-key.pub for the private and public keys respectively:
      ssh-keygen -f ~/my-id-rsa-key -t rsa -b 4096
      1. For each available node, run the below command that copies the public key of my-id-rsa-key:
      ssh-copy-id -i my-id-rsa-key user@node

      For further details, refer to this documentation. After copying the public key to all nodes and creating the secret, you are all set to execute the fault.

  • The target nodes should be in the ready state before and after injecting chaos.

Mandatory tunables

Tunable Description Notes
TARGET_NODE Name of the target node subject to chaos. If this is not provided, a random node is selected. For more information, go to target node.
NODE_LABEL It contains the node label that is used to filter the target nodes.It is mutually exclusive with the TARGET_NODES environment variable. If both are provided, TARGET_NODES takes precedence. For more information, go to tagret node with labels.

Optional tunables

Tunable Description Notes
LIB_IMAGE Image used to run the stress command. Default: chaosnative/chaos-go-runner:main-latest. For more information, go to image used by the helper pod.
SSH_USER Name of the SSH user. Default: root. For more information, go to SSH user.
TARGET_NODE_IP Internal IP of the target node subject to chaos. If not provided, the fault uses the node IP of the TARGET_NODE. Default: empty. For more information, go to target node internal IP.
REBOOT_COMMAND Command used to reboot. Default: sudo systemctl reboot. For more information, go to reboot command.
TOTAL_CHAOS_DURATION Duration that you specify, through which chaos is injected into the target resource (in seconds). Default: 120 s. For more information, go to duration of the chaos.
RAMP_TIME Period to wait before and after injecting chaos (in seconds). For example, 30 s. For more information, go to ramp time.

Reboot command

Command to restart the target node. Tune it by using the REBOOT_COMMAND environment variable.

The following YAML snippet illustrates the use of this environment variable:

# provide the reboot command
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: node-restart
spec:
components:
env:
# command used for the reboot
- name: REBOOT_COMMAND
value: 'sudo systemctl reboot'
# name of the target node
- name: TARGET_NODE
value: 'node01'
- name: TOTAL_CHAOS_DURATION
VALUE: '60'

SSH user

Name of the SSH user for the target node. Tune it by using the SSH_USER environment variable.

The following YAML snippet illustrates the use of this environment variable:

# name of the ssh user used to ssh into targeted node
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: node-restart
spec:
components:
env:
# name of the ssh user
- name: SSH_USER
value: 'root'
# name of the target node
- name: TARGET_NODE
value: 'node01'
- name: TOTAL_CHAOS_DURATION
VALUE: '60'

Target node internal IP

Internal IP of the target node (optional). If the internal IP is not provided, the fault derives the internal IP of the target node. Tune it by using the TARGET_NODE_IP environment variable.

The following YAML snippet illustrates the use of this environment variable:

# internal ip of the targeted node
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: node-restart
spec:
components:
env:
# internal ip of the targeted node
- name: TARGET_NODE_IP
value: '10.0.170.92'
# name of the target node
- name: TARGET_NODE
value: 'node01'
- name: TOTAL_CHAOS_DURATION
VALUE: '60'