14 Troubleshooting Applications

Prerequisites

  • You have access to OpenShift Web Console URL. Ask your workshop coordinator for URL if you don’t have one.

  • You have credentials to login. Ask your workshop coordinator for credentials to log onto the OpenShift cluster

Introduction

Below are common application troubleshooting techniques to use while developing an application. We will perform different exercises to see how to resolve different issues that may come up while developing your application.

Exercises

Prepare Exercise

  • Login to OpenShift

  • Create new project

$ oc new-project workshop-troubleshooting-apps-YourName
Change YourName to your name.

Image Pull Failures

Things to consider.. Why did the container fail to pull
  • Image tags is incorrect

  • Images doesn’t exist (or is in a different registry)

  • Kubernetes doesn’t have permissions to pull that image

Create image that fails pull

$ oc  run fail --image=tcij1013/dne:v1.0.0

Check the status of the image

oc get pods
NAME   READY   STATUS         RESTARTS   AGE
fail   0/1     ErrImagePull   0          11s

Inspect the pod

$ oc describe pod fail

As we can see in the Events the pod failed because it could not pull down the image.

Events:
  Type     Reason          Age                From                                                  Message
  ----     ------          ----               ----                                                  -------
  Normal   Scheduled       44s                default-scheduler                                     Successfully assigned workshop-troubleshooting-apps-tim/fail to ip-10-1-156-6.eu-central-1.compute.internal
  Normal   AddedInterface  42s                multus                                                Add eth0 [10.34.4.227/23]
  Normal   Pulling         19s (x2 over 42s)  kubelet, ip-10-1-156-6.eu-central-1.compute.internal  Pulling image "tcij1013/dne:v1.0.0"
  Warning  Failed          16s (x2 over 35s)  kubelet, ip-10-1-156-6.eu-central-1.compute.internal  Failed to pull image "tcij1013/dne:v1.0.0": rpc error: code = Unknown desc = Error reading manifest v1.0.0 in docker.io/tcij1013/dne: errors:
denied: requested access to the resource is denied
unauthorized: authentication required
  Warning  Failed   16s (x2 over 35s)  kubelet, ip-10-1-156-6.eu-central-1.compute.internal  Error: ErrImagePull
  Normal   BackOff  3s (x2 over 34s)   kubelet, ip-10-1-156-6.eu-central-1.compute.internal  Back-off pulling image "tcij1013/dne:v1.0.0"
  Warning  Failed   3s (x2 over 34s)   kubelet, ip-10-1-156-6.eu-central-1.compute.internal  Error: ImagePullBackOff

Delete the pod

$ oc delete pod/fail
pod "fail" deleted

Invalid ConfigMap or Secret

Create one bad configmap yaml file

$ cat >bad-configmap-pod.yml<<YAML
# bad-configmap-pod.yml
apiVersion: v1
kind: Pod
metadata:
  name: configmap-pod
spec:
  containers:
    - name: test-container
      image: gcr.io/google_containers/busybox
      command: [ "/bin/sh", "-c", "env" ]
      env:
        - name: SPECIAL_LEVEL_KEY
          valueFrom:
            configMapKeyRef:
              name: special-config
              key: special.how
YAML

Create the bad configmap pod deployment

$ oc create -f bad-configmap-pod.yml

When we are getting the status of the pod we see that we have a CreateContainerConfigError

$ oc get pods
NAME            READY   STATUS                       RESTARTS   AGE
configmap-pod   0/1     CreateContainerConfigError   0          31s

When we run the oc describe command we see under Events an Error that show that the configmap could not be found.

$ oc describe pod configmap-pod
Events:
  Type     Reason          Age               From                                                  Message
  ----     ------          ----              ----                                                  -------
  Normal   Scheduled       13s               default-scheduler                                     Successfully assigned workshop-troubleshooting-apps-tim/configmap-pod to ip-10-1-156-6.eu-central-1.compute.internal
  Normal   AddedInterface  11s               multus                                                Add eth0 [10.34.4.236/23]
  Normal   Pulling         9s (x2 over 11s)  kubelet, ip-10-1-156-6.eu-central-1.compute.internal  Pulling image "gcr.io/google_containers/busybox"
  Normal   Pulled          8s (x2 over 9s)   kubelet, ip-10-1-156-6.eu-central-1.compute.internal  Successfully pulled image "gcr.io/google_containers/busybox"
  Warning  Failed          8s (x2 over 9s)   kubelet, ip-10-1-156-6.eu-central-1.compute.internal  Error: configmap "special-config" not found

Delete the bad configmap deployment

$ oc delete -f bad-configmap-pod.yml

Validation Errors

Lets validate a sample nginx app

$ cat >validate-deployment.yaml<<EOF
apiVersion: apps/vl
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 1
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
EOF

Run the oc apply command with --dry-run --validate=true flags

$ oc apply -f validate-deployment.yaml --dry-run --validate=true
error: unable to recognize "validate-deployment.yaml": no matches for kind "Deployment" in version "apps/vl"rue

Add two extra spaces to annotations under metadata in the validate-deployment.yaml

$  cat validate-deployment.yaml
apiVersion: apps/vl
kind: Deployment
  metadata:
  name: nginx-deployment

Check for any spacing error using the python -c command

$  python -c 'import yaml,sys;yaml.safe_load(sys.stdin)' <  validate-deployment.yaml
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib64/python2.7/site-packages/yaml/__init__.py", line 93, in safe_load
    return load(stream, SafeLoader)
  File "/usr/lib64/python2.7/site-packages/yaml/__init__.py", line 71, in load
    return loader.get_single_data()
  File "/usr/lib64/python2.7/site-packages/yaml/constructor.py", line 37, in get_single_data
    node = self.get_single_node()
  File "/usr/lib64/python2.7/site-packages/yaml/composer.py", line 36, in get_single_node
    document = self.compose_document()
  File "/usr/lib64/python2.7/site-packages/yaml/composer.py", line 55, in compose_document
    node = self.compose_node(None, None)
  File "/usr/lib64/python2.7/site-packages/yaml/composer.py", line 84, in compose_node
    node = self.compose_mapping_node(anchor)
  File "/usr/lib64/python2.7/site-packages/yaml/composer.py", line 127, in compose_mapping_node
    while not self.check_event(MappingEndEvent):
  File "/usr/lib64/python2.7/site-packages/yaml/parser.py", line 98, in check_event
    self.current_event = self.state()
  File "/usr/lib64/python2.7/site-packages/yaml/parser.py", line 428, in parse_block_mapping_key
    if self.check_token(KeyToken):
  File "/usr/lib64/python2.7/site-packages/yaml/scanner.py", line 116, in check_token
    self.fetch_more_tokens()
  File "/usr/lib64/python2.7/site-packages/yaml/scanner.py", line 220, in fetch_more_tokens
    return self.fetch_value()
  File "/usr/lib64/python2.7/site-packages/yaml/scanner.py", line 580, in fetch_value
    self.get_mark())
yaml.scanner.ScannerError: mapping values are not allowed here
  in "<stdin>", line 3, column 11

Change apiVersion back to v1 and correct spacing

$ cat validate-deployment.yaml
apiVersion: apps/v1
kind: Deployment
  metadata:
  name: nginx-deployment

Validate YAML

$ python -c 'import yaml,sys;yaml.safe_load(sys.stdin)' <  validate-deployment.yaml
$ oc apply -f validate-deployment.yaml --dry-run --validate=true
deployment.apps/nginx-deployment created (dry run)

Container not updating

An example of a container not updating can be due to the following scenario

Creating a deployment using an image tag (e.g. tcij1013/myapp:v1)
  • Notice there is a bug in myapp

  • Build a new image and push the to the same tag (tcij1013/myapp:v1)

  • Delete all the myapp Pods, and watch the new ones get created by the deployment

  • Realize that the bug is still present

  • This problem relates to how Kubernetes decide weather to go do a docker pull when starting a container in a Pod.

In the V1.Container specification there’s an option call ImagePullPolicy:

Image pull policy. One of Always, Never, IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.

Since the image is tagged as v1 in the above example the default pull policy is IfNotPresent. The OpenShift cluster already has a local copy of tcij1013/myapp:v1, so it does not attempt to do a docker pull. When the new Pods come up, there still using thee old broken container image.

Ways to resolve this issue
  • Use unique tags (e.g. based on your source control commit id)

  • Specify ImagePullPolicy: Always in your deployment.

    • Delete project

$ oc  delete project workshop-troubleshooting-apps-YourName

Summary

In this lab we learned how to troubleshoot the following
  • Image Pull Failures

  • Application Crashing

  • Invalid ConfigMap or Secrets

  • Liveness/Readiness Probe Failure

  • Validation Errors

  • Container not updating