$ oc new-project workshop-troubleshooting-apps-YourName
14 Troubleshooting Applications
Prerequisites
-
You have access to OpenShift Web Console URL. Ask your workshop coordinator for URL if you don’t have one.
-
You have credentials to login. Ask your workshop coordinator for credentials to log onto the OpenShift cluster
Introduction
Below are common application troubleshooting techniques to use while developing an application. We will perform different exercises to see how to resolve different issues that may come up while developing your application.
Exercises
Prepare Exercise
-
Login to OpenShift
-
Create new project
Change YourName to your name. |
Image Pull Failures
-
Image tags is incorrect
-
Images doesn’t exist (or is in a different registry)
-
Kubernetes doesn’t have permissions to pull that image
Create image that fails pull
$ oc run fail --image=tcij1013/dne:v1.0.0
Check the status of the image
oc get pods
NAME READY STATUS RESTARTS AGE
fail 0/1 ErrImagePull 0 11s
Inspect the pod
$ oc describe pod fail
As we can see in the Events the pod failed because it could not pull down the image.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 44s default-scheduler Successfully assigned workshop-troubleshooting-apps-tim/fail to ip-10-1-156-6.eu-central-1.compute.internal
Normal AddedInterface 42s multus Add eth0 [10.34.4.227/23]
Normal Pulling 19s (x2 over 42s) kubelet, ip-10-1-156-6.eu-central-1.compute.internal Pulling image "tcij1013/dne:v1.0.0"
Warning Failed 16s (x2 over 35s) kubelet, ip-10-1-156-6.eu-central-1.compute.internal Failed to pull image "tcij1013/dne:v1.0.0": rpc error: code = Unknown desc = Error reading manifest v1.0.0 in docker.io/tcij1013/dne: errors:
denied: requested access to the resource is denied
unauthorized: authentication required
Warning Failed 16s (x2 over 35s) kubelet, ip-10-1-156-6.eu-central-1.compute.internal Error: ErrImagePull
Normal BackOff 3s (x2 over 34s) kubelet, ip-10-1-156-6.eu-central-1.compute.internal Back-off pulling image "tcij1013/dne:v1.0.0"
Warning Failed 3s (x2 over 34s) kubelet, ip-10-1-156-6.eu-central-1.compute.internal Error: ImagePullBackOff
Delete the pod
$ oc delete pod/fail
pod "fail" deleted
Invalid ConfigMap or Secret
Create one bad configmap yaml file
$ cat >bad-configmap-pod.yml<<YAML
# bad-configmap-pod.yml
apiVersion: v1
kind: Pod
metadata:
name: configmap-pod
spec:
containers:
- name: test-container
image: gcr.io/google_containers/busybox
command: [ "/bin/sh", "-c", "env" ]
env:
- name: SPECIAL_LEVEL_KEY
valueFrom:
configMapKeyRef:
name: special-config
key: special.how
YAML
Create the bad configmap pod deployment
$ oc create -f bad-configmap-pod.yml
When we are getting the status of the pod we see that we have a CreateContainerConfigError
$ oc get pods
NAME READY STATUS RESTARTS AGE
configmap-pod 0/1 CreateContainerConfigError 0 31s
When we run the oc describe
command we see under Events an Error that show that the configmap could not be found.
$ oc describe pod configmap-pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13s default-scheduler Successfully assigned workshop-troubleshooting-apps-tim/configmap-pod to ip-10-1-156-6.eu-central-1.compute.internal
Normal AddedInterface 11s multus Add eth0 [10.34.4.236/23]
Normal Pulling 9s (x2 over 11s) kubelet, ip-10-1-156-6.eu-central-1.compute.internal Pulling image "gcr.io/google_containers/busybox"
Normal Pulled 8s (x2 over 9s) kubelet, ip-10-1-156-6.eu-central-1.compute.internal Successfully pulled image "gcr.io/google_containers/busybox"
Warning Failed 8s (x2 over 9s) kubelet, ip-10-1-156-6.eu-central-1.compute.internal Error: configmap "special-config" not found
Delete the bad configmap deployment
$ oc delete -f bad-configmap-pod.yml
Validation Errors
Lets validate a sample nginx app
$ cat >validate-deployment.yaml<<EOF
apiVersion: apps/vl
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 1
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
EOF
Run the oc apply command with --dry-run --validate=true flags
$ oc apply -f validate-deployment.yaml --dry-run --validate=true
error: unable to recognize "validate-deployment.yaml": no matches for kind "Deployment" in version "apps/vl"rue
Add two extra spaces to annotations under metadata in the validate-deployment.yaml
$ cat validate-deployment.yaml
apiVersion: apps/vl
kind: Deployment
metadata:
name: nginx-deployment
Check for any spacing error using the python -c command
$ python -c 'import yaml,sys;yaml.safe_load(sys.stdin)' < validate-deployment.yaml
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/yaml/__init__.py", line 93, in safe_load
return load(stream, SafeLoader)
File "/usr/lib64/python2.7/site-packages/yaml/__init__.py", line 71, in load
return loader.get_single_data()
File "/usr/lib64/python2.7/site-packages/yaml/constructor.py", line 37, in get_single_data
node = self.get_single_node()
File "/usr/lib64/python2.7/site-packages/yaml/composer.py", line 36, in get_single_node
document = self.compose_document()
File "/usr/lib64/python2.7/site-packages/yaml/composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "/usr/lib64/python2.7/site-packages/yaml/composer.py", line 84, in compose_node
node = self.compose_mapping_node(anchor)
File "/usr/lib64/python2.7/site-packages/yaml/composer.py", line 127, in compose_mapping_node
while not self.check_event(MappingEndEvent):
File "/usr/lib64/python2.7/site-packages/yaml/parser.py", line 98, in check_event
self.current_event = self.state()
File "/usr/lib64/python2.7/site-packages/yaml/parser.py", line 428, in parse_block_mapping_key
if self.check_token(KeyToken):
File "/usr/lib64/python2.7/site-packages/yaml/scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "/usr/lib64/python2.7/site-packages/yaml/scanner.py", line 220, in fetch_more_tokens
return self.fetch_value()
File "/usr/lib64/python2.7/site-packages/yaml/scanner.py", line 580, in fetch_value
self.get_mark())
yaml.scanner.ScannerError: mapping values are not allowed here
in "<stdin>", line 3, column 11
Change apiVersion back to v1 and correct spacing
$ cat validate-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
Validate YAML
$ python -c 'import yaml,sys;yaml.safe_load(sys.stdin)' < validate-deployment.yaml
$ oc apply -f validate-deployment.yaml --dry-run --validate=true
deployment.apps/nginx-deployment created (dry run)
Container not updating
An example of a container not updating can be due to the following scenario
-
Notice there is a bug in myapp
-
Build a new image and push the to the same tag (tcij1013/myapp:v1)
-
Delete all the myapp Pods, and watch the new ones get created by the deployment
-
Realize that the bug is still present
-
This problem relates to how Kubernetes decide weather to go do a docker pull when starting a container in a Pod.
In the V1.Container specification there’s an option call ImagePullPolicy
:
Image pull policy. One of Always, Never, IfNotPresent. Defaults to Always if :latest tag is specified, or IfNotPresent otherwise.
Since the image is tagged as v1 in the above example the default pull policy is IfNotPresent. The OpenShift cluster already has a local copy of tcij1013/myapp:v1, so it does not attempt to do a docker pull. When the new Pods come up, there still using thee old broken container image.
-
Use unique tags (e.g. based on your source control commit id)
-
Specify ImagePullPolicy: Always in your deployment.
-
Delete project
-
$ oc delete project workshop-troubleshooting-apps-YourName
Summary
-
Image Pull Failures
-
Application Crashing
-
Invalid ConfigMap or Secrets
-
Liveness/Readiness Probe Failure
-
Validation Errors
-
Container not updating