Datadog (kubernetes)
This chart installs the Datadog Agent into your Kubernetes cluster to collect and send observability information to Datadog
Made by
Massdriver
Official
Yes
Clouds
Tags
Kubernetes Datadog Integration
This service manages the integration of Datadog for monitoring and logging within a Kubernetes cluster. Datadog provides comprehensive observability into your Kubernetes environments, tracking metrics, logs, and other performance indicators in real-time.
Design Decisions
- Helm Chart Version: The module deploys the Datadog agent using Helm. The version is pinned to 3.25.4 to ensure consistent behavior across deployments.
- Namespace Management: The Helm release creates a namespace for the Datadog agent to avoid conflicts with other resources.
- Values Management: Key values for the Helm chart are populated from Terraform variables and local settings to maintain flexibility and customization.
- Authentication: The Kubernetes cluster authentication information is retrieved dynamically from the input variables, simplifying the integration process.
Runbook
Issue: Datadog Agent Not Collecting Logs
Sometimes the Datadog agent may not collect logs as expected. This could be due to configuration issues or agent deployment problems.
To verify the log collection settings:
kubectl get configmap datadog -o yaml | grep -A 5 logs_config
Expect to see:
logs_config:
container_collect_all: true
If container_collect_all
is not set to true
, update the values.yaml
file and reapply the Helm chart.
Issue: Datadog Agent Failing to Start
If the Datadog agent pods are not starting, check their status:
kubectl get pods -n <namespace> | grep datadog
If the pods are in a CrashLoopBackOff
or Error
state, describe the pod for more details:
kubectl describe pod <pod_name> -n <namespace>
Look for any errors or events that indicate why the pod is failing to start.
Issue: Missing Metrics in Datadog Dashboard
If metrics from your Kubernetes cluster are not appearing in the Datadog dashboard, ensure that the cluster and Datadog API keys are correctly configured:
kubectl get secret datadog -o yaml -n <namespace> | grep api-key
This should show the API key that was configured during the Helm installation. If the key is missing or incorrect, update the secret and restart the Datadog agent.
Issue: High Memory/CPU Usage by Datadog Agent
Check the resource usage of the Datadog agent pods:
kubectl top pod -n <namespace> | grep datadog
If the usage is too high, consider adjusting the resource requests/limits in the values.yaml
file:
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "200m"
Apply the updated values by re-running the Helm deployment:
helm upgrade --install datadog-agent -f values.yaml datadog/datadog -n <namespace>
Kubernetes Cluster Connectivity Issues
If the Datadog agent is unable to communicate with the Kubernetes API, ensure that the service account and the associated roles/role bindings have the necessary permissions.
Check the status of the service account:
kubectl get serviceaccount datadog-agent -n <namespace>
Verify the roles and role bindings:
kubectl get roles,rolebindings -n <namespace>
Ensure that the datadog-agent
service account has the required permissions to access cluster resources. If not, adjust the roles/role bindings accordingly.
Variable | Type | Description |
---|---|---|
clusterAgent.metricsProvider.enabled | boolean | No description |
datadog.apiKey | string | No description |
datadog.apm.portEnabled | boolean | Enable Application Performance Monitoring |
datadog.dogstatsd.useHostPort | boolean | Bind to and expose the Host port. This is required for custom metrics. |
datadog.env[].name | string | No description |
datadog.env[].value | string | No description |
datadog.logs.enabled | boolean | No description |
datadog.site | string | The site of the Datadog intake to send Agent data to. Normally the default "datadoghq.com" is fine, but during Datadog setup you may need to use a specific endpoint. |
namespace | string | No description |
networkMonitoring.enabled | boolean | Enable network performance monitoring |
securityAgent.runtime.enabled | boolean | Set to true to enable Cloud Workload Security (CWS) |
systemProbe.enableOOMKill | boolean | Enable the OOM kill eBPF-based check |
systemProbe.enableTCPQueueLength | boolean | Enable the TCP queue length eBPF-based check |