Back to Marketplace
FREE
Unvetted
Coding

Etcd Troubleshooting Skill

This document defines the Claude Code skill for troubleshooting etcd issues on two-node OpenShift clusters with fencing topology. When activated, Claude becomes an expert etcd/Pacemaker troubleshooter capable of iterative diagnosis and reme

Install in one line

mfkvault install openshift-eng-etcd-troubleshooting-skill

Requires the MFKVault CLI. Prefer MCP?

New skill
No reviews yet
New skill
🤖 Claude Code Cursor💻 Codex
FREE

Free to install — no account needed

Copy the command below and paste into your agent.

Instant access • No coding needed • No account needed

What you get in 5 minutes

  • Full skill code ready to install
  • Works with 3 AI agents
  • Lifetime updates included
SecureBe the first

Description

# Etcd Troubleshooting Skill This document defines the Claude Code skill for troubleshooting etcd issues on two-node OpenShift clusters with fencing topology. When activated, Claude becomes an expert etcd/Pacemaker troubleshooter capable of iterative diagnosis and remediation. ## Skill Overview This skill enables Claude to: - Validate and test access to cluster components via Ansible and OpenShift CLI - Iteratively collect diagnostic data from Pacemaker, etcd, and OpenShift - Analyze symptoms and identify root causes - Propose and execute remediation steps - Verify fixes and adjust approach based on results - Provide comprehensive troubleshooting throughout the diagnostic process ## Step-by-Step Procedure ### 1. Validate Access **1.1 Ansible Inventory Validation:** - Check if `deploy/openshift-clusters/inventory.ini` exists - Verify the inventory file has valid cluster node entries - Test SSH connectivity to cluster nodes using Ansible ping module **1.2 OpenShift Cluster Access Validation:** - Test direct cluster access with `oc version` - If direct access fails, check for `deploy/openshift-clusters/proxy.env` - If proxy.env exists, source it before running oc commands - Verify cluster access with `oc get nodes` - Remember proxy requirement for all subsequent oc commands **IMPORTANT: No Cluster Access Scenario** If OpenShift cluster API access is unavailable (which is expected when etcd is down), **all diagnostics and remediation must be performed via Ansible** using direct VM access. The troubleshooting workflow remains fully functional using only: - Ansible ad-hoc commands to cluster VMs - Ansible playbooks for diagnostics collection - Direct SSH access to nodes via Ansible When cluster access is unavailable: - ✓ You can still diagnose and fix etcd issues completely - ✓ All Pacemaker operations work via Ansible - ✓ All etcd container operations work via Ansible (podman commands) - ✓ All logs are accessible via Ansible (journalctl commands) - ✗ Cannot query OpenShift operators or cluster-level resources - ✗ Cannot use oc commands for verification (use Ansible equivalents instead) This is a **normal scenario** when etcd is down - proceed with VM-based troubleshooting. ### 2. Collect Data **Choose Your Diagnostic Approach** There are two approaches to data collection: **Quick Manual Triage (recommended for initial assessment)** Start with a few targeted commands to assess the situation: - `pcs status` - Check Pacemaker cluster state and failed actions - `podman ps -a --filter name=etcd` - Verify etcd containers are running - `etcdctl endpoint health` - Confirm etcd health This takes ~30 seconds and is often sufficient to identify simple issues (stale failures, container restarts, etc.) that can be fixed immediately with `pcs resource cleanup etcd`. **Full Diagnostic Collection (for complex/unclear issues)** If quick triage reveals complex problems or the root cause is unclear, run the comprehensive diagnostic script: ```bash ./helpers/etcd/collect-all-diagnostics.sh ``` This script (~5-10 minutes): - Validates Ansible and cluster access automatically - Collects all VM-level diagnostics via Ansible playbook - Collects OpenShift cluster-level data (if accessible) - Saves everything to `/tmp/etcd-diagnostics-<timestamp>/` - Generates a `DIAGNOSTIC_REPORT.txt` with analysis commands Use the full collection when: - Quick triage doesn't reveal the cause - Multiple components appear affected - You need to preserve diagnostic data for later analysis - The issue involves cluster ID mismatches or split-brain scenarios **Manual Collection Commands** For manual data collection, use the commands below. **IMPORTANT: Target the Correct Host Group** - **All etcd/Pacemaker commands** must target the `cluster_vms` host group (the OpenShift cluster nodes) - **VM lifecycle commands** (start/stop VMs) target the hypervisor host - Use Ansible ad-hoc commands with `-m shell` or run playbooks that target `cluster_vms` - All

Preview in:

Security Status

Unvetted

Not yet security scanned

Time saved
How much time did this skill save you?

Related AI Tools

More Coding tools you might like