Etcd Troubleshooting Skill
This document defines the Claude Code skill for troubleshooting etcd issues on two-node OpenShift clusters with fencing topology. When activated, Claude becomes an expert etcd/Pacemaker troubleshooter capable of iterative diagnosis and reme
Install in one line
CLI$ mfkvault install openshift-eng-etcd-troubleshooting-skillRequires the MFKVault CLI. Prefer MCP?
Free to install — no account needed
Copy the command below and paste into your agent.
Instant access • No coding needed • No account needed
What you get in 5 minutes
- Full skill code ready to install
- Works with 3 AI agents
- Lifetime updates included
Description
# Etcd Troubleshooting Skill This document defines the Claude Code skill for troubleshooting etcd issues on two-node OpenShift clusters with fencing topology. When activated, Claude becomes an expert etcd/Pacemaker troubleshooter capable of iterative diagnosis and remediation. ## Skill Overview This skill enables Claude to: - Validate and test access to cluster components via Ansible and OpenShift CLI - Iteratively collect diagnostic data from Pacemaker, etcd, and OpenShift - Analyze symptoms and identify root causes - Propose and execute remediation steps - Verify fixes and adjust approach based on results - Provide comprehensive troubleshooting throughout the diagnostic process ## Step-by-Step Procedure ### 1. Validate Access **1.1 Ansible Inventory Validation:** - Check if `deploy/openshift-clusters/inventory.ini` exists - Verify the inventory file has valid cluster node entries - Test SSH connectivity to cluster nodes using Ansible ping module **1.2 OpenShift Cluster Access Validation:** - Test direct cluster access with `oc version` - If direct access fails, check for `deploy/openshift-clusters/proxy.env` - If proxy.env exists, source it before running oc commands - Verify cluster access with `oc get nodes` - Remember proxy requirement for all subsequent oc commands **IMPORTANT: No Cluster Access Scenario** If OpenShift cluster API access is unavailable (which is expected when etcd is down), **all diagnostics and remediation must be performed via Ansible** using direct VM access. The troubleshooting workflow remains fully functional using only: - Ansible ad-hoc commands to cluster VMs - Ansible playbooks for diagnostics collection - Direct SSH access to nodes via Ansible When cluster access is unavailable: - ✓ You can still diagnose and fix etcd issues completely - ✓ All Pacemaker operations work via Ansible - ✓ All etcd container operations work via Ansible (podman commands) - ✓ All logs are accessible via Ansible (journalctl commands) - ✗ Cannot query OpenShift operators or cluster-level resources - ✗ Cannot use oc commands for verification (use Ansible equivalents instead) This is a **normal scenario** when etcd is down - proceed with VM-based troubleshooting. ### 2. Collect Data **Choose Your Diagnostic Approach** There are two approaches to data collection: **Quick Manual Triage (recommended for initial assessment)** Start with a few targeted commands to assess the situation: - `pcs status` - Check Pacemaker cluster state and failed actions - `podman ps -a --filter name=etcd` - Verify etcd containers are running - `etcdctl endpoint health` - Confirm etcd health This takes ~30 seconds and is often sufficient to identify simple issues (stale failures, container restarts, etc.) that can be fixed immediately with `pcs resource cleanup etcd`. **Full Diagnostic Collection (for complex/unclear issues)** If quick triage reveals complex problems or the root cause is unclear, run the comprehensive diagnostic script: ```bash ./helpers/etcd/collect-all-diagnostics.sh ``` This script (~5-10 minutes): - Validates Ansible and cluster access automatically - Collects all VM-level diagnostics via Ansible playbook - Collects OpenShift cluster-level data (if accessible) - Saves everything to `/tmp/etcd-diagnostics-<timestamp>/` - Generates a `DIAGNOSTIC_REPORT.txt` with analysis commands Use the full collection when: - Quick triage doesn't reveal the cause - Multiple components appear affected - You need to preserve diagnostic data for later analysis - The issue involves cluster ID mismatches or split-brain scenarios **Manual Collection Commands** For manual data collection, use the commands below. **IMPORTANT: Target the Correct Host Group** - **All etcd/Pacemaker commands** must target the `cluster_vms` host group (the OpenShift cluster nodes) - **VM lifecycle commands** (start/stop VMs) target the hypervisor host - Use Ansible ad-hoc commands with `-m shell` or run playbooks that target `cluster_vms` - All
Security Status
Unvetted
Not yet security scanned
Related AI Tools
More Coding tools you might like
embodied-ai-tracker
Free具身智能领域前沿动态追踪与视频素材采集系统;覆盖顶会论文(ICRA/IROS/CoRL/CVPR/NeurIPS)、开源项目、实验室动态;优先采集有Demo视频的爆款工作;生成含发布时间/主页/代码/视频链接的结构化日报,支持视频号内容创作
github-search
FreeSearch GitHub for repos, code, and usage examples using gh CLI. Capabilities: repo discovery, code search, finding library usage patterns, issue/PR search. Actions: search, find, discover repos/code/examples. Keywords: gh, github, search re
idea
FreeMCPize Idea Finder — find a profitable MCP server idea, validate it, and get a ready-to-build brief. Part of the MCPize suite (mcpize.com). Guides users through discovery interview (up to 10 questions about skills, domain, goals), generates
Motion Asset Hunter
FreeFind and adapt premium animation assets from CodePen, GitHub, Dribbble, and animation libraries. Discover reference implementations and open-source animations to use or learn from.
github
FreeInteract with GitHub repositories using the GitHub API
Quick Commands
Free> Para workflows COMPLETOS, decision trees e exemplos: **DROID**