Etcd Troubleshooting Skill

Name: Etcd Troubleshooting Skill
Brand: MFKVault
Availability: InStock

This document defines the Claude Code skill for troubleshooting etcd issues on two-node OpenShift clusters with fencing topology. When activated, Claude becomes an expert etcd/Pacemaker troubleshooter capable of iterative diagnosis and reme

Install in one line

CLI

$ mfkvault install openshift-eng-etcd-troubleshooting-skill

Requires the MFKVault CLI. Prefer MCP?

New skill

No reviews yet

New skill

🤖 Claude Code⚡ Cursor💻 Codex

FREE

Free to install — no account needed

Copy the command below and paste into your agent.

Instant access • No coding needed • No account needed

What you get in 5 minutes

Full skill code ready to install
Works with 3 AI agents
Lifetime updates included

SecureBe the first

Description

# Etcd Troubleshooting Skill This document defines the Claude Code skill for troubleshooting etcd issues on two-node OpenShift clusters with fencing topology. When activated, Claude becomes an expert etcd/Pacemaker troubleshooter capable of iterative diagnosis and remediation. ## Skill Overview This skill enables Claude to: - Validate and test access to cluster components via Ansible and OpenShift CLI - Iteratively collect diagnostic data from Pacemaker, etcd, and OpenShift - Analyze symptoms and identify root causes - Propose and execute remediation steps - Verify fixes and adjust approach based on results - Provide comprehensive troubleshooting throughout the diagnostic process ## Step-by-Step Procedure ### 1. Validate Access **1.1 Ansible Inventory Validation:** - Check if `deploy/openshift-clusters/inventory.ini` exists - Verify the inventory file has valid cluster node entries - Test SSH connectivity to cluster nodes using Ansible ping module **1.2 OpenShift Cluster Access Validation:** - Test direct cluster access with `oc version` - If direct access fails, check for `deploy/openshift-clusters/proxy.env` - If proxy.env exists, source it before running oc commands - Verify cluster access with `oc get nodes` - Remember proxy requirement for all subsequent oc commands **IMPORTANT: No Cluster Access Scenario** If OpenShift cluster API access is unavailable (which is expected when etcd is down), **all diagnostics and remediation must be performed via Ansible** using direct VM access. The troubleshooting workflow remains fully functional using only: - Ansible ad-hoc commands to cluster VMs - Ansible playbooks for diagnostics collection - Direct SSH access to nodes via Ansible When cluster access is unavailable: - ✓ You can still diagnose and fix etcd issues completely - ✓ All Pacemaker operations work via Ansible - ✓ All etcd container operations work via Ansible (podman commands) - ✓ All logs are accessible via Ansible (journalctl commands) - ✗ Cannot query OpenShift operators or cluster-level resources - ✗ Cannot use oc commands for verification (use Ansible equivalents instead) This is a **normal scenario** when etcd is down - proceed with VM-based troubleshooting. ### 2. Collect Data **Choose Your Diagnostic Approach** There are two approaches to data collection: **Quick Manual Triage (recommended for initial assessment)** Start with a few targeted commands to assess the situation: - `pcs status` - Check Pacemaker cluster state and failed actions - `podman ps -a --filter name=etcd` - Verify etcd containers are running - `etcdctl endpoint health` - Confirm etcd health This takes ~30 seconds and is often sufficient to identify simple issues (stale failures, container restarts, etc.) that can be fixed immediately with `pcs resource cleanup etcd`. **Full Diagnostic Collection (for complex/unclear issues)** If quick triage reveals complex problems or the root cause is unclear, run the comprehensive diagnostic script: ```bash ./helpers/etcd/collect-all-diagnostics.sh ``` This script (~5-10 minutes): - Validates Ansible and cluster access automatically - Collects all VM-level diagnostics via Ansible playbook - Collects OpenShift cluster-level data (if accessible) - Saves everything to `/tmp/etcd-diagnostics-<timestamp>/` - Generates a `DIAGNOSTIC_REPORT.txt` with analysis commands Use the full collection when: - Quick triage doesn't reveal the cause - Multiple components appear affected - You need to preserve diagnostic data for later analysis - The issue involves cluster ID mismatches or split-brain scenarios **Manual Collection Commands** For manual data collection, use the commands below. **IMPORTANT: Target the Correct Host Group** - **All etcd/Pacemaker commands** must target the `cluster_vms` host group (the OpenShift cluster nodes) - **VM lifecycle commands** (start/stop VMs) target the hypervisor host - Use Ansible ad-hoc commands with `-m shell` or run playbooks that target `cluster_vms` - All

Preview in:

Security Status

Unvetted

Not yet security scanned

Time saved

How much time did this skill save you?

Related AI Tools

More Coding tools you might like

embodied-ai-tracker

Free

具身智能领域前沿动态追踪与视频素材采集系统；覆盖顶会论文（ICRA/IROS/CoRL/CVPR/NeurIPS）、开源项目、实验室动态；优先采集有Demo视频的爆款工作；生成含发布时间/主页/代码/视频链接的结构化日报，支持视频号内容创作

github-search

Free

Search GitHub for repos, code, and usage examples using gh CLI. Capabilities: repo discovery, code search, finding library usage patterns, issue/PR search. Actions: search, find, discover repos/code/examples. Keywords: gh, github, search re

idea

Free

MCPize Idea Finder — find a profitable MCP server idea, validate it, and get a ready-to-build brief. Part of the MCPize suite (mcpize.com). Guides users through discovery interview (up to 10 questions about skills, domain, goals), generates

Motion Asset Hunter

Free

Find and adapt premium animation assets from CodePen, GitHub, Dribbble, and animation libraries. Discover reference implementations and open-source animations to use or learn from.

github

Free

Interact with GitHub repositories using the GitHub API

Quick Commands

Free

> Para workflows COMPLETOS, decision trees e exemplos: **DROID**