Skip to content

Worker fails every message immediately (retry_count=0), 29k+ messages stuck in 'failed' status — restart dies during startup #2718

@ankurelexbit

Description

@ankurelexbit

Summary

claude-mem v10.6.2 is in a state where the worker accepts queued messages from hooks but fails every single message on the first attempt with zero retries. The queue has grown to ~29,400 failed rows; session_summaries and observations tables have been frozen since 2026-04-17 despite hooks firing on every session daily.

Evidence

sqlite3 ~/.claude-mem/claude-mem.db \
  "SELECT status, message_type, COUNT(*), MAX(retry_count) AS max_retries
   FROM pending_messages GROUP BY status, message_type"
failed | observation | 22490 | 0
failed | summarize   |  6907 | 0
pending| summarize   |    46 | 0

Date range of failures (oldest → newest):

2026-03-26 20:01:17 | failed | summarize   | retry_count=0
2026-05-30 12:03:11 | failed | observation | retry_count=0

Failures span the entire install window — every message ever queued fails. retry_count=0 for ALL of them means whatever exception the worker hits, it's not in the retry path, and the message goes straight to failed.

For comparison, the other tables are healthy:

  • user_prompts: 10,417 rows, current through today (every user prompt captured fine)
  • session_summaries: 307 rows, frozen at 2026-04-17T07:26 (when the rich-summary write last succeeded)
  • observations: 512 rows, frozen at 2026-04-17T07:21

Restart fails

$ bun ~/.claude/plugins/.../scripts/worker-cli.js restart
Failed to restart: Process died during startup

status command also misbehaves — returns what looks like a Claude Code hook ACK {"continue": true, "suppressOutput": true} instead of worker status, suggesting worker-cli.js is being intercepted by the hook layer.

The worker process IS bound to port 37777 (lsof -i :37777 shows the bun process running), but every message it processes lands in failed. So it's not a port-binding issue — it's a per-message processing exception.

Schema observation

pending_messages schema has failed_at_epoch but no error-reason column — failures are recorded silently with no diagnostic detail. Adding a failure_reason TEXT (or error TEXT) column and writing the exception message there would massively help future debugging.

What I tried

  1. worker-cli.js status — broken (returns hook ACK)
  2. worker-cli.js restart — fails with "Process died during startup"
  3. Direct bun .../worker-service.cjs — runs but disrupts the MCP server bound to port 37777 mid-execution (MCP search tools disconnected from the session)
  4. Inspecting logs in ~/.claude-mem/logs/ — only INFO-level entries about hook firing; no ERROR/WARN from the failing message-processing path

Environment

  • Plugin: claude-mem v10.6.2
  • Claude Code: 2.1.158
  • Node: v24.2.0
  • Bun: 1.3.11
  • macOS: 15.4.1
  • Install: via thedotmack marketplace
  • Active hooks per hooks.json: beforeSubmitPrompt, afterMCPExecution, afterShellExecution, afterFileEdit, stop

Suggested fixes

  1. Log the exception when a message transitions to failed. Either add a failure_reason column to pending_messages or write structured errors to the log file with the message ID. Currently failures are completely silent — impossible to diagnose without source-diving the worker.
  2. Implement retry. Every message having retry_count=0 while in failed status suggests the retry path isn't wired. Even a basic exponential backoff (max 3 retries) would surface transient failures vs persistent ones.
  3. Fix the restart path. "Process died during startup" with no further info is unhelpful — propagate the startup exception.
  4. Document a recovery procedure in the README — e.g. how to clear the queue, reset the worker, etc., for users in this exact state.

Workaround for now

Falling back to Claude Code's native auto-memory system (writes to ~/.claude/projects/<slug>/memory/*.md during sessions) which is independent of this plugin and unaffected.

Happy to provide more diagnostic detail if useful — DB dump, log files, anything else.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions