Worker fails every message immediately (retry_count=0), 29k+ messages stuck in 'failed' status — restart dies during startup

## Summary

claude-mem v10.6.2 is in a state where the worker accepts queued messages from hooks but **fails every single message on the first attempt with zero retries**. The queue has grown to ~29,400 failed rows; `session_summaries` and `observations` tables have been frozen since 2026-04-17 despite hooks firing on every session daily.

## Evidence

```sql
sqlite3 ~/.claude-mem/claude-mem.db \
  "SELECT status, message_type, COUNT(*), MAX(retry_count) AS max_retries
   FROM pending_messages GROUP BY status, message_type"
```

```
failed | observation | 22490 | 0
failed | summarize   |  6907 | 0
pending| summarize   |    46 | 0
```

Date range of failures (oldest → newest):

```
2026-03-26 20:01:17 | failed | summarize   | retry_count=0
2026-05-30 12:03:11 | failed | observation | retry_count=0
```

Failures span the entire install window — every message ever queued fails. `retry_count=0` for ALL of them means whatever exception the worker hits, it's not in the retry path, and the message goes straight to `failed`.

For comparison, the other tables are healthy:
- `user_prompts`: 10,417 rows, current through today (every user prompt captured fine)
- `session_summaries`: 307 rows, frozen at 2026-04-17T07:26 (when the rich-summary write last succeeded)
- `observations`: 512 rows, frozen at 2026-04-17T07:21

## Restart fails

```bash
$ bun ~/.claude/plugins/.../scripts/worker-cli.js restart
Failed to restart: Process died during startup
```

`status` command also misbehaves — returns what looks like a Claude Code hook ACK `{"continue": true, "suppressOutput": true}` instead of worker status, suggesting `worker-cli.js` is being intercepted by the hook layer.

The worker process IS bound to port 37777 (`lsof -i :37777` shows the `bun` process running), but every message it processes lands in `failed`. So it's not a port-binding issue — it's a per-message processing exception.

## Schema observation

`pending_messages` schema has `failed_at_epoch` but **no error-reason column** — failures are recorded silently with no diagnostic detail. Adding a `failure_reason TEXT` (or `error TEXT`) column and writing the exception message there would massively help future debugging.

## What I tried

1. `worker-cli.js status` — broken (returns hook ACK)
2. `worker-cli.js restart` — fails with "Process died during startup"
3. Direct `bun .../worker-service.cjs` — runs but disrupts the MCP server bound to port 37777 mid-execution (MCP search tools disconnected from the session)
4. Inspecting logs in `~/.claude-mem/logs/` — only INFO-level entries about hook firing; no ERROR/WARN from the failing message-processing path

## Environment

- Plugin: claude-mem v10.6.2
- Claude Code: 2.1.158
- Node: v24.2.0
- Bun: 1.3.11
- macOS: 15.4.1
- Install: via thedotmack marketplace
- Active hooks per `hooks.json`: `beforeSubmitPrompt`, `afterMCPExecution`, `afterShellExecution`, `afterFileEdit`, `stop`

## Suggested fixes

1. **Log the exception** when a message transitions to `failed`. Either add a `failure_reason` column to `pending_messages` or write structured errors to the log file with the message ID. Currently failures are completely silent — impossible to diagnose without source-diving the worker.
2. **Implement retry**. Every message having `retry_count=0` while in `failed` status suggests the retry path isn't wired. Even a basic exponential backoff (max 3 retries) would surface transient failures vs persistent ones.
3. **Fix the restart path**. "Process died during startup" with no further info is unhelpful — propagate the startup exception.
4. **Document a recovery procedure** in the README — e.g. how to clear the queue, reset the worker, etc., for users in this exact state.

## Workaround for now

Falling back to Claude Code's native auto-memory system (writes to `~/.claude/projects/<slug>/memory/*.md` during sessions) which is independent of this plugin and unaffected.

Happy to provide more diagnostic detail if useful — DB dump, log files, anything else.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Worker fails every message immediately (retry_count=0), 29k+ messages stuck in 'failed' status — restart dies during startup #2718

Summary

Evidence

Restart fails

Schema observation

What I tried

Environment

Suggested fixes

Workaround for now

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Worker fails every message immediately (retry_count=0), 29k+ messages stuck in 'failed' status — restart dies during startup #2718

Description

Summary

Evidence

Restart fails

Schema observation

What I tried

Environment

Suggested fixes

Workaround for now

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions