Skip to content

Bug: Auto-resume sends to stale thread_id causing Feishu 99992402 'field validation failed' #35576

@whalemalus

Description

@whalemalus

Summary

When the gateway restarts and auto-resumes a previously interrupted session, the synthesized empty message carries the original source (including thread_id). If the Feishu message referenced by thread_id has been deleted/withdrawn during the downtime, _send_raw_message() uses that stale thread_id as receive_id, and the Feishu API returns [99992402] field validation failed. Both the primary send and the plain-text fallback fail with the same error, silently dropping the response.

This is a persistent, reproducible issue — not a transient API glitch. The error recurs on every gateway restart until the stale session is cleared.

Environment

Item Value
Hermes Agent v0.15.1 (2026.5.29)
Python 3.12.3
OS Linux 6.8.0-111-generic (Ubuntu, x86_64)
lark-oapi 1.5.3
Platform Feishu (websocket mode)
Branch main (commit 689ef5e23)
Upstream remote https://github.com/NousResearch/hermes-agent.git

Steps to Reproduce

  1. Start the gateway with Feishu connected via websocket.
  2. Send a message to the bot in a Feishu DM (creates a thread_id in the session source).
  3. While the agent is mid-turn processing, restart the gateway (hermes gateway restart).
  4. On startup, _schedule_resume_pending_sessions() detects the interrupted session and synthesizes a MessageEvent(text="", source=<original_source>).
  5. If the original message referenced by source.thread_id was deleted/withdrawn during the restart window, the response send fails.

Simplified reproduction: Delete or withdraw the Feishu message that originated the session, then restart the gateway.

Expected Behavior

The auto-resume mechanism should gracefully handle stale thread_id references. When 99992402 is returned, the adapter should fall back to sending the message at the chat level (without thread_id), similar to how it already handles 230011/231003 (reply target withdrawn/missing).

Actual Behavior

Both the primary send and the plain-text fallback fail with [99992402] field validation failed. The response is silently dropped — the user never sees it.

Logs

2026-05-30 22:39:45 gateway restart
2026-05-30 22:40:07 [Feishu] Connected in websocket mode (feishu)
2026-05-30 22:40:08 Scheduled auto-resume for 1 restart-interrupted session(s)
2026-05-30 22:40:08 inbound message: platform=feishu user=ou_xxx chat=oc_xxx msg=''  ← synthetic auto-resume event
2026-05-30 22:40:21 response ready: platform=feishu chat=oc_xxx time=12.5s api_calls=1 response=270 chars
2026-05-30 22:40:21 [Feishu] Sending response (270 chars) to oc_xxx
2026-05-30 22:40:22 WARNING [Feishu] Send failed: [99992402] field validation failed — trying plain-text fallback
2026-05-30 22:40:22 ERROR [Feishu] Fallback send also failed: [99992402] field validation failed

This has been observed consistently across multiple gateway restarts since May 16. The same error pattern also affects cron job deliveries that carry a thread_id in their origin metadata pointing to a deleted message.

Root Cause Analysis

1. Auto-resume carries stale source with thread_id

_schedule_resume_pending_sessions() at gateway/run.py:3915 creates a MessageEvent with the original source object:

# gateway/run.py:3915
event = MessageEvent(
    text="",
    message_type=MessageType.TEXT,
    source=source,  # ← carries the ORIGINAL thread_id
    internal=True,
)

2. _send_raw_message uses thread_id as receive_id

gateway/platforms/feishu.py:4408-4416 — when metadata.thread_id is set, the method sends with receive_id_type="thread_id":

_thread_id = (metadata or {}).get("thread_id")
if _thread_id:
    body = self._build_create_message_body(
        receive_id=_thread_id,  # ← stale thread_id
        msg_type=msg_type,
        content=payload,
        ...
    )
    request = self._build_create_message_request("thread_id", body)

3. 99992402 is NOT in _FEISHU_REPLY_FALLBACK_CODES

The retry logic in _feishu_send_with_retry() (line 4568) handles codes 230011 and 231003 (reply target withdrawn/missing) but not 99992402 (field validation failed):

# gateway/platforms/feishu.py:231
_FEISHU_REPLY_FALLBACK_CODES = frozenset({230011, 231003})

So 99992402 is treated as a non-network, non-retryable error, and falls through to the plain-text fallback in base.py:3034 — which sends with the same stale metadata, producing the same error.

4. Impact scope

This is not limited to auto-resume. Any send path that carries a thread_id pointing to a deleted/withdrawn Feishu message will hit this — including cron job deliveries where the origin.thread_id is stale.

Suggested Fix

Option A: Add 99992402 to _FEISHU_REPLY_FALLBACK_CODES (minimal)

# gateway/platforms/feishu.py:231
_FEISHU_REPLY_FALLBACK_CODES = frozenset({230011, 231003, 99992402})

This would cause _feishu_send_with_retry to fall back from thread_idchat_id routing when 99992402 is returned, which is the same fallback already used for withdrawn reply targets.

Option B: Strip thread_id from auto-resume source (targeted)

# gateway/run.py — in _schedule_resume_pending_sessions()
import copy
safe_source = copy.copy(source)
safe_source.thread_id = None  # auto-resume doesn't need thread routing
event = MessageEvent(
    text="",
    message_type=MessageType.TEXT,
    source=safe_source,
    internal=True,
)

Recommendation

Option A is preferred because it handles all stale-thread_id paths (auto-resume, cron delivery, any future code path), not just auto-resume. It's also a one-line change with clear precedent in the existing fallback code. Option B could be added as a defense-in-depth layer on top.

Additional Context

  • The error has been observed since at least May 16, 2026.
  • Cron jobs with origin.thread_id pointing to deleted messages produce the same error (workaround: deliver: local).
  • 53 occurrences of 99992402 in the gateway log from a single session.
  • The Feishu error code 99992402 means "field validation failed" — in this context, the thread_id field value fails validation because the referenced message no longer exists.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliveryplatform/feishuFeishu / Lark adaptertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions