Skip to content

<fix>[mn]: synchronize hash ring operations to prevent dual-MN task stalling#4157

Open
zstack-robot-1 wants to merge 1 commit into
4.8.38from
sync/yaohua.wu/bugfix/ZSTAC-78182
Open

<fix>[mn]: synchronize hash ring operations to prevent dual-MN task stalling#4157
zstack-robot-1 wants to merge 1 commit into
4.8.38from
sync/yaohua.wu/bugfix/ZSTAC-78182

Conversation

@zstack-robot-1
Copy link
Copy Markdown
Collaborator

Backport ZSTAC-78182 to 4.8.38

Cherry-pick of original fix from 原单 ZSTAC-77711 (source: 5.5.12).

Original commit: 26b8b1a
Local cherry-pick HEAD: 66ab572

Changes

 .../cloudbus/ResourceDestinationMakerImpl.java     |  28 +++---
 .../managementnode/ManagementNodeManagerImpl.java  | 101 ++++++++++++++-------
 2 files changed, 82 insertions(+), 47 deletions(-)

Related: ZSTAC-78182

sync from gitlab !10062

…talling

In dual management node scenarios, concurrent modifications to the
consistent hash ring from heartbeat reconciliation and canonical event
callbacks can cause NodeHash/Nodes inconsistency, leading to message
routing failures and task timeouts.

Fix: (1) synchronized all ResourceDestinationMakerImpl methods to
ensure atomic nodeHash+nodes updates, (2) added lifecycleLock in
ManagementNodeManagerImpl to serialize heartbeat reconciliation with
event callbacks, (3) added two-round delayed confirmation before
removing nodes from hash ring to avoid race with NodeJoin events.

Resolves: ZSTAC-77711

Change-Id: I3d33d53595dd302784dff17417a5b25f2d0f3426
(cherry picked from commit 26b8b1a)
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

走查总结

该 PR 通过两个关键文件的改动,解决了管理节点哈希环在并发场景下的数据竞态和状态不一致问题。ResourceDestinationMaker 通过添加 synchronized 关键字和防御性拷贝强化了并发访问控制;ManagementNodeManager 通过引入生命周期锁和两轮确认缺失机制,协调节点生命周期事件处理与心跳对账流程,防止错误的节点驱逐。

变更

ResourceDestinationMaker 哈希环同步化

层 / 文件(s) 说明
哈希环状态修改操作同步化
core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java
nodeJoinnodeLeftiAmDeadiJoin 方法添加 synchronized 关键字,序列化对 nodeHash 和 nodes 内部状态的更新。
哈希环状态读取和防御性拷贝
core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java
makeDestinationisManagedByUsgetManagementNodesInHashRinggetNodeInfogetAllNodeInfogetManagementNodeCountisNodeInCircle 添加 synchronized 关键字;getManagementNodesInHashRinggetAllNodeInfo 返回防御性拷贝而非直接暴露底层集合;getManagementNodeCount 优化为直接读取 nodes.size()

ManagementNodeManager 生命周期和心跳协调

层 / 文件(s) 说明
并发协调基础设施
portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java
新增 HashSet 导入;定义 lifecycleLock 对象锁用于序列化生命周期事件处理,以及 suspectedMissingFromDb HashSet 用于两轮确认缺失节点的追踪。
生命周期事件回调同步化
portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java
nodeLifeCycleCallbacklifecycleLock 上添加同步块;NodeJoin 事件处理中清除对应节点的 suspectedMissingFromDb 标记后转发事件;NodeLeft 和未知生命周期分支在同一锁内按原有语义处理。
心跳线程对账逻辑重构和两轮确认机制
portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java
将节点对账逻辑移入 lifecycleLock 保护的临界区;对哈希环中存在但数据库缺失的节点实现两轮确认机制(第一轮加入 suspectedMissingFromDb,第二轮确认缺失后才驱逐);对数据库中存在但哈希环缺失的节点在锁内调度异步补齐。

代码审查工作量评估

🎯 4 (复杂) | ⏱️ ~45 分钟

诗歌

哈希环中兔儿忙,
添把锁来护状态,
生命周期须同步,
心跳对账两轮验,🐰
竞态消散安心强。

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main change: synchronizing hash ring operations to prevent dual-MN task stalling, which aligns with the core modifications to ResourceDestinationMakerImpl and ManagementNodeManagerImpl.
Description check ✅ Passed The PR description provides relevant context: it references the backport of ZSTAC-78182 from 5.5.12 to 4.8.38, includes commit information, and describes the synchronization changes made to prevent task stalling.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/yaohua.wu/bugfix/ZSTAC-78182

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java (1)

80-93: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

修复 getNodeInfo 首次加载时返回 null 的逻辑错误

Line 89 使用了 Map.put 的返回值(旧值)赋给 info。在缓存未命中时旧值是 null,方法会返回 null,下游容易触发 NPE。

建议修复
     `@Override`
     public synchronized NodeInfo getNodeInfo(String nodeUuid) {
         NodeInfo info = nodes.get(nodeUuid);
         if (info == null) {
             ManagementNodeVO vo = dbf.findByUuid(nodeUuid, ManagementNodeVO.class);
             if (vo == null) {
                 throw new ManagementNodeNotFoundException(nodeUuid);
             }

             nodeHash.add(nodeUuid);
-            info = nodes.put(nodeUuid, new NodeInfo(vo));
+            info = new NodeInfo(vo);
+            nodes.put(nodeUuid, info);
         }

         return info;
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java`
around lines 80 - 93, The getNodeInfo method incorrectly assigns info =
nodes.put(nodeUuid, new NodeInfo(vo)) which returns the previous value (null on
cache miss); change it to create a new NodeInfo instance (e.g., NodeInfo newInfo
= new NodeInfo(vo)), add the uuid to nodeHash, put the newInfo into the nodes
map (nodes.put(nodeUuid, newInfo)) and set info to newInfo (or use
nodes.computeIfAbsent/node-put-if-absent patterns) so the method returns the
actual NodeInfo instead of null; reference symbols: getNodeInfo, NodeInfo,
nodes, nodeHash, ManagementNodeVO, dbf.findByUuid.
🧹 Nitpick comments (1)
portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java (1)

869-873: 💤 Low value

考虑记录更具体的异常信息。

当前 catch (Exception e) 捕获所有异常并仅打印通用警告。虽然回退到空 hostname 是合理的容错策略,但建议在日志中包含异常详情,以便排查 getNodeInfo 失败的根本原因。

♻️ 建议的改进
                                 try {
                                     inv.setHostName(destinationMaker.getNodeInfo(nodeUuid).getNodeIP());
                                 } catch (Exception e) {
-                                    logger.warn(String.format("cannot get node info for node[uuid:%s], use empty hostname", nodeUuid));
+                                    logger.warn(String.format("cannot get node info for node[uuid:%s], use empty hostname: %s", nodeUuid, e.getMessage()));
                                 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java`
around lines 869 - 873, The catch block in ManagementNodeManagerImpl around
destinationMaker.getNodeInfo(nodeUuid) currently swallows all exceptions and
logs only a generic warning; change the logger call to include the caught
exception details (e.g., pass the exception object or append e.toString()) so
the warning shows the failure reason for getNodeInfo(nodeUuid) while still
falling back to empty hostname; update the catch for the block that calls
inv.setHostName(...) to log the exception (logger.warn with message and e) to
aid debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java`:
- Around line 80-93: The getNodeInfo method incorrectly assigns info =
nodes.put(nodeUuid, new NodeInfo(vo)) which returns the previous value (null on
cache miss); change it to create a new NodeInfo instance (e.g., NodeInfo newInfo
= new NodeInfo(vo)), add the uuid to nodeHash, put the newInfo into the nodes
map (nodes.put(nodeUuid, newInfo)) and set info to newInfo (or use
nodes.computeIfAbsent/node-put-if-absent patterns) so the method returns the
actual NodeInfo instead of null; reference symbols: getNodeInfo, NodeInfo,
nodes, nodeHash, ManagementNodeVO, dbf.findByUuid.

---

Nitpick comments:
In
`@portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java`:
- Around line 869-873: The catch block in ManagementNodeManagerImpl around
destinationMaker.getNodeInfo(nodeUuid) currently swallows all exceptions and
logs only a generic warning; change the logger call to include the caught
exception details (e.g., pass the exception object or append e.toString()) so
the warning shows the failure reason for getNodeInfo(nodeUuid) while still
falling back to empty hostname; update the catch for the block that calls
inv.setHostName(...) to log the exception (logger.warn with message and e) to
aid debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 41c9c53a-dae7-467b-bb4f-2767ca354f2b

📥 Commits

Reviewing files that changed from the base of the PR and between 00abf62 and 66ab572.

📒 Files selected for processing (2)
  • core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java
  • portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants