<fix>[mn]: synchronize hash ring operations to prevent dual-MN task stalling#4157
<fix>[mn]: synchronize hash ring operations to prevent dual-MN task stalling#4157zstack-robot-1 wants to merge 1 commit into
Conversation
…talling In dual management node scenarios, concurrent modifications to the consistent hash ring from heartbeat reconciliation and canonical event callbacks can cause NodeHash/Nodes inconsistency, leading to message routing failures and task timeouts. Fix: (1) synchronized all ResourceDestinationMakerImpl methods to ensure atomic nodeHash+nodes updates, (2) added lifecycleLock in ManagementNodeManagerImpl to serialize heartbeat reconciliation with event callbacks, (3) added two-round delayed confirmation before removing nodes from hash ring to avoid race with NodeJoin events. Resolves: ZSTAC-77711 Change-Id: I3d33d53595dd302784dff17417a5b25f2d0f3426 (cherry picked from commit 26b8b1a)
走查总结该 PR 通过两个关键文件的改动,解决了管理节点哈希环在并发场景下的数据竞态和状态不一致问题。ResourceDestinationMaker 通过添加 synchronized 关键字和防御性拷贝强化了并发访问控制;ManagementNodeManager 通过引入生命周期锁和两轮确认缺失机制,协调节点生命周期事件处理与心跳对账流程,防止错误的节点驱逐。 变更ResourceDestinationMaker 哈希环同步化
ManagementNodeManager 生命周期和心跳协调
代码审查工作量评估🎯 4 (复杂) | ⏱️ ~45 分钟 诗歌
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java (1)
80-93:⚠️ Potential issue | 🟠 Major | ⚡ Quick win修复
getNodeInfo首次加载时返回null的逻辑错误Line 89 使用了
Map.put的返回值(旧值)赋给info。在缓存未命中时旧值是null,方法会返回null,下游容易触发 NPE。建议修复
`@Override` public synchronized NodeInfo getNodeInfo(String nodeUuid) { NodeInfo info = nodes.get(nodeUuid); if (info == null) { ManagementNodeVO vo = dbf.findByUuid(nodeUuid, ManagementNodeVO.class); if (vo == null) { throw new ManagementNodeNotFoundException(nodeUuid); } nodeHash.add(nodeUuid); - info = nodes.put(nodeUuid, new NodeInfo(vo)); + info = new NodeInfo(vo); + nodes.put(nodeUuid, info); } return info; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java` around lines 80 - 93, The getNodeInfo method incorrectly assigns info = nodes.put(nodeUuid, new NodeInfo(vo)) which returns the previous value (null on cache miss); change it to create a new NodeInfo instance (e.g., NodeInfo newInfo = new NodeInfo(vo)), add the uuid to nodeHash, put the newInfo into the nodes map (nodes.put(nodeUuid, newInfo)) and set info to newInfo (or use nodes.computeIfAbsent/node-put-if-absent patterns) so the method returns the actual NodeInfo instead of null; reference symbols: getNodeInfo, NodeInfo, nodes, nodeHash, ManagementNodeVO, dbf.findByUuid.
🧹 Nitpick comments (1)
portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java (1)
869-873: 💤 Low value考虑记录更具体的异常信息。
当前
catch (Exception e)捕获所有异常并仅打印通用警告。虽然回退到空 hostname 是合理的容错策略,但建议在日志中包含异常详情,以便排查getNodeInfo失败的根本原因。♻️ 建议的改进
try { inv.setHostName(destinationMaker.getNodeInfo(nodeUuid).getNodeIP()); } catch (Exception e) { - logger.warn(String.format("cannot get node info for node[uuid:%s], use empty hostname", nodeUuid)); + logger.warn(String.format("cannot get node info for node[uuid:%s], use empty hostname: %s", nodeUuid, e.getMessage())); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java` around lines 869 - 873, The catch block in ManagementNodeManagerImpl around destinationMaker.getNodeInfo(nodeUuid) currently swallows all exceptions and logs only a generic warning; change the logger call to include the caught exception details (e.g., pass the exception object or append e.toString()) so the warning shows the failure reason for getNodeInfo(nodeUuid) while still falling back to empty hostname; update the catch for the block that calls inv.setHostName(...) to log the exception (logger.warn with message and e) to aid debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In
`@core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.java`:
- Around line 80-93: The getNodeInfo method incorrectly assigns info =
nodes.put(nodeUuid, new NodeInfo(vo)) which returns the previous value (null on
cache miss); change it to create a new NodeInfo instance (e.g., NodeInfo newInfo
= new NodeInfo(vo)), add the uuid to nodeHash, put the newInfo into the nodes
map (nodes.put(nodeUuid, newInfo)) and set info to newInfo (or use
nodes.computeIfAbsent/node-put-if-absent patterns) so the method returns the
actual NodeInfo instead of null; reference symbols: getNodeInfo, NodeInfo,
nodes, nodeHash, ManagementNodeVO, dbf.findByUuid.
---
Nitpick comments:
In
`@portal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java`:
- Around line 869-873: The catch block in ManagementNodeManagerImpl around
destinationMaker.getNodeInfo(nodeUuid) currently swallows all exceptions and
logs only a generic warning; change the logger call to include the caught
exception details (e.g., pass the exception object or append e.toString()) so
the warning shows the failure reason for getNodeInfo(nodeUuid) while still
falling back to empty hostname; update the catch for the block that calls
inv.setHostName(...) to log the exception (logger.warn with message and e) to
aid debugging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 41c9c53a-dae7-467b-bb4f-2767ca354f2b
📒 Files selected for processing (2)
core/src/main/java/org/zstack/core/cloudbus/ResourceDestinationMakerImpl.javaportal/src/main/java/org/zstack/portal/managementnode/ManagementNodeManagerImpl.java
Backport ZSTAC-78182 to 4.8.38
Cherry-pick of original fix from 原单 ZSTAC-77711 (source: 5.5.12).
Original commit: 26b8b1a
Local cherry-pick HEAD: 66ab572
Changes
Related: ZSTAC-78182
sync from gitlab !10062