Skip to content

test: ui-smoke phase 2, g-code execution and endpoint check#4054

Open
grandixximo wants to merge 7 commits into
LinuxCNC:masterfrom
grandixximo:ui-tests-phase2
Open

test: ui-smoke phase 2, g-code execution and endpoint check#4054
grandixximo wants to merge 7 commits into
LinuxCNC:masterfrom
grandixximo:ui-tests-phase2

Conversation

@grandixximo
Copy link
Copy Markdown
Contributor

@grandixximo grandixximo commented May 24, 2026

Phase 2 of the GUI test work tracked in #3756. Stacked on #3999 (now merged).

Summary

Extends each per-GUI ui-smoke test landing in #3999 with an end-to-end g-code execution stage:

  • Estop reset, machine on, home all (via c.home(-1), respects HOME_SEQUENCE).
  • MODE_AUTO, program_open on a shared _lib/smoke.ngc, auto(AUTO_RUN, 0).
  • Poll linuxcnc.stat until interp_state == INTERP_IDLE and queue == 0 for 5 consecutive 10 ms polls (settle window guards the inter-line moment where interp briefly reports IDLE while queueing the next move).
  • Assert stat.position[:3] delta against --expect-delta-mm 1,1,0, converted via stat.linear_units so the same arg works on inch (axis, touchy) and mm (gmoccapy, qtdragon) sims.

No new test directories. The four existing tests/ui-smoke/{axis,touchy,gmoccapy,qtdragon}/test.sh now also pass the --run-program and --expect-delta-mm flags through run-gui.sh. The Phase 1 connect-and-settle path stays as-is when those flags are omitted.

smoke.ngc

G21
G91
G0 X1 Y1
G90
M2

Forces mm input units (G21) and uses relative motion (G91) so the same file is sim-agnostic. The driver records stat.position[:3] after homing and checks (final - start) against the expected delta converted to machine units. This sidesteps each sim's HOME offset (axis homes to 0/0/0, qtdragon to 20/20/-10, etc).

State machine handling

linuxcnc.command.wait_complete() on a state or mode change only proves the NML message was acked, not that task_state / task_mode has transitioned. Polling the stat fields is the only deterministic signal. Two GUIs (gmoccapy reliably, qtdragon intermittently) re-issue their own mode commands during their own startup and revert task_mode AUTO -> MANUAL immediately after the driver sets it; without retry, auto(AUTO_RUN, 0) is then rejected and interp_state stays at IDLE.

ensure_state / ensure_mode helpers implement an issue + wait + stability-check + retry pattern, up to STATE_RETRY_BUDGET = 6 attempts. Intermediate timeouts use a quiet wait variant so spurious UI_SMOKE_FAIL lines do not pollute the log during retries (checkresult.sh greps for ^UI_SMOKE_FAIL on any line, so a retry that ultimately succeeds must not emit the marker).

qtdragon-specific bits

The qtdragon sim needs three CI-only workarounds layered in tests/ui-smoke/qtdragon/test.sh:

  1. Writable config mirror. qtvcp's INI-driven LOG_FILE is rooted in the config dir. CI mounts the workspace read-only for the runtime user, so a relative LOG_FILE = qtdragon.log resolves to a path qtvcp cannot create and hal_bridge exits before the driver can attach. The test now mirrors configs/sim/qtdragon/qtdragon_xyz/ into mktemp -d and rewrites LOG_FILE to ~/qtdragon.log.
  2. Offscreen Qt platform. qtvcp under xvfb + xcb on Ubuntu 24.04 segfaults during widget construction (no backtrace). Setting QT_QPA_PLATFORM=offscreen and LINUXCNC_OPENGL_PLATFORM=offscreen renders entirely in memory; xvfb-run still wraps the call so scripts/linuxcnc's X-display assumptions hold.
  3. Block QtWebEngine import. qtdragon's UI embeds WebWidget (QWebEngineView, Chromium). Chromium's browser-process init segfaults inside the qtvcp PID under offscreen + xvfb on the Ubuntu runner even with --no-sandbox --single-process --disable-gpu (Chromium logs Sandboxing disabled by user. and then crashes). Rather than chase Chromium flags for a widget the smoke test never touches, the test drops a sitecustomize.py on PYTHONPATH that installs a sys.meta_path finder blocking qtpy.QtWebEngineWidgets (and PyQt5.QtWebEngineWidgets). lib/python/qtvcp/widgets/web_widget.py already has a fail-safe path that swaps the QWebEngineView for a plain QWidget when that import fails, so the UI loads cleanly with the Web tab inert.

debian/control

Adds python3-zmq under the <!nocheck> profile. Without it qtdragon's hal_bridge fails on startup and qtvcp tears down before the driver can attach.

Performance

ui-smoke total wall time on this machine:

Tests Wall Marginal over Phase 1 baseline (~2 min)
Phase 1 only (#3999) 4 ~2 min baseline
Phase 1 + Phase 2 (this PR) 4 ~2m43s +~40s

Phase 2 reuses each test's existing linuxcnc startup and xvfb instance; CI cost is the Phase 2 sequence per GUI (estop + home + mode + run + verify, ~5 to 16 s depending on GUI), not the full ~30 s startup/shutdown overhead each test pays once.

Sequential by necessity: every ui-smoke test launches linuxcnc which claims a fixed set of SHM keys (SHM_KEYS in _lib/cleanup-runtime.sh), so parallel ui-smoke runs would race. Per-test SHM-key isolation would be a separate refactor.

Test plan

  • 5 consecutive local scripts/runtests tests/ui-smoke runs, 4/4 pass, 0 shmem errors, clean tree
  • Per-test diagnostics confirm gmoccapy hits the mode-revert path on every run and the retry recovers cleanly
  • scripts/shellcheck.sh clean on changed scripts
  • python3 -m py_compile clean on drive.py
  • CI: rip-and-test, rip-and-test-clang, rip-rtai, and all 6 package-arch / 3 package-indep jobs green on run 26361844630

Out of scope

@grandixximo grandixximo marked this pull request as ready for review May 25, 2026 03:13
@grandixximo grandixximo marked this pull request as draft May 27, 2026 12:10
@grandixximo
Copy link
Copy Markdown
Contributor Author

I will do more testing locally on an ubuntu 24.04 before marking as ready

grandixximo added a commit to grandixximo/linuxcnc that referenced this pull request May 28, 2026
Three debug-only additions to diagnose the rip-and-test-clang failure
(homing timeout after 60s + Fatal glibc pthread mutex assertion on
shutdown) which my local docker cannot reproduce:

- launch.sh: PYTHONFAULTHANDLER=1, ulimit -c unlimited, LIBC_FATAL_
  STDERR_=1, MALLOC_CHECK_=3 so SIGABRT/SIGSEGV in any Python child
  prints a Python+native stack to stderr (visible via linuxcnc.err).
- qtvcp.py: faulthandler.enable() + register on SIGUSR1 so the smoke
  driver can dump qtvcp's interpreter stack without killing it.
- drive.py: on homing timeout, dump per-joint state, halui machine
  pin, locate qtvcp processes and send SIGUSR1; sleep briefly so the
  stack dump lands in the log before we tear down.

Will be reverted once the clang-only failure mode is understood.
grandixximo added a commit to grandixximo/linuxcnc that referenced this pull request May 28, 2026
Three debug-only additions to diagnose the rip-and-test-clang failure
(homing timeout after 60s + Fatal glibc pthread mutex assertion on
shutdown) which my local docker cannot reproduce:

- launch.sh: PYTHONFAULTHANDLER=1, ulimit -c unlimited, LIBC_FATAL_
  STDERR_=1, MALLOC_CHECK_=3 so SIGABRT/SIGSEGV in any Python child
  prints a Python+native stack to stderr (visible via linuxcnc.err).
- qtvcp.py: faulthandler.enable() + register on SIGUSR1 so the smoke
  driver can dump qtvcp's interpreter stack without killing it.
- drive.py: on homing timeout, dump per-joint state, halui machine
  pin, locate qtvcp processes and send SIGUSR1; sleep briefly so the
  stack dump lands in the log before we tear down.

Will be reverted once the clang-only failure mode is understood.
grandixximo added a commit to grandixximo/linuxcnc that referenced this pull request May 28, 2026
Three debug-only additions to diagnose the rip-and-test-clang failure
(homing timeout after 60s + Fatal glibc pthread mutex assertion on
shutdown) which my local docker cannot reproduce:

- launch.sh: PYTHONFAULTHANDLER=1, ulimit -c unlimited, LIBC_FATAL_
  STDERR_=1, MALLOC_CHECK_=3 so SIGABRT/SIGSEGV in any Python child
  prints a Python+native stack to stderr (visible via linuxcnc.err).
- qtvcp.py: faulthandler.enable() + register on SIGUSR1 so the smoke
  driver can dump qtvcp's interpreter stack without killing it.
- drive.py: on homing timeout, dump per-joint state, halui machine
  pin, locate qtvcp processes and send SIGUSR1; sleep briefly so the
  stack dump lands in the log before we tear down.

Will be reverted once the clang-only failure mode is understood.
grandixximo added a commit to grandixximo/linuxcnc that referenced this pull request May 28, 2026
Three debug-only additions to diagnose the rip-and-test-clang failure
(homing timeout after 60s + Fatal glibc pthread mutex assertion on
shutdown) which my local docker cannot reproduce:

- launch.sh: PYTHONFAULTHANDLER=1, ulimit -c unlimited, LIBC_FATAL_
  STDERR_=1, MALLOC_CHECK_=3 so SIGABRT/SIGSEGV in any Python child
  prints a Python+native stack to stderr (visible via linuxcnc.err).
- qtvcp.py: faulthandler.enable() + register on SIGUSR1 so the smoke
  driver can dump qtvcp's interpreter stack without killing it.
- drive.py: on homing timeout, dump per-joint state, halui machine
  pin, locate qtvcp processes and send SIGUSR1; sleep briefly so the
  stack dump lands in the log before we tear down.

Will be reverted once the clang-only failure mode is understood.
grandixximo added a commit to grandixximo/linuxcnc that referenced this pull request May 28, 2026
Three debug-only additions to diagnose the rip-and-test-clang failure
(homing timeout after 60s + Fatal glibc pthread mutex assertion on
shutdown) which my local docker cannot reproduce:

- launch.sh: PYTHONFAULTHANDLER=1, ulimit -c unlimited, LIBC_FATAL_
  STDERR_=1, MALLOC_CHECK_=3 so SIGABRT/SIGSEGV in any Python child
  prints a Python+native stack to stderr (visible via linuxcnc.err).
- qtvcp.py: faulthandler.enable() + register on SIGUSR1 so the smoke
  driver can dump qtvcp's interpreter stack without killing it.
- drive.py: on homing timeout, dump per-joint state, halui machine
  pin, locate qtvcp processes and send SIGUSR1; sleep briefly so the
  stack dump lands in the log before we tear down.

Will be reverted once the clang-only failure mode is understood.
grandixximo added a commit to grandixximo/linuxcnc that referenced this pull request May 28, 2026
Three debug-only additions to diagnose the rip-and-test-clang failure
(homing timeout after 60s + Fatal glibc pthread mutex assertion on
shutdown) which my local docker cannot reproduce:

- launch.sh: PYTHONFAULTHANDLER=1, ulimit -c unlimited, LIBC_FATAL_
  STDERR_=1, MALLOC_CHECK_=3 so SIGABRT/SIGSEGV in any Python child
  prints a Python+native stack to stderr (visible via linuxcnc.err).
- qtvcp.py: faulthandler.enable() + register on SIGUSR1 so the smoke
  driver can dump qtvcp's interpreter stack without killing it.
- drive.py: on homing timeout, dump per-joint state, halui machine
  pin, locate qtvcp processes and send SIGUSR1; sleep briefly so the
  stack dump lands in the log before we tear down.

Will be reverted once the clang-only failure mode is understood.
grandixximo added a commit to grandixximo/linuxcnc that referenced this pull request May 28, 2026
Three debug-only additions to diagnose the rip-and-test-clang failure
(homing timeout after 60s + Fatal glibc pthread mutex assertion on
shutdown) which my local docker cannot reproduce:

- launch.sh: PYTHONFAULTHANDLER=1, ulimit -c unlimited, LIBC_FATAL_
  STDERR_=1, MALLOC_CHECK_=3 so SIGABRT/SIGSEGV in any Python child
  prints a Python+native stack to stderr (visible via linuxcnc.err).
- qtvcp.py: faulthandler.enable() + register on SIGUSR1 so the smoke
  driver can dump qtvcp's interpreter stack without killing it.
- drive.py: on homing timeout, dump per-joint state, halui machine
  pin, locate qtvcp processes and send SIGUSR1; sleep briefly so the
  stack dump lands in the log before we tear down.

Will be reverted once the clang-only failure mode is understood.
grandixximo added a commit to grandixximo/linuxcnc that referenced this pull request May 28, 2026
Three debug-only additions to diagnose the rip-and-test-clang failure
(homing timeout after 60s + Fatal glibc pthread mutex assertion on
shutdown) which my local docker cannot reproduce:

- launch.sh: PYTHONFAULTHANDLER=1, ulimit -c unlimited, LIBC_FATAL_
  STDERR_=1, MALLOC_CHECK_=3 so SIGABRT/SIGSEGV in any Python child
  prints a Python+native stack to stderr (visible via linuxcnc.err).
- qtvcp.py: faulthandler.enable() + register on SIGUSR1 so the smoke
  driver can dump qtvcp's interpreter stack without killing it.
- drive.py: on homing timeout, dump per-joint state, halui machine
  pin, locate qtvcp processes and send SIGUSR1; sleep briefly so the
  stack dump lands in the log before we tear down.

Will be reverted once the clang-only failure mode is understood.
grandixximo added a commit to grandixximo/linuxcnc that referenced this pull request May 28, 2026
Three debug-only additions to diagnose the rip-and-test-clang failure
(homing timeout after 60s + Fatal glibc pthread mutex assertion on
shutdown) which my local docker cannot reproduce:

- launch.sh: PYTHONFAULTHANDLER=1, ulimit -c unlimited, LIBC_FATAL_
  STDERR_=1, MALLOC_CHECK_=3 so SIGABRT/SIGSEGV in any Python child
  prints a Python+native stack to stderr (visible via linuxcnc.err).
- qtvcp.py: faulthandler.enable() + register on SIGUSR1 so the smoke
  driver can dump qtvcp's interpreter stack without killing it.
- drive.py: on homing timeout, dump per-joint state, halui machine
  pin, locate qtvcp processes and send SIGUSR1; sleep briefly so the
  stack dump lands in the log before we tear down.

Will be reverted once the clang-only failure mode is understood.
grandixximo added a commit to grandixximo/linuxcnc that referenced this pull request May 28, 2026
Three debug-only additions to diagnose the rip-and-test-clang failure
(homing timeout after 60s + Fatal glibc pthread mutex assertion on
shutdown) which my local docker cannot reproduce:

- launch.sh: PYTHONFAULTHANDLER=1, ulimit -c unlimited, LIBC_FATAL_
  STDERR_=1, MALLOC_CHECK_=3 so SIGABRT/SIGSEGV in any Python child
  prints a Python+native stack to stderr (visible via linuxcnc.err).
- qtvcp.py: faulthandler.enable() + register on SIGUSR1 so the smoke
  driver can dump qtvcp's interpreter stack without killing it.
- drive.py: on homing timeout, dump per-joint state, halui machine
  pin, locate qtvcp processes and send SIGUSR1; sleep briefly so the
  stack dump lands in the log before we tear down.

Will be reverted once the clang-only failure mode is understood.
grandixximo added a commit to grandixximo/linuxcnc that referenced this pull request May 28, 2026
Three debug-only additions to diagnose the rip-and-test-clang failure
(homing timeout after 60s + Fatal glibc pthread mutex assertion on
shutdown) which my local docker cannot reproduce:

- launch.sh: PYTHONFAULTHANDLER=1, ulimit -c unlimited, LIBC_FATAL_
  STDERR_=1, MALLOC_CHECK_=3 so SIGABRT/SIGSEGV in any Python child
  prints a Python+native stack to stderr (visible via linuxcnc.err).
- qtvcp.py: faulthandler.enable() + register on SIGUSR1 so the smoke
  driver can dump qtvcp's interpreter stack without killing it.
- drive.py: on homing timeout, dump per-joint state, halui machine
  pin, locate qtvcp processes and send SIGUSR1; sleep briefly so the
  stack dump lands in the log before we tear down.

Will be reverted once the clang-only failure mode is understood.
@grandixximo
Copy link
Copy Markdown
Contributor Author

CI will fail on this branch until a fix for the qtdragon ui-smoke dbus
segfault lands on master. Root cause is a half-initialized
dbus.SessionBus in qtvcp/lib/sys_notify.py whose mainloop-attached
QSocketNotifier keeps the C connection alive after init raises
ServiceUnknown; the next Qt event-loop tick dispatches into
_dbus_list_unlink and SEGVs. Captured under gdb, full backtrace in
#4064 which proposes the one-line fix. Idling this PR (no further
pushes) until #4064 or an equivalent fix is merged, then I will rebase.

@BsAtHome
Copy link
Copy Markdown
Contributor

Yes, they are all existing troubles in the GUIs :-)
Now that we test them, we should clean it up too.

What happens when sending SIGINT or SIGQUIT (or, fwiw, SIGHUP)? Lowering the grace period is like fixing the symptom and not the actual problem. (Maybe there should be input some X events to click on exit/quit?)
I think we need to consider our options and possibilities here to make the "right" fix, whatever that may be.

The NML problem needs to be looked into a bit more. That Ascii updater should never have been instantiated and I'd like to know why it was. The default NML files in configs/common correctly use xdr. Somehow none of those files were used.

@grandixximo
Copy link
Copy Markdown
Contributor Author

I dug into the signal behavior. Measured each GUI in its real main loop (uspace RIP, sim configs under xvfb), sending each signal and checking whether the process exits:

Signal gmoccapy touchy qtdragon
SIGINT stays up stays up stays up
SIGQUIT stays up stays up stays up
SIGHUP exits exits exits
SIGTERM stays up stays up stays up

So all three ignore INT, QUIT and TERM (absorbed by GLib/PyGObject for the GTK GUIs and by Qt for qtdragon), and only SIGHUP terminates them, because none of them install a SIGHUP handler so it hits the default action. That is why the harness has to escalate SIGTERM to SIGKILL.

On the right fix: I think the proper answer is that each GUI should honor SIGTERM and shut down gracefully (estop plus quit), the same way it would on a normal window close. SIGTERM is the standard "please terminate" signal, so the harness should not have to escalate to SIGKILL, lower the grace, or resort to a SIGHUP/X-event workaround; those treat the symptom. gmoccapy's SIGTERM hang traced to a concrete bug (PyGObject surfaces the signal as a KeyboardInterrupt, which reaches gmoccapy's sys.excepthook, which pops a modal dialog and blocks); PR #4076 makes it exit cleanly. touchy and qtdragon hang for their own toolkit reasons, and I will start working on the equivalent clean-shutdown handlers for each so they honor SIGTERM too. Once they do, the existing teardown works as-is with no long grace.

One question while I do that: is it worth adding ui-smoke coverage for the quit paths, so I do not regress them? Concretely, per GUI: (1) a clean quit via the GUI's own exit/quit, and (2) SIGTERM, asserting the process exits within a short window. I could extend it to SIGINT/SIGQUIT/SIGHUP later once we agree what "correct" is for each. Each extra quit case is roughly one more start-and-quit cycle (about 3 to 4 s per GUI in my measurements), so clean-quit plus SIGTERM across the four GUIs adds on the order of 30 s. Happy to wire that in if you think it is worth the coverage.

Separately, on the NML question: root-caused. The emcCommand buffer is correctly xdr. The CMS_DISPLAY_ASCII_UPDATER is a transient string-conversion updater created by NML::msg2str(), called from emctaskmain.cc only under EMC_DEBUG_TASK_ISSUE (0x10). touchy.ini was the only sim shipping DEBUG = 0x10, so task logged every command as an ASCII string and instantiated it. PR #4073 sets touchy DEBUG = 0; PR #4074 also gates the updater's unconditional stderr banner behind the CMS debug flag so it stays quiet for any legitimate transient use.

One more thing I hit while testing: gmoccapy pops a modal "Important change(s)" release-notes dialog at startup on a fresh profile, which blocks headless startup entirely (it spins a nested loop in the constructor). Filed as #4072.

@grandixximo
Copy link
Copy Markdown
Contributor Author

Update on the SIGTERM right fix: clean-shutdown handlers done for the two remaining GUIs.

  • touchy: SIGTERM handler quits the GTK loop via the idle queue, so atexit cleanup (child panels, saved prefs) still runs.
  • qtdragon (qtvcp): quits the Qt loop on SIGTERM and SIGINT, leaving the X-button "are you sure?" confirmation intact for interactive use.

Both exit promptly now instead of being escalated to SIGKILL.

Caveat on touchy: PyGObject still prints one cosmetic KeyboardInterrupt line on exit. It comes from PyGObject's C-level wakeup-fd bridge, not Python, so it is unsuppressible from signal.signal, GLib.unix_signal_add, sys.excepthook or sys.unraisablehook (tried all). Only os._exit hides it, at the cost of skipping atexit cleanup. It is pre-existing, so no regression.

PRs to follow shortly.

@grandixximo
Copy link
Copy Markdown
Contributor Author

Addendum to my quit-path question above: I built it, so it is concrete now.

Shared harness plus per-GUI tests (touchy, gmoccapy, qtdragon): boot the GUI, wait for the task, send SIGTERM to the GUI alone, assert it exits in a short grace. Validated on a uspace RIP build under xvfb: touchy fails without its handler and passes with it; qtdragon and gmoccapy pass with theirs. They depend on #4076 / #4077 / #4078, so they fail on current master by design until those land.

Worth noting: neither phase catches a GUI wedged in its constructor. gmoccapy on a fresh profile blocks in a modal startup dialog (#4072) and never reaches its main loop, yet both phases report UI_SMOKE_OK since only the NML task is checked. The quit test catches it: a constructor-blocked GUI never installs its handler, so it cannot exit on SIGTERM. So it guards two things, signal handling and "the GUI reached its event loop". I did only the SIGTERM path; a clean GUI-native quit is per-toolkit and messier, left for later.

Packaging: fold into this PR, or a separate follow-up after the three SIGTERM fixes? I lean follow-up to keep this PR focused, but either works.

c-morley pushed a commit that referenced this pull request May 30, 2026
qtvcp registered self.shutdown as the SIGTERM/SIGINT handler, but two
things stopped it working: Qt's C++ event loop (APP.exec()) does not
return to Python often enough to run Python signal handlers, and
shutdown() cleans up without quitting the loop, so even when reached the
process kept running. A caller tearing qtvcp down therefore had to
escalate to SIGKILL.

Quit the event loop on the signal (APP.quit()) so APP.exec() returns and
the existing post-exec shutdown() runs the normal cleanup once, and add a
periodic no-op QTimer so the interpreter gets a chance to run the signal
handler during exec(). Quitting the loop bypasses the window-close
confirmation dialog, which is correct for a terminate request and leaves
that interactive feature untouched.

Fixes the hang for all qtvcp screens (qtdragon, qtplasmac, ...). Verified
qtdragon exits cleanly on SIGTERM and SIGINT.

Surfaced by the ui-smoke tests (PR #4054).
Each per-GUI test now also drives estop reset, machine on, home all,
mode auto, program_open + auto(RUN) on a tiny shared smoke.ngc, waits
for sustained INTERP_IDLE, and asserts stat.position delta against
--expect-delta-mm 1,1,0 converted via stat.linear_units so the same
arg works on inch (axis, touchy) and mm (gmoccapy, qtdragon) sims.

State/mode commands use ensure_state/ensure_mode helpers with a
retry-and-stability pattern: gmoccapy and qtdragon re-issue their own
mode commands during startup and can revert task_mode AUTO -> MANUAL
right after we set it. The helpers wait for the desired state, then
re-check after STATE_STABILITY_S; on revert they retry up to
STATE_RETRY_BUDGET times. Intermediate timeouts use a quiet variant
so spurious UI_SMOKE_FAIL lines do not pollute the log during retries
(checkresult.sh greps for ^UI_SMOKE_FAIL on any line).

smoke.ngc is G21 G91 G0 X1 Y1 G90 M2 - relative move in mm, sim-
agnostic. The driver snapshots stat.position[:3] after homing and
checks (final - start) against the converted delta, sidestepping each
sim's HOME offset.

Adds python3-zmq and python3-opencv to debian/control.top.in under
!nocheck: qtdragon's hal_bridge and the camview widget segfault on
startup without them, which is invisible to the connect-only Phase 1
smoke but breaks the run-program path before the program can start.

5 consecutive local runs all green at 2m43s wall each.
CI run hit 'timeout waiting for all joints homed after 60.0s' on
qtdragon only; locally homing completes in <4s on all four sims.
Likely cause: same task_mode revert race as ensure_mode catches for
MODE_AUTO, except home() lives outside that helper, so a mid-sequence
mode flip back to a non-MANUAL mode silently drops the home command.

Wrap the post-c.home(-1) wait in a poll loop that re-asserts MANUAL
and re-issues home(-1) every HOME_REISSUE_S (10s). Final timeout now
also dumps homed[], task_state, task_mode and exec_state so the next
CI failure has actionable diagnostics.
CI run hit a PermissionError in qtvcp's logger when it tried to open
configs/sim/qtdragon/qtdragon_xyz/qtdragon.log for write: the GitHub
Actions workspace is mounted read-only for the docker build user, and
qtvcp resolves LOG_FILE = qtdragon.log into the config dir. hal_bridge
then exits, linuxcnc tears down, and the driver retries ESTOP_RESET
until the budget is exhausted.

qtdragon test.sh now mirrors the qtdragon_xyz config dir to a mktemp
directory, seds LOG_FILE to ~/qtdragon.log, and passes the absolute
INI path to run-gui.sh. run-gui.sh treats any path starting with /
as absolute; everything else still resolves under configs/sim. Trap
cleans the tmp dir on exit so the working tree stays clean.

Does not touch the shipped qtdragon config to avoid changing default
behaviour for real users. The same fix would work for any other
config that turns out to write into its own dir on CI.
Ubuntu 24.04 rip-and-test runs hit a qtvcp segfault after the log-
permission fix let qtvcp get further than Phase 1 had. Debian
package-arch passes the same code. Two known asymmetries match:

- python3-opencv on Ubuntu pulls Qt5 GUI bits whose cv2/qt/plugins
  directory overrides the system PyQt5 platform plugin path under
  xvfb (opencv-python issue LinuxCNC#572, Qt Forum 119109). qtvcp's
  camview_widget tolerates ImportError on cv2 and just logs a
  warning, so dropping the dep restores the harmless fallback path
  Phase 1 was already exercising.
- xcb_glx is the historical fragile integration under xvfb
  (Launchpad #1761708, QTBUG-67537); xcb_egl is what software-GL
  stacks expect anyway. Set as defense in depth.

Local 4/4 still green with both changes.
xvfb + xcb + xcb_egl was not enough for Ubuntu 24.04 rip-and-test:
qtvcp still segfaults during widget construction even with opencv
and qtwebengine paths quiet, and the same code passes on Debian
package-arch. Offscreen renders entirely in memory and exercises a
different Qt plugin entirely, dodging the xcb-stack instability.

scripts/linuxcnc itself forces QT_QPA_PLATFORM=xcb unless
LINUXCNC_OPENGL_PLATFORM is set to a non-glx value, so pin both.
Only qtdragon needs this; axis (Tk), touchy and gmoccapy (GTK) are
unaffected. Trade-off: no Phase 3 screenshot from qtdragon under
this config; Phase 3 would need an opt-out for offscreen tests.
qtdragon embeds QWebEngineView. On rip-and-test (gcc) CI it racy-crashed
during Chromium browser-process spawn under offscreen + xvfb, no GPU,
no user namespaces. rip-and-test-clang got past it by luck.

Force --no-sandbox --single-process --no-zygote --disable-gpu so the
renderer runs in-process with software rendering.
QtWebEngine browser-process init segfaults inside the qtvcp process on
Ubuntu 24.04 CI even with --no-sandbox --single-process --disable-gpu.
The smoke test never touches the WebWidget, so block the
qtpy.QtWebEngineWidgets import via a sitecustomize meta_path finder;
WebWidget already has a fallback that swaps in a plain QWidget when
that import fails. No Chromium spawn, no segfault.

The previous chromium-flags attempt was retracted: 'Sandboxing disabled
by user.' confirmed Chromium got the flags but still crashed during
init, so we are not going to win that race.
@grandixximo
Copy link
Copy Markdown
Contributor Author

Just rebasing, extended UI smoke quit will probably be a followup PR

@grandixximo
Copy link
Copy Markdown
Contributor Author

grandixximo commented Jun 1, 2026

There has been a regression in #4088 for headless systems, the phase 2 CI would have caught it, since it fails now, will send PR to fix it, once merged phase 2 shall pass again, the smoke test is actually already catching stuff, doing its job properly :-)

BsAtHome pushed a commit that referenced this pull request Jun 1, 2026
touchy.ini was the only sim config shipping a non-zero DEBUG (0x10 =
EMC_DEBUG_TASK_ISSUE). That flag makes task log every issued NML command
as an ASCII string via emcCommandBuffer->msg2str(), which lazily creates
a CMS_DISPLAY_ASCII_UPDATER and prints its scary 'may not function
properly due to range limitations' banner on startup.

The emcCommand buffer is correctly xdr-encoded; the ASCII updater is only
a temporary string-conversion helper for the debug log. Setting DEBUG=0
(matching axis and gmoccapy sims) stops the task-issue logging and the
spurious banner. No functional change to touchy.

Surfaced by the ui-smoke tests (#4054).
BsAtHome pushed a commit that referenced this pull request Jun 1, 2026
gtk_action.py had 'return npath' inside a finally: block. A return in
finally swallows any in-flight exception and overrides earlier returns,
so the except branch's 'return None' (the error path) was always
clobbered: a failed save still returned the path and hid the exception.

Dedent the return out of the finally so the finally only closes the
file. On success the function returns npath; on a write error it now
correctly returns None. Also silences the Python 3.12+
SyntaxWarning: 'return' in a 'finally' block.

Surfaced by the ui-smoke tests (PR #4054).

Fixes #4067
BsAtHome pushed a commit that referenced this pull request Jun 1, 2026
touchy ignored SIGTERM (GLib/PyGObject absorbs the signal), so a caller
tearing it down had to escalate to SIGKILL. Install a SIGTERM handler
that quits the GTK main loop, deferring the quit to the loop via
GLib.idle_add since GTK calls are not safe directly from a signal
handler. atexit cleanup (child processes, prefs) still runs.

This does not touch the interactive window-close path. SIGTERM is a
system terminate request and should not require confirmation.

Note: PyGObject still prints a benign KeyboardInterrupt on this shutdown;
that is pre-existing (touchy printed it on SIGTERM before too, it just
did not exit) and is emitted from PyGObject's C layer, so it is not
suppressible from Python here.

Surfaced by the ui-smoke tests (PR #4054).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants