Skip to content

ENT-14108: cf-execd.service: drain cf-agent on stop#6146

Merged
larsewi merged 1 commit into
cfengine:masterfrom
larsewi:drain-cf-agent-systemd
May 31, 2026
Merged

ENT-14108: cf-execd.service: drain cf-agent on stop#6146
larsewi merged 1 commit into
cfengine:masterfrom
larsewi:drain-cf-agent-systemd

Conversation

@larsewi
Copy link
Copy Markdown
Contributor

@larsewi larsewi commented May 27, 2026

KillMode=process only signals cf-execd. Any cf-agent spawned by cf-execd keeps running after systemctl stop returns. A mid-run agent can then re-trigger cf-php-fpm (Wants=cf-postgres), causing dependencies to be pulled back in after the stop was reported successful.

This fix adds ExecStopPost= that waits up to 60s for cf-agent to drain, then SIGKILLs any survivor. It runs after cf-execd has exited, so no new agents are spawned during the drain.

Ticket: ENT-14108

Backported to:

@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 27, 2026

@cf-bottom Jenkins please :)

@larsewi larsewi added the cherry-pick? Fixes which may need to be cherry-picked to LTS branches label May 27, 2026
@cf-bottom
Copy link
Copy Markdown

Copy link
Copy Markdown
Member

@nickanderson nickanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the 60s wait. Is it not possible for cf-agent to start cf-execd inside those 60s?

@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 28, 2026

Not sure about the 60s wait. Is it not possible for cf-agent to start cf-execd inside those 60s?

Not sure I follow @nickanderson. Is it not cf-execd that starts cf-agent, and not the other way around?

With this fix: when you stop cf-execd, it now waits for cf-agent to finish. If it does not finish within 60 seconds, it gets killed.

This is to fix the issue where a lingering agent can start pulling in dependencies again after systemctl stop cfengine3 causing upgrades to fail. E.g., the cfengine3 umbrella stops postgres, the agent starts it again.

@nickanderson
Copy link
Copy Markdown
Member

Not sure I follow @nickanderson. Is it not cf-execd that starts cf-agent, and not the other way around?

Both things can be true. There is policy in the MPF that watches over CFEngine's own processes. But, this stuff is I think mostly skipped in the case of systemd. But for example:

  processes:

    !windows::

      "bin/cf-execd" -> { "CFE-2974" }
        restart_class => "cf_execd_not_running",
        comment => "If cf-execd isn't running, define a class so that it will be started",
        handle => "cfe_internal_limit_robot_agents_processes_cf_execd_not_running";

      "bin/cf-monitord" -> { "CFE-2963" }
        restart_class => "cf_monitord_not_running",
        handle => "cfe_internal_limit_robot_agents_classify_cf_monitord_not_running",
        comment => "We want cf-monitord to be running, but in order to avoid
                    non-convergent promises, this must be separated from the
                    promise to terminate misbehaving daemons";

  commands:

    cf_execd_not_running::

      "$(sys.cf_execd)"
        comment => "Restart cf-execd process",
        handle => "cfe_internal_limit_robot_agents_commands_restart_cf_execd";

    cf_monitord_not_running::

      "$(sys.cf_monitord)"
        comment => "Restart cf-monitord process",
        handle => "cfe_internal_limit_robot_agents_commands_restart_cf_monitord";

And there are some promises that target systemd, but notice that cf-execd is commented out because FUD.

  services:

    systemd::

      "cf-serverd"
        service_policy => "restart",
        if => "(server_controls_repaired|runagent_controls_repaired)";

      "cf-monitord"
        service_policy => "restart",
        if => "monitor_controls_repaired";

    systemd.enterprise_edition.(am_policy_hub|policy_server)::

      "cf-hub"
        service_policy => "restart",
        if => "hub_controls_repaired";


      # Well, this is dangerous we might kill our own agent
      # "cf-execd"
      #   service_policy => "restart",
      #   if => "(execd_controls_repaired|runagent_controls_repaired)";

I guess I am wondering what waiting for arbitrary time is really gaining us. If I systemctl stop cf-execd what is the real difference between waiting 2 seconds or 60 seconds neither is based on the actual system state or how long we expect an agent process to take.

@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 28, 2026

I guess I am wondering what waiting for arbitrary time is really gaining us. If I systemctl stop cf-execd what is the real difference between waiting 2 seconds or 60 seconds neither is based on the actual system state or how long we expect an agent process to take.

So what you're saying @nickanderson is; why not just kill the agent right away? I.e., instead of waiting for it to finish?

@nickanderson
Copy link
Copy Markdown
Member

So what you're saying @nickanderson is; why not just kill the agent right away? I.e., instead of waiting for it to finish?

Maybe I am, I dunno. I am probably just overthinking it. Why not give it at least 60s to finish up that's why I went ahead and approved it. Just it seemed arbitrary and I was looking for meaning.

Comment thread misc/systemd/cf-execd.service.in Outdated
`KillMode=process` only signals cf-execd. Any cf-agent spawned by
cf-execd keeps running after systemctl stop returns. A mid-run agent can
then re-trigger cf-php-fpm (`Wants=cf-postgres`), causing dependencies
to be pulled back in after the stop was reported successful.

This fix adds `ExecStopPost=` that waits up to 60s for cf-agent to
drain, then `SIGKILL`s any survivor. It runs after cf-execd has exited,
so no new agents are spawned during the drain.

Ticket: ENT-14108
Changelog: cf-execd systemctl stop now waits for in-flight cf-agent to finish
Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech>
@larsewi larsewi force-pushed the drain-cf-agent-systemd branch from 3a97435 to cd78895 Compare May 29, 2026 13:50
@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 29, 2026

@cf-bottom Jenkins please :)

@cf-bottom
Copy link
Copy Markdown

@larsewi larsewi merged commit a031415 into cfengine:master May 31, 2026
41 checks passed
@larsewi larsewi removed the cherry-pick? Fixes which may need to be cherry-picked to LTS branches label May 31, 2026
@larsewi larsewi deleted the drain-cf-agent-systemd branch May 31, 2026 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants