#278 possible multiple certificates requested
Opened by dmulder. Modified

During fresh system setup, certmonger is started with a new CEPCES based CA, also immediately a getcert request is started (all via salt). This results in multiple times a certificate request is sent to CA, as whenever the state of a CA phase is changed, all related requests are aborted and restart, not taking care that there might be already results to a submitted certificate request. See -> cm_restart_entries_by_ca()

Possible fix:
certificate request need to be queued until a newly added CA is ready to serve requests.
Attempt at a fix:
https://github.com/paolo-depa/certmonger/tree/bsc1221406


What is the CEPCES CA doing that it takes time to be available to start running cm_iterate_ca()

It would be easier to see what you are doing with a single patch rather than two separate and overlapping commits.

From f5577c45fbe0fa3c9a9625ba6f520743412e0b0e Mon Sep 17 00:00:00 2001
From: paolodepa <paolo.depasquale@suse.com>
Date: Mon, 15 Apr 2024 10:31:43 +0200
Subject: [PATCH] cm_start_entry: only start entries for ready CAs
---
 src/cm.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)
diff --git a/src/cm.c b/src/cm.c
index f3a9bd58..dc33223c 100644
--- a/src/cm.c
+++ b/src/cm.c
@@ -859,17 +859,35 @@ cm_stop_all(struct cm_context *context)
 dbus_bool_t
 cm_start_entry(struct cm_context *context, const char *nickname)
 {
-   int i;
+   int i,ca;
+   enum cm_ca_phase phase;
    i = cm_find_entry_by_nickname(context, nickname);
    if (i != -1) {
        if (cm_iterate_entry_init(context->entries[i],
                      &context->entry_events[i].iterate_state) == 0) {
-           context->entry_events[i].next_event =
-               cm_service_entry(context, NULL, i);
-           cm_log(3, "Started %s('%s').\n",
-                  context->entries[i]->cm_busname, nickname);
-           return TRUE;
+           /* check if ca is ready */
+           if (ca = cm_find_ca_by_nickname(context, context->entries[i]->cm_ca_nickname)) {
+               /* a phase needs to be idle,unreachable or disabled to be ready*/
+               for (phase = 0; phase < cm_ca_phase_invalid; phase++) {
+                   if ((context->ca_events[ca].iterate_state[phase] != CM_CA_IDLE) && 
+                           (context->ca_events[ca].iterate_state[phase] != CM_CA_DISABLED) &&
+                           (context->ca_events[ca].iterate_state[phase] != CM_CA_DATA_UNREACHABLE)) {
+                       cm_log(3, "CA ('%s') is not ready to start entry %s('%s').\n",context->entries[i]->cm_ca_nickname, context->entries[i]->cm_busname, nickname);
+                       return FALSE;
+                   }
+               }
+               /* only when all phases are ready start an entry */
+               context->entry_events[i].next_event =
+                   cm_service_entry(context, NULL, i);
+               cm_log(3, "Started %s('%s').\n",
+                   context->entries[i]->cm_busname, nickname);
+               return TRUE;
+           }
+           cm_log(3, "CA ('%s') not found by nickname, not starting event %s('%s').\n",
+                   context->entries[i]->cm_ca_nickname,
+                   context->entries[i]->cm_busname, nickname);
+           return FALSE;
        } else {
            cm_log(3, "Error starting %s('%s'), please retry.\n",
                   context->entries[i]->cm_busname, nickname);
-- 
2.44.0

And here are some redacted logs:

2024-04-08 13:07:35 [43908] User ID 0 PID 44278 called /org/fedorahosted/certmonger:org.fedorahosted.certmonger.add_known_ca.
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').identify starts (NEED_TO_REFRESH)
2024-04-08 13:07:35 [44279] Running enrollment/cadata helper "/usr/lib/certmonger/cepces-submit".
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').identify moved to state 'REFRESHING'
2024-04-08 13:07:35 [43908] Will revisit CA6('myca-cepces-ca').identify on traffic from 15.
2024-04-08 13:07:35 [43908] Started CA CA6('myca-cepces-ca')-identify.
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').certs starts (NEED_TO_REFRESH)
2024-04-08 13:07:35 [44280] Running enrollment/cadata helper "/usr/lib/certmonger/cepces-submit".
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').certs moved to state 'REFRESHING'
2024-04-08 13:07:35 [43908] Will revisit CA6('myca-cepces-ca').certs on traffic from 19.
2024-04-08 13:07:35 [43908] Started CA CA6('myca-cepces-ca')-certs.
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').profiles starts (NEED_TO_REFRESH)
2024-04-08 13:07:35 [44281] Running enrollment/cadata helper "/usr/lib/certmonger/cepces-submit".
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').profiles moved to state 'REFRESHING'
2024-04-08 13:07:35 [43908] Will revisit CA6('myca-cepces-ca').profiles on traffic from 23.
2024-04-08 13:07:35 [43908] Started CA CA6('myca-cepces-ca')-profiles.
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').default_profile starts (NEED_TO_REFRESH)
2024-04-08 13:07:35 [44282] Running enrollment/cadata helper "/usr/lib/certmonger/cepces-submit".
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').default_profile moved to state 'REFRESHING'
2024-04-08 13:07:35 [43908] Will revisit CA6('myca-cepces-ca').default_profile on traffic from 27.
2024-04-08 13:07:35 [43908] Started CA CA6('myca-cepces-ca')-default_profile.
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').enrollment_reqs starts (NEED_TO_REFRESH)
2024-04-08 13:07:35 [44283] Running enrollment/cadata helper "/usr/lib/certmonger/cepces-submit".
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').enrollment_reqs moved to state 'REFRESHING'
2024-04-08 13:07:35 [43908] Will revisit CA6('myca-cepces-ca').enrollment_reqs on traffic from 31.
2024-04-08 13:07:35 [43908] Started CA CA6('myca-cepces-ca')-enrollment_reqs.
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').renewal_reqs starts (NEED_TO_REFRESH)
2024-04-08 13:07:35 [44286] Running enrollment/cadata helper "/usr/lib/certmonger/cepces-submit".
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').renewal_reqs moved to state 'REFRESHING'
2024-04-08 13:07:35 [43908] Will revisit CA6('myca-cepces-ca').renewal_reqs on traffic from 35.
2024-04-08 13:07:35 [43908] Started CA CA6('myca-cepces-ca')-renewal_reqs.
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').capabilities starts (NEED_TO_REFRESH)
2024-04-08 13:07:35 [44287] Running enrollment/cadata helper "/usr/lib/certmonger/cepces-submit".
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').capabilities moved to state 'REFRESHING'
2024-04-08 13:07:35 [43908] Will revisit CA6('myca-cepces-ca').capabilities on traffic from 39.
2024-04-08 13:07:35 [43908] Started CA CA6('myca-cepces-ca')-capabilities.
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').encryption_certs starts (NEED_TO_REFRESH)
2024-04-08 13:07:35 [44288] Running enrollment/cadata helper "/usr/lib/certmonger/cepces-submit".
2024-04-08 13:07:35 [43908] CA6('myca-cepces-ca').encryption_certs moved to state 'REFRESHING'
2024-04-08 13:07:35 [43908] Will revisit CA6('myca-cepces-ca').encryption_certs on traffic from 43.
2024-04-08 13:07:35 [43908] Started CA CA6('myca-cepces-ca')-encryption_certs.
#CA has been added and all phases are in state NEED_TO_REFRESH
...
# Now a certificate for "CA6('myca-cepces-ca')" is requested
2024-04-08 13:07:35 [43908] User ID 0 PID 44289 called /org/fedorahosted/certmonger:org.fedorahosted.certmonger.add_request.
2024-04-08 13:07:35 [43908] Request1('dceu0895-cepces') starts in state 'NEWLY_ADDED'
2024-04-08 13:07:35 [43908] Request1('dceu0895-cepces') taking writing lock
2024-04-08 13:07:35 [43908] Request1('dceu0895-cepces') moved to state 'NEWLY_ADDED_START_READING_KEYINFO'
2024-04-08 13:07:35 [43908] Will revisit Request1('dceu0895-cepces') now.
2024-04-08 13:07:35 [43908] Started Request1('dceu0895-cepces').
2024-04-08 13:07:35 [43908] Queuing FD 7 for Read for 0x55e1f007a440:0x55e1f0086210.
2024-04-08 13:07:35 [43908] Request1('dceu0895-cepces') moved to state 'NEWLY_ADDED_READING_KEYINFO'
2024-04-08 13:07:35 [43908] Will revisit Request1('dceu0895-cepces') on traffic from 41.
...
2024-04-08 13:07:35 [43908] Request1('dceu0895-cepces') moved to state 'NEWLY_ADDED_START_READING_CERT'
2024-04-08 13:07:35 [43908] Will revisit Request1('dceu0895-cepces') now.
2024-04-08 13:07:35 [43908] Request1('dceu0895-cepces') moved to state 'NEWLY_ADDED_READING_CERT'
2024-04-08 13:07:35 [43908] Will revisit Request1('dceu0895-cepces') on traffic from 41.
2024-04-08 13:07:35 [43908] Request1('dceu0895-cepces') moved to state 'NEWLY_ADDED_DECIDING'
2024-04-08 13:07:35 [43908] Will revisit Request1('dceu0895-cepces') now.
2024-04-08 13:07:35 [43908] Request1('dceu0895-cepces') releasing writing lock
2024-04-08 13:07:35 [43908] Request1('dceu0895-cepces') has no key or certificate, will generate keys and attempt enrollment
2024-04-08 13:07:35 [43908] Request1('dceu0895-cepces') moved to state 'NEED_KEY_PAIR'
2024-04-08 13:07:35 [43908] Will revisit Request1('dceu0895-cepces') now.
2024-04-08 13:07:35 [43908] Request1('dceu0895-cepces') taking writing lock
2024-04-08 13:07:35 [43908] Request1('dceu0895-cepces') moved to state 'GENERATING_KEY_PAIR'
2024-04-08 13:07:35 [43908] Will revisit Request1('dceu0895-cepces') on traffic from 41.
2024-04-08 13:07:36 [43908] Will revisit Request1('dceu0895-cepces') on traffic from 41.
2024-04-08 13:07:36 [43908] Request1('dceu0895-cepces') releasing writing lock
2024-04-08 13:07:36 [43908] Request1('dceu0895-cepces') moved to state 'HAVE_KEY_PAIR'
2024-04-08 13:07:36 [43908] Will revisit Request1('dceu0895-cepces') now.
2024-04-08 13:07:36 [43908] Request1('dceu0895-cepces') moved to state 'NEED_KEYINFO'
2024-04-08 13:07:36 [43908] Will revisit Request1('dceu0895-cepces') now.
2024-04-08 13:07:36 [43908] Request1('dceu0895-cepces') moved to state 'READING_KEYINFO'
2024-04-08 13:07:36 [43908] Will revisit Request1('dceu0895-cepces') on traffic from 41.
2024-04-08 13:07:36 [44298] Key is an RSA key.
2024-04-08 13:07:36 [44298] Key size is 4096.
2024-04-08 13:07:36 [43908] Request1('dceu0895-cepces') moved to state 'HAVE_KEYINFO'
2024-04-08 13:07:36 [43908] Will revisit Request1('dceu0895-cepces') now.
2024-04-08 13:07:36 [43908] Request1('dceu0895-cepces') moved to state 'NEED_CSR'
2024-04-08 13:07:36 [43908] Will revisit Request1('dceu0895-cepces') now.
2024-04-08 13:07:36 [43908] Request1('dceu0895-cepces') moved to state 'GENERATING_CSR'
2024-04-08 13:07:36 [43908] Will revisit Request1('dceu0895-cepces') on traffic from 41.
2024-04-08 13:07:36 [44299] Read value "0" from "/proc/sys/crypto/fips_enabled".
2024-04-08 13:07:36 [44299] Not attempting to set NSS FIPS mode.
2024-04-08 13:07:36 [43908] Request1('dceu0895-cepces') moved to state 'HAVE_CSR'
2024-04-08 13:07:36 [43908] Will revisit Request1('dceu0895-cepces') now.
2024-04-08 13:07:36 [43908] Request1('dceu0895-cepces') moved to state 'NEED_TO_SUBMIT'
2024-04-08 13:07:36 [43908] Will revisit Request1('dceu0895-cepces') now.
...
2024-04-08 13:07:36 [44301] Redirecting stdin to /dev/null, leaving stdout and stderr open for child "/usr/lib/certmonger/cepces-submit".
2024-04-08 13:07:36 [44301] Running enrollment helper "/usr/lib/certmonger/cepces-submit".
2024-04-08 13:07:36 [43908] Request1('dceu0895-cepces') moved to state 'SUBMITTING'
2024-04-08 13:07:36 [43908] Will revisit Request1('dceu0895-cepces') on traffic from 47.
...
# Now "CA6('myca-cepces-ca')" phase "renewal_reqs" changes state to REFRESHING
2024-04-08 13:07:38 [43908] CA6('myca-cepces-ca').renewal_reqs moved to state 'REFRESHING'
2024-04-08 13:07:38 [43908] Will revisit CA6('myca-cepces-ca').renewal_reqs on traffic from 35.
# REFRESHING of "CA6('myca-cepces-ca')" phase "renewal_reqs" is done
2024-04-08 13:07:38 [43908] CA6('myca-cepces-ca').renewal_reqs data updated
# this logline is generated from function -> cm_iterate_ca() at code  "/src/iterate.c"
# this does trigger function -> cm_restart_entries_by_ca()
# this does call function -> cm_restart_entry()
# this does call function -> cm_stop_entry()
# function cm_stop_entry()() will execute -> talloc_free(context->entry_events[i].next_event)
# this talloc_free() does remove tracking of STDOUT for "Request1('dceu0895-cepces')"
# function cm_stop_entry() will produce "Stopped Request1('dceu0895-cepces')." which we see shortly
...
2024-04-08 13:07:40 [43908] Waited for 44301, got 44301.
# This is the "abnormal" ending for a subprocess we do not track STDOUT anymore.
2024-04-08 13:07:40 [43908] Request1('dceu0895-cepces') ends in state 'HAVE_CSR'
2024-04-08 13:07:40 [43908] Stopped Request1('dceu0895-cepces').
# Here we see the log entry that was prodiced by function -> cm_stop_entry()
...
# and the restart of "Request1('dceu0895-cepces')"
2024-04-08 13:07:40 [43908] Request1('dceu0895-cepces') starts in state 'HAVE_CSR'
2024-04-08 13:07:40 [43908] Request1('dceu0895-cepces') moved to state 'NEED_TO_SUBMIT'
2024-04-08 13:07:40 [43908] Will revisit Request1('dceu0895-cepces') now.
2024-04-08 13:07:40 [43908] Started Request1('dceu0895-cepces').
2024-04-08 13:07:40 [43908] CA6('myca-cepces-ca').renewal_reqs moved to state 'NEED_TO_ANALYZE'
2024-04-08 13:07:40 [43908] Will revisit CA6('myca-cepces-ca').renewal_reqs now.
2024-04-08 13:07:40 [43908] Will revisit CA6('myca-cepces-ca').renewal_reqs now.

Paolo De Pasquale (paolo-depa) did most of the analysis on this bug. I'm just posting his analysis and patch.

I haven't had a chance to look at this in detail yet and it may take more time. This is a pretty fundamental change to the certmonger interator so I want to be careful about any changes. This code has been stable for over a decade.

I have no sort of reproducer where a CA helper can not be fully configured but working well enough to pass along a cert request to a remote CA.

What is the CA helper doing during this time? That's what isn't clear to me.

From Paul Zirnik at SUSE:
I can not answer the question, as I do not have enough inside information what cepces-submit does on different requests.

However this did popup following question to myself.

What conditions would need to be changed to invalidate and abort a running CSR request ?

I can not think of anything that can cause this, if we have some missing information that still need to be collected, we should not have started the CSR request at all. If we have a fundamental change, new certificates needed anyway. So I see no reason to abort and restart a running CSR request.

Because of this I came up with a new idea to patch the issue -> do not restart running CSR requests at all (at least for external CAs)

diff -Naur certmonger-0.79.13-150400.3.3.1.x86_64/src/iterate.c certmonger-0.79.13-150400.3.3.1.x86_64.np/src/iterate.c
--- certmonger-0.79.13-150400.3.3.1.x86_64/src/iterate.c    2020-10-20 21:39:55.000000000 +0200
+++ certmonger-0.79.13-150400.3.3.1.x86_64.np/src/iterate.c 2024-05-08 10:48:22.851568239 +0200
@@ -2389,15 +2389,19 @@
                case cm_ca_phase_renew_reqs:
                case cm_ca_phase_capabilities:
                    if (emit_ca_changes != NULL) {
-                       cm_restart_entries_by_ca(context,
-                                    ca->cm_nickname);
+                       if(ca->cm_ca_type != cm_ca_external) {
+                           cm_restart_entries_by_ca(context,
+                                       ca->cm_nickname);
+                       }
                    }
                    ca->cm_ca_state[state->cm_phase] = CM_CA_NEED_TO_ANALYZE;
                    break;
                case cm_ca_phase_encryption_certs:
                    if (emit_ca_changes != NULL) {
-                       cm_restart_entries_by_ca(context,
-                                    ca->cm_nickname);
+                       if(ca->cm_ca_type != cm_ca_external) {
+                           cm_restart_entries_by_ca(context,
+                                        ca->cm_nickname);
+                       }
                    }
                    ca->cm_ca_state[state->cm_phase] = CM_CA_NEED_TO_ANALYZE;
                    break;

@rcritten we sent a patched certmonger to a customer and had them test this change, and it resolved the problem. Would you consider testing and accepting the patch upstream?

I really need a reproducer so I can see the failures, then watch what happens with the change. This is a pretty big change to the way requests are handled so I need to be sure regressions or corner cases aren't introduced.

Reproducer setup:

Needed:
- Windows Certificate Authority
- external helper cepces-certmonger
- linux client integrated into AD Domain via Samba
(cepces-certmonger will use kerberos auth to authenticate against AD)
- group policies

-- /etc/cepces/cepces.conf
[global]
server=AD-CA-host
shortname=ad-ca-host
SHORTNAME=AD-CA-HOST
fqdn=AD-CA-host.fqdn.net
type=Policy
auth=Kerberos
endpoint=https://${server}/ADPolicyProvider_CEP_${auth}/service.svc/CEP
poll_interval=3600
[kerberos]
keytab=/etc/krb5.keytab
ccache=True
principals=
${shortname}$$
${SHORTNAME}$$
host/${SHORTNAME}
host/${fqdn}
enctypes=
aes128-cts-hmac-sha1-96
aes256-cts-hmac-sha1-96

-- register CA
getcert add-ca -c my-cepces-ca -e '/usr/lib/certmonger/cepces-submit --server=AD-CA-host.fqdn.net --auth=Kerberos'

-- request certificate
getcert request -c my-cepces-ca \
-T ProdWebServerCEP \
-I {{ host }}-cepces \
-N 'CN = {{ fqdn }}, C = DE, ST = Bavaria, L = Europe, O = Company, OU = Linux, emailAddress = email@fqdn.net' \
-D {{ host }} -D {{ fqdn }} \
-g 4096 \
-o root:pki -m 660 -k /etc/pki/company/service-certs/{{ host }}.key \
-O root:pki -M 664 -f /etc/pki/comapny/service-certs/{{ host }}.crt

@rcritten ping!

I haven't forgotten.

This is a complicated reproducer with AD making it significantly more so.

In order to reproduce it to verify and understand the fix I need to block out a fair bit of time and simply haven't been able to yet.

Metadata