I have a little freeipa setup with two replicated servers, both running CentOS Stream 9. On of them, every so often named dies in bind-dyndb-ldap with
Oct 06 11:28:01 ipa named[20720]: ldap_helper.c:3773: INSIST(task == inst->task) failed, back trace Oct 06 11:28:01 ipa named[20720]: #0 0x558ec791e621 in ?? Oct 06 11:28:01 ipa named[20720]: #1 0x7f81f5016420 in ?? Oct 06 11:28:01 ipa named[20720]: #2 0x7f81f1662e83 in ?? Oct 06 11:28:01 ipa named[20720]: #3 0x7f81f50511bd in ?? Oct 06 11:28:01 ipa named[20720]: #4 0x7f81f503c2a9 in ?? Oct 06 11:28:01 ipa named[20720]: #5 0x7f81f503c425 in ?? Oct 06 11:28:01 ipa named[20720]: #6 0x7f81f503cc17 in ?? Oct 06 11:28:01 ipa named[20720]: #7 0x7f81f4dd5b3d in ?? Oct 06 11:28:01 ipa named[20720]: #8 0x7f81f4df185e in ?? Oct 06 11:28:01 ipa named[20720]: #9 0x7f81f4ddb5a8 in ?? Oct 06 11:28:01 ipa named[20720]: #10 0x7f81f503c4db in ?? Oct 06 11:28:01 ipa named[20720]: #11 0x7f81f504ef9a in ?? Oct 06 11:28:01 ipa named[20720]: #12 0x7f81f482f802 in ?? Oct 06 11:28:01 ipa named[20720]: #13 0x7f81f47cf450 in ?? Oct 06 11:28:01 ipa named[20720]: exiting (due to assertion failure) Oct 06 11:28:01 ipa systemd[1]: named.service: Main process exited, code=killed, status=6/ABRT Oct 06 11:28:01 ipa systemd[1]: named.service: Failed with result 'signal'.
Backtrace with symbols:
root@ipa:/var/named> gdb /usr/sbin/named [...] Reading symbols from /usr/sbin/named... Reading symbols from /usr/lib/debug/usr/sbin/named-9.16.23-13.el9.x86_64.debug... (gdb) core core [New LWP 20721] [New LWP 20720] [New LWP 20722] [New LWP 20723] [New LWP 20764] [New LWP 20939] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/named -u named -c /etc/named.conf -E pkcs11'. Program terminated with signal SIGABRT, Aborted. #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0; [Current thread is 1 (Thread 0x7f81f34f8640 (LWP 20721))] (gdb) bt #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 #1 0x00007f81f48315b3 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78 #2 0x00007f81f47e4d06 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #3 0x00007f81f47b87f3 in __GI_abort () at abort.c:79 #4 0x0000558ec791b5b5 in assertion_failed (file=<optimized out>, line=<optimized out>, type=<optimized out>, cond=<optimized out>) at ../../../bin/named/main.c:270 #5 0x00007f81f5016420 in isc_assertion_failed (file=file@entry=0x7f81f166dd75 "ldap_helper.c", line=line@entry=3773, type=type@entry=isc_assertiontype_insist, cond=cond@entry=0x7f81f16701bc "task == inst->task") at ../../../lib/isc/assertions.c:46 #6 0x00007f81f1662e83 in update_zone (task=<optimized out>, event=<optimized out>) at /usr/src/debug/bind-dyndb-ldap-11.9-8.el9.x86_64/src/ldap_helper.c:3773 #7 0x00007f81f50511bd in task_run (task=0x7f81f3654010) at ../../../lib/isc/task.c:857 #8 isc_task_run (task=0x7f81f3654010) at ../../../lib/isc/task.c:950 #9 0x00007f81f503c2a9 in isc__nm_async_task (worker=0x558ec8b21eb0, ev0=0x7f81f365dee8) at netmgr/../../../../lib/isc/netmgr/netmgr.c:873 #10 process_netievent (worker=worker@entry=0x558ec8b21eb0, ievent=0x7f81f365dee8) at netmgr/../../../../lib/isc/netmgr/netmgr.c:958 #11 0x00007f81f503c425 in process_queue (worker=worker@entry=0x558ec8b21eb0, type=type@entry=NETIEVENT_TASK) at netmgr/../../../../lib/isc/netmgr/netmgr.c:1027 #12 0x00007f81f503cc17 in process_all_queues (worker=0x558ec8b21eb0) at netmgr/../../../../lib/isc/netmgr/netmgr.c:798 #13 async_cb (handle=0x558ec8b22210) at netmgr/../../../../lib/isc/netmgr/netmgr.c:827 #14 0x00007f81f4dd5b3d in uv__async_io (loop=0x558ec8b21ec0, w=<optimized out>, events=<optimized out>) at src/unix/async.c:163 --Type <RET> for more, q to quit, c to continue without paging-- #15 0x00007f81f4df185e in uv__io_poll (loop=0x558ec8b21ec0, timeout=<optimized out>) at src/unix/epoll.c:374 #16 0x00007f81f4ddb5a8 in uv__io_poll (timeout=<optimized out>, loop=0x558ec8b21ec0) at src/unix/udp.c:122 #17 uv_run (loop=loop@entry=0x558ec8b21ec0, mode=mode@entry=UV_RUN_DEFAULT) at src/unix/core.c:389 #18 0x00007f81f503c4db in nm_thread (worker0=0x558ec8b21eb0) at netmgr/../../../../lib/isc/netmgr/netmgr.c:733 #19 0x00007f81f504ef9a in isc__trampoline_run (arg=0x558ec8b25e60) at ../../../lib/isc/trampoline.c:196 #20 0x00007f81f482f802 in start_thread (arg=<optimized out>) at pthread_create.c:443 #21 0x00007f81f47cf450 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 (gdb) info threads Id Target Id Frame * 1 Thread 0x7f81f34f8640 (LWP 20721) __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 2 Thread 0x7f81f4392dc0 (LWP 20720) 0x00007f81f47e5aca in __GI___sigtimedwait (set=set@entry=0x7ffc262b7440, info=info@entry=0x7ffc262b7370, timeout=timeout@entry=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:61 3 Thread 0x7f81f2cf7640 (LWP 20722) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7f81f2cf6b00, op=393, expected=0, futex_word=0x7f81f364709c) at futex-internal.c:57 4 Thread 0x7f81f24cc640 (LWP 20723) 0x00007f81f48deace in epoll_wait (epfd=16, events=events@entry=0x558ec8b6fc20, maxevents=maxevents@entry=2048, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 5 Thread 0x7f81e83ef640 (LWP 20764) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f81f4dfd0c8 <cond.lto_priv+40>) at futex-internal.c:57 6 Thread 0x7f81f14f7640 (LWP 20939) __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7f81f14f6010, op=393, expected=0, futex_word=0x7f81f36591c0) at futex-internal.c:57 (gdb)
I have not found a way to trigger this behavior, other than waiting.
The whole OS runs as an unprivileged lxc container, which so far was never an issue and works the same way for the other replica (which is on another physical machine, though).
Let me know how to help, I have a core file from the crash as well.
core
Thanks for the report. It looks like
#5 0x00007f81f5016420 in isc_assertion_failed (file=file@entry=0x7f81f166dd75 "ldap_helper.c", line=line@entry=3773, type=type@entry=isc_assertiontype_insist, cond=cond@entry=0x7f81f16701bc "task == inst->task") at ../../../lib/isc/assertions.c:46
fails on this code in update_zone():
update_zone()
INSIST(task == inst->task); /* For task-exclusive mode */
E.g. we got an event to update zone but the task which was associated with this event is different from the task we expected to be. Typically, if we see this, it happens when LDAP thread task got shut down due to a pending bind daemon restart. This, in turn, happens, for example, if log rotation triggered a reinitialization of the bind.
The event is asynchronously emitted and processed later. The crash happens when update event is processed after an event that leads to recycle of the LDAP driver instance. inst->task at this point is basically an invalid pointer (memory got freed already), so all we can do is to stop processing the update event and exit. The downside is that we would lose this update, pretty much. We cannot do anything else, though. The data for this update is in LDAP, we just didn't process it on the bind side.
inst->task
In frame #6 we don't see task and event values, they were optimised out. But in frame #7 we do get it and event would be known from the lower frames:
#7 0x00007f81f50511bd in task_run (task=0x7f81f3654010) at ../../../lib/isc/task.c:857 #8 isc_task_run (task=0x7f81f3654010) at ../../../lib/isc/task.c:950 #9 0x00007f81f503c2a9 in isc__nm_async_task (worker=0x558ec8b21eb0, ev0=0x7f81f365dee8) at netmgr/../../../../lib/isc/netmgr/netmgr.c:873 #10 process_netievent (worker=worker@entry=0x558ec8b21eb0, ievent=0x7f81f365dee8) at netmgr/../../../../lib/isc/netmgr/netmgr.c:958 #11 0x00007f81f503c425 in process_queue (worker=worker@entry=0x558ec8b21eb0, type=type@entry=NETIEVENT_TASK) at netmgr/../../../../lib/isc/netmgr/netmgr.c:1027
@lorenzh, can you please try to print the content of the event 0x7f81f365dee8? Something like
0x7f81f365dee8
(gdb) print *(ldap_syncreplevent_t*)0x7f81f365dee8 (gdb) print *((ldap_syncreplevent_t*)0x7f81f365dee8)->inst
Here you go:
(gdb) print *(ldap_syncreplevent_t*)0x7f81f365dee8 $1 = {ev_size = 24, ev_attributes = 4083499024, ev_tag = 0x0, ev_type = 0, ev_action = 0x0, ev_arg = 0x0, ev_sender = 0x0, ev_destroy = 0x0, ev_destroy_arg = 0x0, ev_link = {prev = 0x0, next = 0x0}, ev_ratelink = { prev = 0x0, next = 0x0}, mctx = 0x0, inst = 0x0, prevdn = 0x0, chgtype = 0, entry = 0x0, seqid = 0}
(.inst is a null pointer)
.inst
Interesting. If .inst in the event structure were NULL, then we would have failed earlier, in
REQUIRE(inst != NULL);
But we haven't, as we passed to the next assert.
In any case, I think we should make the check for inst less damaging. I'll see what we can do here.
inst
Log in to comment on this ticket.