I work with Flink 1.18.1 Java 11. One of my pipelines after around 30 minutes start getting into crash loop. In the TaskManager i can see log:
Exception in thread "Thread-21" java.lang.IllegalArgumentException: classLoader cannot be null.
at com.esotericsoftware.kryo.Kryo.setClassLoader(Kryo.java:975)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.checkKryoInitialized(KryoSerializer.java:553)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:394)
at org.apache.flink.api.common.typeutils.CompositeSerializer.deserialize(CompositeSerializer.java:156)
at org.apache.flink.contrib.streaming.state.ttl.RocksDbTtlCompactFiltersManager$ListElementFilter.nextElementLastAccessTimestamp(RocksDbTtlCompactFiltersManager.java:205)
at org.apache.flink.contrib.streaming.state.ttl.RocksDbTtlCompactFiltersManager$ListElementFilter.nextUnexpiredOffset(RocksDbTtlCompactFiltersManager.java:191)
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fc338f944bf, pid=1, tid=388
#
# JRE version: OpenJDK Runtime Environment Temurin-11.0.23+9 (11.0.23+9) (build 11.0.23+9)
# Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.23+9 (11.0.23+9, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0x7944bf] Exceptions::_throw_oop(Thread*, char const*, int, oopDesc*)+0x15f
#
# Core dump will be written. Default location: /opt/flink/core.1
#
# An error report file with more information is saved as:
# /opt/flink/hs_err_pid1.log
#
# If you would like to submit a bug report, please visit:
# https://github.com/adoptium/adoptium-support/issues
#
Can't find a solution for this, any ideas?
@Update The relevant part of hs_err_pid1.log:
Current thread (0x00007f476002d800): JavaThread "Thread-21" [_thread_in_vm, id=407, stack(0x00007f46a48d4000,0x00007f46a50d4000)]
Stack: [0x00007f46a48d4000,0x00007f46a50d4000], sp=0x00007f46a50d0f10, free space=8179k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x7944bf] Exceptions::_throw_oop(Thread*, char const*, int, oopDesc*)+0x15f
V [libjvm.so+0x95a8ac] jni_Throw+0x8c
C [librocksdbjni-linux64.so+0x222ce1] JavaListElementFilter::NextUnexpiredOffset(rocksdb::Slice const&, long, long) const+0x121
C [librocksdbjni-linux64.so+0x648941] rocksdb::flink::FlinkCompactionFilter::ListDecide(rocksdb::Slice const&, std::string*) const+0x81
C [librocksdbjni-linux64.so+0x648de7] rocksdb::flink::FlinkCompactionFilter::FilterV2(int, rocksdb::Slice const&, rocksdb::CompactionFilter::ValueType, rocksdb::Slice const&, std::string*, std::string*) const+0xc7
C [librocksdbjni-linux64.so+0x2d0b74] rocksdb::CompactionIterator::InvokeFilterIfNeeded(bool*, rocksdb::Slice*)+0x914
C [librocksdbjni-linux64.so+0x2d24a5] rocksdb::CompactionIterator::NextFromInput()+0x3c5
C [librocksdbjni-linux64.so+0x2d4911] rocksdb::CompactionIterator::SeekToFirst()+0x11
C [librocksdbjni-linux64.so+0x2dd2be] rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x55e
C [librocksdbjni-linux64.so+0x2de668] rocksdb::CompactionJob::Run()+0x278
C [librocksdbjni-linux64.so+0x33cec4] rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x1084
C [librocksdbjni-linux64.so+0x3407e7] rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0xd7
C [librocksdbjni-linux64.so+0x340dda] rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a
C [librocksdbjni-linux64.so+0x5e5744] rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x254
C [librocksdbjni-linux64.so+0x5e58ed] rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x5d
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000
It seems that the problem was correlated to this Flink bug. When I switch TTL mechanism to simple timer-clearing for List State variables (I removed all the TTLs tbh), the pipeline starts working properly and no longer crashes. It's strange that the bug is marked as "Not a Priority " and present since Flink 1.8.0.