When I apply flink job to k8s zookeeper ha, I get below error.
Our structure is job cluster. 1 job and 1 task. We want to implement while we delete job pod the task still can continue work.
job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint
below is my conf
high-availability: zookeeper
high-availability.storageDir: file:///opt/flink/data/
high-availability.zookeeper.quorum: zk-0.zk-hs:2181,zk-1.zk-hs:2181,zk-2.zk-hs:2181
high-availability.zookeeper.client.acl: open
high-availability.zookeeper.path.root: /flinkha
high-availability.cluster-id: /flink-job-service-kpi-ofcwy
below is error log:
2020-06-19 12:56:02,254 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper. 2020-06-19 12:56:02,293 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 0 checkpoints in ZooKeeper. 2020-06-19 12:56:02,293 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to fetch 0 checkpoints from storage. 2020-06-19 12:56:02,312 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}. 2020-06-19 12:56:02,454 INFO org.apache.flink.runtime.jobmaster.JobManagerRunner - JobManager runner for job KPI service job (00000000000000000000000000000000) was granted leadership with session id 9644799b-29cf-4ec5-9e68-5e45261aefb2 at akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/jobmanager_0. 2020-06-19 12:56:02,532 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. 2020-06-19 12:56:02,534 INFO org.apache.flink.runtime.jobmaster.JobMaster - Starting execution of job KPI service job (00000000000000000000000000000000) under job master id 9e685e45261aefb29644799b29cf4ec5. 2020-06-19 12:56:02,552 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job KPI service job (00000000000000000000000000000000) switched from state CREATED to RUNNING. 2020-06-19 12:56:02,575 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: KPI-Kafka-Consumer -> (Sink: Print to Std. Out, Filter -> KPI Query Map -> KPI Unwind -> KPI Custom Map -> KPI filter -> KPI Data Transformation -> Filter) (1/1) (6aeaf74d5a4ee58579e79fa1d3026535) switched from CREATED to SCHEDULED. 2020-06-19 12:56:02,618 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{4abf5ce93cd365168228b616bd80ed71}] 2020-06-19 12:56:02,634 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Process -> Flat Map (1/1) (4ac2344f71fb9b6beb4a42fe18cf77a2) switched from CREATED to SCHEDULED. 2020-06-19 12:56:02,636 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(60000), ProcessingTimeTrigger, DistinctCountAggregateFunction, PassThroughWindowFunction) -> Map (1/1) (1fbb13647621f5e48db6f7d750c32865) switched from CREATED to SCHEDULED. 2020-06-19 12:56:02,636 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Flat Map -> (Sink: Unnamed, Sink: Print to Std. Out) (1/1) (46396671fce9498171d03a31b1cee968) switched from CREATED to SCHEDULED. 2020-06-19 12:56:02,655 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/resourcemanager(82039211570997fc83bd52bafb394879) 2020-06-19 12:56:02,674 INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration 2020-06-19 12:56:02,677 INFO org.apache.flink.runtime.jobmaster.JobMaster - Registration at ResourceManager attempt 1 (timeout=100ms) 2020-06-19 12:56:02,692 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/00000000000000000000000000000000/job_manager_lock. 2020-06-19 12:56:02,693 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager 9e685e45261aefb29644799b29cf4ec5@akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/jobmanager_0 for job 00000000000000000000000000000000. 2020-06-19 12:56:02,753 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager 9e685e45261aefb29644799b29cf4ec5@akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/jobmanager_0 for job 00000000000000000000000000000000. 2020-06-19 12:56:02,775 INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: 82039211570997fc83bd52bafb394879. 2020-06-19 12:56:02,775 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting new slot [SlotRequestId{4abf5ce93cd365168228b616bd80ed71}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager. 2020-06-19 12:56:02,777 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 00000000000000000000000000000000 with allocation id dcc3d3f3537cd3f1032fe47a0aafe577. 2020-06-19 12:56:40,983 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: KPI-Kafka-Consumer -> (Sink: Print to Std. Out, Filter -> KPI Query Map -> KPI Unwind -> KPI Custom Map -> KPI filter -> KPI Data Transformation -> Filter) (1/1) of job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint. 2020-06-19 12:57:40,982 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: KPI-Kafka-Consumer -> (Sink: Print to Std. Out, Filter -> KPI Query Map -> KPI Unwind -> KPI Custom Map -> KPI filter -> KPI Data Transformation -> Filter) (1/1) of job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.
solved it by config service. missing below configutaion.
high-availability.jobmanager.port: 6070