apache-flinkflink-streamingflink-sqlflinkml

I got an error for flink k8s ha. job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint


When I apply flink job to k8s zookeeper ha, I get below error.

Our structure is job cluster. 1 job and 1 task. We want to implement while we delete job pod the task still can continue work.

job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint

below is my conf

high-availability: zookeeper
high-availability.storageDir: file:///opt/flink/data/
high-availability.zookeeper.quorum: zk-0.zk-hs:2181,zk-1.zk-hs:2181,zk-2.zk-hs:2181
high-availability.zookeeper.client.acl: open
high-availability.zookeeper.path.root: /flinkha
high-availability.cluster-id: /flink-job-service-kpi-ofcwy

below is error log:

2020-06-19 12:56:02,254 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Recovering checkpoints from ZooKeeper.
2020-06-19 12:56:02,293 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Found 0 checkpoints in ZooKeeper.
2020-06-19 12:56:02,293 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Trying to fetch 0 checkpoints from storage.
2020-06-19 12:56:02,312 INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/00000000000000000000000000000000/job_manager_lock'}.
2020-06-19 12:56:02,454 INFO  org.apache.flink.runtime.jobmaster.JobManagerRunner           - JobManager runner for job KPI service job (00000000000000000000000000000000) was granted leadership with session id 9644799b-29cf-4ec5-9e68-5e45261aefb2 at akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/jobmanager_0.
2020-06-19 12:56:02,532 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2020-06-19 12:56:02,534 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Starting execution of job KPI service job (00000000000000000000000000000000) under job master id 9e685e45261aefb29644799b29cf4ec5.
2020-06-19 12:56:02,552 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job KPI service job (00000000000000000000000000000000) switched from state CREATED to RUNNING.
2020-06-19 12:56:02,575 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: KPI-Kafka-Consumer -> (Sink: Print to Std. Out, Filter -> KPI Query Map -> KPI Unwind -> KPI Custom Map -> KPI filter -> KPI Data Transformation -> Filter) (1/1) (6aeaf74d5a4ee58579e79fa1d3026535) switched from CREATED to SCHEDULED.
2020-06-19 12:56:02,618 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl      - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{4abf5ce93cd365168228b616bd80ed71}]
2020-06-19 12:56:02,634 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Process -> Flat Map (1/1) (4ac2344f71fb9b6beb4a42fe18cf77a2) switched from CREATED to SCHEDULED.
2020-06-19 12:56:02,636 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Window(TumblingProcessingTimeWindows(60000), ProcessingTimeTrigger, DistinctCountAggregateFunction, PassThroughWindowFunction) -> Map (1/1) (1fbb13647621f5e48db6f7d750c32865) switched from CREATED to SCHEDULED.
2020-06-19 12:56:02,636 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat Map -> (Sink: Unnamed, Sink: Print to Std. Out) (1/1) (46396671fce9498171d03a31b1cee968) switched from CREATED to SCHEDULED.
2020-06-19 12:56:02,655 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Connecting to ResourceManager akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/resourcemanager(82039211570997fc83bd52bafb394879)
2020-06-19 12:56:02,674 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Resolved ResourceManager address, beginning registration
2020-06-19 12:56:02,677 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Registration at ResourceManager attempt 1 (timeout=100ms)
2020-06-19 12:56:02,692 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/00000000000000000000000000000000/job_manager_lock.
2020-06-19 12:56:02,693 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Registering job manager 9e685e45261aefb29644799b29cf4ec5@akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/jobmanager_0 for job 00000000000000000000000000000000.
2020-06-19 12:56:02,753 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Registered job manager 9e685e45261aefb29644799b29cf4ec5@akka.tcp://flink@flink-job-service-kpi-ofcwy:35817/user/jobmanager_0 for job 00000000000000000000000000000000.
2020-06-19 12:56:02,775 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - JobManager successfully registered at ResourceManager, leader id: 82039211570997fc83bd52bafb394879.
2020-06-19 12:56:02,775 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl      - Requesting new slot [SlotRequestId{4abf5ce93cd365168228b616bd80ed71}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2020-06-19 12:56:02,777 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 00000000000000000000000000000000 with allocation id dcc3d3f3537cd3f1032fe47a0aafe577.
2020-06-19 12:56:40,983 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint triggering task Source: KPI-Kafka-Consumer -> (Sink: Print to Std. Out, Filter -> KPI Query Map -> KPI Unwind -> KPI Custom Map -> KPI filter -> KPI Data Transformation -> Filter) (1/1) of job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.
2020-06-19 12:57:40,982 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Checkpoint triggering task Source: KPI-Kafka-Consumer -> (Sink: Print to Std. Out, Filter -> KPI Query Map -> KPI Unwind -> KPI Custom Map -> KPI filter -> KPI Data Transformation -> Filter) (1/1) of job 00000000000000000000000000000000 is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.

Solution

  • solved it by config service. missing below configutaion.

    high-availability.jobmanager.port: 6070