I have two datacentres named site1 and site2, and have 10 nodes at each center. site1: 10.10.1.1 .. 10.10.1.10 site2: 10.10.2.1 .. 10.10.2.10
The network between site1 and site2 is using a fiber, and the latency below 1 ms.
/etc/hosts
10.10.1.1 node01.server
10.10.1.2 node02.server
10.10.1.3 node03.server
10.10.1.4 node04.server
10.10.1.5 node05.server
10.10.1.6 node06.server
10.10.1.7 node07.server
10.10.1.8 node08.server
10.10.1.9 node09.server
10.10.1.10 node10.server
site2 /etc/hosts
10.10.2.1 node01.server
10.10.2.2 node02.server
10.10.2.3 node03.server
10.10.2.4 node04.server
10.10.2.5 node05.server
10.10.2.6 node06.server
10.10.2.7 node07.server
10.10.2.8 node08.server
10.10.2.9 node09.server
10.10.2.10 node10.server
My migration process is:
10.10.1.1 node01.server
10.10.1.2 node02.server
10.10.1.3 node03.server
10.10.1.4 node04.server
10.10.1.5 node05.server
10.10.1.6 node06.server
10.10.1.7 node07.server
10.10.1.8 node08.server
10.10.1.9 node09.server
10.10.2.10 node10.server
Now my riak cluster are all in site2 now.
But now we encounted the not found
problems.
I using curl to fetch a key from riak cluster, the bash like below:
curl -w'\n' -i http://node04.server:8098/buckets/spgs_gamelog_40_1968_0/keys/20049355203
first time, command above returns not found error like below:
HTTP/1.1 404 Object Not Found
Server: MochiWeb/2.20.0 WebMachine/1.11.1 (greased slide to failure)
Date: Wed, 22 May 2024 07:04:02 GMT
Content-Type: text/plain
Content-Length: 10
not found
OR
HTTP/1.1 503 Service Unavailable
Server: MochiWeb/2.20.0 WebMachine/1.11.1 (greased slide to failure)
Date: Wed, 22 May 2024 07:16:04 GMT
Content-Type: text/plain
Content-Length: 25
R-value unsatisfied: 1/2
repeat above command, the key have some values like below:
HTTP/1.1 200 OK
X-Riak-Vclock: a85hYGBgzGDKBVI8th99Mm94rOFjWX6EJYMpkTGPlaFwTvR9viwA
x-riak-index-t_int: 1709645937049
Vary: Accept-Encoding
Server: MochiWeb/2.20.0 WebMachine/1.11.1 (greased slide to failure)
Link: </buckets/spgs_gamelog_40_1968_0>; rel="up"
Last-Modified: Tue, 05 Mar 2024 13:38:57 GMT
ETag: "47x7CEP3nrsLeJkswEXcTw"
Date: Wed, 22 May 2024 07:04:30 GMT
Content-Type: application/json
Content-Length: 1032
{"user_id":"240440215","wriak_t":1709645937050}
I have tried to add parameters like notfound_ok=false&r=3&pr=1
but it does not work!
After check the console.log file. I discoved below warning messages:
riak_kv_vnode:log_key_amnesia:4493 Inbound clock entry for <<157,70,93,96,209,34,165,36>> in <<"spgs_gamelog_40_1968_0">>/<<"20049355203">> greater than local.Epochs: {In:70316435 Local:0}. Counters: {In:1 Local:0}
Does riak client have any way to keep the fetch key operation successfully after read repair immediately. Or if the leveldb partitions are corrupted how can I do?
Added on 2024-05-22:
find . -name "LOG" -exec grep -l 'Compaction error' {} \;
cannot find any errors in leveldb folders.
Added on 2024-05-23: Post some new warning logs like below. Did this means my LAN had some latency?
2024-05-23 02:58:04.915 [warning] <0.5795.5043>@riak_kv_put_fsm:join_mbox_replies:1226 soft-limit mailbox check timeout
2024-05-23 02:58:04.915 [warning] <0.5795.5043>@riak_kv_put_fsm:check_mailboxes:1192 Mailbox soft-load poll timout 100
2024-05-23 02:58:04.915 [warning] <0.5795.5043>@riak_kv_put_fsm:add_errors_to_mbox_data:1239 Mailbox for {633697975561446187189878970435575840553939501056,'riak@node03.server'} did not return in time
2024-05-23 02:58:04.915 [warning] <0.5795.5043>@riak_kv_put_fsm:add_errors_to_mbox_data:1239 Mailbox for {630843480176034267427762398496676850281174007808,'riak@node02.server'} did not return in time
2024-05-23 02:58:04.915 [warning] <0.5795.5043>@riak_kv_put_fsm:add_errors_to_mbox_data:1239 Mailbox for {627988984790622347665645826557777860008408514560,'riak@node01.server'} did not return in time
Added on 05-27: The old keys before migration and new keys after migration both have this issue!
From what you said from before you edited your question, it seems that as part of your migration you are having the single Riak KV 2.9.10 cluster span two different datacentres.
I assume that you are updating the "/etc/hosts" files on all nodes such that the nodes in both datacentres resolve a given nodename to a single specific node (i.e. you update all 10 nodes at the same time to say that "riak@node1.local" has a new IP of 1.2.3.4).
This is generally a bad idea unless the connection between sites is very, very fast (we're talking dark fibre, 1ms latency). Odds are for your case the connection is too slow. The error message you posted showing clock entry latency seems to back this up.
The two recommended methods of migrating to a new datacentre are:
Given the limited information, to potentially solve your problem move all the remaining nodes at the same time. You might also then want to run partition repairs.