After having no issues with ArangoDB3 for a couple of years, suddenly, I am encountering an AQL IO Error of the form
[HTTP 500][ERR 1305] AQL: IO error: While open a file for random read: /ssd1/arangodb3/engine-rocksdb/22850496.sst: No file descriptors available (while finalizing)
This is while performing an insert of the form
insert { id: "foo", junk: [ 1, 2, 3 ] } in bar
This occurred after running a lengthy operation populating a new database.
Looking at syslog
, I see the following (timestamps, etc., elided for readability):
ERROR [fae2c] {rocksdb} RocksDB encountered a background error during a compaction operation: IO error: While open a file for random read: /ssd1/arangodb3/engine-rocksdb/22850496.sst: No file descriptors available; The database will be put in read-only mode, and subsequent write errors are likely. It is advised to shut down this instance, resolve the error offline and then restart it.
ERROR [be9ea] {rocksdb} rocksdb: [db/db_impl/db_impl_compaction_flush.cc:2922] Waiting after background compaction error: IO error: While open a file for random read: /ssd1/arangodb3/engine-rocksdb/22850496.sst: No file descriptors available, Accumulated background error counts: 1
WARNING [afa17] {engines} could not sync metadata for collection 'OpenAlex_20240502/works'
WARNING [a3d0c] {engines} background settings sync failed: IO error: While open a file for random read: /ssd1/arangodb3/engine-rocksdb/22850496.sst: No file descriptors available
WARNING [afa17] {engines} could not sync metadata for collection 'OpenAlex_20240502/publishers'
The first message above seems indicative of something but I'm not sure what.
The file in question, /ssd1/arangodb3/engine-rocksdb/22850496.sst
, does not exist, which would the an obvious source of the problem but I'm not sure how to cure it.
Restarting both Arango DB and the system does not clear the problem.
There is more than enough space on the filesystem
/dev/nvme0n1p1 7.3T 4.6T 2.8T 63% /ssd1
so that's not an issue.
arangodb --version
reports
Arango DB Version 0.18.2, build 3518b68, Go go1.21.5
arangosh --version
reports
3.11.8
architecture: 64bit
arm: false
asan: false
assertions: false
avx: true
avx2: false
boost-version: 1.78.0
build-date: 2024-02-22 14:43:37
build-repository: refs/tags/v3.11.8 eb715d099fb
compiler: gcc [11.2.1 20220219]
coverage: false
cplusplus: 202002
curl-version: none
debug: false
endianness: little
failure-tests: false
fd-client-event-handler: poll
fd-setsize: 1024
full-version-string: ArangoDB 3.11.8 [linux] 64bit, using jemalloc, build refs/tags/v3.11.8 eb715d099fb, VPack 0.2.1, RocksDB 7.2.0, ICU 64.2, V8 7.9.317, OpenSSL 3.0.13 30 Jan 2024
icu-version: 64.2
ipo: true
iresearch-version: 1.3.0.0
jemalloc: true
libunwind: true
license: community
maintainer-mode: false
memory-profiler: true
ndebug: true
openssl-version-compile-time: OpenSSL 3.0.13 30 Jan 2024
openssl-version-run-time: OpenSSL 3.0.13 30 Jan 2024
optimization-flags: -mfxsr -mmmx -msse -msse2 -mcx16 -msahf -mpopcnt -msse3 -msse4.1 -msse4.2 -mssse3 -mpclmul -mavx -mxsave
pic: 2
pie: 2
platform: linux
reactor-type: epoll
replication2-enabled: false
rocksdb-version: 7.2.0
server-version: 3.11.8
sizeof int: 4
sizeof long: 8
sizeof void*: 8
sse42: true
tsan: false
unaligned-access: true
v8-version: 7.9.317
vpack-version: 0.2.1
zlib-version: 1.2.13
I'm running Ubuntu 23.10
Linux servername 6.5.0-28-generic #29-Ubuntu SMP PREEMPT_DYNAMIC Thu Mar 28 23:46:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
I have tried reinstalling Arango DB to no avai. I have restarted the application from a checkpoint and it now immediately and consistently fails. Even a simple insert as above throws the same error.
The application, written in python, is multithreaded, using the multiprocessing
modules and there are 64 threads/processes all performing uploads.
I have the identical code running on another system and it happily runs to completion, so I'm puzzled as to what might be going sideways here.
First, I will mention that I managed to get the application to run through the expedient of deleting a couple of rather large but unused databases and rerunning the application from the start. Success! Though due more to dumb luck than intelligence as I point out below.
As the commentors indicated, I believe file descriptor limits are the culprit -- but, like a drunk under the lamp post, I was looking for my keys where the light was good, rather than where I'd actually dropped my keys.
It finally struck me this morning to carefully look at the ArangoDB startup file in /lib/systemd/system/arangodb3.service
which reveals the following:
# system limits
LimitNOFILE=131072
LimitNPROC=131072
TasksMax=131072
which produced a bit of an Ah Ha! moment in my dim little brain.
That led me to take a look at a backup saved before I deleted the extraneous databases:
myobfuscatedhost:/ssd1/arangodb3/engine-rocksdb# ls -1 | grep sst | wc -l
130878
Lo and behold, those numbers are rather close.
It appears, through the use of lsof
that, indeed, every file in the engine-rocksdb
directory is open.
Ergo, raising the above limits should resolve the problem.
I say should since I have not as of yet fully tested my hypothesis, since as I mentioned in my opening paragraph, I temporarily resolved the problem by removing excess dataset and leaving myself enough headroom to accomplish the task at hand.
I will increase the limits in /lib/systemd/system/arangodb3.service
, cross my fingers, and hope for the best.
Thanks to all who eventually managed pound the answer into my thick head.