I have a database table where each entry is a resource that can be used once by each user. Each user can only use resources that they have not already used, but the same resource can be used by different users, so if I'm not mistaken, the users and resources are in a M:N relationship, this is why I have an association table between them. The schema looks like this:
CREATE TABLE users (
id SERIAL NOT NULL,
PRIMARY KEY(id)
)
CREATE TABLE resources (
id VARCHAR NOT NULL,
created_at TIMESTAMP,
PRIMARY KEY(id)
)
CREATE TABLE resource_usage (
resource_id VARCHAR NOT NULL,
user_id SERIAL NOT NULL,
created_at TIMESTAMP,
PRIMARY KEY(resource_id, user_id),
FOREIGN KEY(resource_id) REFERENCES resources (id),
FOREIGN KEY(user_id) REFERENCES users (id)
)
I run the following SQL query (PostgreSQL 11.16) to select the first available resource for a specific user.
SELECT * FROM resources WHERE NOT EXISTS(
SELECT 1 FROM resource_usage
WHERE resources.id = resource_usage.resource_id
AND resource_usage.user_id = 1
)
LIMIT 1
To add some context, there should never be a scenario where there is no available resource for a user. I keep metrics of available resources for each user and add more accordingly. The resources
table contains a little over a million of rows, but it will keep growing. There is no explicit order in which the resources should be consumed.
As both the resources
and resource_usage
tables grow larger, the nested loop anti-join takes longer to complete (it takes longer to find the first available resource). My question is how do I write an index that would make this operation more efficient.
So far I have tried to create a sorting index where I sorted the resources by the time they were inserted. This made sense to me, because if the resource has only been added now, it's likely it wasn't used yet at all. This actually helped a little with the performance issue, but I don't add new resources very often and the performance starts to degrade quickly.
Even with index support this kind of anti-join grows expensive with very big numbers of used resources.
Your example makes it seem arbitrary which resource to use next. So you might as well enforce a deterministic order in which resources have to be used. (And you commented that you can.)
You already mentioned an insertion timestamp for resources. While multiple resources can be entered at the same time, we might use something like (created_at, id)
as deterministic sort order for resources
. But created_at
must be NOT NULL
and values monotonically nondecreasing.
And we need an equally deterministic order for resource_usage
. I see you already have a similar (distinct) column created_at
. (created_at, id)
can be the deterministic sort order per user_id
for resource_usage
.
Since new resources are always entered with a higher (created_at, id)
, we never miss a new resource, either.
Have these indexes:
CREATE INDEX foo ON resources (created_at, id);
CREATE INDEX bar ON resource_usage (user_id, created_at, id);
We also need the index that comes with the PK on resources(id)
.
Then this query is, and always will be, very fast:
SELECT id
FROM resources r
WHERE (created_at, id) > (
SELECT COALESCE(
(
SELECT (r1.created_at, r1.id)
FROM (
-- id of most recently used resource
SELECT id
FROM resource_usage u
WHERE u.user_id = 1 -- user_id here!
ORDER BY created_at, id
LIMIT 1
) u
JOIN resource r1 USING (id)
)
, ('-infinity', '') -- matching data types!
)
)
ORDER BY created_at, id
LIMIT 1;
Remember that resource_usage.created_at
is distinct from resources.created_at
. After sorting resource_usage
, we need to grab the corresponding resources.created_at
.
I threw in COALESCE
to cover the case where a user hasn't used anything, yet, defaulting to a row that's guaranteed to be smaller than any existing row. Using ROW values for the purpose. About ROW value comparison:
To rule out race conditions from concurrent write access, take a write lock on the user row in the same transaction before running this query. Like:
BEGIN;
SELECT FROM users WHERE id = 1 FOR NO KEY UPDATE;
-- above query
-- actually claim the resource by writing to resource_usage
COMMIT;
You have to consolidate existing data before you can start using this modus operandi. Make sure there is no older, unused resource for the same user.
This actually helped a little with the performance issue, but I don't add new resources very often and the performance starts to degrade quickly.
That's to be expected. This approach finds unused resources quickly at first. But, over time, it degrades to a worst-case scenario as the search always starts at the same point and ensures a maximum of rows to be skipped before finding the next free one.
PostgreSQL 11.16 is getting old (EOL Nov 2023). Upgrade to a current version at your earliest opportunity.
timestamptz
is typically a better choice than timestamp
if multiple time zone can be involved in any way.