My table in a PostgreSQL database stores a large number of integer arrays in the column called vector
. Since all the numbers fit into the range from -128 to 127 and to save space, I use the bytea
type. I need to use the dot product distance operator <#>
provided by the pgvector
extension, e.g.:
select id
from MyTable
order by vector<#>key
limit 10;
However, this operator does not support the bytea
type. I want to convert my vectors to int[]
, which can then be cast to the vector
type supported by pgvector
.
I think the general direction for converting bytea
to int[]
might be along the lines of:
SELECT string_to_array(encode(vector, 'escape'), '\\')::int[]
FROM MyTable;
How do I modify this to make the conversion work?
You can join with generate_series()
and use the get_byte()
function:
WITH b(b) AS (VALUES (BYTEA '\x55FF1200DEADBEEF'))
SELECT array_agg(get_byte(b.b, g.i) - 128)
FROM b
CROSS JOIN LATERAL generate_series(0, length(b) - 1) AS g(i);
array_agg
══════════════════════════════════
{-43,127,-110,-128,94,45,62,111}
(1 row)
This is going to be slow and inefficient, but that is to be expected if you try to save space at all costs.
I would recommend to use a smallint[]
, which uses 2 bytes per number.
If you are running PostgreSQL v15 or later, you could also use an array of the internal data type "char"
, which takes only one byte per element:
SELECT ARRAY[-43,127,-110,-128,94,45,62,111]::"char"[];
array
════════════════════════════════════════
{"\\325",\x7F,"\\222","\\200",^,-,>,o}
(1 row)
To get an individual element, you use
SELECT (ARRAY[-43,127,-110,-128,94,45,62,111]::"char"[])[3]::integer;
array
═══════
-110
(1 row)
A "char"[]
uses one byte per element plus an overhead of 21 bytes per array (a bytea
has an overhead of one or four bytes per value, depending on the size).