I have read the full documentation for gnu sort and searched online but I cannot find what the default for the --buffer-size option is (which determines how much system memory the program uses when it runs). I am guessing it is somehow determined based on total system memory? (or perhaps on memory available at the time the program is begins execution?). How can I determine this?
update: I've experimented a bit and it seems that when I don't specify a particular --buffer-size value, it ends up using very little ram and thus going very slowly. It would be nice though to better understand what exactly is determining this behavior.
I went digging through the coreutils sort source code and found these functions: default_sort_size
and sort_buffer_size
.
It turns out that --buffer-size
(sort_size
in the source code) isn't the target buffer size but rather the maximum buffer size. If no --buffer-size
value is specified, the default_sort_size
function is used to determine a safe maximum buffer size. It does this based on resource limits, available memory, and total memory. A summary of the function is as follows:
size = MIN(SIZE_MAX, resource_limit) / 2;
mem = MAX(available_memory, total_memory / 8);
if ( size > total_memory * 0.75 )
size = total * 0.75;
buffer_max = MIN(mem, size);
buffer_max = MAX(buffer, MIN_SORT_SIZE);
The other function, sort_buffer_size
, is used to determine exactly how much memory to allocate for the given input files. A summary of the function is as follows:
if (sort_size is set)
size_bound = sort_size;
else
size_bound = default_sort_size();
buffer_size = line_bytes + 2;
for each input_file
if (input_file is regular)
file_size = input_file_size;
else
if (sort_size is set)
return sort_size;
else
file_size = guess;
worst_case = file_size * worst_case_per_input_byte + 1;
if (worst_case overflows || size + worst_case >= size_bound)
return size_bound;
else
size += worst_case;
return size;
Possibly the most important point of the sort_buffer_size
function is that if you're sorting data from STDIN or a pipe, it will automatically default to sort_size
(i.e. --buffer-size
) if it was provided. Otherwise, for regular files it will make some rough calculations based on the file sizes and only use sort_size
as an upper limit.