I'm trying to implement a skip list that perform as good as a BST using a minimum additional memory overhead, at the moment even not considering any memory restriction the performance of my SkipList implementation is far away from even a very naive Balanced BST implementation -so to say, handcrafted BTS :) -
As reference I'm using the original paper from William Pugh PUG89 and the implementation i found in Algorithms in C from Sedgewick -13.5-. My code is a recursive implementation, here's the clue of the insert and find operation:
sl_node* create_node()
{
short lvl {1};
while((dist2(en)<p)&&(lvl<max_level))
++lvl;
return new sl_node(lvl);
}
void insert_impl(sl_node* cur_node,
sl_node* new_node,
short lvl)
{
if(cur_node->next_node[lvl]==nullptr || cur_node->next_node[lvl]->value > new_node->value){
if(lvl<new_node->lvl){
new_node->next_node[lvl] = cur_node->next_node[lvl];
cur_node->next_node[lvl] = new_node;
}
if(lvl==0) return;
insert_impl(cur_node,new_node,lvl-1);
return;
}
insert_impl(cur_node->next_node[lvl],new_node,lvl);
}
sl_node* insert(long p_val)
{
sl_node* new_node = create_node();
new_node->value = p_val;
insert_impl(head, new_node,max_level-1);
return new_node;
}
And this is the code for the find operation:
sl_node* find_impl(sl_node* cur_node,
long p_val,
int lvl)
{
if(cur_node==nullptr) return nullptr;
if(cur_node->value==p_val) return cur_node;
if(cur_node->next_node[lvl] == nullptr || cur_node->next_node[lvl]->value>p_val){
if(lvl==0) return nullptr;
return find_impl(cur_node,p_val,lvl-1);
}
return find_impl(cur_node->next_node[lvl],p_val,lvl);
}
sl_node* find(long p_val)
{
return find_impl(head,p_val,max_level-1);
}
A sl_node -skip list node- looks like this:
struct sl_node
{
long value;
short lvl;
sl_node** next_node;
sl_node(int l) : lvl(l)
{
next_node = new sl_node*[l];
for(short i{0};i<l;i++)
next_node[i]=nullptr;
}
~sl_node()
{
delete[] next_node;
}
};
As you can see this is implementation has nothing special nor advanced if compared to the book implementation, i will not share the Balaced BTS code since I don't think is needed here, but is a very basic BTS with a rebalance function triggered during the insertion when the new node height is greather than 16*lg(n) where n is the number of nodes.
So to say, i rebalance the three only if the maximum height is 16 times greather than the best theoretical one, as i said, this straighforward homemade BST performs much better than the homemade Skip List.
But first, let's have a look at some data, using p=.5 and n=262144 the level of the nodes in the SkipList has the following distribution:
1:141439, 53.9547%
2:65153, 24.8539%
3:30119, 11.4895%
4:13703, 5.22728%
5:6363, 2.42729%
6:2895, 1.10435%
7:1374, 0.524139%
8:581, 0.221634%
9:283, 0.107956%
10:117, 0.044632%
11:64, 0.0244141%
12:31, 0.0118256%
13:11, 0.00419617%
14:5, 0.00190735%
15:1, 0.00038147%
16:5, 0.00190735%
Which almost perfectly -oh, this is big one!- match the theory from the article, that is: 50% level 1, 25% level 2 and so on and so forth. The input data came from my best available pseudo-random number generator aka std::random_device with std::default_random_engine and an uniform int distribution. The input looks pretty random to me :) !
The time required to search for 'all' the 262144 elements in the SkipList -in random order- is 315ms on my machine, whereas for the same search operation on the naive BTS the required time is 134ms, so the BTS is almost twice faster than the SkipList. That is not exactly what i was expecting from "Skip list algoriths have the same asymptotic expected time bounds as balanced trees and are simple, faster and use less space" PUG89.
The time required for the 'insertion' of the nodes is 387ms for the SkipList and 143ms for the BTS, again a naive BST perform better.
The things get little more interesting if instead of using a random sequence of input number i use a sorted sequence, here my poor homemade BST become slow, and the insertion of 262144 sorted int's takes 2866ms whereas the SkipList require only 168ms.
But, when come to the search time, the BST is still faster! for the sorted input we have 234ms versus 77ms, this BST is 3x faster.
With different values of the p-factor i got slightly different performance results:
And last but not least, the memory usage graph, which as you may expect confirm that if we increase the amount of levels per node we impact significantly the memory fingerprint. The memory usage is calculated just as a sum of the space required to store the additional pointers for all the node.
After all this, does anybody of you can provide me a comment on how to implement a SkipList performing as good as a BTS -not counting the additional memory overhead-?
I know about the article from DrDobbs LINK about SkipList, and i went thru all the paper, the code for the search and insert operation match exactly the original implementation from PUG89 so should be as good as mine, and the article does not provide any performance analysis anyway, I did not found any other source. Can you help me?
Any comment is highly appreciated!
Thanks! :)
Times have changed a bit since William Pugh wrote his original paper. We see no mention in his paper about the memory hierarchy of the CPU and operating system which has become such a prevalent focus today (now often equally as important as algorithmic complexity).
His input case for benchmarking had a measly 2^16 elements, and hardware back then typically had, at most, 32-bit extended memory addressing available. This made the size of a pointer half the size or smaller than what we're used to today on 64-bit machines. Meanwhile a string field, e.g., could be just as large, making the ratio between the elements stored in the skip list and the pointers required by a skip node potentially a lot smaller, especially given that we often need a number of pointers per skip node.
C Compilers weren't so aggressive at optimization back then with respect to things like register allocation and instruction selection. Even average hand-written assembly could often provide a significant benefit in performance. Compiler hints like register
and inline
actually made a big deal during those times. While this might seem kind of moot since both a balanced BST and skip list implementation would be on equal footing here, optimization of even a basic loop was a more manual process. When optimization is an increasingly manual process, something that is easier to implement is often easier to optimize. Skip lists are often considered to be a lot easier to implement than a balancing tree.
So all of these factors probably had a part in Pugh's conclusions at the time. Yet times have changed: hardware has changed, operating systems have changed, compilers have changed, more research has been done into these topics, etc.
With that aside, let's have some fun and implement a basic skip list. I ended up adapting the implementation available here out of laziness. It's a run-of-the-mill kind of implementation, hardly different from the plethora of easily-accessible exemplary skip list implementations out there today.
We'll be comparing the performance of our implementation against std::set
which is almost always implemented as a red-black tree*.
* Some might wonder why I use 0
instead of nullptr
and things of that sort. It's a habit. At my workplace, we still have to write open libraries that target a wide range of compilers including those that only support C++03, so I'm still used to writing lower/mid-level implementation code that way, and sometimes even in C, so please forgive the old style in which I wrote this code.
#include <iostream>
#include <algorithm>
#include <cstdlib>
#include <ctime>
#include <cmath>
#include <vector>
#include <cassert>
#include <cstring>
#include <set>
using namespace std;
static const int max_level = 32;
static const float probability = 0.5;
static double sys_time()
{
return static_cast<double>(clock()) / CLOCKS_PER_SEC;
}
static int random_level()
{
int lvl = 1;
while ((static_cast<float>(rand()) / RAND_MAX) < probability && lvl < max_level)
++lvl;
return lvl;
}
template <class T>
class SkipSet
{
public:
SkipSet(): head(0)
{
head = create_node(max_level, T());
level = 0;
}
~SkipSet()
{
while (head)
{
Node* to_destroy = head;
head = head->next[0];
destroy_node(to_destroy);
}
}
bool contains(const T& value) const
{
const Node* node = head;
for (int i=level; i >= 0; --i)
{
while (node->next[i] && node->next[i]->value < value)
node = node->next[i];
}
node = node->next[0];
return node && node->value == value;
}
void insert(const T& value)
{
Node* node = head;
Node* update[max_level + 1];
memset(update, 0, sizeof(Node*)*(max_level + 1));
for (int i = level; i >= 0; i--)
{
while (node->next[i] && node->next[i]->value < value)
node = node->next[i];
update[i] = node;
}
node = node->next[0];
if (!node || node->value != value)
{
int lvl = random_level();
assert(lvl >= 0);
if (lvl > level)
{
for (int i = level + 1; i <= lvl; i++) {
update[i] = head;
}
level = lvl;
}
node = create_node(lvl, value);
for (int i = 0; i <= lvl; i++) {
node->next[i] = update[i]->next[i];
update[i]->next[i] = node;
}
}
}
bool erase(const T& value)
{
Node* node = head;
Node* update[max_level + 1];
memset(update, 0, sizeof(Node*)*(max_level + 1));
for (int i = level; i >= 0; i--)
{
while (node->next[i] && node->next[i]->value < value)
node = node->next[i];
update[i] = node;
}
node = node->next[0];
if (node->value == value)
{
for (int i = 0; i <= level; i++) {
if (update[i]->next[i] != node)
break;
update[i]->next[i] = node->next[i];
}
destroy_node(node);
while (level > 0 && !head->next[level])
--level;
return true;
}
return false;
}
private:
struct Node
{
T value;
struct Node** next;
};
Node* create_node(int level, const T& new_value)
{
void* node_mem = malloc(sizeof(Node));
Node* new_node = static_cast<Node*>(node_mem);
new (&new_node->value) T(new_value);
void* next_mem = calloc(level+1, sizeof(Node*));
new_node->next = static_cast<Node**>(next_mem);
return new_node;
}
void destroy_node(Node* node)
{
node->value.~T();
free(node->next);
free(node);
}
Node* head;
int level;
};
template <class T>
bool contains(const std::set<T>& cont, const T& val)
{
return cont.find(val) != cont.end();
}
template <class T>
bool contains(const SkipSet<T>& cont, const T& val)
{
return cont.contains(val);
}
template <class Set, class T>
void benchmark(int num, const T* elements, const T* search_elements)
{
const double start_insert = sys_time();
Set element_set;
for (int j=0; j < num; ++j)
element_set.insert(elements[j]);
cout << "-- Inserted " << num << " elements in " << (sys_time() - start_insert) << " secs" << endl;
const double start_search = sys_time();
int num_found = 0;
for (int j=0; j < num; ++j)
{
if (contains(element_set, search_elements[j]))
++num_found;
}
cout << "-- Found " << num_found << " elements in " << (sys_time() - start_search) << " secs" << endl;
const double start_erase = sys_time();
int num_erased = 0;
for (int j=0; j < num; ++j)
{
if (element_set.erase(search_elements[j]))
++num_erased;
}
cout << "-- Erased " << num_found << " elements in " << (sys_time() - start_erase) << " secs" << endl;
}
int main()
{
const int num_elements = 200000;
vector<int> elements(num_elements);
for (int j=0; j < num_elements; ++j)
elements[j] = j;
random_shuffle(elements.begin(), elements.end());
vector<int> search_elements = elements;
random_shuffle(search_elements.begin(), search_elements.end());
typedef std::set<int> Set1;
typedef SkipSet<int> Set2;
for (int j=0; j < 3; ++j)
{
cout << "std::set" << endl;
benchmark<Set1>(num_elements, &elements[0], &search_elements[0]);
cout << endl;
cout << "SkipSet" << endl;
benchmark<Set2>(num_elements, &elements[0], &search_elements[0]);
cout << endl;
}
}
On GCC 5.2, -O2, I get this:
std::set
-- Inserted 200000 elements in 0.104869 secs
-- Found 200000 elements in 0.078351 secs
-- Erased 200000 elements in 0.098208 secs
SkipSet
-- Inserted 200000 elements in 0.188765 secs
-- Found 200000 elements in 0.160895 secs
-- Erased 200000 elements in 0.162444 secs
... which is pretty awful. We're around twice as slow all across the board.
Yet there is a glaring optimization we can make. If we look at Node
, its current fields look like this:
struct Node
{
T value;
struct Node** next;
};
This implies that the memory for the Node fields and its list of next pointers are two separate blocks, possibly with a very distant stride between them like so:
[Node fields]-------------------->[next0,next1,...,null]
This does poorly for locality of reference. If we want to improve things here, we should merge these memory blocks into a single contiguous structure, like so:
[Node fields,next0,next1,...,null]
We can achieve this through the variable-length struct idiom common in C. It's a little bit awkward to implement in C++ which doesn't support it so directly, but we can emulate the effect like so:
struct Node
{
T value;
struct Node* next[1];
};
Node* create_node(int level, const T& new_value)
{
void* node_mem = malloc(sizeof(Node) + level * sizeof(Node*));
Node* new_node = static_cast<Node*>(node_mem);
new (&new_node->value) T(new_value);
for (int j=0; j < level+1; ++j)
new_node->next[j] = 0;
return new_node;
}
void destroy_node(Node* node)
{
node->value.~T();
free(node);
}
With this modest tweak, we now have these times:
SkipSet (Before)
-- Inserted 200000 elements in 0.188765 secs
-- Found 200000 elements in 0.160895 secs
-- Erased 200000 elements in 0.162444 secs
SkipSet (After)
-- Inserted 200000 elements in 0.132322 secs
-- Found 200000 elements in 0.127989 secs
-- Erased 200000 elements in 0.130889 secs
... which gets us considerably closer to rivaling the performance of std::set
.
A truly efficient skip list implementation will generally want a very fast RNG. Nevertheless, I found during a quick profiling session that only a very teeny portion of time is spent generating a random level/height, hardly enough to consider it much of a hotspot. It would also only impact the insertion times unless it provided a more uniform distribution, so I've decided to skip this optimization.
At this point, I'd say we have a pretty reasonable overview of what we can expect with a skip list implementation vs. a BST:
Insertion
-- std::set: 0.104869 secs
-- SkipList: 0.132322 secs
Search:
-- std::set: 0.078351 secs
-- SkipList: 0.127989 secs
Removal:
-- std::set: 0.098208 secs
-- SkipList: 0.130889 secs
However, if we want to soldier on a little further, we can utilize a fixed allocator. At this point, we're cheating a little bit as std::set
is designed to work with any general-purpose allocator which conforms to the interface requirements of a standard allocator. But let's have a look at using a fixed allocator:
#include <iostream>
#include <iomanip>
#include <algorithm>
#include <cstdlib>
#include <ctime>
#include <cmath>
#include <vector>
#include <cassert>
#include <set>
using namespace std;
static const int max_level = 32;
class FixedAlloc
{
public:
FixedAlloc(): root_block(0), free_element(0), type_size(0), block_size(0), block_num(0)
{
}
FixedAlloc(int itype_size, int iblock_size): root_block(0), free_element(0), type_size(0), block_size(0), block_num(0)
{
init(itype_size, iblock_size);
}
~FixedAlloc()
{
purge();
}
void init(int new_type_size, int new_block_size)
{
purge();
block_size = max(new_block_size, type_size);
type_size = max(new_type_size, static_cast<int>(sizeof(FreeElement)));
block_num = block_size / type_size;
}
void purge()
{
while (root_block)
{
Block* block = root_block;
root_block = root_block->next;
free(block);
}
free_element = 0;
}
void* allocate()
{
assert(type_size > 0);
if (free_element)
{
void* mem = free_element;
free_element = free_element->next_element;
return mem;
}
// Create new block.
void* new_block_mem = malloc(sizeof(Block) - 1 + type_size * block_num);
Block* new_block = static_cast<Block*>(new_block_mem);
new_block->next = root_block;
root_block = new_block;
// Push all but one of the new block's elements to the free pool.
char* mem = new_block->mem;
for (int j=1; j < block_num; ++j)
{
FreeElement* element = reinterpret_cast<FreeElement*>(mem + j * type_size);
element->next_element = free_element;
free_element = element;
}
return mem;
}
void deallocate(void* mem)
{
FreeElement* element = static_cast<FreeElement*>(mem);
element->next_element = free_element;
free_element = element;
}
void swap(FixedAlloc& other)
{
std::swap(free_element, other.free_element);
std::swap(root_block, other.root_block);
std::swap(type_size, other.type_size);
std::swap(block_size, other.block_size);
std::swap(block_num, other.block_num);
}
private:
struct Block
{
Block* next;
char mem[1];
};
struct FreeElement
{
struct FreeElement* next_element;
};
// Disable copying.
FixedAlloc(const FixedAlloc&);
FixedAlloc& operator=(const FixedAlloc&);
struct Block* root_block;
struct FreeElement* free_element;
int type_size;
int block_size;
int block_num;
};
static double sys_time()
{
return static_cast<double>(clock()) / CLOCKS_PER_SEC;
}
static int random_level()
{
int lvl = 1;
while (rand()%2 == 0 && lvl < max_level)
++lvl;
return lvl;
}
template <class T>
class SkipSet
{
public:
SkipSet(): head(0)
{
for (int j=0; j < max_level; ++j)
allocs[j].init(sizeof(Node) + (j+1)*sizeof(Node*), 4096);
head = create_node(max_level, T());
level = 0;
}
~SkipSet()
{
while (head)
{
Node* to_destroy = head;
head = head->next[0];
destroy_node(to_destroy);
}
}
bool contains(const T& value) const
{
const Node* node = head;
for (int i=level; i >= 0; --i)
{
while (node->next[i] && node->next[i]->value < value)
node = node->next[i];
}
node = node->next[0];
return node && node->value == value;
}
void insert(const T& value)
{
Node* node = head;
Node* update[max_level + 1] = {0};
for (int i=level; i >= 0; --i)
{
while (node->next[i] && node->next[i]->value < value)
node = node->next[i];
update[i] = node;
}
node = node->next[0];
if (!node || node->value != value)
{
const int lvl = random_level();
assert(lvl >= 0);
if (lvl > level)
{
for (int i = level + 1; i <= lvl; ++i)
update[i] = head;
level = lvl;
}
node = create_node(lvl, value);
for (int i = 0; i <= lvl; ++i)
{
node->next[i] = update[i]->next[i];
update[i]->next[i] = node;
}
}
}
bool erase(const T& value)
{
Node* node = head;
Node* update[max_level + 1] = {0};
for (int i=level; i >= 0; --i)
{
while (node->next[i] && node->next[i]->value < value)
node = node->next[i];
update[i] = node;
}
node = node->next[0];
if (node->value == value)
{
for (int i=0; i <= level; ++i) {
if (update[i]->next[i] != node)
break;
update[i]->next[i] = node->next[i];
}
destroy_node(node);
while (level > 0 && !head->next[level])
--level;
return true;
}
return false;
}
void swap(SkipSet<T>& other)
{
for (int j=0; j < max_level; ++j)
allocs[j].swap(other.allocs[j]);
std::swap(head, other.head);
std::swap(level, other.level);
}
private:
struct Node
{
T value;
int num;
struct Node* next[1];
};
Node* create_node(int level, const T& new_value)
{
void* node_mem = allocs[level-1].allocate();
Node* new_node = static_cast<Node*>(node_mem);
new (&new_node->value) T(new_value);
new_node->num = level;
for (int j=0; j < level+1; ++j)
new_node->next[j] = 0;
return new_node;
}
void destroy_node(Node* node)
{
node->value.~T();
allocs[node->num-1].deallocate(node);
}
FixedAlloc allocs[max_level];
Node* head;
int level;
};
template <class T>
bool contains(const std::set<T>& cont, const T& val)
{
return cont.find(val) != cont.end();
}
template <class T>
bool contains(const SkipSet<T>& cont, const T& val)
{
return cont.contains(val);
}
template <class Set, class T>
void benchmark(int num, const T* elements, const T* search_elements)
{
const double start_insert = sys_time();
Set element_set;
for (int j=0; j < num; ++j)
element_set.insert(elements[j]);
cout << "-- Inserted " << num << " elements in " << (sys_time() - start_insert) << " secs" << endl;
const double start_search = sys_time();
int num_found = 0;
for (int j=0; j < num; ++j)
{
if (contains(element_set, search_elements[j]))
++num_found;
}
cout << "-- Found " << num_found << " elements in " << (sys_time() - start_search) << " secs" << endl;
const double start_erase = sys_time();
int num_erased = 0;
for (int j=0; j < num; ++j)
{
if (element_set.erase(search_elements[j]))
++num_erased;
}
cout << "-- Erased " << num_found << " elements in " << (sys_time() - start_erase) << " secs" << endl;
}
int main()
{
const int num_elements = 200000;
vector<int> elements(num_elements);
for (int j=0; j < num_elements; ++j)
elements[j] = j;
random_shuffle(elements.begin(), elements.end());
vector<int> search_elements = elements;
random_shuffle(search_elements.begin(), search_elements.end());
typedef std::set<int> Set1;
typedef SkipSet<int> Set2;
cout << fixed << setprecision(3);
for (int j=0; j < 2; ++j)
{
cout << "std::set" << endl;
benchmark<Set1>(num_elements, &elements[0], &search_elements[0]);
cout << endl;
cout << "SkipSet" << endl;
benchmark<Set2>(num_elements, &elements[0], &search_elements[0]);
cout << endl;
}
}
* I also made a minor tweak to random_level
to make it simply assume a probability of 50% after discovering that this seems to work quite well.
By using a fixed allocator, we can quickly allocate and deallocate elements using a very simple constant-time strategy, and more importantly, allocate nodes in a way such that nodes with the same height (the most frequently accessed together) are allocated more contiguously with respect to each other (though not in an ideal sequential order). This improves the times to:
Insertion
-- std::set: 0.104869 secs
-- SkipList: 0.103632 secs
Search:
-- std::set: 0.078351 secs
-- SkipList: 0.089910 secs
Removal:
-- std::set: 0.098208 secs
-- SkipList: 0.089224 secs
... which isn't bad for about 40 minutes of work against std::set
which has been poked and prodded and tuned by experts all over the industry. We also have faster removals than std::set
(insertion times fluctuate a bit on my machine, I'd say they are roughly on par).
Of course we cheated to apply this last step. Using a custom allocator tilts things in our favor, and also changes the characteristics of the container in a way such that its memory cannot be freed until it is cleared, destroyed, or compacted. It can mark the memory as unused and reclaim it upon subsequent insertions, but it cannot make the memory it uses available for those outside the skip list to use. I also didn't bother to pay attention to proper alignment for all possible types of T
which would be necessary to truly generalize it.
Let's try using this against sorted input. To do so, simply comment out the two random_shuffle
statements. On my end, I now get these times:
std::set
-- Inserted 200000 elements in 0.044 secs
-- Found 200000 elements in 0.023 secs
-- Erased 200000 elements in 0.019 secs
SkipSet
-- Inserted 200000 elements in 0.027 secs
-- Found 200000 elements in 0.023 secs
-- Erased 200000 elements in 0.016 secs
... and now our SkipSet
outperforms std::set
in all areas, though just for this one kind of exceptional case.
So I don't know exactly what this says about skip lists. With hardly any time and effort, we got pretty close to rivaling std::set
*. Yet we didn't beat it, and we had to cheat with a memory allocator to get really close.
* Note that mileage may vary across machines, vendor implementations of std::set
, etc.
Times have changed quite a bit since the paper Pugh wrote in 1989.
Today the benefits of locality of reference dominates things to a point where even a linearithmic complexity algorithm can outperform a linear one provided that the former is considerably more cache or page-friendly. Paying close attention to the way the system grabs chunks of memory from upper levels of the memory hierarchy (ex: secondary stage) with slower but bigger memory and down to the little L1 cache line and teeny register is a bigger deal than ever before, and no longer "micro" if you ask me when the benefits can rival algorithmic improvements.
The skip list is potentially crippled here given the considerably larger size of nodes, and just as importantly, the variable size of nodes (which makes them difficult to allocate very efficiently).
One thing we didn't look at is the localized nature in which a skip list updates on insertion. This tends to impact much fewer areas than a balancing tree requires to rebalance by rotating parent nodes. As a result, a skip list can offer a potentially more efficient implementation of a concurrent set or map.
Another promising characteristic of a skip list is that it is simply a linked-list a the lowest level. As a result, it offers very simple linear-time sequential traversal without having to descend down the branches of the tree with linearithmic complexity, so it is potentially very good if we want to do a lot of linear-time iterations through all the elements contained.
But always remember:
A technique is only as good as it can be applied in the hands of the implementor.