I am working on Mainline DHT and I don't understand one nuance.
Here: https://www.bittorrent.org/beps/bep_0005.html writes: "A "distance metric" is used to compare two node IDs or a node ID and an info hash for "closeness."
Also writes: "Announce that the peer, controlling the querying node, is downloading a torrent on a port. announce_peer has four arguments: "id" containing the node ID of the querying node, "info_hash" containing the info hash of the torrent, "port" containing the port as an integer, and the "token" received in response to a previous get_peers query."
So, for example, we have a peer with ID 223456789zxcvbnmasdf which IP is 86.23.145.714 and port is: 7853 I know that this peer downloaded 2 torrent files with info hashes: 023456789zxcvbnmasdf and 123456789zxcvbnmasdf.
So how exactly my k-bucket record should look like? Like this:
{"id": "223456789zxcvbnmasdf", "ip": "86.23.145.714", "port": "7853", "torrents": ["023456789zxcvbnmasdg", "123456789zxcvbnmasdh"]} ?
Or should torrent files be like an "equivalent" record (with duplicated ips and ports) in k-buckets along with peer:
{"id": "223456789zxcvbnmasdf", "ip": "86.23.145.714", "port": "7853"},
{"id": "023456789zxcvbnmasdg", "ip": "86.23.145.714", "port": "7853"},
{"id": "123456789zxcvbnmasdh", "ip": "86.23.145.714", "port": "7853"} ?
I am asking because this is not just implementation nuance. Because "k" is normally 20 or some other integer in all clients. So if I would use k-buckets to store torrent files as full-right members, I would have less space to store real peers data.
Thanks for answers!
k-buckets datastructure is an implementation detail of bit-torrent protocol so that peers reply quickly enough to FIND_PEERS
and FIND_VALUE
.
What I do in my kademlia implementation is that keep the the routing table in a python dictionary and I compute the nearest peers under the 5 seconds by default which is the timeout I use to wait for a UDP reply. To achieve that I need to keep the routing table below 1 000 000 entries.
Like I said above, the routing table is a simple python dict
that maps peerid
to (address, port)
.
The routing table stores peers not values ie. not infohash
addresses.
When I receive a FIND_PEERS
message, the program replies with the following code:
async def peers(self, address, uid):
"""Remote procedure that returns peers that are near UID"""
log.debug("[%r] find peers uid=%r from %r", self._uid, uid, address)
# XXX: if this takes more than 5 seconds (see RPCProtocol) it
# will timeout in the other side.
uids = nearest(self._replication, self._peers.keys(), uid)
out = [self._peers[x] for x in uids]
return out
When I receive a FIND_VALUE
message, the program replies with the following code:
async def value(self, address, key):
"""Remote procedure that returns the associated value or peers that
are near KEY"""
log.debug("[%r] value key=%r from %r", self._uid, key, address)
out = await lookup(key)
if out is None:
# value is not found, reply with peers that are near KEY
out = nearest(self._replication, self._peers.keys(), uid)
return (b"PEERS", out)
else:
return (b"VALUE", out)
Here is the definition of nearest
:
def nearest(k, peers, uid):
"""Return K nearest to to UID peers in PEERS according to XOR"""
# XXX: It only works with len(peers) < 10^6 more than that count
# of peers and the time it takes to compute the nearest peers will
# timeout after 5 seconds on the other side. See RPCProtocol and
# Peer.peers.
return nsmallest(k, peers, key=functools.partial(operator.xor, uid))
That is it sorts the peers
according to their peerid
and return the k
smallest. nsmallest
is supposed to be an optimized version of sort(peers, key=functools.partial(operator.xor, uid))[:k]
where uid
is a peerid
or infohash
(respectively FIND_PEERS
and FIND_VALUE
).
Now back at your question(s):
Is hashinfo equivalent to peer ID in Mainline DHT?
hashinfo
is a hash-something that is the same kind of hash as peerid
ie. they are possible keys in the routing table. That is, torrent files are associated with a hash, peers are associated with the same kind of hash called peerid
. And peers have the "ownership" of keys that near their peerid
. BUT hashinfo
are not stored in the routing table or k-buckets if you prefer. hashinfo
are stored in another mapping that associate hashinfo
hash with their value(s).
I am asking because this is not just implementation nuance. Because "k" is normally 20 or some other integer in all clients. So if I would use k-buckets to store torrent files as full-right members, I would have less space to store real peers data.
There is mis-understanding here about the same thing I try explain above. hashinfo
are keys in storage dictionary. Whereas peerid
are keys in the routing table aka. k-buckets datastructure. They both have the same format because that is how the kademlia routing algorithm work. You must be able to compare hashinfo
with peerid
with xor
to be able to tell which peers "own" which hashinfo
value.
As you can see in the second snippet, when a another peer ask for a value associated with an hash, it calls lookup(key)
that is something like storage.get(key)
except in my case the code stores values in a database.
Maybe there is a mis-understanding about the fact that k-buckets are used to store DHT routing information. And that on top of that, torrent protocol use the DHT to store torrent routing information.
For what it is worth, qadom's peer.py file is where I implement a DHT inspired from kademlia (except I use 256 bits hash and forgo alpha
and k
parameters and use a single REPLICATION
parameter). The code works most of the time check the tests.
Also, there is another project I got inspiration from called simply kademlia that (try to?) implement k-buckets.
As far as I understand, torrent DHT routing looks like qadom bag
functions except the receiving peer must authenticate the announcement, whereas in qadom bags are free-for-all.
Also, check the original kademlia paper.