Bittorrent Client

From COEP Wiki
Jump to: navigation, search

Bittorrent Tips,Tricks and FAQs.

Overview

Here is an overview of the process involved in creating a torrent client.

  • Parse torrent file.
  • Connect to tracker.
  • Ask for list of peers.
  • Connect to peer.
  • Request for piece from peer (Most peers send bitfield and/or have messages when you connect for the first time, more on this later.)
  • Authenticate piece.
  • Save piece to file.

Note: If the piece is large, then you are supposed to first request for a block (basically a part of the piece). Once you receieve all the blocks of a piece, you can authenticate and save to file.


Parsing torrent files

' Note: Some sources refer to the torrent file as the "Metainfo File". They are the same thing.

All torrent files are bencoded. Read more about bencoding here. A python module to help with this is bencode.py. If you are doing your project in JS, then this npm library will come handy to parse bencoded files.

You need to get all the important information like list of trackers, file names, file paths, file sizes, piece length. So, the keys of interest are 'info', 'announce' and 'announce-list'. While the 'announce' key is a mandatory part of the format of the torrent file, the 'announce-list' is not part of the official specification, but an optional extension. Even so, I would recommend checking for its presence and including those additional tracker URLs in the tracker list also. Since some trackers turn out to be unresponsive, a list of alternatives comes in handy.

  • Info Dictionary structure

This is a dictionary that we use to understand the structure of the actual file that the torrent file is created for. The format is different for single-file torrents versus multi-file torrents, as described [[1]]. It is useful to understand how multi-file torrents are built when using the information gathered from the info dictionary of a multi-file torrent.

As described [[2]], the value of the 'path' key in each of the dictionaries in the 'files' list in the Info dictionary of a multi-torrent file is a bencoded list of strings specifying the path and file name for that file. The list does not include the name of the top level directory being shared, but paths starting at directories/files within it.

Connecting to trackers

There is a well defined process to connect to a tracker, and it is fairly simple. Here is a link that tells you all about UDP trackers (which are most commonly used) and another one for HTTP/HTTPS.

You are mainly interested in connect and announce requests, so you can skip the scrape request section.

Tracker gives a list of peers we can connect to. This list of peers can be either compact or non-compact. Compact version gives ip and port number of the peer in (4 + 2) bytes. Hence, for a list of n peers, there will be 6n bytes in the tracker response for peer list. If it gives a non-compact response, the peer id will also be included. The peer id given by the tracker is usually a character string of 20 byte length, like '-AB0707-0a1kf7aDTbLO'. However, I found in my testing that occasionally, the peer id is a UTF-8 encoded bytes object. Both cases need to be taken into account while authenticating the handshake later, as a type mismatch will cause a problem. For compact responses, the peer id of the peer can be obtained by slicing the handshake response of the peer.

Important: In my testing, there were many cases where trackers did not respond (could be due to any reason), so do not fret over it. Try to connect to all the trackers given in the file and you will get one that responds, assuming your request format is correct.

Also, important to note that there are several fields in announce request which are dynamic like - left etc. which keep on changing while you are downloading, so make sure to update them in subsequent tracker announce request messages! I figured it quite late

Peer States

According to the Peer Wire Message(PWM) Protocol, every peer maintains its state which facilitates the exchange of pieces among other peers. The BitTorrent client maintains the states as given below

  • am_choking: this client is choking the peer
  • am_interested: this client is interested in the peer
  • peer_choking: peer is choking this client
  • peer_interested: peer is interested in this client

Any block or piece can only be downloaded by the client only if the client is interested and peer is not choking the client and similarly we can infer that block or piece can be uploaded by the client only if the peer is interested in the client and the client is not choking the peer. Note that when client is choked by the peer the client requests to the peer will be rejected and client needs to wait until unchoke message is given by peer.

Important: In my testing, sometimes I used to get choke message after receiving certain pieces and even after initial handshake, since BitTorrent client protocol works on tit for tat strategies. Also sometimes keep alive messages are given by the peer indicating to keep connection open for some more time. So in order to handle all such cases one try developing Finite State Machine. Interpreting the client states, we can design an Finite State Machine for client to download or upload. The downloading FSM given here can give you any idea, however it is not complete to handle all cases. The approach I found useful was to identify every message received by looking at the 5th byte of the message that gives the message id(exception-keepalive). Then look at the length of the payload(first 4 bytes network-order packed integer) and only receive those many bytes ahead. That lets you avoid the mess of finding the end of a received message in your buffer and receiving messages broken into parts. Refer this [[3]] for the unique format of every type of message.

Connecting to peers

Now we are entering the main part of the protocol. Once you have the list of peers from the tracker, you have to open a tcp connection with them. After you have done this, you will enter the handshake stage.

Handshake

The format for handshake can be seen here. Please follow the conventions for peer id, I did not do this and faced a lot of problems. Just like trackers, peers also may not respond to handshakes, and this is fairly common, all one can do is keep trying.

In addition to authenticating the infohash in the reply handshake, we also need to verify that the peer id received in the reply handshake is the same as the peer id given by the tracker for that IP and port. Note the possible types for the peer id as pointed out in this section.

A successful handshake response usually (According to my testing, 8/10 times) includes bitfield or have messages at the end of the handshake.

Bitfield

Peers which include a bitfield are being very helpful to us, as they are giving us all the information we need very conveniently. However, there is a huge disparity between peers sending bitfield and have messages.

According to this website, bitfields are sent immediately after the handshake, but I have noticed something entirely different in my testing. The peers were sending the bitfield as one of the below methods (this may not be an exhaustive list!):

  • Bitfield length, Bitfield id and bitfield all appended to the end of the handshake.
  • Bitfield length, Bitfield id appended at end of handshake and bitfield sent in the next message.
  • All 3 sent in the message right after the handshake.
  • One of the above, followed by a variable amount of have messages appended to the message.

A reasonable sequence of messages to expect would be- 1. P1(initiator wishing to download) sends handshake to P2. 2. P2 sends reply handshake. 3. P2 sends bitfield and/or have(s). 4. P1 sends interested message. 5. P2 sends unchoke. 6. P1 sends requests for one block at a time until choked by P2 or until download is complete. 7. P2 sends the requested blocks to P1 until it decides to choke P1 or until requests stop or P1 sends 'not interested' message.

Note that this is not necessarily how every exchange would go, but simply a typical ideal scenario.

An important thing to note is bitfield length is always in multiples of 8. Hence, say if number of pieces is 5, and the peer sends occurrence of all pieces it has in the bitfield, then the last 3 buts (8 - 5) are extra. These extra bits have to be zero because these pieces do not exist. If any of these extra bits is 1, then that means the peer has sent an invalid bitfield.

To observe this by yourself, run wireshark when you have reached this stage and analyse the responses of peers to your handshake.

Interested

Once you have completed the handshake, send an interested message, and usually(but not necesarily) the peer will unchoke you.

Piece Request

You have made it to the last part of the downloading process. You have to now request a piece from the peer. This topic is disputed, as observable on this page.

The format of the request is fixed, however this dispute is regarding the acceptable piece size for a piece request. Note that piece length of a torrent can vary. A torrent file usually contains 1500 pieces atmost, which means for larger torrents, piece sizes can be pretty big. Thus, pieces are further broken down into blocks. When you are sending a "piece request", you are not exactly requesting for a piece, rather you are requesting for a block. The block size could be less than or equal to the piece size.

Since the page is not very clear about what size to follow, you can experiment by yourself and find out. However the piece size that worked for me in almost all cases is 2^14 bytes, which is 16384 bytes. This size was also recommended by this source.

In some cases, if the file length is not divisible by the piece length, the piece length of the last piece may be different for the rest. Including a check for the last piece is useful.

Since you are requesting a block that can be quite small compared to the piece length, you may have multiple blocks. Keep in mind that when you request for the last block of the piece, the size may not be 16384 bytes, but lesser. This depends on whether your piece length is divisible by 16384.

Applying the same logic as above, the last piece also may not be the same size as the other pieces, so it is essential you address this in your program.

Saving to File

Once you have all the blocks of a piece, it is important you save that piece to the file, as you cannot keep all the pieces of a file unsaved, it would eat up a huge amount of your RAM.

It is recommended that you test out the entire process with a single peer first, and once this is done, you can use multithreading to connect to multiple peers at the same time. For this section, I will assume you have included multithreading in your program.

Now that you have the piece, you can calculate its SHA1 hash. Calculating the hash can be tricky. For each response of a block request, the response will have around first 13 bytes as header field. Check if the requested block offset is same as the received block offset. If they match, exclude the first 13 bytes a that is the header of the block response. The part from index 13 is the actual payload. Append the payloads of all such block responses. Then, calculate the sha1 hash of all these bytes appended.

Once you have done this, compare it to the hash value in your torrent file. The hash values of all pieces are saved as a long string in the file, so you need to go to the index of the piece that you currently have. This can be calculated as 20*index value as SHA1 hashes have 20 characters. If the hash values match, you can save this piece, or else you need to discard it.

Now you need to think about how you will maintain the accuracy of the file. As the program is multithreaded, you can have multiple threads attempting to access this file. For this purpose, you may consider using locks.

Once you have acquired the lock, open file in binary mode, seek to the part of the file you need (piece index * piece length) and write the piece. Voilà, you have succesfully downloaded a piece of the file.

  • Working with multi file torrents

Working with multifile torrents can be a bit tricky.As when the torrent file is built up, bytes in all the files will be appended one after another and then pieces are constructed and hashes calculated. So it can be a bit tricky to deal when a piece arrives. So, a piece can spread across multiple files together. In that case, you have to be very careful while saving the correct number of bytes in correct files. This looks trivial to listen but can be catchy when the files are also of few bytes and a single piece can cover many files, so in that case, you have to write many files togther for a piece.

Making Torrent Files

Before making torrent file, the structure of the file should be clear.There is a good reading resource here which explains the file structure of torrents quite well.

A torrent file is a bencoded dictionary. The dictionary has fields like info, announce,announce list creation date, created by etc. Beginning with announce, the announce containes the URL of the primary tracker to which peers connect.Announce list is the list of alternate trackers. Info is the most important key. It contains information of various files - thier name and sizes, piece length, pieces etc.

For a single file torrent, name field is the name of the file itself while for multifile torrents, it is the name of the directory being shared. Similarly, for length field is only present in single file torrents depicting the length of file in bytes. For multifile torrents, files field has a list of files. Each entry of the list is in turn a dictionary with 3 fields - 'length', 'md5sum'(you can ignore this one) and 'path'.

  • Making pieces

Prior to this step, you have to decide the piece length which is commonly used as 256 KiB but there is a lot of controversy on this point. Piece length should be chosen optimally so that the torrent files don't become so large and also the pieces should not be so large to be able to download after a lot of blocks.

Aftter deciding piece length, if it is single file torrent, then just read the file and divide the file into segements each of size piece length.The last piece will be smaller than the piece length. Now calculate SHA1 hash of each piece which is a 160 Bit or 20KB value. The hashes are appended one after another to make up pieces field of info.

In case of multiple files,just read and append multiple files one after another to make up a single large byte sequence and perform the same actions as for singlr file. The total length in bytes is the sum of the values of the 'length' key for every file. Its interesting to note that the entire structure of the folder( multifiles) is depicted in the files list of info value.

Note: For multi-file torrents, the complete byte sequence is treated as if it were a single file and pieces are made from the entire byte sequence AFTER the concatenation. So, piece boundaries are not at all sensitive to the original file boundaries and pieces may stretch over multiple files.

Important Point: All strings must be UTF-8 encoded, except for pieces, which contains binary data.

Seeding across NAT

Since your client is deep inside NAT, it is not possible for peers to connect to you directly. For this, we need STUN server to relay the traffic from our local machine to the outside world. There is a work around for doing this without implementing STUN server. That is through TCP port forwarding for RAW packets and SSH Tunneling. You will need a remote machine with public IP to listen to requests from outside world on a port, say external port. Setup NginX on the remote machine and edit the nginx.conf file in /etc/nginx/ directory. Add the following lines on the top of file below include statements :

stream {
 upstream backend {
 server 127.0.0.1:5000;
}
server {
 listen 6887;
 proxy_pass backend;
 }
}

Here 6887 is the port on which NginX server will be listening for requests from outside world.This is the port you need to advertise to the tracker during annouce request. Also, advertise your public IP as the IP of the remote server IP while connecting to the tracker. This will make sure that the peers connect to our server with public IP which will in turn relay packets to our client running on our local machine.

And 5000 is any arbitrarily chosen port on(or from) which we will be forwarding raw TCP packets from outside world.Lets call this relay port.

Now, save and close the file and restart nginx for changes to take place.

Now, move on to local machine where your torrent client is running and fire a command in terminal to start SSH tunnel between the port on which your torrent application is listening for connections (say 6887) and the relay port. So the command goes like this -

ssh -R 5000:localhost:6887 -N username@remote_machine_ip 

Where username and ip is of the remote server to which you must have access and ssh server must be runnging on that.

Keep this terminal running.What we achieved is - Now the peers will connect to you using the public IP of the remote machine and external port advertised to tracker. NginX will forward the packets to the relay port to which we have established an SSH tunnel to forward those packets to our application.

For detailed guide, head on to this blog post.


Tips and Resources

  • In Python, you will be using struct a lot, so read the man page carefully.
  • The os.path module [[4]] provides methods that will be useful whenever you need to work with file paths and directory structures.
  • For Javascript (Node JS) , use this for tcp and this for udp connections.
  • This article helped me a lot in my project.
  • Use Wireshark ALL the time. Use the bittorrent filter to see whether your code is working as intended.
  • Keep this page bookmarked. You will need it all the time.
  • Be smart while testing. Use a small sized torrent with large amount of seeders (atleast 300+). To do this, head over to your favourite torrent website (Like ThePirateBay) and go to the ebooks section. Sort the list according to seeders. You will easily find ebooks which are small and have a large amount of seeders. Use an online epub reader to see if your file has downloaded correctly. Note: This may be illegal and I am not encouraging you to do this for anything other than educational purposes.
  • Download any torrent on transmission or any other client first, before trying it with your own client. This will not only confirm it is working, it will also give you a completed file, which you can use for comparing your downloaded file.
  • First try to connect to one peer at a time and download your file completely, then move on to multiple peers. If you are confident, you can skip this.
  • Messages sent as one from the peer may not arrive as one to you, so keep this in mind. Make sure you are looking for this when interacting with any peer. This is especially true for piece requests.
  • Some peers unchoke right after sending the bitfield, so you need to skip the interested message if you want to download from them. Usually if they have unchoked you already, they may not unchoke you again when you send your interested message.
  • Testing the client for seeding, I would recommend to open free account on AWS educate and get EC2 Linux virtual machine and eventually configure the virtual machine to receive TCP traffic on specific ports. Start seeding over the virtual machine however, be careful use the public IP address of our virtual machine and not the private IP address for connecting to seeder, simple reason is the virtual machine will be behind gateway and will be NATed.
  • From implenentation point,you can choose between asynchronous programming and multi-threaded programming. NodeJS can also be a good option.It is single threaded and everything is asynchronous by default and also is recommended for network oriented tasks.
  • If doing your project in JavaScript, then be cautious of 64 bit integers as JS cannot handle 64 bit integers quite well.So have to write them carefully in torrent requests This link helped me a lot in doing so.
  • Since you cannot make your torrent client to seed to external world as you are behind the NATed network and don't have public IP, so for this you can explore TCP port forwarding for raw packets and ssh tunnelling by configuring NginX. This just tries to replicate STUN server functionality.Bit tricky but fun to do.
  • The bencodepy library may not directly decode the torrent file for you but there is another library called bcoding which worked for me. Link to the library: https://pypi.org/project/bcoding/
  • I noticed that if you try to connect to too many peers or try to establish too many connections using multithreading the client becomes unstable so try to keep less number of maximum peers.