Creating high-performance UDP servers on Windows and Linux

August 29, 2018August 29, 2018 Allen Drennan

There is a lack of information available on building highly scalable UDP servers on the Internet. What information exists, often falls short of best practices. UDP servers are the central backbone of many video game servers and streaming services, but very few good examples or discussions exist on how to construct them on Windows and Linux. This article covers advanced topic areas related to UDP servers and assumes the reader has some understanding of threads, sockets and the available APIs already.

Most implementations revolve around the standard socket APIs, Wsa/RecvFrom() and Wsa/SendTo(). They are relatively easy to understand and there are plenty of examples. RecvFrom() typically receives a datagram from a widely-known listening port and provides you the socket address of the sender. SendTo() simply sends a datagram, usually to the socket address that was previously provided by RecvFrom(). Many UDP server implementations start with these basic APIs and build everything else around them.

If you are using a simple communication model, you might have an event loop in a thread to handle I/O that simply calls RecvFrom() and have multiple threads to handle parallel I/O. This is relatively efficient and from a pure communication-only perspective is the fastest approach, but also introduces issues as you start to build your application logic. The first issue you may encounter is that you may need to keep track of client sessions (or pseudo streams) and route each incoming datagram to the proper session object. This is required by most applications using UDP at some level in the application’s logic, and security libraries such as DTLS (datagram TLS) require you to maintain information about security state for each client session. If you are using socket addresses then you probably would create a hash table to map your socket addresses to your session object. This would require some lock mechanism to maintain integrity. Suddenly you are performing a lot of extra processing for each datagram you receive and performance begins to suffer.

You can use Windows APIs such as I/O completion ports (IOCP) and Registered I/O (RIO) and EPoll on Linux to improve performance. They can be applied asynchronous and non-blocking. However, these APIs work with socket handles not socket addresses, and since UDP is connection-less there is a widely-held misunderstanding that UDP cannot work or should not work with socket handles.

In fact it can work with socket handles. UDP socket handles work well with asynchronous communications with APIs such as IOCP and EPoll, they perform substantially better internally (inside the kernel). They also help you avoid complicated application logic for lock and hash tables to maintain state or lookup session objects for using things like DTLS. If you are using socket addresses with RecvFrom() and SendTo() then you are not leveraging the full performance benefits of these APIs for scalable UDP servers.

Overview

In order to use socket handles with UDP you need to use the Connect() socket API. This is also where developers usually abandon their effort. First off, we all have been taught that UDP is connection-less (and it is) so why would I want to Connect() it? Secondly the steps required to properly setup a socket handle for UDP to both send and receive on a server is pretty confusing and if you don’t do it correctly it will never work. I personally think this is a primary reason why so many implementations stick with RecvFrom() and use socket addresses, because it is easy to understand. There are also some upper limits on the number of socket handles that can be used at one time, but this unlikely to be your bottleneck on any given server.

The asynchronously capable socket APIs on Windows such as IOCP and RIO, and Linux Epoll are designed to be very efficient using socket handles. If you could relate a client session to a socket handle, then these APIs can directly send and receive using the same approach you would use for a TCP session. Consider that last statement for a moment, because it is important. If you use socket handles for both TCP and UDP, then you would be able to unify a great deal of communication logic and client session objects for both protocols. This is also an important aspect of using socket handles instead of socket addresses. With handles you have a uniform architecture to your communication and application logic.

Besides having more consistent and straightforward code, socket handles perform better. The kernel processes datagrams more efficiently when they are related to a socket handle because of the structure of the internal routing tables. (see UDP – Performance p.255 Unix Network Programming by Richard Stevens) This is because when you use a socket address, the kernel will internally do a lookup and connect the socket handle, send the datagram and disconnect the socket handle. This overhead can substantially reduce performance of datagrams. Each underlying socket implementation handles this differently and performance can vary by OS revision, but fundamentally socket handles perform better. This is especially true for overlapped and event APIs that work directly with socket handles.

Another major benefit is that IOCP/RIO on Windows and EPoll on Linux allow you to include extra data along with the overlapped operation or the event. Since the socket handle is related directly to a single client session, any stateful information and session object could be related to the overlapped operation or event. This is an important distinction. If we can include session information with the operation, then we can avoid many locks and hash table lookups. A properly architected IOCP/RIO server can do this an avoid thread contention and race conditions. A discussion of this specific topic is beyond the scope of this article, but needless to say that as long as you only have a single pending overlapped read at a time, you are not going to have to lock your session object with IOCP regardless of how many I/O threads are running. This isn’t entirely true with EPoll servers, since EPoll’s oneshot behavior is inconsistent.

Back to the topic at hand though. If we could allocate a socket handle to the client session, we could leverage all of these aforementioned benefits.

Linux UDP Server

On Linux, the current most scalable approach is to use the EPoll apis. EPoll has involved over the years and is quite stable and scalable for both UDP and TCP servers. Additionally, Linux does an excellent job of implementing scalable sockets for UDP in the kernel.

Linux I/O Model

A straightforward performance I/O model on Linux would involve the pre-allocation of a group of threads whose only purpose is to process I/O in parallel. Each of these threads would be setup with the epoll_ctl() api as edge-triggered EPOLLET and oneshot delivered EPOLLONESHOT. This is the preferred model before Linux kernel 4.5.

Due to potential race conditions in the Epoll implementation more recent versions of the kernel have introduced EPOLLEXCLUSIVE to avoid potential scaling issues. This is used in conjunction with level-triggered I/O which is the default.

Either approach is good at creating highly scalable UDP servers on Linux. Each of these threads would call epoll_wait() in a loop.

This is the basic model of a scalable EPoll server and it is pretty much the same for UDP as it is for TCP.

Using UDP socket handles on Linux

In order to take advantage of socket handles with UDP on Linux, there are numerous steps in the initial setup of the client session. Personally I like to think of this setup process in a similar manner as to how you would handle an initial accept for a TCP session. Once the UDP session is accepted, you can continue your processing in a highly efficient manner.

To make this all work in a Linux UDP server, you need to:

1. Create a UDP listening socket using the socket() api This will be our well-known listening port.

2. Obtain a socket address for the UDP listening socket. There are various ways to do this. I typically use getaddrinfo(). We will need the listening socket address in step 7.

3. Use SetSockOpt() with SO_REUSEADDR against the listening socket. This is required to be able to bind() and connect() to the same socket.

4. Use epoll_ctl() with EPOLL_CTL_ADD and EPOLLIN with the listening socket. To initiate the listening process you need to start the process by adding the EPOLLIN event flag. Along with this event you should also include a pointer to a data object (EPoll_Data.ptr). Your data object should have a flag to indicate whether or not you have already allocated a session object. We examine this flag with every event we will receive.

5. Use epoll_wait() to wait for your events in a loop.

6. If you receive an EPOLLIN event, then examine the flag inside of the data object (EPoll_Data.ptr) to see if we need to allocate a session object.

7. If this is a EPOLLIN event without a session object, then:

1. Use RecvFrom() to obtain the client socket address.  We also need to keep this first data buffer we received, so we can pass it up to the application layer once we have setup the client session and socket handle.
2. Create a new UDP socket using the socket() api.  This new socket will be the socket we will be assigning to the client session.  It should match the listening socket’s family, socket type and protocol.  This is our client socket.
3. Use SetSockOpt() with SO_REUSEADDR against the client socket.  This is also required.
4. Bind() the client socket to the socket address of the listening socket.  On Linux this essentially passes the responsibility for receiving data for the client session from the well-known listening socket, to the newly allocated client socket.  It is important to note that this behavior is not the same on other platforms, like Windows (unfortunately).
5. Connect() the client socket to client socket address.  This is the socket address received in the RecvFrom() method, not the listening socket address.  This will setup the socket so that data can be sent to the client session using the new client socket with the Send() api.
6. Then finally, use epoll_ctl() with EPOLL_CTL_ADD and EPOLLIN with the client socket.  Along with this event you should also include a pointer to your session object (EPoll_Data.ptr).

8. If this is a EPOLLIN event with a session object, then:

1. Use Recv() to read the data from the socket. We do not need to use RecvFrom() since the client socket is already allocated and the client session object is already created.

Note: From this point forward the application logic can be similar between UDP based sessions and TCP based sessions, provided you are trying to handle UDP sessions as such. Even if you intend to handle UDP sessions differently, you may still have the need to handle things like DTLS or other application stateful information that relates to each client.

Windows UDP Server

On Windows we can use either I/O completion ports or Registered I/O, the current most scalable approach. The concepts are nearly identical between the apis, so we will discuss IOCP primarily.

Using IOCP for UDP servers seems like a dark art. There is a widely held belief that you must pre-allocate memory buffers to receive data. This is not true, and it is possible to perform a read-zero operation for UDP servers with IOCP.

For highly scalable UDP servers on Windows, memory can be precious so avoiding allocating memory buffers leads to greater scale. Additionally the pre-allocation of memory buffers requires a great deal of extra logic to manage these buffers as hash tables or queues with locking mechanisms. All of this slows down the processing of individual datagrams and is completely unnecessary.

Note: Unfortunately some aspects of how socket handles work under Unix and Linux, do not work properly on Windows. More on that topic later.

Windows I/O Model

A straightforward performance I/O model on Windows would involve the pre-allocation of a group of threads whose only purpose is to process I/O in parallel. Each of these threads would call GetQueuedCompletionStatus() in a loop.

To initiate a zero-byte read operation for UDP, you simply pass an overlapped IO event with an empty buffer. The key is to include the MSG_PEEK flag in the WsaRecv() overlapped api call. This signals the underlying completion logic to raise an overlapped event, but not to pass any data.

Within your thread I/O loop that is calling GetQueuedCompletionStatus() you will receive a new error condition ERROR_MORE_DATA indicating that data is ready to be read. You can now use WsaRecvFrom() to actually read the data.

It may seem counter intuitive to initiate an overlapped operation to signal read but by using a zero-byte read you avoid not only managing memory buffers and the related overhead of the logic of managing them that impacts your overall datagram processing performance, you increase your scale.

Note: I prefer to pre-allocate a receive buffer for each I/O thread instead of allocating a buffer on demand in the method which handles the WsaRecvFrom(). This tends to be both faster and much more memory efficient.

This is the basic model of a scalable Windows UDP server.

Using UDP socket handles on Windows

To make this all work in a Windows UDP server, you need to:

1. Create an overlapped UDP listening socket using the WsaSocket() api. This will be our well-known listening port.

2. Obtain a socket address for the UDP listening socket. There are various ways to do this. I typically use getaddrinfo(). We will need the listening socket address in step 7.

3. Use SetSockOpt() with SO_REUSEADDR against the listening socket. This is required to be able to Bind() and Connect() to the same socket.

4. Use WsaRecv() to create a single zero-byte overlapped operation for each I/O thread to initiate communications. You will need to include the MSG_PEEK flag.

5. Use GetQueuedCompletionStatus() to wait for your completion events in a loop.

6. If you receive an ERROR_MORE_DATA error then use WsaRecvFrom() to obtain the client socket address. We also need to keep this the data buffer we received.

7. Lookup the socket address, and if the socket address is unknown:

1. Create a new UDP socket using the socket() or WsaSocket() api.  This new socket will be the socket we will be assigning to the client session.  It should match the listening socket’s family, socket type and protocol.  This is our client socket.
2. Use SetSockOpt() with SO_REUSEADDR against the client socket.  This is also required.
3. Bind() the client socket to the socket address of the listening socket.  This does not work the same as Linux, more on this later.
4. Connect() the client socket to client socket address.  This is the socket address received in the WsaRecvFrom() method, not the listening socket address.  Also note, this is not the ConnectEx() api which only works with connected oriented sockets, even though this socket is intended for overlapped I/O.  This will setup the socket so that data can be sent to the client session using the new client socket with the WsaSend() api.
5. Then finally add the socket to your completion handle.  Now you can use the new handle to schedule overlapped IO from your server to send data to specific client sessions using WsaSend().  Along with this event you could also include a pointer to a session object (WSAOVERLAPPED).

8. Lookup the socket address, and if the socket address is known:

1. Handle the received data and then use WsaRecv() to create a single zero-byte overlapped operation for the current thread again. You will need to include the MSG_PEEK flag.

Note: Unfortunately Windows doesn’t seem to implement socket handles correctly for UDP, limiting their use for I/O completion ports. In particular there appears no way to disassociate a client session from the well-known listening port when allocating a new socket for a given UDP client session. This limits client session socket handles to overlapped WsaSend() operations exclusively. Sending data to client sessions is an important part of most UDP servers, so it is still a worthwhile exercise, but it would be great if there was some way to make it work properly on Windows.

What this means is that you need to manually maintain a hash table and locking mechanism to cross reference socket addresses to client session objects when receiving UDP datagrams on Windows. If anyone knows of another way to work around the Bind() issue for UDP on Windows, please let me know.

Conclusion

We hoped this article was helpful in covering some more advanced topics that relate to high performing UDP servers on Windows and Linux and the peculiarities of them. If you didn’t understand a thing, that is okay too as it wasn’t intended as an exhaustive primer on the topics, only covering some of the more complicated aspects and to start a dialog on ideas on how to create the best possible UDP servers.

Please feel free to comment and share your thoughts.

13 thoughts on “Creating high-performance UDP servers on Windows and Linux”

Roberto Della Pasqua says:

August 29, 2018 at 3:40 pm

about a regular tcp scalable server, what kind of kernel API/ usermode data structure your will use? IOCompletionPorts/dynamic ThreadPool/Hash tables/Priority queues/Synchro primitives? Can perhaps you show a smallest and working example for a common scenario?
anyway thanks you guys for your great articles, you really rocks!

LikeLike

Reply
1. Allen Drennan says:
  
  August 30, 2018 at 5:05 am
  
  A good TCP server using IOCP on Windows with Delphi would have almost no hash tables, dictionaries or queues. It would also not use locks for send or receive operations (except for TLS), despite being multi-threaded. Perhaps only a single dictionary to keep track of connections. Most core objects would be interfaces, so that they are referenced counted and removed automatically when they go out of scope since knowing when to destroy things with IOCP is one of the trickiest aspects. A good implementation would take advantage of the asynchronous nature of IOCP and be non-blocking, use Delphi anonymous methods as callbacks for request/response related operations. It would avoid excessive memory allocations and perform zero-memory transfer/copy as much as possible. A fairly good example of this is a library called Delphi Cross Socket. It is years ahead of anything currently used in the community. https://github.com/winddriver/Delphi-Cross-Socket
  
  LikeLike
  
  Reply
themastermind1 says:

September 18, 2018 at 12:15 pm

Hi Allen, thanks for the great writeup.

According to MSDN (https://docs.microsoft.com/en-us/windows/desktop/api/winsock2/nf-winsock2-wsarecv), the MSG_PEEK flag is “valid only for nonoverlapped sockets”. Are the docs simply incorrect about this?

LikeLike

Reply
1. Allen Drennan says:
  
  September 18, 2018 at 12:44 pm
  
  Thank you for taking the time to read it!
  
  I can tell you that it works for the purpose of overlapped UDP sockets, provided you are performing a zero-byte read and not actually transferring any data. In other words, only as a signal to indicate there is overlapped data to be read.
  
  LikeLike
  
  Reply
  1. themastermind1 says:
    
    September 18, 2018 at 1:13 pm
    
    Thanks. In my experience it also works, but the MSDN comment confused me.
    
    One thing to note: if you use GetQueuedCompletionStatusEx() + WSAGetOverlappedResult() instead of GetQueuedCompletionStatus(), then you will not get ERROR_MORE_DATA but instead WSAEMSGSIZE
    
    LikeLike
Griffon26 says:

November 28, 2018 at 12:12 pm

You write that on Windows “as long as you only have a single pending overlapped read at a time, you are not going to have to lock your session object with IOCP regardless of how many I/O threads are running.”

I don’t fully understand this. What’s the point of having multiple threads if there is only one pending overlapped read at a time? The reads are all on the same socket, so you cannot have one pending overlapped read per client, right? Or was the point that you could have those threads handling operations for other parts of your program (file writes etc)?

LikeLike

Reply
1. Allen Drennan says:
  
  November 28, 2018 at 12:58 pm
  
  Sorry, I was intentionally trying to be vague because the topic was beyond the scope of the article. I should have clarified that I was referring towards the way it should work with UDP on Windows with IOCP, which is how it works with TCP on Windows with IOCP and UDP with EPoll. Because Windows lacks the ability to Bind() and Connect() a socket for UDP for receiving data, it doesn’t allow us to stage overlapped operations specific to a client without a locking mechanism of some sort. A typical approach is to use ReadFrom to obtain the client socket address and then use the Integer representation of the socket address (either IPv4 or IPv6) as a hash table lookup key to find our session object.
  
  However, you still want multiple threads to handle multiple different client sessions in parallel. The majority of the time for each thread isn’t in reading from the socket, but instead handling logic/processing related to the read and any subsequent response as you surmised.
  
  LikeLiked by 1 person
  
  Reply
  1. Griffon26 says:
    
    November 30, 2018 at 11:53 am
    
    I’m writing a UDP forwarder that accepts UDP from external clients and forwards it to a server on localhost. Do you know of a way to assess the performance of different solutions? At the moment I only have a packet generator and echoer that I’ve written myself, because I couldn’t find anything on the net.
    
    I’m now trying to decide between having only one pending overlapped read for all clients or having multiple reads and using per-client queues with perhaps PostQueuedCompletionStatus to signal available data. But this seems very complex and I’m not sure if it will bring any performance benefit.
    
    How would you decide which to choose? Does it make sense to use asynchronous operations to avoid waiting on the kernel to transfer data into user space?
    
    LikeLike
Allen Drennan says:

December 1, 2018 at 7:02 am

While it’s not clear to me exactly what you are doing, if you are just forwarding UDP packets you could simply use multiple threads with RecvFrom() and differentiate based upon the socket address related to the sender.

LikeLike

Reply
brjoha says:

January 11, 2019 at 7:53 pm

It’s worth noting that the bind() call for the new descriptor effectively takes over the port from the listening descriptor until the subsequent connect() call is made to complete the “connected” session with the client. After the connect() call is made, the port reverts to the listening descriptor. This window between calls, however small, allows for datagrams from other clients to get delivered to the new descriptor instead of the listening descriptor. This scenario messes up DTLS authentication for all the clients involved. You may also want to look at the proposed SO_REUSEPORT option, thought its hash may still create problems for stateful sessions. What would be nice to see is accept() for UDP that creates a new “connected” descriptor and moves over the datagram from the listener descriptor (along with any others from the client).

LikeLike

Reply
1. Allen Drennan says:
  
  January 12, 2019 at 9:16 am
  
  Thanks for sharing your experiences. I have never personally experienced the issue with timing related to moving from the well known listening socket to the connected socket, but that makes sense to me. Although in DTLS, handshake retries are built into the protocol and that should be able to work around that. The other thing I have noticed is that how UDP packets are discarded is mostly dependent on the network driver stack implementation with some drivers on some OSes buffering packets while others discarding them. Some drivers even allow you to specify buffer sizing or parameters relating to when packets are discarded. Since one cannot rely on datagrams to be reliable or the behavior of the network driver, retries become required for secure handshaking. I agree though, it would be really nice to Accept() directly into our connected socket.
  
  LikeLike
  
  Reply
Dmitry says:

July 13, 2019 at 6:19 am

Good day!
I am not an English speaking person, so I do not understand you are proposing a concept or you have made an application that implement your algorithm of high-performance UDP-server for windows and linux?

And why after step 2 (2. Obtain a socket address for the UDP listening socket…),
both in windows and in linux version of algorithm
You do not make bind() on well-known listening port?
Otherwise RecvFrom() or WsaRecv() for to obtain the client socket address just won’t work.
Moreover, it will give an error if you do not make bind() before first recvfrom().

LikeLike

Reply
1. Allen Drennan says:
  
  July 24, 2019 at 8:56 am
  
  I agree with you. The article says the same thing you are saying, after Step 2… “this is required to be able to Bind() and Connect() to the same socket.” which is part of Step 3. Step 4 is where you actually call Recv(). Yes, this works in real applications.
  
  LikeLike
  
  Reply