| strsctp PaperDescription: OpenSS7 Online PapersA PDF version of this document is available here.
STREAMS vs. Sockets Performance Comparison for SCTP
|
Distribution | Kernel |
RedHat 7.2 | 2.4.20-28.7 |
WhiteBox 3 | 2.4.27 |
CentOS 4 | 2.6.9-5.0.3.EL |
SuSE 10.0 OSS | 2.6.13-15-default |
Ubuntu 6.10 | 2.6.17-11-generic |
Ubuntu 7.04 | 2.6.20-15-server |
Fedora Core 6 | 2.6.20-1.2933.fc6 |
To remove the dependence of test results on a particular machine, various machines were used for testing as follows:
Hostname | Processor | Memory | Architecture |
porky | 2.57GHz PIV | 1Gb (333MHz) | i686 UP |
pumbah | 2.57GHz PIV | 1Gb (333MHz) | i686 UP |
daisy | 3.0GHz i630 HT | 1Gb (400MHz) | x86_64 SMP |
mspiggy | 1.7GHz PIV | 1Gb (333MHz) | i686 UP |
The results for the various distributions and machines is tabulated in Appendix B. The data is tabulated as follows:
Performance is charted by graphing the number of messages sent and received per second against the logarithm of the message send size.
Delay is charted by graphing the number of seconds per send and receive against the sent message size. The delay can be modelled as a fixed overhead per send or receive operation and a fixed overhead per byte sent. This model results in a linear graph with the zero x-intercept representing the fixed per-message overhead, and the slope of the line representing the per-byte cost. As all implementations use the same primary mechanism for copying bytes to and from user space, it is expected that the slope of each graph will be similar and that the intercept will reflect most implementation differences.
Throughput is charted by graphing the logarithm of the product of the number of messages per second and the message size against the logarithm of the message size. It is expected that these graphs will exhibit strong log-log-linear (power function) characteristics. Any curvature in these graphs represents throughput saturation.
Improvement is charted by graphing the quotient of the bytes per second of the implementation and the bytes per second of the Linux sockets implementation as a percentage against the message size. Values over 0% represent an improvement over Linux sockets, whereas values under 0% represent the lack of an improvement.
The results are organized in the sections that follow in order of the machine tested.
Porky is a 2.57GHz Pentium IV (i686) uniprocessor machine with 1Gb of memory. Linux distributions tested on this machine are as follows:
Distribution | Kernel |
Fedora Core 6 | 2.6.20-1.2933.fc6 |
CentOS 4 | 2.6.9-5.0.3.EL |
SuSE 10.0 OSS | 2.6.13-15-default |
Ubuntu 6.10 | 2.6.17-11-generic |
Ubuntu 7.04 | 2.6.20-15-server |
Fedora Core 6 is the most recent full release Fedora distribution. This distribution sports a 2.6.20-1.2933.fc6 kernel with the latest patches. This is the x86 distribution with recent updates.
Figure 4 plots the measured performance of TCP Sockets (both normal and artificial scheduling priorities), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations. The higher performing TCP Sockets graph (with dashed lines and designated with `(A)') is the artifical scheduling priority plot. The under performing TCP Sockets graph (with the solid lines and designated with `(N)') is the normal scheduling plot.
TCP Sockets with normal scheduling shows dismal performance in comparison to both STREAMS - TCP XTIoS and SCTP XTI - approaches at all message sizes beneath 4096 bytes. It is necessary to artificially reduce the receiver priority to a minimum (nice -n 19) and artificially increase the sender priority to a maximum (nice -n -20) to acheive better results on TCP Sockets beneath 4096 bytes, at the cost of poorer performance above 4096 bytes.
The slightly different performance between TCP XTIoS and SCTP XTI can be explained by the significant overheads that the SCTP protocol introduces on small message sizes.
Figure 5 plots the average message delay of TCP Sockets (both normal and artificial), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
The average delay of TCP XTIoS STREAMS and SCTP XTI STREAMS approaches is similar and comparable with TCP Sockets with artificial scheduling and message sizes beneath 4096. With normal scheduling, however, TCP Sockets has poor per message delays (intercept) but superior per-byte delays (slope).
Figure 6 plots the effective throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
All curves exhibit good power function characteristics beneath 1024 byte message sizes, indicating structure and robustness for each implementation, but each implementation exhibits saturation characteristics above 1024 bytes.
Figure 7 plots the relative percentage of throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
For the normal case, TCP XTIoS STREAMS and SCTP XTI STREAMS exhibit significant improvements over TCP Sockets for message sizes less than 4096 bytes and are superior or comparable at message sizes greater than 4096 bytes. Forcing TCP Sockets into a specific behaviour by artificially maximizing the sender priority and minimizing the receive priority results in improved behaviour below 4096 bytes for TCP Sockets but is worse than normal scheduling priority TCP Sockets for message sizes of 4096 bytes or more.
The results for Fedora Core 6 on Porky are, for the most part, similar to the results from other distributions on the same host and also similar to the results for other distributions on other hosts.
Figure 8 plots the measured performance of TCP Sockets (both normal and artificial scheduling priorities), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations. The higher performing TCP Sockets graph (with dashed lines and designated with `(A)') is the artifical scheduling priority plot. The under performing TCP Sockets graph (with the solid lines and designated with `(N)') is the normal scheduling plot.
TCP Sockets with normal scheduling shows dismal performance in comparison to both STREAMS - TCP XTIoS and SCTP XTI - approaches at all message sizes beneath 4096 bytes. It is necessary to artificially reduce the receiver priority to a minimum (nice -n 19) and artificially increase the sender priority to a maximum (nice -n -20) to acheive better results on TCP Sockets beneath 4096 bytes, at the cost of poorer performance above 4096 bytes.
The slightly different performance between TCP XTIoS and SCTP XTI can be explained by the significant overheads that the SCTP protocol introduces on small message sizes.
Figure 9 plots the average message delay of TCP Sockets (both normal and artificial), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
The average delay of TCP XTIoS STREAMS and SCTP XTI STREAMS approaches is similar and comparable with TCP Sockets with artificial scheduling and message sizes beneath 4096. With normal scheduling, however, TCP Sockets has poor per message delays (intercept) but superior per-byte delays (slope).
Figure 10 plots the effective throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
All curves exhibit good power function characteristics beneath 1024 byte message sizes, indicating structure and robustness for each implementation, but each implementation exhibits saturation characteristics above 1024 bytes.
Figure 11 plots the relative percentage of throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
For the normal case, TCP XTIoS STREAMS and SCTP XTI STREAMS exhibit significant improvements over TCP Sockets for message sizes less than 4096 bytes and are superior or comparable at message sizes greater than 4096 bytes. Forcing TCP Sockets into a specific behaviour by artificially maximizing the sender priority and minimizing the receive priority results in improved behaviour below 4096 bytes for TCP Sockets but is worse than normal scheduling priority TCP Sockets for message sizes of 4096 bytes or more.
The results for CentOS on Porky are, for the most part, similar to the results from other distributions on the same host and also similar to the results for other distributions on other hosts.
SuSE 10.0 OSS is the public release version of the SuSE/Novell distribution. There have been two releases subsequent to this one: the 10.1 and recent 10.2 releases. The SuSE 10 release sports a 2.6.13 kernel and the 2.6.13-15-default kernel was the tested kernel.
Figure 12 plots the measured performance of TCP Sockets (both normal and artificial scheduling priorities), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations. The higher performing TCP Sockets graph (with dashed lines and designated with `(A)') is the artifical scheduling priority plot. The under performing TCP Sockets graph (with the solid lines and designated with `(N)') is the normal scheduling plot.
TCP Sockets with normal scheduling shows dismal performance in comparison to both STREAMS - TCP XTIoS and SCTP XTI - approaches at all message sizes beneath 4096 bytes. It is necessary to artificially reduce the receiver priority to a minimum (nice -n 19) and artificially increase the sender priority to a maximum (nice -n -20) to acheive better results on TCP Sockets beneath 4096 bytes, at the cost of poorer performance above 4096 bytes.
The slightly different performance between TCP XTIoS and SCTP XTI can be explained by the significant overheads that the SCTP protocol introduces on small message sizes.
Figure 13 plots the average message delay of TCP Sockets (both normal and artificial), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
The average delay of TCP XTIoS STREAMS and SCTP XTI STREAMS approaches is similar and comparable with TCP Sockets with artificial scheduling and message sizes beneath 4096. With normal scheduling, however, TCP Sockets has poor per message delays (intercept) but superior per-byte delays (slope).
Figure 14 plots the effective throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
All curves exhibit good power function characteristics beneath 1024 byte message sizes, indicating structure and robustness for each implementation, but each implementation exhibits saturation characteristics above 1024 bytes.
Figure 15 plots the relative percentage of throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
For the normal case, TCP XTIoS STREAMS and SCTP XTI STREAMS exhibit significant improvements over TCP Sockets for message sizes less than 4096 bytes and are superior or comparable at message sizes greater than 4096 bytes. Forcing TCP Sockets into a specific behaviour by artificially maximizing the sender priority and minimizing the receive priority results in improved behaviour below 4096 bytes for TCP Sockets but is worse than normal scheduling priority TCP Sockets for message sizes of 4096 bytes or more.
The results for SuSE 10 OSS on Porky are, for the most part, similar to the results from other distributions on the same host and also similar to the results for other distributions on other hosts.
Ubuntu 7.04 is the current release of the Ubuntu distribution. The Ubuntu 7.04 release sports a 2.6.20 kernel. The tested distribution had current updates applied.
Figure 16 plots the measured performance of TCP Sockets (both normal and artificial scheduling priorities), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations. The higher performing TCP Sockets graph (with dashed lines and designated with `(A)') is the artifical scheduling priority plot. The under performing TCP Sockets graph (with the solid lines and designated with `(N)') is the normal scheduling plot.
TCP Sockets with normal scheduling shows dismal performance in comparison to both STREAMS - TCP XTIoS and SCTP XTI - approaches at all message sizes beneath 4096 bytes. It is necessary to artificially reduce the receiver priority to a minimum (nice -n 19) and artificially increase the sender priority to a maximum (nice -n -20) to acheive better results on TCP Sockets beneath 4096 bytes, at the cost of poorer performance above 4096 bytes.
The slightly different performance between TCP XTIoS and SCTP XTI can be explained by the significant overheads that the SCTP protocol introduces on small message sizes.
Figure 17 plots the average message delay of TCP Sockets (both normal and artificial), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
The average delay of TCP XTIoS STREAMS and SCTP XTI STREAMS approaches is similar and comparable with TCP Sockets with artificial scheduling and message sizes beneath 4096. With normal scheduling, however, TCP Sockets has poor per message delays (intercept) but superior per-byte delays (slope).
Figure 18 plots the effective throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
All curves exhibit good power function characteristics beneath 1024 byte message sizes, indicating structure and robustness for each implementation, but each implementation exhibits saturation characteristics above 1024 bytes.
Figure 19 plots the relative percentage of throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
For the normal case, TCP XTIoS STREAMS and SCTP XTI STREAMS exhibit significant improvements over TCP Sockets for message sizes less than 4096 bytes and are superior or comparable at message sizes greater than 4096 bytes. Forcing TCP Sockets into a specific behaviour by artificially maximizing the sender priority and minimizing the receive priority results in improved behaviour below 4096 bytes for TCP Sockets but is worse than normal scheduling priority TCP Sockets for message sizes of 4096 bytes or more.
The results for Ubuntu 7.04 on Porky are, for the most part, similar to the results from other distributions on the same host and also similar to the results for other distributions on other hosts.
Distribution | Kernel |
RedHat 7.2 | 2.4.20-28.7 |
Pumbah is a control machine and is used to rule out differences between recent 2.6 kernels and one of the oldest and most stable 2.4 kernels.
RedHat 7.2 is one of the oldest (and arguably the most stable) glibc2 based releases of the RedHat distribution. This distribution sports a 2.4.20-28.7 kernel. The distribution has all available updates applied.
Figure 20 plots the measured performance of TCP Sockets (both normal and artificial scheduling priorities), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations. The higher performing TCP Sockets graph (with dashed lines and designated with `(A)') is the artifical scheduling priority plot. The under performing TCP Sockets graph (with the solid lines and designated with `(N)') is the normal scheduling plot.
TCP Sockets with normal scheduling shows dismal performance in comparison to both STREAMS - TCP XTIoS and SCTP XTI - approaches at all message sizes beneath 4096 bytes. It is necessary to artificially reduce the receiver priority to a minimum (nice -n 19) and artificially increase the sender priority to a maximum (nice -n -20) to acheive better results on TCP Sockets beneath 4096 bytes, at the cost of poorer performance above 4096 bytes.
The slightly different performance between TCP XTIoS and SCTP XTI can be explained by the significant overheads that the SCTP protocol introduces on small message sizes.
STREAMS demonstrates significant improvements at mssage sizes of less than 1024 bytes, and comparable performace at larger message sizes.
A significant result is that the TCP XTI over Socket approach indeed provided improvements over TCP Sockets itself at message sizes beneath 1024 bytes. This improvement can only be accounted for by buffering and schedule differences, and when the receiving process was given a lower scheduling priority than the sending process, TCP Sockets performed much better.
Figure 21 plots the average message delay of TCP Sockets (both normal and artificial), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
The average delay of TCP XTIoS STREAMS and SCTP XTI STREAMS approaches is similar and comparable with TCP Sockets with artificial scheduling and message sizes beneath 4096. With normal scheduling, however, TCP Sockets has poor per message delays (intercept) but superior per-byte delays (slope).
STREAMS demonstrates significant improvements at message sizes of less than 1024 bytes, and comparable performance at larger message sizes.
The slope of the dleay curve is best for SCTP XTI, then TCP Sockets (for message sizes greater than or equal to 1024 bytes), then TCP XTI over Sockets, then TCP Sockets (with low priority receiver).
The slope of the delay curve either indicates that SCTP XTI STREAMS has the best overall per-byte handling performance.
Figure 22 plots the effective throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
All curves exhibit good power function characteristics beneath 1024 byte message sizes, indicating structure and robustness for each implementation, but each implementation exhibits saturation characteristics above 1024 bytes.
STREAMS demonstrates significant improvements at most message sizes.
As can be seen from Figure 22, all implementations exhibit strong power function characteristics, indicating structure and robustness for each implementation, except for TCP Sockets at regular scheduling prioritys.
TCP Sockets at regular scheduling priorities exhibits a strong dicontinuity between message sizes of 512 btyes and 1024 bytes. This non-linearity can be explained by the poor buffering, flow control and scheduling capabilities of Sockets when compared to STREAMS. Indeed, when the receiving process was artificially downgraded to a low priority (nice -19) to avoid the weaknesses inherent in the Sockets approach, it exhibits better characteristics. Perhaps surprisingly, by wrapping the internal socket with STREAMS, the TCP XTIoS approach does not exhibit the weaknesses of Sockets alone and in some way compensates for the defficiencies of Socket buffering, flow control and scheduling.
Figure 23 plots the relative percentage of throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
For the normal case, TCP XTIoS STREAMS and SCTP XTI STREAMS exhibit significant improvements over TCP Sockets for message sizes less than 4096 bytes and are superior or comparable at message sizes greater than 4096 bytes. Forcing TCP Sockets into a specific behaviour by artificially maximizing the sender priority and minimizing the receive priority results in improved behaviour below 4096 bytes for TCP Sockets but is worse than normal scheduling priority TCP Sockets for message sizes of 4096 bytes or more.
STREAMS demonstrates significant improvements (approx. 150-180% improvement) at message sizes below 1024 bytes. That STREAMS SCTP and TCP give such an improvement over a wide range of message sizes is a dramatic improvement. Note that dramatic improvement is also demonstrated for TCP Sockets when the receiver is artificially given the lowest possible scheduling priority, thus circumventing Sockets' poor scheduling and flow control characteristics.
Distribution | Kernel |
Fedora Core 6 | 2.6.20-1.2933.fc6 |
CentOS 5.0 | 2.6.18-8.1.3.el5 |
This machine is used as an SMP control machine. Most of the tests were performed on uniprocessor non-hyper-threaded machines. This machine is hyper-threaded and runs full SMP kernels. This machine also supports EMT64 and runs x86_64 kernels. It is used to rule out both SMP differences as well as 64-bit architecture differences.
Fedora Core 6 is the most recent full release Fedora distribution. This distribution sports a 2.6.20-1.2933.fc6 kernel with the latest patches. This is the x86_64 distribution with recent updates.
Figure 24 plots the measured performance of TCP Sockets (both normal and artificial scheduling priorities), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations. The higher performing TCP Sockets graph (with dashed lines and designated with `(A)') is the artifical scheduling priority plot. The under performing TCP Sockets graph (with the solid lines and designated with `(N)') is the normal scheduling plot.
TCP Sockets with normal scheduling shows dismal performance in comparison to both STREAMS - TCP XTIoS and SCTP XTI - approaches at all message sizes beneath 4096 bytes. It is necessary to artificially reduce the receiver priority to a minimum (nice -n 19) and artificially increase the sender priority to a maximum (nice -n -20) to acheive better results on TCP Sockets beneath 4096 bytes, at the cost of poorer performance above 4096 bytes.
The slightly different performance between TCP XTIoS and SCTP XTI can be explained by the significant overheads that the SCTP protocol introduces on small message sizes.
Figure 25 plots the average message delay of TCP Sockets (both normal and artificial), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
The average delay of TCP XTIoS STREAMS and SCTP XTI STREAMS approaches is similar and comparable with TCP Sockets with artificial scheduling and message sizes beneath 4096. With normal scheduling, however, TCP Sockets has poor per message delays (intercept) but superior per-byte delays (slope).
Figure 26 plots the effective throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
All curves exhibit good power function characteristics beneath 1024 byte message sizes, indicating structure and robustness for each implementation, but each implementation exhibits saturation characteristics above 1024 bytes.
Figure 27 plots the relative percentage of throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
For the normal case, TCP XTIoS STREAMS and SCTP XTI STREAMS exhibit significant improvements over TCP Sockets for message sizes less than 4096 bytes and are superior or comparable at message sizes greater than 4096 bytes. Forcing TCP Sockets into a specific behaviour by artificially maximizing the sender priority and minimizing the receive priority results in improved behaviour below 4096 bytes for TCP Sockets but is worse than normal scheduling priority TCP Sockets for message sizes of 4096 bytes or more.
Mspiggy is a 1.7Ghz Pentium IV (M-processor) uniprocessor notebook (Toshiba Satellite 5100) with 1Gb of memory. Linux distributions tested on this machine are as follows:
Distribution | Kernel |
SuSE 10.0 OSS | 2.6.13-15-default |
Note that this is the same distribution that was also tested on Porky. The purpose of testing on this notebook is to rule out the differences between machine architectures on the test results. Tests performed on this machine are control tests.
Figure 28 plots the measured performance of TCP Sockets (both normal and artificial scheduling priorities), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations. The higher performing TCP Sockets graph (with dashed lines and designated with `(A)') is the artifical scheduling priority plot. The under performing TCP Sockets graph (with the solid lines and designated with `(N)') is the normal scheduling plot.
TCP Sockets with normal scheduling shows dismal performance in comparison to both STREAMS - TCP XTIoS and SCTP XTI - approaches at all message sizes beneath 4096 bytes. It is necessary to artificially reduce the receiver priority to a minimum (nice -n 19) and artificially increase the sender priority to a maximum (nice -n -20) to acheive better results on TCP Sockets beneath 4096 bytes, at the cost of poorer performance above 4096 bytes.
The slightly different performance between TCP XTIoS and SCTP XTI can be explained by the significant overheads that the SCTP protocol introduces on small message sizes.
STREAMS demonstrates significant improvements at message sizes of less than 4096 bytes, and improvements at all message sizes.
A significant result is that the TCP XTI over Socket approach indeed provided improvements over TCP Sockets itself at message sizes beneath 4096 bytes. This improvement can only be accounted for by buffering differences. When the receiving process was given a lower scheduling priority (nice -n 19) than the sending process (nice -n -20), forcing the implementation into a tight corner, TCP Sockets performed better; however, only for smaller message sizes.
Figure 29 plots the average delay of TCP Sockets (both normal and artificial), TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
The average delay of TCP XTIoS STREAMS and SCTP XTI STREAMS approaches is similar and comparable with TCP Sockets with artificial scheduling and message sizes beneath 4096. With normal scheduling, however, TCP Sockets has poor per message delays (intercept) but superior per-byte delays (slope).
Figure 30 plots the effective throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
All curves exhibit good power function characteristics beneath 1024 byte message sizes, indicating structure and robustness for each implementation, but each implementation exhibits saturation characteristics above 1024 bytes.
Figure 31 plots the relative percentage of throughput of TCP Sockets, TCP XTIoS STREAMS and SCTP XTI STREAMS implementations.
For the normal case, TCP XTIoS STREAMS and SCTP XTI STREAMS exhibit significant improvements over TCP Sockets for message sizes less than 4096 bytes and are superior or comparable at message sizes greater than 4096 bytes. Forcing TCP Sockets into a specific behaviour by artificially maximizing the sender priority and minimizing the receive priority results in improved behaviour below 4096 bytes for TCP Sockets but is worse than normal scheduling priority TCP Sockets for message sizes of 4096 bytes or more.
STREAMS demonstrates significant improvements (approx. 250% improvement) at message sizes below 1024 bytes. That STREAMS SCTP gives a 200% improvement over a wide range of message sizes is a dramatic improvement.
With some caveats as described at the end of this section, the results are consistent enough across the various distributions and machines tested to draw some conclusions regarding the efficiency of the implementations tested. This section is responsible for providing an analysis of the results and drawing conclusions consistent with the experimental results.
The test results reveal that the maximum throughput performance, as tested by the netperf program, of the STREAMS implementation of SCTP is superior to that of the Linux Kernel Sockets implementation of TCP. In fact, STREAMS TPI SCTP implementation performance at smaller message sizes is significantly (as much as 200-300%) greater than that of Linux Kernel Sockets TCP. As the common belief is that STREAMS would exhibit poorer performance, this is perhaps a startling result to some.
Perhaps even more surprising is that the STREAMS implementation of TCP using XTI over Sockets is superior to the TCP Socket alone! And, again by as much as 200-300%.
Looking at both implementations, the differences can be described by implementation similarities and differences:
When Linux Sockets TCP receives a send request, the available send buffer space is checked. If the current data would cause the send buffer fill to exceed the send buffer maximum, either the calling process blocks awaiting available buffer, or the system call returns with an error (ENOBUFS). If the current send request will fit into the send buffer, a socket buffer (skbuff) is allocated, data is copied from user space to the buffer, and the socket buffer is dispatched to the IP layer for transmission.
Linux 2.6 kernels have an amazing amount of special case code that gets executed for even a simple TCP send operation. Linux 2.4 kernels are more direct. The result is the same, even though they are different in the depths to which they must delve before discovering that a send is just a simple send. This might explain part of the rather striking differences between the performance comparison between STREAMS and Sockets on 2.6 and 2.4 kernels.11
When the STREAMS Stream head receives a putmsg(2) request, it checks downstream flow control. If the Stream is flow controlled downstream, either the calling process blocks awaiting succession of flow control, or the putmsg(2) system call returns with an error (EAGAIN). if the Stream is not flow controlled on the write side, message blocks are allocated to hold the control and data portions of the request and the message blocks are passed downstream to the driver. When the driver receives an M_DATA or M_PROTO message block from the Stream head, in its put procedure, it simply queues it to the driver write queue with putq(9). putq(9) will result in the enabling of the service procedure for the driver write queue under the proper circumstances. When the service procedure runs, messages will be dequeued from the driver write queue transformed into IP datagrams and sent to the IP layer for transmission on the network interface.
Linux Fast-STREAMS has a feature whereby the driver can request that the Stream head allocate a Linux socket buffer (skbuff) to hold the data buffer associated with an allocated message block. The STREAMS SCTP driver utilizes this feature (but the STREAMS XTIoS TCP driver cannot). STREAMS also has the feature that a write offset can be applied to all data block allocated and passed downstream. However, neither the STREAMS TPI SCTP nor XTIoS TCP drivers use this capability. It is currently only used by the second generation STREAMS UDP and RAWIP drivers.
Network processing (that is the bottom end under the transport protocol) for both implementations is effectively the same, with only minor differences. In the STREAMS SCTP implementation, no sock structure exists, so issuing socket buffers to the network layer is performed in a slightly more direct fashion.
Loop-back processing is identical as this is performed by the Linux NET4 IP layer in both cases.
For Linux Sockets TCP, when the IP layer frees or orphans the socket buffer, the amount of data associated with the socket buffer is subtracted from the current send buffer fill. If the current buffer fill is less than 1/2 of the maximum, all processes blocked on write or blocked on poll are are woken.
For STREAMS SCTP, when the IP layer frees or orphans the socket buffer, the amount of data associated with the socket buffer is subtracted from the current send buffer fill. If the current send buffer fill is less than the send buffer low water mark (SO_SNDLOWAT or XTI_SNDLOWAT), and the write queue is blocked on flow control, the write queue is enabled.
One disadvantage that it is expected would slow STREAMS SCTP performance is the fact that on the sending side, a STREAMS buffer is allocated along with a skbuff and the skbuff is passed to Linux NET4 IP and the loop-back device. For Linux Sockets TCP, the same skbuff is reused on both sides of the interface. For STREAMS SCTP, there is (currently) no mechanism for passing through the original STREAMS message block and a new message block must be allocated. This results in two message block allocations per skbuff.
Under Linux Sockets TCP, when a socket buffer is received from the network layer, a check is performed whether the associated socket is locked by a user process or not. If the associated socket is locked, the socket buffer is placed on a backlog queue awaiting later processing by the user process when it goes to release the lock. A maximum number of socket buffers are permitted to be queued against the backlog queue per socket (approx. 300).
If the socket is not locked, or if the user process is processing a backlog before releasing the lock, the message is processed: the receive socket buffer is checked and if the received message would cause the buffer to exceed its maximum size, the message is discarded and the socket buffer freed. If the received message fits into the buffer, its size is added to the current send buffer fill and the message is queued on the socket receive queue. If a process is sleeping on read or in poll, an immediate wakeup is generated.
In the STREAMS SCTP implementation on the receive side, again there is no sock structure, so the socket locking and backlog techniques performed by SCTP at the lower layer do not apply. When the STREAMS SCTP implementation receives a socket buffer from the network layer, it tests the receive side of the Stream for flow control and, when not flow controlled, allocates a STREAMS buffer using esballoc(9) and passes the buffer directly to the upstream queue using putnext(9). When flow control is in effect and the read queue of the driver is not full, a STREAMS message block is still allocated and placed on the driver read queue. When the driver read queue is full, the received socket buffer is freed and the contents discarded. While different in mechanism from the socket buffer and backlog approach taken by Linux Sockets TCP, this bottom end receive mechanism is similar in both complexity and behaviour.
For Linux Sockets, when a send side socket buffer is allocated, the true size of the socket buffer is added to the current send buffer fill. After the socket buffer has been passed to the IP layer, and the IP layer consumes (frees or orphans) the socket buffer, the true size of the socket buffer is subtracted from the current send buffer fill. When the resulting fill is less than 1/2 the send buffer maximum, sending processes blocked on send or poll are woken up. When a send will not fit within the maximum send buffer size considering the size of the transmission and the current send buffer fill, the calling process blocks or is returned an error (ENOBUFS). Processes that are blocked or subsequently block on poll(2) will not be woken up until the send buffer fill drops beneath 1/2 of the maximum; however, any process that subsequently attempts to send and has data that will fit in the buffer will be permitted to proceed.
STREAMS networking, on the other hand, performs queueing, flow control and scheduling on both the sender and the receiver. Sent messages are queued before delivery to the IP subsystem. Received messages from the IP subsystem are queued before delivery to the receiver. Both side implement full hysteresis high and low water marks. Queues are deemed full when they reach the high water mark and do not enable feeding processes or subsystems until the queue subsides to the low water mark.
Linux Sockets schedule by waking a receiving process whenever data is available in the receive buffer to be read, and waking a sending process whenever there is one-half of the send buffer available to be written. While accomplishing buffering on the receive side, full hysteresis flow control is only performed on the sending side. Due to the way that Linux handles the loop-back interface, the full hysteresis flow control on the sending side is defeated.
STREAMS networking, on the other hand, uses the queueing, flow control and scheduling mechanism of STREAMS. When messages are delivered from the IP layer to the receiving stream head and a receiving process is sleeping, the service procedure for the reading stream head's read queue is scheduled for later execution. When the STREAMS scheduler later runs, the receiving process is awoken. When messages are sent on the sending side they are queued in the driver's write queue and the service procedure for the driver's write queue is scheduled for later execution. When the STREAMS scheduler later runs, the messages are delivered to the IP layer. When sending processes are blocked on a full driver write queue, and the count drops to the low water mark defined for the queue, the service procedure of the sending stream head is scheduled for later execution. When the STREAMS scheduler later runs, the sending process is awoken.
Linux Fast-STREAMS is designed to run tasks queued to the STREAMS scheduler on the same processor as the queueing processor or task. This avoid unnecessary context switches.
The STREAMS networking approach results in fewer blocking and wakeup events being generated on both the sending and receiving side. Because there are fewer blocking and wakeup events, there are fewer context switches. The receiving process is permitted to consume more messages before the sending process is awoken; and the sending process is permitted to generate more messages before the reading process is awoken.
The result of the differences between the Linux NET and the STREAMS approach is that better flow control is being exerted on the sending side because of intermediate queueing toward the IP layer. This intermediate queueing on the sending side, not present in BSD-style networking, is in fact responsible for reducing the number of blocking and wakeup events on the sender, and permits the sender, when running, to send more messages in a quantum.
On the receiving side, the STREAMS queueing, flow control and scheduling mechanisms are similar to the BSD-style software interrupt approach. However, Linux does not use software interrupts on loop-back (messages are passed directly to the socket with possible backlogging due to locking). The STREAMS approach is more sophisticated as it performs backlogging, queueing and flow control simultaneously on the read side (at the stream head).
The following limitations in the test results and analysis must be considered:
Tests compare performance on loop-back interface only. Several charactersitics of the loop-back interface make it somewhat different from regular network interfaces:
One of the major disadvantages of SCTP over TCP from a protocol performance perspective is the increased cost of the CRC-32C checksum used by SCTP over the Fletcher32 checksum used by TCP. Using the loop-back interface avoids this checksum comparison as both TCP and SCTP do not perform checksum on loop-back.
This means that there is less difference between putting each data chunk in a single packet versus putting multiple data chunks in a packet.
This, in fact, provides an advantage to TCP. Even a light degree of packet loss impacts TCP's performance to a far greater degree than SCTP.
This also provides an advantage to Sockets TCP. Because STREAMS sctP cannot pass a message block along with the socket buffer (socket buffers are orphaned before passing to the loop-back interface), a message block must also be allocated on the receiving side.
Tests compare performance of two rather different implementations of TCP against a single implementation of SCTP. TCP and SCTP have inherent differences in the protocol that affect the efficiency at various load points.
For example, whereas TCP can coallesce many small writes into a single contiguous segment for transmission in a single TCP packet, SCTP must normally create individual data chunks for each write. Some of the original iperf testing on the OpenSS7 Linux Native Sockets version of SCTP used a specialized SOCK_STREAM mode that ignored message boundaries and only supported one SCTP stream. This is a far better comparison to TCP in this respect. The netperf package also takes this approach by setting the T_MORE bit on all calls to t_snd(3).
Also, when message sizes are small, SCTP normally has significant overheads in the protocol that consume available bandwidth and reduce efficiency. For example, for messages of N bytes, transmitted on the loop-back interface, for TCP this consists of the IP header, the TCP header, and the data; for SCTP, the IP header, the SCTP header, and one data chunk header (plus the padding to pad the data to the next 32-bit boundary) for each message bundled. Again, netperf sets the T_MORE bit on all calls to t_snd(3) in an attempt to behave more like TCP for comparison testing.
These experiments have shown that the Linux Fast-STREAMS implementation of STREAMS SCTP, as well as STREAMS TCP using XTIoS, networking outperforms the Linux Sockets TCP implementation by a significant amount (approx. 200-300%).
The Linux Fast-STREAMS implementation of STREAMS SCTP and TCP networking is superior by a significant factor across all systems and kernels tested.
All of the conventional wisdom with regard to STREAMS and STREAMS networking is undermined by these test results for Linux Fast-STREAMS.
Contrary to the preconception that STREAMS must be slower because it is more general purpose, in fact the reverse has been shown to be true in these experiments for Linux Fast-STREAMS. The STREAMS flow control and scheduling mechanisms serve to adapt well and increase both code and data cache as well as scheduler efficiency.
Contrary to the preconception that STREAMS trades flexibility or general purpose architecture for efficiency (that is, that STREAMS is somehow less efficient because it is more flexible and general purpse), in fact has shown to be untrue. Linux Fast-STREAMS is both more flexible and more efficient. Indeed, the performance gains achieved by STREAMS appear to derive from its more sophisticated queueing, scheduling and flow control model.
Contrary to the preconception that STREAMS must be slower due to complex locking and synchronization mechanisms, Linux Fast-STREAMS performed better on SMP (hyperthreaded) machines than on UP machines and outperformed Linux Sockets TCP by and even more significant factor (about 40% improvement at most message sizes). Indeed, STREAMS appears to be able to exploit inherent parallelisms that Linux Sockets is unable to exploit.
Contrary to the preconception that STREAMS networking must be slower because STREAMS is more general purpose and has a rich set of features, the reverse has been shown in these experiments for Linux Fast-STREAMS. By utilizing STREAMS queueing, flow control and scheduling, STREAMS SCTP and TCP indeed performs better than Linux Sockets TCP.
Contrary to the preconception that STREAMS networking must be poorer because of use of a complex yet general purpose framework has shown to be untrue in these experiments for Linux Fast-STREAMS. Also, the fact that STREAMS and Linux conform to the same standard (POSIX), means that they are no more cumbersome from a programming perspective. Indeed a POSIX conforming application will not known the difference between the implementation (with the exception that superior performance will be experienced on STREAMS networking).
UNIX domain sockets are the advocated primary interprocess communications mechanism in the 4.4BSD system: 4.4BSD even implements pipes using UNIX domain sockets (MBKQ97). Linux also implements UNIX domain sockets, but uses the 4.1BSD/SVR3 legacy approach to pipes. XTI has an equivalent to the UNIX domain socket. This consists of connectionless, connection oriented, and connection oriented with orderly release loop-back transport providers. The netperf program has the ability to test UNIX domain sockets, but does not currently have the ability to test the XTI equivalents.
BSD claims that in 4.4BSD pipes were implemented using sockets (UNIX domain sockets) instead of using the file system as they were in 4.1BSD (MBKQ97). One of the reasons cited for implementing pipes on Sockets and UNIX domain sockets using the networking subsystems was performance. Another paper released by the OpenSS7 Project (SS7) shows that experimental results on Linux file-system based pipes (using the SVR3 or 4.1BSD approaches) perform poorly when compared to STREAMS-based pipes. Because Linux uses a similar approach to file-system based pipes in implementation of UNIX domain sockets, it can be expected that UNIX domain sockets under Linux will also perform poorly when compared to loop-back transport providers under STREAMS.
There are several mechanisms to providing BSD/POSIX Sockets interfaces to STREAMS networking (VS90) (Mar01). The experiments in this report indicate that it could be worthwhile to complete one of these implementations for Linux Fast-STREAMS (Soc) and test whether STREAMS networking using the Sockets interface is also superior to Linux Sockets, just as it has been shown to be with the XTI/TPI interface.
A separate paper comparing the STREAMS-based pipe implementation of Linux Fast-STREAMS to the legacy 4.1BSD/SVR3-style Linux pipe implementation has also been prepared. That paper also shows significant performance improvements for STREAMS attributable to similar causes.
A separate paper comparing a STREAMS-based UDP implementation of Linux Fast-STREAMS to the Linux NET4 Sockets approach has also been prepared. That paper also shows significant performance improvements for STREAMS attributable to similar causes.
One script was used to generate normal data for all implementations. Following is a listing of the netperf_benchmark script used to generate raw data points for analysis:
#!/bin/bash set -x ( sudo killall netserver sudo netserver >/dev/null </dev/null 2>/dev/null & sleep 3 netperf_sctp_range --mult=2 -x /dev/sctp_t \ --testtime=10 --bufsizes=131071 --end=16384 ${1+"$@"} netperf_tcp_range --mult=2 \ --testtime=10 --bufsizes=131071 --end=16384 ${1+"$@"} netperf_tcp_range --mult=2 -x /dev/tcp \ --testtime=10 --bufsizes=131071 --end=16384 ${1+"$@"} sudo killall netserver ) 2>&1 | tee `hostname`.`date -uIminutes`.log
Another script was used to generate artificial process priorities for TCP Socket data. Following is a listing of the netperf_nice2 script used to generate raw data points for analysis:
#!/bin/bash set -x ( sudo killall netserver sudo nice -n 19 netserver >/dev/null </dev/null 2>/dev/null & sleep 3 sudo nice -n -20 netperf_tcp_range --mult=2 \ --testtime=10 --bufsizes=131071 --end=16384 ${1+"$@"} sudo nice -n -20 netperf_tcp_range --mult=2 -x /dev/tcp \ --testtime=10 --bufsizes=131071 --end=16384 ${1+"$@"} sudo nice -n -20 netperf_sctp_range --mult=2 -x /dev/sctp_t \ --testtime=10 --bufsizes=131071 --end=16384 ${1+"$@"} sudo killall netserver ) 2>&1 | tee `hostname`.`date -uIminutes`.log
Following are the raw data points captured using the netperf_benchmark script:
Table 1 lists the raw data from the netperf program that was used in preparing graphs for Fedora Core 6 (i386) on Porky.
Table 2 lists the raw data from the netperf program that was used in preparing graphs for CentOS 4 on Porky.
Table 3 lists the raw data from the netperf program that was used in preparing graphs for SuSE OSS 10 on Porky.
Table 4 lists the raw data from the netperf program that was used in preparing graphs for Ubuntu 7.04 on Porky.
Table lists the raw data from the netperf program that was used in preparing graphs for RedHat 7.2 on Pumbah.
Table lists the raw data from the netperf program that was used in preparing graphs for Fedora Core 6 (x86_64) HT on Daisy.
Table 7 lists the raw data from the netperf program that was used in preparing graphs for SuSE 10.0 OSS on Mspiggy.
|
|
|
|
|
|