[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [openss7] SCTP Proto
Chuck,
First: Thank you for performing this testing!
Yes the results are very interesting. I have a number of questions and
comments:
Q: Would it be possible for me to post up your results on the
OpenSS7 site? (Our list server rejected the size of your
attachment as too large and I am sure that others would like to
look at the results.)
Q: Can you share the test code? At least the portion which
interfaces directly with the socket, sends and receives data and
makes the time measurements?
Q: How does the test application operate? Does it send one
forward message (timestamping) and then poll for an
acknowledgment (timestamping)? Or, does it send a stream of
forward messages (timestamping) and then correlate the
(timestamped) responses?
Q: What was the setting of the various SCTP configuration options
and socket options for the test? What was the setting of TCP
configuration options and socket options for the test? (e.g.
was TCP set TCP_NODELAY, was CONFIG_SCTP_SLOW_VERIFICATION set?
etc.)
Q: Is it possible to get Ethereal dumps generated by a third
box snooping between the two?
C: Although using a single byte reply is quite applicable to TCP,
it is a rather unfair comparison for SCTP. TCP can place a 1
bytes acknowledgement into a 21 byte IP payload. SCTP when
bundling SACK with DATA chunks requires a 12 byte message header
a 12 byte SACK chunk, a 12 bytes DATA chunk header and a 4 bytes
(padded) data for a grand total of a 40 byte IP payload.
That is, it is far more complicated for SCTP to generate a one
byte reply than it is for TCP. A fairer comparison might be an
echo test where the receiving side merely echoes the data back
to the originator.
C: In the test results for SCTP, it appears that 50% of the RTTs
were exactly 1000 usecs. This dirac delta function in the
results makes me suspect a bug in the code, the test code or the
method of generating the graphs. It is even more suspicious
that this spike at 1000 occurs at all frame sizes.
I seriously doubt that one could write a software clock that was
this accurate at 1 MHz.
C: The extremely large variances in the RTT makes me wonder
whether SCTP is getting itself into a retransmission scenario.
Ethereal traces would be very helpful.
C: It is interesting that the SCTP minimums are consistently about
double that of TCP's minimums. SCTP SACKs only ever second data
chunk received unless it sends data. SACKs are bundled ahead of
DATA in SCTP messages. The receiving stack may be introducing a
delay by processing the SACK before the DATA. Again, Ethereal
dumps and the testing code would help here. If you are sending
one DATA chunk and waiting for the one byte reply, this might be
exactly what is happening.
C: Kernel crashes over 512 bytes is a good debugging lead. I will
chase that one down and release a patch. It would be very
interesting to see comparisons with TCP over 1024 (TCP's default
MSS) when TCP is forced to fragment, or comparisons of packet
sizes greater than the MTU.
Overall your testing indicates that there might be some problems in poll
handling, sleeping or waking proceses, acknowledgement handling, etc.,
but some strong numbers down at the 300 usec side of the histograms
indicate that it is quite possible to get this SCTP stack performing as
well as TCP and even outperforming it at larger message sizes. A little
more information (test code, ethereal dumps) would make things quite
easy to chase down.
There are about 5 places in the code that I know of where significant
speed improvements can be made once these quirks are found. There are:
1) Rework the copy_and_checksum_from_user. I turned it off in
France due to some problems in generating incorrect
checksums. As it stands, data is copied from the user and
the checksum is recalculated on the data each time that it
is retransmitted.
2) Rework cloning of sk_buffs when bundling DATA chunks. As
it stands, data is copied too many times.
3) Place stream datastructures into kmem caches. Currently
the stream data is kmalloc'ed and kfree'd rather than being
placed in a hardware aligned kmem cache. This data
structure is accessed on every DATA chunk transmission and
should really be cached.
4) There is really no need to perform slow verification. The
option should be removed.
5) The module is compiling -O2 and I'm not sure that the
compiler is inlining everything that needs to be inlined. I
can check this an rewrite as macros those things which are
missing being inlined. I particularly suspect the
established fast path for receive data.
In France we were hoping to get the conformance correct before
addressing performance. I'm sure that it will not take too much to get
this stack running as fast as you would like.
--Brian
Chuck Winters wrote: Tue, 29 May 2001 18:10:45
>
> Hey,
> I recompiled my kernel for to only use one processor. I have been
> doing some preliminary testing of the protocol, and have found it to be
> quite slow. I am getting average rtt of about 1900 microseconds. I am
> including two preliminary tests. One on tcp and one on SCTP.
> You will notice that the sctp one only went to 512, but that is only
> because the kernel crashes every time at that point. These are only
> preliminary, but I thought they may be interesting.
>
> Thanks,
> Chuck
>
> --
> Chuck Winters | Email: cwinters@atl.lmco.com
> Distributed Processing Laboratory | Phone: 856-338-3987
> Lockheed Martin Advanced Technology Labs |
> 1 Federal St - A&E-3W |
> Camden, NJ 08102 |
--
Brian F. G. Bidulock ¦ The reasonable man adapts himself to the ¦
bidulock@openss7.org ¦ world; the unreasonable one persists in ¦
http://www.openss7.org/ ¦ trying to adapt the world to himself. ¦
¦ Therefore all progress depends on the ¦
¦ unreasonable man. -- George Bernard Shaw ¦