Discussion:
Early SPECWeb99 results on 2.5.33 with TSO on e1000
(too old to reply)
Troy Wilson
2002-09-05 18:30:22 UTC
Permalink
I've got some early SPECWeb [*] results with 2.5.33 and TSO on e1000. I
get 2906 simultaneous connections, 99.2% conforming (i.e. faster than the
320 kbps cutoff), at 0% idle with TSO on. For comparison, with 2.5.25, I
got 2656, and with 2.5.29 I got 2662, (both 99+% conformance and 0% idle) so
TSO and 2.5.33 look like a Big Win.

I'm having trouble testing with TSO off (I changed the #define NETIF_F_TSO
to "0" in include/linux/netdevice.h to turn it off). I am getting errors.

NETDEV WATCHDOG: eth1: transmit timed out
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex

That's pushed my SPECWeb results down to below 2500 connections with TSO
off because of those adapter resets (It is only that one adapter, BTW) and
these results (with TSO off) shouldn't be considered valid.

eth1 is the only adapter with errors, and they all look like RX overruns.
For comparison:

eth1 Link encap:Ethernet HWaddr 00:02:B3:9C:F5:9E
inet addr:192.168.4.1 Bcast:192.168.4.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:48621378 errors:8890 dropped:8890 overruns:8890 frame:0
TX packets:64342993 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:3637004554 (3468.5 Mb) TX bytes:1377740556 (1313.9 Mb)
Interrupt:61 Base address:0x1200 Memory:fc020000-0

eth3 Link encap:Ethernet HWaddr 00:02:B3:A3:47:E7
inet addr:192.168.3.1 Bcast:192.168.3.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:37130540 errors:0 dropped:0 overruns:0 frame:0
TX packets:49061277 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:2774988658 (2646.4 Mb) TX bytes:3290541711 (3138.1 Mb)
Interrupt:44 Base address:0x2040 Memory:fe120000-0

I'm still working on getting a clean run with TSO off. If anyone has any
ideas for me about the timeout errors, I'd appreciate the clue.

Thanks,

- Troy


* SPEC(tm) and the benchmark name SPECweb(tm) are registered
trademarks of the Standard Performance Evaluation Corporation.
This benchmarking was performed for research purposes only,
and is non-compliant, with the following deviations from the
rules -

1 - It was run on hardware that does not meed the SPEC
availability-to-the-public criteria. The machine is
an engineering sample.

2 - access_log wasn't kept for full accounting. It was
being written, but deleted every 200 seconds.
Feldman, Scott
2002-09-05 20:47:36 UTC
Permalink
Post by Troy Wilson
I've got some early SPECWeb [*] results with 2.5.33 and TSO
on e1000. I get 2906 simultaneous connections, 99.2%
conforming (i.e. faster than the 320 kbps cutoff), at 0% idle
with TSO on. For comparison, with 2.5.25, I
got 2656, and with 2.5.29 I got 2662, (both 99+% conformance
and 0% idle) so TSO and 2.5.33 look like a Big Win.
A 10% bump is good. Thanks for running the numbers.
Post by Troy Wilson
I'm having trouble testing with TSO off (I changed the
#define NETIF_F_TSO to "0" in include/linux/netdevice.h to
turn it off). I am getting errors.
Sorry, I should have made a CONFIG switch. Just hack the driver for now to
turn it off:

--- linux-2.5/drivers/net/e1000/e1000_main.c Fri Aug 30 19:26:57 2002
+++ linux-2.5-no_tso/drivers/net/e1000/e1000_main.c Thu Sep 5 13:38:44
2002
@@ -428,9 +428,11 @@ e1000_probe(struct pci_dev *pdev,
}

#ifdef NETIF_F_TSO
+#if 0
if(adapter->hw.mac_type >= e1000_82544)
netdev->features |= NETIF_F_TSO;
#endif
+#endif

if(pci_using_dac)
netdev->features |= NETIF_F_HIGHDMA;

-scott
jamal
2002-09-05 20:59:47 UTC
Permalink
Hey, thanks for crossposting to netdev

So if i understood correctly (looking at the intel site) the main value
add of this feature is probably in having the CPU avoid reassembling and
retransmitting. I am willing to bet that the real value in your results is
in saving on retransmits; I would think shoving the data down the NIC
and avoid the fragmentation shouldnt give you that much significant CPU
savings. Do you have any stats from the hardware that could show
retransmits etc; have you tested this with zero copy as well (sendfile)
again, if i am right you shouldnt see much benefit from that either?

I would think it probably works well with things like partial ACKs too?
(I am almost sure it does or someone needs to be spanked, so just
checking).

cheers,
jamal
Troy Wilson
2002-09-05 22:11:15 UTC
Permalink
Post by jamal
So if i understood correctly (looking at the intel site) the main value
add of this feature is probably in having the CPU avoid reassembling and
retransmitting.
Quoting David S. Miller:

dsm> The performance improvement comes from the fact that the card
dsm> is given huge 64K packets, then the card (using the given ip/tcp
dsm> headers as a template) spits out 1500 byte mtu sized packets.
dsm>
dsm> Less data DMA'd to the device per normal-mtu packet and less
dsm> per-packet data structure work by the cpu is where the improvement
dsm> comes from.
Post by jamal
Do you have any stats from the hardware that could show
retransmits etc;
I'll gather netstat -s after runs with and without TSO enabled.
Anything else you'd like to see?
Post by jamal
have you tested this with zero copy as well (sendfile)
Yes. My webserver is Apache 2.0.36, which uses sendfile for anything
over 8k in size. But, iirc, Apache sends the http headers using writev.

Thanks,

- Troy
Nivedita Singhvi
2002-09-05 22:39:31 UTC
Permalink
Post by Troy Wilson
Post by jamal
Do you have any stats from the hardware that could show
retransmits etc;
I'll gather netstat -s after runs with and without TSO enabled.
Anything else you'd like to see?
Troy, this is pointing out the obvious, but make sure
you have the before stats as well :)...
Post by Troy Wilson
Post by jamal
have you tested this with zero copy as well (sendfile)
Yes. My webserver is Apache 2.0.36, which uses sendfile for
anything
over 8k in size. But, iirc, Apache sends the http headers using writev.
SpecWeb99 doesnt execute the path that might benefit the
most from this patch - sendmsg() of large files - large writes
going down..

thanks,
Nivedita
Dave Hansen
2002-09-05 23:01:32 UTC
Permalink
Post by Nivedita Singhvi
SpecWeb99 doesnt execute the path that might benefit the
most from this patch - sendmsg() of large files - large writes
going down..
For those of you who don't know Specweb well, the average size of a request
is about 14.5 kB. The largest files are ~5mb, but the largest top out at
just under a meg.
--
Dave Hansen
***@us.ibm.com
Nivedita Singhvi
2002-09-05 22:48:35 UTC
Permalink
Post by jamal
So if i understood correctly (looking at the intel site) the main
value add of this feature is probably in having the CPU avoid
reassembling and retransmitting. I am willing to bet that the real
Er, even just assembling and transmitting? I'm thinking of the
reduction in things like separate memory allocation calls and looking
up the route, etc..??
Post by jamal
value in your results is in saving on retransmits; I would think
shoving the data down the NIC and avoid the fragmentation shouldnt
give you that much significant CPU savings. Do you have any stats
Why do say that? Wouldnt the fact that youre now reducing the
number of calls down the stack by a significant number provide
a significant saving?
Post by jamal
from the hardware that could show retransmits etc; have you tested
this with zero copy as well (sendfile) again, if i am right you
shouldnt see much benefit from that either?
thanks,
Nivedita
jamal
2002-09-06 01:47:35 UTC
Permalink
Post by Nivedita Singhvi
Post by jamal
value in your results is in saving on retransmits; I would think
shoving the data down the NIC and avoid the fragmentation shouldnt
give you that much significant CPU savings. Do you have any stats
Why do say that? Wouldnt the fact that youre now reducing the
number of calls down the stack by a significant number provide
a significant saving?
I am not sure; if he gets a busy system in a congested network, i can
see the offloading savings i.e i am not sure if the amortization of the
calls away from the CPU is sufficient enough savings if it doesnt
involve a lot of retransmits. I am also wondering how smart this NIC
in doing the retransmits; example i have doubts if this idea is briliant
to begin with; does it handle SACKs for example? What about
the du-jour algorithm, would you have to upgrade the NIC or can it be
taught some new trickes etc etc.
[also i can see why it makes sense to use this feature only with sendfile;
its pretty much useless for interactive apps]

Troy, i am not interested in the nestat -s data rather the TCP stats
this NIC has exposed. Unless those somehow show up magically in netstat.

cheers,
jamal
Nivedita Singhvi
2002-09-06 03:38:10 UTC
Permalink
Post by jamal
I am not sure; if he gets a busy system in a congested network, i
can see the offloading savings i.e i am not sure if the amortization
of the calls away from the CPU is sufficient enough savings if it
doesnt involve a lot of retransmits. I am also wondering how smart
this NIC in doing the retransmits; example i have doubts if this
idea is briliant to begin with; does it handle SACKs for example?
do you mean sack data being sent as a tcp option?
dont know, lots of other questions arise (like timestamp
on all the segments would be the same?).
Post by jamal
Troy, i am not interested in the nestat -s data rather the TCP
stats this NIC has exposed. Unless those somehow show up magically
in netstat.
most recent (dont know how far back) versions of netstat
display /proc/net/snmp and /proc/net/netstat (with the
Linux TCP MIB), so netstat -s should show you most of
whats interesting. Or were you referring to something else?

ifconfig -a and netstat -rn would also be nice to have..

thanks,
Nivedita
David S. Miller
2002-09-06 03:58:42 UTC
Permalink
From: Nivedita Singhvi <***@us.ibm.com>
Date: Thu, 5 Sep 2002 20:38:10 -0700

most recent (dont know how far back) versions of netstat
display /proc/net/snmp and /proc/net/netstat (with the
Linux TCP MIB), so netstat -s should show you most of
whats interesting. Or were you referring to something else?

ifconfig -a and netstat -rn would also be nice to have..

TSO gets turned off during retransmits/SACK and the card does not do
retransmits.

Can we move on in this conversation now? :-)
Nivedita Singhvi
2002-09-06 04:20:47 UTC
Permalink
Post by David S. Miller
Post by Nivedita Singhvi
ifconfig -a and netstat -rn would also be nice to have..
TSO gets turned off during retransmits/SACK and the card does not do
retransmits.
Can we move on in this conversation now? :-)
Sure :). The motivation for seeing the stats though would
be to get an idea of how much retransmission/SACK etc
activity _is_ occurring during Troy's SpecWeb runs, which
would give us an idea of how often we're actually doing
segmentation offload, and better idea of how much gain
its possible to further get from this(ahem) DMA coalescing :).
Some of Troy's early runs had a very large number of
packets dropped by the card.

thanks,
Nivedita
David S. Miller
2002-09-06 04:17:03 UTC
Permalink
From: Nivedita Singhvi <***@us.ibm.com>
Date: Thu, 5 Sep 2002 21:20:47 -0700

Sure :). The motivation for seeing the stats though would
be to get an idea of how much retransmission/SACK etc
activity _is_ occurring during Troy's SpecWeb runs, which
would give us an idea of how often we're actually doing
segmentation offload, and better idea of how much gain
its possible to further get from this(ahem) DMA coalescing :).
Some of Troy's early runs had a very large number of
packets dropped by the card.

One thing to do is make absolutely sure that flow control is
enabled and supported by all devices on the link from the
client to the test spedweb server.

Troy can do you do that for us along with the statistic
dumps?

Thanks.
Troy Wilson
2002-09-07 00:05:29 UTC
Permalink
Post by Nivedita Singhvi
ifconfig -a and netstat -rn would also be nice to have..
These counters may have wrapped over the course of the full-length
( 3 x 20 minute runs + 20 minute warmup + rampup + rampdown) SPECWeb run.


*******************************
* ifconfig -a before workload *
*******************************

eth0 Link encap:Ethernet HWaddr 00:04:AC:23:5E:99
inet addr:9.3.192.209 Bcast:9.3.192.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:208 errors:0 dropped:0 overruns:0 frame:0
TX packets:104 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:22562 (22.0 Kb) TX bytes:14356 (14.0 Kb)
Interrupt:50 Base address:0x2000 Memory:fe180000-fe180038

eth1 Link encap:Ethernet HWaddr 00:02:B3:9C:F5:9E
inet addr:192.168.4.1 Bcast:192.168.4.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:10 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:5940 (5.8 Kb) TX bytes:256 (256.0 b)
Interrupt:61 Base address:0x1200 Memory:fc020000-0

eth2 Link encap:Ethernet HWaddr 00:02:B3:A8:35:C1
inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:54 Base address:0x1220 Memory:fc060000-0

eth3 Link encap:Ethernet HWaddr 00:02:B3:A3:47:E7
inet addr:192.168.3.1 Bcast:192.168.3.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:44 Base address:0x2040 Memory:fe120000-0

eth4 Link encap:Ethernet HWaddr 00:02:B3:A3:46:F9
inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:5 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:784 (784.0 b) TX bytes:256 (256.0 b)
Interrupt:36 Base address:0x2060 Memory:fe160000-0

eth5 Link encap:Ethernet HWaddr 00:02:B3:A3:47:88
inet addr:192.168.5.1 Bcast:192.168.5.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:32 Base address:0x3000 Memory:fe420000-0

eth6 Link encap:Ethernet HWaddr 00:02:B3:9C:F5:A0
inet addr:192.168.6.1 Bcast:192.168.6.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:64 (64.0 b) TX bytes:256 (256.0 b)
Interrupt:28 Base address:0x3020 Memory:fe460000-0

eth7 Link encap:Ethernet HWaddr 00:02:B3:A3:47:39
inet addr:192.168.7.1 Bcast:192.168.7.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:24 Base address:0x4000 Memory:fe820000-0

eth8 Link encap:Ethernet HWaddr 00:02:B3:A3:47:87
inet addr:192.168.8.1 Bcast:192.168.8.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:20 Base address:0x4020 Memory:fe860000-0

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:56 errors:0 dropped:0 overruns:0 frame:0
TX packets:56 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:5100 (4.9 Kb) TX bytes:5100 (4.9 Kb)


******************************
* ifconfig -a after workload *
******************************

eth0 Link encap:Ethernet HWaddr 00:04:AC:23:5E:99
inet addr:9.3.192.209 Bcast:9.3.192.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3434 errors:0 dropped:0 overruns:0 frame:0
TX packets:1408 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:336578 (328.6 Kb) TX bytes:290474 (283.6 Kb)
Interrupt:50 Base address:0x2000 Memory:fe180000-fe180038

eth1 Link encap:Ethernet HWaddr 00:02:B3:9C:F5:9E
inet addr:192.168.4.1 Bcast:192.168.4.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:74893662 errors:3 dropped:3 overruns:0 frame:0
TX packets:100464074 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:1286843881 (1227.2 Mb) TX bytes:2106085286 (2008.5 Mb)
Interrupt:61 Base address:0x1200 Memory:fc020000-0

eth2 Link encap:Ethernet HWaddr 00:02:B3:A8:35:C1
inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:54 Base address:0x1220 Memory:fc060000-0

eth3 Link encap:Ethernet HWaddr 00:02:B3:A3:47:E7
inet addr:192.168.3.1 Bcast:192.168.3.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:50054881 errors:0 dropped:0 overruns:0 frame:0
TX packets:67122955 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:3730406436 (3557.5 Mb) TX bytes:3034087396 (2893.5 Mb)
Interrupt:44 Base address:0x2040 Memory:fe120000-0

eth4 Link encap:Ethernet HWaddr 00:02:B3:A3:46:F9
inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:48 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:7342 (7.1 Kb) TX bytes:256 (256.0 b)
Interrupt:36 Base address:0x2060 Memory:fe160000-0

eth5 Link encap:Ethernet HWaddr 00:02:B3:A3:47:88
inet addr:192.168.5.1 Bcast:192.168.5.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:149206960 errors:2861 dropped:2861 overruns:0 frame:0
TX packets:200247016 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:2530107402 (2412.8 Mb) TX bytes:3331495154 (3177.1 Mb)
Interrupt:32 Base address:0x3000 Memory:fe420000-0

eth6 Link encap:Ethernet HWaddr 00:02:B3:9C:F5:A0
inet addr:192.168.6.1 Bcast:192.168.6.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:13 errors:0 dropped:0 overruns:0 frame:0
TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:832 (832.0 b) TX bytes:640 (640.0 b)
Interrupt:28 Base address:0x3020 Memory:fe460000-0

eth7 Link encap:Ethernet HWaddr 00:02:B3:A3:47:39
inet addr:192.168.7.1 Bcast:192.168.7.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:151162569 errors:2993 dropped:2993 overruns:0 frame:0
TX packets:202895482 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:2673954954 (2550.0 Mb) TX bytes:2456469394 (2342.6 Mb)
Interrupt:24 Base address:0x4000 Memory:fe820000-0

eth8 Link encap:Ethernet HWaddr 00:02:B3:A3:47:87
inet addr:192.168.8.1 Bcast:192.168.8.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:0 (0.0 b) TX bytes:256 (256.0 b)
Interrupt:20 Base address:0x4020 Memory:fe860000-0

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:100 errors:0 dropped:0 overruns:0 frame:0
TX packets:100 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:8696 (8.4 Kb) TX bytes:8696 (8.4 Kb)


***************
* netstat -rn *
***************

Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
192.168.7.0 0.0.0.0 255.255.255.0 U 40 0 0 eth7
192.168.6.0 0.0.0.0 255.255.255.0 U 40 0 0 eth6
192.168.5.0 0.0.0.0 255.255.255.0 U 40 0 0 eth5
192.168.4.0 0.0.0.0 255.255.255.0 U 40 0 0 eth1
192.168.3.0 0.0.0.0 255.255.255.0 U 40 0 0 eth3
192.168.2.0 0.0.0.0 255.255.255.0 U 40 0 0 eth2
192.168.1.0 0.0.0.0 255.255.255.0 U 40 0 0 eth4
9.3.192.0 0.0.0.0 255.255.255.0 U 40 0 0 eth0
192.168.8.0 0.0.0.0 255.255.255.0 U 40 0 0 eth8
127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo
0.0.0.0 9.3.192.1 0.0.0.0 UG 40 0 0 eth0
David S. Miller
2002-09-06 03:56:26 UTC
Permalink
From: jamal <***@cyberus.ca>
Date: Thu, 5 Sep 2002 21:47:35 -0400 (EDT)

I am not sure; if he gets a busy system in a congested network, i can
see the offloading savings i.e i am not sure if the amortization of the
calls away from the CPU is sufficient enough savings if it doesnt
involve a lot of retransmits. I am also wondering how smart this NIC
in doing the retransmits; example i have doubts if this idea is briliant
to begin with; does it handle SACKs for example? What about
the du-jour algorithm, would you have to upgrade the NIC or can it be
taught some new trickes etc etc.
[also i can see why it makes sense to use this feature only with sendfile;
its pretty much useless for interactive apps]

Troy, i am not interested in the nestat -s data rather the TCP stats
this NIC has exposed. Unless those somehow show up magically in netstat.

There are no retransmits happening, the card does not analyze
activity on the TCP connection to retransmit things itself
it's just a simple header templating facility.

Read my other emails about where the benefits come from.

In fact when connection is sick (ie. retransmits and SACKs occur)
we disable TSO completely for that socket.
David S. Miller
2002-09-06 03:47:21 UTC
Permalink
From: jamal <***@cyberus.ca>
Date: Thu, 5 Sep 2002 16:59:47 -0400 (EDT)

I would think shoving the data down the NIC
and avoid the fragmentation shouldnt give you that much significant
CPU savings.

It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
is limited by the DMA throughput of the PCI host controller. In
particular some controllers are limited to smaller DMA bursts to
work around hardware bugs.

Ie. the headers that don't need to go across the bus are the critical
resource saved by TSO.

I think I've said this a million times, perhaps the next person who
tries to figure out where the gains come from can just reply with
a pointer to a URL of this email I'm typing right now :-)
Martin J. Bligh
2002-09-06 06:48:42 UTC
Permalink
Post by jamal
I would think shoving the data down the NIC
and avoid the fragmentation shouldnt give you that much significant
CPU savings.
It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
is limited by the DMA throughput of the PCI host controller. In
particular some controllers are limited to smaller DMA bursts to
work around hardware bugs.
Ie. the headers that don't need to go across the bus are the critical
resource saved by TSO.
I'm not sure that's entirely true in this case - the Netfinity
8500R is slightly unusual in that it has 3 or 4 PCI buses, and
there's 4 - 8 gigabit ethernet cards in this beast spread around
different buses (Troy - are we still just using 4? ... and what's
the raw bandwidth of data we're pushing? ... it's not huge).

I think we're CPU limited (there's no idle time on this machine),
which is odd for an 8 CPU 900MHz P3 Xeon, but still, this is Apache,
not Tux. You mentioned CPU load as another advantage of TSO ...
anything we've done to reduce CPU load enables us to run more and
more connections (I think we started at about 260 or something, so
2900 ain't too bad ;-)).

Just to throw another firework into the fire whilst people are
awake, NAPI does not seem to scale to this sort of load, which
was disappointing, as we were hoping it would solve some of
our interrupt load problems ... seems that half the machine goes
idle, the number of simultaneous connections drop way down, and
everything's blocked on ... something ... not sure what ;-)
Any guesses at why, or ways to debug this?

M.

PS. Anyone else running NAPI on SMP? (ideally at least 4-way?)
David S. Miller
2002-09-06 06:51:59 UTC
Permalink
From: "Martin J. Bligh" <***@us.ibm.com>
Date: Thu, 05 Sep 2002 23:48:42 -0700

Just to throw another firework into the fire whilst people are
awake, NAPI does not seem to scale to this sort of load, which
was disappointing, as we were hoping it would solve some of
our interrupt load problems ...

Stupid question, are you sure you have CONFIG_E1000_NAPI enabled?

NAPI is also not the panacea to all problems in the world.

I bet your greatest gain would be obtained from going to Tux
and using appropriate IRQ affinity settings and making sure
Tux threads bind to same cpu as device where they accept
connections.

It is standard method to obtain peak specweb performance.
Andrew Morton
2002-09-06 07:36:04 UTC
Permalink
Post by David S. Miller
...
NAPI is also not the panacea to all problems in the world.
Mala did some testing on this a couple of weeks back. It appears that
NAPI damaged performance significantly.

http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm
David S. Miller
2002-09-06 07:22:53 UTC
Permalink
From: Andrew Morton <***@zip.com.au>
Date: Fri, 06 Sep 2002 00:36:04 -0700
Post by David S. Miller
NAPI is also not the panacea to all problems in the world.
Mala did some testing on this a couple of weeks back. It appears that
NAPI damaged performance significantly.

http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm

Unfortunately it is not listed what e1000 and core NAPI
patch was used. Also, not listed, are the RX/TX mitigation
and ring sizes given to the kernel module upon loading.

Robert can comment on optimal settings

Robert and Jamal can make a more detailed analysis of Mala's
graphs than I.
jamal
2002-09-06 09:54:09 UTC
Permalink
Post by Andrew Morton
Mala did some testing on this a couple of weeks back. It appears that
NAPI damaged performance significantly.
http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm
Unfortunately it is not listed what e1000 and core NAPI
patch was used. Also, not listed, are the RX/TX mitigation
and ring sizes given to the kernel module upon loading.
Robert can comment on optimal settings
Robert and Jamal can make a more detailed analysis of Mala's
graphs than I.
I looked at those graphs, but the lack of information makes them useless.
For example there are too many variables to the tests -- what is the
effect the message size? and then look at the socket buffer size, would
you set it to 64K if you are trying to show perfomance numbers? What
other tcp settings are there?
Manfred Spraul about a year back complained about some performance issues
in low load setups (which is what this IBM setup seems to be if you count
the pps to the server); its one of those things that have been low in
the TODO deck.
The issue maybe legit not because NAPI is bad but because it is too good.
I dont have the e1000, but i have some Dlinks giges still in boxes and i
have a two-CPU SMP machine; I'll setup the testing this weekend.
In the case of Manfred, we couldnt reproduce the tests because he had this
odd weird NIC; in this case at least access to the e1000 doesnt require
a visit to the museum.

cheers,
jamal
Martin J. Bligh
2002-09-06 14:29:21 UTC
Permalink
Post by David S. Miller
Stupid question, are you sure you have CONFIG_E1000_NAPI enabled?
NAPI is also not the panacea to all problems in the world.
No, but I didn't expect throughput to drop by 40% or so either,
which is (very roughly) what happened. Interrupts are a pain to
manage and do affinity with, so NAPI should (at least in theory)
be better for this kind of setup ... I think.
Post by David S. Miller
I bet your greatest gain would be obtained from going to Tux
and using appropriate IRQ affinity settings and making sure
Tux threads bind to same cpu as device where they accept
connections.
It is standard method to obtain peak specweb performance.
Ah, but that's not really our goal - what we're trying to do is
use specweb as a tool to simulate a semi-realistic customer
workload, and improve the Linux kernel performance, using that
as our yardstick for measuring ourselves. For that I like the
setup we have reasonably well, even though it won't get us the
best numbers.

To get the best benchmark numbers, you're absolutely right though.

M.
Dave Hansen
2002-09-06 15:38:30 UTC
Permalink
Post by Martin J. Bligh
Post by David S. Miller
Stupid question, are you sure you have CONFIG_E1000_NAPI enabled?
NAPI is also not the panacea to all problems in the world.
No, but I didn't expect throughput to drop by 40% or so either,
which is (very roughly) what happened. Interrupts are a pain to
manage and do affinity with, so NAPI should (at least in theory)
be better for this kind of setup ... I think.
No, no. Bad Martin! Throughput didn't drop, "Specweb compliance" dropped.
Those are two very, very different things. I've found that the server
can produce a lot more throughput, although it doesn't have the
characteristics that Specweb considers compliant. Just have Troy enable
mod-status and look at the throughput that Apache tells you that it is
giving during a run. _That_ is real throughput, not number of compliant
connections.

_And_ NAPI is for receive only, right? Also, my compliance drop occurs
with the NAPI checkbox disabled. There is something else in the new driver
that causes our problems.
--
Dave Hansen
***@us.ibm.com
Martin J. Bligh
2002-09-06 16:11:05 UTC
Permalink
Post by Dave Hansen
No, no. Bad Martin! Throughput didn't drop, "Specweb compliance"
dropped. Those are two very, very different things. I've found
that the server can produce a lot more throughput, although it
doesn't have the characteristics that Specweb considers compliant.
Just have Troy enable mod-status and look at the throughput that
Apache tells you that it is giving during a run. _That_ is real
throughput, not number of compliant connections.
By throughput I meant number of compliant connections, not bandwidth.
It may well be latency that's going out the window, rather than
bandwidth. Yes, I should use more precise terms ...
Post by Dave Hansen
_And_ NAPI is for receive only, right? Also, my compliance drop
occurs with the NAPI checkbox disabled. There is something else
in the new driver that causes our problems.
Not sure about that - I was told once that there were transmission
completion interrupts as well? What happens to those? Or am I
confused again ...

M.
Nivedita Singhvi
2002-09-06 16:21:23 UTC
Permalink
Post by Dave Hansen
No, no. Bad Martin! Throughput didn't drop, "Specweb compliance"
dropped. Those are two very, very different things. I've found that
the server can produce a lot more throughput, although it doesn't
have the characteristics that Specweb considers compliant.
Just have Troy enable mod-status and look at the throughput that
Apache tells you that it is giving during a run.
_That_ is real throughput, not number of compliant connections.
_And_ NAPI is for receive only, right? Also, my compliance drop
occurs with the NAPI checkbox disabled. There is something else in
the new driver that causes our problems.
Thanks, Dave, you saved me a bunch of typing...

Just looking at a networking benchmark result is worse than
useless. You really need to look at the stats, settings,
and the profiles. eg, for most of the networking stuff:

ifconfig -a
netstat -s
netstat -rn
/proc/sys/net/ipv4/
/proc/sys/net/core/

before and after the run.

Dave, although in your setup the clients are maxed out,
not sure thats the case for Mala and Troy's clients. (Dont
know, of course). But I'm fairly sure they arent using
single quad NUMAs and they may not be seeing the same
effects..

thanks,
Nivedita
Dave Hansen
2002-09-06 15:29:01 UTC
Permalink
Post by Martin J. Bligh
Just to throw another firework into the fire whilst people are
awake, NAPI does not seem to scale to this sort of load, which
was disappointing, as we were hoping it would solve some of
our interrupt load problems ... seems that half the machine goes
idle, the number of simultaneous connections drop way down, and
everything's blocked on ... something ... not sure what ;-)
Any guesses at why, or ways to debug this?
I thought that I already tried to explain this to you. (although it could
have been on one of those too-much-coffee-days :)

Something strange happens to the clients when NAPI is enabled on the
Specweb clients. Somehow the start using a lot more CPU. The increased
idle time on the server is because the _clients_ are CPU maxed. I have
some preliminary oprofile data for the clients, but it appears that this is
another case of Specweb code just really sucking.

The real question is why NAPI causes so much more work for the client. I'm
not convinced that it is much, much greater, because I believe that I was
already at the edge of the cliff with my clients and NAPI just gave them a
little shove :). Specweb also takes a while to ramp up (even during the
real run), so sometimes it takes a few minutes to see the clients get
saturated.
--
Dave Hansen
***@us.ibm.com
Martin J. Bligh
2002-09-06 16:29:50 UTC
Permalink
Post by Dave Hansen
I thought that I already tried to explain this to you. (although
it could have been on one of those too-much-coffee-days :)
You told me, but I'm far from convinced this is the problem. I think
it's more likely this is a side-effect of a server issue - something
like a lot of dropped packets and retransmits, though not necessarily
that.
Post by Dave Hansen
Something strange happens to the clients when NAPI is enabled on
the Specweb clients. Somehow the start using a lot more CPU.
The increased idle time on the server is because the _clients_ are
CPU maxed. I have some preliminary oprofile data for the clients,
but it appears that this is another case of Specweb code just
really sucking.
Hmmm ... if you change something on the server, and all the clients
go wild, I'm suspicious of whatever you did to the server. You need
to have a lot more data before leaping to the conclusion that it's
because the specweb client code is crap.

Troy - I think your UP clients weren't anywhere near maxed out on
CPU power, right? Can you take a peek at the clients under NAPI load?

Dave - did you ever try running 4 specweb clients bound to each of
the 4 CPUs in an attempt to make the clients scale better? I'm
suspicious that you're maxing out 4 4-way machines, and Troy's
16 UPs are cruising along just fine.

M.
Dave Hansen
2002-09-06 17:36:37 UTC
Permalink
Post by Martin J. Bligh
Post by Dave Hansen
Something strange happens to the clients when NAPI is enabled on
the Specweb clients. Somehow the start using a lot more CPU.
The increased idle time on the server is because the _clients_ are
CPU maxed. I have some preliminary oprofile data for the clients,
but it appears that this is another case of Specweb code just
really sucking.
Hmmm ... if you change something on the server, and all the clients
go wild, I'm suspicious of whatever you did to the server.
Me too :) All that was changed was adding the new e1000 driver. NAPI was
disabled.
Post by Martin J. Bligh
You need
to have a lot more data before leaping to the conclusion that it's
because the specweb client code is crap.
I'll let the profile speak for itself...

oprofile summary:op_time -d

1 0.0000 0.0000 /bin/sleep
2 0.0001 0.0000 /lib/ld-2.2.5.so.dpkg-new (deleted)
2 0.0001 0.0000 /lib/libpthread-0.9.so
2 0.0001 0.0000 /usr/bin/expr
3 0.0001 0.0000 /sbin/init
4 0.0001 0.0000 /lib/libproc.so.2.0.7
12 0.0004 0.0000 /lib/libc-2.2.5.so.dpkg-new (deleted)
17 0.0005 0.0000 /usr/lib/libcrypto.so.0.9.6.dpkg-new (deleted)
20 0.0006 0.0000 /bin/bash
30 0.0010 0.0000 /usr/sbin/sshd
151 0.0048 0.0000 /usr/bin/vmstat
169 0.0054 0.0000 /lib/ld-2.2.5.so
300 0.0095 0.0000 /lib/modules/2.4.18+O1/oprofile/oprofile.o
1115 0.0354 0.0000 /usr/local/bin/oprofiled
3738 0.1186 0.0000 /lib/libnss_files-2.2.5.so
58181 1.8458 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o
249186 7.9056 0.0000 /home/dave/specweb99/build/client
582281 18.4733 0.0000 /lib/libc-2.2.5.so
2256792 71.5986 0.0000 /usr/src/linux/vmlinux

top of oprofile from the client:
08051b3c 2260 0.948938 check_for_timeliness
08051cfc 2716 1.14041 ascii_cat
08050f24 4547 1.90921 HTTPGetReply
0804f138 4682 1.9659 workload_op
08050890 6111 2.56591 HTTPDoConnect
08049a30 7330 3.07775 SHMmalloc
08052244 7433 3.121 HTParse
08052628 8482 3.56146 HTSACopy
08051d88 10288 4.31977 get_some_line
08052150 13070 5.48788 scan
08051a10 65314 27.4243 assign_port_number
0804bd30 83789 35.1817 LOG
#define LOG(x) do {} while(0)
Voila! 35% more CPU!

Top of Kernel profile:
c022c850 33085 1.46602 number
c0106e59 42693 1.89176 restore_all
c01dfe68 42787 1.89592 sys_socketcall
c01df39c 54185 2.40097 sys_bind
c01de698 62740 2.78005 sockfd_lookup
c01372c8 97886 4.3374 fput
c022c110 125306 5.55239 __generic_copy_to_user
c01373b0 181922 8.06109 fget
c020958c 199054 8.82022 tcp_v4_get_port
c0106e10 199934 8.85921 system_call
c022c158 214014 9.48311 __generic_copy_from_user
c0216ecc 257768 11.4219 inet_bind

"oprofpp -k -dl -i /lib/libc-2.2.5.so"
just gives:
vma samples %-age symbol name linenr info image name
00000000 582281 100 (no symbol) (no location information)
/lib/libc-2.2.5.so

I've never really tried to profile anything but the kernel before. Any ideas?
Post by Martin J. Bligh
Troy - I think your UP clients weren't anywhere near maxed out on
CPU power, right? Can you take a peek at the clients under NAPI load?
Make sure you wait a minute or two. The client tends to ramp up.

"vmstat 2" after the client has told the master that it is running:
U S I
----------
4 15 81
5 17 79
7 16 77
7 17 76
7 21 72
11 25 64
3 16 82
2 14 84
7 23 70
16 50 34
24 75 0
27 73 0
28 72 0
24 76 0
...
Post by Martin J. Bligh
Dave - did you ever try running 4 specweb clients bound to each of
the 4 CPUs in an attempt to make the clients scale better? I'm
suspicious that you're maxing out 4 4-way machines, and Troy's
16 UPs are cruising along just fine.
No, but I'm not sure it will do any good. They don't run often enough and
I have the feeling that there are very few cache locality benefits to be had.
--
Dave Hansen
***@us.ibm.com
Andi Kleen
2002-09-06 18:26:46 UTC
Permalink
Post by Dave Hansen
c0106e59 42693 1.89176 restore_all
c01dfe68 42787 1.89592 sys_socketcall
c01df39c 54185 2.40097 sys_bind
c01de698 62740 2.78005 sockfd_lookup
c01372c8 97886 4.3374 fput
c022c110 125306 5.55239 __generic_copy_to_user
c01373b0 181922 8.06109 fget
c020958c 199054 8.82022 tcp_v4_get_port
c0106e10 199934 8.85921 system_call
c022c158 214014 9.48311 __generic_copy_from_user
c0216ecc 257768 11.4219 inet_bind
The profile looks bogus. The NIC driver is nowhere in sight. Normally
its mmap IO for interrupts and device registers should show. I would
double check it (e.g. with normal profile)

In case it is no bogus:
Most of these are either atomic_inc/dec of reference counters or some
form of lock. The system_call could be the int 0x80 (using the SYSENTER
patches would help), which also does atomic operations implicitely.
restore_all is IRET, could also likely be speed up by using SYSEXIT.

If NAPI hurts here then it surely not because of eating CPU time.

-Andi
John Levon
2002-09-06 18:31:57 UTC
Permalink
Post by Andi Kleen
Post by Dave Hansen
c0216ecc 257768 11.4219 inet_bind
The profile looks bogus. The NIC driver is nowhere in sight. Normally
its mmap IO for interrupts and device registers should show. I would
double check it (e.g. with normal profile)
The system summary shows :

58181 1.8458 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o

so it won't show up in the monolithic kernel profile. You can probably
get a combined comparison with

op_time -dnl | grep -e 'vmlinux|acenic'

regards
john
--
"Are you willing to go out there and save the lives of our children, even if it means losing your own life ?
Yes I am.
I believe you, Jeru... you're ready."
Dave Hansen
2002-09-06 18:33:10 UTC
Permalink
Post by Andi Kleen
Post by Dave Hansen
c0106e59 42693 1.89176 restore_all
c01dfe68 42787 1.89592 sys_socketcall
c01df39c 54185 2.40097 sys_bind
c01de698 62740 2.78005 sockfd_lookup
c01372c8 97886 4.3374 fput
c022c110 125306 5.55239 __generic_copy_to_user
c01373b0 181922 8.06109 fget
c020958c 199054 8.82022 tcp_v4_get_port
c0106e10 199934 8.85921 system_call
c022c158 214014 9.48311 __generic_copy_from_user
c0216ecc 257768 11.4219 inet_bind
The profile looks bogus. The NIC driver is nowhere in sight. Normally
its mmap IO for interrupts and device registers should show. I would
double check it (e.g. with normal profile)
Actually, oprofile separated out the acenic module from the rest of the
kernel. I should have included that breakout as well. but it was only 1.3
of CPU:
1.3801 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o
--
Dave Hansen
***@us.ibm.com
David S. Miller
2002-09-06 18:36:52 UTC
Permalink
From: Dave Hansen <***@us.ibm.com>
Date: Fri, 06 Sep 2002 11:33:10 -0700

Actually, oprofile separated out the acenic module from the rest of the
kernel. I should have included that breakout as well. but it was only 1.3
of CPU:
1.3801 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o

We thought you were using e1000 in these tests?
Martin J. Bligh
2002-09-06 18:45:17 UTC
Permalink
Post by Dave Hansen
Actually, oprofile separated out the acenic module from the rest of the
kernel. I should have included that breakout as well. but it was only 1.3
1.3801 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o
We thought you were using e1000 in these tests?
e1000 on the server, those profiles were client side.

M.
David S. Miller
2002-09-06 18:43:36 UTC
Permalink
From: "Martin J. Bligh" <***@us.ibm.com>
Date: Fri, 06 Sep 2002 11:45:17 -0700
Post by Dave Hansen
Actually, oprofile separated out the acenic module from the rest of the
kernel. I should have included that breakout as well. but it was only 1.3
1.3801 0.0000 /lib/modules/2.4.18+O1/kernel/drivers/net/acenic.o
We thought you were using e1000 in these tests?
e1000 on the server, those profiles were client side.

Ok. BTW acenic is packet rate limited by the speed of the
MIPS cpus on the card.

It might be instramental to disable HW checksumming in the
acenic driver and see what this does to your results.
Nivedita Singhvi
2002-09-06 19:19:14 UTC
Permalink
Post by Andi Kleen
Post by Dave Hansen
c0106e59 42693 1.89176 restore_all
c01dfe68 42787 1.89592 sys_socketcall
c01df39c 54185 2.40097 sys_bind
c01de698 62740 2.78005 sockfd_lookup
c01372c8 97886 4.3374 fput
c022c110 125306 5.55239 __generic_copy_to_user
c01373b0 181922 8.06109 fget
c020958c 199054 8.82022 tcp_v4_get_port
c0106e10 199934 8.85921 system_call
c022c158 214014 9.48311 __generic_copy_from_user
c0216ecc 257768 11.4219 inet_bind
The profile looks bogus. The NIC driver is nowhere in sight.
Normally its mmap IO for interrupts and device registers
should show. I would double check it (e.g. with normal profile)
Separately compiled acenic..

I'm surprised by this profile a bit too - on the client side,
since the requests are small, and the client is receiving
all those files, I would have thought that __generic_copy_to_user
would have been way higher than *from_user.

inet_bind() and tcp_v4_get_port() are up there because
we have to grab the socket lock, the tcp_portalloc_lock,
then the head chain lock and traverse the hash table
which has now many hundred entries. Also, because
of the varied length of the connections, the clients
get freed not in the same order they are allocated
a port, hence the fragmentation of the port space..
Tthere is some cacheline thrashing hurting the NUMA
more than other systems here too..

If you just wanted to speed things up, you could get the
clients to specify ports instead of letting the kernel
cycle through for a free port..:)

thanks,
Nivedita
Post by Andi Kleen
Most of these are either atomic_inc/dec of reference counters or
some form of lock. The system_call could be the int 0x80 (using the
SYSENTER patches would help), which also does atomic operations
implicitely. restore_all is IRET, could also likely be speed up by
using SYSEXIT.
If NAPI hurts here then it surely not because of eating CPU time.
-Andi
David S. Miller
2002-09-06 19:24:05 UTC
Permalink
From: Andi Kleen <***@suse.de>
Date: Fri, 6 Sep 2002 21:26:19 +0200

I'm not entirely sure it is worth it in this case. The locks are
probably the majority of the cost.

You can more localize the lock accesses (since we use per-chain
locks) by applying a cpu salt to the port numbers you allocate.

See my other email.
Martin J. Bligh
2002-09-06 19:45:39 UTC
Permalink
Post by Nivedita Singhvi
Tthere is some cacheline thrashing hurting the NUMA
more than other systems here too..
There is no NUMA here ... the clients are 4 single node SMP
systems. We're using the old quads to make them, but they're
all split up, not linked together into one system.
Sorry if we didn't make that clear.

M.
Andi Kleen
2002-09-06 19:26:19 UTC
Permalink
Post by Nivedita Singhvi
If you just wanted to speed things up, you could get the
clients to specify ports instead of letting the kernel
cycle through for a free port..:)
Better would be probably to change the kernel to keep a limited
list of free ports in a free list. The grabbing a free port would
be an O(1) operation.

I'm not entirely sure it is worth it in this case. The locks are
probably the majority of the cost.

-Andi
David S. Miller
2002-09-06 19:21:18 UTC
Permalink
From: Nivedita Singhvi <***@us.ibm.com>
Date: Fri, 6 Sep 2002 12:19:14 -0700

inet_bind() and tcp_v4_get_port() are up there because
we have to grab the socket lock, the tcp_portalloc_lock,
then the head chain lock and traverse the hash table
which has now many hundred entries. Also, because
of the varied length of the connections, the clients
get freed not in the same order they are allocated
a port, hence the fragmentation of the port space..
Tthere is some cacheline thrashing hurting the NUMA
more than other systems here too..

There are methods to eliminate the centrality of the
port allocation locking.

Basically, kill tcp_portalloc_lock and make the port rover be per-cpu.

The only tricky case is the "out of ports" situation. Because there
is no centralized locking being used to serialize port allocation,
it is difficult to be sure that the port space is truly exhausted.

Another idea, which doesn't eliminate the tcp_portalloc_lock but
has other good SMP properties, is to apply a "cpu salt" to the
port rover value. For example, shift the local cpu number into
the upper parts of a 'u16', then 'xor' that with tcp_port_rover.

Alexey and I have discussed this several times but never became
bored enough to experiment :-)
Nivedita Singhvi
2002-09-06 19:45:00 UTC
Permalink
Post by David S. Miller
There are methods to eliminate the centrality of the
port allocation locking.
Basically, kill tcp_portalloc_lock and make the port rover be
per-cpu.
Aha! Exactly what I started to do quite a while ago..
Post by David S. Miller
The only tricky case is the "out of ports" situation. Because
there is no centralized locking being used to serialize port
allocation, it is difficult to be sure that the port space is truly
exhausted.
I decided to use a stupid global flag to signal this..It did become
messy and I didnt finalize everything. Then my day job
intervened :). Still hoping for spare time*5 to complete
this if none comes up with something before then..
Post by David S. Miller
Another idea, which doesn't eliminate the tcp_portalloc_lock but
has other good SMP properties, is to apply a "cpu salt" to the
port rover value. For example, shift the local cpu number into
the upper parts of a 'u16', then 'xor' that with tcp_port_rover.
nice..any patch extant? :)

thanks,
Nivedita
Gerrit Huizenga
2002-09-06 17:26:04 UTC
Permalink
Post by Martin J. Bligh
Post by jamal
I would think shoving the data down the NIC
and avoid the fragmentation shouldnt give you that much significant
CPU savings.
It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
is limited by the DMA throughput of the PCI host controller. In
particular some controllers are limited to smaller DMA bursts to
work around hardware bugs.
I think we're CPU limited (there's no idle time on this machine),
which is odd for an 8 CPU 900MHz P3 Xeon, but still, this is Apache,
not Tux. You mentioned CPU load as another advantage of TSO ...
anything we've done to reduce CPU load enables us to run more and
more connections (I think we started at about 260 or something, so
2900 ain't too bad ;-)).
Troy, is there any chance you could post an oprofile from any sort
of reasonably conformant run? I think that might help enlighten
people a bit as to what we are fighting with. The last numbers I
remember seemed to indicate that we were spending about 1.25 CPUs
in network/e1000 code with 100% CPU utilization and crappy SpecWeb
throughput.

One of our goals is to actually take the next generation of the most
common "large system" web server and get it to scale along the lines
of Tux or some of the other servers which are more common on the
small machines. For some reasons, big corporate customers want lots
of features that are in a web server like apache and would also like
the performance on their 8-CPU or 16-CPU machine to not suck at the
same time. High ideals, I know, wanting all features *and* performance
from the same tool... Next thing you know they'll want reliability
or some such thing.

gerrit
David S. Miller
2002-09-06 17:37:17 UTC
Permalink
From: Gerrit Huizenga <***@us.ibm.com>
Date: Fri, 06 Sep 2002 10:26:04 -0700

One of our goals is to actually take the next generation of the most
common "large system" web server and get it to scale along the lines
of Tux or some of the other servers which are more common on the
small machines. For some reasons, big corporate customers want lots
of features that are in a web server like apache and would also like
the performance on their 8-CPU or 16-CPU machine to not suck at the
same time. High ideals, I know, wanting all features *and* performance
from the same tool... Next thing you know they'll want reliability
or some such thing.

Why does Tux keep you from taking advantage of all the
feature of Apache? Anything Tux doesn't handle in it's
fast path is simple fed up to Apache.
Gerrit Huizenga
2002-09-06 18:19:11 UTC
Permalink
Post by David S. Miller
Date: Fri, 06 Sep 2002 10:26:04 -0700
One of our goals is to actually take the next generation of the most
common "large system" web server and get it to scale along the lines
of Tux or some of the other servers which are more common on the
small machines. For some reasons, big corporate customers want lots
of features that are in a web server like apache and would also like
the performance on their 8-CPU or 16-CPU machine to not suck at the
same time. High ideals, I know, wanting all features *and* performance
from the same tool... Next thing you know they'll want reliability
or some such thing.
Why does Tux keep you from taking advantage of all the
feature of Apache? Anything Tux doesn't handle in it's
fast path is simple fed up to Apache.
You have to ask the hard questions... Some of this is rooted in
the past when Tux was emerging as a technology rather ubiquitously
available. And, combined with the fact that most customers tend to
lag the technology curve, Apache 1.X or, in our case, IBM HTTPD was
simply a customer drop in with standard configuration support that
roughly matched that on all other platforms, e.g. AIX, Solaris, HPUX,
Linux, etc. So, doing a one off for Linux at a very heterogenous
large customer adds pain, that pain becomes cost for the customer in
terms of consulting, training, sys admin, system management, etc.

We also had some bad starts with using Tux in terms of performance
and scalability on 4-CPU and 8-CPU machines, especially when combining
with things like squid or other cacheing products from various third
parties.

Then there is the problem that 90%+ of our customers seem to have
dynamic-only web servers. Static content is limited to a couple of
banners and images that need to be tied into some kind of cacheing
content server. So, Tux's benefits for static serving turned out to
be only additional overhead because there were no static pages to be
served up.

And, honestly, I'm a kernel guy much more than an applications guy, so
I'll admit that I'm not up to speed on what Tux2 can do with dynamic
content. The last I knew was that it could pass it off to another server.
So we are focused on making the most common case for our customer situations
scale well. As you are probably aware, there are no specweb results
posted using Apache, but web crawler stats suggest that Apache is the
most common server. The problem is that performance on Apache sucks
but people like the features. Hence we are working to make Apache
suck less, and finding that part of the problem is the way it uses the
kernel. Other parts are the interface for specweb in particular which
we have done a bunch of work on with Greg Ames. And we are feeding
data back to the Apache 2.0 team which should help Apache in general.

gerrit
Martin J. Bligh
2002-09-06 18:26:49 UTC
Permalink
Post by Gerrit Huizenga
Post by Gerrit Huizenga
One of our goals is to actually take the next generation of the most
common "large system" web server and get it to scale along the lines
of Tux or some of the other servers which are more common on the
small machines. For some reasons, big corporate customers want lots
of features that are in a web server like apache and would also like
the performance on their 8-CPU or 16-CPU machine to not suck at the
same time. High ideals, I know, wanting all features *and* performance
from the same tool... Next thing you know they'll want reliability
or some such thing.
Why does Tux keep you from taking advantage of all the
feature of Apache? Anything Tux doesn't handle in it's
fast path is simple fed up to Apache.
You have to ask the hard questions...
Ultimately, to me at least, the server doesn't really matter, and
neither do the absolute benchmark numbers. Linux should scale under
any reasonable workload. The point of this is to look at the Linux
kernel, not the webserver, or specweb ... they're just hammers to
beat on the kernel with.

The fact that we're doing something different from everyone else
and turning up a different set of kernel issues is a good thing,
to my mind. You're right, we could use Tux if we wanted to ... but
that doesn't stop Apache being interesting ;-)

M.
David S. Miller
2002-09-06 18:36:11 UTC
Permalink
From: "Martin J. Bligh" <***@us.ibm.com>
Date: Fri, 06 Sep 2002 11:26:49 -0700

The fact that we're doing something different from everyone else
and turning up a different set of kernel issues is a good thing,
to my mind. You're right, we could use Tux if we wanted to ... but
that doesn't stop Apache being interesting ;-)

Tux does not obviate Apache from the equation.
See my other emails.
Martin J. Bligh
2002-09-06 18:51:29 UTC
Permalink
Post by Martin J. Bligh
The fact that we're doing something different from everyone else
and turning up a different set of kernel issues is a good thing,
to my mind. You're right, we could use Tux if we wanted to ... but
that doesn't stop Apache being interesting ;-)
Tux does not obviate Apache from the equation.
See my other emails.
That's not the point ... we're getting sidetracked here. The
point is: "is this a realistic-ish stick to beat the kernel
with and expect it to behave" ... I feel the answer is yes.

The secondary point is "what are customers doing in the field?"
(not what *should* they be doing ;-)). Moreover, I think the
Apache + Tux combination has been fairly well beaten on already
by other people in the past, though I'm sure it could be done
again.

I see no reason why turning on NAPI should make the Apache setup
we have perform worse ... quite the opposite. Yes, we could use
Tux, yes we'd get better results. But that's not the point ;-)

M.
David S. Miller
2002-09-06 18:48:15 UTC
Permalink
From: "Martin J. Bligh" <***@us.ibm.com>
Date: Fri, 06 Sep 2002 11:51:29 -0700

I see no reason why turning on NAPI should make the Apache setup
we have perform worse ... quite the opposite. Yes, we could use
Tux, yes we'd get better results. But that's not the point ;-)

Of course.

I just don't want propaganda being spread that using Tux means you
lose any sort of web server functionality whatsoever.
Gerrit Huizenga
2002-09-06 19:05:27 UTC
Permalink
Post by David S. Miller
Date: Fri, 06 Sep 2002 11:51:29 -0700
I see no reason why turning on NAPI should make the Apache setup
we have perform worse ... quite the opposite. Yes, we could use
Tux, yes we'd get better results. But that's not the point ;-)
Of course.
I just don't want propaganda being spread that using Tux means you
lose any sort of web server functionality whatsoever.
Ah sorry - I never meant to imply that Tux was detrimental, other
than one case where it seemed to have no benefit and the performance
numbers while tuning for TPC-W *seemed* worse but were never analyzed
completely. That was the actual event that I meant when I said:

We also had some bad starts with using Tux in terms of performance
and scalability on 4-CPU and 8-CPU machines, especially when
combining with things like squid or other cacheing products from
various third parties.

Those results were never quantified but for various reasons we had a
team that decided to take Tux out of the picture. I think the problem
was more likely lack of knowledge and lack of time to do analysis on
the particular problems. Another combination of solutions was used.

So, any comments I made which might have implied that Tux/Tux2 made things
worse have no substantiated data to prove that and it is quite possible
that there is no such problem. Also, this was run nearly a year ago and
the state of Tux/Tux2 might have been a bit different at the time.

gerrit
David S. Miller
2002-09-06 19:01:04 UTC
Permalink
From: Gerrit Huizenga <***@us.ibm.com>
Date: Fri, 06 Sep 2002 12:05:27 -0700

So, any comments I made which might have implied that Tux/Tux2 made things
worse have no substantiated data to prove that and it is quite possible
that there is no such problem. Also, this was run nearly a year ago and
the state of Tux/Tux2 might have been a bit different at the time.

Thanks for clearing things up.
Alan Cox
2002-09-06 20:29:25 UTC
Permalink
Post by Martin J. Bligh
The secondary point is "what are customers doing in the field?"
(not what *should* they be doing ;-)). Moreover, I think the
Apache + Tux combination has been fairly well beaten on already
by other people in the past, though I'm sure it could be done
again.
Tux has been proven in the field. A glance at some of the interesting
porn domain names using it would show that 8)
David S. Miller
2002-09-06 18:34:48 UTC
Permalink
From: Gerrit Huizenga <***@us.ibm.com>
Date: Fri, 06 Sep 2002 11:19:11 -0700

And, honestly, I'm a kernel guy much more than an applications guy, so
I'll admit that I'm not up to speed on what Tux2 can do with dynamic
content.

TUX can optimize dynamic content just fine.

The last I knew was that it could pass it off to another server.

Not true.

The problem is that performance on Apache sucks
but people like the features.

Tux's design allows it to be a drop in acceleration method
which does not require you to relinquish Apache's feature set.
Gerrit Huizenga
2002-09-06 18:57:39 UTC
Permalink
Post by David S. Miller
Date: Fri, 06 Sep 2002 11:19:11 -0700
TUX can optimize dynamic content just fine.
The last I knew was that it could pass it off to another server.
Out of curiosity, and primarily for my own edification, what kind
of optimization does it do when everything is generated by a java/
perl/python/homebrew script and pasted together by something which
consults a content manager. In a few of the cases that I know of,
there isn't really any static content to cache... And why is this
something that Apache couldn't/shouldn't be doing?

gerrit
David S. Miller
2002-09-06 18:58:04 UTC
Permalink
From: Gerrit Huizenga <***@us.ibm.com>
Date: Fri, 06 Sep 2002 11:57:39 -0700

Out of curiosity, and primarily for my own edification, what kind
of optimization does it do when everything is generated by a java/
perl/python/homebrew script and pasted together by something which
consults a content manager. In a few of the cases that I know of,
there isn't really any static content to cache... And why is this
something that Apache couldn't/shouldn't be doing?

The kernel exec's the CGI process from the TUX server and pipes the
output directly into a networking socket.

Because it is cheaper to create a new fresh user thread from within
the kernel (ie. we don't have to fork() apache and thus dup it's
address space), it is faster.
David S. Miller
2002-09-06 19:49:36 UTC
Permalink
From: Gerrit Huizenga <***@us.ibm.com>
Date: Fri, 06 Sep 2002 12:52:15 -0700

So if apache were using a listen()/clone()/accept()/exec() combo rather than a
full listen()/fork()/exec() model it would see most of the same benefits?

Apache would need to do some more, such as do something about
cpu affinity and do the non-blocking VFS tricks Tux does too.

To be honest, I'm not going to sit here all day long and explain how
Tux works. I'm not even too knowledgable about the precise details of
it's implementation. Besides, the code is freely available and not
too complex, so you can go have a look for yourself :-)
Gerrit Huizenga
2002-09-06 20:03:42 UTC
Permalink
Post by David S. Miller
Date: Fri, 06 Sep 2002 12:52:15 -0700
So if apache were using a listen()/clone()/accept()/exec() combo rather than a
full listen()/fork()/exec() model it would see most of the same benefits?
Apache would need to do some more, such as do something about
cpu affinity and do the non-blocking VFS tricks Tux does too.
To be honest, I'm not going to sit here all day long and explain how
Tux works. I'm not even too knowledgable about the precise details of
it's implementation. Besides, the code is freely available and not
too complex, so you can go have a look for yourself :-)
Aw, and you are such a good tutor, too. :-) But thanks - my particular
goal isn't to fix apache since there is already a group of folks working
on that, but as we look at kernel traces, this should give us a good
idea if we are at the bottleneck of the apache architecture or if we
have other kernel bottlenecks. At the moment, the latter seems to be
true, and I think we have some good data from Troy and Dave to validate
that. I think we have already seen the affinity problem or at least
talked about it as that was somewhat visible and Apache 2.0 does seem
to have some solutions for helping with that. And when the kernel does
the best it can with Apache's architecture, we have more data to convince
them to fix the architecture problems.

thanks again!

gerrit
Gerrit Huizenga
2002-09-06 19:52:15 UTC
Permalink
Post by David S. Miller
Date: Fri, 06 Sep 2002 11:57:39 -0700
Out of curiosity, and primarily for my own edification, what kind
of optimization does it do when everything is generated by a java/
perl/python/homebrew script and pasted together by something which
consults a content manager. In a few of the cases that I know of,
there isn't really any static content to cache... And why is this
something that Apache couldn't/shouldn't be doing?
The kernel exec's the CGI process from the TUX server and pipes the
output directly into a networking socket.
Because it is cheaper to create a new fresh user thread from within
the kernel (ie. we don't have to fork() apache and thus dup it's
address space), it is faster.
So if apache were using a listen()/clone()/accept()/exec() combo rather than a
full listen()/fork()/exec() model it would see most of the same benefits?
Some additional overhead for the user/kernel syscall path but probably
pretty minor, right?

Or did I miss a piece of data, like the time to call clone() as a function
from in kernel is 2x or 10x more than the same syscall?

gerrit
Troy Wilson
2002-09-06 23:48:39 UTC
Permalink
Post by Martin J. Bligh
Post by David S. Miller
It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
is limited by the DMA throughput of the PCI host controller. In
particular some controllers are limited to smaller DMA bursts to
work around hardware bugs.
I'm not sure that's entirely true in this case - the Netfinity
8500R is slightly unusual in that it has 3 or 4 PCI buses, and
there's 4 - 8 gigabit ethernet cards in this beast spread around
different buses (Troy - are we still just using 4?
My machine is not exactly an 8500r. It's an Intel pre-release
engineering sample (8-way 900MHz PIII) box that is similar to an
8500r... there are some differences when going across the choerency
filter (the bus that ties the two 4-way "halves" of the machine
together). Bill Hartner has a test program that illustrates the
differences-- but more on that later.

I've got 4 PCI busses, two 33 MHz, and two 66MHz, all 64-bit.
I'm configured as follows:

PCI Bus 0 eth1 --- 3 clients
33 MHz eth2 --- Not in use


PCI Bus 1 eth3 --- 2 clients
33 MHz eth4 --- Not in use


PCI Bus 3 eth5 --- 6 clients
66 MHz eth6 --- Not in use


PCI Bus 4 eth7 --- 6 clients
66 MHz eth8 --- Not in use
Post by Martin J. Bligh
... and what's
the raw bandwidth of data we're pushing? ... it's not huge).
2900 simultaneous connections, each at ~320 kbps translates to
928000 kbps, which is slightly less than the full bandwidth of a
single e1000. We're spreading that over 4 adapters, and 4 busses.

- Troy
Eric W. Biederman
2002-09-11 09:11:49 UTC
Permalink
Post by Martin J. Bligh
Post by David S. Miller
Ie. the headers that don't need to go across the bus are the critical
resource saved by TSO.
I'm not sure that's entirely true in this case - the Netfinity
8500R is slightly unusual in that it has 3 or 4 PCI buses, and
there's 4 - 8 gigabit ethernet cards in this beast spread around
different buses (Troy - are we still just using 4? ... and what's
the raw bandwidth of data we're pushing? ... it's not huge).
I think we're CPU limited (there's no idle time on this machine),
which is odd for an 8 CPU 900MHz P3 Xeon,
Quite possibly. The P3 has roughly an 800MB/s FSB bandwidth, that must
be used for both I/O and memory accesses. So just driving a gige card at
wire speed takes a considerable portion of the cpus capacity.

On analyzing this kind of thing I usually find it quite helpful to
compute what the hardware can theoretically to get a feel where the
bottlenecks should be.

Eric
Martin J. Bligh
2002-09-11 14:10:57 UTC
Permalink
Post by Eric W. Biederman
Post by Martin J. Bligh
Post by David S. Miller
Ie. the headers that don't need to go across the bus are the critical
resource saved by TSO.
I'm not sure that's entirely true in this case - the Netfinity
8500R is slightly unusual in that it has 3 or 4 PCI buses, and
there's 4 - 8 gigabit ethernet cards in this beast spread around
different buses (Troy - are we still just using 4? ... and what's
the raw bandwidth of data we're pushing? ... it's not huge).
I think we're CPU limited (there's no idle time on this machine),
which is odd for an 8 CPU 900MHz P3 Xeon,
Quite possibly. The P3 has roughly an 800MB/s FSB bandwidth, that must
be used for both I/O and memory accesses. So just driving a gige card at
wire speed takes a considerable portion of the cpus capacity.
On analyzing this kind of thing I usually find it quite helpful to
compute what the hardware can theoretically to get a feel where the
bottlenecks should be.
We can push about 420MB/s of IO out of this thing (out of that
theoretical 800Mb/s). Specweb is only pushing about 120MB/s of
total data through it, so it's not bus limited in this case.
Of course, I should have given you that data to start with,
but ... ;-)

M.

PS. This thing actually has 3 system buses, 1 for each of the two
sets of 4 CPUs, and 1 for all the PCI buses, and the three buses
are joined by an interconnect in the middle. But all the IO goes
through 1 of those buses, so for the purposes of this discussion,
it makes no difference whatsoever ;-)
Eric W. Biederman
2002-09-11 15:06:36 UTC
Permalink
Post by Martin J. Bligh
Post by Eric W. Biederman
Post by Martin J. Bligh
Post by David S. Miller
Ie. the headers that don't need to go across the bus are the critical
resource saved by TSO.
I'm not sure that's entirely true in this case - the Netfinity
8500R is slightly unusual in that it has 3 or 4 PCI buses, and
there's 4 - 8 gigabit ethernet cards in this beast spread around
different buses (Troy - are we still just using 4? ... and what's
the raw bandwidth of data we're pushing? ... it's not huge).
I think we're CPU limited (there's no idle time on this machine),
which is odd for an 8 CPU 900MHz P3 Xeon,
Quite possibly. The P3 has roughly an 800MB/s FSB bandwidth, that must
be used for both I/O and memory accesses. So just driving a gige card at
wire speed takes a considerable portion of the cpus capacity.
On analyzing this kind of thing I usually find it quite helpful to
compute what the hardware can theoretically to get a feel where the
bottlenecks should be.
We can push about 420MB/s of IO out of this thing (out of that
theoretical 800Mb/s).
Sounds about average for a P3. I have pushed the full 800MiB/s out of
a P3 processor to memory but it was a very optimized loop. Is
that 420MB/sec of IO on this test?
Post by Martin J. Bligh
Specweb is only pushing about 120MB/s of
total data through it, so it's not bus limited in this case.
Note quite. But you suck at least 240MB/s of your memory bandwidth with
DMA from disk, and then DMA to the nic. Unless there is a highly
cached component. So I doubt you can effectively use more than 1 gige
card, maybe 2. And you have 8?
Post by Martin J. Bligh
Of course, I should have given you that data to start with,
but ... ;-)
PS. This thing actually has 3 system buses, 1 for each of the two
sets of 4 CPUs, and 1 for all the PCI buses, and the three buses
are joined by an interconnect in the middle. But all the IO goes
through 1 of those buses, so for the purposes of this discussion,
it makes no difference whatsoever ;-)
Wow the hardware designers really believed in over-subscription.
If the busses are just running 64bit/33Mhz you are oversubscribed.
And at 64bit/66Mhz the pci busses can easily swamp the system
533*4 ~= 2128MB/s.

What kind of memory bandwidth does the system have, and on which
bus are the memory controllers? I'm just curious

Eric
David S. Miller
2002-09-11 15:15:21 UTC
Permalink
From: ***@xmission.com (Eric W. Biederman)
Date: 11 Sep 2002 09:06:36 -0600
Post by Martin J. Bligh
We can push about 420MB/s of IO out of this thing (out of that
theoretical 800Mb/s).
Sounds about average for a P3. I have pushed the full 800MiB/s out of
a P3 processor to memory but it was a very optimized loop.

You pushed that over the PCI bus of your P3? Just to RAM
doesn't count, lots of cpu's can do that.

That's what makes his number interesting.
Eric W. Biederman
2002-09-11 15:31:53 UTC
Permalink
Post by David S. Miller
Date: 11 Sep 2002 09:06:36 -0600
Post by Martin J. Bligh
We can push about 420MB/s of IO out of this thing (out of that
theoretical 800Mb/s).
Sounds about average for a P3. I have pushed the full 800MiB/s out of
a P3 processor to memory but it was a very optimized loop.
You pushed that over the PCI bus of your P3? Just to RAM
doesn't count, lots of cpu's can do that.
That's what makes his number interesting.
I agree. Getting 420MB/s to the pci bus is nice, especially with a P3.
The 800MB/s to memory was just the test I happened to conduct about 2 years
ago when I was still messing with slow P3 systems. It was a proof of
concept test to see if we could plug in an I/O card into a memory
slot.

On a current P4 system with the E7500 chipset this kind of thing is
easy. I have gotten roughly 450MB/s to a single myrinet card. And there
is enough theoretical bandwidth to do 4 times that. I haven't had a
chance to get it working in practice. When I attempted to run to gige
cards simultaneously I had some weird problem (probably interrupt
related) where adding additional pci cards did not deliver any extra
performance.

On a P3 to get writes from the cpu to hit 800MB/s you use the special
cpu instructions that bypass the cache.

My point was that I have tested the P3 bus in question and I achieved
a real world 800MB/s over it. So I expect that on the system in
question unless another bottleneck is hit, it should be possible to
achieve a real world 800MB/s of I/O. There are enough pci busses
to support that kind of traffic.

Unless the memory controller is carefully placed on the system though
doing 400+MB/s could easily eat up most of the available memory
bandwidth and reduce the system to doing some very slow cache line fills.

Eric
Martin J. Bligh
2002-09-11 15:27:21 UTC
Permalink
Post by Eric W. Biederman
Sounds about average for a P3. I have pushed the full 800MiB/s out of
a P3 processor to memory but it was a very optimized loop. Is
that 420MB/sec of IO on this test?
Yup, Fibre channel disks. So we know we can push at least that.
Post by Eric W. Biederman
Note quite. But you suck at least 240MB/s of your memory bandwidth with
DMA from disk, and then DMA to the nic. Unless there is a highly
cached component. So I doubt you can effectively use more than 1 gige
card, maybe 2. And you have 8?
Nope, it's operating totally out of pagecache, there's no real disk
IO to speak of.
Post by Eric W. Biederman
Wow the hardware designers really believed in over-subscription.
If the busses are just running 64bit/33Mhz you are oversubscribed.
And at 64bit/66Mhz the pci busses can easily swamp the system
533*4 ~= 2128MB/s.
Two 32bit buses (or maybe it was just one) and two 64bit buses,
all at 66MHz. Yes, the PCI buses can push more than the backplane,
but things are never perfectly balanced in reality, so I'd prefer
it that way around ... it's not a perfect system, but hey, it's
Intel hardware - this is high volume market, not real high end ;-)
Post by Eric W. Biederman
What kind of memory bandwidth does the system have, and on which
bus are the memory controllers? I'm just curious
Memory controllers are hung off the interconnect, slightly difficult
to describe. Look for docs on the Intel profusion chipset, or I can
send you a powerpoint (yeah, yeah) presentation when I get into work
later today if you can't find it. Theoretical mem bandwidth should
be 1600MB/s if you're balanced across the CPUs, in practice I'd
expect to be able to push somewhat over 800Mb/s.

M.
Todd Underwood
2002-09-12 07:28:34 UTC
Permalink
folx,

sorry for the late reply. catching up on kernel mail.

so all this TSO stuff looks v. v. similar to the IP-only fragmentation
that patricia gilfeather and i implemented on alteon acenics a couple of
years ago (see http://www.cs.unm.edu/~maccabe/SSL/frag/FragPaper1/ for a
general overview). it's exciting to see someone else take a stab on
different hardware and approaching some of the tcp-specific issues.

the main different, though, is that general purpose kernel development
still focussed on the improvements in *sending* speed. for real high
performance networking, the improvements are necessary in *receiving* cpu
utilization, in our estimation. (see our analysis of interrupt overhead
and the effect on receivers at gigabit speeds--i hope that this has become
common understanding by now)

i guess i can't disagree with david miller that the improvments in TSO are
due entirely to header retransmission for sending, but that's only because
sending wasn't CPU-intensive in the first place. we were able to get a
significant reduction in receiver cpu-utilization by reassembling IP
fragments on the receiver side (sort of a standards-based interrupt
mitigation strategy that has the benefit of not increasing latency the way
interrupt coalescing does).

anyway, nice work,

t.
Post by David S. Miller
It's the DMA bandwidth saved, most of the specweb runs on x86 hardware
is limited by the DMA throughput of the PCI host controller. In
particular some controllers are limited to smaller DMA bursts to
work around hardware bugs.
Ie. the headers that don't need to go across the bus are the critical
resource saved by TSO.
I think I've said this a million times, perhaps the next person who
tries to figure out where the gains come from can just reply with
a pointer to a URL of this email I'm typing right now :-)
--
todd underwood, vp & cto
oso grande technologies, inc.
***@osogrande.com

"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin
jamal
2002-09-12 12:30:44 UTC
Permalink
Good work. The first time i have seen someone say Linux's way of
reverse order is a GoodThing(tm). It was also great to see de-mything
some of the old assumption of the world.

BTW, TSO is not a intelligent as what you are suggesting.
If i am not mistaken you are not only suggesting fragmentation and
assembly at that level you are also suggesting retransmits at the NIC.
This could be dangerous for practical reasons (changes in TCP congestion
control algorithms etc). TSO as was pointed in earlier emails is just a
dumb sender of packets. I think even fragmentation is a misnomer.
Essentially you shove a huge buffer to the NIC and it breaks it into MTU
sized packets for you and sends them.

In regards to the receive side CPU utilization improvements: I think
that NAPI does a good job at least in getting ridding of the biggest
offender -- interupt overload. Also with NAPI also having got rid of
intermidiate queues to the socket level, facilitating of zero copy receive
should be relatively easy to add but there are no capable NICs in
existence (well, ok not counting the TIGONII/acenic that you can hack
and the fact that the tigon 2 is EOL doesnt help other than just for
experiments). I dont think theres any NIC that can offload reassembly;
that might not be such a bad idea.

Are you still continuing work on this?

cheers,
jamal
Todd Underwood
2002-09-12 13:57:04 UTC
Permalink
jamal,
Post by jamal
Good work. The first time i have seen someone say Linux's way of
reverse order is a GoodThing(tm). It was also great to see de-mything
some of the old assumption of the world.
thanks. although i'd love to take credit, i don't think that the
reverse-order fragmentation appreciation is all that original: who
wouldn't want their data sctructure size determined up-front? :-) (not to
mention getting header-overwriting for-free as part of the single copy.
Post by jamal
BTW, TSO is not a intelligent as what you are suggesting.
If i am not mistaken you are not only suggesting fragmentation and
assembly at that level you are also suggesting retransmits at the NIC.
This could be dangerous for practical reasons (changes in TCP congestion
control algorithms etc). TSO as was pointed in earlier emails is just a
dumb sender of packets. I think even fragmentation is a misnomer.
Essentially you shove a huge buffer to the NIC and it breaks it into MTU
sized packets for you and sends them.
the biggest problem to our approach is that itis extremely difficult to
mix two very different kinds of workloads together: the regular
server-on-the-internet workload (SOI) and the large-cluster-member
workload (LCM). in the former case, SOI, you get dropped packets,
fragments, no fragments, out of order fragments, etc. in the LCM case you
basically never get any of that stuff--you're on a closed network with
1000-10000 of your closest cluster friends and that's just what you're
doing. no fragments (unless you put them there), no out of order
fragments (unless you send them) and basically no dropped packets ever.
obviously, if you can assume conditions like that, you can do things like:
only reassmble fragments in reverse order since you know you'll only send
them that way, e.g.
Post by jamal
In regards to the receive side CPU utilization improvements: I think
that NAPI does a good job at least in getting ridding of the biggest
offender -- interupt overload. Also with NAPI also having got rid of
intermidiate queues to the socket level, facilitating of zero copy receive
should be relatively easy to add but there are no capable NICs in
existence (well, ok not counting the TIGONII/acenic that you can hack
and the fact that the tigon 2 is EOL doesnt help other than just for
experiments). I dont think theres any NIC that can offload reassembly;
that might not be such a bad idea.
i've done some reading about NAPI just recently (somehow i missed the
splash when it came out). the two things i like about it are the hardware
independent interrupt mitigation technique and using the DMA buffers as a
receive backlog. i'm concerned about the numbers posted by ibm folx
recently showing a slowdown under some conditions using NAPI and need to
read the rest of that discussion.

we are definitely aware of the fact that the more you want to put on the
NIC, the more the NIC will have to do (and the more expensive it will have
to be). right now the NICs, that people are developing on are the
TigonII/III and, even more closed/proprietary, the Myrinet NICs. i would
love to have a <$200 NIC with open firmware and a CPU/memory so that we
could offload some more of this functionality (where it makes sense).
Post by jamal
Are you still continuing work on this?
definitely! we were just talking about some of these issues yesterday
(and trying to find hardware sepc info on the web for the e1000 platform
to see what else they might be able to do). patricia gilfeather is working
on finding parts of TCP that are separable from the rest of TCP, but the
problems you raise are serious: it would have to be on an
application-specific and socket-specific basis, so that the app would
*know* that functionality (like acks for synchronization packets or
whatever) was being offloaded.

the biggest difference in our perspective, versus the common kernel
developers, is that we're still looking for ways to get the OS out of the
way of the applications. if we can do large data transfers (with
pre-posted receives and pre-posted memory allocation, obviously) directly
from the nic into application memory and have a clean, relatively simple
and standard api to do that, we avoid all of the interrupt mitigation
techniques and save hugely on context switching overhead.

this may now be off-topic for linux-kernel and i'd be happy to chat
further in private email if others are getting bored :-).
Post by jamal
cheers,
jamal
t.
--
todd underwood, vp & cto
oso grande technologies, inc.
***@osogrande.com

"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin
Alan Cox
2002-09-12 14:11:23 UTC
Permalink
Post by Todd Underwood
thanks. although i'd love to take credit, i don't think that the
reverse-order fragmentation appreciation is all that original: who
wouldn't want their data sctructure size determined up-front? :-) (not to
mention getting header-overwriting for-free as part of the single copy.
As far as I am aware it was original when Linux first did it (and we
broke cisco pix, some boot proms, some sco in the process). Credit goes
to Arnt Gulbrandsen probably better known nowdays for his work on Qt
t***@osogrande.com
2002-09-12 14:41:25 UTC
Permalink
alan,

good to know. it's a nice piece of engineering. it's useful to note that
linux has such a long and rich history of breaking de-facto standards in
order to make things work better.

t.
Post by Alan Cox
Post by Todd Underwood
thanks. although i'd love to take credit, i don't think that the
reverse-order fragmentation appreciation is all that original: who
wouldn't want their data sctructure size determined up-front? :-) (not to
mention getting header-overwriting for-free as part of the single copy.
As far as I am aware it was original when Linux first did it (and we
broke cisco pix, some boot proms, some sco in the process). Credit goes
to Arnt Gulbrandsen probably better known nowdays for his work on Qt
--
todd underwood, vp & cto
oso grande technologies, inc.
***@osogrande.com

"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin
David S. Miller
2002-09-12 23:12:25 UTC
Permalink
From: jamal <***@cyberus.ca>
Date: Thu, 12 Sep 2002 08:30:44 -0400 (EDT)

In regards to the receive side CPU utilization improvements: I think
that NAPI does a good job at least in getting ridding of the biggest
offender -- interupt overload.

I disagree, at least for bulk receivers. We have no way currently to
get rid of the data copy. We desperately need sys_receivefile() and
appropriate ops all the way into the networking, then the necessary
driver level support to handle the cards that can do this.

Once 10gbit cards start hitting the shelves this will convert from a
nice perf improvement into a must have.
t***@osogrande.com
2002-09-13 21:59:15 UTC
Permalink
dave, all,
Post by David S. Miller
I disagree, at least for bulk receivers. We have no way currently to
get rid of the data copy. We desperately need sys_receivefile() and
appropriate ops all the way into the networking, then the necessary
driver level support to handle the cards that can do this.
not sure i understand what you're proposing, but while we're at it, why
not also make the api for apps to allocate a buffer in userland that (for
nics that support it) the nic can dma directly into? it seems likely
notification that the buffer was used would have to travel through the
kernel, but it would be nice to save the interrupts altogether.

this may be exactly what you were saying.
Post by David S. Miller
Once 10gbit cards start hitting the shelves this will convert from a
nice perf improvement into a must have.
totally agreed. this is a must for high-performance computing now (since
who wants to waste 80-100% of their CPU just running the network)?

t.
--
todd underwood, vp & cto
oso grande technologies, inc.
***@osogrande.com

"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin
Nivedita Singhvi
2002-09-13 22:12:20 UTC
Permalink
Post by t***@osogrande.com
dave, all,
not sure i understand what you're proposing, but while we're at it,
why not also make the api for apps to allocate a buffer in userland
that (for nics that support it) the nic can dma directly into? it
I believe thats exactly what David was referring to - reverse
direction sendfile() so to speak..
Post by t***@osogrande.com
seems likely notification that the buffer was used would have to
travel through the kernel, but it would be nice to save the
interrupts altogether.
However, I dont think what youre saving are interrupts as
much as the extra copy, but I could be wrong..

thanks,
Nivedita
David S. Miller
2002-09-13 22:04:39 UTC
Permalink
From: todd-***@osogrande.com
Date: Fri, 13 Sep 2002 15:59:15 -0600 (MDT)

not sure i understand what you're proposing

Cards in the future at 10gbit and faster are going to provide
facilities by which:

1) You register a IPV4 src_addr/dst_addr TCP src_port/dst_port cookie
with the hardware when TCP connections are openned.

2) The card scans TCP packets arriving, if the cookie matches, it
accumulated received data to fill full pages and wakes up the
networking when either:

a) full page has accumulated for a connection
b) connection cookie mismatch
c) configurable timer has expired

3) TCP ends up getting receive packets with skb->shinfo() fraglist
containing the data portion in full struct page *'s
This can be placed directly into the page cache via sys_receivefile
generic code in mm/filemap.c or f.e. NFSD/NFS receive side
processing.

not also make the api for apps to allocate a buffer in userland that (for
nics that support it) the nic can dma directly into? it seems likely
notification that the buffer was used would have to travel through the
kernel, but it would be nice to save the interrupts altogether.

This is already doable with sys_sendfile() for send today. The user
just does the following:

1) mmap()'s a file with MAP_SHARED to write the data
2) uses sys_sendfile() to send the data over the socket from that file
3) uses socket write space monitoring to determine if the portions of
the shared area are reclaimable for new writes

BTW Apache could make this, I doubt it does currently.

The corrolary with sys_receivefile would be that the use:

1) mmap()'s a file with MAP_SHARED to write the data
2) uses sys_receivefile() to pull in the data from the socket to that file

There is no need to poll the receive socket space as the successful
return from sys_receivefile() is the "data got received successfully"
event.

totally agreed. this is a must for high-performance computing now (since
who wants to waste 80-100% of their CPU just running the network)?

If send side is your bottleneck and you think zerocopy sends of
user anonymous data might help, see the above since we can do it
today and you are free to experiment.

Franks a lot,
David S. Miller
***@redhat.com
jamal
2002-09-15 20:16:13 UTC
Permalink
10 gige becomes more of an interesting beast. Not sure if we would see
servers with 10gige real soon now. Your proposal does make sense although
compute power would still be a player. I think the key would be
parallelization;
Now if it wasnt for the stupid way TCP options were designed
you could easily do remote DMA instead. Would be relatively easy to add
NIC support for that. Maybe SCTP would save us ;-> however, if history
could be used to predict the future, i think TCP will continue to be
"hacked" and fit the throughput requirements so no chance for SCTP to be
a big player i am afraid .

cheers,
jamal
Post by David S. Miller
Date: Fri, 13 Sep 2002 15:59:15 -0600 (MDT)
not sure i understand what you're proposing
Cards in the future at 10gbit and faster are going to provide
1) You register a IPV4 src_addr/dst_addr TCP src_port/dst_port cookie
with the hardware when TCP connections are openned.
[..]
David S. Miller
2002-09-16 04:23:21 UTC
Permalink
From: jamal <***@cyberus.ca>
Date: Sun, 15 Sep 2002 16:16:13 -0400 (EDT)

Your proposal does make sense although compute power would still be
a player. I think the key would be parallelization;

Oh I forgot to mention that some of these cards also compute a cookie
for you on receive packets, and your meant to point the input
processing for that packet to a cpu whose number is derived from that
cookie it gives you.

Lockless per-cpu packet input queues make this sort of hard for us
to implement currently.
t***@osogrande.com
2002-09-16 14:16:47 UTC
Permalink
david,

comments/questions below...
Post by David S. Miller
1) You register a IPV4 src_addr/dst_addr TCP src_port/dst_port cookie
with the hardware when TCP connections are openned.
intriguing architecture. are there any standards in progress to support
this. bascially, people doing high performance computing have been
customizing non-commodity nics (acenic, myrinet, quadrics, etc.) to do
some of this cookie registration/scanning. it would be nice if there were
a standard API/hardware capability that took care of at least this piece.

(frankly, it would also be nice if customizable, almost-commodity nics
based on processor/memory/firmware architecture rather than just asics
(like the acenic) continued to exist).
Post by David S. Miller
not also make the api for apps to allocate a buffer in userland that (for
nics that support it) the nic can dma directly into? it seems likely
notification that the buffer was used would have to travel through the
kernel, but it would be nice to save the interrupts altogether.
This is already doable with sys_sendfile() for send today. The user
1) mmap()'s a file with MAP_SHARED to write the data
2) uses sys_sendfile() to send the data over the socket from that file
3) uses socket write space monitoring to determine if the portions of
the shared area are reclaimable for new writes
BTW Apache could make this, I doubt it does currently.
1) mmap()'s a file with MAP_SHARED to write the data
2) uses sys_receivefile() to pull in the data from the socket to that file
There is no need to poll the receive socket space as the successful
return from sys_receivefile() is the "data got received successfully"
event.
the send case has been well described and seems work well for the people
for whom that is the bottleneck. that has not been the case in HPC, since
sends are relatively cheaper (in terms of cpu) than receives.

who is working on this architecture for receives? i know quite a few
people who would be interested in working on it and willing to prototype
as well.
Post by David S. Miller
totally agreed. this is a must for high-performance computing now (since
who wants to waste 80-100% of their CPU just running the network)?
If send side is your bottleneck and you think zerocopy sends of
user anonymous data might help, see the above since we can do it
today and you are free to experiment.
for many of the applications that i care about, receive is the bottleneck,
so zerocopy sends are somewhat of a non-issue (not that they're not nice,
they just don't solve the primary waste of processor resources).

is there a beginning implementation yet of zerocopy receives as you
describe above, or you you be interested in entertaining implementations
that work on existing (1Gig-e) cards?

what i'm thinking is something that prototypes the api to the nic that you
are proposing and implements the NIC-side functionality in firmware on the
acenic-2's (which have available firmware in at least two
implementations--the alteon version and pete wyckoff's version (which may
be less license-encumbered).

this is obviously only feasible if there already exists some consensus on
what the os-to-hardware API should look like (or there is willingness to
try to build a consensus around that now).

t.
--
todd underwood, vp & cto
oso grande technologies, inc.
***@osogrande.com

"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin
David S. Miller
2002-09-16 19:52:11 UTC
Permalink
From: todd-***@osogrande.com
Date: Mon, 16 Sep 2002 08:16:47 -0600 (MDT)

are there any standards in progress to support this.

Your question makes no sense, it is a hardware optimization
of an existing standard. The chip merely is told what flows
exist and it concatenates TCP data from consequetive packets
for that flow if they arrive in sequence.

who is working on this architecture for receives?

Once cards with the feature exist, probably Alexey and myself
will work on it.

Basically, who ever isn't busy with something else once the technology
appears.

is there a beginning implementation yet of zerocopy receives

No.

Franks a lot,
David S. Miller
***@redhat.com
t***@osogrande.com
2002-09-16 21:32:56 UTC
Permalink
folx,

perhaps i was insufficiently clear.
Post by David S. Miller
are there any standards in progress to support this.
Your question makes no sense, it is a hardware optimization
of an existing standard. The chip merely is told what flows
exist and it concatenates TCP data from consequetive packets
for that flow if they arrive in sequence.
hardware optimizations can be standardized. in fact, when they are, it is
substantially easier to implement to them.

my assumption (perhaps incorrect) is that some core set of functionality
is necessary for a card to support zero-copy receives (in particular, the
ability to register cookies of expected data flows and the memory location
to which they are to be sent). what 'existing standard' is this
kernel<->api a standardization of?
Post by David S. Miller
who is working on this architecture for receives?
Once cards with the feature exist, probably Alexey and myself
will work on it.
Basically, who ever isn't busy with something else once the technology
appears.
so if we wrote and distributed firmware for alteon acenics that supported
this today, you would be willing to incorporate the new system calls into
the networking code (along with the new firmware for the card, provided we
could talk jes into accepting the changes, assuming he's still the
maintainer of the driver)? that's great.
Post by David S. Miller
is there a beginning implementation yet of zerocopy receives
No.
thanks for your feedback.

t.
--
todd underwood, vp & cto
oso grande technologies, inc.
***@osogrande.com

"Those who give up essential liberties for temporary safety deserve
neither liberty nor safety." - Benjamin Franklin
David S. Miller
2002-09-16 21:29:31 UTC
Permalink
From: todd-***@osogrande.com
Date: Mon, 16 Sep 2002 15:32:56 -0600 (MDT)

new system calls into the networking code

The system calls would go into the VFS, sys_receivefile is not
networking specific in any way shape or form.

And to answer your question, if I had the time I'd work on it yes.

Right now the answer to "well do you have the time" is no, I am
working on something much more important wrt. Linux networking. I've
hinted at what this is in previous postings, and if people can't
figure out what it is I'm not going to mention this explicitly :-)
David Woodhouse
2002-09-16 22:53:00 UTC
Permalink
Post by David S. Miller
new system calls into the networking code
The system calls would go into the VFS, sys_receivefile is not
networking specific in any way shape or form.
Er, surely the same goes for sys_sendfile? Why have a new system call
rather than just swapping the 'in' and 'out' fds?

--
dwmw2
David S. Miller
2002-09-16 22:46:40 UTC
Permalink
From: David Woodhouse <***@infradead.org>
Date: Mon, 16 Sep 2002 23:53:00 +0100

Er, surely the same goes for sys_sendfile? Why have a new system call
rather than just swapping the 'in' and 'out' fds?

There is an assumption that one is a linear stream of output (in this
case a socket) and the other one is a page cache based file.

It would be nice to extend sys_sendfile to work properly in both
ways in a manner that Linus would accept, want to work on that?
David Woodhouse
2002-09-16 23:03:19 UTC
Permalink
Post by David S. Miller
Post by David Woodhouse
Er, surely the same goes for sys_sendfile? Why have a new system
call rather than just swapping the 'in' and 'out' fds?
There is an assumption that one is a linear stream of output (in this
case a socket) and the other one is a page cache based file.
That's an implementation detail and it's not clear we should be exposing it
to the user. It's not entirely insane to contemplate socket->socket or
file->file sendfile either -- would we invent new system calls for those
too? File descriptors are file descriptors.
Post by David S. Miller
It would be nice to extend sys_sendfile to work properly in both ways
in a manner that Linus would accept, want to work on that?
Yeah -- I'll add it to the TODO list. Scheduled for some time in 2007 :)

More seriously though, I'd hope that whoever implemented what you call
'sys_receivefile' would solve this issue, as 'sys_receivefile' isn't really
useful as anything more than a handy nomenclature for describing the
process in question.

--
dwmw2
Jeff Garzik
2002-09-16 23:08:15 UTC
Permalink
Post by David Woodhouse
Post by David S. Miller
Post by David Woodhouse
Er, surely the same goes for sys_sendfile? Why have a new system
call rather than just swapping the 'in' and 'out' fds?
There is an assumption that one is a linear stream of output (in this
case a socket) and the other one is a page cache based file.
That's an implementation detail and it's not clear we should be exposing it
to the user. It's not entirely insane to contemplate socket->socket or
file->file sendfile either -- would we invent new system calls for those
too? File descriptors are file descriptors.
I was rather disappointed when file->file sendfile was [purposefully?]
broken in 2.5.x...

Jeff
David S. Miller
2002-09-16 23:02:10 UTC
Permalink
From: Jeff Garzik <***@mandrakesoft.com>
Date: Mon, 16 Sep 2002 19:08:15 -0400

I was rather disappointed when file->file sendfile was [purposefully?]
broken in 2.5.x...

What change made this happen?
Jeff Garzik
2002-09-16 23:48:37 UTC
Permalink
Post by David S. Miller
Date: Mon, 16 Sep 2002 19:08:15 -0400
I was rather disappointed when file->file sendfile was [purposefully?]
broken in 2.5.x...
What change made this happen?
I dunno when it happened, but 2.5.x now returns EINVAL for all
file->file cases.

In 2.4.x, if sendpage is NULL, file_send_actor in mm/filemap.c faked a
call to fops->write().
In 2.5.x, if sendpage is NULL, EINVAL is unconditionally returned.
David S. Miller
2002-09-16 23:43:43 UTC
Permalink
From: Jeff Garzik <***@mandrakesoft.com>
Date: Mon, 16 Sep 2002 19:48:37 -0400

I dunno when it happened, but 2.5.x now returns EINVAL for all
file->file cases.

In 2.4.x, if sendpage is NULL, file_send_actor in mm/filemap.c faked a
call to fops->write().
In 2.5.x, if sendpage is NULL, EINVAL is unconditionally returned.


What if source and destination file and offsets match?
Sounds like 2.4.x might deadlock.

In fact it sounds similar to the "read() with buf pointed to same
page in MAP_WRITE mmap()'d area" deadlock we had ages ago.

Nivedita Singhvi
2002-09-12 17:18:37 UTC
Permalink
Post by Todd Underwood
sorry for the late reply. catching up on kernel mail.
the main different, though, is that general purpose kernel
development still focussed on the improvements in *sending* speed.
for real high performance networking, the improvements are necessary
in *receiving* cpu utilization, in our estimation.
(see our analysis of interrupt overhead and the effect on receivers
at gigabit speeds--i hope that this has become common understanding
by now)
Some of that may be a byproduct of the "all the worlds' a webserver"
mindset - we are primarily focussed on the server side (aka
money side ;)), and there is some amount of automatic thinking that
this means we're going to be sending data and receiving small packets,
mostly acks) in return. There is much less emphasis given to solving
the problems on the other side (active connection scalability for
instance), or other issues that manifest themselves as
client side bottlenecks for most applications..

thanks,
Nivedita
David S. Miller
2002-09-06 23:52:44 UTC
Permalink
From: Troy Wilson <***@tempest.prismnet.com>
Date: Fri, 6 Sep 2002 18:56:04 -0500 (CDT)

4241408 segments retransmited

Is hw flow control being negotiated and enabled properly on the
gigabit interfaces?

There should be no reason for these kinds of retransmits to
happen.
Troy Wilson
2002-09-06 23:56:04 UTC
Permalink
Post by jamal
Do you have any stats from the hardware that could show
retransmits etc;
**********************************
* netstat -s before the workload *
**********************************

Ip:
433 total packets received
0 forwarded
0 incoming packets discarded
409 incoming packets delivered
239 requests sent out
Icmp:
24 ICMP messages received
0 input ICMP message failed.
ICMP input histogram:
destination unreachable: 24
24 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 24
Tcp:
0 active connections openings
2 passive connection openings
0 failed connection attempts
0 connection resets received
2 connections established
300 segments received
183 segments send out
0 segments retransmited
0 bad segments received.
2 resets sent
Udp:
8 packets received
24 packets to unknown port received.
0 packet receive errors
32 packets sent
TcpExt:
ArpFilter: 0
5 delayed acks sent
4 packets directly queued to recvmsg prequeue.
35 packets header predicted
TCPPureAcks: 5
TCPHPAcks: 160
TCPRenoRecovery: 0
TCPSackRecovery: 0
TCPSACKReneging: 0
TCPFACKReorder: 0
TCPSACKReorder: 0
TCPRenoReorder: 0
TCPTSReorder: 0
TCPFullUndo: 0
TCPPartialUndo: 0
TCPDSACKUndo: 0
TCPLossUndo: 0
TCPLoss: 0
TCPLostRetransmit: 0
TCPRenoFailures: 0
TCPSackFailures: 0
TCPLossFailures: 0
TCPFastRetrans: 0
TCPForwardRetrans: 0
TCPSlowStartRetrans: 0
TCPTimeouts: 0
TCPRenoRecoveryFail: 0
TCPSackRecoveryFail: 0
TCPSchedulerFailed: 0
TCPRcvCollapsed: 0
TCPDSACKOldSent: 0
TCPDSACKOfoSent: 0
TCPDSACKRecv: 0
TCPDSACKOfoRecv: 0
TCPAbortOnSyn: 0
TCPAbortOnData: 0
TCPAbortOnClose: 0
TCPAbortOnMemory: 0
TCPAbortOnTimeout: 0
TCPAbortOnLinger: 0
TCPAbortFailed: 0
TCPMemoryPressures: 0

*********************************
* netstat -s after the workload *
*********************************

Ip:
425317106 total packets received
3648 forwarded
0 incoming packets discarded
425313332 incoming packets delivered
203629600 requests sent out
Icmp:
58 ICMP messages received
12 input ICMP message failed.
ICMP input histogram:
destination unreachable: 58
58 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 58
Tcp:
64 active connections openings
16690445 passive connection openings
56552 failed connection attempts
0 connection resets received
3 connections established
425311551 segments received
203629500 segments send out
4241408 segments retransmited
0 bad segments received.
298883 resets sent
Udp:
8 packets received
34 packets to unknown port received.
0 packet receive errors
42 packets sent
TcpExt:
ArpFilter: 0
8884840 TCP sockets finished time wait in fast timer
12913162 delayed acks sent
17292 delayed acks further delayed because of locked socket
Quick ack mode was activated 102351 times
54977 times the listen queue of a socket overflowed
54977 SYNs to LISTEN sockets ignored
157 packets directly queued to recvmsg prequeue.
51 packets directly received from prequeue
16925947 packets header predicted
51 packets header predicted and directly queued to user
TCPPureAcks: 169071816
TCPHPAcks: 176510836
TCPRenoRecovery: 30090
TCPSackRecovery: 0
TCPSACKReneging: 0
TCPFACKReorder: 0
TCPSACKReorder: 0
TCPRenoReorder: 464
TCPTSReorder: 5
TCPFullUndo: 6
TCPPartialUndo: 29
TCPDSACKUndo: 0
TCPLossUndo: 1
TCPLoss: 0
TCPLostRetransmit: 0
TCPRenoFailures: 218884
TCPSackFailures: 0
TCPLossFailures: 35561
TCPFastRetrans: 145529
TCPForwardRetrans: 0
TCPSlowStartRetrans: 3463096
TCPTimeouts: 373473
TCPRenoRecoveryFail: 1221
TCPSackRecoveryFail: 0
TCPSchedulerFailed: 0
TCPRcvCollapsed: 0
TCPDSACKOldSent: 0
TCPDSACKOfoSent: 0
TCPDSACKRecv: 1
TCPDSACKOfoRecv: 0
TCPAbortOnSyn: 0
TCPAbortOnData: 0
TCPAbortOnClose: 0
TCPAbortOnMemory: 0
TCPAbortOnTimeout: 0
TCPAbortOnLinger: 0
TCPAbortFailed: 0
TCPMemoryPressures: 0
Robert Olsson
2002-09-06 11:44:34 UTC
Permalink
Post by Andrew Morton
Mala did some testing on this a couple of weeks back. It appears that
NAPI damaged performance significantly.
http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm
Robert can comment on optimal settings
Hopefully yes...

I see other numbers so we have to sort out the differences. Andrew Morton
pinged me about this test last week. So I've had a chance to run some tests.

Some comments:
Scale to CPU can be dangerous measure w. NAPI due to its adapting behaviour
where RX interrupts decreases in favour of successive polls.

And NAPI scheme behaves different since we can not assume that all network
traffic is well-behaved like TCP. System has to be manageable and to "perform"
under any network load not only for well-behaved TCP. So of course we will
see some differences -- there are no free lunch. Simply we can not blindly
look at one test. IMO NAPI is the best overall performer. The number speaks
for themselves.

Here is the most recent test...

NAPI kernel path is included in 2.4.20-pre4 the comparison below is mainly
between e1000 driver w and w/o NAPI and the NAPI port to e1000 is still
evolving.

Linux 2.4.20-pre4/UP PIII @ 933 MHz w. Intel's e100 2 port GIGE adapter.
e1000 4.3.2-k1 (current kernel version) and current NAPI patch. For NAPI
e1000 driver uses RxIntDelay=1. RxIntDewlay=0 caused problem. Non-NAPI
driver RxIntDelay=64. (default)

Three tests: TCP, UDP, packet forwarding.

Netperf. TCP socket size 131070, Single TCP stream. Test length 30 s.

M-size e1000 NAPI-e1000
============================
4 20.74 20.69 Mbit/s data received.
128 458.14 465.26
512 836.40 846.71
1024 936.11 937.93
2048 940.65 939.92
4096 940.86 937.59
8192 940.87 939.95
16384 940.88 937.61
32768 940.89 939.92
65536 940.90 939.48
131070 940.84 939.74

Netperf. UDP_STREAM. 1440 pkts. Single UDP stream. Test length 30 s.
e1000 NAPI-e1000
====================================
955.7 955.7 Mbit/s data received.

Forwarding test. 1 Mpkts at 970 kpps injected.
e1000 NAPI-e1000
=============================================
T-put 305 298580 pkts routed.

NOTE!
With non-NAPI driver this system is "dead" an performes nothing.


Cheers.
--ro
Martin J. Bligh
2002-09-06 14:37:16 UTC
Permalink
Post by Robert Olsson
And NAPI scheme behaves different since we can not assume that all network
traffic is well-behaved like TCP. System has to be manageable and to "perform"
under any network load not only for well-behaved TCP. So of course we will
see some differences -- there are no free lunch. Simply we can not blindly
look at one test. IMO NAPI is the best overall performer. The number speaks
for themselves.
I don't doubt it's a win for most cases, we just want to reap the benefit
for the large SMP systems as well ... the fundamental mechanism seems
very scalable to me, we probably just need to do a little tuning?
Post by Robert Olsson
NAPI kernel path is included in 2.4.20-pre4 the comparison below is mainly
between e1000 driver w and w/o NAPI and the NAPI port to e1000 is still
evolving.
We are running from 2.5.latest ... any updates needed for NAPI for the
driver in the current 2.5 tree, or is that OK?

Thanks,

Martin.
Robert Olsson
2002-09-06 15:38:19 UTC
Permalink
Post by Martin J. Bligh
We are running from 2.5.latest ... any updates needed for NAPI for the
driver in the current 2.5 tree, or is that OK?
Should be OK. Get latest kernel e1000 to get Intel's and maintainers latest
work and apply the e1000 NAPI patch. RH includes this patch?

And yes there are plenty of room for improvements...


Cheers.
--ro
Manfred Spraul
2002-09-06 18:35:08 UTC
Permalink
Post by Dave Hansen
The real question is why NAPI causes so much more work for the client.
[Just a summary from my results from last year. All testing with a
simple NIC without hw interrupt mitigation, on a Cyrix P150]

My assumption was that NAPI increases the cost of receiving a single
packet: instead of one hw interrupt with one device access (ack
interrupt) and the softirq processing, the hw interrupt must ack &
disable the interrupt, then the processing occurs in softirq context,
and the interrupts are reenabled at softirq context.

The second point was that interrupt mitigation must remain enabled, even
with NAPI: the automatic mitigation doesn't work with process space
limited loads (e.g. TCP: backlog queue is drained quickly, but the
system is busy processing the prequeue or receive queue)

jamal, it is possible that a driver uses both napi and the normal
interface, or would that break fairness?
Use netif_rx, until it returns dropping. If that happens, disable the
interrupt, and call netif_rx_schedule().

Is it possible to determine the average number of packets that are
processed for each netif_rx_schedule()?

--
Manfred
David S. Miller
2002-09-06 18:38:29 UTC
Permalink
From: Manfred Spraul <***@colorfullife.com>
Date: Fri, 06 Sep 2002 20:35:08 +0200

The second point was that interrupt mitigation must remain enabled, even
with NAPI: the automatic mitigation doesn't work with process space
limited loads (e.g. TCP: backlog queue is drained quickly, but the
system is busy processing the prequeue or receive queue)

Not true. NAPI is in fact a %100 replacement for hw interrupt
mitigation strategies. The cpu usage elimination afforded by
hw interrupt mitigation is also afforded by NAPI and even more
so by NAPI.

See Jamal's paper.

Franks a lot,
David S. Miller
***@redhat.com
David S. Miller
2002-09-06 19:34:28 UTC
Permalink
From: Manfred Spraul <***@colorfullife.com>
Date: Fri, 06 Sep 2002 21:40:09 +0200

Dave, do you have interrupt rates from the clients with and without NAPI?

Robert does.
Robert Olsson
2002-09-10 12:02:20 UTC
Permalink
Post by David S. Miller
Post by Manfred Spraul
But what if the backlog queue is empty all the time? Then NAPI thinks
that the system is idle, and reenables the interrupts after each packet :-(
Yes and this happens even with without NAPI. Just set RxIntDelay=X and send
pkts at >= X+1 interval.
Post by David S. Miller
Post by Manfred Spraul
Dave, do you have interrupt rates from the clients with and without NAPI?
Robert does.
Yes we get into this interesting discussion now... Since with NAPI we can
safely use RxIntDelay=0 (e1000 terminologi). With the classical IRQ we simply
had to add latency (RxIntDelay of 64-128 us common for GIGE) this just to
survive at higher speeds (GIGE max is 1.48 Mpps) and with the interrupt latency
also comes higher network latencies... IMO this delay was a "work-around"
for the old interrupt scheme.

So we now have the option of removing it... But we are trading less latency
for for more interrupts. So yes Manfred is correct...

So is there a decent setting/compromise?

Well first approximation is just to do just what DaveM suggested.
RxIntDelay=0. This solved many problems with buggy hardware and complicated
tuning and RxIntDelay used to be combined with other mitigation parameters to
compensate for different packets sizes etc. This leading to very "fragile"
performance where a NIC could perform excellent w. single TCP stream but
to be seriously broke in many other tests. So tuning to just one "test"
can cause a lot of mis-tuning as well.

Anyway. A tulip NAPI variant added mitigation when we reached "some load" to
avoid the static interrupt delay. (Still keeping things pretty simple):

Load "Mode"
-------------------
Lo 1) RxIntDelay=0
Mid 2) RxIntDelay=fix (When we had X pkts on the RX ring)
Hi 3) Consecutive polling. No RX interrupts.

Is it worth the effort?

For SMP w/o affinity the delay could eventually reduce the cache bouncing
since the packets becomes more "batched" at cost the of latency of course.
We use RxIntDelay=0 for production use. (IP-forwarding on UP)

Cheers.

--ro
Manfred Spraul
2002-09-10 16:55:35 UTC
Permalink
Post by Robert Olsson
Anyway. A tulip NAPI variant added mitigation when we reached "some
load" to avoid the static interrupt delay. (Still keeping things
Load "Mode"
-------------------
Lo 1) RxIntDelay=0
Mid 2) RxIntDelay=fix (When we had X pkts on the RX ring)
Hi 3) Consecutive polling. No RX interrupts.
Sounds good.

The difficult part is when to go from Lo to Mid. Unfortunately my tulip
card is braindead (LC82C168), but I'll try to find something usable for
benchmarking

In my tests with the winbond card, I've switched at a fixed packet rate:

< 2000 packets/sec: no delay
Post by Robert Olsson
2000 packets/sec: poll rx at 0.5 ms
--
Manfred
Robert Olsson
2002-09-11 07:46:45 UTC
Permalink
Post by Manfred Spraul
Post by Robert Olsson
Load "Mode"
-------------------
Lo 1) RxIntDelay=0
Mid 2) RxIntDelay=fix (When we had X pkts on the RX ring)
Hi 3) Consecutive polling. No RX interrupts.
Sounds good.
The difficult part is when to go from Lo to Mid. Unfortunately my tulip
card is braindead (LC82C168), but I'll try to find something usable for
benchmarking
21143 for tulip's. Well any NIC with "RxIntDelay" should do.
Post by Manfred Spraul
< 2000 packets/sec: no delay
Post by Robert Olsson
2000 packets/sec: poll rx at 0.5 ms
I was experimenting with all sorts of moving averages but never got a good
correlation with bursty network traffic as this level of resolution. The
only measure I found fast and simple enough for this was the number of
packets on the RX ring as I mentioned.


Cheers.
--ro
Manfred Spraul
2002-09-06 19:40:09 UTC
Permalink
Post by David S. Miller
Date: Fri, 06 Sep 2002 20:35:08 +0200
The second point was that interrupt mitigation must remain enabled, even
with NAPI: the automatic mitigation doesn't work with process space
limited loads (e.g. TCP: backlog queue is drained quickly, but the
system is busy processing the prequeue or receive queue)
Not true. NAPI is in fact a %100 replacement for hw interrupt
mitigation strategies. The cpu usage elimination afforded by
hw interrupt mitigation is also afforded by NAPI and even more
so by NAPI.
See Jamal's paper.
I've read his paper: it's about MLFFR. There is no alternative to NAPI
if packets arrive faster than they are processed by the backlog queue.

But what if the backlog queue is empty all the time? Then NAPI thinks
that the system is idle, and reenables the interrupts after each packet :-(

In my tests, I've used a pentium class system (I have no GigE cards -
that was the only system where I could saturate the cpu with 100MBit
ethernet). IIRC 30% cpu time was needed for the copy_to_user(). The
receive queue was filled, the backlog queue empty. With NAPI, I got 1
interrupt for each packet, with hw interrupt mitigation the throughput
was 30% higher for MTU 600.

Dave, do you have interrupt rates from the clients with and without NAPI?

--
Manfred
Nivedita Singhvi
2002-09-07 00:18:16 UTC
Permalink
Post by jamal
Do you have any stats from the hardware that could show
retransmits etc;
Troy,

Are tcp_sack, tcp_fack, tcp_dsack turned on?

thanks,
Nivedita
Troy Wilson
2002-09-07 00:27:33 UTC
Permalink
Post by Nivedita Singhvi
Are tcp_sack, tcp_fack, tcp_dsack turned on?
tcp_fack and tcp_dsack are on, tcp_sack is off.

- Troy
Mala Anand
2002-09-10 14:59:27 UTC
Permalink
I am resending this note with the subject heading, so that
it can be viewed through the subject catagory.
Post by Andrew Morton
Post by David S. Miller
NAPI is also not the panacea to all problems in the world.
   >Mala did some testing on this a couple of weeks back. It appears that
   >NAPI damaged performance significantly.
Post by Andrew Morton
http://www-124.ibm.com/developerworks/opensource/linuxperf/netperf/results/july_02/netperf2.5.25results.htm
Unfortunately it is not listed what e1000 and core NAPI
patch was used. Also, not listed, are the RX/TX mitigation
and ring sizes given to the kernel module upon loading.
The default driver that is included in 2.5.25 kernel for Intel
gigabit adapter was used for the baseline test and the NAPI driver
was downloaded from Robert Olsson's website. I have updated my web
page to include Robert's patch. However it is given there for reference
purpose only. Except for the ones mentioned explicitly the rest of
the configurable values used are default. The default for RX/TX mitigation
is 64 microseconds and the default ring size is 80.

I have added statistics collected during the test to my web site. I do
want to analyze and understand how NAPI can be improved in my tcp_stream
test. Last year around November, when I first tested NAPI, I did find NAPI
results better than the baseline using udp_stream. However I am
concentrating on tcp_stream since that is where NAPI can be improved in
my setup. I will update the website as I do more work on this.
Post by Andrew Morton
Robert can comment on optimal settings
I saw Robert's postings. Looks like he may have a more recent version of
NAPI
driver than the one I used. I also see 2.5.33 has NAPI, I will move to
2.5.33
and continue my work on that.
Post by Andrew Morton
Robert and Jamal can make a more detailed analysis of Mala's
graphs than I.
Jamal has questioned about socket buffer size that I used, I have tried
132k
socket buffer size in the past and I didn't see much difference in my
tests.
I will add that to my list again.


Regards,
Mala


Mala Anand
IBM Linux Technology Center - Kernel Performance
E-mail:***@us.ibm.com
http://www-124.ibm.com/developerworks/opensource/linuxperf
http://www-124.ibm.com/developerworks/projects/linuxperf
Phone:838-8088; Tie-line:678-8088
Loading...