Discussion:
maximum buffer size for splice(2) tcp->pipe?
Volker Lendecke
2009-01-08 10:13:51 UTC
Permalink
Hi!

While implementing splice support in Samba for better
performance I found it blocking when trying to pull data off
tcp into a pipe when the recvq was full. Attached find a
test program that shows this behaviour, on another host I
started

netcat 192.168.19.10 4711 < /dev/zero

***@lenny:~$ uname -a
Linux lenny 2.6.28-06857-g5cbd04a #7 Wed Jan 7 10:10:42 CET 2009 x86_64 = GNU/Linux
***@lenny:~$ gcc -o splicetest /host/home/vlendec/splicetest.c -O3 -Wall
***@lenny:~$ ./splicetest out 65536 &
[1] 697
***@lenny:~$ strace -p 697
Process 697 attached - interrupt to quit
splice(0x3, 0, 0x5, 0, 0x56a0, 0x1) = 22176
splice(0x7, 0, 0x4, 0, 0x10000, 0x1^C <unfinished ...>
Process 697 detached
***@lenny:~$ netstat -nt | grep 4711
tcp 69272 0 192.168.19.10:4711 192.168.19.1:33773 ESTABLISHED
***@lenny:~$

Interestingly, whenever I start the strace, it gets another
chunk of data and then blocks in the next splice call.

If I start splicetest with a buffer size of 16384 instead of
65536, it does not block. I could not find a way to ask the
kernel for the tipping point below which it does not block.

What is a safe buffer size to use with splice?

BTW, this kernel is from Steve French's linux-cifs.git repo.

Thanks,

Volker Lendecke

Samba Team

P.S: I'm not subscribed to linux-kernel, so if possible
please CC me directly. If this is inappropriate behaviour,
please give me a quick hint :-)
--
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
Andrew Morton
2009-01-13 20:37:02 UTC
Permalink
(cc's added)

On Thu, 8 Jan 2009 11:13:51 +0100
Post by Volker Lendecke
Hi!
While implementing splice support in Samba for better
performance I found it blocking when trying to pull data off
tcp into a pipe when the recvq was full. Attached find a
test program that shows this behaviour, on another host I
started
netcat 192.168.19.10 4711 < /dev/zero
Linux lenny 2.6.28-06857-g5cbd04a #7 Wed Jan 7 10:10:42 CET 2009 x86_64 = GNU/Linux
[1] 697
Process 697 attached - interrupt to quit
splice(0x3, 0, 0x5, 0, 0x56a0, 0x1) = 22176
splice(0x7, 0, 0x4, 0, 0x10000, 0x1^C <unfinished ...>
Process 697 detached
tcp 69272 0 192.168.19.10:4711 192.168.19.1:33773 ESTABLISHED
Interestingly, whenever I start the strace, it gets another
chunk of data and then blocks in the next splice call.
If I start splicetest with a buffer size of 16384 instead of
65536, it does not block. I could not find a way to ask the
kernel for the tipping point below which it does not block.
What is a safe buffer size to use with splice?
BTW, this kernel is from Steve French's linux-cifs.git repo.
Thanks,
Volker Lendecke
Samba Team
P.S: I'm not subscribed to linux-kernel, so if possible
please CC me directly. If this is inappropriate behaviour,
please give me a quick hint :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2009-01-13 23:15:04 UTC
Permalink
Post by Andrew Morton
(cc's added)
=20
On Thu, 8 Jan 2009 11:13:51 +0100
=20
Post by Volker Lendecke
Hi!
While implementing splice support in Samba for better
performance I found it blocking when trying to pull data off
tcp into a pipe when the recvq was full. Attached find a
test program that shows this behaviour, on another host I
started
netcat 192.168.19.10 4711 < /dev/zero
Linux lenny 2.6.28-06857-g5cbd04a #7 Wed Jan 7 10:10:42 CET 2009 x86=
_64 =3D GNU/Linux
O3 -Wall
Post by Andrew Morton
Post by Volker Lendecke
[1] 697
Process 697 attached - interrupt to quit
splice(0x3, 0, 0x5, 0, 0x56a0, 0x1) =3D 22176
splice(0x7, 0, 0x4, 0, 0x10000, 0x1^C <unfinished ...>
Volker, your splice() is a blocking one, from tcp socket to a pipe ?

If no other thread is reading the pipe, then you might block forever
in splice_to_pipe() as soon pipe is full (16 pages).

As pages are not necessarly full (each skb will use at least one page, =
even if=20
its length is small), it is not really possible to use splice() like th=
is.

In your case, only safe way with current kernel would be to call splice=
()
asking for no more than 16 bytes, that would be really insane for your =
needs.

You may prefer a non blocking mode, at least when calling splice_to_pip=
e()

Maybe SPLICE_F_NONBLOCK splice() flag should only apply on pipe side.
tcp_splice_read() should not use this flag to select a blocking/nonblok=
ing
mode on the source socket, but underlying file flag.

This way, your program could let socket in blocking mode, yet call spli=
ce()
with SPLICE_F_NONBLOCK flag to not block on pipe.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2009-01-13 23:38:32 UTC
Permalink
Post by Eric Dumazet
Post by Andrew Morton
(cc's added)
On Thu, 8 Jan 2009 11:13:51 +0100
Post by Volker Lendecke
Hi!
While implementing splice support in Samba for better
performance I found it blocking when trying to pull data off
tcp into a pipe when the recvq was full. Attached find a
test program that shows this behaviour, on another host I
started
netcat 192.168.19.10 4711 < /dev/zero
Linux lenny 2.6.28-06857-g5cbd04a #7 Wed Jan 7 10:10:42 CET 2009 x8=
6_64 =3D GNU/Linux
-O3 -Wall
Post by Eric Dumazet
Post by Andrew Morton
Post by Volker Lendecke
[1] 697
Process 697 attached - interrupt to quit
splice(0x3, 0, 0x5, 0, 0x56a0, 0x1) =3D 22176
splice(0x7, 0, 0x4, 0, 0x10000, 0x1^C <unfinished ...>
=20
Volker, your splice() is a blocking one, from tcp socket to a pipe ?
=20
If no other thread is reading the pipe, then you might block forever
in splice_to_pipe() as soon pipe is full (16 pages).
=20
As pages are not necessarly full (each skb will use at least one page=
, even if=20
Post by Eric Dumazet
its length is small), it is not really possible to use splice() like =
this.
Post by Eric Dumazet
=20
In your case, only safe way with current kernel would be to call spli=
ce()
Post by Eric Dumazet
asking for no more than 16 bytes, that would be really insane for you=
r needs.
Post by Eric Dumazet
=20
You may prefer a non blocking mode, at least when calling splice_to_p=
ipe()
Post by Eric Dumazet
=20
Maybe SPLICE_F_NONBLOCK splice() flag should only apply on pipe side.
tcp_splice_read() should not use this flag to select a blocking/nonbl=
oking
Post by Eric Dumazet
mode on the source socket, but underlying file flag.
=20
This way, your program could let socket in blocking mode, yet call sp=
lice()
Post by Eric Dumazet
with SPLICE_F_NONBLOCK flag to not block on pipe.
=20
This patch, coupled with the previous one from Willy Tarreau=20
(tcp: splice as many packets as possible at once)
gives expected result.

[PATCH] net: splice() from tcp to socket should take into account O_NON=
BLOCK

Instead of using SPLICE_F_NONBLOCK to select a non blocking mode both o=
n
source tcp socket and pipe destination, we use the underlying file flag=
(O_NONBLOCK)
for selecting a non blocking socket.

Signed-off-by: Eric Dumazet <***@cosmosbay.com>

diff --git a/include/linux/net.h b/include/linux/net.h
index 4515efa..10e38d1 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -185,7 +185,7 @@ struct proto_ops {
struct vm_area_struct * vma);
ssize_t (*sendpage) (struct socket *sock, struct page *page,
int offset, size_t size, int flags);
- ssize_t (*splice_read)(struct socket *sock, loff_t *ppos,
+ ssize_t (*splice_read)(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len, unsigned int flag=
s);
};
=20
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 218235d..e8e7f80 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -309,7 +309,7 @@ extern int tcp_twsk_unique(struct sock *sk,
=20
extern void tcp_twsk_destructor(struct sock *sk);
=20
-extern ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
+extern ssize_t tcp_splice_read(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len, unsigned int fl=
ags);
=20
static inline void tcp_dec_quickack_mode(struct sock *sk,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ce572f9..c777d88 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -548,10 +548,11 @@ static int __tcp_splice_read(struct sock *sk, str=
uct tcp_splice_state *tss)
* Will read pages from given socket and fill them into a pipe.
*
**/
-ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos,
+ssize_t tcp_splice_read(struct file *file, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
{
+ struct socket *sock =3D file->private_data;
struct sock *sk =3D sock->sk;
struct tcp_splice_state tss =3D {
.pipe =3D pipe,
@@ -572,7 +573,7 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t=
*ppos,
=20
lock_sock(sk);
=20
- timeo =3D sock_rcvtimeo(sk, flags & SPLICE_F_NONBLOCK);
+ timeo =3D sock_rcvtimeo(sk, file->f_flags & O_NONBLOCK);
while (tss.len) {
ret =3D __tcp_splice_read(sk, &tss);
if (ret < 0)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller
2009-01-15 04:58:39 UTC
Permalink
From: Eric Dumazet <***@cosmosbay.com>
Date: Wed, 14 Jan 2009 00:38:32 +0100
[PATCH] net: splice() from tcp to socket should take into account O_NONBLOCK
Instead of using SPLICE_F_NONBLOCK to select a non blocking mode both on
source tcp socket and pipe destination, we use the underlying file flag (O_NONBLOCK)
for selecting a non blocking socket.
This needs at least some more thought.

It seems, for one thing, that this change will interfere with the
intentions of the code in splice_dirt_to_actor which goes:

/*
* Don't block on output, we have to drain the direct pipe.
*/
sd->flags &= ~SPLICE_F_NONBLOCK;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet
2009-01-15 11:47:22 UTC
Permalink
Post by David Miller
Date: Wed, 14 Jan 2009 00:38:32 +0100
=20
[PATCH] net: splice() from tcp to socket should take into account O_=
NONBLOCK
Post by David Miller
Instead of using SPLICE_F_NONBLOCK to select a non blocking mode bot=
h on
Post by David Miller
source tcp socket and pipe destination, we use the underlying file f=
lag (O_NONBLOCK)
Post by David Miller
for selecting a non blocking socket.
=20
This needs at least some more thought.
=20
It seems, for one thing, that this change will interfere with the
=20
/*
* Don't block on output, we have to drain the direct pipe.
*/
sd->flags &=3D ~SPLICE_F_NONBLOCK;
Reading splice_direct_to_actor() I see nothing wrong with the patch

(Patch is about splice from socket to pipe, while the sd->flags you men=
tion
in splice_direct_to_actor() only applies to the splice from internal pi=
pe to
something else, as splice_direct_to_actor() allocates an internal pipe =
to perform
its work.

Also, the meaning of SPLICE_F_NONBLOCK, as explained in include/linux/s=
plice.h is :

#define SPLICE_F_NONBLOCK (0x02) /* don't block on the pipe splicing (b=
ut */
/* we may still block on the fd we splice */
/* from/to, of course */

If the comment is still correct, SPLICE_F_NONBLOCK only applies to the =
pipe implied in
splice() syscall.

=46or the other file, either its :
- A regular file : nonblocking mode is not available, like a normal rea=
d()/write() syscall

- A socket : We should be able to specify if its blocking or not, indep=
endantly from
the SPLICE_F_NONBLOCK flag that only applies to the pipe. =
Normal way
is using ioctl(FIONBIO) or other fcntl() call to change fi=
le->f_flags O_NONBLOCK


In order to be able to efficiently use splice() from a socket to a file=
, we need
to do a loop of :

{
splice(from blocking tcp socket to non blocking pipe, SPLICE_F_NONBLOCK=
); /* nonblocking pipe or risk deadlock */
splice(from pipe to file)
}


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Volker Lendecke
2009-01-14 07:40:08 UTC
Permalink
Post by Eric Dumazet
Volker, your splice() is a blocking one, from tcp socket to a pipe ?
Yes, it is.
Post by Eric Dumazet
If no other thread is reading the pipe, then you might block forever
in splice_to_pipe() as soon pipe is full (16 pages).
Why does it block when the pipe is full? Why doesn't it
return a short read, just like the read(2) call does? We
need to cope with that behaviour anyway.
Post by Eric Dumazet
As pages are not necessarly full (each skb will use at least one page, even if
its length is small), it is not really possible to use splice() like this.
In your case, only safe way with current kernel would be to call splice()
asking for no more than 16 bytes, that would be really insane for your needs.
You may prefer a non blocking mode, at least when calling splice_to_pipe()
Which fd do I have to set the nonblocking flag on? The TCP
socket I read from, or the pipe I write to?

Thanks for the hint anyway :-)

Volker
Eric Dumazet
2009-01-14 09:13:34 UTC
Permalink
Post by Volker Lendecke
Post by Eric Dumazet
Volker, your splice() is a blocking one, from tcp socket to a pipe ?
=20
Yes, it is.
=20
Post by Eric Dumazet
If no other thread is reading the pipe, then you might block forever
in splice_to_pipe() as soon pipe is full (16 pages).
=20
Why does it block when the pipe is full? Why doesn't it
return a short read, just like the read(2) call does? We
need to cope with that behaviour anyway.
Well, check code in fs/splice.c, function splice_to_pipe().

If SPLICE_F_NONBLOCK is not set, it is *expected* to block on pipe.

In this mode, only another thread is able to drain the pipe and wakeup =
the blocked thread.

Code review :

When all pages are used "if (pipe->nrbufs =3D=3D PIPE_BUFFERS)"

if (spd->flags & SPLICE_F_NONBLOCK) {
if (!ret)
ret =3D -EAGAIN;
break;
}

if (signal_pending(current)) {
if (!ret)
ret =3D -ERESTARTSYS;
break;
}

if (do_wakeup) {
smp_mb();
if (waitqueue_active(&pipe->wait))
wake_up_interruptible_sync(&pipe->wait)=
;
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_=
IN);
do_wakeup =3D 0;
}

pipe->waiting_writers++;
HERE >> pipe_wait(pipe);
pipe->waiting_writers--;
Post by Volker Lendecke
=20
Post by Eric Dumazet
As pages are not necessarly full (each skb will use at least one pag=
e, even if=20
Post by Volker Lendecke
Post by Eric Dumazet
its length is small), it is not really possible to use splice() like=
this.
Post by Volker Lendecke
Post by Eric Dumazet
In your case, only safe way with current kernel would be to call spl=
ice()
Post by Volker Lendecke
Post by Eric Dumazet
asking for no more than 16 bytes, that would be really insane for yo=
ur needs.
Post by Volker Lendecke
Post by Eric Dumazet
You may prefer a non blocking mode, at least when calling splice_to_=
pipe()
Post by Volker Lendecke
=20
Which fd do I have to set the nonblocking flag on? The TCP
socket I read from, or the pipe I write to?
I would say, use the SPLICE_F_NONBLOCK flag on splice() system call,
but let tcp socket in blocking mode... But with current kernel it
wont work. In order to avoid busy looping, you might add a poll()/selec=
t()
to call splice(SPLICE_F_NONBLOCK) only when socket has data
in its receive queue.

for (;;) {
struct pollfd pfd;
pfd.fd =3D socket;
pfd.events =3D POLLIN;
if (poll(&pfd, 1, -1) !=3D 1)
continue;
res =3D splice(socket, NULL, pipefds[1], NULL, 65536, SPLICE_F_MOVE|SP=
LICE_F_NONBLOCK);
if (res > 0)
nwritten =3D splice(pipefds[0], NULL, file_fd, NULL, res, SPLICE_F_MO=
VE|SPLICE_F_MORE);
}

splice() from tcp socket to pipe is not working as is unfortunatly if !=
SPLICE_F_NONBLOCK)
and if using the same thread to write and read the pipe. Or risk deadlo=
ck.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Volker Lendecke
2009-01-14 10:03:30 UTC
Permalink
Post by Eric Dumazet
for (;;) {
struct pollfd pfd;
pfd.fd = socket;
pfd.events = POLLIN;
if (poll(&pfd, 1, -1) != 1)
continue;
res = splice(socket, NULL, pipefds[1], NULL, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
if (res > 0)
nwritten = splice(pipefds[0], NULL, file_fd, NULL, res, SPLICE_F_MOVE|SPLICE_F_MORE);
}
Doesn't this reduce performance again? I thought the whole
point of splice() was to increase performance by avoiding
memory copies. If I have to do a poll syscall for each call
to splice, doesn't the context switch eat that performance
advantage again?

Or was splice designed only for multi-threaded applications
(which at least Samba is not)?

Volker
Eric Dumazet
2009-01-14 10:17:57 UTC
Permalink
Post by Volker Lendecke
Post by Eric Dumazet
for (;;) {
struct pollfd pfd;
pfd.fd =3D socket;
pfd.events =3D POLLIN;
if (poll(&pfd, 1, -1) !=3D 1)
continue;
res =3D splice(socket, NULL, pipefds[1], NULL, 65536, SPLICE_F_MOVE=
|SPLICE_F_NONBLOCK);
Post by Volker Lendecke
Post by Eric Dumazet
if (res > 0)
nwritten =3D splice(pipefds[0], NULL, file_fd, NULL, res, SPLICE_F=
_MOVE|SPLICE_F_MORE);
Post by Volker Lendecke
Post by Eric Dumazet
}
=20
Doesn't this reduce performance again? I thought the whole
point of splice() was to increase performance by avoiding
memory copies. If I have to do a poll syscall for each call
to splice, doesn't the context switch eat that performance
advantage again?
=20
Or was splice designed only for multi-threaded applications
(which at least Samba is not)?
=20
Volker
splice() avoids memory copies yes, but on typical 1460 bytes
frames its a small gain.

But if no data is available on socket,
you still have to wait (and have a context switch later).

Waiting in poll() or splice() has same context switch cost.

Only cost is the extra syscall of course, but it is mandatory
if you want to avoid a possible deadlock in current splice()
implementation.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...