Discussion:
epoll and closed file descriptors
Gilad Benjamini
2009-09-16 23:22:55 UTC
Permalink
I am running repeatedly into a scenario where epoll notifies userland of
events on a closed file descriptor.
I am running a single thread application, on a single CPU machine so
multiple threads isn't the issue.

A sample set of events that I have seen
- File descriptor (13) for a socket is closed
- epoll_wait returns with no events.
- Several epoll related calls happen
- More than 20 seconds after the "close", epoll_wait finds an event on fd 13
with EPOLLIN|EPOLLERR|EPOLLHUP.
- epoll_wait continues to report this event

Running kernel 2.6.24. Some technical problems are preventing me from trying
a newer kernel at the moment.

One more thing worth mentioning: the application uses libcurl, leading to a
situation where the file is closed before the descriptor was removed from
the epoll descriptor. The code should be able to handle that AFAIK.

Any ideas would be appreciated.
Gilad
Davide Libenzi
2009-09-17 00:07:56 UTC
Permalink
Post by Gilad Benjamini
I am running repeatedly into a scenario where epoll notifies userland of
events on a closed file descriptor.
I am running a single thread application, on a single CPU machine so
multiple threads isn't the issue.
A sample set of events that I have seen
- File descriptor (13) for a socket is closed
- epoll_wait returns with no events.
- Several epoll related calls happen
- More than 20 seconds after the "close", epoll_wait finds an event on fd 13
with EPOLLIN|EPOLLERR|EPOLLHUP.
- epoll_wait continues to report this event
Epoll removes the fd from its container, when the last instance of the
underlying kernel file pointer is released (or when you explicitly
remove it with epoll_ctl(EPOLL_CTL_DEL)).
If you continue to get the event, it means that someone else has an
instance of the socket (that, looking at the events, saw a shutdown) open,
by hence keeping the kernel object alive.
If you don't want to see the events, just remove the socket from the epoll
set before closing.
Or, you remove the socket the first time you see an EPOLLHUP.



- Davide
Gilad Benjamini
2009-09-17 00:23:09 UTC
Permalink
Post by Gilad Benjamini
Post by Gilad Benjamini
I am running repeatedly into a scenario where epoll notifies
userland of
Post by Gilad Benjamini
events on a closed file descriptor.
I am running a single thread application, on a single CPU machine so
multiple threads isn't the issue.
A sample set of events that I have seen
- File descriptor (13) for a socket is closed
- epoll_wait returns with no events.
- Several epoll related calls happen
- More than 20 seconds after the "close", epoll_wait finds an event
on fd 13
Post by Gilad Benjamini
with EPOLLIN|EPOLLERR|EPOLLHUP.
- epoll_wait continues to report this event
Epoll removes the fd from its container, when the last instance of the
underlying kernel file pointer is released (or when you explicitly
remove it with epoll_ctl(EPOLL_CTL_DEL)).
If you continue to get the event, it means that someone else has an
instance of the socket (that, looking at the events, saw a shutdown) open,
by hence keeping the kernel object alive.
If you don't want to see the events, just remove the socket from the epoll
set before closing.
Or, you remove the socket the first time you see an EPOLLHUP.
- Davide
I would, but epoll is preventing me from doing so.
Early in sys_epoll_ctl there are these lines

file = fget(epfd);
if (!file)
goto error_return;

Leaving me in a kind of dead lock

Gilad
Bryan Donlan
2009-09-17 00:28:00 UTC
Permalink
On Wed, Sep 16, 2009 at 8:23 PM, Gilad Benjamini
Post by Gilad Benjamini
Post by Gilad Benjamini
I am running repeatedly into a scenario =A0where epoll notifies
userland of
events on a closed file descriptor.
I am running a single thread application, on a single CPU machine =
so
Post by Gilad Benjamini
Post by Gilad Benjamini
multiple threads isn't the issue.
A sample set of events that I have seen
- File descriptor (13) for a socket is closed
- epoll_wait returns with no events.
- Several epoll related calls happen
- More than 20 seconds after the "close", epoll_wait finds an even=
t
Post by Gilad Benjamini
Post by Gilad Benjamini
on fd 13
with EPOLLIN|EPOLLERR|EPOLLHUP.
- epoll_wait continues to report this event
Epoll removes the fd from its container, when the last instance of t=
he
Post by Gilad Benjamini
Post by Gilad Benjamini
underlying kernel file pointer is released (or when you explicitly
remove it with epoll_ctl(EPOLL_CTL_DEL)).
If you continue to get the event, it means that someone else has an
instance of the socket (that, looking at the events, saw a shutdown) open,
by hence keeping the kernel object alive.
If you don't want to see the events, just remove the socket from the epoll
set before closing.
Or, you remove the socket the first time you see an EPOLLHUP.
- Davide
I would, but epoll is preventing me from doing so.
Early in sys_epoll_ctl there are these lines
=A0file =3D fget(epfd);
=A0if (!file)
=A0 =A0goto error_return;
Leaving me in a kind of dead lock
You could dup() the file descriptor before handing it to curl, and use
your own copy for epoll operations.
Davide Libenzi
2009-09-17 00:30:18 UTC
Permalink
Post by Gilad Benjamini
I would, but epoll is preventing me from doing so.
Early in sys_epoll_ctl there are these lines
file = fget(epfd);
if (!file)
goto error_return;
Leaving me in a kind of dead lock
The 'epfd' in there, is the _epoll fd_, which, if fget() fails, means you
close it.
You see likely failing the 'tfile = fget(fd)' (of course, you closed it),
so if someone else keeps the socket open and you have no chance in telling
it to drop it (really?), you need to remove the socket from the set before
closing it.



- Davide
Gilad Benjamini
2009-09-17 00:40:49 UTC
Permalink
Post by Davide Libenzi
Post by Gilad Benjamini
I would, but epoll is preventing me from doing so.
Early in sys_epoll_ctl there are these lines
file = fget(epfd);
if (!file)
goto error_return;
Leaving me in a kind of dead lock
The 'epfd' in there, is the _epoll fd_, which, if fget() fails, means you
close it.
You see likely failing the 'tfile = fget(fd)' (of course, you closed it),
so if someone else keeps the socket open and you have no chance in telling
it to drop it (really?), you need to remove the socket from the set before
closing it.
- Davide
My bad. I meant to quote the line that you mentioned.
I agree that the right thing to do is to remove the fd from epoll before
closing it.
However, due to the way curl works, I cannot do that. Changing the curl code
doesn't seem trivial.

Regardless, I still don't see how the kernel got into this situation, and if
this situation is valid, why it doesn't bail out of it.
Bryan Donlan
2009-09-17 00:45:18 UTC
Permalink
On Wed, Sep 16, 2009 at 8:40 PM, Gilad Benjamini
Post by Gilad Benjamini
Post by Gilad Benjamini
I would, but epoll is preventing me from doing so.
Early in sys_epoll_ctl there are these lines
=A0 file =3D fget(epfd);
=A0 if (!file)
=A0 =A0 goto error_return;
Leaving me in a kind of dead lock
The 'epfd' in there, is the _epoll fd_, which, if fget() fails, mean=
s
Post by Gilad Benjamini
you
close it.
You see likely failing the 'tfile =3D fget(fd)' (of course, you clos=
ed
Post by Gilad Benjamini
it),
so if someone else keeps the socket open and you have no chance in telling
it to drop it (really?), you need to remove the socket from the set before
closing it.
- Davide
My bad. I meant to quote the line that you mentioned.
I agree that the right thing to do is to remove the fd from epoll bef=
ore
Post by Gilad Benjamini
closing it.
However, due to the way curl works, I cannot do that. Changing the cu=
rl code
Post by Gilad Benjamini
doesn't seem trivial.
Regardless, I still don't see how the kernel got into this situation,=
and if
Post by Gilad Benjamini
this situation is valid, why it doesn't bail out of it.
epoll references the underlying file object; the fd is used _only_ to
obtain this file object, and then never used again. Determining when
the fd goes away then requires iterating over all fds, and since epoll
was designed to avoid doing exactly that, it isn't an acceptable
solution.

As I mentioned in the other email, by dup()ing the file descriptor,
you can get control over when the copy is closed without changing
curl. Then just make sure to remove it from the epoll set before
closing your copy.
Gilad Benjamini
2009-09-17 00:53:05 UTC
Permalink
Subject: Re: epoll and closed file descriptors
=20
On Wed, Sep 16, 2009 at 8:40 PM, Gilad Benjamini
Post by Gilad Benjamini
Post by Davide Libenzi
Post by Gilad Benjamini
I would, but epoll is preventing me from doing so.
Early in sys_epoll_ctl there are these lines
=A0 file =3D fget(epfd);
=A0 if (!file)
=A0 =A0 goto error_return;
Leaving me in a kind of dead lock
The 'epfd' in there, is the _epoll fd_, which, if fget() fails,
means
Post by Gilad Benjamini
Post by Davide Libenzi
you
close it.
You see likely failing the 'tfile =3D fget(fd)' (of course, you cl=
osed
Post by Gilad Benjamini
Post by Davide Libenzi
it),
so if someone else keeps the socket open and you have no chance in telling
it to drop it (really?), you need to remove the socket from the se=
t
Post by Gilad Benjamini
Post by Davide Libenzi
before
closing it.
- Davide
My bad. I meant to quote the line that you mentioned.
I agree that the right thing to do is to remove the fd from epoll
before
Post by Gilad Benjamini
closing it.
However, due to the way curl works, I cannot do that. Changing the
curl code
Post by Gilad Benjamini
doesn't seem trivial.
Regardless, I still don't see how the kernel got into this situatio=
n,
and if
Post by Gilad Benjamini
this situation is valid, why it doesn't bail out of it.
=20
epoll references the underlying file object; the fd is used _only_ to
obtain this file object, and then never used again. Determining when
the fd goes away then requires iterating over all fds, and since epol=
l
was designed to avoid doing exactly that, it isn't an acceptable
solution.
Regarding bailing out of the situation, I see the logic in your answer.
What about the first part ? Any ideas how the kernel actually got into =
that
tight spot ?
Looking into the code I can't find a path that can lead into this situa=
tion.
Bryan Donlan
2009-09-17 00:57:18 UTC
Permalink
Subject: Re: epoll and closed file descriptors
On Wed, Sep 16, 2009 at 8:40 PM, Gilad Benjamini
Post by Gilad Benjamini
Post by Davide Libenzi
Post by Gilad Benjamini
I would, but epoll is preventing me from doing so.
Early in sys_epoll_ctl there are these lines
=A0 file =3D fget(epfd);
=A0 if (!file)
=A0 =A0 goto error_return;
Leaving me in a kind of dead lock
The 'epfd' in there, is the _epoll fd_, which, if fget() fails,
means
Post by Gilad Benjamini
Post by Davide Libenzi
you
close it.
You see likely failing the 'tfile =3D fget(fd)' (of course, you c=
losed
Post by Gilad Benjamini
Post by Davide Libenzi
it),
so if someone else keeps the socket open and you have no chance i=
n
Post by Gilad Benjamini
Post by Davide Libenzi
telling
it to drop it (really?), you need to remove the socket from the s=
et
Post by Gilad Benjamini
Post by Davide Libenzi
before
closing it.
- Davide
My bad. I meant to quote the line that you mentioned.
I agree that the right thing to do is to remove the fd from epoll
before
Post by Gilad Benjamini
closing it.
However, due to the way curl works, I cannot do that. Changing the
curl code
Post by Gilad Benjamini
doesn't seem trivial.
Regardless, I still don't see how the kernel got into this situati=
on,
and if
Post by Gilad Benjamini
this situation is valid, why it doesn't bail out of it.
epoll references the underlying file object; the fd is used _only_ t=
o
obtain this file object, and then never used again. Determining when
the fd goes away then requires iterating over all fds, and since epo=
ll
was designed to avoid doing exactly that, it isn't an acceptable
solution.
Regarding bailing out of the situation, I see the logic in your answe=
r.
What about the first part ? Any ideas how the kernel actually got int=
o that
tight spot ?
Looking into the code I can't find a path that can lead into this sit=
uation.

Userspace passes in a fd that is unused (closed). Kernel can't find
the file object corresponding to the fd (because the fd went away when
it wasn't looking), so it says to userspace, "Sorry, no such file!".
And it's completely correct.

So, the basic problem is userspace becomes unable to refer to the file
object. The kernel's doing just fine; it's not in a "tight spot" at
all. It's perfectly happy to refer to the file object by a direct
pointer to it - it's just userspace is unable to tell the kernel to
remove it from the epoll, because the only name userspace had for it
has been removed.

Davide Libenzi
2009-09-17 00:55:22 UTC
Permalink
Post by Gilad Benjamini
My bad. I meant to quote the line that you mentioned.
I agree that the right thing to do is to remove the fd from epoll before
closing it.
However, due to the way curl works, I cannot do that. Changing the curl code
doesn't seem trivial.
You probably need to explain how you use libcurl together with epoll, even
though this is not a help-userspace mailing list.
Post by Gilad Benjamini
Regardless, I still don't see how the kernel got into this situation, and if
this situation is valid, why it doesn't bail out of it.
I think this was already explained.


- Davide
Loading...