Discussion:
PROBLEM: pthread-safety bug in write(2) on Linux 2.6.x
Dan Bonachea
2006-04-13 01:45:28 UTC
Permalink
Hi - I believe I've discovered a thread-safety bug in the Linux 2.6.x kernel
implementation of write(2).

The small C program below the problem - in a nutshell, if multiple threads
write() to STDOUT_FILENO, and stdout has been redirected to a file, then some
of the output lines get lost. The actual result is non-deterministic (even in
a "correct" run) - however the expected correct behavior is 10 lines of output
(in some non-deterministic order). However, the test program reproducibly
generates some lost output (less than 10 lines of total output) on every run
where the output is redirected to a new file. This appears to be a violation
of the POSIX spec (POSIX 1003.1-2001:2.9.1 requires thread-safety of write()).


The problem does not appear to occur if output goes to the console, or is
redirected to append to an existing file, only when stdout is redirected to a
new file.

---------------------------------------------------------------------------
/* Instructions:
* compile as: gcc write-bug.c -lpthread
* run with: ./a.out
* note that lines from all 10 threads are visible
* run with: ./a.out > output
* note that lines from several threads are missing from the output file
*/
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
volatile int flag = 0;
void *start(void *arg) {
char msg[255];
int res;
sprintf(msg,"hi from %i\n",(int)arg);
if ((int)arg == 9) flag = 1;
while (!flag) ; /* thread barrier */
#if 1
res = write(STDOUT_FILENO,msg,strlen(msg));
if (res != strlen(msg)) fprintf(stderr,"Failure: %i
%s\n",res,strerror(res));
#else
fputs(msg,stdout);
#endif
#if 1 /* work extra hard to flush output (makes no difference) */
fflush(NULL);
fsync(STDOUT_FILENO);
sync();
#endif
return NULL;
}
int main() {
int i;
/* create 10 pthreads */
pthread_t id[10];
for (i =0 ; i < 10; i++) {
int ret = pthread_create(&(id[i]),NULL,&start,(void*)i);
if (ret) printf("pthread_create: %s\n",strerror(ret));
}
/* join 10 pthreads */
for (i =0 ; i < 10; i++) {
pthread_join(id[i], NULL);
}
sleep(1);
return 0;
}
---------------------------------------------------------------------------

The problem occurs on every Linux 2.6 machine I've tried, including ones with
Itanium-2, Athlon, Opteron and Pentium 4 chips, both SMP's and uniprocessors,
using a variety of recent gcc versions. The problem does *not* appear to occur
on any Linux 2.4 machine I've tried, even if I use the same executable which
fails on the Linux 2.6 machine. Finally, replacing write(STDOUT_FILENO) with
fputs(stdout) makes the problem disappear (presumably due to locking in libc).

I don't have administrative access to upgrade the kernel on these machines,
however below is the full machine info for the most recent installed kernel
version I have access to.

Any suggestions are appreciated.
Thanks,
Dan

# gcc --version
gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)
Copyright (C) 2004 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# cat /proc/version
Linux version 2.6.12-1.1376_FC3smp (***@tweety.build.redhat.com) (gcc
version 3.4.4 20050721 (Red Hat 3.4.4-2)) #1 SMP Fri Aug 26 23:50:33 EDT 2005
# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Pentium(R) 4 CPU 2.40GHz
stepping : 9
cpu MHz : 2395.041
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips : 4734.97

processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Pentium(R) 4 CPU 2.40GHz
stepping : 9
cpu MHz : 2395.041
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips : 4784.12
# cat /proc/modules
ipt_multiport 6465 2 - Live 0xe0a1d000
ipt_ULOG 11973 6 - Live 0xe0aff000
ipt_REJECT 9537 5 - Live 0xe0b03000
ipt_limit 6465 7 - Live 0xe0a2c000
ipt_state 5825 9 - Live 0xe0a20000
ip_conntrack 46249 1 ipt_state, Live 0xe0b54000
iptable_filter 6849 1 - Live 0xe09e3000
ip_tables 25025 6
ipt_multiport,ipt_ULOG,ipt_REJECT,ipt_limit,ipt_state,iptable_filter, Live
0xe0b3d000
parport_pc 30981 1 - Live 0xe0b0b000
lp 16713 0 - Live 0xe0a23000
parport 38793 2 parport_pc,lp, Live 0xe0af4000
autofs4 22725 11 - Live 0xe09ed000
nfs 196745 7 - Live 0xe0b77000
lockd 64745 2 nfs, Live 0xe09fa000
sunrpc 138629 3 nfs,lockd, Live 0xe0b1a000
dm_mod 59109 0 - Live 0xe0a0d000
video 19653 0 - Live 0xe09f4000
button 8001 0 - Live 0xe09aa000
battery 13253 0 - Live 0xe09e8000
ac 8773 0 - Live 0xe09a6000
md5 8001 1 - Live 0xe099d000
ipv6 262913 20 - Live 0xe0a30000
uhci_hcd 35665 0 - Live 0xe09d9000
ehci_hcd 37709 0 - Live 0xe0965000
hw_random 9429 0 - Live 0xe0961000
i2c_i801 12493 0 - Live 0xe089b000
i2c_core 25025 1 i2c_i801, Live 0xe0991000
snd_intel8x0 35585 0 - Live 0xe0935000
snd_ac97_codec 77497 1 snd_intel8x0, Live 0xe09c5000
snd_pcm_oss 53745 0 - Live 0xe090a000
snd_mixer_oss 21697 1 snd_pcm_oss, Live 0xe0894000
snd_pcm 91845 3 snd_intel8x0,snd_ac97_codec,snd_pcm_oss, Live 0xe09ad000
snd_timer 28357 1 snd_pcm, Live 0xe0861000
snd 58405 6
snd_intel8x0,snd_ac97_codec,snd_pcm_oss,snd_mixer_oss,snd_pcm,snd_timer, Live
0xe08fa000
soundcore 13217 1 snd, Live 0xe087a000
snd_page_alloc 13765 2 snd_intel8x0,snd_pcm, Live 0xe0875000
e1000 105517 0 - Live 0xe091a000
ext3 131529 2 - Live 0xe08d8000
jbd 61913 1 ext3, Live 0xe08a0000
ata_piix 13381 0 - Live 0xe0870000
libata 49093 1 ata_piix, Live 0xe087f000
sd_mod 22721 0 - Live 0xe0869000
scsi_mod 135561 2 libata,sd_mod, Live 0xe081a000
# cat /proc/ioports
0000-001f : dma1
0020-0021 : pic1
0040-0043 : timer0
0050-0053 : timer1
0060-006f : keyboard
0070-0077 : rtc
0080-008f : dma page reg
00a0-00a1 : pic2
00c0-00df : dma2
00f0-00ff : fpu
0170-0177 : ide1
01f0-01f7 : ide0
02f8-02ff : serial
0376-0376 : ide1
0378-037a : parport0
03c0-03df : vga+
03f6-03f6 : ide0
03f8-03ff : serial
0800-087f : 0000:00:1f.0
0800-0803 : PM1a_EVT_BLK
0804-0805 : PM1a_CNT_BLK
0808-080b : PM_TMR
0810-0815 : ACPI CPU throttle
0828-082f : GPE0_BLK
0880-08bf : 0000:00:1f.0
0c00-0c7f : pnp 00:0a
0cf8-0cff : PCI conf1
df40-df7f : 0000:02:0c.0
df40-df7f : e1000
eda0-edbf : 0000:00:1f.3
eda0-edaf : i801-smbus
edc0-edff : 0000:00:1f.5
edc0-edff : Intel ICH5
ee00-eeff : 0000:00:1f.5
ee00-eeff : Intel ICH5
fe00-fe07 : 0000:00:1f.2
fe00-fe07 : libata
fe10-fe13 : 0000:00:1f.2
fe10-fe13 : libata
fe20-fe27 : 0000:00:1f.2
fe20-fe27 : libata
fe30-fe33 : 0000:00:1f.2
fe30-fe33 : libata
fea0-feaf : 0000:00:1f.2
fea0-feaf : libata
ff20-ff3f : 0000:00:1d.3
ff20-ff3f : uhci_hcd
ff40-ff5f : 0000:00:1d.2
ff40-ff5f : uhci_hcd
ff60-ff7f : 0000:00:1d.1
ff60-ff7f : uhci_hcd
ff80-ff9f : 0000:00:1d.0
ff80-ff9f : uhci_hcd
ffa0-ffaf : 0000:00:1f.1
ffa0-ffa7 : ide0
ffa8-ffaf : ide1
# cat /proc/iomem
00000000-0009ffff : System RAM
000a0000-000bffff : Video RAM area
000c0000-000cf7ff : Video ROM
000cf800-000cffff : Adapter ROM
000f0000-000fffff : System ROM
00100000-1ff73fff : System RAM
00100000-003027ae : Kernel code
003027af-003c2b7f : Kernel data
1ff74000-1ff75fff : ACPI Non-volatile Storage
1ff76000-1ff96fff : ACPI Tables
1ff97000-1fffffff : reserved
e8000000-efffffff : 0000:00:00.0
f0000000-f7ffffff : PCI Bus #01
f0000000-f7ffffff : 0000:01:00.0
fcfe0000-fcffffff : 0000:02:0c.0
fcfe0000-fcffffff : e1000
fd000000-feafffff : PCI Bus #01
fd000000-fdffffff : 0000:01:00.0
febff900-febff9ff : 0000:00:1f.5
febff900-febff9ff : Intel ICH5
febffa00-febffbff : 0000:00:1f.5
febffa00-febffbff : Intel ICH5
febffc00-febfffff : 0000:00:1f.1
fec00000-fec0ffff : reserved
fecf0000-fecf0fff : reserved
fed20000-fed8ffff : reserved
fee00000-fee0ffff : reserved
ffa80800-ffa80bff : 0000:00:1d.7
ffa80800-ffa80bff : ehci_hcd
ffb00000-ffffffff : reserved
# cat /proc/scsi/scsi
Attached devices:
# lspci -vvv
00:00.0 Host bridge: Intel Corporation 82875P/E7210 Memory Controller Hub (rev
02)
Subsystem: Dell: Unknown device 0156
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort+ >SERR- <PERR-
Latency: 0
Region 0: Memory at e8000000 (32-bit, prefetchable) [size=128M]
Capabilities: <available only to root>

00:01.0 PCI bridge: Intel Corporation 82875P Processor to AGP Controller (rev
02) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 64
Bus: primary=00, secondary=01, subordinate=01, sec-latency=64
I/O behind bridge: 0000f000-00000fff
Memory behind bridge: fd000000-feafffff
Prefetchable memory behind bridge: f0000000-f7ffffff
Secondary status: 66Mhz+ FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA- VGA+ MAbort- >Reset- FastB2B-

00:1d.0 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI
Controller #1 (rev 02) (prog-if 00 [UHCI])
Subsystem: Dell: Unknown device 0156
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 193
Region 4: I/O ports at ff80 [size=32]

00:1d.1 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI
Controller #2 (rev 02) (prog-if 00 [UHCI])
Subsystem: Dell: Unknown device 0156
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin B routed to IRQ 201
Region 4: I/O ports at ff60 [size=32]

00:1d.2 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI #3
(rev 02) (prog-if 00 [UHCI])
Subsystem: Dell: Unknown device 0156
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin C routed to IRQ 169
Region 4: I/O ports at ff40 [size=32]

00:1d.3 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI
Controller #4 (rev 02) (prog-if 00 [UHCI])
Subsystem: Dell: Unknown device 0156
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 193
Region 4: I/O ports at ff20 [size=32]

00:1d.7 USB Controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI
Controller (rev 02) (prog-if 20 [EHCI])
Subsystem: Dell: Unknown device 0156
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin D routed to IRQ 185
Region 0: Memory at ffa80800 (32-bit, non-prefetchable) [size=1K]
Capabilities: <available only to root>

00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2) (prog-if 00
[Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Bus: primary=00, secondary=02, subordinate=02, sec-latency=32
I/O behind bridge: 0000d000-0000dfff
Memory behind bridge: fcf00000-fcffffff
Prefetchable memory behind bridge: fff00000-000fffff
Secondary status: 66Mhz- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-

00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface
Bridge (rev 02)
Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0

00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE
Controller (rev 02) (prog-if 8a [Master SecP PriP])
Subsystem: Dell: Unknown device 0156
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 169
Region 0: I/O ports at <ignored>
Region 1: I/O ports at <ignored>
Region 2: I/O ports at <ignored>
Region 3: I/O ports at <ignored>
Region 4: I/O ports at ffa0 [size=16]
Region 5: Memory at febffc00 (32-bit, non-prefetchable) [size=1K]

00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev
02) (prog-if 8f [Master SecP SecO PriP PriO])
Subsystem: Dell: Unknown device 0156
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 169
Region 0: I/O ports at fe00 [size=8]
Region 1: I/O ports at fe10 [size=4]
Region 2: I/O ports at fe20 [size=8]
Region 3: I/O ports at fe30 [size=4]
Region 4: I/O ports at fea0 [size=16]

00:1f.3 SMBus: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller (rev
02)
Subsystem: Dell: Unknown device 0156
Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Interrupt: pin B routed to IRQ 10
Region 4: I/O ports at eda0 [size=32]

00:1f.5 Multimedia audio controller: Intel Corporation 82801EB/ER (ICH5/ICH5R)
AC'97 Audio Controller (rev 02)
Subsystem: Dell: Unknown device 0156
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin B routed to IRQ 177
Region 0: I/O ports at ee00 [size=256]
Region 1: I/O ports at edc0 [size=64]
Region 2: Memory at febffa00 (32-bit, non-prefetchable) [size=512]
Region 3: Memory at febff900 (32-bit, non-prefetchable) [size=256]
Capabilities: <available only to root>

01:00.0 VGA compatible controller: nVidia Corporation NV34GL [Quadro FX
500/600 PCI] (rev a1) (prog-if 00 [VGA])
Subsystem: nVidia Corporation: Unknown device 01ba
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (1250ns min, 250ns max)
Interrupt: pin A routed to IRQ 11
Region 0: Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at f0000000 (32-bit, prefetchable) [size=128M]
Expansion ROM at fea00000 [disabled] [size=128K]
Capabilities: <available only to root>

02:0c.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet
Controller (rev 02)
Subsystem: Dell: Unknown device 0156
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (63750ns min), Cache Line Size 10
Interrupt: pin A routed to IRQ 169
Region 0: Memory at fcfe0000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at df40 [size=64]
Capabilities: <available only to root>

#
Alistair John Strachan
2006-04-13 02:10:27 UTC
Permalink
Post by Dan Bonachea
Hi - I believe I've discovered a thread-safety bug in the Linux 2.6.x
kernel implementation of write(2).
Hi, let's get the obvious out of the way first.

Is this reproducible on a current -stable, non-vendor kernel?
--
Cheers,
Alistair.

'No sense being pessimistic, it probably wouldn't work anyway.'
Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
Andrew Morton
2006-04-13 04:46:13 UTC
Permalink
Post by Dan Bonachea
Hi - I believe I've discovered a thread-safety bug in the Linux 2.6.x kernel
implementation of write(2).
The small C program below the problem - in a nutshell, if multiple threads
write() to STDOUT_FILENO, and stdout has been redirected to a file, then some
of the output lines get lost. The actual result is non-deterministic (even in
a "correct" run) - however the expected correct behavior is 10 lines of output
(in some non-deterministic order). However, the test program reproducibly
generates some lost output (less than 10 lines of total output) on every run
where the output is redirected to a new file. This appears to be a violation
of the POSIX spec (POSIX 1003.1-2001:2.9.1 requires thread-safety of write()).
The problem does not appear to occur if output goes to the console, or is
redirected to append to an existing file, only when stdout is redirected to a
new file.
This comes up occasionally. Is very simple:

asmlinkage ssize_t sys_write(unsigned int fd, const char __user * buf, size_t count)
{
struct file *file;
ssize_t ret = -EBADF;
int fput_needed;

file = fget_light(fd, &fput_needed);
if (file) {
loff_t pos = file_pos_read(file);
ret = vfs_write(file, buf, count, &pos);
file_pos_write(file, pos);
fput_light(file, fput_needed);
}

return ret;
}

we can have multiple threads in vfs_write() using the same `pos'.

Locking for file.f_pos is generally file->f_dentry->d_inode->i_mutex. We
could use that if we were to restructure the code a lot. Or we could add a
new lock to `struct file'.

Or we could do nothing, because a) the application is going to produce
inderterminate output anyway and b) because it only affects silly testcases
and not real-world apps.

OK, there _might_ be a real-world case: threads appending logging
information to a flat file. Trivially workable-around with a userspace
lock, or by switching to stdio (same thing).

Yes, really we should fix it. But it's not worth adding more overhead to
do so. So the fix would involve widespread (but simple) change, to draw
that f_pos update inside i_mutex.

(We could pseudo-fix it by updating f_pos _before_ entering vfs_write(), too).
Nick Piggin
2006-04-13 05:33:49 UTC
Permalink
Post by Andrew Morton
OK, there _might_ be a real-world case: threads appending logging
information to a flat file. Trivially workable-around with a userspace
lock, or by switching to stdio (same thing).
Yes, really we should fix it. But it's not worth adding more overhead to
do so. So the fix would involve widespread (but simple) change, to draw
that f_pos update inside i_mutex.
Didn't Linus explicitly made the decision not to add synchronisation for
writes with the same file?

http://groups.google.com/group/fa.linux.kernel/msg/ef66c762e737bab7?hl=en&
Is the closest I could find, but I'm sure he said something similar,
specifically about write(2) vs write(2).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Linus Torvalds
2006-04-13 15:01:25 UTC
Permalink
Post by Nick Piggin
Didn't Linus explicitly made the decision not to add synchronisation for
writes with the same file?
I would be _very_ nervous to do it, yes. In particular, I do not consider
it at all unlikely to have something database-like that does concurrent
writes to the same file at different offsets, and serializing them on the
VFS layer seems pretty broken.

Also, doing it unconditionally in the VFS layer is actually pretty
seriously broken: the VFS layer doesn't even know what kind of file it is,
and locking on writes can be literally the wrong thing for some file
descriptors (think pipes or sockets: if we have one blocking write holding
the lock, that should _not_ imply that other - possibly nonblocking -
writes shouldn't be able to call in to the low-level write handler).

That said, I wouldn't be 100% against it, especially under certain
circumstances. However, the circumstances when I think it might be
acceptable are fairly specific:

- when we use f_pos (ie we'd synchronize write against write, but only
per "struct file", not on an inode basis)
- only for files that are marked seekable with FMODE_LSEEK (thus avoiding
the stream-like objects like pipes and sockets)

Under those two circumstances, I'd certainly be ok with it, and I think we
could argue that it is a "good thing". It would be a "f_pos" lock (so we
migt do it for reads too), not a "data lock".

Comments?

Linus
Linus Torvalds
2006-04-13 15:28:27 UTC
Permalink
Post by Linus Torvalds
That said, I wouldn't be 100% against it, especially under certain
circumstances. However, the circumstances when I think it might be
- when we use f_pos (ie we'd synchronize write against write, but only
per "struct file", not on an inode basis)
- only for files that are marked seekable with FMODE_LSEEK (thus avoiding
the stream-like objects like pipes and sockets)
Under those two circumstances, I'd certainly be ok with it, and I think we
could argue that it is a "good thing". It would be a "f_pos" lock (so we
migt do it for reads too), not a "data lock".
Comments?
Something like this.

NOTE NOTE NOTE! Untested!

Linus
---
diff --git a/fs/file_table.c b/fs/file_table.c
index bcea199..48728cc 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -120,6 +120,7 @@ struct file *get_empty_filp(void)
f->f_uid = tsk->fsuid;
f->f_gid = tsk->fsgid;
eventpoll_init_file(f);
+ mutex_init(&f->f_pos_lock);
/* f->f_version: 0 */
return f;

diff --git a/fs/read_write.c b/fs/read_write.c
index 5bc0e92..ad9db26 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -329,14 +329,23 @@ ssize_t vfs_write(struct file *file, con

EXPORT_SYMBOL(vfs_write);

-static inline loff_t file_pos_read(struct file *file)
+/*
+ * FMODE_LSEEK will never change for a file during its lifetime,
+ * so it's ok to test it independently on the lock/unlock path
+ * rather than explicitly remembering whether we locked it.
+ */
+static inline loff_t file_pos_read_lock(struct file *file)
{
+ if (file->f_mode & FMODE_LSEEK)
+ mutex_lock(&file->f_pos_lock);
return file->f_pos;
}

-static inline void file_pos_write(struct file *file, loff_t pos)
+static inline void file_pos_write_unlock(struct file *file, loff_t pos)
{
file->f_pos = pos;
+ if (file->f_mode & FMODE_LSEEK)
+ mutex_unlock(&file->f_pos_lock);
}

asmlinkage ssize_t sys_read(unsigned int fd, char __user * buf, size_t count)
@@ -347,9 +356,9 @@ asmlinkage ssize_t sys_read(unsigned int

file = fget_light(fd, &fput_needed);
if (file) {
- loff_t pos = file_pos_read(file);
+ loff_t pos = file_pos_read_lock(file);
ret = vfs_read(file, buf, count, &pos);
- file_pos_write(file, pos);
+ file_pos_write_unlock(file, pos);
fput_light(file, fput_needed);
}

@@ -365,9 +374,9 @@ asmlinkage ssize_t sys_write(unsigned in

file = fget_light(fd, &fput_needed);
if (file) {
- loff_t pos = file_pos_read(file);
+ loff_t pos = file_pos_read_lock(file);
ret = vfs_write(file, buf, count, &pos);
- file_pos_write(file, pos);
+ file_pos_write_unlock(file, pos);
fput_light(file, fput_needed);
}

@@ -603,9 +612,9 @@ sys_readv(unsigned long fd, const struct

file = fget_light(fd, &fput_needed);
if (file) {
- loff_t pos = file_pos_read(file);
+ loff_t pos = file_pos_read_lock(file);
ret = vfs_readv(file, vec, vlen, &pos);
- file_pos_write(file, pos);
+ file_pos_write_unlock(file, pos);
fput_light(file, fput_needed);
}

@@ -624,9 +633,9 @@ sys_writev(unsigned long fd, const struc

file = fget_light(fd, &fput_needed);
if (file) {
- loff_t pos = file_pos_read(file);
+ loff_t pos = file_pos_read_lock(file);
ret = vfs_writev(file, vec, vlen, &pos);
- file_pos_write(file, pos);
+ file_pos_write_unlock(file, pos);
fput_light(file, fput_needed);
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 162c6e5..1d58cd0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -643,6 +643,7 @@ struct file {
loff_t f_pos;
struct fown_struct f_owner;
unsigned int f_uid, f_gid;
+ struct mutex f_pos_lock;
struct file_ra_state f_ra;

unsigned long f_version;
Nikita Danilov
2006-04-14 10:20:36 UTC
Permalink
Linus Torvalds writes:
[...]
Post by Linus Torvalds
+ * so it's ok to test it independently on the lock/unlock path
+ * rather than explicitly remembering whether we locked it.
+ */
+static inline loff_t file_pos_read_lock(struct file *file)
{
+ if (file->f_mode & FMODE_LSEEK)
+ mutex_lock(&file->f_pos_lock);
return file->f_pos;
}
-static inline void file_pos_write(struct file *file, loff_t pos)
+static inline void file_pos_write_unlock(struct file *file, loff_t pos)
{
file->f_pos = pos;
+ if (file->f_mode & FMODE_LSEEK)
+ mutex_unlock(&file->f_pos_lock);
}
Naming of two functions above is confusing: it looks like read/write
lock is taken. Maybe file_pos_lock_and_get() and
file_pos_set_and_unlock()?

Nikita.
Alan Cox
2006-04-13 21:50:22 UTC
Permalink
Post by Linus Torvalds
That said, I wouldn't be 100% against it, especially under certain
circumstances. However, the circumstances when I think it might be
- when we use f_pos (ie we'd synchronize write against write, but only
per "struct file", not on an inode basis)
- only for files that are marked seekable with FMODE_LSEEK (thus avoiding
the stream-like objects like pipes and sockets)
Under those two circumstances, I'd certainly be ok with it, and I think we
could argue that it is a "good thing". It would be a "f_pos" lock (so we
migt do it for reads too), not a "data lock".
The only serious case historically has been O_APPEND which does have
pretty precise semantics. Nowdays we also have pread/pwrite which have
pretty clear semantics and deal with threading. The O_APPEND case is
very important to get correct and 2.4 certainly did so.


Looking at the spec it says the following

"If the O_APPEND flag of the file status flags is set, the file offset
shall be set to the end of the file prior to each write and no
intervening file modification operation shall occur between changing the
file offset and the write operation."

This is what 2.4 took great paints to guarantee for file writes. Now in
actual fact no OS I know of implements this to the extreme (mmap) but
does for the other cases.

Outside of O_APPEND the specification says only that
- The write starts at the file position
- The file position is updated before the syscall returns

It makes no other guarantee I can see.


As such I belive that the O_APPEND case must be kept locked properly and
the non O_APPEND cases are already correctly handled by the kernel. That
seems to argue for f_pos serialization on O_APPEND only.

Alan
Linus Torvalds
2006-04-13 22:40:10 UTC
Permalink
Post by Alan Cox
The only serious case historically has been O_APPEND which does have
pretty precise semantics. Nowdays we also have pread/pwrite which have
pretty clear semantics and deal with threading. The O_APPEND case is
very important to get correct and 2.4 certainly did so.
pread/pwrite automatically is safe.

Our O_APPEND handling should be safe - although since we do it at a FS
level it actually depends on the filesystem itself. Most (all that use the
generic routines at least) filesystems will get the inode semaphore for
writing, and do the position handling within that semaphore.

So we follow the specs, but..
Post by Alan Cox
Outside of O_APPEND the specification says only that
- The write starts at the file position
- The file position is updated before the syscall returns
It makes no other guarantee I can see.
Right. I think this is purely a "quality of implementation" issue. We
already follow the spec, the question is whether we want to be better than
that.
Post by Alan Cox
As such I belive that the O_APPEND case must be kept locked properly and
the non O_APPEND cases are already correctly handled by the kernel. That
seems to argue for f_pos serialization on O_APPEND only.
f_pos doesn't really matter for O_APPEND, since we'll ignore it, and use
the file size as the position. Which is why the patch I sent out doesn't
matter (and which is why we already get O_APPEND right - we check the file
size within the inode semaphore/mutex).

Linus
Alan Cox
2006-04-13 23:05:49 UTC
Permalink
Post by Linus Torvalds
Post by Alan Cox
Outside of O_APPEND the specification says only that
- The write starts at the file position
- The file position is updated before the syscall returns
It makes no other guarantee I can see.
Right. I think this is purely a "quality of implementation" issue. We
already follow the spec, the question is whether we want to be better than
that.
Quality for whom ? There is a measurable cost to all that extra locking
which will hurt everyone. Given existing kernels don't make the
guarantee and SuS v3 does not make the guarantee the apps that need it
will continue to do the extra work themselves anyway.

I'd say the existing approach is the best quality of implementation for
those needing performance and that the cost for those needing ordering
guarantees in Linux is already astoundingly low thanks to the excellent
work done on futex based posix locking in glibc. I can choose to pay the
costs today, if we do extra locking I cannot opt out.

And of course I too would like to know if anyone is hitting O_APPEND
examples of this problem and if so on what fs ....

Alan
Linus Torvalds
2006-04-13 23:06:46 UTC
Permalink
Post by Alan Cox
Quality for whom ? There is a measurable cost to all that extra locking
which will hurt everyone. Given existing kernels don't make the
guarantee and SuS v3 does not make the guarantee the apps that need it
will continue to do the extra work themselves anyway.
True.
Post by Alan Cox
And of course I too would like to know if anyone is hitting O_APPEND
examples of this problem and if so on what fs ....
Well, Dan already said that he doesn't see it with O_APPEND (aka ">>"),
but yes, other filesystems may not be that lucky.

I'd actually not be surprised if the O_APPEND thing is tested by some of
the FS stress tools, so I suspect all the regular filesystems get it
right. Of course, since most of them - at least the local ones - use the
same generic framework, that's an even safer bet.

NFS etc is very special, of course, especially in the presense of clients
on different machines. At that point, I wouldn't guarantee a whole lot.

Linus
Linus Torvalds
2006-04-13 23:11:30 UTC
Permalink
Post by Alan Cox
Quality for whom ? There is a measurable cost to all that extra locking
which will hurt everyone. Given existing kernels don't make the
guarantee and SuS v3 does not make the guarantee the apps that need it
will continue to do the extra work themselves anyway.
True.
Side note: if you want the strange POSIX guarantees that Dan quoted
(atomicity between write and fstat), you'd almost have to do it with
user-space crapola magic locks. Doing it in the kernel would just be
insane, you'd have to have some per-process "IO semaphore".

(Doing it in user space _also_ sounds insane, but then you could have a
switch like "POSIX me harder, and do all the really stupid things" at
compile time or something)

Linus
Dan Bonachea
2006-04-13 22:06:03 UTC
Permalink
Post by Alan Cox
The only serious case historically has been O_APPEND which does have
pretty precise semantics. Nowdays we also have pread/pwrite which have
pretty clear semantics and deal with threading. The O_APPEND case is
very important to get correct and 2.4 certainly did so.
...
Post by Alan Cox
As such I belive that the O_APPEND case must be kept locked properly and
the non O_APPEND cases are already correctly handled by the kernel. That
seems to argue for f_pos serialization on O_APPEND only.
I agree that O_APPEND is important to get correct. However, would that also
handle the specific case of stdout redirection to a file? (ie "a.out >
result", not just "a.out >> result" - the latter incidentally does not
currently seem to fail, at least with my tests)
Post by Alan Cox
Looking at the spec it says the following
"If the O_APPEND flag of the file status flags is set, the file offset
shall be set to the end of the file prior to each write and no
intervening file modification operation shall occur between changing the
file offset and the write operation."
This is what 2.4 took great paints to guarantee for file writes. Now in
actual fact no OS I know of implements this to the extreme (mmap) but
does for the other cases.
Outside of O_APPEND the specification says only that
- The write starts at the file position
- The file position is updated before the syscall returns
It makes no other guarantee I can see.
The POSIX 1003.1-2001 spec seems to provide a very clear guarantee of
thread-safety behavior wrt threads:

2.9.1 Thread-Safety
All functions defined by this volume of IEEE Std 1003.1-2001 shall be
thread-safe, except that the following functions need not be thread-safe.
<omitted list of functions, which does not include write>
Implementations shall provide internal synchronization as necessary in order
to satisfy this
requirement.
...
2.9.7 Thread Interactions with Regular File Operations
All of the functions chmod( ), close( ), fchmod( ), fcntl( ), fstat( ),
ftruncate( ), lseek( ), open( ), read( ), readlink( ), stat( ), symlink( ),
and write( ) shall be atomic with respect to each other in the effects
specified in IEEE Std 1003.1-2001 when they operate on regular files. If two
threads each call one of these functions, each call shall either see all of
the specified effects of the other call, or none of them.

Unless I'm missing something, that doesn't leave much ambiguity regarding
what's required for POSIX compliance on this issue (although I'm not sure
POSIX compliance is the right metric).

Dan
Alan Cox
2006-04-13 23:11:47 UTC
Permalink
Post by Dan Bonachea
Unless I'm missing something, that doesn't leave much ambiguity regarding
what's required for POSIX compliance on this issue (although I'm not sure
POSIX compliance is the right metric).
Interesting. That pretty much conflicts with what write(2) itself is
defined as in the same specification, and means that the locking is
specific to posix thread groups not to processes. Well we've always
known that pthreads was a brain dead screw-up of a specification so I
guess that should be no suprise.

If the locking is thread group specific it may actually be best to
handle that one in glibc with a futex lock, as only glibc really knows
what is a posix pthread app, and it would avoid the idiocy escaping into
normal applications (which are 95+% of cases)

What does Ulrich think ?


Alan
Linus Torvalds
2006-04-13 23:19:16 UTC
Permalink
Post by Alan Cox
Interesting. That pretty much conflicts with what write(2) itself is
defined as in the same specification
They may mean that writes get "as much atomicity" between threads as
specified in other places (which is not a whole lot). In which case we're
certainly totally according to spec (since Linux has exactly the same
guarantees for threads as for anything else - since we just don't even
_care_ if it's a thread or not).

It may be that the extra POSIX wording comes from user-space thread
libraries that did "magic things" with select loops etc for IO using
non-blocking file descriptors (which, together with some latency
guarantees, could turn a single write into a series of smaller blocked
writes).

That would explain the POSIX wording - that they are supposed to be "as
thread safe" as a native write, even when they are wrapped inside a magic
threaded IO library. Maybe the "in the effects specified in IEEE Std
1003.1-2001" part is exactly about the fact that write is _not_ actually
specified to be totally atomic by the _normal_ POSIX stuff, but that they
wanted to make it clear that it's supposed to be "as atomic" as it's
supposed to be.

Hmm? Trying to be a language lawyer over a spec is always painful. I'd
suspect that the people who wrote that part didn't even really think about
it a lot, they just meant that they were "thread safe" in the sense that
you can call them concurrently without the system blowing up.

Linus
Linus Torvalds
2006-04-13 23:03:06 UTC
Permalink
Post by Dan Bonachea
2.9.7 Thread Interactions with Regular File Operations
All of the functions chmod( ), close( ), fchmod( ), fcntl( ), fstat( ),
ftruncate( ), lseek( ), open( ), read( ), readlink( ), stat( ), symlink( ),
and write( ) shall be atomic with respect to each other in the effects
specified in IEEE Std 1003.1-2001 when they operate on regular files. If two
threads each call one of these functions, each call shall either see all of
the specified effects of the other call, or none of them.
Unless I'm missing something, that doesn't leave much ambiguity regarding
what's required for POSIX compliance on this issue (although I'm not sure
POSIX compliance is the right metric).
Interesting. That's certainly not something we've guaranteed. My suggested
patch makes read/write on the same file descriptor atomic wrt each other,
but does not serialize them wrt different file descriptors (even if it's
the same file). So you could see "half a write" from another read, for
example, or fstat() could easily return the half-way file size for a write
that hasn't completed.

Quite frankly, being totally atomic wrt each other seems unreasonable.
Having fstat take the inode lock between threads is insane. It's
definitely not even something you'd _want_ between processes (ie if
somebody is doing a 2GB write, you want to be able to do an "ls" to see
the file grow!), doing it between threads would just be insane.

Or maybe the "effects specified in IEEE Std 1003.1-2001" is really about
the _other_ effects (ie things like write clearing SUID bits, atime/mtime
etc).

So it smells like POSIX is broken here, in that it expects bad behaviour
that isn't sane. But a subset of it would potentially be sane (ie exactly
that write/write or write/read atomicity on the same file _descriptor_,
exactly because they share f_pos).

Linus
Dan Bonachea
2006-04-13 09:18:55 UTC
Permalink
Post by Andrew Morton
Locking for file.f_pos is generally file->f_dentry->d_inode->i_mutex. We
could use that if we were to restructure the code a lot. Or we could add a
new lock to `struct file'.
Or we could do nothing, because a) the application is going to produce
inderterminate output anyway and b) because it only affects silly testcases
and not real-world apps.
OK, there _might_ be a real-world case: threads appending logging
information to a flat file. Trivially workable-around with a userspace
lock, or by switching to stdio (same thing).
Yes, really we should fix it. But it's not worth adding more overhead to
do so. So the fix would involve widespread (but simple) change, to draw
that f_pos update inside i_mutex.
Hi Andrew - thanks for the detailed response.

I don't know enough about the kernel implementation to comment on your
proposed fixes.

However, I should clarify that this problem definitely affects more than just
"silly testcases", and the fact that a program generates non-deterministically
ordered output does not necessarily make it erroneous, invalid or unuseful.

This problem arose in the parallel runtime system for a scientific language
compiler (nearly a million lines of code total - definitely a "real-world"
program) - the example code is merely a pared-down demonstration of the
problem. In parallel scientific computing, it's very common for many threads
to be writing to stdout (usually for monitoring purposes) and it's expected
and normal for output from separate threads to be arbitrarily interleaved, but
it's *not* ok for output to be lost entirely. This is essentially equivalent
to the real-world example you gave of many threads logging to a file.

We've worked around the problem in Linux 2.6 by adding locking at user-level
around our writes, as you suggest, although this of course penalizes our
performance on kernels that already correctly implement the thread-safety
required by the POSIX spec. In any case it seemed like a problem that we
should report, to be good open-source citizens - especially given that it
appears to be a regression with respect to the Linux 2.4 kernel. How you
choose to handle the report is of course your decision.

Thanks for your time.
Dan
Andrew Morton
2006-04-13 09:56:08 UTC
Permalink
Post by Dan Bonachea
This problem arose in the parallel runtime system for a scientific language
compiler (nearly a million lines of code total - definitely a "real-world"
program) - the example code is merely a pared-down demonstration of the
problem. In parallel scientific computing, it's very common for many threads
to be writing to stdout (usually for monitoring purposes) and it's expected
and normal for output from separate threads to be arbitrarily interleaved, but
it's *not* ok for output to be lost entirely. This is essentially equivalent
to the real-world example you gave of many threads logging to a file.
Interesting - afaik that's the first time this has been hit in a real
application.
Post by Dan Bonachea
We've worked around the problem in Linux 2.6 by adding locking at user-level
around our writes, as you suggest, although this of course penalizes our
performance on kernels that already correctly implement the thread-safety
required by the POSIX spec. In any case it seemed like a problem that we
should report, to be good open-source citizens - especially given that it
appears to be a regression with respect to the Linux 2.4 kernel. How you
choose to handle the report is of course your decision.
yup, thanks.
Kyle Moffett
2006-04-13 10:28:12 UTC
Permalink
Post by Andrew Morton
Post by Dan Bonachea
This problem arose in the parallel runtime system for a scientific
language compiler (nearly a million lines of code total -
definitely a "real-world" program) - the example code is merely a
pared-down demonstration of the problem. In parallel scientific
computing, it's very common for many threads to be writing to
stdout (usually for monitoring purposes) and it's expected and
normal for output from separate threads to be arbitrarily
interleaved, but it's *not* ok for output to be lost entirely.
This is essentially equivalent to the real-world example you gave
of many threads logging to a file.
Interesting - afaik that's the first time this has been hit in a
real application.
I would guess that it could also be a problem with a wide variety of
Perl CGIs running under Apache2+mod_perl with the worker threading
model. In fact this may actually explain some odd behavior I got
from one such module that I "fixed" by switching to logging via
syslog. I don't remember which module it was or what the exact
problem was, but I think it seemed similar in nature. That is
somewhat of a surprising and nonintuitive failure mode for logging,
IMHO it would be nice to get fixed. I would imagine that this has
probably been hit a number of times before, mysteriously fixed by
changed userspace locking or subtle thread ordering, and written off
as the mysterious effects of cosmic rays on RAM.

Cheers,
Kyle Moffett
Jan Engelhardt
2006-04-13 14:14:36 UTC
Permalink
I don't know enough about the kernel implementation to comment on your proposed
fixes.
However, I should clarify that this problem definitely affects more than just
"silly testcases", and the fact that a program generates non-deterministically
ordered output does not necessarily make it erroneous, invalid or unuseful.
For example syslogd (in the rare event that it does not use stdio). Or any
kind of logfile daemon that just dumps incoming events to a file that needs
user analysis afterwards anyway.


Jan Engelhardt
--
Kai Henningsen
2006-04-14 09:14:00 UTC
Permalink
Post by Dan Bonachea
Post by Andrew Morton
Locking for file.f_pos is generally file->f_dentry->d_inode->i_mutex. We
could use that if we were to restructure the code a lot. Or we could add a
new lock to `struct file'.
Or we could do nothing, because a) the application is going to produce
inderterminate output anyway and b) because it only affects silly testcases
and not real-world apps.
OK, there _might_ be a real-world case: threads appending logging
information to a flat file. Trivially workable-around with a userspace
lock, or by switching to stdio (same thing).
Yes, really we should fix it. But it's not worth adding more overhead to
do so. So the fix would involve widespread (but simple) change, to draw
that f_pos update inside i_mutex.
Hi Andrew - thanks for the detailed response.
I don't know enough about the kernel implementation to comment on your
proposed fixes.
However, I should clarify that this problem definitely affects more than
just "silly testcases", and the fact that a program generates
non-deterministically ordered output does not necessarily make it erroneous,
invalid or unuseful.
This problem arose in the parallel runtime system for a scientific language
compiler (nearly a million lines of code total - definitely a "real-world"
program) - the example code is merely a pared-down demonstration of the
problem. In parallel scientific computing, it's very common for many threads
to be writing to stdout (usually for monitoring purposes) and it's expected
and normal for output from separate threads to be arbitrarily interleaved,
but it's *not* ok for output to be lost entirely. This is essentially
equivalent to the real-world example you gave of many threads logging to a
file.
We've worked around the problem in Linux 2.6 by adding locking at user-level
around our writes, as you suggest, although this of course penalizes our
performance on kernels that already correctly implement the thread-safety
required by the POSIX spec. In any case it seemed like a problem that we
should report, to be good open-source citizens - especially given that it
appears to be a regression with respect to the Linux 2.4 kernel. How you
choose to handle the report is of course your decision.
So, considering the rest of the thread, shouldn't it be enough to simply
use O_APPEND for that use case - and, for that matter, isn't using
O_APPEND actually completely natural there, and isn't that (modulo the
process-thread difference) exactly what O_APPEND was invented for?

MfG Kai
Jens Moser
2010-12-19 22:45:43 UTC
Permalink
Post by Dan Bonachea
Hi - I believe I've discovered a thread-safety bug in the Linux 2.6.x kernel
implementation of write(2).
The small C program below the problem - in a nutshell, if multiple threads
write() to STDOUT_FILENO, and stdout has been redirected to a file, then some
of the output lines get lost. The actual result is non-deterministic (even in
a "correct" run) - however the expected correct behavior is 10 lines of output
(in some non-deterministic order). However, the test program reproducibly
generates some lost output (less than 10 lines of total output) on every run
where the output is redirected to a new file. This appears to be a violation
of the POSIX spec (POSIX 1003.1-2001:2.9.1 requires thread-safety of write()).
The problem does not appear to occur if output goes to the console, or is
redirected to append to an existing file, only when stdout is redirected to a
new file.
---------------------------------------------------------------------------
* compile as: gcc write-bug.c -lpthread
* run with: ./a.out
* note that lines from all 10 threads are visible
* run with: ./a.out > output
* note that lines from several threads are missing from the output file
*/
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
volatile int flag = 0;
void *start(void *arg) {
char msg[255];
int res;
sprintf(msg,"hi from %i\n",(int)arg);
if ((int)arg == 9) flag = 1;
while (!flag) ; /* thread barrier */
#if 1
res = write(STDOUT_FILENO,msg,strlen(msg));
if (res != strlen(msg)) fprintf(stderr,"Failure: %i
%s\n",res,strerror(res));
#else
fputs(msg,stdout);
#endif
#if 1 /* work extra hard to flush output (makes no difference) */
fflush(NULL);
fsync(STDOUT_FILENO);
sync();
#endif
return NULL;
}
int main() {
int i;
/* create 10 pthreads */
pthread_t id[10];
for (i =0 ; i < 10; i++) {
int ret = pthread_create(&(id[i]),NULL,&start,(void*)i);
if (ret) printf("pthread_create: %s\n",strerror(ret));
}
/* join 10 pthreads */
for (i =0 ; i < 10; i++) {
pthread_join(id[i], NULL);
}
sleep(1);
return 0;
}
---------------------------------------------------------------------------
The problem occurs on every Linux 2.6 machine I've tried, including ones with
Itanium-2, Athlon, Opteron and Pentium 4 chips, both SMP's and uniprocessors,
using a variety of recent gcc versions. The problem does *not* appear to occur
on any Linux 2.4 machine I've tried, even if I use the same executable which
fails on the Linux 2.6 machine. Finally, replacing write(STDOUT_FILENO) with
fputs(stdout) makes the problem disappear (presumably due to locking in libc).
I don't have administrative access to upgrade the kernel on these machines,
however below is the full machine info for the most recent installed kernel
version I have access to.
Any suggestions are appreciated.
Thanks,
Dan
Hello all,

it's a long time since above message was posted and I wondered if Linus' patch
found it's way into the mainline kernel. A quick check with Dan Bonachea's test
program below showed that this is probably not the case.

We are currently porting a program to Linux (x86 and x86-64) which suffers from
the exact same problem. In contrast to Dan's scientific program our application
could live with occasional losses of messages as long as the output of different
concurrent write(2)s won't be interleaved.

So if we had a write(2) of "abcdef" and a second concurrent write(2) of "uvwxyz"
it would be tolerable if the output of one of them was lost, but it would be
catastrophic for our application if the output was interleaved, e.g. in the form
"abcuvwdefxyz".

I checked for that case with a modified version of Dan's test program (with
write(2)s of up to 100 MB
per thread) and under no circumstances did I get interleaved output (but only
losses of output).

So my question is, does Linux do any synchronisation to prevent interleaving of
output in conjuction with concurrent write(2)s to the same file descriptor
(regular files only). If not, the test results would be a surprise to me, but
maybe someone with more knowledge in regards to the Linux kernel could shed some
light on that.

Any answers are higly appreciated!

Many thanks in advance,

Jens Moser

Loading...