Discussion:
[PATCH] Do not silently discard WRITE_SAME requests
Petr Vandrovec
2014-10-11 06:30:53 UTC
Permalink
Hi,
it was brought to my attention that there are claims of data corruption
caused by VMware's SCSI implementation. After investigating, problem
seems to be in a way completion handler for WRITE_SAME handles EOPNOTSUPP
error, causing all-but-first WRITE_SAME request on the LVM device to be
silently ignored - command is never issued, but success is returned to
higher layers. Problem affects all disks without WRITE_SAME support -
and I guess VMware's SCSI emulation is one of few that do not support
this command ATM.

Please apply patch below.
Thanks,
Petr Vandrovec

From: Petr Vandrovec <***@vmware.com>
Subject: [PATCH] Do not silently discard WRITE_SAME requests


When device does not support WRITE_SAME, after first failure
block layer starts throwing away WRITE_SAME requests without
warning anybody, leading to the data corruption.

Let's do something about it - do not use EOPNOTSUPP error,
as apparently that error code is special (use EREMOTEIO, AKA
target failure, like when request hits hardware), and propagate
inabiity to do WRITE_SAME to the top of stack, so we do not
try to issue WRITE_SAME again and again.

It also reverts 4089b71cc820a426d601283c92fcd4ffeb5139c2, as
there is nothing wrong with VMware's WRITE_SAME emulation.
Only problem was that block layer did not issue WRITE_SAME
request at all, but reported success, and it affected all
disks that do not support WRITE_SAME.

Signed-off-by: Petr Vandrovec <***@vmware.com>
Cc: Arvind Kumar <***@vmware.com>
Cc: Chris J Arges <***@canonical.com>
Cc: Martin K. Petersen <***@oracle.com>
Cc: Christoph Hellwig <***@lst.de>
Cc: ***@vger.kernel.org
---
block/blk-core.c | 2 +-
block/blk-lib.c | 10 ++++++++++
drivers/message/fusion/mptspi.c | 5 -----
3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9c888bd..b070782 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1822,7 +1822,7 @@ generic_make_request_checks(struct bio *bio)
}

if (bio->bi_rw & REQ_WRITE_SAME && !bdev_write_same(bio->bi_bdev)) {
- err = -EOPNOTSUPP;
+ err = -EREMOTEIO;
goto end_io;
}

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 8411be3..abad72d 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -298,6 +298,16 @@ int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
ZERO_PAGE(0)))
return 0;

+ /*
+ * If WRITE_SAME failed, inability to perform WRITE_SAME was
+ * possibly recorded in device's queue by sd.c. But in case
+ * of LVM we are issuing request here on LVM device. So
+ * we should mark device as ineligible for WRITE_SAME here too,
+ * as otherwise we keep trying to submit WRITE_SAME again and
+ * again to LVM where they get promptly rejected by underlying
+ * disk queue.
+ */
+ blk_queue_max_write_same_sectors(bdev_get_queue(bdev), 0);
bdevname(bdev, bdn);
pr_err("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
}
diff --git a/drivers/message/fusion/mptspi.c b/drivers/message/fusion/mptspi.c
index 613231c..787933d 100644
--- a/drivers/message/fusion/mptspi.c
+++ b/drivers/message/fusion/mptspi.c
@@ -1419,11 +1419,6 @@ mptspi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
goto out_mptspi_probe;
}

- /* VMWare emulation doesn't properly implement WRITE_SAME
- */
- if (pdev->subsystem_vendor == 0x15AD)
- sh->no_write_same = 1;
-
spin_lock_irqsave(&ioc->FreeQlock, flags);

/* Attach the SCSI Host to the IOC structure
--
2.1.1
Martin K. Petersen
2014-10-11 12:51:30 UTC
Permalink
Petr> After investigating, problem seems to be in a way completion
Petr> handler for WRITE_SAME handles EOPNOTSUPP error, causing
Petr> all-but-first WRITE_SAME request on the LVM device to be silently
Petr> ignored - command is never issued, but success is returned to
Petr> higher layers.

Commit 7eee4ae2dbb2 was meant to address this issue. Does it still
happen with that in place?
--
Martin K. Petersen Oracle Linux Engineering
Petr Vandrovec
2014-10-12 06:18:02 UTC
Permalink
Post by Martin K. Petersen
Petr> After investigating, problem seems to be in a way completion
Petr> handler for WRITE_SAME handles EOPNOTSUPP error, causing
Petr> all-but-first WRITE_SAME request on the LVM device to be silently
Petr> ignored - command is never issued, but success is returned to
Petr> higher layers.
Commit 7eee4ae2dbb2 was meant to address this issue. Does it still
happen with that in place?
Hi,
that commit alleviates need for change to blk-lib.c. But I believe
that change to blk-core.c that changes return value from EOPNOTSUPP to
EREMOTEIO is still necessary - unless I'm missing some locking
somewhere, there is a race in blkdev_issue_write_same() wrt. updating
max_write_same_sectors:

blkdev_issue_write_same() checks whether max_write_same_sectors is
non-zero at the beginning, and if it is non-zero it proceeds with
generating BIOs. While it generates them, other thread seems to be able
to complete previously issued write_same, find it is not supported, and
clear max_write_same_sectors. Which means that BIOs that are now being
generated will fast-fail in blk-core.c with EOPNOTSUPP, and
blkdev_issue_write_same() will then return success, rather than failure.

It is true that now WRITE_SAME is failing only if second WRITE_SAME is
issued to the device while first ever issued WRITE_SAME on the device is
being completed, but I see no reason why to not close this race.

Logic (from 2011, commit 8af1954d172a46a63e5e79dae523a6d74715e458) says
that EOPNOTSUPP is returned when DISCARD request failed, as discarding
is optional, and failures can be safely ignored. That is definitely not
true for WRITE_SAME failures, and so unsupported WRITE_SAME should
return different error code than unsupported DISCARD.

Which is what patch does. I've removed part that propagates disabling
WRITE_SAME from the diff, keeping only EOPNOTSUPP => EREMOTEIO change,
and revert of blacklisting VMware's LSI (if anything, blacklist should
be for current firmware version of 'VMware Virtual SCSI Disk', as f.e.
passed-through (RDM) SCSI disks do support WRITE_SAME under VMware) --
see attached updated diff.
Petr Vandrovec
Martin K. Petersen
2014-10-15 00:57:58 UTC
Permalink
Petr,

Petr> Logic (from 2011, commit 8af1954d172a46a63e5e79dae523a6d74715e458)
Petr> says that EOPNOTSUPP is returned when DISCARD request failed, as
Petr> discarding is optional, and failures can be safely ignored. That
Petr> is definitely not true for WRITE_SAME failures, and so unsupported
Petr> WRITE_SAME should return different error code than unsupported
Petr> DISCARD.

I agree. With Lukas' change the bio batch error handling is too
permissive for the write same case. It also made the regular zeroout
error handling case a bit fishy but it is unlikely that we'd get
EOPNOTSUPP on a regular write.

So for the block portion:

Acked-by: Martin K. Petersen <***@oracle.com>
Cc: ***@vger.kernel.org # 3.7+

Petr> revert of blacklisting VMware's LSI (if anything, blacklist should
Petr> be for current firmware version of 'VMware Virtual SCSI Disk', as
Petr> f.e. passed-through (RDM) SCSI disks do support WRITE_SAME under
Petr> VMware) -- see attached updated diff.

If you have a better heuristic then let's by all means use it. Please
adjust the pattern matching strings to match the actual values returned
by the target.


SCSI: Only blacklist WRITE SAME for VMware virtual disks

Commit 4089b71cc820 blacklisted WRITE SAME for all VMware disks.
However, the WRITE SAME commands are supported for passthrough
disks. Change the heuristic so we only blacklist virtual disks.

Signed-off-by: Martin K. Petersen <***@oracle.com>
Reported-by: Petr Vandrovec <***@vmware.com>

diff --git a/drivers/message/fusion/mptspi.c b/drivers/message/fusion/mptspi.c
index 613231c16194..787933d43d32 100644
--- a/drivers/message/fusion/mptspi.c
+++ b/drivers/message/fusion/mptspi.c
@@ -1419,11 +1419,6 @@ mptspi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
goto out_mptspi_probe;
}

- /* VMWare emulation doesn't properly implement WRITE_SAME
- */
- if (pdev->subsystem_vendor == 0x15AD)
- sh->no_write_same = 1;
-
spin_lock_irqsave(&ioc->FreeQlock, flags);

/* Attach the SCSI Host to the IOC structure
diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c
index 49014a143c6a..402aebfb97dd 100644
--- a/drivers/scsi/scsi_devinfo.c
+++ b/drivers/scsi/scsi_devinfo.c
@@ -249,6 +249,7 @@ static struct {
{"TOSHIBA", "CD-ROM", NULL, BLIST_ISROM},
{"Traxdata", "CDR4120", NULL, BLIST_NOLUN}, /* locks up */
{"USB2.0", "SMARTMEDIA/XD", NULL, BLIST_FORCELUN | BLIST_INQUIRY_36},
+ {"VMware", "Virtual SCSI Disk", NULL, BLIST_NO_WRITE_SAME},
{"WangDAT", "Model 2600", "01.7", BLIST_SELECT_NO_ATN},
{"WangDAT", "Model 3200", "02.2", BLIST_SELECT_NO_ATN},
{"WangDAT", "Model 1300", "02.4", BLIST_SELECT_NO_ATN},
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index ba3f1e8d0d57..c57daffe57af 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -955,6 +955,9 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result,
if (*bflags & BLIST_NO_DIF)
sdev->no_dif = 1;

+ if (*bflags & BLIST_NO_WRITE_SAME)
+ sdev->no_write_same = 1;
+
sdev->eh_timeout = SCSI_DEFAULT_EH_TIMEOUT;

if (*bflags & BLIST_TRY_VPD_PAGES)
diff --git a/include/scsi/scsi_devinfo.h b/include/scsi/scsi_devinfo.h
index 183eaab7c380..1a24efb4b1d6 100644
--- a/include/scsi/scsi_devinfo.h
+++ b/include/scsi/scsi_devinfo.h
@@ -36,5 +36,6 @@
for sequential scan */
#define BLIST_TRY_VPD_PAGES 0x10000000 /* Attempt to read VPD pages */
#define BLIST_NO_RSOC 0x20000000 /* don't try to issue RSOC */
+#define BLIST_NO_WRITE_SAME 0x40000000 /* don't try to issue WRITE SAME */

#endif
--
Martin K. Petersen Oracle Linux Engineering
Petr Vandrovec
2014-10-15 06:28:23 UTC
Permalink
Post by Martin K. Petersen
Petr,
Petr> Logic (from 2011, commit 8af1954d172a46a63e5e79dae523a6d74715e458)
Petr> says that EOPNOTSUPP is returned when DISCARD request failed, as
Petr> discarding is optional, and failures can be safely ignored. That
Petr> is definitely not true for WRITE_SAME failures, and so unsupported
Petr> WRITE_SAME should return different error code than unsupported
Petr> DISCARD.
I agree. With Lukas' change the bio batch error handling is too
permissive for the write same case. It also made the regular zeroout
error handling case a bit fishy but it is unlikely that we'd get
EOPNOTSUPP on a regular write.
Petr> revert of blacklisting VMware's LSI (if anything, blacklist should
Petr> be for current firmware version of 'VMware Virtual SCSI Disk', as
Petr> f.e. passed-through (RDM) SCSI disks do support WRITE_SAME under
Petr> VMware) -- see attached updated diff.
If you have a better heuristic then let's by all means use it. Please
adjust the pattern matching strings to match the actual values returned
by the target.
SCSI: Only blacklist WRITE SAME for VMware virtual disks
Commit 4089b71cc820 blacklisted WRITE SAME for all VMware disks.
However, the WRITE SAME commands are supported for passthrough
disks. Change the heuristic so we only blacklist virtual disks.
Hi,
is there any reason to do blacklisting? Why not let first request
fail, and switch to non-write-same code path once that happens?

If you want to go ahead with blacklisting, then entries should be:

{ "VMware", "Virtual disk", "1.0", BLIST_NO_WRITE_SAME }, //ESX
{ "VMware,", "VMware Virtual S", "1.0", BLIST_NO_WRITE_SAME }, //WS,Fusion

(yes, on Workstation and Fusion there is really comma in the vendor name
and VMware is repeated in the device ID - first is 'VMware, Inc.'
truncated to 8 characters, and second is 'VMware Virtual SCSI Hard
Drive' truncated to 16; we got revision right...)
Thanks,
Petr
Post by Martin K. Petersen
diff --git a/drivers/message/fusion/mptspi.c b/drivers/message/fusion/mptspi.c
index 613231c16194..787933d43d32 100644
--- a/drivers/message/fusion/mptspi.c
+++ b/drivers/message/fusion/mptspi.c
@@ -1419,11 +1419,6 @@ mptspi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
goto out_mptspi_probe;
}
- /* VMWare emulation doesn't properly implement WRITE_SAME
- */
- if (pdev->subsystem_vendor == 0x15AD)
- sh->no_write_same = 1;
-
spin_lock_irqsave(&ioc->FreeQlock, flags);
/* Attach the SCSI Host to the IOC structure
diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c
index 49014a143c6a..402aebfb97dd 100644
--- a/drivers/scsi/scsi_devinfo.c
+++ b/drivers/scsi/scsi_devinfo.c
@@ -249,6 +249,7 @@ static struct {
{"TOSHIBA", "CD-ROM", NULL, BLIST_ISROM},
{"Traxdata", "CDR4120", NULL, BLIST_NOLUN}, /* locks up */
{"USB2.0", "SMARTMEDIA/XD", NULL, BLIST_FORCELUN | BLIST_INQUIRY_36},
+ {"VMware", "Virtual SCSI Disk", NULL, BLIST_NO_WRITE_SAME},
{"WangDAT", "Model 2600", "01.7", BLIST_SELECT_NO_ATN},
{"WangDAT", "Model 3200", "02.2", BLIST_SELECT_NO_ATN},
{"WangDAT", "Model 1300", "02.4", BLIST_SELECT_NO_ATN},
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index ba3f1e8d0d57..c57daffe57af 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -955,6 +955,9 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result,
if (*bflags & BLIST_NO_DIF)
sdev->no_dif = 1;
+ if (*bflags & BLIST_NO_WRITE_SAME)
+ sdev->no_write_same = 1;
+
sdev->eh_timeout = SCSI_DEFAULT_EH_TIMEOUT;
if (*bflags & BLIST_TRY_VPD_PAGES)
diff --git a/include/scsi/scsi_devinfo.h b/include/scsi/scsi_devinfo.h
index 183eaab7c380..1a24efb4b1d6 100644
--- a/include/scsi/scsi_devinfo.h
+++ b/include/scsi/scsi_devinfo.h
@@ -36,5 +36,6 @@
for sequential scan */
#define BLIST_TRY_VPD_PAGES 0x10000000 /* Attempt to read VPD pages */
#define BLIST_NO_RSOC 0x20000000 /* don't try to issue RSOC */
+#define BLIST_NO_WRITE_SAME 0x40000000 /* don't try to issue WRITE SAME */
#endif
Martin K. Petersen
2014-10-17 23:46:38 UTC
Permalink
Petr> is there any reason to do blacklisting? Why not let first request
Petr> fail, and switch to non-write-same code path once that happens?

Well, we do. But we try to avoid confusing users with error messages if
we can avoid it. And if we know it's not going to work we might as avoid
trying.


SCSI: Only blacklist WRITE SAME for VMware virtual disks

Commit 4089b71cc820 blacklisted WRITE SAME for all VMware disks.
However, the WRITE SAME commands are supported for passthrough
disks. Change the heuristic so we only blacklist virtual disks.

Signed-off-by: Martin K. Petersen <***@oracle.com>
Reported-by: Petr Vandrovec <***@vmware.com>


diff --git a/drivers/message/fusion/mptspi.c b/drivers/message/fusion/mptspi.c
index 613231c16194..787933d43d32 100644
--- a/drivers/message/fusion/mptspi.c
+++ b/drivers/message/fusion/mptspi.c
@@ -1419,11 +1419,6 @@ mptspi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
goto out_mptspi_probe;
}

- /* VMWare emulation doesn't properly implement WRITE_SAME
- */
- if (pdev->subsystem_vendor == 0x15AD)
- sh->no_write_same = 1;
-
spin_lock_irqsave(&ioc->FreeQlock, flags);

/* Attach the SCSI Host to the IOC structure
diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c
index 49014a143c6a..8c228e049bb6 100644
--- a/drivers/scsi/scsi_devinfo.c
+++ b/drivers/scsi/scsi_devinfo.c
@@ -249,6 +249,8 @@ static struct {
{"TOSHIBA", "CD-ROM", NULL, BLIST_ISROM},
{"Traxdata", "CDR4120", NULL, BLIST_NOLUN}, /* locks up */
{"USB2.0", "SMARTMEDIA/XD", NULL, BLIST_FORCELUN | BLIST_INQUIRY_36},
+ {"VMware", "Virtual disk", "1.0", BLIST_NO_WRITE_SAME }, /* ESX */
+ {"VMware,", "VMware Virtual S", "1.0", BLIST_NO_WRITE_SAME }, /* WS,Fusion */
{"WangDAT", "Model 2600", "01.7", BLIST_SELECT_NO_ATN},
{"WangDAT", "Model 3200", "02.2", BLIST_SELECT_NO_ATN},
{"WangDAT", "Model 1300", "02.4", BLIST_SELECT_NO_ATN},
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index ba3f1e8d0d57..c57daffe57af 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -955,6 +955,9 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result,
if (*bflags & BLIST_NO_DIF)
sdev->no_dif = 1;

+ if (*bflags & BLIST_NO_WRITE_SAME)
+ sdev->no_write_same = 1;
+
sdev->eh_timeout = SCSI_DEFAULT_EH_TIMEOUT;

if (*bflags & BLIST_TRY_VPD_PAGES)
diff --git a/include/scsi/scsi_devinfo.h b/include/scsi/scsi_devinfo.h
index 183eaab7c380..1a24efb4b1d6 100644
--- a/include/scsi/scsi_devinfo.h
+++ b/include/scsi/scsi_devinfo.h
@@ -36,5 +36,6 @@
for sequential scan */
#define BLIST_TRY_VPD_PAGES 0x10000000 /* Attempt to read VPD pages */
#define BLIST_NO_RSOC 0x20000000 /* don't try to issue RSOC */
+#define BLIST_NO_WRITE_SAME 0x40000000 /* don't try to issue WRITE SAME */

#endif
Chris J Arges
2014-10-20 15:10:14 UTC
Permalink
Post by Martin K. Petersen
Petr> is there any reason to do blacklisting? Why not let first request
Petr> fail, and switch to non-write-same code path once that happens?
Well, we do. But we try to avoid confusing users with error messages if
we can avoid it. And if we know it's not going to work we might as avoid
trying.
In addition it seems that after blkdev_issue_write_same fails and we try
to zero with __blkdev_issue_zeroout, we have issues where data is not
being zeroed properly as seem by the test case.
Post by Martin K. Petersen
SCSI: Only blacklist WRITE SAME for VMware virtual disks
Commit 4089b71cc820 blacklisted WRITE SAME for all VMware disks.
However, the WRITE SAME commands are supported for passthrough
disks. Change the heuristic so we only blacklist virtual disks.
I've tested with this patch applied and the original test case for the
bug does not fail. I only verified this with VMWare workstation. Feel
free to add:

Tested-by: Chris J Arges <***@canonical.com>

Thanks,
--chris j arges
Post by Martin K. Petersen
diff --git a/drivers/message/fusion/mptspi.c b/drivers/message/fusion/mptspi.c
index 613231c16194..787933d43d32 100644
--- a/drivers/message/fusion/mptspi.c
+++ b/drivers/message/fusion/mptspi.c
@@ -1419,11 +1419,6 @@ mptspi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
goto out_mptspi_probe;
}
- /* VMWare emulation doesn't properly implement WRITE_SAME
- */
- if (pdev->subsystem_vendor == 0x15AD)
- sh->no_write_same = 1;
-
spin_lock_irqsave(&ioc->FreeQlock, flags);
/* Attach the SCSI Host to the IOC structure
diff --git a/drivers/scsi/scsi_devinfo.c b/drivers/scsi/scsi_devinfo.c
index 49014a143c6a..8c228e049bb6 100644
--- a/drivers/scsi/scsi_devinfo.c
+++ b/drivers/scsi/scsi_devinfo.c
@@ -249,6 +249,8 @@ static struct {
{"TOSHIBA", "CD-ROM", NULL, BLIST_ISROM},
{"Traxdata", "CDR4120", NULL, BLIST_NOLUN}, /* locks up */
{"USB2.0", "SMARTMEDIA/XD", NULL, BLIST_FORCELUN | BLIST_INQUIRY_36},
+ {"VMware", "Virtual disk", "1.0", BLIST_NO_WRITE_SAME }, /* ESX */
+ {"VMware,", "VMware Virtual S", "1.0", BLIST_NO_WRITE_SAME }, /* WS,Fusion */
{"WangDAT", "Model 2600", "01.7", BLIST_SELECT_NO_ATN},
{"WangDAT", "Model 3200", "02.2", BLIST_SELECT_NO_ATN},
{"WangDAT", "Model 1300", "02.4", BLIST_SELECT_NO_ATN},
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index ba3f1e8d0d57..c57daffe57af 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -955,6 +955,9 @@ static int scsi_add_lun(struct scsi_device *sdev, unsigned char *inq_result,
if (*bflags & BLIST_NO_DIF)
sdev->no_dif = 1;
+ if (*bflags & BLIST_NO_WRITE_SAME)
+ sdev->no_write_same = 1;
+
sdev->eh_timeout = SCSI_DEFAULT_EH_TIMEOUT;
if (*bflags & BLIST_TRY_VPD_PAGES)
diff --git a/include/scsi/scsi_devinfo.h b/include/scsi/scsi_devinfo.h
index 183eaab7c380..1a24efb4b1d6 100644
--- a/include/scsi/scsi_devinfo.h
+++ b/include/scsi/scsi_devinfo.h
@@ -36,5 +36,6 @@
for sequential scan */
#define BLIST_TRY_VPD_PAGES 0x10000000 /* Attempt to read VPD pages */
#define BLIST_NO_RSOC 0x20000000 /* don't try to issue RSOC */
+#define BLIST_NO_WRITE_SAME 0x40000000 /* don't try to issue WRITE SAME */
#endif
Continue reading on narkive:
Loading...