Discussion:
Integration of SCST in the mainstream Linux kernel
(too old to reply)
Bart Van Assche
2008-01-23 14:22:00 UTC
Permalink
As you probably know there is a trend in enterprise computing towards
networked storage. This is illustrated by the emergence during the
past few years of standards like SRP (SCSI RDMA Protocol), iSCSI
(Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different
pieces of software are necessary to make networked storage possible:
initiator software and target software. As far as I know there exist
three different SCSI target implementations for Linux:
- The iSCSI Enterprise Target Daemon (IETD,
http://iscsitarget.sourceforge.net/);
- The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/);
- The Generic SCSI Target Middle Level for Linux project (SCST,
http://scst.sourceforge.net/).
Since I was wondering which SCSI target software would be best suited
for an InfiniBand network, I started evaluating the STGT and SCST SCSI
target implementations. Apparently the performance difference between
STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks,
but the SCST target software outperforms the STGT software on an
InfiniBand network. See also the following thread for the details:
http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com&forum_name=scst-devel.

About the design of the SCST software: while one of the goals of the
STGT project was to keep the in-kernel code minimal, the SCST project
implements the whole SCSI target in kernel space. SCST is implemented
as a set of new kernel modules, only minimal changes to the existing
kernel are necessary before the SCST kernel modules can be used. This
is the same approach that will be followed in the very near future in
the OpenSolaris kernel (see also
http://opensolaris.org/os/project/comstar/). More information about
the design of SCST can be found here:
http://scst.sourceforge.net/doc/scst_pg.html.

My impression is that both the STGT and SCST projects are well
designed, well maintained and have a considerable user base. According
to the SCST maintainer (Vladislav Bolkhovitin), SCST is superior to
STGT with respect to features, performance, maturity, stability, and
number of existing target drivers. Unfortunately the SCST kernel code
lives outside the kernel tree, which makes SCST harder to use than
STGT.

As an SCST user, I would like to see the SCST kernel code integrated
in the mainstream kernel because of its excellent performance on an
InfiniBand network. Since the SCST project comprises about 14 KLOC,
reviewing the SCST code will take considerable time. Who will do this
reviewing work ? And with regard to the comments made by the
reviewers: Vladislav, do you have the time to carry out the
modifications requested by the reviewers ? I expect a.o. that
reviewers will ask to move SCST's configuration pseudofiles from
procfs to sysfs.

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-01-23 17:11:34 UTC
Permalink
Post by Bart Van Assche
As you probably know there is a trend in enterprise computing towards
networked storage. This is illustrated by the emergence during the
past few years of standards like SRP (SCSI RDMA Protocol), iSCSI
(Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different
initiator software and target software. As far as I know there exist
- The iSCSI Enterprise Target Daemon (IETD,
http://iscsitarget.sourceforge.net/);
- The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/);
- The Generic SCSI Target Middle Level for Linux project (SCST,
http://scst.sourceforge.net/).
Since I was wondering which SCSI target software would be best suited
for an InfiniBand network, I started evaluating the STGT and SCST SCSI
target implementations. Apparently the performance difference between
STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks,
but the SCST target software outperforms the STGT software on an
http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com&forum_name=scst-devel.
About the design of the SCST software: while one of the goals of the
STGT project was to keep the in-kernel code minimal, the SCST project
implements the whole SCSI target in kernel space. SCST is implemented
as a set of new kernel modules, only minimal changes to the existing
kernel are necessary before the SCST kernel modules can be used. This
is the same approach that will be followed in the very near future in
the OpenSolaris kernel (see also
http://opensolaris.org/os/project/comstar/). More information about
http://scst.sourceforge.net/doc/scst_pg.html.
My impression is that both the STGT and SCST projects are well
designed, well maintained and have a considerable user base. According
to the SCST maintainer (Vladislav Bolkhovitin), SCST is superior to
STGT with respect to features, performance, maturity, stability, and
number of existing target drivers. Unfortunately the SCST kernel code
lives outside the kernel tree, which makes SCST harder to use than
STGT.
As an SCST user, I would like to see the SCST kernel code integrated
in the mainstream kernel because of its excellent performance on an
InfiniBand network. Since the SCST project comprises about 14 KLOC,
reviewing the SCST code will take considerable time. Who will do this
reviewing work ? And with regard to the comments made by the
reviewers: Vladislav, do you have the time to carry out the
modifications requested by the reviewers ? I expect a.o. that
reviewers will ask to move SCST's configuration pseudofiles from
procfs to sysfs.
Sure, I do, although I personally don't see much sense in such move.
Post by Bart Van Assche
Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
James Bottomley
2008-01-29 20:42:11 UTC
Permalink
Post by Bart Van Assche
As you probably know there is a trend in enterprise computing towards
networked storage. This is illustrated by the emergence during the
past few years of standards like SRP (SCSI RDMA Protocol), iSCSI
(Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different
initiator software and target software. As far as I know there exist
- The iSCSI Enterprise Target Daemon (IETD,
http://iscsitarget.sourceforge.net/);
- The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/);
- The Generic SCSI Target Middle Level for Linux project (SCST,
http://scst.sourceforge.net/).
Since I was wondering which SCSI target software would be best suited
for an InfiniBand network, I started evaluating the STGT and SCST SCSI
target implementations. Apparently the performance difference between
STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks,
but the SCST target software outperforms the STGT software on an
http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com&forum_name=scst-devel.
That doesn't seem to pull up a thread. However, I assume it's these
figures:

.............................................................................................
. . STGT read SCST read . STGT read SCST read .
. . performance performance . performance performance .
. . (0.5K, MB/s) (0.5K, MB/s) . (1 MB, MB/s) (1 MB, MB/s) .
.............................................................................................
. Ethernet (1 Gb/s network) . 77 78 . 77 89 .
. IPoIB (8 Gb/s network) . 163 185 . 201 239 .
. iSER (8 Gb/s network) . 250 N/A . 360 N/A .
. SRP (8 Gb/s network) . N/A 421 . N/A 683 .
.............................................................................................

On the comparable figures, which only seem to be IPoIB they're showing a
13-18% variance, aren't they? Which isn't an incredible difference.
Post by Bart Van Assche
About the design of the SCST software: while one of the goals of the
STGT project was to keep the in-kernel code minimal, the SCST project
implements the whole SCSI target in kernel space. SCST is implemented
as a set of new kernel modules, only minimal changes to the existing
kernel are necessary before the SCST kernel modules can be used. This
is the same approach that will be followed in the very near future in
the OpenSolaris kernel (see also
http://opensolaris.org/os/project/comstar/). More information about
http://scst.sourceforge.net/doc/scst_pg.html.
My impression is that both the STGT and SCST projects are well
designed, well maintained and have a considerable user base. According
to the SCST maintainer (Vladislav Bolkhovitin), SCST is superior to
STGT with respect to features, performance, maturity, stability, and
number of existing target drivers. Unfortunately the SCST kernel code
lives outside the kernel tree, which makes SCST harder to use than
STGT.
As an SCST user, I would like to see the SCST kernel code integrated
in the mainstream kernel because of its excellent performance on an
InfiniBand network. Since the SCST project comprises about 14 KLOC,
reviewing the SCST code will take considerable time. Who will do this
reviewing work ? And with regard to the comments made by the
reviewers: Vladislav, do you have the time to carry out the
modifications requested by the reviewers ? I expect a.o. that
reviewers will ask to move SCST's configuration pseudofiles from
procfs to sysfs.
The two target architectures perform essentially identical functions, so
there's only really room for one in the kernel. Right at the moment,
it's STGT. Problems in STGT come from the user<->kernel boundary which
can be mitigated in a variety of ways. The fact that the figures are
pretty much comparable on non IB networks shows this.

I really need a whole lot more evidence than at worst a 20% performance
difference on IB to pull one implementation out and replace it with
another. Particularly as there's no real evidence that STGT can't be
tweaked to recover the 20% even on IB.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Roland Dreier
2008-01-29 21:31:52 UTC
Permalink
Post by James Bottomley
. . STGT read SCST read . STGT read SCST read .
. . performance performance . performance performance .
. . (0.5K, MB/s) (0.5K, MB/s) . (1 MB, MB/s) (1 MB, MB/s) .
. iSER (8 Gb/s network) . 250 N/A . 360 N/A .
. SRP (8 Gb/s network) . N/A 421 . N/A 683 .
On the comparable figures, which only seem to be IPoIB they're showing a
13-18% variance, aren't they? Which isn't an incredible difference.
Maybe I'm all wet, but I think iSER vs. SRP should be roughly
comparable. The exact formatting of various messages etc. is
different but the data path using RDMA is pretty much identical. So
the big difference between STGT iSER and SCST SRP hints at some big
difference in the efficiency of the two implementations.

- R.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
FUJITA Tomonori
2008-01-29 23:32:39 UTC
Permalink
On Tue, 29 Jan 2008 13:31:52 -0800
Post by Roland Dreier
Post by James Bottomley
. . STGT read SCST read . STGT read SCST read .
. . performance performance . performance performance .
. . (0.5K, MB/s) (0.5K, MB/s) . (1 MB, MB/s) (1 MB, MB/s) .
. iSER (8 Gb/s network) . 250 N/A . 360 N/A .
. SRP (8 Gb/s network) . N/A 421 . N/A 683 .
On the comparable figures, which only seem to be IPoIB they're showing a
13-18% variance, aren't they? Which isn't an incredible difference.
Maybe I'm all wet, but I think iSER vs. SRP should be roughly
comparable. The exact formatting of various messages etc. is
different but the data path using RDMA is pretty much identical. So
the big difference between STGT iSER and SCST SRP hints at some big
difference in the efficiency of the two implementations.
iSER has parameters to limit the maximum size of RDMA (it needs to
repeat RDMA with a poor configuration)?


Anyway, here's the results from Robin Humble:

iSER to 7G ramfs, x86_64, centos4.6, 2.6.22 kernels, git tgtd,
initiator end booted with mem=512M, target with 8G ram

direct i/o dd
write/read 800/751 MB/s
dd if=/dev/zero of=/dev/sdc bs=1M count=5000 oflag=direct
dd of=/dev/null if=/dev/sdc bs=1M count=5000 iflag=direct

http://www.mail-archive.com/linux-***@vger.kernel.org/msg13502.html

I think that STGT is pretty fast with the fast backing storage.


I don't think that there is the notable perfornace difference between
kernel-space and user-space SRP (or ISER) implementations about moving
data between hosts. IB is expected to enable user-space applications
to move data between hosts quickly (if not, what can IB provide us?).

I think that the question is how fast user-space applications can do
I/Os ccompared with I/Os in kernel space. STGT is eager for the advent
of good asynchronous I/O and event notification interfances.


One more possible optimization for STGT is zero-copy data
transfer. STGT uses pre-registered buffers and move data between page
cache and thsse buffers, and then does RDMA transfer. If we implement
own caching mechanism to use pre-registered buffers directly with (AIO
and O_DIRECT), then STGT can move data without data copies.
Vu Pham
2008-01-30 01:15:09 UTC
Permalink
Post by FUJITA Tomonori
On Tue, 29 Jan 2008 13:31:52 -0800
Post by Roland Dreier
Post by James Bottomley
. . STGT read SCST read . STGT read SCST read .
. . performance performance . performance performance .
. . (0.5K, MB/s) (0.5K, MB/s) . (1 MB, MB/s) (1 MB, MB/s) .
. iSER (8 Gb/s network) . 250 N/A . 360 N/A .
. SRP (8 Gb/s network) . N/A 421 . N/A 683 .
On the comparable figures, which only seem to be IPoIB they're showing a
13-18% variance, aren't they? Which isn't an incredible difference.
Maybe I'm all wet, but I think iSER vs. SRP should be roughly
comparable. The exact formatting of various messages etc. is
different but the data path using RDMA is pretty much identical. So
the big difference between STGT iSER and SCST SRP hints at some big
difference in the efficiency of the two implementations.
iSER has parameters to limit the maximum size of RDMA (it needs to
repeat RDMA with a poor configuration)?
iSER to 7G ramfs, x86_64, centos4.6, 2.6.22 kernels, git tgtd,
initiator end booted with mem=512M, target with 8G ram
direct i/o dd
write/read 800/751 MB/s
dd if=/dev/zero of=/dev/sdc bs=1M count=5000 oflag=direct
dd of=/dev/null if=/dev/sdc bs=1M count=5000 iflag=direct
Both Robin (iser/stgt) and Bart (scst/srp) using ramfs

Robin's numbers come from DDR IB HCAs

Bart's numbers come from SDR IB HCAs:
Results with /dev/ram0 configured as backing store on the
target (buffered I/O):
Read Write Read
Write
performance performance
performance performance
(0.5K, MB/s) (0.5K, MB/s) (1 MB,
MB/s) (1 MB, MB/s)
STGT + iSER 250 48 349
781
SCST + SRP 411 66 659
746

Results with /dev/ram0 configured as backing store on the
target (direct I/O):
Read Write Read
Write
performance performance
performance performance
(0.5K, MB/s) (0.5K, MB/s) (1 MB,
MB/s) (1 MB, MB/s)
STGT + iSER 7.9 9.8 589
647
SCST + SRP 12.3 9.7 811
794

http://www.mail-archive.com/linux-***@vger.kernel.org/msg13514.html

Here are my numbers with DDR IB HCAs, SCST/SRP 5G /dev/ram0
block_io mode, RHEL5 2.6.18-8.el5

direct i/o dd
write/read 1100/895 MB/s
dd if=/dev/zero of=/dev/sdc bs=1M count=5000 oflag=direct
dd of=/dev/null if=/dev/sdc bs=1M count=5000 iflag=direct

buffered i/o dd
write/read 950/770 MB/s
dd if=/dev/zero of=/dev/sdc bs=1M count=5000
dd of=/dev/null if=/dev/sdc bs=1M count=5000

So when using DDR IB hcas:

stgt/iser scst/srp
direct I/O 800/751 1100/895
buffered I/O 1109/350 950/770


-vu
Post by FUJITA Tomonori
I think that STGT is pretty fast with the fast backing storage.
I don't think that there is the notable perfornace difference between
kernel-space and user-space SRP (or ISER) implementations about moving
data between hosts. IB is expected to enable user-space applications
to move data between hosts quickly (if not, what can IB provide us?).
I think that the question is how fast user-space applications can do
I/Os ccompared with I/Os in kernel space. STGT is eager for the advent
of good asynchronous I/O and event notification interfances.
One more possible optimization for STGT is zero-copy data
transfer. STGT uses pre-registered buffers and move data between page
cache and thsse buffers, and then does RDMA transfer. If we implement
own caching mechanism to use pre-registered buffers directly with (AIO
and O_DIRECT), then STGT can move data without data copies.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Scst-devel mailing list
https://lists.sourceforge.net/lists/listinfo/scst-devel
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-01-30 08:38:04 UTC
Permalink
Post by FUJITA Tomonori
iSER has parameters to limit the maximum size of RDMA (it needs to
repeat RDMA with a poor configuration)?
Please specify which parameters you are referring to. As you know I
had already repeated my tests with ridiculously high values for the
following iSER parameters: FirstBurstLength, MaxBurstLength and
MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block
size specified to dd).

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
FUJITA Tomonori
2008-01-30 10:56:37 UTC
Permalink
On Wed, 30 Jan 2008 09:38:04 +0100
Post by Bart Van Assche
Post by FUJITA Tomonori
iSER has parameters to limit the maximum size of RDMA (it needs to
repeat RDMA with a poor configuration)?
Please specify which parameters you are referring to. As you know I
Sorry, I can't say. I don't know much about iSER. But seems that Pete
and Robin can get the better I/O performance - line speed ratio with
STGT.

The version of OpenIB might matters too. For example, Pete said that
STGT reads loses about 100 MB/s for some transfer sizes for some
transfer sizes due to the OpenIB version difference or other unclear
reasons.

http://article.gmane.org/gmane.linux.iscsi.tgt.devel/135

It's fair to say that it takes long time and need lots of knowledge to
get the maximum performance of SAN, I think.

I think that it would be easier to convince James with the detailed
analysis (e.g. where does it take so long, like Pete did), not just
'dd' performance results.

Pushing iSCSI target code into mainline failed four times: IET, SCST,
STGT (doing I/Os in kernel in the past), and PyX's one (*1). iSCSI
target code is huge. You said SCST comprises 14,000 lines, but it's
not iSCSI target code. The SCSI engine code comprises 14,000
lines. You need another 10,000 lines for the iSCSI driver. Note that
SCST's iSCSI driver provides only basic iSCSI features. PyX's iSCSI
target code implemenents more iSCSI features (like MC/S, ERL2, etc)
and comprises about 60,000 lines and it still lacks some features like
iSER, bidi, etc.

I think that it's reasonable to say that we need more than 'dd'
results before pushing about possible more than 60,000 lines to
mainline.

(*1) http://linux-iscsi.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-01-30 11:40:08 UTC
Permalink
Post by FUJITA Tomonori
On Wed, 30 Jan 2008 09:38:04 +0100
Post by Bart Van Assche
Post by FUJITA Tomonori
iSER has parameters to limit the maximum size of RDMA (it needs to
repeat RDMA with a poor configuration)?
Please specify which parameters you are referring to. As you know I
Sorry, I can't say. I don't know much about iSER. But seems that Pete
and Robin can get the better I/O performance - line speed ratio with
STGT.
The version of OpenIB might matters too. For example, Pete said that
STGT reads loses about 100 MB/s for some transfer sizes for some
transfer sizes due to the OpenIB version difference or other unclear
reasons.
http://article.gmane.org/gmane.linux.iscsi.tgt.devel/135
It's fair to say that it takes long time and need lots of knowledge to
get the maximum performance of SAN, I think.
I think that it would be easier to convince James with the detailed
analysis (e.g. where does it take so long, like Pete did), not just
'dd' performance results.
Pushing iSCSI target code into mainline failed four times: IET, SCST,
STGT (doing I/Os in kernel in the past), and PyX's one (*1). iSCSI
target code is huge. You said SCST comprises 14,000 lines, but it's
not iSCSI target code. The SCSI engine code comprises 14,000
lines. You need another 10,000 lines for the iSCSI driver. Note that
SCST's iSCSI driver provides only basic iSCSI features. PyX's iSCSI
target code implemenents more iSCSI features (like MC/S, ERL2, etc)
and comprises about 60,000 lines and it still lacks some features like
iSER, bidi, etc.
I think that it's reasonable to say that we need more than 'dd'
results before pushing about possible more than 60,000 lines to
mainline.
Tomo, please stop counting in-kernel lines only (see
http://lkml.org/lkml/2007/4/24/364). The amount of the overall project
lines for the same feature set is a lot more important.
Post by FUJITA Tomonori
(*1) http://linux-iscsi.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-01-30 13:10:47 UTC
Permalink
Post by FUJITA Tomonori
On Wed, 30 Jan 2008 09:38:04 +0100
Post by Bart Van Assche
Please specify which parameters you are referring to. As you know I
Sorry, I can't say. I don't know much about iSER. But seems that Pete
and Robin can get the better I/O performance - line speed ratio with
STGT.
Robin Humble was using a DDR InfiniBand network, while my tests were
performed with an SDR InfiniBand network. Robin's results can't be
directly compared to my results.

Pete Wyckoff's results
(http://www.osc.edu/~pw/papers/wyckoff-iser-snapi07-talk.pdf) are hard
to interpret. I have asked Pete which of the numbers in his test can
be compared with what I measured, but Pete did not reply.
Post by FUJITA Tomonori
The version of OpenIB might matters too. For example, Pete said that
STGT reads loses about 100 MB/s for some transfer sizes for some
transfer sizes due to the OpenIB version difference or other unclear
reasons.
http://article.gmane.org/gmane.linux.iscsi.tgt.devel/135
Pete wrote about a degradation from 600 MB/s to 500 MB/s for reads
with STGT+iSER. In my tests I measured 589 MB/s for reads (direct
I/O), which matches with the better result obtained by Pete.

Note: the InfiniBand kernel modules I used were those from the
2.6.22.9 kernel, not from the OFED distribution.

Bart.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
FUJITA Tomonori
2008-01-30 13:54:31 UTC
Permalink
On Wed, 30 Jan 2008 14:10:47 +0100
Post by Bart Van Assche
Post by FUJITA Tomonori
On Wed, 30 Jan 2008 09:38:04 +0100
Post by Bart Van Assche
Please specify which parameters you are referring to. As you know I
Sorry, I can't say. I don't know much about iSER. But seems that Pete
and Robin can get the better I/O performance - line speed ratiwo with
STGT.
Robin Humble was using a DDR InfiniBand network, while my tests were
performed with an SDR InfiniBand network. Robin's results can't be
directly compared to my results.
I know that you use different hardware. I used 'ratio' word.


BTW, you said the performance difference of dio READ is 38% but I
think it's 27.3 %, though it's still large.
Post by Bart Van Assche
Pete Wyckoff's results
(http://www.osc.edu/~pw/papers/wyckoff-iser-snapi07-talk.pdf) are hard
to interpret. I have asked Pete which of the numbers in his test can
be compared with what I measured, but Pete did not reply.
Post by FUJITA Tomonori
The version of OpenIB might matters too. For example, Pete said that
STGT reads loses about 100 MB/s for some transfer sizes for some
transfer sizes due to the OpenIB version difference or other unclear
reasons.
http://article.gmane.org/gmane.linux.iscsi.tgt.devel/135
Pete wrote about a degradation from 600 MB/s to 500 MB/s for reads
with STGT+iSER. In my tests I measured 589 MB/s for reads (direct
I/O), which matches with the better result obtained by Pete.
I don't know he used the same benchmark software so I don't think that
we can compare them.

All I tried to say is the OFED version might has big effect on the
performance. So you might need to find the best one.
Post by Bart Van Assche
Note: the InfiniBand kernel modules I used were those from the
2.6.22.9 kernel, not from the OFED distribution.
I'm talking about a target machine (I think that Pete was also talking
about OFED on his target machine). STGT uses OFED libraries, I think.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-01-31 07:48:26 UTC
Permalink
Post by FUJITA Tomonori
On Wed, 30 Jan 2008 14:10:47 +0100
Post by Bart Van Assche
Post by FUJITA Tomonori
Sorry, I can't say. I don't know much about iSER. But seems that Pete
and Robin can get the better I/O performance - line speed ratio with
STGT.
Robin Humble was using a DDR InfiniBand network, while my tests were
performed with an SDR InfiniBand network. Robin's results can't be
directly compared to my results.
I know that you use different hardware. I used 'ratio' word.
Let's start with summarizing the relevant numbers from Robin's
measurements and my own measurements.

Maximum bandwidth of the underlying physical medium: 2000 MB/s for a
DDR 4x InfiniBand network and 1000 MB/s for a SDR 4x InfiniBand
network.

Maximum bandwidth reported by the OFED ib_write_bw test program: 1473
MB/s for Robin's setup and 933 MB/s for my setup. These numbers match
published ib_write_bw results (see e.g. figure 11 in
http://www.usenix.org/events/usenix06/tech/full_papers/liu/liu_html/index.html
or chapter 7 in
http://www.voltaire.com/ftp/rocks/HCA-4X0_Linux_GridStack_4.3_Release_Notes_DOC-00171-A00.pdf)

Throughput measured for communication via STGT + iSER to a remote RAM
disk via direct I/O with dd: 800 MB/s for writing and 751 MB/s for
reading in Robin's setup, and 647 MB/s for writing and 589 MB/s for
reading in my setup.
54 % for writing and 51 % for reading in Robin's setup, and 69 % for
writing and 63 % for reading in my setup. Or a slightly better
utilization of the bandwidth in my setup than in Robin's setup. This
is no surprise -- the faster a communication link is, the harder it is
to use all of the available bandwidth.

So why did you state that in Robin's tests the I/O performance to line
speed ratio was better than in my tests ?

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-01-31 13:25:38 UTC
Permalink
Greetings all,
Post by FUJITA Tomonori
On Wed, 30 Jan 2008 09:38:04 +0100
Post by Bart Van Assche
Post by FUJITA Tomonori
iSER has parameters to limit the maximum size of RDMA (it needs to
repeat RDMA with a poor configuration)?
Please specify which parameters you are referring to. As you know I
Sorry, I can't say. I don't know much about iSER. But seems that Pete
and Robin can get the better I/O performance - line speed ratio with
STGT.
The version of OpenIB might matters too. For example, Pete said that
STGT reads loses about 100 MB/s for some transfer sizes for some
transfer sizes due to the OpenIB version difference or other unclear
reasons.
http://article.gmane.org/gmane.linux.iscsi.tgt.devel/135
It's fair to say that it takes long time and need lots of knowledge to
get the maximum performance of SAN, I think.
I think that it would be easier to convince James with the detailed
analysis (e.g. where does it take so long, like Pete did), not just
'dd' performance results.
Pushing iSCSI target code into mainline failed four times: IET, SCST,
STGT (doing I/Os in kernel in the past), and PyX's one (*1). iSCSI
target code is huge. You said SCST comprises 14,000 lines, but it's
not iSCSI target code. The SCSI engine code comprises 14,000
lines. You need another 10,000 lines for the iSCSI driver. Note that
SCST's iSCSI driver provides only basic iSCSI features. PyX's iSCSI
target code implemenents more iSCSI features (like MC/S, ERL2, etc)
and comprises about 60,000 lines and it still lacks some features like
iSER, bidi, etc.
The PyX storage engine supports a scatterlist linked list algorithm that
maps any sector count + sector size combination down to contiguous
struct scatterlist arrays across (potentially) multiple Linux storage
subsystems from a single CDB received on a initiator port. This design
was a consequence of a requirement for running said engine on Linux v2.2
and v2.4 across non cache coherent systems (MIPS R5900-EE) using a
single contiguous memory block mapped into struct buffer_head for PATA
access, and struct scsi_cmnd access on USB storage. Note that this was
before struct bio and struct scsi_request existed..

The PyX storage engine as it exists at Linux-iSCSI.org today can be
thought of as a hybrid OSD processing engine, as it maps storage object
memory across a number of tasks from a received command CDB. The
ability to pass in pre allocated memory from an RDMA capable adapter, as
well as allocated internally (ie: traditional iSCSI without open_iscsi's
struct skbuff rx zero-copy) is inherient in the design of the storage
engine. The lacking Bidi support can be attributed to lack of greater
support (and hence user interest) in Bidi, but I am really glad to see
this getting into the SCSI ML and STGT, and is certainly of interest in
the long term. Another feature that is missing in the current engine is
Post by FUJITA Tomonori
16 Byte CDBs, which I would imagine alot of vendor folks would like to
see in Linux as well. This is pretty easy to add in iSCSI with an AHS
and in the engine and storage subsystems.
Post by FUJITA Tomonori
I think that it's reasonable to say that we need more than 'dd'
results before pushing about possible more than 60,000 lines to
mainline.
(*1) http://linux-iscsi.org/
-
The 60k lines of code also includes functionality (the SE mirroring
comes to mind) that I do not plan to push towards mainline, along with
other legacy bits so we can build on earlier v2.6 embedded platforms.
The existing Target mode LIO-SE that provides linked list scatterlist
mapping algorithm that is similar to what Jens and Rusty have been
working on, and is under 14k lines including the switch(cdb[0]) +
function pointer assignment to per CDB specific structure that is called
potentially out-of-order in the RX side context of the CmdSN state
machine in RFC-3720. The current SE is also lacking the very SCSI
specific task management state machines that not a whole lot of iSCSI
implementions implement properly, and seem to be minimal interest to
users, and of moderate interest to vendors. Getting this implemented
generically in SCSI, as opposed to an transport specific mechanisim
would benefit the Linux SCSI target engine.

The pSCSI (struct scsi_cmnd), iBlock (struct bio) and FILE (struct file)
plugins together are a grand total of 3.5k lines using the v2.9 LIO-SE
interface. Assuming we have a single preferred data and control patch
for underlying physical and virtual block devices, this could also get
smaller. A quick check of the code puts the traditional kernel level
iSCSI statemachine at roughly 16k, which is pretty good for the complete
state machine. Also, having iSER and traditional iSCSI share MC/S and
ERL=2 common code will be of interest, as well as iSCSI login state
machines, which are identical minus the extra iSER specific keys and
requirement to transition from byte stream mode to RDMA accelerated
mode.

Since this particular code is located in a non-data path critical
section, the kernel vs. user discussion is a wash. If we are talking
about data path, yes, the relevance of DD tests in kernel designs are
suspect :p. For those IB testers who are interested, perhaps having a
look with disktest from the Linux Test Project would give a better
comparision between the two implementations on a RDMA capable fabric
like IB for best case performance. I think everyone is interested in
seeing just how much data path overhead exists between userspace and
kernel space in typical and heavy workloads, if if this overhead can be
minimized to make userspace a better option for some of this very
complex code.

--nab
Post by FUJITA Tomonori
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-01-31 14:34:17 UTC
Permalink
Post by Nicholas A. Bellinger
Since this particular code is located in a non-data path critical
section, the kernel vs. user discussion is a wash. If we are talking
about data path, yes, the relevance of DD tests in kernel designs are
suspect :p. For those IB testers who are interested, perhaps having a
look with disktest from the Linux Test Project would give a better
comparision between the two implementations on a RDMA capable fabric
like IB for best case performance. I think everyone is interested in
seeing just how much data path overhead exists between userspace and
kernel space in typical and heavy workloads, if if this overhead can be
minimized to make userspace a better option for some of this very
complex code.
I can run disktest on the same setups I ran dd on. This will take some
time however.

Disktest is new to me -- any hints with regard to suitable
combinations of command line parameters are welcome. The most recent
version I could find on http://ltp.sourceforge.net/ is ltp-20071231.

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-01-31 14:44:23 UTC
Permalink
Hi Bart,
Post by Bart Van Assche
Post by Nicholas A. Bellinger
Since this particular code is located in a non-data path critical
section, the kernel vs. user discussion is a wash. If we are talking
about data path, yes, the relevance of DD tests in kernel designs are
suspect :p. For those IB testers who are interested, perhaps having a
look with disktest from the Linux Test Project would give a better
comparision between the two implementations on a RDMA capable fabric
like IB for best case performance. I think everyone is interested in
seeing just how much data path overhead exists between userspace and
kernel space in typical and heavy workloads, if if this overhead can be
minimized to make userspace a better option for some of this very
complex code.
I can run disktest on the same setups I ran dd on. This will take some
time however.
Disktest is new to me -- any hints with regard to suitable
combinations of command line parameters are welcome. The most recent
version I could find on http://ltp.sourceforge.net/ is ltp-20071231.
I posted some numbers with traditional iSCSI on Neterion Xframe I 10
Gb/sec with LRO back in 2005 with disktest on the 1st generation x86_64
hardware available at the time. These tests where designed to show the
performance advantages of internexus multiplexing that is available
within traditional iSCSI, as well as iSER.

The disktest parameters that I used are listed in the following thread:

https://www.redhat.com/archives/dm-devel/2005-April/msg00013.html

--nab
Post by Bart Van Assche
Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-01-31 15:50:44 UTC
Permalink
Post by Bart Van Assche
Post by Nicholas A. Bellinger
Since this particular code is located in a non-data path critical
section, the kernel vs. user discussion is a wash. If we are talking
about data path, yes, the relevance of DD tests in kernel designs are
suspect :p. For those IB testers who are interested, perhaps having a
look with disktest from the Linux Test Project would give a better
comparision between the two implementations on a RDMA capable fabric
like IB for best case performance. I think everyone is interested in
seeing just how much data path overhead exists between userspace and
kernel space in typical and heavy workloads, if if this overhead can be
minimized to make userspace a better option for some of this very
complex code.
I can run disktest on the same setups I ran dd on. This will take some
time however.
Disktest was already referenced in the beginning of the performance
comparison thread, but its results are not very interesting if we are
going to find out, which implementation is more effective, because in
the modes, in which usually people run this utility, it produces latency
insensitive workload (multiple threads working in parallel). So, such
multithreaded disktests results will be different between STGT and SCST
only if STGT's implementation will get target CPU bound. If CPU on the
target is powerful enough, even extra busy loops in the STGT or SCST hot
path code will change nothing.

Additionally, multithreaded disktest over RAM disk is a good example of
a synthetic benchmark, which has almost no relation with real life
workloads. But people like it, because it produces nice looking results.

Actually, I don't know what kind of conclusions it is possible to make
from disktest's results (maybe only how throughput gets bigger or slower
with increasing number of threads?), it's a good stress test tool, but
not more.
Post by Bart Van Assche
Disktest is new to me -- any hints with regard to suitable
combinations of command line parameters are welcome. The most recent
version I could find on http://ltp.sourceforge.net/ is ltp-20071231.
Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Joe Landman
2008-01-31 16:25:32 UTC
Permalink
[...]
Post by Vladislav Bolkhovitin
Post by Bart Van Assche
I can run disktest on the same setups I ran dd on. This will take some
time however.
Disktest was already referenced in the beginning of the performance
comparison thread, but its results are not very interesting if we are
going to find out, which implementation is more effective, because in
the modes, in which usually people run this utility, it produces latency
insensitive workload (multiple threads working in parallel). So, such
There are other issues with disktest, in that you can easily specify
option combinations that generate apparently 5+ GB/s of IO, though
actual traffic over the link to storage is very low. Caveat disktest
emptor.
Post by Vladislav Bolkhovitin
multithreaded disktests results will be different between STGT and SCST
only if STGT's implementation will get target CPU bound. If CPU on the
target is powerful enough, even extra busy loops in the STGT or SCST hot
path code will change nothing.
Additionally, multithreaded disktest over RAM disk is a good example of
a synthetic benchmark, which has almost no relation with real life
workloads. But people like it, because it produces nice looking results.
I agree. The backing store should be a disk for it to have meaning,
though please note my caveat above.
Post by Vladislav Bolkhovitin
Actually, I don't know what kind of conclusions it is possible to make
from disktest's results (maybe only how throughput gets bigger or slower
with increasing number of threads?), it's a good stress test tool, but
not more.
Unfortunately, I agree. Bonnie++, dd tests, and a few others seem to
bear far closer to "real world" tests than disktest and iozone, the
latter of which does more to test the speed of RAM cache and system call
performance than actual IO.
Post by Vladislav Bolkhovitin
Post by Bart Van Assche
Disktest is new to me -- any hints with regard to suitable
combinations of command line parameters are welcome. The most recent
version I could find on http://ltp.sourceforge.net/ is ltp-20071231.
Bart Van Assche.
Here is what I have run:

disktest -K 8 -B 256k -I F -N 20000000 -P A -w /big/file
disktest -K 8 -B 64k -I F -N 20000000 -P A -w /big/file
disktest -K 8 -B 1k -I B -N 2000000 -P A /dev/sdb2

and many others.



Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: ***@scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-01-31 17:08:48 UTC
Permalink
Post by Joe Landman
Post by Vladislav Bolkhovitin
Actually, I don't know what kind of conclusions it is possible to make
from disktest's results (maybe only how throughput gets bigger or slower
with increasing number of threads?), it's a good stress test tool, but
not more.
Unfortunately, I agree. Bonnie++, dd tests, and a few others seem to
bear far closer to "real world" tests than disktest and iozone, the
latter of which does more to test the speed of RAM cache and system call
performance than actual IO.
I have ran some tests with Bonnie++, but found out that on a fast
network like IB the filesystem used for the test has a really big
impact on the test results.

If anyone has a suggestion for a better test than dd to compare the
performance of SCSI storage protocols, please let it know.

Bart Van Assche.
Joe Landman
2008-01-31 17:13:12 UTC
Permalink
Post by Bart Van Assche
I have ran some tests with Bonnie++, but found out that on a fast
network like IB the filesystem used for the test has a really big
impact on the test results.
This is true of the file systems when physically directly connected to
the unit as well. Some file systems are designed with high performance
in mind, some are not.
Post by Bart Van Assche
If anyone has a suggestion for a better test than dd to compare the
performance of SCSI storage protocols, please let it know.
Hmmm... if you care about the protocol side, I can't help. Our users
are more concerned with the file system side, so this is where we focus
our tuning attention.
Post by Bart Van Assche
Bart Van Assche.
Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: ***@scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Dillow
2008-01-31 18:12:15 UTC
Permalink
Post by Bart Van Assche
If anyone has a suggestion for a better test than dd to compare the
performance of SCSI storage protocols, please let it know.
xdd on /dev/sda, sdb, etc. using -dio to do direct IO seems to work
decently, though it is hard (ie, impossible) to get a repeatable
sequence of IO when using higher queue depths, as it uses threads to
generate multiple requests.

You may also look at sgpdd_survey from Lustre's iokit, but I've not done
much with that -- it uses the sg devices to send lowlevel SCSI commands.

I've been playing around with some benchmark code using libaio, but it's
not in generally usable shape.

xdd:
http://www.ioperformance.com/products.htm

Lustre IO Kit:
http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-20-1.html
--
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office
Vladislav Bolkhovitin
2008-02-01 11:50:36 UTC
Permalink
Post by David Dillow
Post by Bart Van Assche
If anyone has a suggestion for a better test than dd to compare the
performance of SCSI storage protocols, please let it know.
xdd on /dev/sda, sdb, etc. using -dio to do direct IO seems to work
decently, though it is hard (ie, impossible) to get a repeatable
sequence of IO when using higher queue depths, as it uses threads to
generate multiple requests.
This utility seems to be a good one, but it's basically the same as
disktest, although much more advanced.
Post by David Dillow
You may also look at sgpdd_survey from Lustre's iokit, but I've not done
much with that -- it uses the sg devices to send lowlevel SCSI commands.
Yes, it might be worth to try. Since fundamentally it's the same as
O_DIRECT dd, but with a bit less overhead on the initiator side (hence
less initiator side latency), most likely it will show ever bigger
difference, than it is with dd.
Post by David Dillow
I've been playing around with some benchmark code using libaio, but it's
not in generally usable shape.
http://www.ioperformance.com/products.htm
http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-20-1.html
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-01 11:50:40 UTC
Permalink
Post by Bart Van Assche
Post by Joe Landman
Post by Vladislav Bolkhovitin
Actually, I don't know what kind of conclusions it is possible to make
from disktest's results (maybe only how throughput gets bigger or slower
with increasing number of threads?), it's a good stress test tool, but
not more.
Unfortunately, I agree. Bonnie++, dd tests, and a few others seem to
bear far closer to "real world" tests than disktest and iozone, the
latter of which does more to test the speed of RAM cache and system call
performance than actual IO.
I have ran some tests with Bonnie++, but found out that on a fast
network like IB the filesystem used for the test has a really big
impact on the test results.
If anyone has a suggestion for a better test than dd to compare the
performance of SCSI storage protocols, please let it know.
I would suggest you to try something from real life, like:

- Copying large file tree over a single or multiple IB links

- Measure of some DB engine's TPC

- etc.
Post by Bart Van Assche
Bart Van Assche.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Scst-devel mailing list
https://lists.sourceforge.net/lists/listinfo/scst-devel
Vladislav Bolkhovitin
2008-02-01 12:25:34 UTC
Permalink
Post by Vladislav Bolkhovitin
Post by Bart Van Assche
Post by Joe Landman
Post by Vladislav Bolkhovitin
Actually, I don't know what kind of conclusions it is possible to make
from disktest's results (maybe only how throughput gets bigger or slower
with increasing number of threads?), it's a good stress test tool, but
not more.
Unfortunately, I agree. Bonnie++, dd tests, and a few others seem to
bear far closer to "real world" tests than disktest and iozone, the
latter of which does more to test the speed of RAM cache and system call
performance than actual IO.
I have ran some tests with Bonnie++, but found out that on a fast
network like IB the filesystem used for the test has a really big
impact on the test results.
If anyone has a suggestion for a better test than dd to compare the
performance of SCSI storage protocols, please let it know.
- Copying large file tree over a single or multiple IB links
- Measure of some DB engine's TPC
- etc.
Forgot to mention. During those tests make sure that imported devices
from both SCST and STGT report in the kernel log the same write cache
and FUA capabilities, since they significantly affect initiator's
behavior. Like:

sd 4:0:0:5: [sdf] Write cache: enabled, read cache: enabled, supports
DPO and FUA

For SCST the fastest mode is NV_CACHE, refer to its README file for details.
Post by Vladislav Bolkhovitin
Post by Bart Van Assche
Bart Van Assche.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Scst-devel mailing list
https://lists.sourceforge.net/lists/listinfo/scst-devel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-01-31 17:14:20 UTC
Permalink
Post by Vladislav Bolkhovitin
Post by Bart Van Assche
Post by Nicholas A. Bellinger
Since this particular code is located in a non-data path critical
section, the kernel vs. user discussion is a wash. If we are talking
about data path, yes, the relevance of DD tests in kernel designs are
suspect :p. For those IB testers who are interested, perhaps having a
look with disktest from the Linux Test Project would give a better
comparision between the two implementations on a RDMA capable fabric
like IB for best case performance. I think everyone is interested in
seeing just how much data path overhead exists between userspace and
kernel space in typical and heavy workloads, if if this overhead can be
minimized to make userspace a better option for some of this very
complex code.
I can run disktest on the same setups I ran dd on. This will take some
time however.
Disktest was already referenced in the beginning of the performance
comparison thread, but its results are not very interesting if we are
going to find out, which implementation is more effective, because in
the modes, in which usually people run this utility, it produces latency
insensitive workload (multiple threads working in parallel). So, such
multithreaded disktests results will be different between STGT and SCST
only if STGT's implementation will get target CPU bound. If CPU on the
target is powerful enough, even extra busy loops in the STGT or SCST hot
path code will change nothing.
I think the really interesting numbers are the difference for bulk I/O
between kernel and userspace on both traditional iSCSI and the RDMA
enabled flavours. I have not been able to determine anything earth
shattering from the current run of kernel vs. userspace tests, nor which
method of implementation for iSER, SRP, and generic Storage Engine are
'more effective' for that case. Performance and latency to real storage
would make alot more sense for the kernel vs. user case. Also workloads
against software LVM and Linux MD block devices would be of interest as
these would be some of the more typical deployments that would be in the
field, and is what Linux-iSCSI.org uses for our production cluster
storage today.

Having implemented my own iSCSI and SCSI Target mode Storage Engine
leads me to believe that putting logic in userspace is probably a good
idea in the longterm. If this means putting the entire data IO path
into userspace for Linux/iSCSI, then there needs to be a good reason why
this will not not scale to multi-port 10 Gb/sec engines in traditional
and RDMA mode if we need to take this codepath back into the kernel.
The end goal is to have the most polished and complete storage engine
and iSCSI stacks designs go upstream, which is something I think we can
all agree on.

Also, with STGT being a pretty new design which has not undergone alot
of optimization, perhaps profiling both pieces of code against similar
tests would give us a better idea of where userspace bottlenecks reside.
Also, the overhead involved with traditional iSCSI for bulk IO from
kernel / userspace would also be a key concern for a much larger set of
users, as iSER and SRP on IB is a pretty small userbase and will
probably remain small for the near future.
Post by Vladislav Bolkhovitin
Additionally, multithreaded disktest over RAM disk is a good example of
a synthetic benchmark, which has almost no relation with real life
workloads. But people like it, because it produces nice looking results.
Yes, people like to claim their stacks are the fastest with RAM disk
benchmarks. But hooking up their fast network silicon to existing
storage hardware and OS storage subsystems and software is where the
real game is..
Post by Vladislav Bolkhovitin
Actually, I don't know what kind of conclusions it is possible to make
from disktest's results (maybe only how throughput gets bigger or slower
with increasing number of threads?), it's a good stress test tool, but
not more.
Being able to have a best case baseline with disktest for kernel vs.
user would be of interest for both transport protocol and SCSI Target
mode Storage Engine profiling. The first run of tests looked pretty
bandwith oriented, so disktest works well to determine maximum bandwith.
Disktest also is nice for getting reads from cache on hardware RAID
controllers because disktest only generates requests with LBAs from 0 ->
disktest BLOCKSIZE.

--nab
Post by Vladislav Bolkhovitin
Post by Bart Van Assche
Disktest is new to me -- any hints with regard to suitable
combinations of command line parameters are welcome. The most recent
version I could find on http://ltp.sourceforge.net/ is ltp-20071231.
Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-01-31 17:40:59 UTC
Permalink
Post by Nicholas A. Bellinger
Also, with STGT being a pretty new design which has not undergone alot
of optimization, perhaps profiling both pieces of code against similar
tests would give us a better idea of where userspace bottlenecks reside.
Also, the overhead involved with traditional iSCSI for bulk IO from
kernel / userspace would also be a key concern for a much larger set of
users, as iSER and SRP on IB is a pretty small userbase and will
probably remain small for the near future.
Two important trends in data center technology are server
consolidation and storage consolidation. A.o. every web hosting
company can profit from a fast storage solution. I wouldn't call this
a small user base.

Regarding iSER and SRP on IB: InfiniBand is today the most economic
solution for a fast storage network. I do not know which technology
will be the most popular for storage consolidation within a few years
-- this can be SRP, iSER, IPoIB + SDP, FCoE (Fibre Channel over
Ethernet) or maybe yet another technology. No matter which technology
becomes the most popular for storage applications, there will be a
need for high-performance storage software.

References:
* Michael Feldman, Battle of the Network Fabrics, HPCwire, December
2006, http://www.hpcwire.com/hpc/1145060.html
* NetApp, Reducing Data Center Power Consumption Through Efficient
Storage, February 2007,
http://www.netapp.com/ftp/wp-reducing-datacenter-power-consumption.pdf

Bart Van Assche.
Nicholas A. Bellinger
2008-01-31 18:15:39 UTC
Permalink
Post by Bart Van Assche
Post by Nicholas A. Bellinger
Also, with STGT being a pretty new design which has not undergone alot
of optimization, perhaps profiling both pieces of code against similar
tests would give us a better idea of where userspace bottlenecks reside.
Also, the overhead involved with traditional iSCSI for bulk IO from
kernel / userspace would also be a key concern for a much larger set of
users, as iSER and SRP on IB is a pretty small userbase and will
probably remain small for the near future.
Two important trends in data center technology are server
consolidation and storage consolidation. A.o. every web hosting
company can profit from a fast storage solution. I wouldn't call this
a small user base.
Regarding iSER and SRP on IB: InfiniBand is today the most economic
solution for a fast storage network. I do not know which technology
will be the most popular for storage consolidation within a few years
-- this can be SRP, iSER, IPoIB + SDP, FCoE (Fibre Channel over
Ethernet) or maybe yet another technology. No matter which technology
becomes the most popular for storage applications, there will be a
need for high-performance storage software.
I meant small referring to storage on IB fabrics which has usually been
in the research and national lab settings, with some other vendors
offering IB as an alternative storage fabric for those who [w,c]ould not
wait for 10 Gb/sec copper Ethernet and Direct Data Placement to come
online. These types of numbers compared to say traditional iSCSI, that
is getting used all over the place these days in areas I won't bother
listing here.

As for the future, I am obviously cheering for IP storage fabrics, in
particular 10 Gb/sec Ethernet and Direct Data Placement in concert with
iSCSI Extentions for RDMA to give the data center a high performance,
low latency transport that can do OS independent storage multiplexing
and recovery across multiple independently developed implementions.
Also avoiding lock-in from un-interoptable storage transports (espically
on the high end) that had plauged so many vendors in years past has
become an real option in the past few years with a IETF defined block
level storage protocol. We are actually going on four years since
RFC-3720 was ratified. (April 2004)

Making the 'enterprise' ethernet switching equipment go from millisecond
to nanosecond latency in a whole different story that goes beyond my
area of expertise. I know there is one startup (Fulcrum Micro) who is
working on this problem and seems to be making some good progress.

--nab
Post by Bart Van Assche
* Michael Feldman, Battle of the Network Fabrics, HPCwire, December
2006, http://www.hpcwire.com/hpc/1145060.html
* NetApp, Reducing Data Center Power Consumption Through Efficient
Storage, February 2007,
http://www.netapp.com/ftp/wp-reducing-datacenter-power-consumption.pdf
Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-02-01 09:08:28 UTC
Permalink
Post by Nicholas A. Bellinger
I meant small referring to storage on IB fabrics which has usually been
in the research and national lab settings, with some other vendors
offering IB as an alternative storage fabric for those who [w,c]ould not
wait for 10 Gb/sec copper Ethernet and Direct Data Placement to come
online. These types of numbers compared to say traditional iSCSI, that
is getting used all over the place these days in areas I won't bother
listing here.
InfiniBand has several advantages over 10 Gbit/s Ethernet (the list
below probably isn't complete):
- Lower latency. Communication latency is not only determined by the
latency of a switch. The whole InfiniBand protocol stack was designed
with low latency in mind. Low latency is really important for database
software that accesses storage over a network.
- High-availability is implemented at the network layer. Suppose that
a group of servers has dual-port network interfaces and is
interconnected via a so-called dual star topology, With an InfiniBand
network, failover in case of a single failure (link or switch) is
handled without any operating system or application intervention. With
Ethernet, failover in case of a single failure must be handled either
by the operating system or by the application.
- You do not have to use iSER or SRP to use the bandwidth of an
InfiniBand network effectively. The SDP (Sockets Direct Protocol)
makes it possible that applications benefit from RDMA by using the
very classic IPv4 Berkeley sockets interface. An SDP implementation in
software is already available today via OFED. iperf reports 470 MB/s
on single-threaded tests and 975 MB/s for a performance test with two
threads on an SDR 4x InfiniBand network. These tests were performed
with the OFED 1.2.5.4 SDP implementation. It is possible that future
SDP implementations will perform even better. (Note: I could not yet
get iSCSI over SDP working.)

We should leave the choice of networking technology open -- both
Ethernet and InfiniBand have specific advantages.

See also:
InfiniBand Trade Association, InfiniBand Architecture Specification
Release 1.2.1, http://www.infinibandta.org/specs/register/publicspec/

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-02-01 08:11:24 UTC
Permalink
Post by Nicholas A. Bellinger
The PyX storage engine supports a scatterlist linked list algorithm that
...
Which parts of the PyX source code are licensed under the GPL and
which parts are closed source ? A Google query for PyX + iSCSI showed
information about licensing deals. Licensing deals can only be closed
for software that is not entirely licensed under the GPL.

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-01 10:39:24 UTC
Permalink
Post by Bart Van Assche
Post by Nicholas A. Bellinger
The PyX storage engine supports a scatterlist linked list algorithm that
...
Which parts of the PyX source code are licensed under the GPL and
which parts are closed source ? A Google query for PyX + iSCSI showed
information about licensing deals. Licensing deals can only be closed
for software that is not entirely licensed under the GPL.
I was using the name PyX to give an historical context to the
discussion. :-) In more recent times, I have been using the name "LIO
Target Stack" and "LIO Storage Engine" to refer to Traditional RFC-3720
Target statemachines, and SCSI Processing engine implementation
respectively. The codebase has matured significantly from the original
codebase, as the Linux SCSI, ATA and Block subsystems envolved from
v2.2, v2.4, v2.5 and modern v2.6, the LIO stack has grown (and sometimes
shrunk) along with the following requirement; To support all possible
storage devices on all subsystems on any hardware platform that Linux
could be made to boot. Interopt with other non Linux SCSI subsystems
was also an issue early in development.. If you can imagine a Solaris
SCSI subsystem asking for T10 EVPD WWN information from a Linux/iSCSI
Target with pre libata SATA drivers, you can probably guess just how
time was spent looking at packet captures to figure out to make OS
dependent (ie: outernexus) multipath to play nice.

Note that PyX Target Code for Linux v2.6 has been available in source
and binary form for a diverse array of Linux devices and environments
since September 2007. Right around this time, the Linux-iSCSI.org
Storage and Virtualization stack went online for the first time using
OCFS2, PVM, HVM, LVM, RAID6 and of course, traditional RFC-3720 on 10
Gb/sec and 1 Gb/sec fabric. There have also been world's first storage
research work and prototypes that have been developed with the LIO code.
Information on these topics is available from the homepage, and a few
links deep there are older projects and information about features
inherent to the LIO Target and Storage Engine. One of my items for the
v2.9 codebase in 2008 is start picking apart the current code and
determining which pieces should be sent upstream for review. I have
also been spending alot of time recently looking at the other available
open source storage transport and processing stacks and seeing how
Linux/iSCSI, and other projects can benefit from our large pool of
people, knowledge, and code.

Speaking of the LIO Target and SE code, it today runs the production
services for Linux-iSCSI.org and it's storage and virtualization
clusters on x86_64. It also also provides a base for next generation
and forward looking projects that exist (or soon to exist :-) within the
Linux/iSCSI ecosystem. There have been lots of time and resources put
into the codebase, and having a real live working RFC-3720 stack that
supports optional features that give iSCSI (and hence designed into
iSER) the flexibility and transparentness to operate as the original
designers intended.

Many thanks for your most valuable of time,

--nab
Post by Bart Van Assche
Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-02-01 11:04:14 UTC
Permalink
Post by Nicholas A. Bellinger
Post by Bart Van Assche
Post by Nicholas A. Bellinger
The PyX storage engine supports a scatterlist linked list algorithm that
...
Which parts of the PyX source code are licensed under the GPL and
which parts are closed source ? A Google query for PyX + iSCSI showed
information about licensing deals. Licensing deals can only be closed
for software that is not entirely licensed under the GPL.
I was using the name PyX to give an historical context to the
discussion. ...
Regarding the PyX Target Code: I have found a link via which I can
download a free 30-day demo. This means that a company is earning
money via this target code and that the source code is not licensed
under the GPL. This is fine, but it also means that today the PyX
target code is not a candidate for inclusion in the Linux kernel, and
that it is unlikely that all of the PyX target code (kernelspace +
userspace) will be made available under GPL soon.

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-01 12:05:17 UTC
Permalink
Post by Bart Van Assche
Post by Nicholas A. Bellinger
Post by Bart Van Assche
Post by Nicholas A. Bellinger
The PyX storage engine supports a scatterlist linked list algorithm that
...
Which parts of the PyX source code are licensed under the GPL and
which parts are closed source ? A Google query for PyX + iSCSI showed
information about licensing deals. Licensing deals can only be closed
for software that is not entirely licensed under the GPL.
I was using the name PyX to give an historical context to the
discussion. ...
Regarding the PyX Target Code: I have found a link via which I can
download a free 30-day demo. This means that a company is earning
money via this target code and that the source code is not licensed
under the GPL. This is fine, but it also means that today the PyX
target code is not a candidate for inclusion in the Linux kernel, and
that it is unlikely that all of the PyX target code (kernelspace +
userspace) will be made available under GPL soon.
All of the kernel and C userspace code is open source and available from
linux-iscsi.org and licensed under the GPL. There is the BSD licensed
code from userspace (iSNS), as well as ISCSI and SCSI MIBs. As for what
pieces of code will be going upstream (for kernel and/or userspace), LIO
Target state machines and SE algoritims are definately some of the best
examples of GPL code for production IP storage fabric and has gained
maturity from people and resources applied to it in a number of
respects.

The LIO stack presents a number of possible options to get the diverse
amount of hardware and software to work. Completely dismissing the
available code is certainly a waste, and there are still significant
amounts of functionality related to real-time administration, RFC-3720
MC/S and ERL=2 and generic SE functionality OS storage subsystems that
only exist in LIO and our assoicated projects. A one obvious example is
the LIO-VM project, which brings LIO active-active transport recovery
and other Linux storage functionality to Vmware and Qemu images that can
provide target mode IP storage fabric on x86 non-linux based hosts. A
first of its kind in the Linux/iSCSI universe.

Anyways, lets get back to the technical discussion.

--nab
Post by Bart Van Assche
Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-02-01 13:25:34 UTC
Permalink
Post by Nicholas A. Bellinger
All of the kernel and C userspace code is open source and available from
linux-iscsi.org and licensed under the GPL.
I found a statement on a web page that the ERL2 implementation is not
included in the GPL version (http://zaal.org/iscsi/index.html). The
above implies that this statement is incorrect. Tomo, are you the
maintainer of this web page ?

I'll try to measure the performance of the LIO Target Stack on the
same setup on which I ran the other performance tests.

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-01 14:36:51 UTC
Permalink
Post by Bart Van Assche
Post by Nicholas A. Bellinger
All of the kernel and C userspace code is open source and available from
linux-iscsi.org and licensed under the GPL.
I found a statement on a web page that the ERL2 implementation is not
included in the GPL version (http://zaal.org/iscsi/index.html). The
above implies that this statement is incorrect. Tomo, are you the
maintainer of this web page ?
This was mentioned in the context of the Core-iSCSI Initiator module.
Post by Bart Van Assche
I'll try to measure the performance of the LIO Target Stack on the
same setup on which I ran the other performance tests.
Great!

--nab


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
James Bottomley
2008-01-30 16:34:31 UTC
Permalink
Post by Bart Van Assche
Post by FUJITA Tomonori
iSER has parameters to limit the maximum size of RDMA (it needs to
repeat RDMA with a poor configuration)?
Please specify which parameters you are referring to. As you know I
had already repeated my tests with ridiculously high values for the
following iSER parameters: FirstBurstLength, MaxBurstLength and
MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block
size specified to dd).
the 1Mb block size is a bit of a red herring. Unless you've
specifically increased the max_sector_size and are using an sg_chain
converted driver, on x86 the maximum possible transfer accumulation is
0.5MB.

I certainly don't rule out that increasing the transfer size up from
0.5MB might be the way to improve STGT efficiency, since at an 1GB/s
theoretical peak, that's roughly 2000 context switches per I/O; however,
It doesn't look like you've done anything that will overcome the block
layer limitations.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-01-30 16:50:52 UTC
Permalink
On Jan 30, 2008 5:34 PM, James Bottomley
Post by James Bottomley
Post by Bart Van Assche
Post by FUJITA Tomonori
iSER has parameters to limit the maximum size of RDMA (it needs to
repeat RDMA with a poor configuration)?
Please specify which parameters you are referring to. As you know I
had already repeated my tests with ridiculously high values for the
following iSER parameters: FirstBurstLength, MaxBurstLength and
MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block
size specified to dd).
the 1Mb block size is a bit of a red herring. Unless you've
specifically increased the max_sector_size and are using an sg_chain
converted driver, on x86 the maximum possible transfer accumulation is
0.5MB.
I did not publish the results, but I have also done tests with other
block sizes. The other sizes I tested were between 0.1MB and 10MB. The
performance difference for these other sizes compared to a block size
of 1MB was small (smaller than the variance between individual tests
results).

Bart.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Pete Wyckoff
2008-02-02 15:32:31 UTC
Permalink
Post by James Bottomley
Post by Bart Van Assche
Post by FUJITA Tomonori
iSER has parameters to limit the maximum size of RDMA (it needs to
repeat RDMA with a poor configuration)?
Please specify which parameters you are referring to. As you know I
had already repeated my tests with ridiculously high values for the
following iSER parameters: FirstBurstLength, MaxBurstLength and
MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block
size specified to dd).
the 1Mb block size is a bit of a red herring. Unless you've
specifically increased the max_sector_size and are using an sg_chain
converted driver, on x86 the maximum possible transfer accumulation is
0.5MB.
I certainly don't rule out that increasing the transfer size up from
0.5MB might be the way to improve STGT efficiency, since at an 1GB/s
theoretical peak, that's roughly 2000 context switches per I/O; however,
It doesn't look like you've done anything that will overcome the block
layer limitations.
The MRDSL parameter has no effect on iSER, as the RFC describes.
How to transfer data to satisfy a command is solely up to the
target. So you would need both big requests from the client, then
look at how the target will send the data.

I've only used 512 kB for the RDMA transfer size from the target, as
it matches the default client size and was enough to get good
performance out of my IB gear and minimizes resource consumption on
the target. It's currently hard-coded as a #define. There is no
provision in the protocol for the client to dictate the value.

If others want to spend some effort trying to tune stgt for iSER,
there are a fair number of comments in the code, including a big one
that explains this RDMA transfer size issue. And I'll answer
informed questions as I can. But I'm not particularly interested in
arguing about which implementation is best, or trying to interpret
bandwidth comparison numbers from poorly designed tests. It takes
work to understand these issues.

-- Pete
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-02-06 12:16:43 UTC
Permalink
Using such large values for FirstBurstLength will give you poor
performance numbers for WRITE commands (with iSER). FirstBurstLength
means how much data should you send as unsolicited data (i.e. without
RDMA). It means that your WRITE commands were sent without RDMA.
Sorry, but I'm afraid you got this wrong. When the iSER transport is
used instead of TCP, all data is sent via RDMA, including unsolicited
data. If you have look at the iSER implementation in the Linux kernel
(source files under drivers/infiniband/ulp/iser), you will see that
all data is transferred via RDMA and not via TCP/IP.

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Benny Halevy
2008-02-06 16:45:47 UTC
Permalink
Post by Bart Van Assche
Using such large values for FirstBurstLength will give you poor
performance numbers for WRITE commands (with iSER). FirstBurstLength
means how much data should you send as unsolicited data (i.e. without
RDMA). It means that your WRITE commands were sent without RDMA.
Sorry, but I'm afraid you got this wrong. When the iSER transport is
used instead of TCP, all data is sent via RDMA, including unsolicited
data. If you have look at the iSER implementation in the Linux kernel
(source files under drivers/infiniband/ulp/iser), you will see that
all data is transferred via RDMA and not via TCP/IP.
Regardless of what the current implementation is, the behavior you (Bart)
describe seems to disagree with http://www.ietf.org/rfc/rfc5046.txt.

Benny
Post by Bart Van Assche
Bart Van Assche.
-
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Roland Dreier
2008-02-06 17:06:37 UTC
Permalink
Post by Bart Van Assche
Sorry, but I'm afraid you got this wrong. When the iSER transport is
used instead of TCP, all data is sent via RDMA, including unsolicited
data. If you have look at the iSER implementation in the Linux kernel
(source files under drivers/infiniband/ulp/iser), you will see that
all data is transferred via RDMA and not via TCP/IP.
I think the confusion here is caused by a slight misuse of the term
"RDMA". It is true that all data is always transported over an
InfiniBand connection when iSER is used, but not all such transfers
are one-sided RDMA operations; some data can be transferred using
send/receive operations.

- R.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Erez Zilber
2008-02-18 09:43:31 UTC
Permalink
Post by Bart Van Assche
Using such large values for FirstBurstLength will give you poor
performance numbers for WRITE commands (with iSER). FirstBurstLength
means how much data should you send as unsolicited data (i.e. without
RDMA). It means that your WRITE commands were sent without RDMA.
Sorry, but I'm afraid you got this wrong. When the iSER transport is
used instead of TCP, all data is sent via RDMA, including unsolicited
data. If you have look at the iSER implementation in the Linux kernel
(source files under drivers/infiniband/ulp/iser), you will see that
all data is transferred via RDMA and not via TCP/IP.
When you execute WRITE commands with iSCSI, it works like this:

EDTL (Expected data length) - the data length of your command

FirstBurstLength - the length of data that will be sent as unsolicited
data (i.e. as immediate data with the SCSI command and as unsolicited
data-out PDUs)

If you use a high value for FirstBurstLength, all (or most) of your data
will be sent as unsolicited data-out PDUs. These PDUs don't use the RDMA
engine, so you miss the advantage of IB.

If you use a lower value for FirstBurstLength, EDTL - FirstBurstLength
bytes will be sent as solicited data-out PDUs. With iSER, solicited
data-out PDUs are RDMA operations.

I hope that I'm more clear now.

Erez
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-01-30 11:18:49 UTC
Permalink
Post by FUJITA Tomonori
On Tue, 29 Jan 2008 13:31:52 -0800
Post by Roland Dreier
Post by James Bottomley
. . STGT read SCST read . STGT read SCST read .
. . performance performance . performance performance .
. . (0.5K, MB/s) (0.5K, MB/s) . (1 MB, MB/s) (1 MB, MB/s) .
. iSER (8 Gb/s network) . 250 N/A . 360 N/A .
. SRP (8 Gb/s network) . N/A 421 . N/A 683 .
On the comparable figures, which only seem to be IPoIB they're showing a
13-18% variance, aren't they? Which isn't an incredible difference.
Maybe I'm all wet, but I think iSER vs. SRP should be roughly
comparable. The exact formatting of various messages etc. is
different but the data path using RDMA is pretty much identical. So
the big difference between STGT iSER and SCST SRP hints at some big
difference in the efficiency of the two implementations.
iSER has parameters to limit the maximum size of RDMA (it needs to
repeat RDMA with a poor configuration)?
iSER to 7G ramfs, x86_64, centos4.6, 2.6.22 kernels, git tgtd,
initiator end booted with mem=512M, target with 8G ram
direct i/o dd
write/read 800/751 MB/s
dd if=/dev/zero of=/dev/sdc bs=1M count=5000 oflag=direct
dd of=/dev/null if=/dev/sdc bs=1M count=5000 iflag=direct
I think that STGT is pretty fast with the fast backing storage.
How fast SCST will be on the same hardware?
Post by FUJITA Tomonori
I don't think that there is the notable perfornace difference between
kernel-space and user-space SRP (or ISER) implementations about moving
data between hosts. IB is expected to enable user-space applications
to move data between hosts quickly (if not, what can IB provide us?).
I think that the question is how fast user-space applications can do
I/Os ccompared with I/Os in kernel space. STGT is eager for the advent
of good asynchronous I/O and event notification interfances.
One more possible optimization for STGT is zero-copy data
transfer. STGT uses pre-registered buffers and move data between page
cache and thsse buffers, and then does RDMA transfer. If we implement
own caching mechanism to use pre-registered buffers directly with (AIO
and O_DIRECT), then STGT can move data without data copies.
Great! So, you are going to duplicate Linux page cache in the user
space. You will continue keeping the in-kernel code as small as possible
and its mainteinership effort as low as possible by the cost that the
user space part's code size and complexity (and, hence, its
mainteinership effort) will rocket to the sky. Apparently, this doesn't
look like a good design decision.
Post by FUJITA Tomonori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-01-30 08:29:27 UTC
Permalink
On Jan 29, 2008 9:42 PM, James Bottomley
Post by James Bottomley
Post by Bart Van Assche
As an SCST user, I would like to see the SCST kernel code integrated
in the mainstream kernel because of its excellent performance on an
InfiniBand network. Since the SCST project comprises about 14 KLOC,
reviewing the SCST code will take considerable time. Who will do this
reviewing work ? And with regard to the comments made by the
reviewers: Vladislav, do you have the time to carry out the
modifications requested by the reviewers ? I expect a.o. that
reviewers will ask to move SCST's configuration pseudofiles from
procfs to sysfs.
The two target architectures perform essentially identical functions, so
there's only really room for one in the kernel. Right at the moment,
it's STGT. Problems in STGT come from the user<->kernel boundary which
can be mitigated in a variety of ways. The fact that the figures are
pretty much comparable on non IB networks shows this.
Are you saying that users who need an efficient iSCSI implementation
should switch to OpenSolaris ? The OpenSolaris COMSTAR project involves
the migration of the existing OpenSolaris iSCSI target daemon from
userspace to their kernel. The OpenSolaris developers are
spending time on this because they expect a significant performance
improvement.
Post by James Bottomley
I really need a whole lot more evidence than at worst a 20% performance
difference on IB to pull one implementation out and replace it with
another. Particularly as there's no real evidence that STGT can't be
tweaked to recover the 20% even on IB.
My measurements on a 1 GB/s InfiniBand network have shown that the current
SCST implementation is able to read data via direct I/O at a rate of 811 GB/s
(via SRP) and that the current STGT implementation is able to transfer data at a
rate of 589 MB/s (via iSER). That's a performance difference of 38%.

And even more important, the I/O latency of SCST is significantly
lower than that
of STGT. This is very important for database workloads -- the I/O pattern caused
by database software is close to random I/O, and database software needs low
latency I/O in order to run efficiently.

In the thread with the title "Performance of SCST versus STGT" on the
SCST-devel /
STGT-devel mailing lists not only the raw performance numbers were discussed but
also which further performance improvements are possible. It became clear that
the SCST performance can be improved further by implementing a well known
optimization (zero-copy I/O). Fujita Tomonori explained in the same
thread that it is
possible to improve the performance of STGT further, but that this would require
a lot of effort (implementing asynchronous I/O in the kernel and also
implementing
a new caching mechanism using pre-registered buffers).

See also:
http://sourceforge.net/mailarchive/forum.php?forum_name=scst-devel&viewmonth=200801&viewday=17

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
James Bottomley
2008-01-30 16:22:54 UTC
Permalink
Post by Bart Van Assche
On Jan 29, 2008 9:42 PM, James Bottomley
Post by James Bottomley
Post by Bart Van Assche
As an SCST user, I would like to see the SCST kernel code integrated
in the mainstream kernel because of its excellent performance on an
InfiniBand network. Since the SCST project comprises about 14 KLOC,
reviewing the SCST code will take considerable time. Who will do this
reviewing work ? And with regard to the comments made by the
reviewers: Vladislav, do you have the time to carry out the
modifications requested by the reviewers ? I expect a.o. that
reviewers will ask to move SCST's configuration pseudofiles from
procfs to sysfs.
The two target architectures perform essentially identical functions, so
there's only really room for one in the kernel. Right at the moment,
it's STGT. Problems in STGT come from the user<->kernel boundary which
can be mitigated in a variety of ways. The fact that the figures are
pretty much comparable on non IB networks shows this.
Are you saying that users who need an efficient iSCSI implementation
should switch to OpenSolaris ?
I'd certainly say that's a totally unsupported conclusion.
Post by Bart Van Assche
The OpenSolaris COMSTAR project involves
the migration of the existing OpenSolaris iSCSI target daemon from
userspace to their kernel. The OpenSolaris developers are
spending time on this because they expect a significant performance
improvement.
Just because Solaris takes a particular design decision doesn't
automatically make it the right course of action.

Microsoft once pulled huge gobs of the C library and their windowing
system into the kernel in the name of efficiency. It proved not only to
be less efficient, but also to degrade their security model.

Deciding what lives in userspace and what should be in the kernel lies
at the very heart of architectural decisions. However, the argument
that "it should be in the kernel because that would make it faster" is
pretty much a discredited one. To prevail on that argument, you have to
demonstrate that there's no way to enable user space to do the same
thing at the same speed. Further, it was the same argument used the
last time around when the STGT vs SCST investigation was done. Your own
results on non-IB networks show that both architectures perform at the
same speed. That tends to support the conclusion that there's something
specific about IB that needs to be tweaked or improved for STGT to get
it to perform correctly.

Furthermore, if you have already decided before testing that SCST is
right and that STGT is wrong based on the architectures, it isn't
exactly going to increase my confidence in your measurement methodology
claiming to show this, now is it?
Post by Bart Van Assche
Post by James Bottomley
I really need a whole lot more evidence than at worst a 20% performance
difference on IB to pull one implementation out and replace it with
another. Particularly as there's no real evidence that STGT can't be
tweaked to recover the 20% even on IB.
My measurements on a 1 GB/s InfiniBand network have shown that the current
SCST implementation is able to read data via direct I/O at a rate of 811 GB/s
(via SRP) and that the current STGT implementation is able to transfer data at a
rate of 589 MB/s (via iSER). That's a performance difference of 38%.
And even more important, the I/O latency of SCST is significantly
lower than that
of STGT. This is very important for database workloads -- the I/O pattern caused
by database software is close to random I/O, and database software needs low
latency I/O in order to run efficiently.
In the thread with the title "Performance of SCST versus STGT" on the
SCST-devel /
STGT-devel mailing lists not only the raw performance numbers were discussed but
also which further performance improvements are possible. It became clear that
the SCST performance can be improved further by implementing a well known
optimization (zero-copy I/O). Fujita Tomonori explained in the same
thread that it is
possible to improve the performance of STGT further, but that this would require
a lot of effort (implementing asynchronous I/O in the kernel and also
implementing
a new caching mechanism using pre-registered buffers).
These are both features being independently worked on, are they not?
Even if they weren't, the combination of the size of SCST in kernel plus
the problem of having to find a migration path for the current STGT
users still looks to me to involve the greater amount of work.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-01-30 17:03:10 UTC
Permalink
On Jan 30, 2008 5:22 PM, James Bottomley
Post by James Bottomley
...
Deciding what lives in userspace and what should be in the kernel lies
at the very heart of architectural decisions. However, the argument
that "it should be in the kernel because that would make it faster" is
pretty much a discredited one. To prevail on that argument, you have to
demonstrate that there's no way to enable user space to do the same
thing at the same speed. Further, it was the same argument used the
last time around when the STGT vs SCST investigation was done. Your own
results on non-IB networks show that both architectures perform at the
same speed. That tends to support the conclusion that there's something
specific about IB that needs to be tweaked or improved for STGT to get
it to perform correctly.
You should know that given two different implementations in software of the
same communication protocol, differences in latency and throughput become
more visible as the network latency gets lower and the throughput gets higher.
That's why conclusions can only be drawn from the InfiniBand numbers, and
not from the 1 Gbit/s Ethernet numbers. Assuming that there is something
specific in STGT with regard to InfiniBand is speculation.
Post by James Bottomley
Furthermore, if you have already decided before testing that SCST is
right and that STGT is wrong based on the architectures, it isn't
exactly going to increase my confidence in your measurement methodology
claiming to show this, now is it?
I did not draw any conclusions from the architecture -- the only data I based my
conclusions on were my own performance measurements.
Post by James Bottomley
...
These are both features being independently worked on, are they not?
Even if they weren't, the combination of the size of SCST in kernel plus
the problem of having to find a migration path for the current STGT
users still looks to me to involve the greater amount of work.
My proposal was to have both the SCST kernel code and the STGT kernel
code in the mainstream Linux kernel. This would make it easier for current
STGT users to evaluate SCST. It's too early to choose one of the two
projects -- this choice can be made later on.

Bart.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-01-30 11:17:17 UTC
Permalink
Post by James Bottomley
The two target architectures perform essentially identical functions, so
there's only really room for one in the kernel. Right at the moment,
it's STGT. Problems in STGT come from the user<->kernel boundary which
can be mitigated in a variety of ways. The fact that the figures are
pretty much comparable on non IB networks shows this.
I really need a whole lot more evidence than at worst a 20% performance
difference on IB to pull one implementation out and replace it with
another. Particularly as there's no real evidence that STGT can't be
tweaked to recover the 20% even on IB.
James,

Although the performance difference between STGT and SCST is apparent,
this isn't the only point why SCST is better. I've already written about
it many times in various mailing lists, but let me summarize it one more
time here.

As you know, almost all kernel parts can be done in user space,
including all the drivers, networking, I/O management with block/SCSI
initiator subsystem and disk cache manager. But does it mean that
currently Linux kernel is bad and all the above should be (re)done in
user space instead? I believe, not. Linux isn't a microkernel for very
pragmatic reasons: simplicity and performance. So, additional important
point why SCST is better is simplicity.

For SCSI target, especially with hardware target card, data are came
from kernel and eventually served by kernel, which does actual I/O or
getting/putting data from/to cache. Dividing requests processing between
user and kernel space creates unnecessary interface layer(s) and
effectively makes the requests processing job distributed with all its
complexity and reliability problems. From my point of view, having such
distribution, where user space is master side and kernel is slave is
rather wrong, because:

1. It makes kernel depend from user program, which services it and
provides for it its routines, while the regular paradigm is the
opposite: kernel services user space applications. As a direct
consequence from it that there is no real protection for the kernel from
faults in the STGT core code without excessive effort, which, no
surprise, wasn't currently done and, seems, is never going to be done.
So, on practice debugging and developing under STGT isn't easier, than
if the whole code was in the kernel space, but, actually, harder (see
below why).

2. It requires new complicated interface between kernel and user spaces
that creates additional maintenance and debugging headaches, which don't
exist for kernel only code. Linus Torvalds some time ago perfectly
described why it is bad, see http://lkml.org/lkml/2007/4/24/451,
http://lkml.org/lkml/2006/7/1/41 and http://lkml.org/lkml/2007/4/24/364.

3. It makes for SCSI target impossible to use (at least, on a simple and
sane way) many effective optimizations: zero-copy cached I/O, more
control over read-ahead, device queue unplugging-plugging, etc. One
example of already implemented such features is zero-copy network data
transmission, done in simple 260 lines put_page_callback patch. This
optimization is especially important for the user space gate (scst_user
module), see below for details.

The whole point that development for kernel is harder, than for user
space, is totally nonsense nowadays. It's different, yes, in some ways
more limited, yes, but not harder. For ones who need gdb (I for many
years - don't) kernel has kgdb, plus it also has many not available for
user space or more limited there debug facilities like lockdep, lockup
detection, oprofile, etc. (I don't mention wider choice of more
effectively implemented synchronization primitives and not only them).

For people who need complicated target devices emulation, like, e.g., in
case of VTL (Virtual Tape Library), where there is a need to operate
with large mmap'ed memory areas, SCST provides gateway to the user space
(scst_user module), but, in contrast with STGT, it's done in regular
"kernel - master, user application - slave" paradigm, so it's reliable
and no fault in user space device emulator can break kernel and other
user space applications. Plus, since SCSI target state machine and
memory management are in the kernel, it's very effective and allows only
one kernel-user space switch per SCSI command.

Also, I should note here, that in the current state STGT in many aspects
doesn't fully conform SCSI specifications, especially in area of
management events, like Unit Attentions generation and processing, and
it doesn't look like somebody cares about it. At the same time, SCST
pays big attention to fully conform SCSI specifications, because price
of non-conformance is a possible user's data corruption.

Returning to performance, modern SCSI transports, e.g. InfiniBand, have
as low link latency as 1(!) microsecond. For comparison, the
inter-thread context switch time on a modern system is about the same,
syscall time - about 0.1 microsecond. So, only ten empty syscalls or one
context switch add the same latency as the link. Even 1Gbps Ethernet has
less, than 100 microseconds of round-trip latency.

You, probably, know, that QLogic Fibre Channel target driver for SCST
allows commands being executed either directly from soft IRQ, or from
the corresponding thread. There is a steady 5-7% difference in IOPS
between those modes on 512 bytes reads on nullio using 4Gbps link. So, a
single additional inter-kernel-thread context switch costs 5-7% of IOPS.

Another source of additional unavoidable with the user space approach
latency is data copy to/from cache. With the fully kernel space
approach, cache can be used directly, so no extra copy will be needed.
We can estimate how much latency the data copying adds. On the modern
systems memory copy throughput is less than 2GB/s, so on 20Gbps
InfiniBand link it almost doubles data transfer latency.

So, putting code in the user space you should accept the extra latency
it adds. Many, if not most, real-life workloads more or less latency,
not throughput, bound, so there shouldn't be surprise that single stream
"dd if=/dev/sdX of=/dev/null" on initiator gives too low values. Such
"benchmark" isn't less important and practical, than all the
multithreaded latency insensitive benchmarks, which people like running,
because it does essentially the same as most Linux processes do when
they read data from files.

You may object me that the target's backstorage device(s) latency is a
lot more, than 1 microsecond, but that is relevant only if data are
read/written from/to the actual backstorage media, not from the cache,
even from the backstorage device's cache. Nothing prevents target from
having 8 or even 64GB of cache, so most even random accesses could be
served by it. This is especially important for sync writes.

Thus, why SCST is better:

1. It is more simple, because it's monolithic, so all its components are
in one place and communicate using direct function calls. Hence, it is
smaller, faster, more reliable and maintainable. Currently it's bigger,
than STGT, just because it supports more features, see (2).

2. It supports more features: 1 to many pass-through support with all
necessary for it functionality, including support for non-disk SCSI
devices, like tapes, SGV cache, BLOCKIO, where requests converted to
bio's and directly sent to block level (this mode is effective for
random mostly workloads with data set size >> memory size on the
target), etc.

3. It has better performance and going to have it even better. SCST only
now enters in the phase, where it starts exploiting all advantages of
being in the kernel. Particularly, zero-copy cached I/O is currently
being implemented.

4. It provides safer and more effective interface to emulate target
devices in the user space via scst_user module.

5. It much more confirms to SCSI specifications (see above).

Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-04 12:27:06 UTC
Permalink
Post by Vladislav Bolkhovitin
Post by James Bottomley
The two target architectures perform essentially identical functions, so
there's only really room for one in the kernel. Right at the moment,
it's STGT. Problems in STGT come from the user<->kernel boundary which
can be mitigated in a variety of ways. The fact that the figures are
pretty much comparable on non IB networks shows this.
I really need a whole lot more evidence than at worst a 20% performance
difference on IB to pull one implementation out and replace it with
another. Particularly as there's no real evidence that STGT can't be
tweaked to recover the 20% even on IB.
James,
Although the performance difference between STGT and SCST is apparent,
this isn't the only point why SCST is better. I've already written about
it many times in various mailing lists, but let me summarize it one more
time here.
As you know, almost all kernel parts can be done in user space,
including all the drivers, networking, I/O management with block/SCSI
initiator subsystem and disk cache manager. But does it mean that
currently Linux kernel is bad and all the above should be (re)done in
user space instead? I believe, not. Linux isn't a microkernel for very
pragmatic reasons: simplicity and performance. So, additional important
point why SCST is better is simplicity.
For SCSI target, especially with hardware target card, data are came
from kernel and eventually served by kernel, which does actual I/O or
getting/putting data from/to cache. Dividing requests processing between
user and kernel space creates unnecessary interface layer(s) and
effectively makes the requests processing job distributed with all its
complexity and reliability problems. From my point of view, having such
distribution, where user space is master side and kernel is slave is
1. It makes kernel depend from user program, which services it and
provides for it its routines, while the regular paradigm is the
opposite: kernel services user space applications. As a direct
consequence from it that there is no real protection for the kernel from
faults in the STGT core code without excessive effort, which, no
surprise, wasn't currently done and, seems, is never going to be done.
So, on practice debugging and developing under STGT isn't easier, than
if the whole code was in the kernel space, but, actually, harder (see
below why).
2. It requires new complicated interface between kernel and user spaces
that creates additional maintenance and debugging headaches, which don't
exist for kernel only code. Linus Torvalds some time ago perfectly
described why it is bad, see http://lkml.org/lkml/2007/4/24/451,
http://lkml.org/lkml/2006/7/1/41 and http://lkml.org/lkml/2007/4/24/364.
3. It makes for SCSI target impossible to use (at least, on a simple and
sane way) many effective optimizations: zero-copy cached I/O, more
control over read-ahead, device queue unplugging-plugging, etc. One
example of already implemented such features is zero-copy network data
transmission, done in simple 260 lines put_page_callback patch. This
optimization is especially important for the user space gate (scst_user
module), see below for details.
The whole point that development for kernel is harder, than for user
space, is totally nonsense nowadays. It's different, yes, in some ways
more limited, yes, but not harder. For ones who need gdb (I for many
years - don't) kernel has kgdb, plus it also has many not available for
user space or more limited there debug facilities like lockdep, lockup
detection, oprofile, etc. (I don't mention wider choice of more
effectively implemented synchronization primitives and not only them).
For people who need complicated target devices emulation, like, e.g., in
case of VTL (Virtual Tape Library), where there is a need to operate
with large mmap'ed memory areas, SCST provides gateway to the user space
(scst_user module), but, in contrast with STGT, it's done in regular
"kernel - master, user application - slave" paradigm, so it's reliable
and no fault in user space device emulator can break kernel and other
user space applications. Plus, since SCSI target state machine and
memory management are in the kernel, it's very effective and allows only
one kernel-user space switch per SCSI command.
Also, I should note here, that in the current state STGT in many aspects
doesn't fully conform SCSI specifications, especially in area of
management events, like Unit Attentions generation and processing, and
it doesn't look like somebody cares about it. At the same time, SCST
pays big attention to fully conform SCSI specifications, because price
of non-conformance is a possible user's data corruption.
Returning to performance, modern SCSI transports, e.g. InfiniBand, have
as low link latency as 1(!) microsecond. For comparison, the
inter-thread context switch time on a modern system is about the same,
syscall time - about 0.1 microsecond. So, only ten empty syscalls or one
context switch add the same latency as the link. Even 1Gbps Ethernet has
less, than 100 microseconds of round-trip latency.
You, probably, know, that QLogic Fibre Channel target driver for SCST
allows commands being executed either directly from soft IRQ, or from
the corresponding thread. There is a steady 5-7% difference in IOPS
between those modes on 512 bytes reads on nullio using 4Gbps link. So, a
single additional inter-kernel-thread context switch costs 5-7% of IOPS.
Another source of additional unavoidable with the user space approach
latency is data copy to/from cache. With the fully kernel space
approach, cache can be used directly, so no extra copy will be needed.
We can estimate how much latency the data copying adds. On the modern
systems memory copy throughput is less than 2GB/s, so on 20Gbps
InfiniBand link it almost doubles data transfer latency.
So, putting code in the user space you should accept the extra latency
it adds. Many, if not most, real-life workloads more or less latency,
not throughput, bound, so there shouldn't be surprise that single stream
"dd if=/dev/sdX of=/dev/null" on initiator gives too low values. Such
"benchmark" isn't less important and practical, than all the
multithreaded latency insensitive benchmarks, which people like running,
because it does essentially the same as most Linux processes do when
they read data from files.
You may object me that the target's backstorage device(s) latency is a
lot more, than 1 microsecond, but that is relevant only if data are
read/written from/to the actual backstorage media, not from the cache,
even from the backstorage device's cache. Nothing prevents target from
having 8 or even 64GB of cache, so most even random accesses could be
served by it. This is especially important for sync writes.
1. It is more simple, because it's monolithic, so all its components are
in one place and communicate using direct function calls. Hence, it is
smaller, faster, more reliable and maintainable. Currently it's bigger,
than STGT, just because it supports more features, see (2).
2. It supports more features: 1 to many pass-through support with all
necessary for it functionality, including support for non-disk SCSI
devices, like tapes, SGV cache, BLOCKIO, where requests converted to
bio's and directly sent to block level (this mode is effective for
random mostly workloads with data set size >> memory size on the
target), etc.
3. It has better performance and going to have it even better. SCST only
now enters in the phase, where it starts exploiting all advantages of
being in the kernel. Particularly, zero-copy cached I/O is currently
being implemented.
4. It provides safer and more effective interface to emulate target
devices in the user space via scst_user module.
5. It much more confirms to SCSI specifications (see above).
So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?

Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-02-04 13:53:46 UTC
Permalink
Post by Vladislav Bolkhovitin
So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?
It's too early to draw conclusions about performance. I'm currently
performing more measurements, and the results are not easy to
interpret. My plan is to measure the following:
* Setup: target with RAM disk of 2 GB as backing storage.
* Throughput reported by dd and xdd (direct I/O).
* Transfers with dd/xdd in units of 1 KB to 1 GB (the smallest
transfer size that can be specified to xdd is 1 KB).
* Target SCSI software to be tested: IETD iSCSI via IPoIB, STGT iSCSI
via IPoIB, STGT iSER, SCST iSCSI via IPoIB, SCST SRP, LIO iSCSI via
IPoIB.

The reason I chose dd/xdd for these tests is that I want to measure
the performance of the communication protocols, and that I am assuming
that this performance can be modeled by the following formula:
(transfer time in s) = (transfer setup latency in s) + (transfer size
in MB) / (bandwidth in MB/s). Measuring the time needed for transfers
with varying block size allows to compute the constants in the above
formula via linear regression.

One difficulty I already encountered is that the performance of the
Linux IPoIB implementation varies a lot under high load
(http://bugzilla.kernel.org/show_bug.cgi?id=9883).

Another issue I have to look further into is that dd and xdd report
different results for very large block sizes (> 1 MB).

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Dillow
2008-02-04 17:00:01 UTC
Permalink
Post by Bart Van Assche
Another issue I have to look further into is that dd and xdd report
different results for very large block sizes (> 1 MB).
Be aware that xdd reports 1 MB as 1000000, not 1048576. Though, it looks
like dd is the same, so that's probably not helpful. Also, make sure
you're passing {i,o}flag=direct to dd if you're using -dio in xdd to be
sure you are comparing apples to apples.
--
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-04 17:08:55 UTC
Permalink
Post by Bart Van Assche
Post by Vladislav Bolkhovitin
So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?
It's too early to draw conclusions about performance. I'm currently
performing more measurements, and the results are not easy to
* Setup: target with RAM disk of 2 GB as backing storage.
* Throughput reported by dd and xdd (direct I/O).
* Transfers with dd/xdd in units of 1 KB to 1 GB (the smallest
transfer size that can be specified to xdd is 1 KB).
* Target SCSI software to be tested: IETD iSCSI via IPoIB, STGT iSCSI
via IPoIB, STGT iSER, SCST iSCSI via IPoIB, SCST SRP, LIO iSCSI via
IPoIB.
The reason I chose dd/xdd for these tests is that I want to measure
the performance of the communication protocols, and that I am assuming
(transfer time in s) = (transfer setup latency in s) + (transfer size
in MB) / (bandwidth in MB/s).
It isn't fully correct, you forgot about link latency. More correct one is:

(transfer time) = (transfer setup latency on both initiator and target,
consisting from software processing time, including memory copy, if
necessary, and PCI setup/transfer time) + (transfer size)/(bandwidth) +
(link latency to deliver request for READs or status for WRITES) +
(2*(link latency) to deliver R2T/XFER_READY request in case of WRITEs,
if necessary (e.g. iSER for small transfers might not need it, but SRP
most likely always needs it)). Also you should note that it's correct
only in case of single threaded workloads with one outstanding command
at time. For other workloads it depends from how well they manage to
keep the "link" full in interval from (transfer size)/(transfer time) to
bandwidth.
Post by Bart Van Assche
Measuring the time needed for transfers
with varying block size allows to compute the constants in the above
formula via linear regression.
Unfortunately, it isn't so easy, see above.
Post by Bart Van Assche
One difficulty I already encountered is that the performance of the
Linux IPoIB implementation varies a lot under high load
(http://bugzilla.kernel.org/show_bug.cgi?id=9883).
Another issue I have to look further into is that dd and xdd report
different results for very large block sizes (> 1 MB).
Look at /proc/scsi_tgt/sgv (for SCST) and you will see, which transfer
sizes are actually used. Initiators don't like sending big requests and
often split them on smaller ones.

Look at this message as well, it might be helpful:
http://lkml.org/lkml/2007/5/16/223
Post by Bart Van Assche
Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
James Bottomley
2008-02-04 15:30:15 UTC
Permalink
Post by Vladislav Bolkhovitin
So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?
The answers were pretty much contained here

http://marc.info/?l=linux-scsi&m=120164008302435

and here:

http://marc.info/?l=linux-scsi&m=120171067107293

Weren't they?

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-04 16:25:42 UTC
Permalink
Post by James Bottomley
Post by Vladislav Bolkhovitin
So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?
The answers were pretty much contained here
http://marc.info/?l=linux-scsi&m=120164008302435
http://marc.info/?l=linux-scsi&m=120171067107293
Weren't they?
No, sorry, it doesn't look so for me. They are about performance, but
I'm asking about the overall project's architecture, namely about one
part of it: simplicity. Particularly, what do you think about
duplicating Linux page cache in the user space to have zero-copy cached
I/O? Or can you suggest another architectural solution for that problem
in the STGT's approach?

Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
James Bottomley
2008-02-04 17:06:06 UTC
Permalink
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?
The answers were pretty much contained here
http://marc.info/?l=linux-scsi&m=120164008302435
http://marc.info/?l=linux-scsi&m=120171067107293
Weren't they?
No, sorry, it doesn't look so for me. They are about performance, but
I'm asking about the overall project's architecture, namely about one
part of it: simplicity. Particularly, what do you think about
duplicating Linux page cache in the user space to have zero-copy cached
I/O? Or can you suggest another architectural solution for that problem
in the STGT's approach?
Isn't that an advantage of a user space solution? It simply uses the
backing store of whatever device supplies the data. That means it takes
advantage of the existing mechanisms for caching.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-04 17:16:59 UTC
Permalink
Post by James Bottomley
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?
The answers were pretty much contained here
http://marc.info/?l=linux-scsi&m=120164008302435
http://marc.info/?l=linux-scsi&m=120171067107293
Weren't they?
No, sorry, it doesn't look so for me. They are about performance, but
I'm asking about the overall project's architecture, namely about one
part of it: simplicity. Particularly, what do you think about
duplicating Linux page cache in the user space to have zero-copy cached
I/O? Or can you suggest another architectural solution for that problem
in the STGT's approach?
Isn't that an advantage of a user space solution? It simply uses the
backing store of whatever device supplies the data. That means it takes
advantage of the existing mechanisms for caching.
No, please reread this thread, especially this message:
http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of
the advantages of the kernel space implementation. The user space
implementation has to have data copied between the cache and user space
buffer, but the kernel space one can use pages in the cache directly,
without extra copy.

Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
James Bottomley
2008-02-04 17:25:01 UTC
Permalink
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?
The answers were pretty much contained here
http://marc.info/?l=linux-scsi&m=120164008302435
http://marc.info/?l=linux-scsi&m=120171067107293
Weren't they?
No, sorry, it doesn't look so for me. They are about performance, but
I'm asking about the overall project's architecture, namely about one
part of it: simplicity. Particularly, what do you think about
duplicating Linux page cache in the user space to have zero-copy cached
I/O? Or can you suggest another architectural solution for that problem
in the STGT's approach?
Isn't that an advantage of a user space solution? It simply uses the
backing store of whatever device supplies the data. That means it takes
advantage of the existing mechanisms for caching.
http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of
the advantages of the kernel space implementation. The user space
implementation has to have data copied between the cache and user space
buffer, but the kernel space one can use pages in the cache directly,
without extra copy.
Well, you've said it thrice (the bellman cried) but that doesn't make it
true.

The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O. For reads, the page gather will ensure that the pages are up to
date from the backing store to the cache before sending the I/O out.
For writes, You actually have to do a msync on the region to get the
data secured to the backing store. You also have to pull tricks with
the mmap region in the case of writes to prevent useless data being read
in from the backing store. However, none of this involves data copies.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-04 17:56:21 UTC
Permalink
Post by James Bottomley
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?
The answers were pretty much contained here
http://marc.info/?l=linux-scsi&m=120164008302435
http://marc.info/?l=linux-scsi&m=120171067107293
Weren't they?
No, sorry, it doesn't look so for me. They are about performance, but
I'm asking about the overall project's architecture, namely about one
part of it: simplicity. Particularly, what do you think about
duplicating Linux page cache in the user space to have zero-copy cached
I/O? Or can you suggest another architectural solution for that problem
in the STGT's approach?
Isn't that an advantage of a user space solution? It simply uses the
backing store of whatever device supplies the data. That means it takes
advantage of the existing mechanisms for caching.
http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of
the advantages of the kernel space implementation. The user space
implementation has to have data copied between the cache and user space
buffer, but the kernel space one can use pages in the cache directly,
without extra copy.
Well, you've said it thrice (the bellman cried) but that doesn't make it
true.
The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O. For reads, the page gather will ensure that the pages are up to
date from the backing store to the cache before sending the I/O out.
For writes, You actually have to do a msync on the region to get the
data secured to the backing store.
James, have you checked how fast is mmaped I/O if work size > size of
RAM? It's several times slower comparing to buffered I/O. It was many
times discussed in LKML and, seems, VM people consider it unavoidable.
So, using mmaped IO isn't an option for high performance. Plus, mmaped
IO isn't an option for high reliability requirements, since it doesn't
provide a practical way to handle I/O errors.
Post by James Bottomley
You also have to pull tricks with
the mmap region in the case of writes to prevent useless data being read
in from the backing store.
Can you be more exact and specify what kind of tricks should be done for
that?
Post by James Bottomley
However, none of this involves data copies.
James
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
James Bottomley
2008-02-04 18:22:02 UTC
Permalink
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?
The answers were pretty much contained here
http://marc.info/?l=linux-scsi&m=120164008302435
http://marc.info/?l=linux-scsi&m=120171067107293
Weren't they?
No, sorry, it doesn't look so for me. They are about performance, but
I'm asking about the overall project's architecture, namely about one
part of it: simplicity. Particularly, what do you think about
duplicating Linux page cache in the user space to have zero-copy cached
I/O? Or can you suggest another architectural solution for that problem
in the STGT's approach?
Isn't that an advantage of a user space solution? It simply uses the
backing store of whatever device supplies the data. That means it takes
advantage of the existing mechanisms for caching.
http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of
the advantages of the kernel space implementation. The user space
implementation has to have data copied between the cache and user space
buffer, but the kernel space one can use pages in the cache directly,
without extra copy.
Well, you've said it thrice (the bellman cried) but that doesn't make it
true.
The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O. For reads, the page gather will ensure that the pages are up to
date from the backing store to the cache before sending the I/O out.
For writes, You actually have to do a msync on the region to get the
data secured to the backing store.
James, have you checked how fast is mmaped I/O if work size > size of
RAM? It's several times slower comparing to buffered I/O. It was many
times discussed in LKML and, seems, VM people consider it unavoidable.
Erm, but if you're using the case of work size > size of RAM, you'll
find buffered I/O won't help because you don't have the memory for
buffers either.
Post by Vladislav Bolkhovitin
So, using mmaped IO isn't an option for high performance. Plus, mmaped
IO isn't an option for high reliability requirements, since it doesn't
provide a practical way to handle I/O errors.
I think you'll find it does ... the page gather returns -EFAULT if
there's an I/O error in the gathered region. msync does something
similar if there's a write failure.
Post by Vladislav Bolkhovitin
Post by James Bottomley
You also have to pull tricks with
the mmap region in the case of writes to prevent useless data being read
in from the backing store.
Can you be more exact and specify what kind of tricks should be done for
that?
Actually, just avoid touching it seems to do the trick with a recent
kernel.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-04 18:38:02 UTC
Permalink
Post by James Bottomley
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?
The answers were pretty much contained here
http://marc.info/?l=linux-scsi&m=120164008302435
http://marc.info/?l=linux-scsi&m=120171067107293
Weren't they?
No, sorry, it doesn't look so for me. They are about performance, but
I'm asking about the overall project's architecture, namely about one
part of it: simplicity. Particularly, what do you think about
duplicating Linux page cache in the user space to have zero-copy cached
I/O? Or can you suggest another architectural solution for that problem
in the STGT's approach?
Isn't that an advantage of a user space solution? It simply uses the
backing store of whatever device supplies the data. That means it takes
advantage of the existing mechanisms for caching.
http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of
the advantages of the kernel space implementation. The user space
implementation has to have data copied between the cache and user space
buffer, but the kernel space one can use pages in the cache directly,
without extra copy.
Well, you've said it thrice (the bellman cried) but that doesn't make it
true.
The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O. For reads, the page gather will ensure that the pages are up to
date from the backing store to the cache before sending the I/O out.
For writes, You actually have to do a msync on the region to get the
data secured to the backing store.
James, have you checked how fast is mmaped I/O if work size > size of
RAM? It's several times slower comparing to buffered I/O. It was many
times discussed in LKML and, seems, VM people consider it unavoidable.
Erm, but if you're using the case of work size > size of RAM, you'll
find buffered I/O won't help because you don't have the memory for
buffers either.
James, just check and you will see, buffered I/O is a lot faster.
Post by James Bottomley
Post by Vladislav Bolkhovitin
So, using mmaped IO isn't an option for high performance. Plus, mmaped
IO isn't an option for high reliability requirements, since it doesn't
provide a practical way to handle I/O errors.
I think you'll find it does ... the page gather returns -EFAULT if
there's an I/O error in the gathered region.
Err, to whom return? If you try to read from a mmaped page, which can't
be populated due to I/O error, you will get SIGBUS or SIGSEGV, I don't
remember exactly. It's quite tricky to get back to the faulted command
from the signal handler.

Or do you mean mmap(MAP_POPULATE)/munmap() for each command? Do you
think that such mapping/unmapping is good for performance?
Post by James Bottomley
msync does something
similar if there's a write failure.
Post by Vladislav Bolkhovitin
Post by James Bottomley
You also have to pull tricks with
the mmap region in the case of writes to prevent useless data being read
in from the backing store.
Can you be more exact and specify what kind of tricks should be done for
that?
Actually, just avoid touching it seems to do the trick with a recent
kernel.
Hmm, how can one write to an mmaped page and don't touch it?
Post by James Bottomley
James
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
James Bottomley
2008-02-04 18:54:53 UTC
Permalink
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?
The answers were pretty much contained here
http://marc.info/?l=linux-scsi&m=120164008302435
http://marc.info/?l=linux-scsi&m=120171067107293
Weren't they?
No, sorry, it doesn't look so for me. They are about performance, but
I'm asking about the overall project's architecture, namely about one
part of it: simplicity. Particularly, what do you think about
duplicating Linux page cache in the user space to have zero-copy cached
I/O? Or can you suggest another architectural solution for that problem
in the STGT's approach?
Isn't that an advantage of a user space solution? It simply uses the
backing store of whatever device supplies the data. That means it takes
advantage of the existing mechanisms for caching.
http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of
the advantages of the kernel space implementation. The user space
implementation has to have data copied between the cache and user space
buffer, but the kernel space one can use pages in the cache directly,
without extra copy.
Well, you've said it thrice (the bellman cried) but that doesn't make it
true.
The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O. For reads, the page gather will ensure that the pages are up to
date from the backing store to the cache before sending the I/O out.
For writes, You actually have to do a msync on the region to get the
data secured to the backing store.
James, have you checked how fast is mmaped I/O if work size > size of
RAM? It's several times slower comparing to buffered I/O. It was many
times discussed in LKML and, seems, VM people consider it unavoidable.
Erm, but if you're using the case of work size > size of RAM, you'll
find buffered I/O won't help because you don't have the memory for
buffers either.
James, just check and you will see, buffered I/O is a lot faster.
So in an out of memory situation the buffers you don't have are a lot
faster than the pages I don't have?
Post by Vladislav Bolkhovitin
Post by James Bottomley
Post by Vladislav Bolkhovitin
So, using mmaped IO isn't an option for high performance. Plus, mmaped
IO isn't an option for high reliability requirements, since it doesn't
provide a practical way to handle I/O errors.
I think you'll find it does ... the page gather returns -EFAULT if
there's an I/O error in the gathered region.
Err, to whom return? If you try to read from a mmaped page, which can't
be populated due to I/O error, you will get SIGBUS or SIGSEGV, I don't
remember exactly. It's quite tricky to get back to the faulted command
from the signal handler.
Or do you mean mmap(MAP_POPULATE)/munmap() for each command? Do you
think that such mapping/unmapping is good for performance?
Post by James Bottomley
msync does something
similar if there's a write failure.
Post by Vladislav Bolkhovitin
Post by James Bottomley
You also have to pull tricks with
the mmap region in the case of writes to prevent useless data being read
in from the backing store.
Can you be more exact and specify what kind of tricks should be done for
that?
Actually, just avoid touching it seems to do the trick with a recent
kernel.
Hmm, how can one write to an mmaped page and don't touch it?
I meant from user space ... the writes are done inside the kernel.

However, as Linus has pointed out, this discussion is getting a bit off
topic. There's no actual evidence that copy problems are causing any
performatince issues issues for STGT. In fact, there's evidence that
they're not for everything except IB networks.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-06 18:07:35 UTC
Permalink
Post by James Bottomley
Post by Vladislav Bolkhovitin
Hmm, how can one write to an mmaped page and don't touch it?
I meant from user space ... the writes are done inside the kernel.
Sure, the mmap() approach agreed to be unpractical, but could you
elaborate more on this anyway, please? I'm just curious. Do you think
about implementing a new syscall, which would put pages with data in the
mmap'ed area?
No, it has to do with the way invalidation occurs. When you mmap a
region from a device or file, the kernel places page translations for
that region into your vm_area. The regions themselves aren't backed
until faulted. For write (i.e. incoming command to target) you specify
the write flag and send the area off to receive the data. The gather,
expecting the pages to be overwritten, backs them with pages marked
dirty but doesn't fault in the contents (unless it already exists in the
page cache). The kernel writes the data to the pages and the dirty
pages go back to the user. msync() flushes them to the device.
The disadvantage of all this is that the handle for the I/O if you will
is a virtual address in a user process that doesn't actually care to see
the data. non-x86 architectures will do flushes/invalidates on this
address space as the I/O occurs.
I more or less see, thanks. But (1) pages still needs to be mmaped to
the user space process before the data transmission, i.e. they must be
zeroed before being mmaped, which isn't much faster, than data copy, and
(2) I suspect, it would be hard to make it race free, e.g. if another
process would want to write to the same area simultaneously
Post by James Bottomley
However, as Linus has pointed out, this discussion is getting a bit off
topic.
No, that isn't off topic. We've just proved that there is no good way to
implement zero-copy cached I/O for STGT. I see the only practical way
for that, proposed by FUJITA Tomonori some time ago: duplicating Linux
page cache in the user space. But will you like it?
Well, there's no real evidence that zero copy or lack of it is a problem
yet.
The performance improvement from zero copy can be easily estimated,
knowing the link throughput and data copy throughput, which are about
the same for 20Gbps links (I did that few e-mail ago).

Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-02-07 13:13:20 UTC
Permalink
Since the focus of this thread shifted somewhat in the last few
messages, I'll try to summarize what has been discussed so far:
- There was a number of participants who joined this discussion
spontaneously. This suggests that there is considerable interest in
networked storage and iSCSI.
- It has been motivated why iSCSI makes sense as a storage protocol
(compared to ATA over Ethernet and Fibre Channel over Ethernet).
- The direct I/O performance results for block transfer sizes below 64
KB are a meaningful benchmark for storage target implementations.
- It has been discussed whether an iSCSI target should be implemented
in user space or in kernel space. It is clear now that an
implementation in the kernel can be made faster than a user space
implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
Regarding existing implementations, measurements have a.o. shown that
SCST is faster than STGT (30% with the following setup: iSCSI via
IPoIB and direct I/O block transfers with a size of 512 bytes).
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-07 13:45:07 UTC
Permalink
Post by Bart Van Assche
Since the focus of this thread shifted somewhat in the last few
- There was a number of participants who joined this discussion
spontaneously. This suggests that there is considerable interest in
networked storage and iSCSI.
- It has been motivated why iSCSI makes sense as a storage protocol
(compared to ATA over Ethernet and Fibre Channel over Ethernet).
- The direct I/O performance results for block transfer sizes below 64
KB are a meaningful benchmark for storage target implementations.
- It has been discussed whether an iSCSI target should be implemented
in user space or in kernel space. It is clear now that an
implementation in the kernel can be made faster than a user space
implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
Regarding existing implementations, measurements have a.o. shown that
SCST is faster than STGT (30% with the following setup: iSCSI via
IPoIB and direct I/O block transfers with a size of 512 bytes).
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).
As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?
I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for
SCST framework with a lot of bugfixes and improvements.

2. I think, everybody will agree that Linux iSCSI target should work
over some standard SCSI target framework. Hence the choice gets
narrower: SCST vs STGT. I don't think there's a way for a dedicated
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code
duplication. Nicholas could decide to move to either existing framework
(although, frankly, I don't think there's a possibility for in-kernel
iSCSI target and user space SCSI target framework) and if he decide to
go with SCST, I'll be glad to offer my help and support and wouldn't
care if LIO-SCST eventually replaced iSCSI-SCST. The better one should win.

Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
d***@lang.hm
2008-02-07 22:51:48 UTC
Permalink
Post by Bart Van Assche
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).
As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?
1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST
framework with a lot of bugfixes and improvements.
2. I think, everybody will agree that Linux iSCSI target should work over
some standard SCSI target framework. Hence the choice gets narrower: SCST vs
STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO)
in the mainline, because of a lot of code duplication. Nicholas could decide
to move to either existing framework (although, frankly, I don't think
there's a possibility for in-kernel iSCSI target and user space SCSI target
framework) and if he decide to go with SCST, I'll be glad to offer my help
and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The
better one should win.
why should linux as an iSCSI target be limited to passthrough to a SCSI
device.

the most common use of this sort of thing that I would see is to load up a
bunch of 1TB SATA drives in a commodity PC, run software RAID, and then
export the resulting volume to other servers via iSCSI. not a 'real' SCSI
device in sight.

As far as how good a standard iSCSI is, at this point I don't think it
really matters. There are too many devices and manufacturers out there
that implement iSCSI as their storage protocol (from both sides, offering
storage to other systems, and using external storage). Sometimes the best
technology doesn't win, but Linux should be interoperable with as much as
possible and be ready to support the winners and the loosers in technology
options, for as long as anyone chooses to use the old equipment (after
all, we support things like Arcnet networking, which lost to Ethernet many
years ago)

David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-08 10:37:28 UTC
Permalink
Post by d***@lang.hm
Post by Vladislav Bolkhovitin
Post by Bart Van Assche
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).
As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?
1. IET should be excluded from this list, iSCSI-SCST is IET updated
for SCST framework with a lot of bugfixes and improvements.
2. I think, everybody will agree that Linux iSCSI target should work
over some standard SCSI target framework. Hence the choice gets
narrower: SCST vs STGT. I don't think there's a way for a dedicated
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code
duplication. Nicholas could decide to move to either existing
framework (although, frankly, I don't think there's a possibility for
in-kernel iSCSI target and user space SCSI target framework) and if he
decide to go with SCST, I'll be glad to offer my help and support and
wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The better
one should win.
why should linux as an iSCSI target be limited to passthrough to a SCSI
device.
the most common use of this sort of thing that I would see is to load up
a bunch of 1TB SATA drives in a commodity PC, run software RAID, and
then export the resulting volume to other servers via iSCSI. not a
'real' SCSI device in sight.
As far as how good a standard iSCSI is, at this point I don't think it
really matters. There are too many devices and manufacturers out there
that implement iSCSI as their storage protocol (from both sides,
offering storage to other systems, and using external storage).
Sometimes the best technology doesn't win, but Linux should be
interoperable with as much as possible and be ready to support the
winners and the loosers in technology options, for as long as anyone
chooses to use the old equipment (after all, we support things like
Arcnet networking, which lost to Ethernet many years ago)
David, your question surprises me a lot. From where have you decided
that SCST supports only pass-through backstorage? Does the RAM disk,
which Bart has been using for performance tests, look like a SCSI device?

SCST supports all backstorage types you can imagine and Linux kernel
supports.
Post by d***@lang.hm
David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
d***@lang.hm
2008-02-09 07:40:45 UTC
Permalink
Post by d***@lang.hm
2. I think, everybody will agree that Linux iSCSI target should work over
some standard SCSI target framework. Hence the choice gets narrower: SCST
vs STGT. I don't think there's a way for a dedicated iSCSI target (i.e.
PyX/LIO) in the mainline, because of a lot of code duplication. Nicholas
could decide to move to either existing framework (although, frankly, I
don't think there's a possibility for in-kernel iSCSI target and user
space SCSI target framework) and if he decide to go with SCST, I'll be
glad to offer my help and support and wouldn't care if LIO-SCST eventually
replaced iSCSI-SCST. The better one should win.
why should linux as an iSCSI target be limited to passthrough to a SCSI
device.
the most common use of this sort of thing that I would see is to load up a
bunch of 1TB SATA drives in a commodity PC, run software RAID, and then
export the resulting volume to other servers via iSCSI. not a 'real' SCSI
device in sight.
David, your question surprises me a lot. From where have you decided that
SCST supports only pass-through backstorage? Does the RAM disk, which Bart
has been using for performance tests, look like a SCSI device?
I was responding to the start of item #2 that I left in the quote above.
it asn't saying that SCST didn't support that, but was stating that any
implementation of a iSCSI target should use the SCSI framework. I read
this to mean that this would only be able to access things that the SCSI
framework can access, and that would not be things like ramdisks, raid
arrays, etc.

David Lang

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-08 11:33:15 UTC
Permalink
Post by d***@lang.hm
Post by Bart Van Assche
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).
As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?
1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST
framework with a lot of bugfixes and improvements.
2. I think, everybody will agree that Linux iSCSI target should work over
some standard SCSI target framework. Hence the choice gets narrower: SCST vs
STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO)
in the mainline, because of a lot of code duplication. Nicholas could decide
to move to either existing framework (although, frankly, I don't think
there's a possibility for in-kernel iSCSI target and user space SCSI target
framework) and if he decide to go with SCST, I'll be glad to offer my help
and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The
better one should win.
why should linux as an iSCSI target be limited to passthrough to a SCSI
device.
<nod>

I don't think anyone is saying it should be. It makes sense that the
more mature SCSI engines that have working code will be providing alot
of the foundation as we talk about options..
Post by d***@lang.hm
From comparing the designs of SCST and LIO-SE, we know that SCST has
supports very SCSI specific target mode hardware, including software
target mode forks of other kernel code. This code for the target mode
pSCSI, FC and SAS control paths (more for the state machines, that CDB
emulation) that will most likely never need to be emulated on non SCSI
target engine. SCST has support for the most SCSI fabric protocols of
the group (although it is lacking iSER) while the LIO-SE only supports
traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6). The
design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs
and data to talk to every potential device in the Linux storage stack on
the largest amount of hardware architectures possible.

Most of the iSCSI Initiators I know (including non Linux) do not rely on
heavy SCSI task management, and I think this would be a lower priority
item to get real SCSI specific recovery in the traditional iSCSI target
for users. Espically things like SCSI target mode queue locking
(affectionally called Auto Contingent Allegiance) make no sense for
traditional iSCSI or iSER, because CmdSN rules are doing this for us.
Post by d***@lang.hm
the most common use of this sort of thing that I would see is to load up a
bunch of 1TB SATA drives in a commodity PC, run software RAID, and then
export the resulting volume to other servers via iSCSI. not a 'real' SCSI
device in sight.
I recently moved the last core LIO target machine from a hardware RAID5
to MD RAID6 with struct block_device exported LVM objects via
Linux/iSCSI to PVM and HVM domains, and I have been very happy with the
results. Being able to export any physical or virtual storage object
from whatever layer makes sense for your particular case. This applies
to both block and file level access. For example, making an iSCSI
Initiator and Target run in the most limited in environments places
where NAS (espically userspace server side) would have a really hard
time fitting, has always been a requirement. You can imagine a system
with a smaller amount of memory (say 32MB) having a difficult time doing
I/O to any amount of NAS clients.

If are talking about memory required to get best performance, using
kernel level DMA ring allocation and submission to a generic target
engine uses a significantly smaller amount of memory, than say
traditional buffered FILEIO. Going futher up the storage stack with
buffered file IO, regardless of if its block or file level, will always
start to add overhead. I think that kernel level FILEIO with O_DIRECT
and asyncio would probably help alot in this case for general target
mode usage of MD and LVM block devices.

This is because when we are using PSCSI or IBLOCK to queue I/Os which,
may need be different from the original IO from the initiator/client due
to OS storage subsystem differences and/or physical HBA limitiations for
the layers below block. The current LIO-SE API excepts the storage
object to present these physical limitiations if to engine they exist.
This is called iscsi_transport_t in iscsi_target_transport.h currently,
but really should be called something like target_subsytem_api_t and
plugins called target_pscsi_t, target_bio_t, target_file_t, etc.
Post by d***@lang.hm
As far as how good a standard iSCSI is, at this point I don't think it
really matters. There are too many devices and manufacturers out there
that implement iSCSI as their storage protocol (from both sides, offering
storage to other systems, and using external storage). Sometimes the best
technology doesn't win, but Linux should be interoperable with as much as
possible and be ready to support the winners and the loosers in technology
options, for as long as anyone chooses to use the old equipment (after
all, we support things like Arcnet networking, which lost to Ethernet many
years ago)
The RFC-3720 standard has been stable for going on four years in 2008,
and as the implementations continue to mature, having Linux lead the way
in iSCSI Target, Initiator and Target/Initiator that can potentially run
on anything that can boot Linux on the many, many types of system and
storage around these days is the goal. I can't personally comment on
how many of these types of systems that target mode or iSCSI stacks have
run in other people's environments, but I have personally been involved
getting LIO/SE and Core/iSCSI running on i386 and x86_64, along with
Alpha, ia64, MIPS, PPC and POWER, and lots of ARM. I believe the LIO
Target and Initiator stacks have been able to run on the smallest
systems so far, including a uclinux 2.6 sub 100 Mhz board and ~4 MB of
usable sytem memory. This is still today with the LIO target stack,
which has been successfully run on the OpenMoko device with memory and
FILEIO. :-)

--nab

Btw, I definately agree that being able to export the large amount of
legacy drivers will continue to be an important part..

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-08 14:36:56 UTC
Permalink
Post by Nicholas A. Bellinger
Post by d***@lang.hm
Post by Bart Van Assche
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).
As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?
1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST
framework with a lot of bugfixes and improvements.
2. I think, everybody will agree that Linux iSCSI target should work over
some standard SCSI target framework. Hence the choice gets narrower: SCST vs
STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO)
in the mainline, because of a lot of code duplication. Nicholas could decide
to move to either existing framework (although, frankly, I don't think
there's a possibility for in-kernel iSCSI target and user space SCSI target
framework) and if he decide to go with SCST, I'll be glad to offer my help
and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The
better one should win.
why should linux as an iSCSI target be limited to passthrough to a SCSI
device.
<nod>
I don't think anyone is saying it should be. It makes sense that the
more mature SCSI engines that have working code will be providing alot
of the foundation as we talk about options..
Post by d***@lang.hm
From comparing the designs of SCST and LIO-SE, we know that SCST has
supports very SCSI specific target mode hardware, including software
target mode forks of other kernel code. This code for the target mode
pSCSI, FC and SAS control paths (more for the state machines, that CDB
emulation) that will most likely never need to be emulated on non SCSI
target engine.
...but required for SCSI. So, it must be, anyway.
Post by Nicholas A. Bellinger
SCST has support for the most SCSI fabric protocols of
the group (although it is lacking iSER) while the LIO-SE only supports
traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6). The
design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs
and data to talk to every potential device in the Linux storage stack on
the largest amount of hardware architectures possible.
Most of the iSCSI Initiators I know (including non Linux) do not rely on
heavy SCSI task management, and I think this would be a lower priority
item to get real SCSI specific recovery in the traditional iSCSI target
for users. Espically things like SCSI target mode queue locking
(affectionally called Auto Contingent Allegiance) make no sense for
traditional iSCSI or iSER, because CmdSN rules are doing this for us.
Sorry, it isn't correct. ACA provides possibility to lock commands queue
in case of CHECK CONDITION, so allows to keep commands execution order
in case of errors. CmdSN keeps commands execution order only in case of
success, in case of error the next queued command will be executed
immediately after the failed one, although application might require to
have all subsequent after the failed one commands aborted. Think about
journaled file systems, for instance. Also ACA allows to retry the
failed command and then resume the queue.

Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-08 23:53:03 UTC
Permalink
Post by Vladislav Bolkhovitin
Post by Nicholas A. Bellinger
Post by d***@lang.hm
Post by Bart Van Assche
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).
As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?
1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST
framework with a lot of bugfixes and improvements.
2. I think, everybody will agree that Linux iSCSI target should work over
some standard SCSI target framework. Hence the choice gets narrower: SCST vs
STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO)
in the mainline, because of a lot of code duplication. Nicholas could decide
to move to either existing framework (although, frankly, I don't think
there's a possibility for in-kernel iSCSI target and user space SCSI target
framework) and if he decide to go with SCST, I'll be glad to offer my help
and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The
better one should win.
why should linux as an iSCSI target be limited to passthrough to a SCSI
device.
<nod>
I don't think anyone is saying it should be. It makes sense that the
more mature SCSI engines that have working code will be providing alot
of the foundation as we talk about options..
Post by d***@lang.hm
From comparing the designs of SCST and LIO-SE, we know that SCST has
supports very SCSI specific target mode hardware, including software
target mode forks of other kernel code. This code for the target mode
pSCSI, FC and SAS control paths (more for the state machines, that CDB
emulation) that will most likely never need to be emulated on non SCSI
target engine.
...but required for SCSI. So, it must be, anyway.
Post by Nicholas A. Bellinger
SCST has support for the most SCSI fabric protocols of
the group (although it is lacking iSER) while the LIO-SE only supports
traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6). The
design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs
and data to talk to every potential device in the Linux storage stack on
the largest amount of hardware architectures possible.
Most of the iSCSI Initiators I know (including non Linux) do not rely on
heavy SCSI task management, and I think this would be a lower priority
item to get real SCSI specific recovery in the traditional iSCSI target
for users. Espically things like SCSI target mode queue locking
(affectionally called Auto Contingent Allegiance) make no sense for
traditional iSCSI or iSER, because CmdSN rules are doing this for us.
Sorry, it isn't correct. ACA provides possibility to lock commands queue
in case of CHECK CONDITION, so allows to keep commands execution order
in case of errors. CmdSN keeps commands execution order only in case of
success, in case of error the next queued command will be executed
immediately after the failed one, although application might require to
have all subsequent after the failed one commands aborted. Think about
journaled file systems, for instance. Also ACA allows to retry the
failed command and then resume the queue.
Fair enough. The point I was making is that I have never actually seen
an iSCSI Initiator use ACA functionality (I don't believe that the Linux
SCSI Ml implements this), or actually generate a CLEAR_ACA task
management request.

--nab
Post by Vladislav Bolkhovitin
Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-02-15 15:02:47 UTC
Permalink
Post by Vladislav Bolkhovitin
Post by Bart Van Assche
Since the focus of this thread shifted somewhat in the last few
- There was a number of participants who joined this discussion
spontaneously. This suggests that there is considerable interest in
networked storage and iSCSI.
- It has been motivated why iSCSI makes sense as a storage protocol
(compared to ATA over Ethernet and Fibre Channel over Ethernet).
- The direct I/O performance results for block transfer sizes below 64
KB are a meaningful benchmark for storage target implementations.
- It has been discussed whether an iSCSI target should be implemented
in user space or in kernel space. It is clear now that an
implementation in the kernel can be made faster than a user space
implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
Regarding existing implementations, measurements have a.o. shown that
SCST is faster than STGT (30% with the following setup: iSCSI via
IPoIB and direct I/O block transfers with a size of 512 bytes).
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).
As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?
1. IET should be excluded from this list, iSCSI-SCST is IET updated for
SCST framework with a lot of bugfixes and improvements.
2. I think, everybody will agree that Linux iSCSI target should work
over some standard SCSI target framework. Hence the choice gets
narrower: SCST vs STGT. I don't think there's a way for a dedicated
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code
duplication. Nicholas could decide to move to either existing framework
(although, frankly, I don't think there's a possibility for in-kernel
iSCSI target and user space SCSI target framework) and if he decide to
go with SCST, I'll be glad to offer my help and support and wouldn't
care if LIO-SCST eventually replaced iSCSI-SCST. The better one should win.
If I understood the above correctly, regarding a kernel space iSCSI
target implementation, only LIO-SE and SCST should be considered. What
I know today about these Linux iSCSI target implementations is as
follows:
* SCST performs slightly better than LIO-SE, and LIO-SE performs
slightly better than STGT (both with regard to latency and with regard
to bandwidth).
* The coding style of SCST is closer to the Linux kernel coding style
than the coding style of the LIO-SE project.
* The structure of SCST is closer to what Linus expects than the
structure of LIO-SE (i.e., authentication handled in userspace, data
transfer handled by the kernel -- LIO-SE handles both in kernel
space).
* Until now I did not encounter any strange behavior in SCST. The
issues I encountered with LIO-SE are being resolved via the LIO-SE
mailing list (http://groups.google.com/group/linux-iscsi-target-dev).

It would take too much effort to develop a new kernel space iSCSI
target from scratch -- we should start from either LIO-SE or SCST. My
opinion is that the best approach is to start with integrating SCST in
the mainstream kernel, and that the more advanced features from LIO-SE
that are not yet in SCST can be ported from LIO-SE to the SCST
framework.

Nicholas, do you think the structure of SCST is powerful enough to be
extended with LIO-SE's powerful features like ERL-2 ?

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-07 15:38:15 UTC
Permalink
Post by Bart Van Assche
Since the focus of this thread shifted somewhat in the last few
- There was a number of participants who joined this discussion
spontaneously. This suggests that there is considerable interest in
networked storage and iSCSI.
- It has been motivated why iSCSI makes sense as a storage protocol
(compared to ATA over Ethernet and Fibre Channel over Ethernet).
- The direct I/O performance results for block transfer sizes below 64
KB are a meaningful benchmark for storage target implementations.
- It has been discussed whether an iSCSI target should be implemented
in user space or in kernel space. It is clear now that an
implementation in the kernel can be made faster than a user space
implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
Regarding existing implementations, measurements have a.o. shown that
SCST is faster than STGT (30% with the following setup: iSCSI via
IPoIB and direct I/O block transfers with a size of 512 bytes).
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).
As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?
I think the other data point here would be that final target design
needs to be as generic as possible. Generic in the sense that the
engine eventually needs to be able to accept NDB and other ethernet
based target mode storage configurations to an abstracted device object
(struct scsi_device, struct block_device, or struct file) just as it
would for an IP Storage based request.

We know that NDB and *oE will have their own naming and discovery, and
the first set of IO tasks to be completed would be those using
(iscsi_cmd_t->cmd_flags & ICF_SCSI_DATA_SG_IO_CDB) in
iscsi_target_transport.c in the current code. These are single READ_*
and WRITE_* codepaths that perform DMA memory pre-proceessing in v2.9
LIO-SE.

Also, being able to tell the engine to accelerate to DMA ring operation
(say to underlying struct scsi_device or struct block_device) instead of
fileio in some cases you will see better performance when using hardware
(ie: not a underlying kernel thread queueing IO into block). But I have
found FILEIO with sendpage with MD to be faster in single threaded tests
than struct block_device. I am currently using IBLOCK for LVM for core
LIO operation (which actually sits on software MD raid6). I do this
because using submit_bio() with se_mem_t mapped arrays of struct
scatterlist -> struct bio_vec can handle power failures properly, and
not send back StatSN Acks to the Initiator who thinks that everything
has already made it to disk. This is the case with doing IO to struct
file in the kernel today without a kernel level O_DIRECT.

Also for proper kernel-level target mode support, using struct file with
O_DIRECT for storage blocks and emulating control path CDBS is one of
the work items. This can be made generic or obtained from the
underlying storage object (anything that can be exported from LIO
Subsystem TPI) For real hardware (struct scsi_device in just about all
the cases these days). Last time I looked this was due to
fs/direct-io.c:dio_refill_pages() using get_user_pages()...

For really transport specific CDB and control code, which in good amount
of cases, we are going eventually be expected to emulate in software.
I really like how STGT breaks this up into per device type code
segments; spc.c sbc.c mmc.c ssc.c smc.c etc. Having all of these split
out properly is one strong point of STGT IMHO, and really makes learning
things much easier. Also, being able to queue these IOs into a
userspace and receive a asynchronous response back up the storage stack.
I think this is actually a pretty interesting potential for passing
storage protocol packets into userspace apps and leave the protocol
state machines and recovery paths in the kernel with a generic target
engine.

Also, I know that the SCST folks have put alot of time into getting the
very SCSI hardware specific target mode control modes to work. I
personally own a bunch of this adapters, and would really like to see
better support for target mode on non iSCSI type adapters with a single
target mode storage engine that abstracts storage subsystems and wire
protocol fabrics.

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Luben Tuikov
2008-02-07 20:37:22 UTC
Permalink
Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?

Luben

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-08 10:32:07 UTC
Permalink
Post by Luben Tuikov
Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?
What do you mean? To call directly low level backstorage SCSI drivers
queuecommand() routine? What are advantages of it?
Post by Luben Tuikov
Luben
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Luben Tuikov
2008-02-09 07:32:55 UTC
Permalink
Post by Luben Tuikov
Post by Luben Tuikov
Is there an open iSCSI Target implementation which
does NOT
Post by Luben Tuikov
issue commands to sub-target devices via the SCSI
mid-layer, but
Post by Luben Tuikov
bypasses it completely?
What do you mean? To call directly low level backstorage
SCSI drivers
queuecommand() routine? What are advantages of it?
Yes, that's what I meant. Just curious.

Thanks,
Luben

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-11 10:02:47 UTC
Permalink
Post by Luben Tuikov
Post by Luben Tuikov
Post by Luben Tuikov
Is there an open iSCSI Target implementation which
does NOT
Post by Luben Tuikov
issue commands to sub-target devices via the SCSI
mid-layer, but
Post by Luben Tuikov
bypasses it completely?
What do you mean? To call directly low level backstorage
SCSI drivers
queuecommand() routine? What are advantages of it?
Yes, that's what I meant. Just curious.
What's advantage of it?
Post by Luben Tuikov
Thanks,
Luben
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-08 11:53:59 UTC
Permalink
Post by Luben Tuikov
Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?
Luben
Hi Luben,

I am guessing you mean futher down the stack, which I don't know this to
be the case. Going futher up the layers is the design of v2.9 LIO-SE.
There is a diagram explaining the basic concepts from a 10,000 foot
level.

http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf

Note that only traditional iSCSI target is currently implemented in v2.9
LIO-SE codebase in the list of target mode fabrics on left side of the
layout. The API between the protocol headers that does
encoding/decoding target mode storage packets is probably the least
mature area of the LIO stack (because it has always been iSCSI looking
towards iSER :). I don't know who has the most mature API between the
storage engine and target storage protocol for doing this between SCST
and STGT, I am guessing SCST because of the difference in age of the
projects. Could someone be so kind to fill me in on this..?

Also note, the storage engine plugin for doing userspace passthrough on
the right is also currently not implemented. Userspace passthrough in
this context is an target engine I/O that is enforcing max_sector and
sector_size limitiations, and encodes/decodes target storage protocol
packets all out of view of userspace. The addressing will be completely
different if we are pointing SE target packets at non SCSI target ports
in userspace.

--nab
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vladislav Bolkhovitin
2008-02-08 14:42:50 UTC
Permalink
Post by Nicholas A. Bellinger
Post by Luben Tuikov
Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?
Luben
Hi Luben,
I am guessing you mean futher down the stack, which I don't know this to
be the case. Going futher up the layers is the design of v2.9 LIO-SE.
There is a diagram explaining the basic concepts from a 10,000 foot
level.
http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf
Note that only traditional iSCSI target is currently implemented in v2.9
LIO-SE codebase in the list of target mode fabrics on left side of the
layout. The API between the protocol headers that does
encoding/decoding target mode storage packets is probably the least
mature area of the LIO stack (because it has always been iSCSI looking
towards iSER :). I don't know who has the most mature API between the
storage engine and target storage protocol for doing this between SCST
and STGT, I am guessing SCST because of the difference in age of the
projects. Could someone be so kind to fill me in on this..?
SCST uses scsi_execute_async_fifo() function to submit commands to SCSI
devices in the pass-through mode. This function is slightly modified
version of scsi_execute_async(), which submits requests in FIFO order
instead of LIFO as scsi_execute_async() does (so with
scsi_execute_async() they are executed in the reverse order).
Scsi_execute_async_fifo() added as a separate patch to the kernel.
Post by Nicholas A. Bellinger
Also note, the storage engine plugin for doing userspace passthrough on
the right is also currently not implemented. Userspace passthrough in
this context is an target engine I/O that is enforcing max_sector and
sector_size limitiations, and encodes/decodes target storage protocol
packets all out of view of userspace. The addressing will be completely
different if we are pointing SE target packets at non SCSI target ports
in userspace.
--nab
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-09 00:00:27 UTC
Permalink
Post by Vladislav Bolkhovitin
Post by Nicholas A. Bellinger
Post by Luben Tuikov
Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?
Luben
Hi Luben,
I am guessing you mean futher down the stack, which I don't know this to
be the case. Going futher up the layers is the design of v2.9 LIO-SE.
There is a diagram explaining the basic concepts from a 10,000 foot
level.
http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf
Note that only traditional iSCSI target is currently implemented in v2.9
LIO-SE codebase in the list of target mode fabrics on left side of the
layout. The API between the protocol headers that does
encoding/decoding target mode storage packets is probably the least
mature area of the LIO stack (because it has always been iSCSI looking
towards iSER :). I don't know who has the most mature API between the
storage engine and target storage protocol for doing this between SCST
and STGT, I am guessing SCST because of the difference in age of the
projects. Could someone be so kind to fill me in on this..?
SCST uses scsi_execute_async_fifo() function to submit commands to SCSI
devices in the pass-through mode. This function is slightly modified
version of scsi_execute_async(), which submits requests in FIFO order
instead of LIFO as scsi_execute_async() does (so with
scsi_execute_async() they are executed in the reverse order).
Scsi_execute_async_fifo() added as a separate patch to the kernel.
The LIO-SE PSCSI Plugin also depends on scsi_execute_async() for builds
on >= 2.6.18. Note in the core LIO storage engine code (would be
iscsi_target_transport.c), there is no subsystem dependence logic. The
LIO-SE API is what allows the SE plugins to remain simple and small:

-rw-r--r-- 1 root root 35008 2008-02-02 03:25 iscsi_target_pscsi.c
-rw-r--r-- 1 root root 7537 2008-02-02 17:27 iscsi_target_pscsi.h
-rw-r--r-- 1 root root 18269 2008-02-04 02:23 iscsi_target_iblock.c
-rw-r--r-- 1 root root 6834 2008-02-04 02:25 iscsi_target_iblock.h
-rw-r--r-- 1 root root 30611 2008-02-02 03:25 iscsi_target_file.c
-rw-r--r-- 1 root root 7833 2008-02-02 17:27 iscsi_target_file.h
-rw-r--r-- 1 root root 35154 2008-02-02 04:01 iscsi_target_rd.c
-rw-r--r-- 1 root root 9900 2008-02-02 17:27 iscsi_target_rd.h

It also means that the core LIO-SE code does not have to change when the
subsystem APIs change. This has been important in the past for the
project, but for upstream code, probably would not make a huge
difference.

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2008-02-04 18:29:38 UTC
Permalink
Post by James Bottomley
The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O.
mmap'ing may avoid the copy, but the overhead of a mmap operation is
quite often much *bigger* than the overhead of a copy operation.

Please do not advocate the use of mmap() as a way to avoid memory copies.
It's not realistic. Even if you can do it with a single "mmap()" system
call (which is not at all a given, considering that block devices can
easily be much larger than the available virtual memory space), the fact
is that page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner.

Yes, memory is "slow", but dammit, so is mmap().
Post by James Bottomley
You also have to pull tricks with the mmap region in the case of writes
to prevent useless data being read in from the backing store. However,
none of this involves data copies.
"data copies" is irrelevant. The only thing that matters is performance.
And if avoiding data copies is more costly (or even of a similar cost)
than the copies themselves would have been, there is absolutely no upside,
and only downsides due to extra complexity.

If you want good performance for a service like this, you really generally
*do* need to in kernel space. You can play games in user space, but you're
fooling yourself if you think you can do as well as doing it in the
kernel. And you're *definitely* fooling yourself if you think mmap()
solves performance issues. "Zero-copy" does not equate to "fast". Memory
speeds may be slower that core CPU speeds, but not infinitely so!

(That said: there *are* alternatives to mmap, like "splice()", that really
do potentially solve some issues without the page table and TLB overheads.
But while splice() avoids the costs of paging, I strongly suspect it would
still have easily measurable latency issues. Switching between user and
kernel space multiple times is definitely not going to be free, although
it's probably not a huge issue if you have big enough requests).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
James Bottomley
2008-02-04 18:49:52 UTC
Permalink
Post by Linus Torvalds
Post by James Bottomley
The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O.
mmap'ing may avoid the copy, but the overhead of a mmap operation is
quite often much *bigger* than the overhead of a copy operation.
Please do not advocate the use of mmap() as a way to avoid memory copies.
It's not realistic. Even if you can do it with a single "mmap()" system
call (which is not at all a given, considering that block devices can
easily be much larger than the available virtual memory space), the fact
is that page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner.
Yes, memory is "slow", but dammit, so is mmap().
Post by James Bottomley
You also have to pull tricks with the mmap region in the case of writes
to prevent useless data being read in from the backing store. However,
none of this involves data copies.
"data copies" is irrelevant. The only thing that matters is performance.
And if avoiding data copies is more costly (or even of a similar cost)
than the copies themselves would have been, there is absolutely no upside,
and only downsides due to extra complexity.
If you want good performance for a service like this, you really generally
*do* need to in kernel space. You can play games in user space, but you're
fooling yourself if you think you can do as well as doing it in the
kernel. And you're *definitely* fooling yourself if you think mmap()
solves performance issues. "Zero-copy" does not equate to "fast". Memory
speeds may be slower that core CPU speeds, but not infinitely so!
(That said: there *are* alternatives to mmap, like "splice()", that really
do potentially solve some issues without the page table and TLB overheads.
But while splice() avoids the costs of paging, I strongly suspect it would
still have easily measurable latency issues. Switching between user and
kernel space multiple times is definitely not going to be free, although
it's probably not a huge issue if you have big enough requests).
Sorry ... this is really just a discussion of how something (zero copy)
could be done, rather than an implementation proposal. (I'm not
actually planning to make the STGT people do anything ... although
investigating splice does sound interesting).

Right at the moment, STGT seems to be performing just fine on
measurements up to gigabit networks. There are suggestions that there
may be a problem on 8G IB networks, but it's not definitive yet.

I'm already on record as saying I think the best fix for IB networks is
just to reduce the context switches by increasing the transfer size, but
the infrastructure to allow that only just went into git head.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-04 19:06:29 UTC
Permalink
Post by Linus Torvalds
Post by James Bottomley
The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O.
mmap'ing may avoid the copy, but the overhead of a mmap operation is
quite often much *bigger* than the overhead of a copy operation.
Please do not advocate the use of mmap() as a way to avoid memory copies.
It's not realistic. Even if you can do it with a single "mmap()" system
call (which is not at all a given, considering that block devices can
easily be much larger than the available virtual memory space), the fact
is that page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner.
Yes, memory is "slow", but dammit, so is mmap().
Post by James Bottomley
You also have to pull tricks with the mmap region in the case of writes
to prevent useless data being read in from the backing store. However,
none of this involves data copies.
"data copies" is irrelevant. The only thing that matters is performance.
And if avoiding data copies is more costly (or even of a similar cost)
than the copies themselves would have been, there is absolutely no upside,
and only downsides due to extra complexity.
The iSER spec (RFC-5046) quotes the following in the TCP case for direct
data placement:

" Out-of-order TCP segments in the Traditional iSCSI model have to be
stored and reassembled before the iSCSI protocol layer within an end
node can place the data in the iSCSI buffers. This reassembly is
required because not every TCP segment is likely to contain an iSCSI
header to enable its placement, and TCP itself does not have a
built-in mechanism for signaling Upper Level Protocol (ULP) message
boundaries to aid placement of out-of-order segments. This TCP
reassembly at high network speeds is quite counter-productive for the
following reasons: wasted memory bandwidth in data copying, the need
for reassembly memory, wasted CPU cycles in data copying, and the
general store-and-forward latency from an application perspective."

While this does not have anything to do directly with the kernel vs. user discussion
for target mode storage engine, the scaling and latency case is easy enough
to make if we are talking about scaling TCP for 10 Gb/sec storage fabrics.
Post by Linus Torvalds
If you want good performance for a service like this, you really generally
*do* need to in kernel space. You can play games in user space, but you're
fooling yourself if you think you can do as well as doing it in the
kernel. And you're *definitely* fooling yourself if you think mmap()
solves performance issues. "Zero-copy" does not equate to "fast". Memory
speeds may be slower that core CPU speeds, but not infinitely so!
From looking at this problem from a kernel space perspective for a
number of years, I would be inclined to believe this is true for
software and hardware data-path cases. The benefits of moving various
control statemachines for something like say traditional iSCSI to
userspace has always been debateable. The most obvious ones are things
like authentication, espically if something more complex than CHAP are
the obvious case for userspace. However, I have thought recovery for
failures caused from communication path (iSCSI connections) or entire
nexuses (iSCSI sessions) failures was very problematic to expect to have
to potentially push down IOs state to userspace.

Keeping statemachines for protocol and/or fabric specific statemachines
(CSM-E and CSM-I from connection recovery in iSCSI and iSER are the
obvious ones) are the best canidates for residing in kernel space.
Post by Linus Torvalds
(That said: there *are* alternatives to mmap, like "splice()", that really
do potentially solve some issues without the page table and TLB overheads.
But while splice() avoids the costs of paging, I strongly suspect it would
still have easily measurable latency issues. Switching between user and
kernel space multiple times is definitely not going to be free, although
it's probably not a huge issue if you have big enough requests).
Most of the SCSI OS storage subsystems that I have worked with in the
context of iSCSI have used 256 * 512 byte setctor requests, which the
default traditional iSCSI PDU data payload (MRDSL) being 64k to hit the
sweet spot with crc32c checksum calculations. I am assuming this is
going to be the case for other fabrics as well.

--nab


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-04 19:19:16 UTC
Permalink
Post by Nicholas A. Bellinger
Post by Linus Torvalds
Post by James Bottomley
The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O.
mmap'ing may avoid the copy, but the overhead of a mmap operation is
quite often much *bigger* than the overhead of a copy operation.
Please do not advocate the use of mmap() as a way to avoid memory copies.
It's not realistic. Even if you can do it with a single "mmap()" system
call (which is not at all a given, considering that block devices can
easily be much larger than the available virtual memory space), the fact
is that page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner.
Yes, memory is "slow", but dammit, so is mmap().
Post by James Bottomley
You also have to pull tricks with the mmap region in the case of writes
to prevent useless data being read in from the backing store. However,
none of this involves data copies.
"data copies" is irrelevant. The only thing that matters is performance.
And if avoiding data copies is more costly (or even of a similar cost)
than the copies themselves would have been, there is absolutely no upside,
and only downsides due to extra complexity.
The iSER spec (RFC-5046) quotes the following in the TCP case for direct
" Out-of-order TCP segments in the Traditional iSCSI model have to be
stored and reassembled before the iSCSI protocol layer within an end
node can place the data in the iSCSI buffers. This reassembly is
required because not every TCP segment is likely to contain an iSCSI
header to enable its placement, and TCP itself does not have a
built-in mechanism for signaling Upper Level Protocol (ULP) message
boundaries to aid placement of out-of-order segments. This TCP
reassembly at high network speeds is quite counter-productive for the
following reasons: wasted memory bandwidth in data copying, the need
for reassembly memory, wasted CPU cycles in data copying, and the
general store-and-forward latency from an application perspective."
While this does not have anything to do directly with the kernel vs. user discussion
for target mode storage engine, the scaling and latency case is easy enough
to make if we are talking about scaling TCP for 10 Gb/sec storage fabrics.
Post by Linus Torvalds
If you want good performance for a service like this, you really generally
*do* need to in kernel space. You can play games in user space, but you're
fooling yourself if you think you can do as well as doing it in the
kernel. And you're *definitely* fooling yourself if you think mmap()
solves performance issues. "Zero-copy" does not equate to "fast". Memory
speeds may be slower that core CPU speeds, but not infinitely so!
From looking at this problem from a kernel space perspective for a
number of years, I would be inclined to believe this is true for
software and hardware data-path cases. The benefits of moving various
control statemachines for something like say traditional iSCSI to
userspace has always been debateable. The most obvious ones are things
like authentication, espically if something more complex than CHAP are
the obvious case for userspace. However, I have thought recovery for
failures caused from communication path (iSCSI connections) or entire
nexuses (iSCSI sessions) failures was very problematic to expect to have
to potentially push down IOs state to userspace.
Keeping statemachines for protocol and/or fabric specific statemachines
(CSM-E and CSM-I from connection recovery in iSCSI and iSER are the
obvious ones) are the best canidates for residing in kernel space.
Post by Linus Torvalds
(That said: there *are* alternatives to mmap, like "splice()", that really
do potentially solve some issues without the page table and TLB overheads.
But while splice() avoids the costs of paging, I strongly suspect it would
still have easily measurable latency issues. Switching between user and
kernel space multiple times is definitely not going to be free, although
it's probably not a huge issue if you have big enough requests).
Then again, having some data-path for software and hardware bulk IO
operation of storage fabric protocol / statemachine in userspace would
be really interesting for something like an SPU enabled engine for the
Cell Broadband Architecture.

--nab



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Torvalds
2008-02-04 19:44:31 UTC
Permalink
Post by Nicholas A. Bellinger
While this does not have anything to do directly with the kernel vs.
user discussion for target mode storage engine, the scaling and latency
case is easy enough to make if we are talking about scaling TCP for 10
Gb/sec storage fabrics.
I would like to point out that while I think there is no question that the
basic data transfer engine would perform better in kernel space, there
stll *are* questions whether

- iSCSI is relevant enough for us to even care ...

- ... and the complexity is actually worth it.

That said, I also tend to believe that trying to split things up between
kernel and user space is often more complex than just keeping things in
one place, because the trade-offs of which part goes where wll inevitably
be wrong in *some* area, and then you're really screwed.

So from a purely personal standpoint, I'd like to say that I'm not really
interested in iSCSI (and I don't quite know why I've been cc'd on this
whole discussion) and think that other approaches are potentially *much*
better. So for example, I personally suspect that ATA-over-ethernet is way
better than some crazy SCSI-over-TCP crap, but I'm biased for simple and
low-level, and against those crazy SCSI people to begin with.

So take any utterances of mine with a big pinch of salt.

Historically, the only split that has worked pretty well is "connection
initiation/setup in user space, actual data transfers in kernel space".

Pure user-space solutions work, but tend to eventually be turned into
kernel-space if they are simple enough and really do have throughput and
latency considerations (eg nfsd), and aren't quite complex and crazy
enough to have a large impedance-matching problem even for basic IO stuff
(eg samba).

And totally pure kernel solutions work only if there are very stable
standards and no major authentication or connection setup issues (eg local
disks).

So just going by what has happened in the past, I'd assume that iSCSI
would eventually turn into "connecting/authentication in user space" with
"data transfers in kernel space". But only if it really does end up
mattering enough. We had a totally user-space NFS daemon for a long time,
and it was perfectly fine until people really started caring.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
4***@retesintesi.it
2008-02-04 20:06:10 UTC
Permalink
So from a purely personal standpoint, I'd like to say that I'm not re=
ally
interested in iSCSI (and I don't quite know why I've been cc'd on thi=
s
whole discussion) and think that other approaches are potentially *mu=
ch*
better. So for example, I personally suspect that ATA-over-ethernet i=
s way
better than some crazy SCSI-over-TCP crap, but I'm biased for simple =
and
low-level, and against those crazy SCSI people to begin with.
surely aoe is better than iscsi almost on performance because of the le=
sser=20
protocol stack:
iscsi -> scsi - ip - eth
aoe -> ata - eth

but surely iscsi is more a standard than aoe and is more actively used =
by=20
real-world .

Other really useful feature are that:
- iscsi is capable to move to a ip based san scsi devices by routing th=
at (=20
i've some tape changer routed by scst to some system that don't have ot=
her=20
way to see a tape).
- because it work on the ip layer it can be routed between long distanc=
e , so=20
having needed bandwidth you can have a really remote block device spoki=
ng a=20
standard protocol between non ethereogenus systems.
- iscsi is now the cheapest san avaible.

bye,
marco.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-04 20:24:59 UTC
Permalink
Post by Linus Torvalds
Post by Nicholas A. Bellinger
While this does not have anything to do directly with the kernel vs.
user discussion for target mode storage engine, the scaling and latency
case is easy enough to make if we are talking about scaling TCP for 10
Gb/sec storage fabrics.
I would like to point out that while I think there is no question that the
basic data transfer engine would perform better in kernel space, there
stll *are* questions whether
- iSCSI is relevant enough for us to even care ...
- ... and the complexity is actually worth it.
That said, I also tend to believe that trying to split things up between
kernel and user space is often more complex than just keeping things in
one place, because the trade-offs of which part goes where wll inevitably
be wrong in *some* area, and then you're really screwed.
So from a purely personal standpoint, I'd like to say that I'm not really
interested in iSCSI (and I don't quite know why I've been cc'd on this
whole discussion)
The generic target mode storage engine discussion quickly goes to
transport specific scenarios. With so much interest in the SCSI
transports, in particuarly iSCSI, there are lots of devs, users, and
vendors who would like to see Linux improve in this respect.
Post by Linus Torvalds
and think that other approaches are potentially *much*
better. So for example, I personally suspect that ATA-over-ethernet is way
better than some crazy SCSI-over-TCP crap,
Having the non SCSI target mode transports use the same data IO path as
the SCSI ones to SCSI, BIO, and FILE subsystems is something that can
easily be agreed on. Also having to emulate the non SCSI control paths
in a non generic matter to a target mode engine has to suck (I don't
know what AoE does for that now, considering that this is going down to
libata or real SCSI hardware in some cases. There are some of the more
arcane task management functionality in SCSI (ACA anyone?) that even
generic SCSI target mode engines do not use, and only seem to make
endlessly complex implement and emulate.

But aside from those very SCSI hardware specific cases, having a generic
method to use something like ABORT_TASK or LUN_RESET for a target mode
engine (along with the data path to all of the subsystems) would be
beneficial for any fabric.
Post by Linus Torvalds
but I'm biased for simple and
low-level, and against those crazy SCSI people to begin with.
Well, having no obvious preconception (well, aside from the email
address), I am of the mindset than the iSCSI people are the LEAST crazy
said crazy SCSI people. Some people (usually least crazy iSCSI
standards folks) say that FCoE people are crazy. Being one of the iSCSI
people I am kinda obligated to agree, but the technical points are
really solid, and have been so for over a decade. They are listed here
for those who are interested:

http://www.ietf.org/mail-archive/web/ips/current/msg02325.html
Post by Linus Torvalds
So take any utterances of mine with a big pinch of salt.
Historically, the only split that has worked pretty well is "connection
initiation/setup in user space, actual data transfers in kernel space".
Pure user-space solutions work, but tend to eventually be turned into
kernel-space if they are simple enough and really do have throughput and
latency considerations (eg nfsd), and aren't quite complex and crazy
enough to have a large impedance-matching problem even for basic IO stuff
(eg samba).
And totally pure kernel solutions work only if there are very stable
standards and no major authentication or connection setup issues (eg local
disks).
So just going by what has happened in the past, I'd assume that iSCSI
would eventually turn into "connecting/authentication in user space" with
"data transfers in kernel space". But only if it really does end up
mattering enough. We had a totally user-space NFS daemon for a long time,
and it was perfectly fine until people really started caring.
Thanks for putting this into an historical perspective. Also it is
interesting to note that the iSCSI spec (RFC-3720) was ratified in April
2004, so it will be going on 4 years soon, which pre-RFC products first
going out in 2001 (yikes!). In my experience, the iSCSI interopt
amongst implementations (espically between different OSes) has been
stable since about late 2004, early 2005, with interopt between OS SCSI
subsystems (espically talking to non SCSI hardware) being the slower of
the two.

--nab


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
J. Bruce Fields
2008-02-04 21:01:21 UTC
Permalink
On Mon, Feb 04, 2008 at 11:44:31AM -0800, Linus Torvalds wrote:
...
Post by Linus Torvalds
Pure user-space solutions work, but tend to eventually be turned into
kernel-space if they are simple enough and really do have throughput and
latency considerations (eg nfsd), and aren't quite complex and crazy
enough to have a large impedance-matching problem even for basic IO stuff
(eg samba).
...
Post by Linus Torvalds
So just going by what has happened in the past, I'd assume that iSCSI
would eventually turn into "connecting/authentication in user space" with
"data transfers in kernel space". But only if it really does end up
mattering enough. We had a totally user-space NFS daemon for a long time,
and it was perfectly fine until people really started caring.
I'd assumed the move was primarily because of the difficulty of getting
correct semantics on a shared filesystem--if you're content with
NFS-only access to your filesystem, then you can probably do everything
in userspace, but once you start worrying about getting stable
filehandles, consistent file locking, etc., from a real disk filesystem
with local users, then you require much closer cooperation from the
kernel.

And I seem to recall being told that sort of thing was the motivation
more than performance, but I wasn't there (and I haven't seen
performance comparisons).

--b.
Linus Torvalds
2008-02-04 21:24:53 UTC
Permalink
Post by J. Bruce Fields
I'd assumed the move was primarily because of the difficulty of getting
correct semantics on a shared filesystem
.. not even shared. It was hard to get correct semantics full stop.

Which is a traditional problem. The thing is, the kernel always has some
internal state, and it's hard to expose all the semantics that the kernel
knows about to user space.

So no, performance is not the only reason to move to kernel space. It can
easily be things like needing direct access to internal data queues (for a
iSCSI target, this could be things like barriers or just tagged commands -
yes, you can probably emulate things like that without access to the
actual IO queues, but are you sure the semantics will be entirely right?

The kernel/userland boundary is not just a performance boundary, it's an
abstraction boundary too, and these kinds of protocols tend to break
abstractions. NFS broke it by having "file handles" (which is not
something that really exists in user space, and is almost impossible to
emulate correctly), and I bet the same thing happens when emulating a SCSI
target in user space.

Maybe not. I _rally_ haven't looked into iSCSI, I'm just guessing there
would be things like ordering issues.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-04 22:00:31 UTC
Permalink
Post by Linus Torvalds
Post by J. Bruce Fields
I'd assumed the move was primarily because of the difficulty of getting
correct semantics on a shared filesystem
.. not even shared. It was hard to get correct semantics full stop.
Which is a traditional problem. The thing is, the kernel always has some
internal state, and it's hard to expose all the semantics that the kernel
knows about to user space.
So no, performance is not the only reason to move to kernel space. It can
easily be things like needing direct access to internal data queues (for a
iSCSI target, this could be things like barriers or just tagged commands -
yes, you can probably emulate things like that without access to the
actual IO queues, but are you sure the semantics will be entirely right?
The kernel/userland boundary is not just a performance boundary, it's an
abstraction boundary too, and these kinds of protocols tend to break
abstractions. NFS broke it by having "file handles" (which is not
something that really exists in user space, and is almost impossible to
emulate correctly), and I bet the same thing happens when emulating a SCSI
target in user space.
Maybe not. I _rally_ haven't looked into iSCSI, I'm just guessing there
would be things like ordering issues.
<nod>.

The iSCSI CDBs and write immediate, unsoliciated, or soliciated data
payloads may be received out of order across communication paths (which
may be going over different subnets) within the nexus, but the execution
of the CDB to SCSI Target Port must be in the same order as it came down
from the SCSI subsystem on the initiator port. In iSCSI and iSER terms,
this is called Command Sequence Number (CmdSN) ordering, and is enforced
within each nexus. The initiator node will be assigning the CmdSNs as
the CDBs come down, and when communication paths fail, unacknowledged
CmdSNs will be retried on a different communication path when using
iSCSI/iSER connection recovery. Already acknowledged CmdSNs will be
explictly retried using a iSCSI specific task management function called
TASK_REASSIGN. This along with CSM-I and CSM-E statemachines are
collectly known as ErrorRecoveryLevel=2 in iSCSI.

Anyways, here is a great visual of a modern iSCSI Target processor and
SCSI Target Engine. The CmdSN ordering is representd by the oval across
across iSCSI connections going to various network portals groups on the
left side of the diagram. Thanks Eddy Q!

http://www.haifa.il.ibm.com/satran/ips/EddyQuicksall-iSCSI-in-diagrams/portal_groups.pdf

--nab


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-02-06 10:22:00 UTC
Permalink
For remotely accessing data, iSCSI+fs is quite simply more overhead than
a networked fs. With iSCSI you are doing
local VFS -> local blkdev -> network
whereas a networked filesystem is
local VFS -> network
There are use cases than can be solved better via iSCSI and a
filesystem than via a network filesystem. One such use case is when
deploying a virtual machine whose data is stored on a network server:
in that case there is only one user of the data (so there are no
locking issues) and filesystem and block device each run in another
operating system: the filesystem runs inside the virtual machine and
iSCSI either runs in the hypervisor or in the native OS.

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeff Garzik
2008-02-06 14:21:02 UTC
Permalink
Post by Bart Van Assche
For remotely accessing data, iSCSI+fs is quite simply more overhead than
a networked fs. With iSCSI you are doing
local VFS -> local blkdev -> network
whereas a networked filesystem is
local VFS -> network
There are use cases than can be solved better via iSCSI and a
filesystem than via a network filesystem. One such use case is when
in that case there is only one user of the data (so there are no
locking issues) and filesystem and block device each run in another
operating system: the filesystem runs inside the virtual machine and
iSCSI either runs in the hypervisor or in the native OS.
Hence the diskless root fs configuration I referred to in multiple
emails... whoopee, you reinvented NFS root with quotas :)

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-06 00:11:07 UTC
Permalink
iSCSI is way, way too complicated.
I fully agree. From one side, all that complexity is unavoidable for
case of multiple connections per session, but for the regular case of
one connection per session it must be a lot simpler.
Actually, think about those multiple connections... we already had to
implement fast-failover (and load bal) SCSI multi-pathing at a higher
level. IMO that portion of the protocol is redundant: You need the
same capability elsewhere in the OS _anyway_, if you are to support
multi-pathing.
I'm thinking about MC/S as about a way to improve performance using
several physical links. There's no other way, except MC/S, to keep
commands processing order in that case. So, it's really valuable
property of iSCSI, although with a limited application.
Vlad
Greetings,

I have always observed the case with LIO SE/iSCSI target mode (as well
as with other software initiators we can leave out of the discussion for
now, and congrats to the open/iscsi on folks recent release. :-) that
execution core hardware thread and inter-nexus per 1 Gb/sec ethernet
port performance scales up to 4x and 2x core x86_64 very well with
MC/S). I have been seeing 450 MB/sec using 2x socket 4x core x86_64 for
a number of years with MC/S. Using MC/S on 10 Gb/sec (on PCI-X v2.0
266mhz as well, which was the first transport that LIO Target ran on
that was able to reach handle duplex ~1200 MB/sec with 3 initiators and
MC/S. In the point to point 10 GB/sec tests on IBM p404 machines, the
initiators where able to reach ~910 MB/sec with MC/S. Open/iSCSI was
able to go a bit faster (~950 MB/sec) because it uses struct sk_buff
directly.

A good rule to keep in mind here while considering performance is that
context switching overhead and pipeline <-> bus stalling (along with
other legacy OS specific storage stack limitations with BLOCK and VFS
with O_DIRECT, et al and I will leave out of the discussion for iSCSI
and SE engine target mode) is that a initiator will scale roughly 1/2 as
well as a target, given comparable hardware and virsh output. The
software target case target case also depends, in great regard in many
cases, if we are talking about something something as simple as doing
contiguous DMA memory allocations in from a SINGLE kernel thread, and
handling direction execution to a storage hardware DMA ring that may
have not been allocated in the current kernel thread. In MC/S mode this
breaks down to:

1) Sorting logic that handles pre execution statemachine for transport
from local RDMA memory and OS specific data buffers. TCP application
data buffer, struct sk_buff, or RDMA struct page or SG. This should be
generic between iSCSI and iSER.

2) Allocation of said memory buffers to OS subsystem dependent code that
can be queued up to these drivers. It breaks down to what you can get
drivers and OS subsystem folks to agree to implement, and can be made
generic in a Transport / BLOCK / VFS layered storage stack. In the
"allocate thread DMA ring and use OS supported software and vendor
available hardware" I don't think the kernel space requirement will
every completely be able to go away.

Without diving into RFC-3720 specifics, the statemachine for MC/S side
for memory allocation, login and logout generic to iSCSi and ISER, and
ERL=2 recovery. My plan is to post the locations in the LIO code where
this has been implemented, and where we where can make this easier, etc.
In the early in the development of what eventually became LIO Target
code, ERL was broken into separete files and separete function
prefixes.

iscsi_target_erl0, iscsi_target_erl1 and iscsi_target_erl2.

The statemachine for ERL=0 and ERL=2 is pretty simple in RFC-3720 (have
a look for those interested in the discussion)

7.1.1. State Descriptions for Initiators and Targets

The LIO target code is also pretty simple for this:

[***@ps3-cell target]# wc -l iscsi_target_erl*
1115 iscsi_target_erl0.c
45 iscsi_target_erl0.h
526 iscsi_target_erl0.o
1426 iscsi_target_erl1.c
51 iscsi_target_erl1.h
1253 iscsi_target_erl1.o
605 iscsi_target_erl2.c
45 iscsi_target_erl2.h
447 iscsi_target_erl2.o
5513 total

erl1.c is a bit larger than the others because it contains the MC/S
statemachine functions. iscsi_target_erl1.c:iscsi_execute_cmd() and
iscsi_target_util.c:iscsi_check_received_cmdsn() do most of the work for
LIO MC/S state machine. I would probably benefit from being in broken
up into say iscsi_target_mcs.c. Note that all of this code is MC/S
safe, with the exception of the specific SCSI TMR functions. For the
SCSI TMR pieces, I have always hoped to use SCST code for doing this...

Most of the login/logout code is done in iscsi_target.c, which is could
probably also benefit fot getting broken out...

--nab


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-06 01:43:20 UTC
Permalink
Post by Nicholas A. Bellinger
iSCSI is way, way too complicated.
I fully agree. From one side, all that complexity is unavoidable for
case of multiple connections per session, but for the regular case of
one connection per session it must be a lot simpler.
Actually, think about those multiple connections... we already had to
implement fast-failover (and load bal) SCSI multi-pathing at a higher
level. IMO that portion of the protocol is redundant: You need the
same capability elsewhere in the OS _anyway_, if you are to support
multi-pathing.
I'm thinking about MC/S as about a way to improve performance using
several physical links. There's no other way, except MC/S, to keep
commands processing order in that case. So, it's really valuable
property of iSCSI, although with a limited application.
Vlad
Greetings,
I have always observed the case with LIO SE/iSCSI target mode (as well
as with other software initiators we can leave out of the discussion for
now, and congrats to the open/iscsi on folks recent release. :-) that
execution core hardware thread and inter-nexus per 1 Gb/sec ethernet
port performance scales up to 4x and 2x core x86_64 very well with
MC/S). I have been seeing 450 MB/sec using 2x socket 4x core x86_64 for
a number of years with MC/S. Using MC/S on 10 Gb/sec (on PCI-X v2.0
266mhz as well, which was the first transport that LIO Target ran on
that was able to reach handle duplex ~1200 MB/sec with 3 initiators and
MC/S. In the point to point 10 GB/sec tests on IBM p404 machines, the
initiators where able to reach ~910 MB/sec with MC/S. Open/iSCSI was
able to go a bit faster (~950 MB/sec) because it uses struct sk_buff
directly.
Sorry, these where IBM p505 express (not p404, duh) which had a 2x
socket 2x core POWER5 setup. These along with an IBM X-series machine)
where the only ones available for PCI-X v2.0, and this probably is still
the case. :-)

Also, these numbers where with a ~9000 MTU (I don't recall what the
hardware limit on the 10 Gb/sec switch lwas) doing direct struct iovec
to preallocated struct page mapping for payload on the target side.
This is known as RAMDISK_DR plugin in the LIO-SE. On the initiator, LTP
disktest and O_DIRECT where used for direct to SCSI block device access.

I can big up this paper if anyone is interested.

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Van Assche
2008-02-12 16:05:20 UTC
Permalink
I have always observed the case with LIO SE/iSCSI target mode ...
Hello Nicholas,

Are you sure that the LIO-SE kernel module source code is ready for
inclusion in the mainstream Linux kernel ? As you know I tried to test
the LIO-SE iSCSI target. Already while configuring the target I
encountered a kernel crash that froze the whole system. I can
reproduce this kernel crash easily, and I reported it 11 days ago on
the LIO-SE mailing list (February 4, 2008). One of the call stacks I
posted shows a crash in mempool_alloc() called from jbd. Or: the crash
is most likely the result of memory corruption caused by LIO-SE.

Because I was curious to know why it took so long to fix such a severe
crash, I started browsing through the LIO-SE source code. Analysis of
the LIO-SE kernel module source code learned me that this crash is not
a coincidence. Dynamic memory allocation (kmalloc()/kfree()) in the
LIO-SE kernel module is complex and hard to verify. There are 412
memory allocation/deallocation calls in the current version of the
LIO-SE kernel module source code, which is a lot. Additionally,
because of the complexity of the memory handling in LIO-SE, it is not
possible to verify the correctness of the memory handling by analyzing
a single function at a time. In my opinion this makes the LIO-SE
source code hard to maintain.
Furthermore, the LIO-SE kernel module source code does not follow
conventions that have proven their value in the past like grouping all
error handling at the end of a function. As could be expected, the
consequence is that error handling is not correct in several
functions, resulting in memory leaks in case of an error. Some
examples of functions in which error handling is clearly incorrect:
* transport_allocate_passthrough().
* iscsi_do_build_list().

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-13 03:44:35 UTC
Permalink
Greetings all,
Post by Bart Van Assche
I have always observed the case with LIO SE/iSCSI target mode ...
Hello Nicholas,
Are you sure that the LIO-SE kernel module source code is ready for
inclusion in the mainstream Linux kernel ? As you know I tried to test
the LIO-SE iSCSI target. Already while configuring the target I
encountered a kernel crash that froze the whole system. I can
reproduce this kernel crash easily, and I reported it 11 days ago on
the LIO-SE mailing list (February 4, 2008). One of the call stacks I
posted shows a crash in mempool_alloc() called from jbd. Or: the crash
is most likely the result of memory corruption caused by LIO-SE.
So I was able to FINALLY track this down to:

-# CONFIG_SLUB_DEBUG is not set
-# CONFIG_SLAB is not set
-CONFIG_SLUB=y
+CONFIG_SLAB=y

in both your and Chris Weiss's configs that was causing the
reproduceable general protection faults. I also disabled
CONFIG_RELOCATABLE and crash dump because I was debugging using kdb in
x86_64 VM on 2.6.24 with your config. I am pretty sure you can leave
this (crash dump) in your config for testing.

This can take a while to compile and take up alot of space, esp. with
all of the kernel debug options enabled, which on 2.6.24, really amounts
to alot of CPU time when building. Also with your original config, I
was seeing some strange undefined module objects after Stage 2 Link with
iscsi_target_mod with modpost with the SLUB the lockups (which are not
random btw, and are tracked back to __kmalloc()).. Also, at module load
time with the original config, there where some warning about symbol
objects (I believe it was SCSI related, same as the ones with modpost).

In any event, the dozen 1000 loop discovery test is now working fine (as
well as IPoIB) with the above config change, and you should be ready to
go for your testing.

Tomo, Vlad, Andrew and Co:

Do you have any ideas why this would be the case with LIO-Target..? Is
anyone else seeing something similar to this with their target mode
(mabye its all out of tree code..?) that is having an issue..? I am
using Debian x86_64 and Bart and Chris are using Ubuntu x86_64 and we
both have this problem with CONFIG_SLUB on >= 2.6.22 kernel.org
kernels.

Also, I will recompile some of my non x86 machines with the above
enabled and see if I can reproduce.. Here the Bart's config again:

http://groups.google.com/group/linux-iscsi-target-dev/browse_thread/thread/30835aede1028188
Post by Bart Van Assche
Because I was curious to know why it took so long to fix such a severe
crash, I started browsing through the LIO-SE source code. Analysis of
the LIO-SE kernel module source code learned me that this crash is not
a coincidence. Dynamic memory allocation (kmalloc()/kfree()) in the
LIO-SE kernel module is complex and hard to verify.
What the LIO-SE Target module does is complex. :P Sorry for taking so
long, I had to start tracking this down by CONFIG_ option with your
config on an x86_64 VM.
Post by Bart Van Assche
There are 412
memory allocation/deallocation calls in the current version of the
LIO-SE kernel module source code, which is a lot. Additionally,
because of the complexity of the memory handling in LIO-SE, it is not
possible to verify the correctness of the memory handling by analyzing
a single function at a time. In my opinion this makes the LIO-SE
source code hard to maintain.
Furthermore, the LIO-SE kernel module source code does not follow
conventions that have proven their value in the past like grouping all
error handling at the end of a function. As could be expected, the
consequence is that error handling is not correct in several
functions, resulting in memory leaks in case of an error.
I would be more than happy to point the release paths for iSCSI Target
and LIO-SE to show they are not actual memory leaks (as I mentioned,
this code has been stable for a number of years) for some particular SE
or iSCSI Target logic if you are interested..

Also, if we are talking about target mode storage engine that should be
going upstream, the API to the current stable and future storage
systems, and of course the Mem->SG and SG->Mem that handles all possible
cases of max_sectors and sector_size to past, present, and future. I
really glad that you have been taking a look at this, because some of
the code (as you mention) can get very complex to make this a reality as
it has been with LIO-Target since v2.2.
Post by Bart Van Assche
Some
* transport_allocate_passthrough().
* iscsi_do_build_list().
You did find the one in transport_allocate_passthrough() and the
strncpy() + strlen() in userspace. Also, thanks for pointing me to the
missing sg_init_table() and sg_mark_end() usage for 2.6.24. I will post
an update to my thread about how to do this for other drivers..

I will have a look at your new changes and post them on LIO-Target-Dev
for your review. Please feel free to Ack them when I post.

(Thanks Bart !!)

PS: Sometimes it takes a while when you are on the bleeding edge of
development to track these types of issues down. :-)

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-13 06:18:27 UTC
Permalink
Post by Nicholas A. Bellinger
Greetings all,
Post by Bart Van Assche
I have always observed the case with LIO SE/iSCSI target mode ...
Hello Nicholas,
Are you sure that the LIO-SE kernel module source code is ready for
inclusion in the mainstream Linux kernel ? As you know I tried to test
the LIO-SE iSCSI target. Already while configuring the target I
encountered a kernel crash that froze the whole system. I can
reproduce this kernel crash easily, and I reported it 11 days ago on
the LIO-SE mailing list (February 4, 2008). One of the call stacks I
posted shows a crash in mempool_alloc() called from jbd. Or: the crash
is most likely the result of memory corruption caused by LIO-SE.
-# CONFIG_SLUB_DEBUG is not set
-# CONFIG_SLAB is not set
-CONFIG_SLUB=y
+CONFIG_SLAB=y
in both your and Chris Weiss's configs that was causing the
reproduceable general protection faults. I also disabled
CONFIG_RELOCATABLE and crash dump because I was debugging using kdb in
x86_64 VM on 2.6.24 with your config. I am pretty sure you can leave
this (crash dump) in your config for testing.
This can take a while to compile and take up alot of space, esp. with
all of the kernel debug options enabled, which on 2.6.24, really amounts
to alot of CPU time when building. Also with your original config, I
was seeing some strange undefined module objects after Stage 2 Link with
iscsi_target_mod with modpost with the SLUB the lockups (which are not
random btw, and are tracked back to __kmalloc()).. Also, at module load
time with the original config, there where some warning about symbol
objects (I believe it was SCSI related, same as the ones with modpost).
In any event, the dozen 1000 loop discovery test is now working fine (as
well as IPoIB) with the above config change, and you should be ready to
go for your testing.
Do you have any ideas why this would be the case with LIO-Target..? Is
anyone else seeing something similar to this with their target mode
(mabye its all out of tree code..?) that is having an issue..? I am
using Debian x86_64 and Bart and Chris are using Ubuntu x86_64 and we
both have this problem with CONFIG_SLUB on >= 2.6.22 kernel.org
kernels.
Also, I will recompile some of my non x86 machines with the above
http://groups.google.com/group/linux-iscsi-target-dev/browse_thread/thread/30835aede1028188
This is also failing on CONFIG_SLUB on 2.6.24 ppc64. Since the rest of
the system seems to work fine, my only guess it may be related to the
fact that the module is being compiled out of tree. I took a quick
glance at what kbuild was using for compiler and linker parameters, but
nothing looked out of the ordinary.

I will take a look with kdb and SLUB re-enabled on x86_64 and see if this
helps shed any light on the issue. Is anyone else seeing an issue with CONFIG_SLUB..?

Also, I wonder who else aside from Ubuntu is using this by default in
their .config..?


--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-13 16:37:36 UTC
Permalink
Post by Nicholas A. Bellinger
Post by Nicholas A. Bellinger
Greetings all,
Post by Bart Van Assche
I have always observed the case with LIO SE/iSCSI target mode ...
Hello Nicholas,
Are you sure that the LIO-SE kernel module source code is ready for
inclusion in the mainstream Linux kernel ? As you know I tried to test
the LIO-SE iSCSI target. Already while configuring the target I
encountered a kernel crash that froze the whole system. I can
reproduce this kernel crash easily, and I reported it 11 days ago on
the LIO-SE mailing list (February 4, 2008). One of the call stacks I
posted shows a crash in mempool_alloc() called from jbd. Or: the crash
is most likely the result of memory corruption caused by LIO-SE.
-# CONFIG_SLUB_DEBUG is not set
-# CONFIG_SLAB is not set
-CONFIG_SLUB=y
+CONFIG_SLAB=y
in both your and Chris Weiss's configs that was causing the
reproduceable general protection faults. I also disabled
CONFIG_RELOCATABLE and crash dump because I was debugging using kdb in
x86_64 VM on 2.6.24 with your config. I am pretty sure you can leave
this (crash dump) in your config for testing.
This can take a while to compile and take up alot of space, esp. with
all of the kernel debug options enabled, which on 2.6.24, really amounts
to alot of CPU time when building. Also with your original config, I
was seeing some strange undefined module objects after Stage 2 Link with
iscsi_target_mod with modpost with the SLUB the lockups (which are not
random btw, and are tracked back to __kmalloc()).. Also, at module load
time with the original config, there where some warning about symbol
objects (I believe it was SCSI related, same as the ones with modpost).
In any event, the dozen 1000 loop discovery test is now working fine (as
well as IPoIB) with the above config change, and you should be ready to
go for your testing.
Do you have any ideas why this would be the case with LIO-Target..? Is
anyone else seeing something similar to this with their target mode
(mabye its all out of tree code..?) that is having an issue..? I am
using Debian x86_64 and Bart and Chris are using Ubuntu x86_64 and we
both have this problem with CONFIG_SLUB on >= 2.6.22 kernel.org
kernels.
Also, I will recompile some of my non x86 machines with the above
http://groups.google.com/group/linux-iscsi-target-dev/browse_thread/thread/30835aede1028188
This is also failing on CONFIG_SLUB on 2.6.24 ppc64. Since the rest of
the system seems to work fine, my only guess it may be related to the
fact that the module is being compiled out of tree. I took a quick
glance at what kbuild was using for compiler and linker parameters, but
nothing looked out of the ordinary.
I will take a look with kdb and SLUB re-enabled on x86_64 and see if this
helps shed any light on the issue. Is anyone else seeing an issue with CONFIG_SLUB..?
I was able to track this down to a memory corruption issue in the
in-band iSCSI discovery path. I just made the change, and the diff can
be located at:

http://groups.google.com/group/linux-iscsi-target-dev/browse_thread/thread/a70d4835c55be392

In any event, this is now fixed, and I will be generating some new
builds for LIO-Target shortly for Debian, CentOS and Ubuntu. Espically
for the Ubuntu folks, where this is going to be an issue with their
default kernel config.

A big thanks to Bart Van Assche for helping me locate the actual issue,
and giving me a clue with slub_debug=FZPU. Thanks again,

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-06 00:17:10 UTC
Permalink
iSCSI is way, way too complicated.
I fully agree. From one side, all that complexity is unavoidable for
case of multiple connections per session, but for the regular case of
one connection per session it must be a lot simpler.
Actually, think about those multiple connections... we already had to
implement fast-failover (and load bal) SCSI multi-pathing at a higher
level. IMO that portion of the protocol is redundant: You need the
same capability elsewhere in the OS _anyway_, if you are to support
multi-pathing.
Jeff
Hey Jeff,

I put a whitepaper on the LIO cluster recently about this topic.. It is
from a few years ago but the datapoints are very relevant.

http://linux-iscsi.org/builds/user/nab/Inter.vs.OuterNexus.Multiplexing.pdf

The key advantage to MC/S and ERL=2 has always been that they are
completely OS independent. They are designed to work together and
actually benefit from one another.

They are also are protocol independent between Traditional iSCSI and
iSER.

--nab

PS: A great thanks for my former colleague Edward Cheng for putting this
together.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-06 00:48:22 UTC
Permalink
Post by Linus Torvalds
better. So for example, I personally suspect that ATA-over-ethernet is way
better than some crazy SCSI-over-TCP crap, but I'm biased for simple and
low-level, and against those crazy SCSI people to begin with.
Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
would probably trash iSCSI for latency if nothing else.
AoE is truly a thing of beauty. It has a two/three page RFC (say no more!).
But quite so... AoE is limited to MTU size, which really hurts. Can't
really do tagged queueing, etc.
iSCSI is way, way too complicated.
I fully agree. From one side, all that complexity is unavoidable for
case of multiple connections per session, but for the regular case of
one connection per session it must be a lot simpler.
And now think about iSER, which brings iSCSI on the whole new complexity
level ;)
Actually, the iSER protocol wire protocol itself is quite simple,
because it builds on iSCSI and IPS fundamentals, and because traditional
iSCSI's recovery logic for CRC failures (and hence alot of
acknowledgement sequence PDUs that go missing, etc) and the RDMA Capable
Protocol (RCaP).

The logic that iSER collectively disables is known as within-connection
and within-command recovery (negotiated as ErrorRecoveryLevel=1 on the
wire), RFC-5046 requires that the iSCSI layer that iSER is being enabled
to disable CRC32C checksums and any associated timeouts for ERL=1.

Also, have a look at Appendix A. in the iSER spec.

A.1. iWARP Message Format for iSER Hello Message ...............73
A.2. iWARP Message Format for iSER HelloReply Message ..........74
A.3. iWARP Message Format for SCSI Read Command PDU ............75
A.4. iWARP Message Format for SCSI Read Data ...................76
A.5. iWARP Message Format for SCSI Write Command PDU ...........77
A.6. iWARP Message Format for RDMA Read Request ................78
A.7. iWARP Message Format for Solicited SCSI Write Data ........79
A.8. iWARP Message Format for SCSI Response PDU ................80

This is about as 1/2 as many traditional iSCSI PDUs, that iSER
encapulates.

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-06 00:51:28 UTC
Permalink
Post by Nicholas A. Bellinger
Post by Linus Torvalds
better. So for example, I personally suspect that ATA-over-ethernet is way
better than some crazy SCSI-over-TCP crap, but I'm biased for simple and
low-level, and against those crazy SCSI people to begin with.
Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
would probably trash iSCSI for latency if nothing else.
AoE is truly a thing of beauty. It has a two/three page RFC (say no more!).
But quite so... AoE is limited to MTU size, which really hurts. Can't
really do tagged queueing, etc.
iSCSI is way, way too complicated.
I fully agree. From one side, all that complexity is unavoidable for
case of multiple connections per session, but for the regular case of
one connection per session it must be a lot simpler.
And now think about iSER, which brings iSCSI on the whole new complexity
level ;)
Actually, the iSER protocol wire protocol itself is quite simple,
because it builds on iSCSI and IPS fundamentals, and because traditional
iSCSI's recovery logic for CRC failures (and hence alot of
acknowledgement sequence PDUs that go missing, etc) and the RDMA Capable
Protocol (RCaP).
this should be:

.. and instead the RDMA Capacle Protocol (RCaP) provides the 32-bit or
greater data integrity.

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
FUJITA Tomonori
2008-02-06 01:29:31 UTC
Permalink
On Tue, 05 Feb 2008 18:09:15 +0100
On Tue, 05 Feb 2008 08:14:01 +0100
Post by James Bottomley
These are both features being independently worked on, are they not?
Even if they weren't, the combination of the size of SCST in kernel plus
the problem of having to find a migration path for the current STGT
users still looks to me to involve the greater amount of work.
I don't want to be mean, but does anyone actually use STGT in
production? Seriously?
In the latest development version of STGT, it's only possible to stop
the tgtd target daemon using KILL / 9 signal - which also means all
iSCSI initiator connections are corrupted when tgtd target daemon is
started again (kernel upgrade, target daemon upgrade, server reboot etc.).
I don't know what "iSCSI initiator connections are corrupted"
mean. But if you reboot a server, how can an iSCSI target
implementation keep iSCSI tcp connections?
Imagine you have to reboot all your NFS clients when you reboot your NFS
server. Not only that - your data is probably corrupted, or at least the
filesystem deserves checking...
Don't know if matters, but in my setup (iscsi on top of drbd+heartbeat)
rebooting the primary server doesn't affect my iscsi traffic, SCST correctly
manages stop/crash, by sending unit attention to clients on reconnect.
Drbd+heartbeat correctly manages those things too.
Still from an end-user POV, i was able to reboot/survive a crash only with
SCST, IETD still has reconnect problems and STGT are even worst.
Please tell us on stgt-devel mailing list if you see problems. We will
try to fix them.

Thanks,
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nicholas A. Bellinger
2008-02-06 02:01:36 UTC
Permalink
Post by FUJITA Tomonori
On Tue, 05 Feb 2008 18:09:15 +0100
On Tue, 05 Feb 2008 08:14:01 +0100
Post by James Bottomley
These are both features being independently worked on, are they not?
Even if they weren't, the combination of the size of SCST in kernel plus
the problem of having to find a migration path for the current STGT
users still looks to me to involve the greater amount of work.
I don't want to be mean, but does anyone actually use STGT in
production? Seriously?
In the latest development version of STGT, it's only possible to stop
the tgtd target daemon using KILL / 9 signal - which also means all
iSCSI initiator connections are corrupted when tgtd target daemon is
started again (kernel upgrade, target daemon upgrade, server reboot etc.).
I don't know what "iSCSI initiator connections are corrupted"
mean. But if you reboot a server, how can an iSCSI target
implementation keep iSCSI tcp connections?
Imagine you have to reboot all your NFS clients when you reboot your NFS
server. Not only that - your data is probably corrupted, or at least the
filesystem deserves checking...
The TCP connection will drop, remember that the TCP connection state for
one side has completely vanished. Depending on iSCSI/iSER
ErrorRecoveryLevel that is set, this will mean:

1) Session Recovery, ERL=0 - Restarting the entire nexus and all
connections across all of the possible subnets or comm-links. All
outstanding un-StatSN acknowledged commands will be returned back to the
SCSI subsystem with RETRY status. Once a single connection has been
reestablished to start the nexus, the CDBs will be resent.

2) Connection Recovery, ERL=2 - CDBs from the failed connection(s) will
be retried (nothing changes in the PDU) to fill the iSCSI CmdSN ordering
gap, or be explictly retried with TMR TASK_REASSIGN for ones already
acknowledged by the ExpCmdSN that are returned to the initiator in
response packets or by way of unsolicited NopINs.
Post by FUJITA Tomonori
Don't know if matters, but in my setup (iscsi on top of drbd+heartbeat)
rebooting the primary server doesn't affect my iscsi traffic, SCST correctly
manages stop/crash, by sending unit attention to clients on reconnect.
Drbd+heartbeat correctly manages those things too.
Still from an end-user POV, i was able to reboot/survive a crash only with
SCST, IETD still has reconnect problems and STGT are even worst.
Please tell us on stgt-devel mailing list if you see problems. We will
try to fix them.
FYI, the LIO code also supports rmmoding iscsi_target_mod while at full
10 Gb/sec speed. I think it should be a requirement to be able to
control per initiator, per portal group, per LUN, per device, per HBA in
the design without restarting any other objects.

--nab
Post by FUJITA Tomonori
Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Luben Tuikov
2008-02-09 07:44:10 UTC
Permalink
Post by Luben Tuikov
Post by Luben Tuikov
Is there an open iSCSI Target implementation which
does NOT
Post by Luben Tuikov
issue commands to sub-target devices via the SCSI
mid-layer, but
Post by Luben Tuikov
bypasses it completely?
Luben
Hi Luben,
I am guessing you mean futher down the stack, which I
don't know this to
Yes, that's what I meant.
Post by Luben Tuikov
be the case. Going futher up the layers is the design of
v2.9 LIO-SE.
There is a diagram explaining the basic concepts from a
10,000 foot
level.
http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf
Thanks!

Luben
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Continue reading on narkive:
Loading...