[CFT][PATCH] 2.5.47 Athlon/Druon, much faster copy

You don't seem to save/restore the FPU state, so it will be likely
corrupted after your copy runs.

Also I'm pretty sure that using movntq (= forcing destination out of
cache) is not a good strategy for generic copy_from_user(). It may
be a win for the copies in write ( user space -> page cache ), but
will hurt for all the ioctls and other things that actually need the
data in cache afterwards. I am afraid it is not enough to do micro benchmarks
here.

-Andi

Akira Tsukamoto

2002-11-16 18:22:51 UTC

On Sat, 16 Nov 2002 11:56:52 +0100

Post by Andi Kleen
You don't seem to save/restore the FPU state, so it will be likely
corrupted after your copy runs.

This is the main question for me that I was wondering for all week.
My first version was using fsave and frstore, so
just changing three lines will accomplish this.
Is it all I need? Any thing elase needed to consider using fpu register?

Post by Andi Kleen
Also I'm pretty sure that using movntq (= forcing destination out of
cache) is not a good strategy for generic copy_from_user(). It may
be a win for the copies in write ( user space -> page cache ),

Yes, that why I included postfetch in the code because movntq does not leave
them in the L2 cache.
Anygood idea to

Post by Andi Kleen
but
will hurt for all the ioctls and other things that actually need the
data in cache afterwards. I am afraid it is not enough to do micro benchmarks
here.

check above?

Post by Andi Kleen
-Andi

Andi Kleen

2002-11-16 18:30:03 UTC

Post by Akira Tsukamoto
On Sat, 16 Nov 2002 11:56:52 +0100

Post by Andi Kleen
You don't seem to save/restore the FPU state, so it will be likely
corrupted after your copy runs.

You are currently corrupting the user's FPU state.

The proper way to save it is to use kernel_fpu_begin()

Yes, that why I included postfetch in the code because movntq does not leave
them in the L2 cache.

That looks rather wasteful - first force it out and then trying to get it in
again. I have my doubts on it being a good strategy for speed.

Post by Andi Kleen
but
will hurt for all the ioctls and other things that actually need the
data in cache afterwards. I am afraid it is not enough to do micro benchmarks
here.

check above?

Use special function calls for them, don't put it into generic
copy_*_user

Also you should really check for small copy and not use the FPU based
copy for them. Best is probably to use a relatively simply copy_*_user
(no FPU tricks, just an unrolled integer core) and change the VFS
and the file systems to call a special function from write(), but only
when the write is big.

-Andi

Akira Tsukamoto

2002-11-16 18:50:17 UTC

Hi,

On Sat, 16 Nov 2002 19:30:03 +0100

Post by Andi Kleen
The proper way to save it is to use kernel_fpu_begin()

Thanks! I will look into it. This is what I was looking for.

I been running this kernel with my copy for three days, and never had
oops, but I was really worried.

Yes, that why I included postfetch in the code because movntq does not leave
them in the L2 cache.

That looks rather wasteful - first force it out and then trying to get it in
again. I have my doubts on it being a good strategy for speed.

It tried both, use just normal mov or movq <-> use movntq + postfetch, and the later
was much much faster, because postfetch needs to read only every 64 bytes.

I will ckeck kernel_fpu_begin() fisrt and if using fpu register is too much
overhead than I will remove them.

Akira

Hirokazu Takahashi

2002-11-16 22:23:37 UTC

Hello,

Post by Andi Kleen
The proper way to save it is to use kernel_fpu_begin()

Thanks! I will look into it. This is what I was looking for.
I been running this kernel with my copy for three days, and never had
oops, but I was really worried.

kernel_fpu_begin() is not enough for this purpose as user_*_copy may
cause pagefault and sleep in it. You should know lazy FPU swtich mechanism.
The mechanism will be confued by your your copy routine and may save
broken FPU context or another process may use broken FPU registers.

You should also care about do_page_fault like this:

do_page_fault()
{
if (fault happens in user_*_copy) {
save FPU context on stack;
recover FPU context to the previous state;
}
.........
.........
.........
if (fault happens in user_*_copy) {
recover FPU context from stack;
}
}

Yes, that why I included postfetch in the code because movntq does not leave
them in the L2 cache.

That looks rather wasteful - first force it out and then trying to get it in
again. I have my doubts on it being a good strategy for speed.

It tried both, use just normal mov or movq <-> use movntq + postfetch, and the later
was much much faster, because postfetch needs to read only every 64 bytes.
I will ckeck kernel_fpu_begin() fisrt and if using fpu register is too much
overhead than I will remove them.

I guess kernel_fpu_begin() might not be heavy as many processes don't use
fpu registers so much.

Thank you,
Hirokazu Takahashi.

Akira Tsukamoto

2002-11-16 21:55:31 UTC

On Sat, 16 Nov 2002 19:30:03 +0100

Post by Akira Tsukamoto
This is the main question for me that I was wondering for all week.
My first version was using fsave and frstore, so
just changing three lines will accomplish this.
Is it all I need? Any thing elase needed to consider using fpu register?

You are currently corrupting the user's FPU state.

fsave and frstor should solve this problem, doesn't it?

Post by Andi Kleen
The proper way to save it is to use kernel_fpu_begin()

I looked into it. kernel_fpu_begin/end are basically doing:
1)preempt enable/disable
2)fsave and frstor
It does not look a lot of overhead.

So what is missing in my patch is:
1)Surround with kernel_fpu_begin/end.
2)Change the threshold of the size from 256 to somewhere around 512.
I removed the fsave/frstor, which was in my first version, to lower the
threshold because they had some overhead and if the copying size
was smaller than 512, the org_copy became faster.
I just need to reverse it.

Please let me know if anything esle is missing.