Discussion:
[RFC PATCH] crypto: crc32c-pclmul - Use pmovzxdq to shrink K_table
George Spelvin
2014-05-28 14:40:00 UTC
Permalink
While following a number of tangents in the code (I was figuring out
how to edit lib/Kconfig; don't ask), I came across a table of 256 64-bit
words, all of which had the high half set to zero.

Since the code depends on both pclmulq and crc32, SSE 4.1 is obviously
present, so it could use pmovzxdq and save 1K of kernel data.

The following patch obviously lacks the kludges for old binutils,
but should convey the general idea.

Jan: Is support for SLE10's pre-2.18 binutils still required?
Your PEXTRD fix was only a year ago, so I expect, but I wanted to ask.

Two other minor additional changes:

1. The current code unnecessarily puts the table in the read-write
.data section. Moved to .text.
2. I'm also not sure why it's necessary to force such large alignment
on K_table. Comments on reducing it?

Signed-off-by: George Spelvin <***@horizon.com>


diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
index dbc4339b..9f885ee4 100644
--- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
+++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
@@ -216,15 +216,11 @@ LABEL crc_ %i
## 4) Combine three results:
################################################################

- lea (K_table-16)(%rip), bufp # first entry is for idx 1
+ lea (K_table-8)(%rip), bufp # first entry is for idx 1
shlq $3, %rax # rax *= 8
- subq %rax, tmp # tmp -= rax*8
- shlq $1, %rax
- subq %rax, tmp # tmp -= rax*16
- # (total tmp -= rax*24)
- addq %rax, bufp
-
- movdqa (bufp), %xmm0 # 2 consts: K1:K2
+ pmovzxdq (bufp,%rax), %xmm0 # 2 consts: K1:K2
+ leal (%eax,%eax,2), %eax # rax *= 3 (total *24)
+ subq %rax, tmp # tmp -= rax*24

movq crc_init, %xmm1 # CRC for block 1
PCLMULQDQ 0x00,%xmm0,%xmm1 # Multiply by K2
@@ -331,136 +327,135 @@ ENDPROC(crc_pcl)

################################################################
## PCLMULQDQ tables
- ## Table is 128 entries x 2 quad words each
+ ## Table is 128 entries x 2 words (8 bytes) each
################################################################
-.data
-.align 64
+.align 8
K_table:
- .quad 0x14cd00bd6,0x105ec76f0
+ .long 0x14cd00bd6,0x105ec76f0
- .quad 0x0ba4fc28e,0x14cd00bd6
+ .long 0x0ba4fc28e,0x14cd00bd6
- .quad 0x1d82c63da,0x0f20c0dfe
+ .long 0x1d82c63da,0x0f20c0dfe
- .quad 0x09e4addf8,0x0ba4fc28e
+ .long 0x09e4addf8,0x0ba4fc28e
- .quad 0x039d3b296,0x1384aa63a
+ .long 0x039d3b296,0x1384aa63a
- .quad 0x102f9b8a2,0x1d82c63da
+ .long 0x102f9b8a2,0x1d82c63da
- .quad 0x14237f5e6,0x01c291d04
+ .long 0x14237f5e6,0x01c291d04
- .quad 0x00d3b6092,0x09e4addf8
+ .long 0x00d3b6092,0x09e4addf8

(Remaining boring bits of this hunk elided.)
George Spelvin
2014-05-28 15:32:49 UTC
Permalink
Um, yeah, I just noticed the problem with that patch: half of the numbers
in that table are 33 bits, and cause a pile of warnings (not errors,
unfortunately!) from gas that scrolled by when I wasn't looking.

Logically, there should be no need for 33-bit values; they should all be
reducible modulo the polynomial. But that is going to take a slightly
larger change.
George Spelvin
2014-05-28 22:15:05 UTC
Permalink
crypto: crc32c-pclmul - Shrink K_table to 32-bit words

There's no need for the K_table to be made of 64-bit words. For some
reason, the original authors didn't fully reduce the values modulo the
CRC32C polynomial, and so had some 33-bit number in there. They
can all be reduced to 32 bits.

Doing that cuts the table size in half. Since the code depends on both
pclmulq and crc32, SSE 4.1 is obviously present, so we can use pmovzxdq
to fetch it in the correct format.

Two other related fixes:
* K_table is read-only, so belongs in .text, not .data, and
* There's no need for more than 8-byte alignment

Signed-off-by: George Spelvin <***@horizon.com>
---
Fixed properly and tested with an exhaustive user-space test harness.

I filled a 4K byte buffer with pseudorandom bytes and computed CRCs
from i to j and from j to k for all 0 <= i < j < k < 4096, comparing
both the intermediate and final results against a basic bit-at-a-time
software algorithm.

There's still room for improvement. Some additional areas that
could use tweaking:
- If the SMALL_SIZE is set right, that should also be the size where
we fall out of the 3-part algorithm. As it is, a buffer of size
3096 will do a 3072-byte chunk and then do 3 8-byte CRCs and
mess around a lot combining them.
- Does it really warrant all the unrolling? Surely any processor
new enough to have a fully pipelined crc32 insutruction can
also handle some loop overhead instructions as well?
- Some reassignment of the registers would put 32-bit variables
(like crc_init_dw) in low registers so that they can be addressed
without REX prefixes and shrink the ode. But 64-bit pointers like
block_0 and block_1 are only ever used with 64-bit operands and thus
REX prefixes.

diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
index dbc4339b..dcc50752 100644
--- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
+++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
@@ -216,15 +216,11 @@ LABEL crc_ %i
## 4) Combine three results:
################################################################

- lea (K_table-16)(%rip), bufp # first entry is for idx 1
+ lea (K_table-8)(%rip), bufp # first entry is for idx 1
shlq $3, %rax # rax *= 8
- subq %rax, tmp # tmp -= rax*8
- shlq $1, %rax
- subq %rax, tmp # tmp -= rax*16
- # (total tmp -= rax*24)
- addq %rax, bufp
-
- movdqa (bufp), %xmm0 # 2 consts: K1:K2
+ pmovzxdq (bufp,%rax), %xmm0 # 2 consts: K1:K2
+ leal (%eax,%eax,2), %eax # rax *= 3 (total *24)
+ subq %rax, tmp # tmp -= rax*24

movq crc_init, %xmm1 # CRC for block 1
PCLMULQDQ 0x00,%xmm0,%xmm1 # Multiply by K2
@@ -238,9 +234,9 @@ LABEL crc_ %i
mov crc2, crc_init
crc32 %rax, crc_init

-################################################################
-## 5) Check for end:
-################################################################
+ ################################################################
+ ## 5) Check for end:
+ ################################################################

LABEL crc_ 0
mov tmp, len
@@ -331,136 +327,135 @@ ENDPROC(crc_pcl)

################################################################
## PCLMULQDQ tables
- ## Table is 128 entries x 2 quad words each
+ ## Table is 128 entries x 2 words (8 bytes) each
################################################################
-.data
-.align 64
+.align 8
K_table:
- .quad 0x14cd00bd6,0x105ec76f0
- .quad 0x0ba4fc28e,0x14cd00bd6
- .quad 0x1d82c63da,0x0f20c0dfe
- .quad 0x09e4addf8,0x0ba4fc28e
- .quad 0x039d3b296,0x1384aa63a
- .quad 0x102f9b8a2,0x1d82c63da
- .quad 0x14237f5e6,0x01c291d04
- .quad 0x00d3b6092,0x09e4addf8
- .quad 0x0c96cfdc0,0x0740eef02
- .quad 0x18266e456,0x039d3b296
- .quad 0x0daece73e,0x0083a6eec
- .quad 0x0ab7aff2a,0x102f9b8a2
- .quad 0x1248ea574,0x1c1733996
- .quad 0x083348832,0x14237f5e6
- .quad 0x12c743124,0x02ad91c30
- .quad 0x0b9e02b86,0x00d3b6092
- .quad 0x018b33a4e,0x06992cea2
- .quad 0x1b331e26a,0x0c96cfdc0
- .quad 0x17d35ba46,0x07e908048
- .quad 0x1bf2e8b8a,0x18266e456
- .quad 0x1a3e0968a,0x11ed1f9d8
- .quad 0x0ce7f39f4,0x0daece73e
- .quad 0x061d82e56,0x0f1d0f55e
- .quad 0x0d270f1a2,0x0ab7aff2a
- .quad 0x1c3f5f66c,0x0a87ab8a8
- .quad 0x12ed0daac,0x1248ea574
- .quad 0x065863b64,0x08462d800
- .quad 0x11eef4f8e,0x083348832
- .quad 0x1ee54f54c,0x071d111a8
- .quad 0x0b3e32c28,0x12c743124
- .quad 0x0064f7f26,0x0ffd852c6
- .quad 0x0dd7e3b0c,0x0b9e02b86
- .quad 0x0f285651c,0x0dcb17aa4
- .quad 0x010746f3c,0x018b33a4e
- .quad 0x1c24afea4,0x0f37c5aee
- .quad 0x0271d9844,0x1b331e26a
- .quad 0x08e766a0c,0x06051d5a2
- .quad 0x093a5f730,0x17d35ba46
- .quad 0x06cb08e5c,0x11d5ca20e
- .quad 0x06b749fb2,0x1bf2e8b8a
- .quad 0x1167f94f2,0x021f3d99c
- .quad 0x0cec3662e,0x1a3e0968a
- .quad 0x19329634a,0x08f158014
- .quad 0x0e6fc4e6a,0x0ce7f39f4
- .quad 0x08227bb8a,0x1a5e82106
- .quad 0x0b0cd4768,0x061d82e56
- .quad 0x13c2b89c4,0x188815ab2
- .quad 0x0d7a4825c,0x0d270f1a2
- .quad 0x10f5ff2ba,0x105405f3e
- .quad 0x00167d312,0x1c3f5f66c
- .quad 0x0f6076544,0x0e9adf796
- .quad 0x026f6a60a,0x12ed0daac
- .quad 0x1a2adb74e,0x096638b34
- .quad 0x19d34af3a,0x065863b64
- .quad 0x049c3cc9c,0x1e50585a0
- .quad 0x068bce87a,0x11eef4f8e
- .quad 0x1524fa6c6,0x19f1c69dc
- .quad 0x16cba8aca,0x1ee54f54c
- .quad 0x042d98888,0x12913343e
- .quad 0x1329d9f7e,0x0b3e32c28
- .quad 0x1b1c69528,0x088f25a3a
- .quad 0x02178513a,0x0064f7f26
- .quad 0x0e0ac139e,0x04e36f0b0
- .quad 0x0170076fa,0x0dd7e3b0c
- .quad 0x141a1a2e2,0x0bd6f81f8
- .quad 0x16ad828b4,0x0f285651c
- .quad 0x041d17b64,0x19425cbba
- .quad 0x1fae1cc66,0x010746f3c
- .quad 0x1a75b4b00,0x18db37e8a
- .quad 0x0f872e54c,0x1c24afea4
- .quad 0x01e41e9fc,0x04c144932
- .quad 0x086d8e4d2,0x0271d9844
- .quad 0x160f7af7a,0x052148f02
- .quad 0x05bb8f1bc,0x08e766a0c
- .quad 0x0a90fd27a,0x0a3c6f37a
- .quad 0x0b3af077a,0x093a5f730
- .quad 0x04984d782,0x1d22c238e
- .quad 0x0ca6ef3ac,0x06cb08e5c
- .quad 0x0234e0b26,0x063ded06a
- .quad 0x1d88abd4a,0x06b749fb2
- .quad 0x04597456a,0x04d56973c
- .quad 0x0e9e28eb4,0x1167f94f2
- .quad 0x07b3ff57a,0x19385bf2e
- .quad 0x0c9c8b782,0x0cec3662e
- .quad 0x13a9cba9e,0x0e417f38a
- .quad 0x093e106a4,0x19329634a
- .quad 0x167001a9c,0x14e727980
- .quad 0x1ddffc5d4,0x0e6fc4e6a
- .quad 0x00df04680,0x0d104b8fc
- .quad 0x02342001e,0x08227bb8a
- .quad 0x00a2a8d7e,0x05b397730
- .quad 0x168763fa6,0x0b0cd4768
- .quad 0x1ed5a407a,0x0e78eb416
- .quad 0x0d2c3ed1a,0x13c2b89c4
- .quad 0x0995a5724,0x1641378f0
- .quad 0x19b1afbc4,0x0d7a4825c
- .quad 0x109ffedc0,0x08d96551c
- .quad 0x0f2271e60,0x10f5ff2ba
- .quad 0x00b0bf8ca,0x00bf80dd2
- .quad 0x123888b7a,0x00167d312
- .quad 0x1e888f7dc,0x18dcddd1c
- .quad 0x002ee03b2,0x0f6076544
- .quad 0x183e8d8fe,0x06a45d2b2
- .quad 0x133d7a042,0x026f6a60a
- .quad 0x116b0f50c,0x1dd3e10e8
- .quad 0x05fabe670,0x1a2adb74e
- .quad 0x130004488,0x0de87806c
- .quad 0x000bcf5f6,0x19d34af3a
- .quad 0x18f0c7078,0x014338754
- .quad 0x017f27698,0x049c3cc9c
- .quad 0x058ca5f00,0x15e3e77ee
- .quad 0x1af900c24,0x068bce87a
- .quad 0x0b5cfca28,0x0dd07448e
- .quad 0x0ded288f8,0x1524fa6c6
- .quad 0x059f229bc,0x1d8048348
- .quad 0x06d390dec,0x16cba8aca
- .quad 0x037170390,0x0a3e3e02c
- .quad 0x06353c1cc,0x042d98888
- .quad 0x0c4584f5c,0x0d73c7bea
- .quad 0x1f16a3418,0x1329d9f7e
- .quad 0x0531377e2,0x185137662
- .quad 0x1d8d9ca7c,0x1b1c69528
- .quad 0x0b25b29f2,0x18a08b5bc
- .quad 0x19fb2a8b0,0x02178513a
- .quad 0x1a08fe6ac,0x1da758ae0
- .quad 0x045cddf4e,0x0e0ac139e
- .quad 0x1a91647f2,0x169cf9eb0
- .quad 0x1a0f717c4,0x0170076fa
+ .long 0x493c7d27, 0x00000001
+ .long 0xba4fc28e, 0x493c7d27
+ .long 0xddc0152b, 0xf20c0dfe
+ .long 0x9e4addf8, 0xba4fc28e
+ .long 0x39d3b296, 0x3da6d0cb
+ .long 0x0715ce53, 0xddc0152b
+ .long 0x47db8317, 0x1c291d04
+ .long 0x0d3b6092, 0x9e4addf8
+ .long 0xc96cfdc0, 0x740eef02
+ .long 0x878a92a7, 0x39d3b296
+ .long 0xdaece73e, 0x083a6eec
+ .long 0xab7aff2a, 0x0715ce53
+ .long 0x2162d385, 0xc49f4f67
+ .long 0x83348832, 0x47db8317
+ .long 0x299847d5, 0x2ad91c30
+ .long 0xb9e02b86, 0x0d3b6092
+ .long 0x18b33a4e, 0x6992cea2
+ .long 0xb6dd949b, 0xc96cfdc0
+ .long 0x78d9ccb7, 0x7e908048
+ .long 0xbac2fd7b, 0x878a92a7
+ .long 0xa60ce07b, 0x1b3d8f29
+ .long 0xce7f39f4, 0xdaece73e
+ .long 0x61d82e56, 0xf1d0f55e
+ .long 0xd270f1a2, 0xab7aff2a
+ .long 0xc619809d, 0xa87ab8a8
+ .long 0x2b3cac5d, 0x2162d385
+ .long 0x65863b64, 0x8462d800
+ .long 0x1b03397f, 0x83348832
+ .long 0xebb883bd, 0x71d111a8
+ .long 0xb3e32c28, 0x299847d5
+ .long 0x064f7f26, 0xffd852c6
+ .long 0xdd7e3b0c, 0xb9e02b86
+ .long 0xf285651c, 0xdcb17aa4
+ .long 0x10746f3c, 0x18b33a4e
+ .long 0xc7a68855, 0xf37c5aee
+ .long 0x271d9844, 0xb6dd949b
+ .long 0x8e766a0c, 0x6051d5a2
+ .long 0x93a5f730, 0x78d9ccb7
+ .long 0x6cb08e5c, 0x18b0d4ff
+ .long 0x6b749fb2, 0xbac2fd7b
+ .long 0x1393e203, 0x21f3d99c
+ .long 0xcec3662e, 0xa60ce07b
+ .long 0x96c515bb, 0x8f158014
+ .long 0xe6fc4e6a, 0xce7f39f4
+ .long 0x8227bb8a, 0xa00457f7
+ .long 0xb0cd4768, 0x61d82e56
+ .long 0x39c7ff35, 0x8d6d2c43
+ .long 0xd7a4825c, 0xd270f1a2
+ .long 0x0ab3844b, 0x00ac29cf
+ .long 0x0167d312, 0xc619809d
+ .long 0xf6076544, 0xe9adf796
+ .long 0x26f6a60a, 0x2b3cac5d
+ .long 0xa741c1bf, 0x96638b34
+ .long 0x98d8d9cb, 0x65863b64
+ .long 0x49c3cc9c, 0xe0e9f351
+ .long 0x68bce87a, 0x1b03397f
+ .long 0x57a3d037, 0x9af01f2d
+ .long 0x6956fc3b, 0xebb883bd
+ .long 0x42d98888, 0x2cff42cf
+ .long 0x3771e98f, 0xb3e32c28
+ .long 0xb42ae3d9, 0x88f25a3a
+ .long 0x2178513a, 0x064f7f26
+ .long 0xe0ac139e, 0x4e36f0b0
+ .long 0x170076fa, 0xdd7e3b0c
+ .long 0x444dd413, 0xbd6f81f8
+ .long 0x6f345e45, 0xf285651c
+ .long 0x41d17b64, 0x91c9bd4b
+ .long 0xff0dba97, 0x10746f3c
+ .long 0xa2b73df1, 0x885f087b
+ .long 0xf872e54c, 0xc7a68855
+ .long 0x1e41e9fc, 0x4c144932
+ .long 0x86d8e4d2, 0x271d9844
+ .long 0x651bd98b, 0x52148f02
+ .long 0x5bb8f1bc, 0x8e766a0c
+ .long 0xa90fd27a, 0xa3c6f37a
+ .long 0xb3af077a, 0x93a5f730
+ .long 0x4984d782, 0xd7c0557f
+ .long 0xca6ef3ac, 0x6cb08e5c
+ .long 0x234e0b26, 0x63ded06a
+ .long 0xdd66cbbb, 0x6b749fb2
+ .long 0x4597456a, 0x4d56973c
+ .long 0xe9e28eb4, 0x1393e203
+ .long 0x7b3ff57a, 0x9669c9df
+ .long 0xc9c8b782, 0xcec3662e
+ .long 0x3f70cc6f, 0xe417f38a
+ .long 0x93e106a4, 0x96c515bb
+ .long 0x62ec6c6d, 0x4b9e0f71
+ .long 0xd813b325, 0xe6fc4e6a
+ .long 0x0df04680, 0xd104b8fc
+ .long 0x2342001e, 0x8227bb8a
+ .long 0x0a2a8d7e, 0x5b397730
+ .long 0x6d9a4957, 0xb0cd4768
+ .long 0xe8b6368b, 0xe78eb416
+ .long 0xd2c3ed1a, 0x39c7ff35
+ .long 0x995a5724, 0x61ff0e01
+ .long 0x9ef68d35, 0xd7a4825c
+ .long 0x0c139b31, 0x8d96551c
+ .long 0xf2271e60, 0x0ab3844b
+ .long 0x0b0bf8ca, 0x0bf80dd2
+ .long 0x2664fd8b, 0x0167d312
+ .long 0xed64812d, 0x8821abed
+ .long 0x02ee03b2, 0xf6076544
+ .long 0x8604ae0f, 0x6a45d2b2
+ .long 0x363bd6b3, 0x26f6a60a
+ .long 0x135c83fd, 0xd8d26619
+ .long 0x5fabe670, 0xa741c1bf
+ .long 0x35ec3279, 0xde87806c
+ .long 0x00bcf5f6, 0x98d8d9cb
+ .long 0x8ae00689, 0x14338754
+ .long 0x17f27698, 0x49c3cc9c
+ .long 0x58ca5f00, 0x5bd2011f
+ .long 0xaa7c7ad5, 0x68bce87a
+ .long 0xb5cfca28, 0xdd07448e
+ .long 0xded288f8, 0x57a3d037
+ .long 0x59f229bc, 0xdde8f5b9
+ .long 0x6d390dec, 0x6956fc3b
+ .long 0x37170390, 0xa3e3e02c
+ .long 0x6353c1cc, 0x42d98888
+ .long 0xc4584f5c, 0xd73c7bea
+ .long 0xf48642e9, 0x3771e98f
+ .long 0x531377e2, 0x80ff0093
+ .long 0xdd35bc8d, 0xb42ae3d9
+ .long 0xb25b29f2, 0x8fe4c34d
+ .long 0x9a5ede41, 0x2178513a
+ .long 0xa563905d, 0xdf99fc11
+ .long 0x45cddf4e, 0xe0ac139e
+ .long 0xacfa3103, 0x6c23e841
+ .long 0xa51b6135, 0x170076fa
Tim Chen
2014-05-28 23:02:15 UTC
Permalink
Post by George Spelvin
crypto: crc32c-pclmul - Shrink K_table to 32-bit words
There's no need for the K_table to be made of 64-bit words. For some
reason, the original authors didn't fully reduce the values modulo the
CRC32C polynomial, and so had some 33-bit number in there. They
can all be reduced to 32 bits.
Doing that cuts the table size in half. Since the code depends on both
pclmulq and crc32, SSE 4.1 is obviously present, so we can use pmovzxdq
to fetch it in the correct format.
* K_table is read-only, so belongs in .text, not .data, and
* There's no need for more than 8-byte alignment
George,

Can you do a tcrypt speed measurement with and without your changes?
Check to see if there's any slowdown. Please make sure you pin
the frequency of your cpu when running the test.

e.g.
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Thanks.

Tim
George Spelvin
2014-05-28 23:55:16 UTC
Permalink
Post by Tim Chen
Can you do a tcrypt speed measurement with and without your changes?
Check to see if there's any slowdown. Please make sure you pin
the frequency of your cpu when running the test.
Sure thing; I was already inspired to do that based on your concerns.
Do you have any particular buffer sizes or alignments you'd suggest?

Since I'm changing only the three-part core, I was going to
avoid unaligned or short buffers, stick with a single buffer so
it stays in L1 D-cache, but vary the length so we use lots of
the K_table.

It's not the RAM I was worried about, but the D-cache wasted on
on the K table. Which doesn't affect the CRC code itself, but the
surrounding kernel code.


I'm also thinking of some ideas for handling even larger buffer sizes
without having to interrupt the 3-way main loop. Pclmulqdq can
mutiply up to 4 32-bit values to produce a 128-bit result, which
crc32 can efficiently reduce. So if we have three tables, of
x^(64*n) x^(4096*n), and x^(262144*n), each for n=0..63, we can
multiply them all together to handle up to a 16 MiB chunk.

The other option is to schedule the pclmulqdq in parallel with
the crc32q iterations and, after arranging a staggered start,
have a 4-part main loop, where 3 parts are performing crc32q
iterations and the fourth is using SSE to shift itself
forward (at which point it gets XORed into the data stream
that one other part is working on).

I haven't got all the details of that idea worked out in my head, but
it seems possible. I have to study the optimization guide in detail to
see how many micro-ops the crc32q instruction from memory is (and thus
how much of the decoder it requires).

As of Nehalem, a small inner loop that fits in the decoded uop cache
has the potential to be faster than a hugely unrolled one.
George Spelvin
2014-05-29 03:26:55 UTC
Permalink
Post by Tim Chen
Can you do a tcrypt speed measurement with and without your changes?
Check to see if there's any slowdown. Please make sure you pin
the frequency of your cpu when running the test.
e.g.
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
I just now re-read your e-mail and noticed you suggested a specific tool.
Oops, I haven't run that yet. I just made up my own in user space.
As I mentioned, since the changes are to the main loop that operates on
aligned buffers in multiples of 24 bytes, I focused my benchmarking there:

#define BUFFER 6114
static unsigned char buf[BUFFER] __attribute__ ((aligned(8)));
#define ITER 24 /* Number of test iterations */

uint32_t
do_test(uint32_t crc, uint32_t (*f)(void const *, unsigned, uint32_t))
{
int i, j;
for (i = 0; i < BUFFER; i += 8)
for (j = i+24; j <= BUFFER; j += 24)
crc = f(buf+i, j-i, crc);
return crc;
}

uint32_t
time_test(uint64_t *time, uint32_t crc, uint32_t (*f)(void const *, unsigned, ui
nt32_t))
{
uint64_t start = rdtsc();
crc = do_test(crc, f);
*time = rdtsc() - start;
return crc;
}

The actual test goes in ABBA order to reduce bias:

for (i = 0; i < ITER; i += 2) {
crc1 = time_test(times[i]+0, crc1, crc_pcl_1);
crc2 = time_test(times[i]+1, crc2, crc_pcl_2);
crc2 = time_test(times[i+1]+1, crc2, crc_pcl_2);
crc1 = time_test(times[i+1]+0, crc1, crc_pcl_1);
}

crc_pcl_1 is the old code, crc_pcl_2 is my revised version.


The results are as follows (the last line is a total):

Old code New code
0: 85009953 71812457 (-13197496)
1: 57408829 63361572 (+5952743)
2: 52552399 49195266 (-3357133)
3: 43595130 45988364 (+2393234)
4: 41541760 39714198 (-1827562)
5: 36576082 38021344 (+1445262)
6: 35307854 34150656 (-1157198)
7: 32182230 33134236 (+952006)
8: 31341596 31307004 (-34592)
9: 31340900 31329408 (-11492)
10: 31344884 31329144 (-15740)
11: 31334144 31312492 (-21652)
12: 31338992 31330356 (-8636)
13: 31343744 31311344 (-32400)
14: 31339000 31340196 (+1196)
15: 31337492 31313988 (-23504)
16: 31341688 31334040 (-7648)
17: 31341804 31308936 (-32868)
18: 31339936 31332020 (-7916)
19: 31323228 31324240 (+1012)
20: 31339744 31331768 (-7976)
21: 31321536 31332688 (+11152)
22: 31340280 31335212 (-5068)
23: 31332056 31335768 (+3712)
24: 885575261 876586697 (-8988564)

I swapped the link order of the two .o files in case cache
placement made a difference:

0: 84305981 71483150 (-12822831)
1: 57341376 63129024 (+5787648)
2: 52361618 49240069 (-3121549)
3: 43520576 45822670 (+2302094)
4: 41500104 39684116 (-1815988)
5: 36542864 37940196 (+1397332)
6: 35281570 34144348 (-1137222)
7: 32149420 33088652 (+939232)
8: 31342368 31329056 (-13312)
9: 31338788 31313212 (-25576)
10: 31336324 31335612 (-712)
11: 31341892 31319576 (-22316)
12: 31336224 31322808 (-13416)
13: 31338560 31315084 (-23476)
14: 31338332 31332976 (-5356)
15: 31337300 31315088 (-22212)
16: 31334300 31330884 (-3416)
17: 31318660 31329916 (+11256)
18: 31334984 31327740 (-7244)
19: 31315084 31327768 (+12684)
20: 31334708 31345872 (+11164)
21: 31325988 31330948 (+4960)
22: 31333956 31339800 (+5844)
23: 31322880 31327316 (+4436)
24: 884333857 875775881 (-8557976)

It doesn't look like a slowdown; more like a 1% speedup.

I'll figure out tcrypt in a bit.
Tim Chen
2014-05-29 16:33:21 UTC
Permalink
Post by George Spelvin
Post by Tim Chen
Can you do a tcrypt speed measurement with and without your changes?
Check to see if there's any slowdown. Please make sure you pin
the frequency of your cpu when running the test.
e.g.
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
I just now re-read your e-mail and noticed you suggested a specific tool.
Try to run the standard kernel crypto test with tcrypt. For speed test
of crc32c, use test 319:

modprobe tcrypt mode=319

Then you will see the output in dmesg (or tail of /var/log/messages).
It will give you the cycles you spent for various block sizes.

For consistent test numbers, before test,
disable turbo mode of cpu in BIOS and pin
frequency of all your cpus to max with something like

i=0
num_cpus=`cat /proc/cpuinfo| grep "^processor"| wc -l `
while [ $i -lt $num_cpus ]
do
echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor
i=`expr $i + 1`
done
Post by George Spelvin
Oops, I haven't run that yet. I just made up my own in user space.
As I mentioned, since the changes are to the main loop that operates on
#define BUFFER 6114
static unsigned char buf[BUFFER] __attribute__ ((aligned(8)));
#define ITER 24 /* Number of test iterations */
uint32_t
do_test(uint32_t crc, uint32_t (*f)(void const *, unsigned, uint32_t))
{
int i, j;
for (i = 0; i < BUFFER; i += 8)
for (j = i+24; j <= BUFFER; j += 24)
crc = f(buf+i, j-i, crc);
return crc;
}
uint32_t
time_test(uint64_t *time, uint32_t crc, uint32_t (*f)(void const *, unsigned, ui
nt32_t))
{
uint64_t start = rdtsc();
crc = do_test(crc, f);
*time = rdtsc() - start;
return crc;
}
for (i = 0; i < ITER; i += 2) {
crc1 = time_test(times[i]+0, crc1, crc_pcl_1);
crc2 = time_test(times[i]+1, crc2, crc_pcl_2);
crc2 = time_test(times[i+1]+1, crc2, crc_pcl_2);
crc1 = time_test(times[i+1]+0, crc1, crc_pcl_1);
}
crc_pcl_1 is the old code, crc_pcl_2 is my revised version.
Old code New code
0: 85009953 71812457 (-13197496)
1: 57408829 63361572 (+5952743)
Maybe your cpu has not been pinned to constant frequency?
The cycles are much higher in the first few iterations.
Likely cpu frequency is going up when governor detect
the load on cpu. Please also check that turbo is
turned off as this can introduce much variations
in your testing.
Post by George Spelvin
2: 52552399 49195266 (-3357133)
3: 43595130 45988364 (+2393234)
4: 41541760 39714198 (-1827562)
5: 36576082 38021344 (+1445262)
6: 35307854 34150656 (-1157198)
7: 32182230 33134236 (+952006)
8: 31341596 31307004 (-34592)
9: 31340900 31329408 (-11492)
10: 31344884 31329144 (-15740)
11: 31334144 31312492 (-21652)
12: 31338992 31330356 (-8636)
13: 31343744 31311344 (-32400)
14: 31339000 31340196 (+1196)
15: 31337492 31313988 (-23504)
16: 31341688 31334040 (-7648)
17: 31341804 31308936 (-32868)
18: 31339936 31332020 (-7916)
19: 31323228 31324240 (+1012)
20: 31339744 31331768 (-7976)
21: 31321536 31332688 (+11152)
22: 31340280 31335212 (-5068)
23: 31332056 31335768 (+3712)
Looks encouraging that the time difference is fairly
small between the two algorithms.
Post by George Spelvin
24: 885575261 876586697 (-8988564)
It doesn't look like a slowdown; more like a 1% speedup.
You will need to throw away the first few iterations of
the test to account for cache warming effects.

Thanks.

Tim
Jan Beulich
2014-05-28 20:47:16 UTC
Permalink
Post by George Spelvin
Jan: Is support for SLE10's pre-2.18 binutils still required?
Your PEXTRD fix was only a year ago, so I expect, but I wanted to ask.
I'd much appreciate if I would be able to build the kernel that way for another while.
Post by George Spelvin
1. The current code unnecessarily puts the table in the read-write
.data section. Moved to .text.
Putting data into .text seems wrong - it should go into .rodata.

Jan
George Spelvin
2014-05-28 21:47:03 UTC
Permalink
Post by Jan Beulich
Post by George Spelvin
Jan: Is support for SLE10's pre-2.18 binutils still required?
Your PEXTRD fix was only a year ago, so I expect, but I wanted to ask.
I'd much appreciate if I would be able to build the kernel that way for another while.
Does it matter that the code I'm working on is 64-bit only? It aready
uses crc32q instruction (added with SSE4.2) with no assembler workarounds,
so I figure pmovzxdq (part of SSE 4.1) doesn't make it any worse.

The annoying thing about doing it with macros is that it would be a
PITA to support a memory operand; I'd probably have to punt to .byte.
Post by Jan Beulich
Putting data into .text seems wrong - it should go into .rodata.
I don't really care, but it's being accessed PC-relative the same as
a jump table that's already in .text, so I just figured I'd be lazy.
Jan Beulich
2014-05-29 06:44:31 UTC
Permalink
Post by George Spelvin
Post by Jan Beulich
Post by George Spelvin
Jan: Is support for SLE10's pre-2.18 binutils still required?
Your PEXTRD fix was only a year ago, so I expect, but I wanted to ask.
I'd much appreciate if I would be able to build the kernel that way for another while.
Does it matter that the code I'm working on is 64-bit only?
No.
Post by George Spelvin
It aready
uses crc32q instruction (added with SSE4.2) with no assembler workarounds,
so I figure pmovzxdq (part of SSE 4.1) doesn't make it any worse.
If that's the case, then adding another (earlier) one shouldn't be an issue.

Jan
Tim Chen
2014-05-28 22:32:59 UTC
Permalink
Post by George Spelvin
While following a number of tangents in the code (I was figuring out
how to edit lib/Kconfig; don't ask), I came across a table of 256 64-bit
words, all of which had the high half set to zero.
Since the code depends on both pclmulq and crc32, SSE 4.1 is obviously
present, so it could use pmovzxdq and save 1K of kernel data.
The following patch obviously lacks the kludges for old binutils,
but should convey the general idea.
Jan: Is support for SLE10's pre-2.18 binutils still required?
Your PEXTRD fix was only a year ago, so I expect, but I wanted to ask.
1. The current code unnecessarily puts the table in the read-write
.data section. Moved to .text.
2. I'm also not sure why it's necessary to force such large alignment
on K_table. Comments on reducing it?
diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
index dbc4339b..9f885ee4 100644
--- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
+++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
@@ -216,15 +216,11 @@ LABEL crc_ %i
################################################################
- lea (K_table-16)(%rip), bufp # first entry is for idx 1
+ lea (K_table-8)(%rip), bufp # first entry is for idx 1
shlq $3, %rax # rax *= 8
- subq %rax, tmp # tmp -= rax*8
- shlq $1, %rax
- subq %rax, tmp # tmp -= rax*16
- # (total tmp -= rax*24)
- addq %rax, bufp
-
- movdqa (bufp), %xmm0 # 2 consts: K1:K2
+ pmovzxdq (bufp,%rax), %xmm0 # 2 consts: K1:K2
Changing from the aligned move (movdqa) to unaligned move and zeroing
(pmovzxdq), is going to make things slower. If the table is aligned
on 8 byte boundary, some of the table can span 2 cache lines, which
can slow things further.

We are trading speed for only 4096 bytes of memory save,
which is likely not a good trade for most systems except for
those really constrained of memory. For this kind of non-performance
critical system, it may as well use the generic crc32c algorithm and
compile out this module.

Thanks.

Tim
George Spelvin
2014-05-28 23:01:47 UTC
Permalink
Thanks for the reply!
Post by Tim Chen
Changing from the aligned move (movdqa) to unaligned move and zeroing
(pmovzxdq), is going to make things slower. If the table is aligned
on 8 byte boundary, some of the table can span 2 cache lines, which
can slow things further.
Um, two notes:
1) This load is performed once per 3072-byte block, which
is a minimum of 128 cycles just for the crc32q instructions,
never mind all the pcmulqdq folderol.

Is it really more than 2 cycles? Heck, is it *any* overall
time given that it's preceded by a stretch of 384 instructions
that it's not data-dependent on?

I'll do some benchmarking to find out.

2) The shrunk table entries are 8 bytes long, and so can't
span a cache line. Is there any benefit to using a
larger alignment, other than the very small issue of the
full table needing 1 more cache line to be fully cached?
Post by Tim Chen
We are trading speed for only 4096 bytes of memory save,
which is likely not a good trade for most systems except for
those really constrained of memory. For this kind of non-performance
critical system, it may as well use the generic crc32c algorithm and
compile out this module.
I hadn't intended to cause any speed penalty at all.
Do you really think there will be one?
Tim Chen
2014-05-28 23:28:14 UTC
Permalink
Post by George Spelvin
Thanks for the reply!
Post by Tim Chen
Changing from the aligned move (movdqa) to unaligned move and zeroing
(pmovzxdq), is going to make things slower. If the table is aligned
on 8 byte boundary, some of the table can span 2 cache lines, which
can slow things further.
1) This load is performed once per 3072-byte block, which
is a minimum of 128 cycles just for the crc32q instructions,
never mind all the pcmulqdq folderol.
Is it really more than 2 cycles? Heck, is it *any* overall
time given that it's preceded by a stretch of 384 instructions
that it's not data-dependent on?
I'll do some benchmarking to find out.
2) The shrunk table entries are 8 bytes long, and so can't
span a cache line. Is there any benefit to using a
larger alignment, other than the very small issue of the
full table needing 1 more cache line to be fully cached?
I think you are fine. Each entry should fit in a cache line
entirely. With the reduced entry size, we will be fitting
twice as many entries per cache line so it may help to reduce
the cache miss.
Post by George Spelvin
Post by Tim Chen
We are trading speed for only 4096 bytes of memory save,
which is likely not a good trade for most systems except for
those really constrained of memory. For this kind of non-performance
critical system, it may as well use the generic crc32c algorithm and
compile out this module.
I hadn't intended to cause any speed penalty at all.
Do you really think there will be one?
If you can do some benchmarking to find out the change's
speed impact, that will help to eliminate concerns about
speed penalty.

Thanks.

Tim

Tim
George Spelvin
2014-05-29 23:54:32 UTC
Permalink
Sorry for the delay; my Ivy Bridge test machine isn't in my
office and getting to the console to tweak the BIOS is a
bit of a bother.

Anyway, i7-4930K, turbo boost & hyperthreading disabled,
$ cat /sys/devices/system/cpu/cpu?/cpufreq/scaling_governor
performance
performance
performance
performance
performance
performance

Oddly, though, CPU speed still seems to be fluctuating:
$ grep MHz /proc/cpuinfo
cpu MHz : 1255.875
cpu MHz : 3168.375
cpu MHz : 3062.125
cpu MHz : 1468.375
cpu MHz : 1309.000
cpu MHz : 2212.125
$ grep MHz /proc/cpuinfo
cpu MHz : 1255.875
cpu MHz : 2690.250
cpu MHz : 1255.875
cpu MHz : 2530.875
cpu MHz : 2212.125
cpu MHz : 1521.500

It does this even if I set scaling_min_freq to 3400000.
Very annoying. Should I be using a different
scaling_governor than intel_pstate?
Post by Tim Chen
Post by George Spelvin
It doesn't look like a slowdown; more like a 1% speedup.
You will need to throw away the first few iterations of
the test to account for cache warming effects.
You're absolutely right; that's exactly *why* I ran it 24 times and
listed them all separately. The "1%" number was B.S. and I was not
thinking when I quoted it.

What I had legitimately noticed was that the code with the patch took
slightly fewer cycles most of the time, even after discounting the
first few. Not statistically significant, but enough to argue that it
didn't cause a noticeable slowdown.


Anyway, two iterations each of "modprobe tcrypt mode=319".

Old code:
[ 1530.513529]
[ 1530.513529] testing speed of crc32c
[ 1530.513535] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 75 cycles/operation, 4 cycles/byte
[ 1530.513537] test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 413 cycles/operation, 6 cycles/byte
[ 1530.513540] test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 88 cycles/operation, 1 cycles/byte
[ 1530.513542] test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 1327 cycles/operation, 5 cycles/byte
[ 1530.513548] test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 503 cycles/operation, 1 cycles/byte
[ 1530.513551] test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 178 cycles/operation, 0 cycles/byte
[ 1530.513553] test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 4972 cycles/operation, 4 cycles/byte
[ 1530.513572] test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 806 cycles/operation, 0 cycles/byte
[ 1530.513576] test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 370 cycles/operation, 0 cycles/byte
[ 1530.513579] test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 9835 cycles/operation, 4 cycles/byte
[ 1530.513615] test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 1461 cycles/operation, 0 cycles/byte
[ 1530.513622] test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 847 cycles/operation, 0 cycles/byte
[ 1530.513626] test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 495 cycles/operation, 0 cycles/byte
[ 1530.513630] test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 19571 cycles/operation, 4 cycles/byte
[ 1530.513700] test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 2758 cycles/operation, 0 cycles/byte
[ 1530.513711] test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 1676 cycles/operation, 0 cycles/byte
[ 1530.513718] test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 859 cycles/operation, 0 cycles/byte
[ 1530.513722] test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 39012 cycles/operation, 4 cycles/byte
[ 1530.513861] test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 5417 cycles/operation, 0 cycles/byte
[ 1530.513882] test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 3162 cycles/operation, 0 cycles/byte
[ 1530.513894] test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 1678 cycles/operation, 0 cycles/byte
[ 1530.513901] test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 1653 cycles/operation, 0 cycles/byte

[ 1662.359717]
[ 1662.359717] testing speed of crc32c
[ 1662.359723] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 80 cycles/operation, 5 cycles/byte
[ 1662.359725] test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 430 cycles/operation, 6 cycles/byte
[ 1662.359729] test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 81 cycles/operation, 1 cycles/byte
[ 1662.359730] test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 1324 cycles/operation, 5 cycles/byte
[ 1662.359736] test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 503 cycles/operation, 1 cycles/byte
[ 1662.359740] test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 171 cycles/operation, 0 cycles/byte
[ 1662.359741] test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 4983 cycles/operation, 4 cycles/byte
[ 1662.359760] test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 832 cycles/operation, 0 cycles/byte
[ 1662.359764] test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 366 cycles/operation, 0 cycles/byte
[ 1662.359768] test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 9839 cycles/operation, 4 cycles/byte
[ 1662.359804] test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 1437 cycles/operation, 0 cycles/byte
[ 1662.359810] test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 862 cycles/operation, 0 cycles/byte
[ 1662.359815] test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 494 cycles/operation, 0 cycles/byte
[ 1662.359818] test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 19553 cycles/operation, 4 cycles/byte
[ 1662.359901] test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 2761 cycles/operation, 0 cycles/byte
[ 1662.359912] test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 1715 cycles/operation, 0 cycles/byte
[ 1662.359919] test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 852 cycles/operation, 0 cycles/byte
[ 1662.359928] test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 39016 cycles/operation, 4 cycles/byte
[ 1662.360069] test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 5538 cycles/operation, 0 cycles/byte
[ 1662.360090] test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 3280 cycles/operation, 0 cycles/byte
[ 1662.360102] test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 1695 cycles/operation, 0 cycles/byte
[ 1662.360110] test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 1639 cycles/operation, 0 cycles/byte

New code:
[ 710.814463]
[ 710.814463] testing speed of crc32c
[ 710.814469] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 80 cycles/operation, 5 cycles/byte
[ 710.814472] test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 410 cycles/operation, 6 cycles/byte
[ 710.814476] test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 94 cycles/operation, 1 cycles/byte
[ 710.814477] test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 1327 cycles/operation, 5 cycles/byte
[ 710.814483] test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 492 cycles/operation, 1 cycles/byte
[ 710.814486] test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 175 cycles/operation, 0 cycles/byte
[ 710.814488] test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 4970 cycles/operation, 4 cycles/byte
[ 710.814507] test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 797 cycles/operation, 0 cycles/byte
[ 710.814511] test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 370 cycles/operation, 0 cycles/byte
[ 710.814514] test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 9846 cycles/operation, 4 cycles/byte
[ 710.814551] test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 1452 cycles/operation, 0 cycles/byte
[ 710.814557] test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 840 cycles/operation, 0 cycles/byte
[ 710.814561] test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 497 cycles/operation, 0 cycles/byte
[ 710.814564] test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 19563 cycles/operation, 4 cycles/byte
[ 710.814635] test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 2764 cycles/operation, 0 cycles/byte
[ 710.814646] test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 1646 cycles/operation, 0 cycles/byte
[ 710.814653] test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 858 cycles/operation, 0 cycles/byte
[ 710.814657] test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 39020 cycles/operation, 4 cycles/byte
[ 710.814796] test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 5422 cycles/operation, 0 cycles/byte
[ 710.814816] test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 3182 cycles/operation, 0 cycles/byte
[ 710.814829] test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 1669 cycles/operation, 0 cycles/byte
[ 710.814836] test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 1636 cycles/operation, 0 cycles/byte

[ 1751.451733]
[ 1751.451733] testing speed of crc32c
[ 1751.451739] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 75 cycles/operation, 4 cycles/byte
[ 1751.451741] test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 414 cycles/operation, 6 cycles/byte
[ 1751.451745] test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 87 cycles/operation, 1 cycles/byte
[ 1751.451746] test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 1329 cycles/operation, 5 cycles/byte
[ 1751.451752] test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 499 cycles/operation, 1 cycles/byte
[ 1751.451756] test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 170 cycles/operation, 0 cycles/byte
[ 1751.451757] test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 4964 cycles/operation, 4 cycles/byte
[ 1751.451776] test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 836 cycles/operation, 0 cycles/byte
[ 1751.451780] test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 370 cycles/operation, 0 cycles/byte
[ 1751.451784] test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 9844 cycles/operation, 4 cycles/byte
[ 1751.451820] test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 1468 cycles/operation, 0 cycles/byte
[ 1751.451826] test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 835 cycles/operation, 0 cycles/byte
[ 1751.451830] test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 493 cycles/operation, 0 cycles/byte
[ 1751.451834] test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 19564 cycles/operation, 4 cycles/byte
[ 1751.451904] test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 2776 cycles/operation, 0 cycles/byte
[ 1751.451915] test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 1662 cycles/operation, 0 cycles/byte
[ 1751.451922] test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 858 cycles/operation, 0 cycles/byte
[ 1751.451927] test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 39531 cycles/operation, 4 cycles/byte
[ 1751.452067] test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 5427 cycles/operation, 0 cycles/byte
[ 1751.452088] test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 3175 cycles/operation, 0 cycles/byte
[ 1751.452100] test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 1666 cycles/operation, 0 cycles/byte
[ 1751.452107] test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 1634 cycles/operation, 0 cycles/byte

The tests are pretty short, but there's no obvious slowdown. Particularly
on the tests with > 200 byte per update where the modified code paths are
found.

Of course, whether the timing is valid is an interesting question.
Tim Chen
2014-05-30 01:07:16 UTC
Permalink
Post by George Spelvin
Sorry for the delay; my Ivy Bridge test machine isn't in my
office and getting to the console to tweak the BIOS is a
bit of a bother.
Anyway, i7-4930K, turbo boost & hyperthreading disabled,
$ cat /sys/devices/system/cpu/cpu?/cpufreq/scaling_governor
performance
performance
performance
performance
performance
performance
$ grep MHz /proc/cpuinfo
cpu MHz : 1255.875
cpu MHz : 3168.375
cpu MHz : 3062.125
cpu MHz : 1468.375
cpu MHz : 1309.000
cpu MHz : 2212.125
$ grep MHz /proc/cpuinfo
cpu MHz : 1255.875
cpu MHz : 2690.250
cpu MHz : 1255.875
cpu MHz : 2530.875
cpu MHz : 2212.125
cpu MHz : 1521.500
This is odd. On my Ivy Bridge system the CPU speed from /proc/cpuinfo
is at max freq once I set the performance governor.
The numbers above almost look like
the cpu frequency is fluctuating and an average is taken.
What version of the kernel are you running? Is
CONFIG_CPU_FREQ_GOV_PERFORMANCE compiled in?

Does /sys/devices/system/cpu/cpu?/cpufreq/scaling_cur_freq
also changes?

Can you check what are the available governors in your system
and available frequencies?

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies

If userspace governor is available, you can try set the governor
to userspace, then pin frequency to 3400 MHz (assuming that's your
max) with command like:

i=0
num_cpus=`cat /proc/cpuinfo| grep "^processor"| wc -l `
while [ $i -lt $num_cpus ]
do
echo userspace > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor
echo 3400000 > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_setspeed
i=`expr $i + 1`
done
Post by George Spelvin
It does this even if I set scaling_min_freq to 3400000.
Very annoying. Should I be using a different
scaling_governor than intel_pstate?
Post by Tim Chen
Post by George Spelvin
It doesn't look like a slowdown; more like a 1% speedup.
You will need to throw away the first few iterations of
the test to account for cache warming effects.
You're absolutely right; that's exactly *why* I ran it 24 times and
listed them all separately. The "1%" number was B.S. and I was not
thinking when I quoted it.
What I had legitimately noticed was that the code with the patch took
slightly fewer cycles most of the time, even after discounting the
first few. Not statistically significant, but enough to argue that it
didn't cause a noticeable slowdown.
Anyway, two iterations each of "modprobe tcrypt mode=319".
[ 1530.513529]
[ 1530.513529] testing speed of crc32c
[ 1530.513535] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 75 cycles/operation, 4 cycles/byte
[ 1530.513537] test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 413 cycles/operation, 6 cycles/byte
[ 1530.513540] test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 88 cycles/operation, 1 cycles/byte
[ 1530.513542] test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 1327 cycles/operation, 5 cycles/byte
[ 1530.513548] test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 503 cycles/operation, 1 cycles/byte
[ 1530.513551] test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 178 cycles/operation, 0 cycles/byte
[ 1530.513553] test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 4972 cycles/operation, 4 cycles/byte
[ 1530.513572] test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 806 cycles/operation, 0 cycles/byte
[ 1530.513576] test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 370 cycles/operation, 0 cycles/byte
[ 1530.513579] test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 9835 cycles/operation, 4 cycles/byte
[ 1530.513615] test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 1461 cycles/operation, 0 cycles/byte
[ 1530.513622] test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 847 cycles/operation, 0 cycles/byte
[ 1530.513626] test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 495 cycles/operation, 0 cycles/byte
[ 1530.513630] test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 19571 cycles/operation, 4 cycles/byte
[ 1530.513700] test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 2758 cycles/operation, 0 cycles/byte
[ 1530.513711] test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 1676 cycles/operation, 0 cycles/byte
[ 1530.513718] test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 859 cycles/operation, 0 cycles/byte
[ 1530.513722] test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 39012 cycles/operation, 4 cycles/byte
[ 1530.513861] test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 5417 cycles/operation, 0 cycles/byte
[ 1530.513882] test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 3162 cycles/operation, 0 cycles/byte
[ 1530.513894] test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 1678 cycles/operation, 0 cycles/byte
[ 1530.513901] test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 1653 cycles/operation, 0 cycles/byte
[ 1662.359717]
[ 1662.359717] testing speed of crc32c
[ 1662.359723] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 80 cycles/operation, 5 cycles/byte
[ 1662.359725] test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 430 cycles/operation, 6 cycles/byte
[ 1662.359729] test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 81 cycles/operation, 1 cycles/byte
[ 1662.359730] test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 1324 cycles/operation, 5 cycles/byte
[ 1662.359736] test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 503 cycles/operation, 1 cycles/byte
[ 1662.359740] test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 171 cycles/operation, 0 cycles/byte
[ 1662.359741] test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 4983 cycles/operation, 4 cycles/byte
[ 1662.359760] test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 832 cycles/operation, 0 cycles/byte
[ 1662.359764] test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 366 cycles/operation, 0 cycles/byte
[ 1662.359768] test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 9839 cycles/operation, 4 cycles/byte
[ 1662.359804] test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 1437 cycles/operation, 0 cycles/byte
[ 1662.359810] test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 862 cycles/operation, 0 cycles/byte
[ 1662.359815] test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 494 cycles/operation, 0 cycles/byte
[ 1662.359818] test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 19553 cycles/operation, 4 cycles/byte
[ 1662.359901] test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 2761 cycles/operation, 0 cycles/byte
[ 1662.359912] test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 1715 cycles/operation, 0 cycles/byte
[ 1662.359919] test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 852 cycles/operation, 0 cycles/byte
[ 1662.359928] test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 39016 cycles/operation, 4 cycles/byte
[ 1662.360069] test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 5538 cycles/operation, 0 cycles/byte
[ 1662.360090] test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 3280 cycles/operation, 0 cycles/byte
[ 1662.360102] test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 1695 cycles/operation, 0 cycles/byte
[ 1662.360110] test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 1639 cycles/operation, 0 cycles/byte
[ 710.814463]
[ 710.814463] testing speed of crc32c
[ 710.814469] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 80 cycles/operation, 5 cycles/byte
[ 710.814472] test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 410 cycles/operation, 6 cycles/byte
[ 710.814476] test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 94 cycles/operation, 1 cycles/byte
[ 710.814477] test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 1327 cycles/operation, 5 cycles/byte
[ 710.814483] test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 492 cycles/operation, 1 cycles/byte
[ 710.814486] test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 175 cycles/operation, 0 cycles/byte
[ 710.814488] test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 4970 cycles/operation, 4 cycles/byte
[ 710.814507] test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 797 cycles/operation, 0 cycles/byte
[ 710.814511] test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 370 cycles/operation, 0 cycles/byte
[ 710.814514] test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 9846 cycles/operation, 4 cycles/byte
[ 710.814551] test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 1452 cycles/operation, 0 cycles/byte
[ 710.814557] test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 840 cycles/operation, 0 cycles/byte
[ 710.814561] test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 497 cycles/operation, 0 cycles/byte
[ 710.814564] test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 19563 cycles/operation, 4 cycles/byte
[ 710.814635] test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 2764 cycles/operation, 0 cycles/byte
[ 710.814646] test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 1646 cycles/operation, 0 cycles/byte
[ 710.814653] test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 858 cycles/operation, 0 cycles/byte
[ 710.814657] test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 39020 cycles/operation, 4 cycles/byte
[ 710.814796] test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 5422 cycles/operation, 0 cycles/byte
[ 710.814816] test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 3182 cycles/operation, 0 cycles/byte
[ 710.814829] test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 1669 cycles/operation, 0 cycles/byte
[ 710.814836] test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 1636 cycles/operation, 0 cycles/byte
[ 1751.451733]
[ 1751.451733] testing speed of crc32c
[ 1751.451739] test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 75 cycles/operation, 4 cycles/byte
[ 1751.451741] test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 414 cycles/operation, 6 cycles/byte
[ 1751.451745] test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 87 cycles/operation, 1 cycles/byte
[ 1751.451746] test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 1329 cycles/operation, 5 cycles/byte
[ 1751.451752] test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 499 cycles/operation, 1 cycles/byte
[ 1751.451756] test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 170 cycles/operation, 0 cycles/byte
[ 1751.451757] test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 4964 cycles/operation, 4 cycles/byte
[ 1751.451776] test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 836 cycles/operation, 0 cycles/byte
[ 1751.451780] test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 370 cycles/operation, 0 cycles/byte
[ 1751.451784] test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 9844 cycles/operation, 4 cycles/byte
[ 1751.451820] test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 1468 cycles/operation, 0 cycles/byte
[ 1751.451826] test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 835 cycles/operation, 0 cycles/byte
[ 1751.451830] test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 493 cycles/operation, 0 cycles/byte
[ 1751.451834] test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 19564 cycles/operation, 4 cycles/byte
[ 1751.451904] test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 2776 cycles/operation, 0 cycles/byte
[ 1751.451915] test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 1662 cycles/operation, 0 cycles/byte
[ 1751.451922] test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 858 cycles/operation, 0 cycles/byte
[ 1751.451927] test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 39531 cycles/operation, 4 cycles/byte
[ 1751.452067] test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 5427 cycles/operation, 0 cycles/byte
[ 1751.452088] test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 3175 cycles/operation, 0 cycles/byte
[ 1751.452100] test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 1666 cycles/operation, 0 cycles/byte
[ 1751.452107] test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 1634 cycles/operation, 0 cycles/byte
The tests are pretty short, but there's no obvious slowdown. Particularly
on the tests with > 200 byte per update where the modified code paths are
found.
So far, the numbers look good.

BTW, why do you place the K table in .text, instead of .rodata?

Thanks.

Tim
Dave Jones
2014-05-30 01:16:36 UTC
Permalink
Post by Tim Chen
Post by George Spelvin
Sorry for the delay; my Ivy Bridge test machine isn't in my
office and getting to the console to tweak the BIOS is a
bit of a bother.
Anyway, i7-4930K, turbo boost & hyperthreading disabled,
$ cat /sys/devices/system/cpu/cpu?/cpufreq/scaling_governor
performance
performance
performance
performance
performance
performance
$ grep MHz /proc/cpuinfo
cpu MHz : 1255.875
cpu MHz : 3168.375
cpu MHz : 3062.125
cpu MHz : 1468.375
cpu MHz : 1309.000
cpu MHz : 2212.125
$ grep MHz /proc/cpuinfo
cpu MHz : 1255.875
cpu MHz : 2690.250
cpu MHz : 1255.875
cpu MHz : 2530.875
cpu MHz : 2212.125
cpu MHz : 1521.500
This is odd. On my Ivy Bridge system the CPU speed from /proc/cpuinfo
is at max freq once I set the performance governor.
The numbers above almost look like
the cpu frequency is fluctuating and an average is taken.
What version of the kernel are you running? Is
CONFIG_CPU_FREQ_GOV_PERFORMANCE compiled in?
Does /sys/devices/system/cpu/cpu?/cpufreq/scaling_cur_freq
also changes?
Can you check what are the available governors in your system
and available frequencies?
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
If userspace governor is available, you can try set the governor
to userspace, then pin frequency to 3400 MHz (assuming that's your
intel_pstate overrides any governor choice you make through sysfs.

Dave
Tim Chen
2014-05-30 17:56:32 UTC
Permalink
Post by Dave Jones
Post by Tim Chen
Post by George Spelvin
Sorry for the delay; my Ivy Bridge test machine isn't in my
office and getting to the console to tweak the BIOS is a
bit of a bother.
Anyway, i7-4930K, turbo boost & hyperthreading disabled,
$ cat /sys/devices/system/cpu/cpu?/cpufreq/scaling_governor
performance
performance
performance
performance
performance
performance
$ grep MHz /proc/cpuinfo
cpu MHz : 1255.875
cpu MHz : 3168.375
cpu MHz : 3062.125
cpu MHz : 1468.375
cpu MHz : 1309.000
cpu MHz : 2212.125
$ grep MHz /proc/cpuinfo
cpu MHz : 1255.875
cpu MHz : 2690.250
cpu MHz : 1255.875
cpu MHz : 2530.875
cpu MHz : 2212.125
cpu MHz : 1521.500
This is odd. On my Ivy Bridge system the CPU speed from /proc/cpuinfo
is at max freq once I set the performance governor.
The numbers above almost look like
the cpu frequency is fluctuating and an average is taken.
What version of the kernel are you running? Is
CONFIG_CPU_FREQ_GOV_PERFORMANCE compiled in?
Does /sys/devices/system/cpu/cpu?/cpufreq/scaling_cur_freq
also changes?
Can you check what are the available governors in your system
and available frequencies?
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
If userspace governor is available, you can try set the governor
to userspace, then pin frequency to 3400 MHz (assuming that's your
intel_pstate overrides any governor choice you make through sysfs.
Dave
Dirk,

Wonder if this the right behavior for intel_pstate that when I set the
governor to performance, intel_pstate driver still adjusts
the cpu frequencies around?

Turbotstat also confirms that the frequencies are not at max,
even though the max_perf_pct and min_perf_pct are both set at 100.

I ran on my HSW system with 3.15-rc7 kernel and see similar
issue that Geroge reported.

It is really a pain when we need to do performance benchmarking and
need to have a constant cpu frequency.

Thanks.

Tim
Dirk Brandewie
2014-05-30 18:45:19 UTC
Permalink
Post by Tim Chen
Post by Dave Jones
Post by Tim Chen
Post by George Spelvin
Sorry for the delay; my Ivy Bridge test machine isn't in my
office and getting to the console to tweak the BIOS is a
bit of a bother.
Anyway, i7-4930K, turbo boost & hyperthreading disabled,
$ cat /sys/devices/system/cpu/cpu?/cpufreq/scaling_governor
performance
performance
performance
performance
performance
performance
$ grep MHz /proc/cpuinfo
cpu MHz : 1255.875
cpu MHz : 3168.375
cpu MHz : 3062.125
cpu MHz : 1468.375
cpu MHz : 1309.000
cpu MHz : 2212.125
$ grep MHz /proc/cpuinfo
cpu MHz : 1255.875
cpu MHz : 2690.250
cpu MHz : 1255.875
cpu MHz : 2530.875
cpu MHz : 2212.125
cpu MHz : 1521.500
This is odd. On my Ivy Bridge system the CPU speed from /proc/cpuinfo
is at max freq once I set the performance governor.
The numbers above almost look like
the cpu frequency is fluctuating and an average is taken.
What version of the kernel are you running? Is
CONFIG_CPU_FREQ_GOV_PERFORMANCE compiled in?
Does /sys/devices/system/cpu/cpu?/cpufreq/scaling_cur_freq
also changes?
Can you check what are the available governors in your system
and available frequencies?
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
If userspace governor is available, you can try set the governor
to userspace, then pin frequency to 3400 MHz (assuming that's your
intel_pstate overrides any governor choice you make through sysfs.
Dave
Dirk,
Wonder if this the right behavior for intel_pstate that when I set the
governor to performance, intel_pstate driver still adjusts
the cpu frequencies around?
No, the value returned is a measured/delivered frequency instead of the P state
requested which is what the other governors return.
Post by Tim Chen
Turbotstat also confirms that the frequencies are not at max,
even though the max_perf_pct and min_perf_pct are both set at 100.
I calculate frequency the same way turbostat does but my samples are a *lot*
shorter.
Post by Tim Chen
I ran on my HSW system with 3.15-rc7 kernel and see similar
issue that Geroge reported.
It is really a pain when we need to do performance benchmarking and
need to have a constant cpu frequency.
With turbostat from rc7.
[***@echolake turbostat]# ./turbostat
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.08 1178 3492 0 0.12 0.08 0.01 99.71 29 29 99.23 0.00 0.00 0.00 2.18 0.00 0.00
0 0 2 0.19 1189 3492 0 0.22 0.30 0.00 99.29 29 29 99.24 0.00 0.00 0.00 2.18 0.00 0.00
0 4 1 0.12 1253 3492 0 0.29
1 1 0 0.03 1065 3492 0 0.03 0.00 0.00 99.93 23
1 5 0 0.01 1104 3492 0 0.05
2 2 0 0.02 1275 3492 0 0.22 0.00 0.03 99.73 24
2 6 2 0.18 1220 3492 0 0.06
3 3 0 0.01 992 3492 0 0.07 0.00 0.01 99.90 23
3 7 0 0.05 915 3492 0 0.04
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.06 1034 3492 0 0.09 5.15 0.00 94.70 28 28 99.49 0.00 0.00 0.00 2.48 0.01 0.00
0 0 1 0.09 1066 3492 0 0.17 0.01 0.00 99.73 28 28 99.49 0.00 0.00 0.00 2.48 0.01 0.00
0 4 1 0.12 1036 3492 0 0.14
1 1 0 0.04 1009 3492 0 0.05 20.59 0.00 79.32 24
1 5 0 0.02 922 3492 0 0.07
2 2 0 0.03 924 3492 0 0.15 0.00 0.00 99.82 25
2 6 1 0.12 1117 3492 0 0.06
3 3 0 0.01 911 3492 0 0.04 0.01 0.00 99.94 22
3 7 0 0.03 856 3492 0 0.02
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.08 889 3492 0 0.12 0.03 0.06 99.71 29 29 99.32 0.00 0.00 0.00 2.21 0.00 0.00
0 0 1 0.11 867 3492 0 0.20 0.02 0.22 99.44 29 29 99.32 0.00 0.00 0.00 2.21 0.00 0.00
0 4 1 0.14 907 3492 0 0.17
1 1 1 0.12 809 3492 0 0.04 0.11 0.01 99.73 24
1 5 0 0.01 798 3492 0 0.14
2 2 0 0.03 863 3492 0 0.18 0.00 0.01 99.78 24
2 6 1 0.14 1013 3492 0 0.07
3 3 0 0.02 853 3492 0 0.09 0.00 0.00 99.89 23
3 7 1 0.06 815 3492 0 0.05
^C
[***@echolake turbostat]# echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
[***@echolake turbostat]# ./turbostat
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.03 3489 3492 0 2.43 0.01 0.00 97.53 30 30 90.20 0.00 0.00 0.00 2.85 0.06 0.00
0 0 1 0.04 3470 3492 0 0.09 0.00 0.00 99.88 30 30 90.20 0.00 0.00 0.00 2.85 0.06 0.00
0 4 2 0.06 3492 3492 0 0.07
1 1 1 0.02 3495 3492 0 0.05 0.03 0.00 99.90 25
1 5 0 0.00 3494 3492 0 0.07
2 2 0 0.01 3492 3492 0 9.53 0.00 0.01 90.45 25
2 6 1 0.04 3492 3492 0 9.50
3 3 1 0.03 3492 3492 0 0.05 0.01 0.00 99.91 23
3 7 1 0.02 3493 3492 0 0.06
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.02 3492 3492 0 4.93 0.00 0.00 95.04 30 30 80.19 0.00 0.00 0.00 3.54 0.10 0.00
0 0 1 0.02 3491 3492 0 0.08 0.01 0.00 99.89 30 30 80.19 0.00 0.00 0.00 3.54 0.10 0.00
0 4 2 0.05 3492 3492 0 0.05
1 1 0 0.01 3492 3492 0 0.02 0.00 0.00 99.97 24
1 5 0 0.01 3493 3492 0 0.02
2 2 0 0.01 3493 3492 0 19.65 0.01 0.00 80.34 24
2 6 2 0.05 3493 3492 0 19.61
3 3 1 0.01 3492 3492 0 0.02 0.00 0.00 99.97 23
3 7 0 0.01 3494 3492 0 0.02
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 2 0.05 3493 3492 0 1.64 0.01 0.00 98.29 30 30 93.25 0.00 0.00 0.00 2.64 0.04 0.00
0 0 4 0.12 3492 3492 0 0.13 0.01 0.00 99.74 30 30 93.25 0.00 0.00 0.00 2.64 0.04 0.00
0 4 2 0.06 3493 3492 0 0.19
1 1 1 0.02 3492 3492 0 0.03 0.04 0.00 99.91 23
1 5 0 0.01 3494 3492 0 0.04
2 2 0 0.01 3492 3492 0 6.42 0.00 0.00 93.57 25
2 6 6 0.16 3492 3492 0 6.27
3 3 0 0.01 3501 3492 0 0.05 0.01 0.00 99.93 22
3 7 1 0.03 3492 3492 0 0.03
[***@echolake turbostat]# grep MH /proc/cpuinfo
cpu MHz : 997.089
cpu MHz : 797.480
cpu MHz : 998.320
cpu MHz : 800.078
cpu MHz : 845.878
cpu MHz : 801.445
cpu MHz : 800.078
cpu MHz : 800.351
[***@echolake turbostat]# echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
[***@echolake turbostat]# grep MH /proc/cpuinfo
cpu MHz : 3497.128
cpu MHz : 3506.699
cpu MHz : 3500.273
cpu MHz : 3500.273
cpu MHz : 3500.000
cpu MHz : 3500.000
cpu MHz : 3500.000
cpu MHz : 3495.898
Post by Tim Chen
Thanks.
Tim
Tim Chen
2014-05-30 19:32:06 UTC
Permalink
Post by Dirk Brandewie
With turbostat from rc7.
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.08 1178 3492 0 0.12 0.08 0.01 99.71 29 29 99.23 0.00 0.00 0.00 2.18 0.00 0.00
0 0 2 0.19 1189 3492 0 0.22 0.30 0.00 99.29 29 29 99.24 0.00 0.00 0.00 2.18 0.00 0.00
0 4 1 0.12 1253 3492 0 0.29
1 1 0 0.03 1065 3492 0 0.03 0.00 0.00 99.93 23
1 5 0 0.01 1104 3492 0 0.05
2 2 0 0.02 1275 3492 0 0.22 0.00 0.03 99.73 24
2 6 2 0.18 1220 3492 0 0.06
3 3 0 0.01 992 3492 0 0.07 0.00 0.01 99.90 23
3 7 0 0.05 915 3492 0 0.04
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.06 1034 3492 0 0.09 5.15 0.00 94.70 28 28 99.49 0.00 0.00 0.00 2.48 0.01 0.00
0 0 1 0.09 1066 3492 0 0.17 0.01 0.00 99.73 28 28 99.49 0.00 0.00 0.00 2.48 0.01 0.00
0 4 1 0.12 1036 3492 0 0.14
1 1 0 0.04 1009 3492 0 0.05 20.59 0.00 79.32 24
1 5 0 0.02 922 3492 0 0.07
2 2 0 0.03 924 3492 0 0.15 0.00 0.00 99.82 25
2 6 1 0.12 1117 3492 0 0.06
3 3 0 0.01 911 3492 0 0.04 0.01 0.00 99.94 22
3 7 0 0.03 856 3492 0 0.02
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.08 889 3492 0 0.12 0.03 0.06 99.71 29 29 99.32 0.00 0.00 0.00 2.21 0.00 0.00
0 0 1 0.11 867 3492 0 0.20 0.02 0.22 99.44 29 29 99.32 0.00 0.00 0.00 2.21 0.00 0.00
0 4 1 0.14 907 3492 0 0.17
1 1 1 0.12 809 3492 0 0.04 0.11 0.01 99.73 24
1 5 0 0.01 798 3492 0 0.14
2 2 0 0.03 863 3492 0 0.18 0.00 0.01 99.78 24
2 6 1 0.14 1013 3492 0 0.07
3 3 0 0.02 853 3492 0 0.09 0.00 0.00 99.89 23
3 7 1 0.06 815 3492 0 0.05
^C
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.03 3489 3492 0 2.43 0.01 0.00 97.53 30 30 90.20 0.00 0.00 0.00 2.85 0.06 0.00
0 0 1 0.04 3470 3492 0 0.09 0.00 0.00 99.88 30 30 90.20 0.00 0.00 0.00 2.85 0.06 0.00
0 4 2 0.06 3492 3492 0 0.07
1 1 1 0.02 3495 3492 0 0.05 0.03 0.00 99.90 25
1 5 0 0.00 3494 3492 0 0.07
2 2 0 0.01 3492 3492 0 9.53 0.00 0.01 90.45 25
2 6 1 0.04 3492 3492 0 9.50
3 3 1 0.03 3492 3492 0 0.05 0.01 0.00 99.91 23
3 7 1 0.02 3493 3492 0 0.06
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.02 3492 3492 0 4.93 0.00 0.00 95.04 30 30 80.19 0.00 0.00 0.00 3.54 0.10 0.00
0 0 1 0.02 3491 3492 0 0.08 0.01 0.00 99.89 30 30 80.19 0.00 0.00 0.00 3.54 0.10 0.00
0 4 2 0.05 3492 3492 0 0.05
1 1 0 0.01 3492 3492 0 0.02 0.00 0.00 99.97 24
1 5 0 0.01 3493 3492 0 0.02
2 2 0 0.01 3493 3492 0 19.65 0.01 0.00 80.34 24
2 6 2 0.05 3493 3492 0 19.61
3 3 1 0.01 3492 3492 0 0.02 0.00 0.00 99.97 23
3 7 0 0.01 3494 3492 0 0.02
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 2 0.05 3493 3492 0 1.64 0.01 0.00 98.29 30 30 93.25 0.00 0.00 0.00 2.64 0.04 0.00
0 0 4 0.12 3492 3492 0 0.13 0.01 0.00 99.74 30 30 93.25 0.00 0.00 0.00 2.64 0.04 0.00
0 4 2 0.06 3493 3492 0 0.19
1 1 1 0.02 3492 3492 0 0.03 0.04 0.00 99.91 23
1 5 0 0.01 3494 3492 0 0.04
2 2 0 0.01 3492 3492 0 6.42 0.00 0.00 93.57 25
2 6 6 0.16 3492 3492 0 6.27
3 3 0 0.01 3501 3492 0 0.05 0.01 0.00 99.93 22
3 7 1 0.03 3492 3492 0 0.03
cpu MHz : 997.089
cpu MHz : 797.480
cpu MHz : 998.320
cpu MHz : 800.078
cpu MHz : 845.878
cpu MHz : 801.445
cpu MHz : 800.078
cpu MHz : 800.351
cpu MHz : 3497.128
cpu MHz : 3506.699
cpu MHz : 3500.273
cpu MHz : 3500.273
cpu MHz : 3500.000
cpu MHz : 3500.000
cpu MHz : 3500.000
cpu MHz : 3495.898
Dirk,

Thanks for checking things out.

I tested on a Haswell system, and I see that the frequency
can dip below the max even when I set the min_perf_pct to 100.
Let me know if you want to log on to my system and check if
there's something I missed. It is odd that the package 1's
cores are at a much higher frequency and close to
max than package 0, once min_perf_pct is set to 100.

Tim

[***@otc-grantly-02 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
3600000
[***@otc-grantly-02 ~]# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
1200000
[***@otc-grantly-02 ~]# echo 100 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
[***@otc-grantly-02 ~]# cat /sys/devices/system/cpu/intel_pstate/min_perf_pct
100
[***@otc-grantly-02 ~]# uname -a
Linux otc-grantly-02 3.15.0-rc7+ #3 SMP Thu May 29 11:34:39 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
[***@otc-grantly-02 ~]# cpupower -c 0-1 frequency-info
analyzing CPU 0:
driver: intel_pstate
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: 0.97 ms.
hardware limits: 1.20 GHz - 3.60 GHz
available cpufreq governors: performance, powersave
current policy: frequency should be within 1.20 GHz and 3.60 GHz.
The governor "powersave" may decide which speed to use
within this range.
current CPU frequency is 1.20 GHz (asserted by call to hardware).
boost state support:
Supported: yes
Active: yes
analyzing CPU 1:
driver: intel_pstate
CPUs which run at the same hardware frequency: 1
CPUs which need to have their frequency coordinated by software: 1
maximum transition latency: 0.97 ms.
hardware limits: 1.20 GHz - 3.60 GHz
available cpufreq governors: performance, powersave
current policy: frequency should be within 1.20 GHz and 3.60 GHz.
The governor "powersave" may decide which speed to use
within this range.
current CPU frequency is 2.02 GHz (asserted by call to hardware).
boost state support:
Supported: yes
Active: yes
[***@otc-grantly-02 ~]# turbostat
Package Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt PKG_% RAM_%
- - - 0 0.02 1964 2594 0 0.13 0.00 99.85 0.00 33 41 4.92 0.00 93.99 0.00 23.04 3.60 0.18 0.00
0 0 0 1 0.07 2154 2594 0 0.21 0.00 99.72 0.00 32 41 4.42 0.00 94.00 0.00 17.16 1.73 0.10 0.00
0 0 28 0 0.01 1465 2594 0 0.26
0 1 1 1 0.04 1941 2594 0 0.18 0.00 99.78 0.00 33
0 1 29 0 0.02 1587 2594 0 0.20
0 2 2 1 0.04 1586 2594 0 0.15 0.00 99.81 0.00 28
0 2 30 0 0.01 1539 2594 0 0.17
0 3 3 1 0.04 1656 2594 0 0.17 0.00 99.79 0.00 31
0 3 31 0 0.01 1723 2594 0 0.19
0 4 4 1 0.06 1800 2594 0 0.21 0.00 99.74 0.00 33
0 4 32 0 0.02 1725 2594 0 0.24
0 5 5 1 0.04 1917 2594 0 0.15 0.00 99.81 0.00 29
0 5 33 0 0.02 1707 2594 0 0.17
0 6 6 1 0.04 1820 2594 0 0.17 0.00 99.79 0.00 33
0 6 34 0 0.01 1564 2594 0 0.20
0 8 7 0 0.02 1655 2594 0 0.11 0.00 99.86 0.00 29
0 8 35 0 0.01 1687 2594 0 0.12
0 9 8 0 0.03 1748 2594 0 0.15 0.00 99.83 0.00 32
0 9 36 0 0.02 2001 2594 0 0.15
0 10 9 1 0.06 1604 2594 0 0.20 0.00 99.74 0.00 32
0 10 37 0 0.02 1679 2594 0 0.24
0 11 10 1 0.04 1644 2594 0 0.12 0.00 99.84 0.00 30
0 11 38 0 0.01 1509 2594 0 0.14
0 12 11 1 0.04 1773 2594 0 0.13 0.00 99.83 0.00 30
0 12 39 0 0.01 1529 2594 0 0.16
0 13 12 0 0.02 1907 2594 0 0.11 0.00 99.87 0.00 30
0 13 40 0 0.01 1574 2594 0 0.12
0 14 13 1 0.04 1831 2594 0 0.19 0.00 99.77 0.00 31
0 14 41 0 0.01 1735 2594 0 0.22
1 0 14 1 0.04 1831 2594 0 0.11 0.00 99.85 0.00 28 37 5.43 0.00 93.98 0.00 5.88 1.87 0.08 0.00
1 0 42 0 0.01 2238 2594 0 0.14
1 1 15 1 0.04 1869 2594 0 0.15 0.00 99.81 0.00 31
1 1 43 0 0.01 2407 2594 0 0.18
1 2 16 0 0.02 2164 2594 0 0.10 0.00 99.88 0.00 28
1 2 44 0 0.01 2326 2594 0 0.11
1 3 17 1 0.04 2101 2594 0 0.10 0.00 99.86 0.00 30
1 3 45 0 0.01 2355 2594 0 0.13
1 4 18 0 0.01 2429 2594 0 0.08 0.00 99.90 0.00 29
1 4 46 0 0.01 2545 2594 0 0.08
1 5 19 0 0.01 2412 2594 0 0.08 0.00 99.91 0.00 29
1 5 47 0 0.01 2392 2594 0 0.08
1 6 20 0 0.01 2448 2594 0 0.08 0.00 99.90 0.00 29
1 6 48 0 0.01 2430 2594 0 0.08
1 8 21 0 0.01 2574 2594 0 0.08 0.00 99.90 0.00 29
1 8 49 0 0.01 2450 2594 0 0.09
1 9 22 0 0.02 2470 2594 0 0.08 0.00 99.90 0.00 31
1 9 50 0 0.01 2555 2594 0 0.08
1 10 23 0 0.01 2540 2594 0 0.07 0.00 99.92 0.00 26
1 10 51 0 0.01 2672 2594 0 0.07
1 11 24 0 0.01 2472 2594 0 0.08 0.00 99.91 0.00 28
1 11 52 0 0.01 2461 2594 0 0.08
1 12 25 0 0.01 2438 2594 0 0.07 0.00 99.92 0.00 29
1 12 53 0 0.01 2316 2594 0 0.07
1 13 26 0 0.01 2363 2594 0 0.08 0.00 99.90 0.00 28
1 13 54 0 0.01 2586 2594 0 0.09
1 14 27 0 0.01 2459 2594 0 0.09 0.00 99.90 0.00 27
1 14 55 1 0.02 2939 2594 0 0.08

Tim
Dirk Brandewie
2014-05-30 19:38:34 UTC
Permalink
Post by Tim Chen
Post by Dirk Brandewie
With turbostat from rc7.
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.08 1178 3492 0 0.12 0.08 0.01 99.71 29 29 99.23 0.00 0.00 0.00 2.18 0.00 0.00
0 0 2 0.19 1189 3492 0 0.22 0.30 0.00 99.29 29 29 99.24 0.00 0.00 0.00 2.18 0.00 0.00
0 4 1 0.12 1253 3492 0 0.29
1 1 0 0.03 1065 3492 0 0.03 0.00 0.00 99.93 23
1 5 0 0.01 1104 3492 0 0.05
2 2 0 0.02 1275 3492 0 0.22 0.00 0.03 99.73 24
2 6 2 0.18 1220 3492 0 0.06
3 3 0 0.01 992 3492 0 0.07 0.00 0.01 99.90 23
3 7 0 0.05 915 3492 0 0.04
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.06 1034 3492 0 0.09 5.15 0.00 94.70 28 28 99.49 0.00 0.00 0.00 2.48 0.01 0.00
0 0 1 0.09 1066 3492 0 0.17 0.01 0.00 99.73 28 28 99.49 0.00 0.00 0.00 2.48 0.01 0.00
0 4 1 0.12 1036 3492 0 0.14
1 1 0 0.04 1009 3492 0 0.05 20.59 0.00 79.32 24
1 5 0 0.02 922 3492 0 0.07
2 2 0 0.03 924 3492 0 0.15 0.00 0.00 99.82 25
2 6 1 0.12 1117 3492 0 0.06
3 3 0 0.01 911 3492 0 0.04 0.01 0.00 99.94 22
3 7 0 0.03 856 3492 0 0.02
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.08 889 3492 0 0.12 0.03 0.06 99.71 29 29 99.32 0.00 0.00 0.00 2.21 0.00 0.00
0 0 1 0.11 867 3492 0 0.20 0.02 0.22 99.44 29 29 99.32 0.00 0.00 0.00 2.21 0.00 0.00
0 4 1 0.14 907 3492 0 0.17
1 1 1 0.12 809 3492 0 0.04 0.11 0.01 99.73 24
1 5 0 0.01 798 3492 0 0.14
2 2 0 0.03 863 3492 0 0.18 0.00 0.01 99.78 24
2 6 1 0.14 1013 3492 0 0.07
3 3 0 0.02 853 3492 0 0.09 0.00 0.00 99.89 23
3 7 1 0.06 815 3492 0 0.05
^C
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.03 3489 3492 0 2.43 0.01 0.00 97.53 30 30 90.20 0.00 0.00 0.00 2.85 0.06 0.00
0 0 1 0.04 3470 3492 0 0.09 0.00 0.00 99.88 30 30 90.20 0.00 0.00 0.00 2.85 0.06 0.00
0 4 2 0.06 3492 3492 0 0.07
1 1 1 0.02 3495 3492 0 0.05 0.03 0.00 99.90 25
1 5 0 0.00 3494 3492 0 0.07
2 2 0 0.01 3492 3492 0 9.53 0.00 0.01 90.45 25
2 6 1 0.04 3492 3492 0 9.50
3 3 1 0.03 3492 3492 0 0.05 0.01 0.00 99.91 23
3 7 1 0.02 3493 3492 0 0.06
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 1 0.02 3492 3492 0 4.93 0.00 0.00 95.04 30 30 80.19 0.00 0.00 0.00 3.54 0.10 0.00
0 0 1 0.02 3491 3492 0 0.08 0.01 0.00 99.89 30 30 80.19 0.00 0.00 0.00 3.54 0.10 0.00
0 4 2 0.05 3492 3492 0 0.05
1 1 0 0.01 3492 3492 0 0.02 0.00 0.00 99.97 24
1 5 0 0.01 3493 3492 0 0.02
2 2 0 0.01 3493 3492 0 19.65 0.01 0.00 80.34 24
2 6 2 0.05 3493 3492 0 19.61
3 3 1 0.01 3492 3492 0 0.02 0.00 0.00 99.97 23
3 7 0 0.01 3494 3492 0 0.02
Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt CorWatt GFXWatt
- - 2 0.05 3493 3492 0 1.64 0.01 0.00 98.29 30 30 93.25 0.00 0.00 0.00 2.64 0.04 0.00
0 0 4 0.12 3492 3492 0 0.13 0.01 0.00 99.74 30 30 93.25 0.00 0.00 0.00 2.64 0.04 0.00
0 4 2 0.06 3493 3492 0 0.19
1 1 1 0.02 3492 3492 0 0.03 0.04 0.00 99.91 23
1 5 0 0.01 3494 3492 0 0.04
2 2 0 0.01 3492 3492 0 6.42 0.00 0.00 93.57 25
2 6 6 0.16 3492 3492 0 6.27
3 3 0 0.01 3501 3492 0 0.05 0.01 0.00 99.93 22
3 7 1 0.03 3492 3492 0 0.03
cpu MHz : 997.089
cpu MHz : 797.480
cpu MHz : 998.320
cpu MHz : 800.078
cpu MHz : 845.878
cpu MHz : 801.445
cpu MHz : 800.078
cpu MHz : 800.351
cpu MHz : 3497.128
cpu MHz : 3506.699
cpu MHz : 3500.273
cpu MHz : 3500.273
cpu MHz : 3500.000
cpu MHz : 3500.000
cpu MHz : 3500.000
cpu MHz : 3495.898
Dirk,
Thanks for checking things out.
I tested on a Haswell system, and I see that the frequency
can dip below the max even when I set the min_perf_pct to 100.
Let me know if you want to log on to my system and check if
there's something I missed. It is odd that the package 1's
cores are at a much higher frequency and close to
max than package 0, once min_perf_pct is set to 100.
Can you run turbostat for a few samples it reports an average over the sample
time.
Post by Tim Chen
Tim
3600000
1200000
100
Linux otc-grantly-02 3.15.0-rc7+ #3 SMP Thu May 29 11:34:39 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
driver: intel_pstate
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: 0.97 ms.
hardware limits: 1.20 GHz - 3.60 GHz
available cpufreq governors: performance, powersave
current policy: frequency should be within 1.20 GHz and 3.60 GHz.
The governor "powersave" may decide which speed to use
within this range.
current CPU frequency is 1.20 GHz (asserted by call to hardware).
Supported: yes
Active: yes
driver: intel_pstate
CPUs which run at the same hardware frequency: 1
CPUs which need to have their frequency coordinated by software: 1
maximum transition latency: 0.97 ms.
hardware limits: 1.20 GHz - 3.60 GHz
available cpufreq governors: performance, powersave
current policy: frequency should be within 1.20 GHz and 3.60 GHz.
The governor "powersave" may decide which speed to use
within this range.
current CPU frequency is 2.02 GHz (asserted by call to hardware).
Supported: yes
Active: yes
Package Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt PKG_% RAM_%
- - - 0 0.02 1964 2594 0 0.13 0.00 99.85 0.00 33 41 4.92 0.00 93.99 0.00 23.04 3.60 0.18 0.00
0 0 0 1 0.07 2154 2594 0 0.21 0.00 99.72 0.00 32 41 4.42 0.00 94.00 0.00 17.16 1.73 0.10 0.00
0 0 28 0 0.01 1465 2594 0 0.26
0 1 1 1 0.04 1941 2594 0 0.18 0.00 99.78 0.00 33
0 1 29 0 0.02 1587 2594 0 0.20
0 2 2 1 0.04 1586 2594 0 0.15 0.00 99.81 0.00 28
0 2 30 0 0.01 1539 2594 0 0.17
0 3 3 1 0.04 1656 2594 0 0.17 0.00 99.79 0.00 31
0 3 31 0 0.01 1723 2594 0 0.19
0 4 4 1 0.06 1800 2594 0 0.21 0.00 99.74 0.00 33
0 4 32 0 0.02 1725 2594 0 0.24
0 5 5 1 0.04 1917 2594 0 0.15 0.00 99.81 0.00 29
0 5 33 0 0.02 1707 2594 0 0.17
0 6 6 1 0.04 1820 2594 0 0.17 0.00 99.79 0.00 33
0 6 34 0 0.01 1564 2594 0 0.20
0 8 7 0 0.02 1655 2594 0 0.11 0.00 99.86 0.00 29
0 8 35 0 0.01 1687 2594 0 0.12
0 9 8 0 0.03 1748 2594 0 0.15 0.00 99.83 0.00 32
0 9 36 0 0.02 2001 2594 0 0.15
0 10 9 1 0.06 1604 2594 0 0.20 0.00 99.74 0.00 32
0 10 37 0 0.02 1679 2594 0 0.24
0 11 10 1 0.04 1644 2594 0 0.12 0.00 99.84 0.00 30
0 11 38 0 0.01 1509 2594 0 0.14
0 12 11 1 0.04 1773 2594 0 0.13 0.00 99.83 0.00 30
0 12 39 0 0.01 1529 2594 0 0.16
0 13 12 0 0.02 1907 2594 0 0.11 0.00 99.87 0.00 30
0 13 40 0 0.01 1574 2594 0 0.12
0 14 13 1 0.04 1831 2594 0 0.19 0.00 99.77 0.00 31
0 14 41 0 0.01 1735 2594 0 0.22
1 0 14 1 0.04 1831 2594 0 0.11 0.00 99.85 0.00 28 37 5.43 0.00 93.98 0.00 5.88 1.87 0.08 0.00
1 0 42 0 0.01 2238 2594 0 0.14
1 1 15 1 0.04 1869 2594 0 0.15 0.00 99.81 0.00 31
1 1 43 0 0.01 2407 2594 0 0.18
1 2 16 0 0.02 2164 2594 0 0.10 0.00 99.88 0.00 28
1 2 44 0 0.01 2326 2594 0 0.11
1 3 17 1 0.04 2101 2594 0 0.10 0.00 99.86 0.00 30
1 3 45 0 0.01 2355 2594 0 0.13
1 4 18 0 0.01 2429 2594 0 0.08 0.00 99.90 0.00 29
1 4 46 0 0.01 2545 2594 0 0.08
1 5 19 0 0.01 2412 2594 0 0.08 0.00 99.91 0.00 29
1 5 47 0 0.01 2392 2594 0 0.08
1 6 20 0 0.01 2448 2594 0 0.08 0.00 99.90 0.00 29
1 6 48 0 0.01 2430 2594 0 0.08
1 8 21 0 0.01 2574 2594 0 0.08 0.00 99.90 0.00 29
1 8 49 0 0.01 2450 2594 0 0.09
1 9 22 0 0.02 2470 2594 0 0.08 0.00 99.90 0.00 31
1 9 50 0 0.01 2555 2594 0 0.08
1 10 23 0 0.01 2540 2594 0 0.07 0.00 99.92 0.00 26
1 10 51 0 0.01 2672 2594 0 0.07
1 11 24 0 0.01 2472 2594 0 0.08 0.00 99.91 0.00 28
1 11 52 0 0.01 2461 2594 0 0.08
1 12 25 0 0.01 2438 2594 0 0.07 0.00 99.92 0.00 29
1 12 53 0 0.01 2316 2594 0 0.07
1 13 26 0 0.01 2363 2594 0 0.08 0.00 99.90 0.00 28
1 13 54 0 0.01 2586 2594 0 0.09
1 14 27 0 0.01 2459 2594 0 0.09 0.00 99.90 0.00 27
1 14 55 1 0.02 2939 2594 0 0.08
Tim
Tim Chen
2014-05-30 20:07:19 UTC
Permalink
Post by Dirk Brandewie
Post by Tim Chen
Dirk,
Thanks for checking things out.
I tested on a Haswell system, and I see that the frequency
can dip below the max even when I set the min_perf_pct to 100.
Let me know if you want to log on to my system and check if
there's something I missed. It is odd that the package 1's
cores are at a much higher frequency and close to
max than package 0, once min_perf_pct is set to 100.
Can you run turbostat for a few samples it reports an average over the sample
time.
Here it is.

Tim

Package Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt PKG_% RAM_%
- - - 0 0.02 2048 2594 0 0.23 0.00 99.75 0.00 33 42 5.93 0.00 91.52 0.00 23.22 4.15 0.12 0.00
0 0 0 1 0.06 1997 2594 0 0.16 0.00 99.78 0.00 32 42 7.92 0.00 91.55 0.00 16.88 1.95 0.06 0.00
0 0 28 0 0.01 1338 2594 0 0.21
0 1 1 0 0.02 1696 2594 0 0.11 0.00 99.87 0.00 33
0 1 29 0 0.01 1455 2594 0 0.11
0 2 2 0 0.01 1618 2594 0 0.07 0.00 99.91 0.00 30
0 2 30 0 0.01 1513 2594 0 0.07
0 3 3 0 0.01 1724 2594 0 0.08 0.00 99.91 0.00 31
0 3 31 0 0.01 1447 2594 0 0.08
0 4 4 0 0.01 1769 2594 0 0.06 0.00 99.92 0.00 32
0 4 32 0 0.01 1483 2594 0 0.06
0 5 5 0 0.01 1670 2594 0 0.07 0.00 99.92 0.00 29
0 5 33 0 0.01 1515 2594 0 0.07
0 6 6 0 0.01 1600 2594 0 0.07 0.00 99.92 0.00 33
0 6 34 0 0.01 1412 2594 0 0.07
0 8 7 0 0.01 1588 2594 0 0.07 0.00 99.92 0.00 30
0 8 35 0 0.01 1432 2594 0 0.07
0 9 8 0 0.01 1662 2594 0 0.11 0.00 99.88 0.00 32
0 9 36 0 0.02 1658 2594 0 0.10
0 10 9 0 0.01 1570 2594 0 0.07 0.00 99.91 0.00 32
0 10 37 0 0.01 1468 2594 0 0.07
0 11 10 0 0.01 1680 2594 0 0.07 0.00 99.92 0.00 31
0 11 38 0 0.01 1511 2594 0 0.07
0 12 11 0 0.01 1690 2594 0 0.08 0.00 99.91 0.00 30
0 12 39 0 0.01 1560 2594 0 0.08
0 13 12 0 0.02 1604 2594 0 0.11 0.00 99.87 0.00 29
0 13 40 0 0.02 1436 2594 0 0.11
0 14 13 0 0.02 1620 2594 0 0.09 0.00 99.89 0.00 29
0 14 41 0 0.02 1440 2594 0 0.09
1 0 14 0 0.03 1666 2594 0 0.16 0.00 99.82 0.00 28 36 3.94 0.00 91.50 0.00 6.34 2.20 0.06 0.00
1 0 42 3 0.08 3263 2594 0 0.11
1 1 15 0 0.01 2194 2594 0 0.09 0.00 99.90 0.00 30
1 1 43 0 0.01 2358 2594 0 0.09
1 2 16 0 0.01 2650 2594 0 0.08 0.00 99.91 0.00 28
1 2 44 0 0.01 2032 2594 0 0.08
1 3 17 1 0.03 2305 2594 0 4.11 0.00 95.86 0.00 30
1 3 45 0 0.01 2290 2594 0 4.13
1 4 18 0 0.01 2362 2594 0 0.09 0.00 99.90 0.00 28
1 4 46 0 0.01 2325 2594 0 0.09
1 5 19 0 0.01 2374 2594 0 0.07 0.00 99.92 0.00 30
1 5 47 0 0.01 2442 2594 0 0.07
1 6 20 0 0.01 2476 2594 0 0.08 0.00 99.91 0.00 30
1 6 48 0 0.01 2382 2594 0 0.07
1 8 21 0 0.01 2669 2594 0 0.09 0.00 99.90 0.00 29
1 8 49 0 0.02 1953 2594 0 0.09
1 9 22 0 0.01 2537 2594 0 0.10 0.00 99.89 0.00 31
1 9 50 0 0.01 2117 2594 0 0.10
1 10 23 0 0.01 2531 2594 0 0.07 0.00 99.92 0.00 27
1 10 51 0 0.01 2404 2594 0 0.08
1 11 24 0 0.01 2315 2594 0 0.08 0.00 99.91 0.00 28
1 11 52 0 0.01 2210 2594 0 0.08
1 12 25 0 0.01 2434 2594 0 0.07 0.00 99.91 0.00 28
1 12 53 0 0.01 2113 2594 0 0.08
1 13 26 0 0.01 2070 2594 0 0.07 0.00 99.91 0.00 27
1 13 54 0 0.01 2114 2594 0 0.08
1 14 27 0 0.01 2324 2594 0 0.10 0.00 99.89 0.00 27
1 14 55 1 0.03 2991 2594 0 0.08
Package Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt PKG_% RAM_%
- - - 0 0.01 2138 2594 0 0.10 0.01 99.88 0.00 33 42 4.32 0.09 94.88 0.00 22.45 3.56 0.12 0.00
0 0 0 1 0.07 2106 2594 0 0.25 0.00 99.68 0.00 31 42 4.20 0.00 95.00 0.00 16.72 1.73 0.06 0.00
0 0 28 0 0.01 2163 2594 0 0.31
0 1 1 0 0.02 2005 2594 0 0.11 0.00 99.87 0.00 33
0 1 29 0 0.01 1823 2594 0 0.12
0 2 2 0 0.02 2008 2594 0 0.10 0.00 99.88 0.00 30
0 2 30 0 0.01 1903 2594 0 0.10
0 3 3 0 0.02 1953 2594 0 0.10 0.00 99.88 0.00 31
0 3 31 0 0.01 1840 2594 0 0.11
0 4 4 0 0.02 2220 2594 0 0.09 0.01 99.89 0.00 33
0 4 32 0 0.01 1806 2594 0 0.09
0 5 5 0 0.01 1723 2594 0 0.09 0.00 99.89 0.00 28
0 5 33 0 0.01 1904 2594 0 0.09
0 6 6 0 0.01 1806 2594 0 0.08 0.00 99.91 0.00 33
0 6 34 0 0.01 1824 2594 0 0.08
0 8 7 0 0.01 1910 2594 0 0.10 0.00 99.89 0.00 30
0 8 35 0 0.01 1847 2594 0 0.10
0 9 8 0 0.02 2204 2594 0 0.11 0.00 99.88 0.00 30
0 9 36 0 0.02 1899 2594 0 0.11
0 10 9 0 0.01 1967 2594 0 0.09 0.00 99.90 0.00 33
0 10 37 0 0.01 1838 2594 0 0.09
0 11 10 0 0.01 1696 2594 0 0.08 0.00 99.90 0.00 31
0 11 38 0 0.01 1728 2594 0 0.08
0 12 11 0 0.02 1863 2594 0 0.09 0.00 99.90 0.00 30
0 12 39 0 0.01 1838 2594 0 0.09
0 13 12 0 0.02 1856 2594 0 0.11 0.00 99.87 0.00 29
0 13 40 0 0.01 1741 2594 0 0.12
0 14 13 0 0.02 1887 2594 0 0.10 0.00 99.88 0.00 30
0 14 41 0 0.01 1860 2594 0 0.11
1 0 14 0 0.03 1875 2594 0 0.09 0.00 99.88 0.00 28 38 4.44 0.18 94.75 0.00 5.72 1.82 0.06 0.00
1 0 42 0 0.01 2363 2594 0 0.11
1 1 15 0 0.01 2368 2594 0 0.09 0.00 99.90 0.00 31
1 1 43 0 0.01 2403 2594 0 0.09
1 2 16 0 0.01 2501 2594 0 0.07 0.00 99.91 0.00 27
1 2 44 0 0.01 2469 2594 0 0.07
1 3 17 1 0.04 2674 2594 0 0.10 0.19 99.66 0.00 30
1 3 45 0 0.01 2374 2594 0 0.13
1 4 18 0 0.01 2446 2594 0 0.08 0.00 99.91 0.00 28
1 4 46 0 0.01 2372 2594 0 0.08
1 5 19 0 0.01 2479 2594 0 0.08 0.00 99.91 0.00 29
1 5 47 0 0.01 2352 2594 0 0.08
1 6 20 0 0.01 2436 2594 0 0.07 0.00 99.91 0.00 30
1 6 48 0 0.01 2381 2594 0 0.08
1 8 21 0 0.01 2377 2594 0 0.08 0.00 99.91 0.00 29
1 8 49 0 0.01 2629 2594 0 0.08
1 9 22 0 0.01 2407 2594 0 0.09 0.00 99.90 0.00 30
1 9 50 0 0.01 2547 2594 0 0.09
1 10 23 0 0.01 2254 2594 0 0.09 0.00 99.90 0.00 28
1 10 51 0 0.01 2514 2594 0 0.09
1 11 24 0 0.01 2204 2594 0 0.10 0.00 99.89 0.00 29
1 11 52 0 0.01 2187 2594 0 0.09
1 12 25 0 0.01 2310 2594 0 0.09 0.00 99.90 0.00 27
1 12 53 0 0.01 2636 2594 0 0.09
1 13 26 0 0.01 2325 2594 0 0.09 0.00 99.89 0.00 29
1 13 54 0 0.02 1959 2594 0 0.09
1 14 27 0 0.01 2273 2594 0 0.11 0.00 99.88 0.00 28
1 14 55 1 0.02 2678 2594 0 0.10
Package Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt PKG_% RAM_%
- - - 0 0.02 2223 2594 0 0.12 0.00 99.86 0.00 34 41 4.47 0.00 94.54 0.00 22.70 3.64 0.14 0.00
0 0 0 1 0.05 2251 2594 0 0.18 0.00 99.77 0.00 32 41 4.81 0.00 94.56 0.00 16.89 1.78 0.06 0.00
0 0 28 0 0.01 1846 2594 0 0.22
0 1 1 0 0.02 1758 2594 0 0.12 0.00 99.86 0.00 33
0 1 29 0 0.02 1945 2594 0 0.12
0 2 2 0 0.02 1635 2594 0 0.09 0.00 99.89 0.00 29
0 2 30 0 0.01 1939 2594 0 0.10
0 3 3 0 0.01 1834 2594 0 0.08 0.00 99.90 0.00 31
0 3 31 0 0.01 1554 2594 0 0.09
0 4 4 0 0.02 1827 2594 0 0.08 0.00 99.91 0.00 33
0 4 32 0 0.01 1824 2594 0 0.08
0 5 5 0 0.02 1925 2594 0 0.08 0.00 99.90 0.00 29
0 5 33 0 0.01 1796 2594 0 0.08
0 6 6 0 0.02 1801 2594 0 0.07 0.00 99.91 0.00 34
0 6 34 0 0.01 1874 2594 0 0.08
0 8 7 0 0.02 1930 2594 0 0.08 0.00 99.91 0.00 30
0 8 35 0 0.01 1901 2594 0 0.08
0 9 8 0 0.02 1874 2594 0 0.10 0.00 99.88 0.00 30
0 9 36 0 0.02 1915 2594 0 0.10
0 10 9 0 0.02 1779 2594 0 0.08 0.00 99.90 0.00 32
0 10 37 0 0.01 1983 2594 0 0.09
0 11 10 0 0.02 1754 2594 0 0.08 0.00 99.90 0.00 31
0 11 38 0 0.01 1722 2594 0 0.09
0 12 11 0 0.02 1730 2594 0 0.08 0.00 99.90 0.00 29
0 12 39 0 0.01 1892 2594 0 0.09
0 13 12 0 0.02 1943 2594 0 0.10 0.00 99.88 0.00 30
0 13 40 0 0.02 2016 2594 0 0.10
0 14 13 0 0.02 1893 2594 0 0.10 0.00 99.87 0.00 31
0 14 41 0 0.01 1790 2594 0 0.11
1 0 14 1 0.03 1998 2594 0 0.16 0.00 99.81 0.00 28 37 4.13 0.00 94.52 0.00 5.81 1.86 0.08 0.00
1 0 42 3 0.08 3493 2594 0 0.11
1 1 15 0 0.01 2483 2594 0 0.08 0.00 99.90 0.00 31
1 1 43 0 0.01 2279 2594 0 0.09
1 2 16 0 0.01 2454 2594 0 0.07 0.00 99.92 0.00 27
1 2 44 0 0.01 2405 2594 0 0.07
1 3 17 1 0.03 3069 2594 0 0.29 0.00 99.68 0.00 31
1 3 45 0 0.01 2298 2594 0 0.31
1 4 18 0 0.01 2515 2594 0 0.08 0.00 99.91 0.00 28
1 4 46 0 0.01 2193 2594 0 0.08
1 5 19 0 0.01 2547 2594 0 0.06 0.00 99.93 0.00 28
1 5 47 0 0.01 2327 2594 0 0.06
1 6 20 0 0.01 2315 2594 0 0.07 0.00 99.92 0.00 29
1 6 48 0 0.01 2120 2594 0 0.07
1 8 21 0 0.01 2482 2594 0 0.07 0.00 99.92 0.00 29
1 8 49 0 0.01 2311 2594 0 0.07
1 9 22 0 0.01 2372 2594 0 0.09 0.00 99.90 0.00 30
1 9 50 0 0.01 2509 2594 0 0.09
1 10 23 0 0.02 2147 2594 0 0.08 0.00 99.91 0.00 27
1 10 51 0 0.01 2477 2594 0 0.08
1 11 24 0 0.01 2138 2594 0 0.08 0.00 99.90 0.00 29
1 11 52 0 0.01 2365 2594 0 0.09
1 12 25 0 0.01 1965 2594 0 0.07 0.00 99.91 0.00 28
1 12 53 0 0.01 2447 2594 0 0.08
1 13 26 0 0.01 2476 2594 0 0.08 0.00 99.91 0.00 28
1 13 54 0 0.01 2282 2594 0 0.08
1 14 27 0 0.01 2386 2594 0 0.76 0.00 99.22 0.00 28
1 14 55 1 0.02 3065 2594 0 0.75
Dirk Brandewie
2014-05-30 20:15:46 UTC
Permalink
Post by Tim Chen
Post by Dirk Brandewie
Post by Tim Chen
Dirk,
Thanks for checking things out.
I tested on a Haswell system, and I see that the frequency
can dip below the max even when I set the min_perf_pct to 100.
Let me know if you want to log on to my system and check if
there's something I missed. It is odd that the package 1's
cores are at a much higher frequency and close to
max than package 0, once min_perf_pct is set to 100.
Can you run turbostat for a few samples it reports an average over the sample
time.
Here it is.
You have me at a loss here I can come in on Monday if you are around and
we can try to figure out what is happening.

--Dirk
Post by Tim Chen
Tim
Package Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt PKG_% RAM_%
- - - 0 0.02 2048 2594 0 0.23 0.00 99.75 0.00 33 42 5.93 0.00 91.52 0.00 23.22 4.15 0.12 0.00
0 0 0 1 0.06 1997 2594 0 0.16 0.00 99.78 0.00 32 42 7.92 0.00 91.55 0.00 16.88 1.95 0.06 0.00
0 0 28 0 0.01 1338 2594 0 0.21
0 1 1 0 0.02 1696 2594 0 0.11 0.00 99.87 0.00 33
0 1 29 0 0.01 1455 2594 0 0.11
0 2 2 0 0.01 1618 2594 0 0.07 0.00 99.91 0.00 30
0 2 30 0 0.01 1513 2594 0 0.07
0 3 3 0 0.01 1724 2594 0 0.08 0.00 99.91 0.00 31
0 3 31 0 0.01 1447 2594 0 0.08
0 4 4 0 0.01 1769 2594 0 0.06 0.00 99.92 0.00 32
0 4 32 0 0.01 1483 2594 0 0.06
0 5 5 0 0.01 1670 2594 0 0.07 0.00 99.92 0.00 29
0 5 33 0 0.01 1515 2594 0 0.07
0 6 6 0 0.01 1600 2594 0 0.07 0.00 99.92 0.00 33
0 6 34 0 0.01 1412 2594 0 0.07
0 8 7 0 0.01 1588 2594 0 0.07 0.00 99.92 0.00 30
0 8 35 0 0.01 1432 2594 0 0.07
0 9 8 0 0.01 1662 2594 0 0.11 0.00 99.88 0.00 32
0 9 36 0 0.02 1658 2594 0 0.10
0 10 9 0 0.01 1570 2594 0 0.07 0.00 99.91 0.00 32
0 10 37 0 0.01 1468 2594 0 0.07
0 11 10 0 0.01 1680 2594 0 0.07 0.00 99.92 0.00 31
0 11 38 0 0.01 1511 2594 0 0.07
0 12 11 0 0.01 1690 2594 0 0.08 0.00 99.91 0.00 30
0 12 39 0 0.01 1560 2594 0 0.08
0 13 12 0 0.02 1604 2594 0 0.11 0.00 99.87 0.00 29
0 13 40 0 0.02 1436 2594 0 0.11
0 14 13 0 0.02 1620 2594 0 0.09 0.00 99.89 0.00 29
0 14 41 0 0.02 1440 2594 0 0.09
1 0 14 0 0.03 1666 2594 0 0.16 0.00 99.82 0.00 28 36 3.94 0.00 91.50 0.00 6.34 2.20 0.06 0.00
1 0 42 3 0.08 3263 2594 0 0.11
1 1 15 0 0.01 2194 2594 0 0.09 0.00 99.90 0.00 30
1 1 43 0 0.01 2358 2594 0 0.09
1 2 16 0 0.01 2650 2594 0 0.08 0.00 99.91 0.00 28
1 2 44 0 0.01 2032 2594 0 0.08
1 3 17 1 0.03 2305 2594 0 4.11 0.00 95.86 0.00 30
1 3 45 0 0.01 2290 2594 0 4.13
1 4 18 0 0.01 2362 2594 0 0.09 0.00 99.90 0.00 28
1 4 46 0 0.01 2325 2594 0 0.09
1 5 19 0 0.01 2374 2594 0 0.07 0.00 99.92 0.00 30
1 5 47 0 0.01 2442 2594 0 0.07
1 6 20 0 0.01 2476 2594 0 0.08 0.00 99.91 0.00 30
1 6 48 0 0.01 2382 2594 0 0.07
1 8 21 0 0.01 2669 2594 0 0.09 0.00 99.90 0.00 29
1 8 49 0 0.02 1953 2594 0 0.09
1 9 22 0 0.01 2537 2594 0 0.10 0.00 99.89 0.00 31
1 9 50 0 0.01 2117 2594 0 0.10
1 10 23 0 0.01 2531 2594 0 0.07 0.00 99.92 0.00 27
1 10 51 0 0.01 2404 2594 0 0.08
1 11 24 0 0.01 2315 2594 0 0.08 0.00 99.91 0.00 28
1 11 52 0 0.01 2210 2594 0 0.08
1 12 25 0 0.01 2434 2594 0 0.07 0.00 99.91 0.00 28
1 12 53 0 0.01 2113 2594 0 0.08
1 13 26 0 0.01 2070 2594 0 0.07 0.00 99.91 0.00 27
1 13 54 0 0.01 2114 2594 0 0.08
1 14 27 0 0.01 2324 2594 0 0.10 0.00 99.89 0.00 27
1 14 55 1 0.03 2991 2594 0 0.08
Package Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt PKG_% RAM_%
- - - 0 0.01 2138 2594 0 0.10 0.01 99.88 0.00 33 42 4.32 0.09 94.88 0.00 22.45 3.56 0.12 0.00
0 0 0 1 0.07 2106 2594 0 0.25 0.00 99.68 0.00 31 42 4.20 0.00 95.00 0.00 16.72 1.73 0.06 0.00
0 0 28 0 0.01 2163 2594 0 0.31
0 1 1 0 0.02 2005 2594 0 0.11 0.00 99.87 0.00 33
0 1 29 0 0.01 1823 2594 0 0.12
0 2 2 0 0.02 2008 2594 0 0.10 0.00 99.88 0.00 30
0 2 30 0 0.01 1903 2594 0 0.10
0 3 3 0 0.02 1953 2594 0 0.10 0.00 99.88 0.00 31
0 3 31 0 0.01 1840 2594 0 0.11
0 4 4 0 0.02 2220 2594 0 0.09 0.01 99.89 0.00 33
0 4 32 0 0.01 1806 2594 0 0.09
0 5 5 0 0.01 1723 2594 0 0.09 0.00 99.89 0.00 28
0 5 33 0 0.01 1904 2594 0 0.09
0 6 6 0 0.01 1806 2594 0 0.08 0.00 99.91 0.00 33
0 6 34 0 0.01 1824 2594 0 0.08
0 8 7 0 0.01 1910 2594 0 0.10 0.00 99.89 0.00 30
0 8 35 0 0.01 1847 2594 0 0.10
0 9 8 0 0.02 2204 2594 0 0.11 0.00 99.88 0.00 30
0 9 36 0 0.02 1899 2594 0 0.11
0 10 9 0 0.01 1967 2594 0 0.09 0.00 99.90 0.00 33
0 10 37 0 0.01 1838 2594 0 0.09
0 11 10 0 0.01 1696 2594 0 0.08 0.00 99.90 0.00 31
0 11 38 0 0.01 1728 2594 0 0.08
0 12 11 0 0.02 1863 2594 0 0.09 0.00 99.90 0.00 30
0 12 39 0 0.01 1838 2594 0 0.09
0 13 12 0 0.02 1856 2594 0 0.11 0.00 99.87 0.00 29
0 13 40 0 0.01 1741 2594 0 0.12
0 14 13 0 0.02 1887 2594 0 0.10 0.00 99.88 0.00 30
0 14 41 0 0.01 1860 2594 0 0.11
1 0 14 0 0.03 1875 2594 0 0.09 0.00 99.88 0.00 28 38 4.44 0.18 94.75 0.00 5.72 1.82 0.06 0.00
1 0 42 0 0.01 2363 2594 0 0.11
1 1 15 0 0.01 2368 2594 0 0.09 0.00 99.90 0.00 31
1 1 43 0 0.01 2403 2594 0 0.09
1 2 16 0 0.01 2501 2594 0 0.07 0.00 99.91 0.00 27
1 2 44 0 0.01 2469 2594 0 0.07
1 3 17 1 0.04 2674 2594 0 0.10 0.19 99.66 0.00 30
1 3 45 0 0.01 2374 2594 0 0.13
1 4 18 0 0.01 2446 2594 0 0.08 0.00 99.91 0.00 28
1 4 46 0 0.01 2372 2594 0 0.08
1 5 19 0 0.01 2479 2594 0 0.08 0.00 99.91 0.00 29
1 5 47 0 0.01 2352 2594 0 0.08
1 6 20 0 0.01 2436 2594 0 0.07 0.00 99.91 0.00 30
1 6 48 0 0.01 2381 2594 0 0.08
1 8 21 0 0.01 2377 2594 0 0.08 0.00 99.91 0.00 29
1 8 49 0 0.01 2629 2594 0 0.08
1 9 22 0 0.01 2407 2594 0 0.09 0.00 99.90 0.00 30
1 9 50 0 0.01 2547 2594 0 0.09
1 10 23 0 0.01 2254 2594 0 0.09 0.00 99.90 0.00 28
1 10 51 0 0.01 2514 2594 0 0.09
1 11 24 0 0.01 2204 2594 0 0.10 0.00 99.89 0.00 29
1 11 52 0 0.01 2187 2594 0 0.09
1 12 25 0 0.01 2310 2594 0 0.09 0.00 99.90 0.00 27
1 12 53 0 0.01 2636 2594 0 0.09
1 13 26 0 0.01 2325 2594 0 0.09 0.00 99.89 0.00 29
1 13 54 0 0.02 1959 2594 0 0.09
1 14 27 0 0.01 2273 2594 0 0.11 0.00 99.88 0.00 28
1 14 55 1 0.02 2678 2594 0 0.10
Package Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 PkgWatt RAMWatt PKG_% RAM_%
- - - 0 0.02 2223 2594 0 0.12 0.00 99.86 0.00 34 41 4.47 0.00 94.54 0.00 22.70 3.64 0.14 0.00
0 0 0 1 0.05 2251 2594 0 0.18 0.00 99.77 0.00 32 41 4.81 0.00 94.56 0.00 16.89 1.78 0.06 0.00
0 0 28 0 0.01 1846 2594 0 0.22
0 1 1 0 0.02 1758 2594 0 0.12 0.00 99.86 0.00 33
0 1 29 0 0.02 1945 2594 0 0.12
0 2 2 0 0.02 1635 2594 0 0.09 0.00 99.89 0.00 29
0 2 30 0 0.01 1939 2594 0 0.10
0 3 3 0 0.01 1834 2594 0 0.08 0.00 99.90 0.00 31
0 3 31 0 0.01 1554 2594 0 0.09
0 4 4 0 0.02 1827 2594 0 0.08 0.00 99.91 0.00 33
0 4 32 0 0.01 1824 2594 0 0.08
0 5 5 0 0.02 1925 2594 0 0.08 0.00 99.90 0.00 29
0 5 33 0 0.01 1796 2594 0 0.08
0 6 6 0 0.02 1801 2594 0 0.07 0.00 99.91 0.00 34
0 6 34 0 0.01 1874 2594 0 0.08
0 8 7 0 0.02 1930 2594 0 0.08 0.00 99.91 0.00 30
0 8 35 0 0.01 1901 2594 0 0.08
0 9 8 0 0.02 1874 2594 0 0.10 0.00 99.88 0.00 30
0 9 36 0 0.02 1915 2594 0 0.10
0 10 9 0 0.02 1779 2594 0 0.08 0.00 99.90 0.00 32
0 10 37 0 0.01 1983 2594 0 0.09
0 11 10 0 0.02 1754 2594 0 0.08 0.00 99.90 0.00 31
0 11 38 0 0.01 1722 2594 0 0.09
0 12 11 0 0.02 1730 2594 0 0.08 0.00 99.90 0.00 29
0 12 39 0 0.01 1892 2594 0 0.09
0 13 12 0 0.02 1943 2594 0 0.10 0.00 99.88 0.00 30
0 13 40 0 0.02 2016 2594 0 0.10
0 14 13 0 0.02 1893 2594 0 0.10 0.00 99.87 0.00 31
0 14 41 0 0.01 1790 2594 0 0.11
1 0 14 1 0.03 1998 2594 0 0.16 0.00 99.81 0.00 28 37 4.13 0.00 94.52 0.00 5.81 1.86 0.08 0.00
1 0 42 3 0.08 3493 2594 0 0.11
1 1 15 0 0.01 2483 2594 0 0.08 0.00 99.90 0.00 31
1 1 43 0 0.01 2279 2594 0 0.09
1 2 16 0 0.01 2454 2594 0 0.07 0.00 99.92 0.00 27
1 2 44 0 0.01 2405 2594 0 0.07
1 3 17 1 0.03 3069 2594 0 0.29 0.00 99.68 0.00 31
1 3 45 0 0.01 2298 2594 0 0.31
1 4 18 0 0.01 2515 2594 0 0.08 0.00 99.91 0.00 28
1 4 46 0 0.01 2193 2594 0 0.08
1 5 19 0 0.01 2547 2594 0 0.06 0.00 99.93 0.00 28
1 5 47 0 0.01 2327 2594 0 0.06
1 6 20 0 0.01 2315 2594 0 0.07 0.00 99.92 0.00 29
1 6 48 0 0.01 2120 2594 0 0.07
1 8 21 0 0.01 2482 2594 0 0.07 0.00 99.92 0.00 29
1 8 49 0 0.01 2311 2594 0 0.07
1 9 22 0 0.01 2372 2594 0 0.09 0.00 99.90 0.00 30
1 9 50 0 0.01 2509 2594 0 0.09
1 10 23 0 0.02 2147 2594 0 0.08 0.00 99.91 0.00 27
1 10 51 0 0.01 2477 2594 0 0.08
1 11 24 0 0.01 2138 2594 0 0.08 0.00 99.90 0.00 29
1 11 52 0 0.01 2365 2594 0 0.09
1 12 25 0 0.01 1965 2594 0 0.07 0.00 99.91 0.00 28
1 12 53 0 0.01 2447 2594 0 0.08
1 13 26 0 0.01 2476 2594 0 0.08 0.00 99.91 0.00 28
1 13 54 0 0.01 2282 2594 0 0.08
1 14 27 0 0.01 2386 2594 0 0.76 0.00 99.22 0.00 28
1 14 55 1 0.02 3065 2594 0 0.75
George Spelvin
2014-05-30 01:37:34 UTC
Permalink
Post by Tim Chen
This is odd. On my Ivy Bridge system the CPU speed from /proc/cpuinfo
is at max freq once I set the performance governor.
The numbers above almost look like
the cpu frequency is fluctuating and an average is taken.
What version of the kernel are you running? Is
CONFIG_CPU_FREQ_GOV_PERFORMANCE compiled in?
Yes; I have

CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_COMMON=y
CONFIG_CPU_FREQ_STAT=y
# CONFIG_CPU_FREQ_STAT_DETAILS is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
# CONFIG_CPU_FREQ_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

However scaling_available_governor only lists "performance powersave"
Post by Tim Chen
Does /sys/devices/system/cpu/cpu?/cpufreq/scaling_cur_freq
also changes?
That fine does not exist. However,
/sys/devices/system/cpu/cpu?/cpufreq/cpuinfo_cur_freq
exists and changes. Several snapshots:

Snap1 Snap2 Snap3 Snap4
cpu0 1255875 1255875 1255875 1255875
cpu1 1202750 1202750 1202750 1415250
cpu2 1680875 1255875 1468375 1468375
cpu3 1202750 1255875 1521500 1521500
cpu4 1946500 1255875 1255875 1255875
cpu5 2690250 2371500 1946500 1734000
Post by Tim Chen
Can you check what are the available governors in your system
and available frequencies?
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
performance powersave
Post by Tim Chen
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
cat: /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies: No such file or directory
$ ls /sys/devices/system/cpu/cpu0/cpufreq/
affected_cpus cpuinfo_transition_latency scaling_governor
cpuinfo_cur_freq related_cpus scaling_max_freq
cpuinfo_max_freq scaling_available_governors scaling_min_freq
cpuinfo_min_freq scaling_driver scaling_setspeed
$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
<unsupported>
Post by Tim Chen
If userspace governor is available, you can try set the governor
to userspace, then pin frequency to 3400 MHz (assuming that's your
I'll have to recompile and reboot, but sure.

Do you want me to change from the intel_pstate driver while I'm at it?
Post by Tim Chen
BTW, why do you place the K table in .text, instead of .rodata?
Because the jump table before it was in .text, and if I try to move
*that* to .rodata I get a linker error. So I just put the K_table
right next to it.

However, it's all moot: my current v3 does move K_table to .rodata.
Tim Chen
2014-05-30 17:01:11 UTC
Permalink
That's very small (less than 0.2%) so I think it's acceptable.
Thank you! May I take this as an Acked-by; ?
Yes, with the caveat that you still have a v3 of this patch
that reorganize the K table to rodata.

Tim
I'll work on some performance improvements, but they proably
won't be ready for the 3.16 merge window.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
George Spelvin
2014-06-07 03:08:58 UTC
Permalink
There's no need for the K_table to be made of 64-bit words. For some
reason, the original authors didn't fully reduce the values modulo the
CRC32C polynomial, and so had some 33-bit values in there. They can
all be reduced to 32 bits.

Doing that cuts the table size in half. Since the code depends on both
pclmulq and crc32, SSE 4.1 is obviously present, so we can use pmovzxdq
to fetch it in the correct format.

This adds (measured on Ivy Bridge) 1 cycle per main loop iteration
(CRC of up to 3K bytes), less than 0.2%. The hope is that the reduced
D-cache footprint will make up the loss in other code.

Two other related fixes:
* K_table is read-only, so belongs in .rodata, and
* There's no need for more than 8-byte alignment

Acked-by: Tim Chen <***@linux.intel.com>
Signed-off-by: George Spelvin <***@horizon.com>
---
Having been tweaked, benchmarked and acked, I think this is ready to
be merged.

My initial attempts at additional speedups resulted in slowdowns;
apparently Intel coders are fairly good at optimization. :-)

arch/x86/crypto/crc32c-pcl-intel-asm_64.S | 281 +++++++++++++++---------------
1 file changed, 139 insertions(+), 142 deletions(-)

diff --git a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
index dbc4339b..26d49eba 100644
--- a/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
+++ b/arch/x86/crypto/crc32c-pcl-intel-asm_64.S
@@ -72,6 +72,7 @@

# unsigned int crc_pcl(u8 *buffer, int len, unsigned int crc_init);

+.text
ENTRY(crc_pcl)
#define bufp %rdi
#define bufp_dw %edi
@@ -216,15 +217,11 @@ LABEL crc_ %i
## 4) Combine three results:
################################################################

- lea (K_table-16)(%rip), bufp # first entry is for idx 1
+ lea (K_table-8)(%rip), bufp # first entry is for idx 1
shlq $3, %rax # rax *= 8
- subq %rax, tmp # tmp -= rax*8
- shlq $1, %rax
- subq %rax, tmp # tmp -= rax*16
- # (total tmp -= rax*24)
- addq %rax, bufp
-
- movdqa (bufp), %xmm0 # 2 consts: K1:K2
+ pmovzxdq (bufp,%rax), %xmm0 # 2 consts: K1:K2
+ leal (%eax,%eax,2), %eax # rax *= 3 (total *24)
+ subq %rax, tmp # tmp -= rax*24

movq crc_init, %xmm1 # CRC for block 1
PCLMULQDQ 0x00,%xmm0,%xmm1 # Multiply by K2
@@ -238,9 +235,9 @@ LABEL crc_ %i
mov crc2, crc_init
crc32 %rax, crc_init

-################################################################
-## 5) Check for end:
-################################################################
+ ################################################################
+ ## 5) Check for end:
+ ################################################################

LABEL crc_ 0
mov tmp, len
@@ -331,136 +328,136 @@ ENDPROC(crc_pcl)

################################################################
## PCLMULQDQ tables
- ## Table is 128 entries x 2 quad words each
+ ## Table is 128 entries x 2 words (8 bytes) each
################################################################
-.data
-.align 64
+.section .rotata, "a", %progbits
+.align 8
K_table:
- .quad 0x14cd00bd6,0x105ec76f0
- .quad 0x0ba4fc28e,0x14cd00bd6
- .quad 0x1d82c63da,0x0f20c0dfe
- .quad 0x09e4addf8,0x0ba4fc28e
- .quad 0x039d3b296,0x1384aa63a
- .quad 0x102f9b8a2,0x1d82c63da
- .quad 0x14237f5e6,0x01c291d04
- .quad 0x00d3b6092,0x09e4addf8
- .quad 0x0c96cfdc0,0x0740eef02
- .quad 0x18266e456,0x039d3b296
- .quad 0x0daece73e,0x0083a6eec
- .quad 0x0ab7aff2a,0x102f9b8a2
- .quad 0x1248ea574,0x1c1733996
- .quad 0x083348832,0x14237f5e6
- .quad 0x12c743124,0x02ad91c30
- .quad 0x0b9e02b86,0x00d3b6092
- .quad 0x018b33a4e,0x06992cea2
- .quad 0x1b331e26a,0x0c96cfdc0
- .quad 0x17d35ba46,0x07e908048
- .quad 0x1bf2e8b8a,0x18266e456
- .quad 0x1a3e0968a,0x11ed1f9d8
- .quad 0x0ce7f39f4,0x0daece73e
- .quad 0x061d82e56,0x0f1d0f55e
- .quad 0x0d270f1a2,0x0ab7aff2a
- .quad 0x1c3f5f66c,0x0a87ab8a8
- .quad 0x12ed0daac,0x1248ea574
- .quad 0x065863b64,0x08462d800
- .quad 0x11eef4f8e,0x083348832
- .quad 0x1ee54f54c,0x071d111a8
- .quad 0x0b3e32c28,0x12c743124
- .quad 0x0064f7f26,0x0ffd852c6
- .quad 0x0dd7e3b0c,0x0b9e02b86
- .quad 0x0f285651c,0x0dcb17aa4
- .quad 0x010746f3c,0x018b33a4e
- .quad 0x1c24afea4,0x0f37c5aee
- .quad 0x0271d9844,0x1b331e26a
- .quad 0x08e766a0c,0x06051d5a2
- .quad 0x093a5f730,0x17d35ba46
- .quad 0x06cb08e5c,0x11d5ca20e
- .quad 0x06b749fb2,0x1bf2e8b8a
- .quad 0x1167f94f2,0x021f3d99c
- .quad 0x0cec3662e,0x1a3e0968a
- .quad 0x19329634a,0x08f158014
- .quad 0x0e6fc4e6a,0x0ce7f39f4
- .quad 0x08227bb8a,0x1a5e82106
- .quad 0x0b0cd4768,0x061d82e56
- .quad 0x13c2b89c4,0x188815ab2
- .quad 0x0d7a4825c,0x0d270f1a2
- .quad 0x10f5ff2ba,0x105405f3e
- .quad 0x00167d312,0x1c3f5f66c
- .quad 0x0f6076544,0x0e9adf796
- .quad 0x026f6a60a,0x12ed0daac
- .quad 0x1a2adb74e,0x096638b34
- .quad 0x19d34af3a,0x065863b64
- .quad 0x049c3cc9c,0x1e50585a0
- .quad 0x068bce87a,0x11eef4f8e
- .quad 0x1524fa6c6,0x19f1c69dc
- .quad 0x16cba8aca,0x1ee54f54c
- .quad 0x042d98888,0x12913343e
- .quad 0x1329d9f7e,0x0b3e32c28
- .quad 0x1b1c69528,0x088f25a3a
- .quad 0x02178513a,0x0064f7f26
- .quad 0x0e0ac139e,0x04e36f0b0
- .quad 0x0170076fa,0x0dd7e3b0c
- .quad 0x141a1a2e2,0x0bd6f81f8
- .quad 0x16ad828b4,0x0f285651c
- .quad 0x041d17b64,0x19425cbba
- .quad 0x1fae1cc66,0x010746f3c
- .quad 0x1a75b4b00,0x18db37e8a
- .quad 0x0f872e54c,0x1c24afea4
- .quad 0x01e41e9fc,0x04c144932
- .quad 0x086d8e4d2,0x0271d9844
- .quad 0x160f7af7a,0x052148f02
- .quad 0x05bb8f1bc,0x08e766a0c
- .quad 0x0a90fd27a,0x0a3c6f37a
- .quad 0x0b3af077a,0x093a5f730
- .quad 0x04984d782,0x1d22c238e
- .quad 0x0ca6ef3ac,0x06cb08e5c
- .quad 0x0234e0b26,0x063ded06a
- .quad 0x1d88abd4a,0x06b749fb2
- .quad 0x04597456a,0x04d56973c
- .quad 0x0e9e28eb4,0x1167f94f2
- .quad 0x07b3ff57a,0x19385bf2e
- .quad 0x0c9c8b782,0x0cec3662e
- .quad 0x13a9cba9e,0x0e417f38a
- .quad 0x093e106a4,0x19329634a
- .quad 0x167001a9c,0x14e727980
- .quad 0x1ddffc5d4,0x0e6fc4e6a
- .quad 0x00df04680,0x0d104b8fc
- .quad 0x02342001e,0x08227bb8a
- .quad 0x00a2a8d7e,0x05b397730
- .quad 0x168763fa6,0x0b0cd4768
- .quad 0x1ed5a407a,0x0e78eb416
- .quad 0x0d2c3ed1a,0x13c2b89c4
- .quad 0x0995a5724,0x1641378f0
- .quad 0x19b1afbc4,0x0d7a4825c
- .quad 0x109ffedc0,0x08d96551c
- .quad 0x0f2271e60,0x10f5ff2ba
- .quad 0x00b0bf8ca,0x00bf80dd2
- .quad 0x123888b7a,0x00167d312
- .quad 0x1e888f7dc,0x18dcddd1c
- .quad 0x002ee03b2,0x0f6076544
- .quad 0x183e8d8fe,0x06a45d2b2
- .quad 0x133d7a042,0x026f6a60a
- .quad 0x116b0f50c,0x1dd3e10e8
- .quad 0x05fabe670,0x1a2adb74e
- .quad 0x130004488,0x0de87806c
- .quad 0x000bcf5f6,0x19d34af3a
- .quad 0x18f0c7078,0x014338754
- .quad 0x017f27698,0x049c3cc9c
- .quad 0x058ca5f00,0x15e3e77ee
- .quad 0x1af900c24,0x068bce87a
- .quad 0x0b5cfca28,0x0dd07448e
- .quad 0x0ded288f8,0x1524fa6c6
- .quad 0x059f229bc,0x1d8048348
- .quad 0x06d390dec,0x16cba8aca
- .quad 0x037170390,0x0a3e3e02c
- .quad 0x06353c1cc,0x042d98888
- .quad 0x0c4584f5c,0x0d73c7bea
- .quad 0x1f16a3418,0x1329d9f7e
- .quad 0x0531377e2,0x185137662
- .quad 0x1d8d9ca7c,0x1b1c69528
- .quad 0x0b25b29f2,0x18a08b5bc
- .quad 0x19fb2a8b0,0x02178513a
- .quad 0x1a08fe6ac,0x1da758ae0
- .quad 0x045cddf4e,0x0e0ac139e
- .quad 0x1a91647f2,0x169cf9eb0
- .quad 0x1a0f717c4,0x0170076fa
+ .long 0x493c7d27, 0x00000001
+ .long 0xba4fc28e, 0x493c7d27
+ .long 0xddc0152b, 0xf20c0dfe
+ .long 0x9e4addf8, 0xba4fc28e
+ .long 0x39d3b296, 0x3da6d0cb
+ .long 0x0715ce53, 0xddc0152b
+ .long 0x47db8317, 0x1c291d04
+ .long 0x0d3b6092, 0x9e4addf8
+ .long 0xc96cfdc0, 0x740eef02
+ .long 0x878a92a7, 0x39d3b296
+ .long 0xdaece73e, 0x083a6eec
+ .long 0xab7aff2a, 0x0715ce53
+ .long 0x2162d385, 0xc49f4f67
+ .long 0x83348832, 0x47db8317
+ .long 0x299847d5, 0x2ad91c30
+ .long 0xb9e02b86, 0x0d3b6092
+ .long 0x18b33a4e, 0x6992cea2
+ .long 0xb6dd949b, 0xc96cfdc0
+ .long 0x78d9ccb7, 0x7e908048
+ .long 0xbac2fd7b, 0x878a92a7
+ .long 0xa60ce07b, 0x1b3d8f29
+ .long 0xce7f39f4, 0xdaece73e
+ .long 0x61d82e56, 0xf1d0f55e
+ .long 0xd270f1a2, 0xab7aff2a
+ .long 0xc619809d, 0xa87ab8a8
+ .long 0x2b3cac5d, 0x2162d385
+ .long 0x65863b64, 0x8462d800
+ .long 0x1b03397f, 0x83348832
+ .long 0xebb883bd, 0x71d111a8
+ .long 0xb3e32c28, 0x299847d5
+ .long 0x064f7f26, 0xffd852c6
+ .long 0xdd7e3b0c, 0xb9e02b86
+ .long 0xf285651c, 0xdcb17aa4
+ .long 0x10746f3c, 0x18b33a4e
+ .long 0xc7a68855, 0xf37c5aee
+ .long 0x271d9844, 0xb6dd949b
+ .long 0x8e766a0c, 0x6051d5a2
+ .long 0x93a5f730, 0x78d9ccb7
+ .long 0x6cb08e5c, 0x18b0d4ff
+ .long 0x6b749fb2, 0xbac2fd7b
+ .long 0x1393e203, 0x21f3d99c
+ .long 0xcec3662e, 0xa60ce07b
+ .long 0x96c515bb, 0x8f158014
+ .long 0xe6fc4e6a, 0xce7f39f4
+ .long 0x8227bb8a, 0xa00457f7
+ .long 0xb0cd4768, 0x61d82e56
+ .long 0x39c7ff35, 0x8d6d2c43
+ .long 0xd7a4825c, 0xd270f1a2
+ .long 0x0ab3844b, 0x00ac29cf
+ .long 0x0167d312, 0xc619809d
+ .long 0xf6076544, 0xe9adf796
+ .long 0x26f6a60a, 0x2b3cac5d
+ .long 0xa741c1bf, 0x96638b34
+ .long 0x98d8d9cb, 0x65863b64
+ .long 0x49c3cc9c, 0xe0e9f351
+ .long 0x68bce87a, 0x1b03397f
+ .long 0x57a3d037, 0x9af01f2d
+ .long 0x6956fc3b, 0xebb883bd
+ .long 0x42d98888, 0x2cff42cf
+ .long 0x3771e98f, 0xb3e32c28
+ .long 0xb42ae3d9, 0x88f25a3a
+ .long 0x2178513a, 0x064f7f26
+ .long 0xe0ac139e, 0x4e36f0b0
+ .long 0x170076fa, 0xdd7e3b0c
+ .long 0x444dd413, 0xbd6f81f8
+ .long 0x6f345e45, 0xf285651c
+ .long 0x41d17b64, 0x91c9bd4b
+ .long 0xff0dba97, 0x10746f3c
+ .long 0xa2b73df1, 0x885f087b
+ .long 0xf872e54c, 0xc7a68855
+ .long 0x1e41e9fc, 0x4c144932
+ .long 0x86d8e4d2, 0x271d9844
+ .long 0x651bd98b, 0x52148f02
+ .long 0x5bb8f1bc, 0x8e766a0c
+ .long 0xa90fd27a, 0xa3c6f37a
+ .long 0xb3af077a, 0x93a5f730
+ .long 0x4984d782, 0xd7c0557f
+ .long 0xca6ef3ac, 0x6cb08e5c
+ .long 0x234e0b26, 0x63ded06a
+ .long 0xdd66cbbb, 0x6b749fb2
+ .long 0x4597456a, 0x4d56973c
+ .long 0xe9e28eb4, 0x1393e203
+ .long 0x7b3ff57a, 0x9669c9df
+ .long 0xc9c8b782, 0xcec3662e
+ .long 0x3f70cc6f, 0xe417f38a
+ .long 0x93e106a4, 0x96c515bb
+ .long 0x62ec6c6d, 0x4b9e0f71
+ .long 0xd813b325, 0xe6fc4e6a
+ .long 0x0df04680, 0xd104b8fc
+ .long 0x2342001e, 0x8227bb8a
+ .long 0x0a2a8d7e, 0x5b397730
+ .long 0x6d9a4957, 0xb0cd4768
+ .long 0xe8b6368b, 0xe78eb416
+ .long 0xd2c3ed1a, 0x39c7ff35
+ .long 0x995a5724, 0x61ff0e01
+ .long 0x9ef68d35, 0xd7a4825c
+ .long 0x0c139b31, 0x8d96551c
+ .long 0xf2271e60, 0x0ab3844b
+ .long 0x0b0bf8ca, 0x0bf80dd2
+ .long 0x2664fd8b, 0x0167d312
+ .long 0xed64812d, 0x8821abed
+ .long 0x02ee03b2, 0xf6076544
+ .long 0x8604ae0f, 0x6a45d2b2
+ .long 0x363bd6b3, 0x26f6a60a
+ .long 0x135c83fd, 0xd8d26619
+ .long 0x5fabe670, 0xa741c1bf
+ .long 0x35ec3279, 0xde87806c
+ .long 0x00bcf5f6, 0x98d8d9cb
+ .long 0x8ae00689, 0x14338754
+ .long 0x17f27698, 0x49c3cc9c
+ .long 0x58ca5f00, 0x5bd2011f
+ .long 0xaa7c7ad5, 0x68bce87a
+ .long 0xb5cfca28, 0xdd07448e
+ .long 0xded288f8, 0x57a3d037
+ .long 0x59f229bc, 0xdde8f5b9
+ .long 0x6d390dec, 0x6956fc3b
+ .long 0x37170390, 0xa3e3e02c
+ .long 0x6353c1cc, 0x42d98888
+ .long 0xc4584f5c, 0xd73c7bea
+ .long 0xf48642e9, 0x3771e98f
+ .long 0x531377e2, 0x80ff0093
+ .long 0xdd35bc8d, 0xb42ae3d9
+ .long 0xb25b29f2, 0x8fe4c34d
+ .long 0x9a5ede41, 0x2178513a
+ .long 0xa563905d, 0xdf99fc11
+ .long 0x45cddf4e, 0xe0ac139e
+ .long 0xacfa3103, 0x6c23e841
+ .long 0xa51b6135, 0x170076fa
--
2.0.0
Loading...