Discussion:
[RFC] x86: xsave/xrstor support, ucontext_t extensions
(too old to reply)
Suresh Siddha
2008-05-13 01:10:30 UTC
Permalink
hi,

Appended patch adds the support for xsave/xrstor infrastructure for x86.
xsave/xrstor manages the existing and future processor extended states in x86
architecutre.

More info on xsave/xrstor can be found in the Intel SDM's located at
http://www.intel.com/products/processor/manuals/index.htm

Please let me know your feedback and comments. Specifically, I am not sure
if I break anything or make anyone's life harder with the ucontext_t extensions
that are proposed in the patch.

Similar to fpstate, xsave state need to be saved/restored across signals.
x86 sigcontext doesn't seem to have any unused space, while x86_64 has
some unused space(reserved1[8]) in sigcontext.

To keep it consistent across 32bit and 64bit, ucontext is extended with
the new state context. Please review and let me know if you foresee any
compatibility or other issues with these extensions.

BTW, Traditionally glibc has this definition for struct ucontext.

/* Userlevel context. */
typedef struct ucontext
{
unsigned long int uc_flags;
struct ucontext *uc_link;
stack_t uc_stack;
mcontext_t uc_mcontext;
__sigset_t uc_sigmask;
struct _libc_fpstate __fpregs_mem;
} ucontext_t;

And application uses the same structure for get/setcontext() routines and
to refer process context in signal handler routines.

Kernel which sets up the signal handling context has this definition:

struct ucontext {
unsigned long uc_flags;
struct ucontext *uc_link;
stack_t uc_stack;
struct sigcontext uc_mcontext;
sigset_t uc_sigmask;
};

Though the kernel Vs user ucontext look somewhat similar, kernel's ucontext
struct is different from glibc's ucontext struct because of sigset_t size
differences between user Vs kernel. So fpstate in signal handlers must always
be referred through the pointer in sigcontext, and not directly through
__fpregs_mem in userlevels ucontext struct.

glibc perhaps need to use different context structures, one for
get/setcontext() and another for signal handling? Signal handling
context will be governed by the kernel and context info
referred by get/setcontext() will be governed by glibc. This is specfically
needed if glibc's get/setcontext() want to play with xsave info aswell.

This kernel patch is adding a pointer to ucontext for representing
xsave context (size of this area will be determined by the processor
and kernel capabilities). If at some point, this state need to be saved/restored
by get/setcontext() glibc routines and if they want to support application usage
like:

ucontext_t context;

void save()
{
getcontext(&context);
}

void restore()
{
setcontext(&context);
}

then, context information used by get/setcontext() need to evolve independently
from the signal handler context information provided by the kernel.

Comments?

thanks,
suresh

---
[RFC] x86: xsave/xrstor support

The layout of the xsave/xrstor area extends from the 512-byte FXSAVE/FXRSTOR
layout. xsave/xrstor area layout consists of:

- fxsave/fxrstor area (512 bytes)
- xsave header area (64 bytes)
- set of save areas, each corresponding to a processor extended state

The number of save areas, the offset and the size of each save area is
enumerated by CPUID leaf function 0xd.

This patch includes the basic xsave/xrstor infrastructure, which includes:
- context switch support, extending traditional lazy restore mechanism
- signal handling support, extending ucontext_t

Signed-off-by: Suresh Siddha <***@intel.com>
---

Index: linux-2.6-x86/arch/x86/kernel/xsave.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-x86/arch/x86/kernel/xsave.c 2008-05-12 13:09:56.000000000 -0700
@@ -0,0 +1,272 @@
+/*
+ * xsave/xrstor support.
+ *
+ * Suresh Siddha <***@intel.com>
+ */
+#include <linux/bootmem.h>
+#include <linux/compat.h>
+#include <asm/i387.h>
+
+#ifdef CONFIG_X86_64
+#include <asm/sigcontext32.h>
+#endif
+
+unsigned int pcntxt_hmask, pcntxt_lmask;
+struct xsave_struct *init_xstate_buf;
+
+#ifdef CONFIG_X86_64
+/*
+ * Signal frame handlers.
+ */
+int save_i387_xstate(void __user *buf)
+{
+ struct task_struct *tsk = current;
+ int err = 0;
+
+ if ((unsigned long)buf % 64)
+ printk("save_i387_xstate: bad xstate %p\n", buf);
+
+ clear_used_math(); /* trigger finit */
+ if (task_thread_info(tsk)->status & TS_USEDFPU) {
+ if (cpu_has_xsave)
+ err = save_xstate_checking(buf);
+ else
+ err = save_i387_checking(buf);
+
+ if (err)
+ return err;
+
+ task_thread_info(tsk)->status &= ~TS_USEDFPU;
+ stts();
+ } else {
+ if (__copy_to_user(buf, &tsk->thread.xstate->fxsave,
+ xstate_size))
+ return -1;
+ }
+
+ return 0;
+}
+
+/*
+ * This restores directly out of user space. Exceptions are handled.
+ */
+int restore_i387_xstate(struct _fpstate __user *buf,
+ struct xstate_cntxt __user *buf1)
+{
+ int err = 0;
+ int size = 0;
+ unsigned int lmask = 0, hmask = 0;
+ struct _xstate __user *xstate = 0;
+
+ if (cpu_has_xsave) {
+ __get_user(size, &buf1->size);
+ __get_user(lmask, &buf1->lmask);
+ __get_user(hmask, &buf1->hmask);
+ __get_user(xstate, &buf1->xstate);
+ }
+
+ if (!buf && !xstate)
+ goto init_state;
+
+ set_used_math();
+ if (!(task_thread_info(current)->status & TS_USEDFPU)) {
+ clts();
+ task_thread_info(current)->status |= TS_USEDFPU;
+ }
+
+ if (xstate) {
+ err = xrestore_checking((struct xsave_struct *) xstate, size,
+ lmask, hmask);
+ if (err)
+ goto init_state;
+ /*
+ * initialize the other extended state that the kernel
+ * knows and not specifed in the user restore masks.
+ */
+ init_xstate(pcntxt_lmask & ~(XSTATE_FPSSE | lmask),
+ pcntxt_hmask & ~hmask);
+ } else
+ init_xstate(pcntxt_lmask & ~XSTATE_FPSSE, pcntxt_hmask);
+
+ if (buf) {
+ if (!access_ok(VERIFY_READ, buf, sizeof(*buf))) {
+ err = -1;
+ goto init_state;
+ }
+
+ err = restore_fpu_checking(buf);
+ if (err)
+ goto init_state;
+ } else
+ init_xstate(XSTATE_FPSSE, 0);
+
+ if (!err)
+ return 0;
+
+init_state:
+ if (used_math()) {
+ clear_fpu(current);
+ clear_used_math();
+ }
+
+ return err;
+}
+#endif
+
+#ifndef CONFIG_X86_64
+# define _fpstate_ia32 _fpstate
+# define xstate_cntxt_ia32 xstate_cntxt
+#endif
+
+/*
+ * FP and extended context restore during signal return. Extended state is
+ * restored directly from user space. Exceptions are handled.
+ */
+int restore_user_xstate(struct _fpstate_ia32 __user *buf,
+ struct xstate_cntxt_ia32 __user *buf1)
+{
+ int err = 0;
+ struct task_struct *tsk = current;
+ int size = 0, lmask = 0, hmask = 0;
+ struct _xstate *xstate;
+
+ if (!buf && !buf1)
+ goto init_state;
+
+ set_used_math();
+ if (!(task_thread_info(current)->status & TS_USEDFPU)) {
+ clts();
+ task_thread_info(current)->status |= TS_USEDFPU;
+ }
+
+ __get_user(size, &buf1->size);
+ __get_user(lmask, &buf1->lmask);
+ __get_user(hmask, &buf1->hmask);
+
+#ifdef CONFIG_IA32_EMULATION
+ {
+ u32 tmp;
+ __get_user(tmp, &buf1->xstate);
+ xstate = compat_ptr(tmp);
+ }
+#else
+ __get_user(xstate, &buf1->xstate);
+#endif
+
+ if (xstate) {
+ /*
+ * Restore directly from the user space and handle the possible
+ * exception. This way, we don't have to do manual error
+ * checking on the user buffer contents.
+ */
+ err = xrestore_checking((__force struct xsave_struct *) xstate,
+ size, lmask, hmask);
+ if (err)
+ return err;
+ /*
+ * initialize the other extended state that the kernel
+ * knows and not specifed in the user restore masks.
+ */
+ init_xstate(pcntxt_lmask & ~(XSTATE_FPSSE | lmask),
+ pcntxt_hmask & ~hmask);
+ } else
+ init_xstate(pcntxt_lmask & ~XSTATE_FPSSE, pcntxt_hmask);
+
+ /*
+ * FP and SSE state can't be restored directly from the userspace
+ * because of legacy reasons. Lets restore it to the fpstate
+ * in the task struct.
+ */
+ unlazy_fpu(tsk);
+
+ if (buf) {
+ /*
+ * legacy FP and SSE restore.
+ */
+ err = restore_i387_fxsave(buf);
+ if (err)
+ return err;
+ } else
+ /*
+ * initialize tasks fpstate in the memory.
+ */
+ init_task_fpstate(tsk);
+
+ return err;
+
+init_state:
+ if (used_math()) {
+ clear_fpu(current);
+ clear_used_math();
+ }
+
+ return err;
+}
+
+/*
+ * Enable the extended processor state save/restore feature
+ */
+void __cpuinit xsave_init(void)
+{
+ if (!cpu_has_xsave)
+ return;
+
+ set_in_cr4(X86_CR4_OSXSAVE);
+
+ /*
+ * Enable all the features that the HW is capable of
+ * and the Linux kernel is aware of.
+ *
+ * xsetbv();
+ */
+ asm volatile(".byte 0x0f,0x01,0xd1"::"c" (0),
+ "a" (pcntxt_lmask), "d" (pcntxt_hmask));
+}
+
+/*
+ * setup the xstate image representing the init state
+ */
+void setup_xstate_init(void)
+{
+ init_xstate_buf = alloc_bootmem(xstate_size);
+ init_xstate_buf->i387.mxcsr = MXCSR_DEFAULT;
+}
+
+/*
+ * Enable and initialize the xsave feature.
+ */
+void __init xsave_cntxt_init(void)
+{
+ unsigned int eax, ebx, ecx, edx;
+
+ cpuid_count(0xd, 0, &eax, &ebx, &ecx, &edx);
+
+ pcntxt_lmask = eax;
+ pcntxt_hmask = edx;
+
+ if ((pcntxt_lmask & XSTATE_FPSSE) != XSTATE_FPSSE) {
+ printk("FP/SSE not shown under xsave features %x\n",
+ pcntxt_lmask);
+ BUG();
+ }
+
+ /*
+ * for now OS knows only about FP/SSE
+ */
+ pcntxt_lmask = pcntxt_lmask & XCNTXT_LMASK;
+ pcntxt_hmask = pcntxt_hmask & XCNTXT_HMASK;
+
+ xsave_init();
+
+ /*
+ * Recompute the context size for enabled features
+ */
+ cpuid_count(0xd, 0, &eax, &ebx, &ecx, &edx);
+
+ xstate_size = ebx;
+
+ setup_xstate_init();
+
+ printk("xsave/xrstor: cntxt size %x, supported lmask %x, hmask %x\n",
+ xstate_size, pcntxt_lmask, pcntxt_hmask);
+}
Index: linux-2.6-x86/include/asm-x86/xsave.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-x86/include/asm-x86/xsave.h 2008-05-12 14:43:07.000000000 -0700
@@ -0,0 +1,120 @@
+#ifndef __ASM_X86_64_XSAVE_H
+#define __ASM_X86_64_XSAVE_H
+
+#include <asm/processor.h>
+#include <asm/i387.h>
+
+#define XSTATE_FP 0x1
+#define XSTATE_SSE 0x2
+
+#define XSTATE_FPSSE (XSTATE_FP | XSTATE_SSE)
+
+#define FXSAVE_SIZE 512
+
+/*
+ * These are the masks that OS can handle currently.
+ */
+#define XCNTXT_LMASK (XSTATE_FP | XSTATE_SSE)
+#define XCNTXT_HMASK 0x0
+
+#ifdef CONFIG_X86_64
+#define REX_PREFIX "0x48, "
+#else
+#define REX_PREFIX
+#endif
+
+extern unsigned int xstate_size, pcntxt_hmask, pcntxt_lmask;
+extern struct xsave_struct *init_xstate_buf;
+
+extern void xsave_cntxt_init(void);
+extern void xsave_init(void);
+
+static inline void xrstor(struct xsave_struct *fx)
+{
+ asm volatile(".byte " REX_PREFIX "0x0f,0xae,0x2f\n\t"
+ :: "D" (fx), "m" (*fx), "a" (-1), "d" (-1) : "memory");
+}
+
+static inline int xsave(struct task_struct *tsk)
+{
+ /* This, however, we can work around by forcing the compiler to select
+ an addressing mode that doesn't require extended registers. */
+ __asm__ __volatile__(".byte " REX_PREFIX "0x0f,0xae,0x27"
+ ::"D" (&(tsk->thread.xstate->xsave)),
+ "a" (-1), "d"(-1) : "memory");
+
+ return 0;
+}
+
+static inline int save_xstate_checking(struct xsave_struct __user *buf)
+{
+ int err;
+ __asm__ __volatile__("1: .byte " REX_PREFIX "0x0f,0xae,0x27\n"
+ "2:\n"
+ ".section .fixup,\"ax\"\n"
+ "3: movl $-1,%[err]\n"
+ " jmp 2b\n"
+ ".previous\n"
+ ".section __ex_table,\"a\"\n"
+ _ASM_ALIGN "\n"
+ _ASM_PTR "1b,3b\n"
+ ".previous"
+ : [err] "=r" (err)
+ : "D" (buf), "a" (-1), "d" (-1), "0" (0)
+ : "memory");
+ if (unlikely(err) && __clear_user(buf, xstate_size))
+ err = -EFAULT;
+ /* No need to clear here because the caller clears USED_MATH */
+ return err;
+}
+
+static inline int xrestore_checking(struct xsave_struct __user *buf,
+ int size, unsigned int lmask,
+ unsigned int hmask)
+{
+ int err;
+ struct xsave_struct *xstate = ((__force struct xsave_struct *)buf);
+ int eax = lmask & ~0x3;
+ int edx = hmask;
+
+ if (!access_ok(VERIFY_READ, buf, size))
+ return -1;
+
+ __asm__ __volatile__("1: .byte " REX_PREFIX "0x0f,0xae,0x2f\n"
+ "2:\n"
+ ".section .fixup,\"ax\"\n"
+ "3: movl $-1,%[err]\n"
+ " jmp 2b\n"
+ ".previous\n"
+ ".section __ex_table,\"a\"\n"
+ _ASM_ALIGN "\n"
+ _ASM_PTR "1b,3b\n"
+ ".previous"
+ : [err] "=r" (err)
+ : "D" (xstate), "a" (eax), "d" (edx), "0" (0)
+ : "memory"); //memory required?
+ return err;
+}
+
+static inline void xrstor_state(struct xsave_struct *fx, int lmask, int hmask)
+{
+ asm volatile(".byte " REX_PREFIX "0x0f,0xae,0x2f\n\t"
+ :: "D" (fx), "m" (*fx), "a" (lmask), "d" (hmask)
+ : "memory");
+}
+
+static inline void init_xstate(int lmask, int hmask)
+{
+ if (cpu_has_xsave)
+ xrstor_state(init_xstate_buf, lmask, hmask);
+}
+
+static inline void init_task_fpstate(struct task_struct *tsk)
+{
+ struct xsave_struct *xstate = &tsk->thread.xstate->xsave;
+ if (cpu_has_xsave) {
+ xstate->xsave_hdr.xstate_bv &= ~XSTATE_FPSSE;
+ xstate->i387.mxcsr = MXCSR_DEFAULT;
+ }
+}
+#endif
Index: linux-2.6-x86/include/asm-x86/processor-flags.h
===================================================================
--- linux-2.6-x86.orig/include/asm-x86/processor-flags.h 2008-05-12 13:09:03.000000000 -0700
+++ linux-2.6-x86/include/asm-x86/processor-flags.h 2008-05-12 13:09:56.000000000 -0700
@@ -59,6 +59,7 @@
#define X86_CR4_OSFXSR 0x00000200 /* enable fast FPU save and restore */
#define X86_CR4_OSXMMEXCPT 0x00000400 /* enable unmasked SSE exceptions */
#define X86_CR4_VMXE 0x00002000 /* enable VMX virtualization */
+#define X86_CR4_OSXSAVE 0x00040000 /* enable xsave and xrestore */

/*
* x86-64 Task Priority Register, CR8
Index: linux-2.6-x86/arch/x86/kernel/traps_64.c
===================================================================
--- linux-2.6-x86.orig/arch/x86/kernel/traps_64.c 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/traps_64.c 2008-05-12 13:09:56.000000000 -0700
@@ -1155,7 +1155,7 @@
}

clts(); /* Allow maths ops (or we recurse) */
- restore_fpu_checking(&me->thread.xstate->fxsave);
+ restore_fpu_xstate(me);
task_thread_info(me)->status |= TS_USEDFPU;
me->fpu_counter++;
}
@@ -1191,10 +1191,6 @@
#endif

/*
- * initialize the per thread extended state:
- */
- init_thread_xstate();
- /*
* Should be a barrier for any external CPU state.
*/
cpu_init();
Index: linux-2.6-x86/arch/x86/kernel/signal_64.c
===================================================================
--- linux-2.6-x86.orig/arch/x86/kernel/signal_64.c 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/signal_64.c 2008-05-12 13:09:56.000000000 -0700
@@ -59,7 +59,7 @@
*/
static int
restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc,
- unsigned long *pax)
+ struct xstate_cntxt __user *uc_xstate, unsigned long *pax)
{
unsigned int err = 0;

@@ -99,24 +99,11 @@
struct _fpstate __user * buf;
err |= __get_user(buf, &sc->fpstate);

- if (buf) {
- if (!access_ok(VERIFY_READ, buf, sizeof(*buf)))
- goto badframe;
- err |= restore_i387(buf);
- } else {
- struct task_struct *me = current;
- if (used_math()) {
- clear_fpu(me);
- clear_used_math();
- }
- }
+ err |= restore_i387_xstate(buf, uc_xstate);
}

err |= __get_user(*pax, &sc->ax);
return err;
-
-badframe:
- return 1;
}

asmlinkage long sys_rt_sigreturn(struct pt_regs *regs)
@@ -137,7 +124,8 @@
recalc_sigpending();
spin_unlock_irq(&current->sighand->siglock);

- if (restore_sigcontext(regs, &frame->uc.uc_mcontext, &ax))
+ if (restore_sigcontext(regs, &frame->uc.uc_mcontext,
+ &frame->uc.uc_xstate, &ax))
goto badframe;

if (do_sigaltstack(&frame->uc.uc_stack, NULL, regs->sp) == -EFAULT)
@@ -155,7 +143,8 @@
*/

static inline int
-setup_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs, unsigned long mask, struct task_struct *me)
+setup_sigcontext(struct sigcontext __user *sc, struct pt_regs *regs,
+ unsigned long mask, struct task_struct *me)
{
int err = 0;

@@ -207,7 +196,7 @@
sp = current->sas_ss_sp + current->sas_ss_size;
}

- return (void __user *)round_down(sp - size, 16);
+ return (void __user *)round_down(sp - size, 64);
}

static int setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info,
@@ -219,14 +208,14 @@
struct task_struct *me = current;

if (used_math()) {
- fp = get_stack(ka, regs, sizeof(struct _fpstate));
+ fp = get_stack(ka, regs, xstate_size);
frame = (void __user *)round_down(
(unsigned long)fp - sizeof(struct rt_sigframe), 16) - 8;

- if (!access_ok(VERIFY_WRITE, fp, sizeof(struct _fpstate)))
+ if (!access_ok(VERIFY_WRITE, fp, xstate_size))
goto give_sigsegv;

- if (save_i387(fp) < 0)
+ if (save_i387_xstate(fp) < 0)
err |= -1;
} else
frame = get_stack(ka, regs, sizeof(struct rt_sigframe)) - 8;
@@ -249,6 +238,19 @@
err |= __put_user(me->sas_ss_size, &frame->uc.uc_stack.ss_size);
err |= setup_sigcontext(&frame->uc.uc_mcontext, regs, set->sig[0], me);
err |= __put_user(fp, &frame->uc.uc_mcontext.fpstate);
+
+ if (cpu_has_xsave) {
+ err |= __put_user(fp, &frame->uc.uc_xstate.xstate);
+ err |= __put_user(xstate_size, &frame->uc.uc_xstate.size);
+ err |= __put_user(pcntxt_lmask, &frame->uc.uc_xstate.lmask);
+ err |= __put_user(pcntxt_hmask, &frame->uc.uc_xstate.hmask);
+ } else {
+ err |= __put_user(0, &frame->uc.uc_xstate.xstate);
+ err |= __put_user(0, &frame->uc.uc_xstate.size);
+ err |= __put_user(0, &frame->uc.uc_xstate.lmask);
+ err |= __put_user(0, &frame->uc.uc_xstate.hmask);
+ }
+
if (sizeof(*set) == 16) {
__put_user(set->sig[0], &frame->uc.uc_sigmask.sig[0]);
__put_user(set->sig[1], &frame->uc.uc_sigmask.sig[1]);
Index: linux-2.6-x86/include/asm-x86/sigcontext.h
===================================================================
--- linux-2.6-x86.orig/include/asm-x86/sigcontext.h 2008-05-12 13:09:03.000000000 -0700
+++ linux-2.6-x86/include/asm-x86/sigcontext.h 2008-05-12 14:54:21.000000000 -0700
@@ -202,4 +202,26 @@

#endif /* !__i386__ */

+struct _xsave_hdr_struct {
+ u64 xstate_bv;
+ u64 reserved1[2];
+ u64 reserved2[5];
+} __attribute__((packed));
+
+struct _xstate {
+ /*
+ * Applications need to refer to fpstate through fpstate pointer
+ * in sigcontext. Not here directly.
+ */
+ struct _fpstate fpstate;
+ struct _xsave_hdr_struct xsave_hdr;
+ /* new processor state extensions will go here */
+} __attribute__ ((aligned (64)));
+
+struct xstate_cntxt {
+ struct _xstate __user *xstate;
+ u32 size;
+ u32 lmask;
+ u32 hmask;
+};
#endif
Index: linux-2.6-x86/arch/x86/ia32/ia32_signal.c
===================================================================
--- linux-2.6-x86.orig/arch/x86/ia32/ia32_signal.c 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/ia32/ia32_signal.c 2008-05-12 13:09:56.000000000 -0700
@@ -174,9 +174,10 @@
u32 pretcode;
int sig;
struct sigcontext_ia32 sc;
- struct _fpstate_ia32 fpstate;
+ struct xstate_cntxt_ia32 xst_cnxt;
unsigned int extramask[_COMPAT_NSIG_WORDS-1];
char retcode[8];
+ /* fp and rest of the extended context state follows here */
};

struct rt_sigframe
@@ -187,8 +188,8 @@
u32 puc;
compat_siginfo_t info;
struct ucontext_ia32 uc;
- struct _fpstate_ia32 fpstate;
char retcode[8];
+ /* fp and rest of the extended context state follows here */
};

#define COPY(x) { \
@@ -207,7 +208,8 @@

static int ia32_restore_sigcontext(struct pt_regs *regs,
struct sigcontext_ia32 __user *sc,
- unsigned int *peax)
+ unsigned int *peax,
+ struct xstate_cntxt_ia32 __user *xst_cntxt)
{
unsigned int tmpflags, gs, oldgs, err = 0;
struct _fpstate_ia32 __user *buf;
@@ -254,26 +256,13 @@

err |= __get_user(tmp, &sc->fpstate);
buf = compat_ptr(tmp);
- if (buf) {
- if (!access_ok(VERIFY_READ, buf, sizeof(*buf)))
- goto badframe;
- err |= restore_i387_ia32(buf);
- } else {
- struct task_struct *me = current;

- if (used_math()) {
- clear_fpu(me);
- clear_used_math();
- }
- }
+ err |= restore_i387_xstate_ia32(buf, xst_cntxt);

err |= __get_user(tmp, &sc->ax);
*peax = tmp;

return err;
-
-badframe:
- return 1;
}

asmlinkage long sys32_sigreturn(struct pt_regs *regs)
@@ -281,6 +270,7 @@
struct sigframe __user *frame = (struct sigframe __user *)(regs->sp-8);
sigset_t set;
unsigned int ax;
+ struct xstate_cntxt_ia32 __user *xst_cnxt = &frame->xst_cnxt;

if (!access_ok(VERIFY_READ, frame, sizeof(*frame)))
goto badframe;
@@ -297,7 +287,7 @@
recalc_sigpending();
spin_unlock_irq(&current->sighand->siglock);

- if (ia32_restore_sigcontext(regs, &frame->sc, &ax))
+ if (ia32_restore_sigcontext(regs, &frame->sc, &ax, xst_cnxt))
goto badframe;
return ax;

@@ -326,7 +316,8 @@
recalc_sigpending();
spin_unlock_irq(&current->sighand->siglock);

- if (ia32_restore_sigcontext(regs, &frame->uc.uc_mcontext, &ax))
+ if (ia32_restore_sigcontext(regs, &frame->uc.uc_mcontext, &ax,
+ &frame->uc.uc_xstate))
goto badframe;

tregs = *regs;
@@ -345,8 +336,9 @@
*/

static int ia32_setup_sigcontext(struct sigcontext_ia32 __user *sc,
+ struct pt_regs *regs, unsigned int mask,
struct _fpstate_ia32 __user *fpstate,
- struct pt_regs *regs, unsigned int mask)
+ struct xsave_struct __user *xstate)
{
int tmp, err = 0;

@@ -376,7 +368,7 @@
err |= __put_user((u32)regs->flags, &sc->flags);
err |= __put_user((u32)regs->sp, &sc->sp_at_signal);

- tmp = save_i387_ia32(fpstate);
+ tmp = save_i387_xstate_ia32(fpstate, xstate);
if (tmp < 0)
err = -EFAULT;
else {
@@ -397,7 +389,9 @@
* Determine which stack to use..
*/
static void __user *get_sigframe(struct k_sigaction *ka, struct pt_regs *regs,
- size_t frame_size)
+ int frame_size,
+ struct _fpstate_ia32 **fpstate,
+ struct xsave_struct **xstate)
{
unsigned long sp;

@@ -416,7 +410,19 @@
ka->sa.sa_restorer)
sp = (unsigned long) ka->sa.sa_restorer;

- sp -= frame_size;
+ if (used_math()) {
+ sp = round_down(sp - xstate_size, 64);
+ if (cpu_has_xsave)
+ *xstate = (struct xsave_struct *) sp;
+
+ sp = sp - (sizeof(struct _fpstate_ia32) - FXSAVE_SIZE);
+
+ *fpstate = (struct _fpstate_ia32 *) sp;
+
+ sp = sp - frame_size;
+ } else
+ sp -= frame_size;
+
/* Align the stack pointer according to the i386 ABI,
* i.e. so that on function entry ((sp + 4) & 15) == 0. */
sp = ((sp + 4) & -16ul) - 4;
@@ -429,6 +435,8 @@
struct sigframe __user *frame;
void __user *restorer;
int err = 0;
+ struct _fpstate_ia32 __user *fpstate = 0;
+ struct xsave_struct __user *xstate = 0;

/* copy_to_user optimizes that into a single 8 byte store */
static const struct {
@@ -443,7 +451,7 @@
0,
};

- frame = get_sigframe(ka, regs, sizeof(*frame));
+ frame = get_sigframe(ka, regs, sizeof(*frame), &fpstate, &xstate);

if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame)))
goto give_sigsegv;
@@ -452,11 +460,24 @@
if (err)
goto give_sigsegv;

- err |= ia32_setup_sigcontext(&frame->sc, &frame->fpstate, regs,
- set->sig[0]);
+ err |= ia32_setup_sigcontext(&frame->sc, regs, set->sig[0],
+ fpstate, xstate);
if (err)
goto give_sigsegv;

+ if (cpu_has_xsave) {
+ err |= __put_user(ptr_to_compat(xstate),
+ &frame->xst_cnxt.xstate);
+ err |= __put_user(xstate_size, &frame->xst_cnxt.size);
+ err |= __put_user(pcntxt_lmask, &frame->xst_cnxt.lmask);
+ err |= __put_user(pcntxt_hmask, &frame->xst_cnxt.hmask);
+ } else {
+ err |= __put_user(0, &frame->xst_cnxt.xstate);
+ err |= __put_user(0, &frame->xst_cnxt.size);
+ err |= __put_user(0, &frame->xst_cnxt.lmask);
+ err |= __put_user(0, &frame->xst_cnxt.hmask);
+ }
+
if (_COMPAT_NSIG_WORDS > 1) {
err |= __copy_to_user(frame->extramask, &set->sig[1],
sizeof(frame->extramask));
@@ -518,6 +539,8 @@
struct exec_domain *ed = current_thread_info()->exec_domain;
void __user *restorer;
int err = 0;
+ struct _fpstate_ia32 __user *fpstate = 0;
+ struct xsave_struct __user *xstate = 0;

/* __copy_to_user optimizes that into a single 8 byte store */
static const struct {
@@ -533,7 +556,7 @@
0,
};

- frame = get_sigframe(ka, regs, sizeof(*frame));
+ frame = get_sigframe(ka, regs, sizeof(*frame), &fpstate, &xstate);

if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame)))
goto give_sigsegv;
@@ -553,8 +576,21 @@
err |= __put_user(sas_ss_flags(regs->sp),
&frame->uc.uc_stack.ss_flags);
err |= __put_user(current->sas_ss_size, &frame->uc.uc_stack.ss_size);
- err |= ia32_setup_sigcontext(&frame->uc.uc_mcontext, &frame->fpstate,
- regs, set->sig[0]);
+ err |= ia32_setup_sigcontext(&frame->uc.uc_mcontext, regs, set->sig[0],
+ fpstate, xstate);
+
+ if (cpu_has_xsave) {
+ err |= __put_user(ptr_to_compat(xstate), &frame->uc.uc_xstate.xstate);
+ err |= __put_user(xstate_size, &frame->uc.uc_xstate.size);
+ err |= __put_user(pcntxt_lmask, &frame->uc.uc_xstate.lmask);
+ err |= __put_user(pcntxt_hmask, &frame->uc.uc_xstate.hmask);
+ } else {
+ err |= __put_user(0, &frame->uc.uc_xstate.xstate);
+ err |= __put_user(0, &frame->uc.uc_xstate.size);
+ err |= __put_user(0, &frame->uc.uc_xstate.lmask);
+ err |= __put_user(0, &frame->uc.uc_xstate.hmask);
+ }
+
err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));
if (err)
goto give_sigsegv;
Index: linux-2.6-x86/arch/x86/kernel/Makefile
===================================================================
--- linux-2.6-x86.orig/arch/x86/kernel/Makefile 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/Makefile 2008-05-12 13:09:56.000000000 -0700
@@ -31,7 +31,7 @@

obj-$(CONFIG_X86_TRAMPOLINE) += trampoline.o
obj-y += process.o
-obj-y += i387.o
+obj-y += i387.o xsave.o
obj-y += ptrace.o
obj-y += ds.o
obj-$(CONFIG_X86_32) += tls.o
Index: linux-2.6-x86/arch/x86/kernel/i387.c
===================================================================
--- linux-2.6-x86.orig/arch/x86/kernel/i387.c 2008-05-12 13:09:47.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/i387.c 2008-05-12 13:23:10.000000000 -0700
@@ -21,9 +21,10 @@
# include <asm/sigcontext32.h>
# include <asm/user32.h>
#else
-# define save_i387_ia32 save_i387
-# define restore_i387_ia32 restore_i387
+# define save_i387_xstate_ia32 save_i387_xstate
+# define restore_i387_xstate_ia32 restore_i387_xstate
# define _fpstate_ia32 _fpstate
+# define xstate_cntxt_ia32 xstate_cntxt
# define user_i387_ia32_struct user_i387_struct
# define user32_fxsr_struct user_fxsr_struct
#endif
@@ -38,6 +39,21 @@
unsigned int xstate_size;
static struct i387_fxsave_struct fx_scratch __cpuinitdata;

+void __cpuinit init_thread_xstate(void)
+{
+ if (cpu_has_xsave) {
+ xsave_cntxt_init();
+ return;
+ }
+
+ if (cpu_has_fxsr)
+ xstate_size = sizeof(struct i387_fxsave_struct);
+#ifdef CONFIG_X86_32
+ else
+ xstate_size = sizeof(struct i387_fsave_struct);
+#endif
+}
+
void __cpuinit mxcsr_feature_mask_init(void)
{
unsigned long mask = 0;
@@ -54,16 +70,6 @@
stts();
}

-void __init init_thread_xstate(void)
-{
- if (cpu_has_fxsr)
- xstate_size = sizeof(struct i387_fxsave_struct);
-#ifdef CONFIG_X86_32
- else
- xstate_size = sizeof(struct i387_fsave_struct);
-#endif
-}
-
#ifdef CONFIG_X86_64
/*
* Called at bootup to set up the initial FPU state that is later cloned
@@ -78,7 +84,12 @@

write_cr0(oldcr0 & ~(X86_CR0_TS|X86_CR0_EM)); /* clear TS and EM */

+ if (!smp_processor_id())
+ init_thread_xstate();
+ xsave_init();
+
mxcsr_feature_mask_init();
+
/* clean state in init */
current_thread_info()->status = 0;
clear_used_math();
@@ -181,6 +192,9 @@
*/
target->thread.xstate->fxsave.mxcsr &= mxcsr_feature_mask;

+ if (cpu_has_xsave)
+ target->thread.xstate->xsave.xsave_hdr.xstate_bv |= XSTATE_FPSSE;
+
return ret;
}

@@ -381,6 +395,9 @@
if (!ret)
convert_to_fxsr(target, &env);

+ if (cpu_has_xsave)
+ target->thread.xstate->xsave.xsave_hdr.xstate_bv |= XSTATE_FP;
+
return ret;
}

@@ -393,7 +410,6 @@
struct task_struct *tsk = current;
struct i387_fsave_struct *fp = &tsk->thread.xstate->fsave;

- unlazy_fpu(tsk);
fp->status = fp->swd;
if (__copy_to_user(buf, fp, sizeof(struct i387_fsave_struct)))
return -1;
@@ -407,8 +423,6 @@
struct user_i387_ia32_struct env;
int err = 0;

- unlazy_fpu(tsk);
-
convert_from_fxsr(&env, tsk);
if (__copy_to_user(buf, &env, sizeof(env)))
return -1;
@@ -418,14 +432,15 @@
if (err)
return -1;

- if (__copy_to_user(&buf->_fxsr_env[0], fx,
- sizeof(struct i387_fxsave_struct)))
+ if (__copy_to_user(&buf->_fxsr_env[0], fx, xstate_size))
return -1;
return 1;
}

-int save_i387_ia32(struct _fpstate_ia32 __user *buf)
+int save_i387_xstate_ia32(struct _fpstate_ia32 __user *buf,
+ struct xsave_struct __user *buf1)
{
+ struct task_struct *tsk = current;
if (!used_math())
return 0;
/*
@@ -440,7 +455,12 @@
NULL, buf) ? -1 : 1;
}

+ unlazy_fpu(tsk);
+
if (cpu_has_fxsr)
+ /*
+ * saves the extended state including legacy fxsave.
+ */
return save_i387_fxsave(buf);
else
return save_i387_fsave(buf);
@@ -450,18 +470,16 @@
{
struct task_struct *tsk = current;

- clear_fpu(tsk);
return __copy_from_user(&tsk->thread.xstate->fsave, buf,
sizeof(struct i387_fsave_struct));
}

-static int restore_i387_fxsave(struct _fpstate_ia32 __user *buf)
+int restore_i387_fxsave(struct _fpstate_ia32 __user *buf)
{
struct task_struct *tsk = current;
struct user_i387_ia32_struct env;
int err;

- clear_fpu(tsk);
err = __copy_from_user(&tsk->thread.xstate->fxsave, &buf->_fxsr_env[0],
sizeof(struct i387_fxsave_struct));
/* mxcsr reserved bits must be masked to zero for security reasons */
@@ -473,22 +491,56 @@
return 0;
}

-int restore_i387_ia32(struct _fpstate_ia32 __user *buf)
+int restore_i387_xstate_ia32(struct _fpstate_ia32 __user *buf,
+ struct xstate_cntxt_ia32 __user *buf1)
{
- int err;
+ int err = 0;
+ struct task_struct *tsk = current;
+
+ if (buf && !access_ok(VERIFY_READ, buf, sizeof(*buf))) {
+ err = -1;
+ goto init_state;
+ }

if (HAVE_HWFP) {
- if (cpu_has_fxsr)
- err = restore_i387_fxsave(buf);
- else
- err = restore_i387_fsave(buf);
+ clear_fpu(tsk);
+
+ if (!used_math()) {
+ err = init_fpu(tsk);
+ if (err)
+ return err;
+ }
+
+ if (cpu_has_xsave)
+ err = restore_user_xstate(buf, buf1);
+ else {
+ if (!buf)
+ goto init_state;
+
+ if (cpu_has_fxsr)
+ err = restore_i387_fxsave(buf);
+ else
+ err = restore_i387_fsave(buf);
+ }
} else {
err = fpregs_soft_set(current, NULL,
0, sizeof(struct user_i387_ia32_struct),
NULL, buf) != 0;
}
+
+ if (err)
+ goto init_state;
+
set_used_math();

+ return 0;
+
+init_state:
+ if (used_math()) {
+ clear_fpu(current);
+ clear_used_math();
+ }
+
return err;
}

Index: linux-2.6-x86/arch/x86/kernel/signal_32.c
===================================================================
--- linux-2.6-x86.orig/arch/x86/kernel/signal_32.c 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/signal_32.c 2008-05-12 13:09:56.000000000 -0700
@@ -26,6 +26,7 @@
#include <asm/uaccess.h>
#include <asm/i387.h>
#include <asm/vdso.h>
+#include <asm/proto.h>

#include "sigframe.h"

@@ -116,7 +117,7 @@
*/
static int
restore_sigcontext(struct pt_regs *regs, struct sigcontext __user *sc,
- unsigned long *pax)
+ unsigned long *pax, struct xstate_cntxt __user *xst_cntxt)
{
unsigned int err = 0;

@@ -162,25 +163,12 @@
struct _fpstate __user *buf;

err |= __get_user(buf, &sc->fpstate);
- if (buf) {
- if (!access_ok(VERIFY_READ, buf, sizeof(*buf)))
- goto badframe;
- err |= restore_i387(buf);
- } else {
- struct task_struct *me = current;
-
- if (used_math()) {
- clear_fpu(me);
- clear_used_math();
- }
- }
+
+ err |= restore_i387_xstate(buf, xst_cntxt);
}

err |= __get_user(*pax, &sc->ax);
return err;
-
-badframe:
- return 1;
}

asmlinkage unsigned long sys_sigreturn(unsigned long __unused)
@@ -189,9 +177,11 @@
struct pt_regs *regs;
unsigned long ax;
sigset_t set;
+ struct xstate_cntxt __user *xst_cnxt;

regs = (struct pt_regs *) &__unused;
frame = (struct sigframe __user *)(regs->sp - 8);
+ xst_cnxt = &frame->xst_cnxt;

if (!access_ok(VERIFY_READ, frame, sizeof(*frame)))
goto badframe;
@@ -206,7 +196,7 @@
recalc_sigpending();
spin_unlock_irq(&current->sighand->siglock);

- if (restore_sigcontext(regs, &frame->sc, &ax))
+ if (restore_sigcontext(regs, &frame->sc, &ax, xst_cnxt))
goto badframe;
return ax;

@@ -245,7 +235,8 @@
recalc_sigpending();
spin_unlock_irq(&current->sighand->siglock);

- if (restore_sigcontext(regs, &frame->uc.uc_mcontext, &ax))
+ if (restore_sigcontext(regs, &frame->uc.uc_mcontext, &ax,
+ &frame->uc.uc_xstate))
goto badframe;

if (do_sigaltstack(&frame->uc.uc_stack, NULL, regs->sp) == -EFAULT)
@@ -262,8 +253,10 @@
* Set up a signal frame.
*/
static int
-setup_sigcontext(struct sigcontext __user *sc, struct _fpstate __user *fpstate,
- struct pt_regs *regs, unsigned long mask)
+setup_sigcontext(struct sigcontext __user *sc,
+ struct pt_regs *regs, unsigned long mask,
+ struct _fpstate __user *fpstate,
+ struct xsave_struct __user *xstate)
{
int tmp, err = 0;

@@ -289,7 +282,7 @@
err |= __put_user(regs->sp, &sc->sp_at_signal);
err |= __put_user(regs->ss, (unsigned int __user *)&sc->ss);

- tmp = save_i387(fpstate);
+ tmp = save_i387_xstate(fpstate, xstate);
if (tmp < 0)
err = 1;
else
@@ -306,7 +299,8 @@
* Determine which stack to use..
*/
static inline void __user *
-get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size)
+get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,
+ struct _fpstate **fpstate, struct xsave_struct **xstate)
{
unsigned long sp;

@@ -332,7 +326,18 @@
sp = (unsigned long) ka->sa.sa_restorer;
}

- sp -= frame_size;
+ if (used_math()) {
+ sp = round_down(sp - xstate_size, 64);
+ if (cpu_has_xsave)
+ *xstate = (struct xsave_struct *) sp;
+
+ sp = sp - (sizeof(struct _fpstate) - FXSAVE_SIZE);
+
+ *fpstate = (struct _fpstate *) sp;
+
+ sp -= frame_size;
+ } else
+ sp -= frame_size;
/*
* Align the stack pointer according to the i386 ABI,
* i.e. so that on function entry ((sp + 4) & 15) == 0.
@@ -350,10 +355,12 @@
void __user *restorer;
int err = 0;
int usig;
+ struct _fpstate __user *fpstate = 0;
+ struct xsave_struct __user *xstate = 0;

- frame = get_sigframe(ka, regs, sizeof(*frame));
+ frame = get_sigframe(ka, regs, sizeof(*frame), &fpstate, &xstate);

- if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame)))
+ if (!access_ok(VERIFY_WRITE, frame, sizeof (*frame)))
goto give_sigsegv;

usig = current_thread_info()->exec_domain
@@ -366,9 +373,21 @@
if (err)
goto give_sigsegv;

- err = setup_sigcontext(&frame->sc, &frame->fpstate, regs, set->sig[0]);
+ err = setup_sigcontext(&frame->sc, regs, set->sig[0],
+ fpstate, xstate);
if (err)
goto give_sigsegv;
+ if (cpu_has_xsave) {
+ err |= __put_user(xstate, &frame->xst_cnxt.xstate);
+ err |= __put_user(xstate_size, &frame->xst_cnxt.size);
+ err |= __put_user(pcntxt_lmask, &frame->xst_cnxt.lmask);
+ err |= __put_user(pcntxt_hmask, &frame->xst_cnxt.hmask);
+ } else {
+ err |= __put_user(0, &frame->xst_cnxt.xstate);
+ err |= __put_user(0, &frame->xst_cnxt.size);
+ err |= __put_user(0, &frame->xst_cnxt.lmask);
+ err |= __put_user(0, &frame->xst_cnxt.hmask);
+ }

if (_NSIG_WORDS > 1) {
err = __copy_to_user(&frame->extramask, &set->sig[1],
@@ -427,8 +446,10 @@
void __user *restorer;
int err = 0;
int usig;
+ struct _fpstate __user *fpstate = 0;
+ struct xsave_struct __user *xstate = 0;

- frame = get_sigframe(ka, regs, sizeof(*frame));
+ frame = get_sigframe(ka, regs, sizeof(*frame), &fpstate, &xstate);

if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame)))
goto give_sigsegv;
@@ -453,8 +474,20 @@
err |= __put_user(sas_ss_flags(regs->sp),
&frame->uc.uc_stack.ss_flags);
err |= __put_user(current->sas_ss_size, &frame->uc.uc_stack.ss_size);
- err |= setup_sigcontext(&frame->uc.uc_mcontext, &frame->fpstate,
- regs, set->sig[0]);
+ err |= setup_sigcontext(&frame->uc.uc_mcontext, regs, set->sig[0],
+ fpstate, xstate);
+ if (cpu_has_xsave) {
+ err |= __put_user(xstate, &frame->uc.uc_xstate.xstate);
+ err |= __put_user(xstate_size, &frame->uc.uc_xstate.size);
+ err |= __put_user(pcntxt_lmask, &frame->uc.uc_xstate.lmask);
+ err |= __put_user(pcntxt_hmask, &frame->uc.uc_xstate.hmask);
+ } else {
+ err |= __put_user(0, &frame->uc.uc_xstate.xstate);
+ err |= __put_user(0, &frame->uc.uc_xstate.size);
+ err |= __put_user(0, &frame->uc.uc_xstate.lmask);
+ err |= __put_user(0, &frame->uc.uc_xstate.hmask);
+ }
+
err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));
if (err)
goto give_sigsegv;
Index: linux-2.6-x86/include/asm-x86/cpufeature.h
===================================================================
--- linux-2.6-x86.orig/include/asm-x86/cpufeature.h 2008-05-12 13:09:03.000000000 -0700
+++ linux-2.6-x86/include/asm-x86/cpufeature.h 2008-05-12 13:09:56.000000000 -0700
@@ -90,6 +90,7 @@
#define X86_FEATURE_CX16 (4*32+13) /* CMPXCHG16B */
#define X86_FEATURE_XTPR (4*32+14) /* Send Task Priority Messages */
#define X86_FEATURE_DCA (4*32+18) /* Direct Cache Access */
+#define X86_FEATURE_XSAVE (4*32+26) /* XSAVE */

/* VIA/Cyrix/Centaur-defined CPU features, CPUID level 0xC0000001, word 5 */
#define X86_FEATURE_XSTORE (5*32+ 2) /* on-CPU RNG present (xstore insn) */
@@ -187,6 +188,7 @@
#define cpu_has_gbpages boot_cpu_has(X86_FEATURE_GBPAGES)
#define cpu_has_arch_perfmon boot_cpu_has(X86_FEATURE_ARCH_PERFMON)
#define cpu_has_pat boot_cpu_has(X86_FEATURE_PAT)
+#define cpu_has_xsave boot_cpu_has(X86_FEATURE_XSAVE)

#if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64)
# define cpu_has_invlpg 1
Index: linux-2.6-x86/include/asm-x86/i387.h
===================================================================
--- linux-2.6-x86.orig/include/asm-x86/i387.h 2008-05-12 13:09:47.000000000 -0700
+++ linux-2.6-x86/include/asm-x86/i387.h 2008-05-12 14:46:43.000000000 -0700
@@ -18,6 +18,8 @@
#include <asm/sigcontext.h>
#include <asm/user.h>
#include <asm/uaccess.h>
+#include <asm/xsave.h>
+#include <asm/percpu.h>

extern void fpu_init(void);
extern void mxcsr_feature_mask_init(void);
@@ -31,10 +33,19 @@

#ifdef CONFIG_IA32_EMULATION
struct _fpstate_ia32;
-extern int save_i387_ia32(struct _fpstate_ia32 __user *buf);
-extern int restore_i387_ia32(struct _fpstate_ia32 __user *buf);
+struct xstate_cntxt_ia32;
+extern int save_i387_xstate_ia32(struct _fpstate_ia32 __user *buf,
+ struct xsave_struct __user *buf1);
+extern int restore_i387_xstate_ia32(struct _fpstate_ia32 __user *buf,
+ struct xstate_cntxt_ia32 __user *buf1);
+extern int restore_user_xstate(struct _fpstate_ia32 __user *buf,
+ struct xstate_cntxt_ia32 __user *buf1);
+
+extern int restore_i387_fxsave(struct _fpstate_ia32 __user *buf);
#endif

+#define X87_FSW_ES (1 << 7) /* Exception Summary */
+
#ifdef CONFIG_X86_64

/* Ignore delayed exceptions from user space */
@@ -45,9 +56,13 @@
_ASM_EXTABLE(1b, 2b));
}

-static inline int restore_fpu_checking(struct i387_fxsave_struct *fx)
+static inline int restore_fpu_checking(struct _fpstate __user *buf)
{
int err;
+ struct i387_fxsave_struct *fx = ((__force struct i387_fxsave_struct *) buf);
+
+ if (!access_ok(VERIFY_READ, buf, sizeof(*buf)))
+ return -1;

asm volatile("1: rex64/fxrstor (%[fx])\n\t"
"2:\n"
@@ -62,20 +77,42 @@
#else
: [fx] "cdaSDb" (fx), "m" (*fx), "0" (0));
#endif
- if (unlikely(err))
- init_fpu(current);
return err;
}

-#define X87_FSW_ES (1 << 7) /* Exception Summary */
+static inline void __fxrstor(struct i387_fxsave_struct *fx)
+{
+ asm volatile("1: rex64/fxrstor (%[fx])\n\t"
+#if 0 /* See comment in __fxsave_clear() below. */
+ :: [fx] "r" (fx), "m" (*fx));
+#else
+ :: [fx] "cdaSDb" (fx), "m" (*fx));
+#endif
+}
+
+static inline void restore_fpu_xstate(struct task_struct *tsk)
+{
+ if (cpu_has_xsave)
+ xrstor(&tsk->thread.xstate->xsave);
+ else
+ __fxrstor(&tsk->thread.xstate->fxsave);
+}

/* AMD CPUs don't save/restore FDP/FIP/FOP unless an exception
is pending. Clear the x87 state here by setting it to fixed
values. The kernel data segment can be sometimes 0 and sometimes
new user value. Both should be ok.
Use the PDA as safe address because it should be already in L1. */
-static inline void clear_fpu_state(struct i387_fxsave_struct *fx)
+static inline void clear_fpu_state(struct xsave_struct *xstate)
{
+ struct i387_fxsave_struct *fx = (struct i387_fxsave_struct *) xstate;
+
+ /*
+ * Header may indicate the init state of the FP.
+ */
+ if (cpu_has_xsave && !(xstate->xsave_hdr.xstate_bv & XSTATE_FP))
+ return;
+
if (unlikely(fx->swd & X87_FSW_ES))
asm volatile("fnclex");
alternative_input(ASM_NOP8 ASM_NOP2,
@@ -108,7 +145,7 @@
return err;
}

-static inline void __save_init_fpu(struct task_struct *tsk)
+static inline void __fxsave(struct task_struct *tsk)
{
/* Using "rex64; fxsave %0" is broken because, if the memory operand
uses any extended registers for addressing, a second REX prefix
@@ -133,55 +170,23 @@
: "=m" (tsk->thread.xstate->fxsave)
: "cdaSDb" (&tsk->thread.xstate->fxsave));
#endif
- clear_fpu_state(&tsk->thread.xstate->fxsave);
- task_thread_info(tsk)->status &= ~TS_USEDFPU;
}

-/*
- * Signal frame handlers.
- */
-
-static inline int save_i387(struct _fpstate __user *buf)
+static inline void __save_init_fpu(struct task_struct *tsk)
{
- struct task_struct *tsk = current;
- int err = 0;
-
- BUILD_BUG_ON(sizeof(struct user_i387_struct) !=
- sizeof(tsk->thread.xstate->fxsave));
-
- if ((unsigned long)buf % 16)
- printk("save_i387: bad fpstate %p\n", buf);
+ if (cpu_has_xsave)
+ xsave(tsk);
+ else
+ __fxsave(tsk);

- if (!used_math())
- return 0;
- clear_used_math(); /* trigger finit */
- if (task_thread_info(tsk)->status & TS_USEDFPU) {
- err = save_i387_checking((struct i387_fxsave_struct __user *)
- buf);
- if (err)
- return err;
- task_thread_info(tsk)->status &= ~TS_USEDFPU;
- stts();
- } else {
- if (__copy_to_user(buf, &tsk->thread.xstate->fxsave,
- sizeof(struct i387_fxsave_struct)))
- return -1;
- }
- return 1;
+ clear_fpu_state(&tsk->thread.xstate->xsave);
+ task_thread_info(tsk)->status &= ~TS_USEDFPU;
}
-
/*
- * This restores directly out of user space. Exceptions are handled.
+ * Signal frame handlers...
*/
-static inline int restore_i387(struct _fpstate __user *buf)
-{
- set_used_math();
- if (!(task_thread_info(current)->status & TS_USEDFPU)) {
- clts();
- task_thread_info(current)->status |= TS_USEDFPU;
- }
- return restore_fpu_checking((__force struct i387_fxsave_struct *)buf);
-}
+extern int save_i387_xstate(void __user *buf);
+extern int restore_i387_xstate(struct _fpstate *buf, struct xstate_cntxt *buf1);

#else /* CONFIG_X86_32 */

@@ -190,8 +195,12 @@
asm volatile("fnclex ; fwait");
}

-static inline void restore_fpu(struct task_struct *tsk)
+static inline void restore_fpu_xstate(struct task_struct *tsk)
{
+ if (cpu_has_xsave) {
+ xrstor(&tsk->thread.xstate->xsave);
+ return;
+ }
/*
* The "nop" is needed to make the instructions the same
* length.
@@ -217,6 +226,21 @@
*/
static inline void __save_init_fpu(struct task_struct *tsk)
{
+ if (cpu_has_xsave) {
+ struct xsave_struct *xstate = &tsk->thread.xstate->xsave;
+ struct i387_fxsave_struct *fx = &tsk->thread.xstate->fxsave;
+
+ xsave(tsk);
+
+ /*
+ * Header may indicate the init state of the FP.
+ */
+ if (!(xstate->xsave_hdr.xstate_bv & XSTATE_FP))
+ goto end;
+
+ if (unlikely(fx->swd & X87_FSW_ES))
+ asm volatile("fnclex");
+ } else {
/* Use more nops than strictly needed in case the compiler
varies code */
alternative_input(
@@ -226,6 +250,7 @@
X86_FEATURE_FXSR,
[fx] "m" (tsk->thread.xstate->fxsave),
[fsw] "m" (tsk->thread.xstate->fxsave.swd) : "memory");
+ }
/* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
is pending. Clear the x87 state here by setting it to fixed
values. safe_address is a random variable that should be in L1 */
@@ -235,15 +260,18 @@
"fildl %[addr]", /* set F?P to defined value */
X86_FEATURE_FXSAVE_LEAK,
[addr] "m" (safe_address));
+end:
task_thread_info(tsk)->status &= ~TS_USEDFPU;
}

-/*
- * Signal frame handlers...
- */
-extern int save_i387(struct _fpstate __user *buf);
-extern int restore_i387(struct _fpstate __user *buf);
+extern int save_i387_xstate(struct _fpstate __user *buf,
+ struct xsave_struct __user *buf1);
+extern int restore_i387_xstate(struct _fpstate __user *buf,
+ struct xstate_cntxt __user *buf1);
+extern int restore_user_xstate(struct _fpstate __user *buf,
+ struct xstate_cntxt __user *buf1);

+extern int restore_i387_fxsave(struct _fpstate __user *buf);
#endif /* CONFIG_X86_64 */

static inline void __unlazy_fpu(struct task_struct *tsk)
Index: linux-2.6-x86/include/asm-x86/processor.h
===================================================================
--- linux-2.6-x86.orig/include/asm-x86/processor.h 2008-05-12 13:09:03.000000000 -0700
+++ linux-2.6-x86/include/asm-x86/processor.h 2008-05-12 14:53:38.000000000 -0700
@@ -351,10 +351,23 @@
u32 entry_eip;
};

+struct xsave_hdr_struct {
+ u64 xstate_bv;
+ u64 reserved1[2];
+ u64 reserved2[5];
+} __attribute__((packed));
+
+struct xsave_struct {
+ struct i387_fxsave_struct i387;
+ struct xsave_hdr_struct xsave_hdr;
+ /* new processor state extensions will go here */
+} __attribute__ ((packed, aligned (64)));
+
union thread_xstate {
struct i387_fsave_struct fsave;
struct i387_fxsave_struct fxsave;
struct i387_soft_struct soft;
+ struct xsave_struct xsave;
};

#ifdef CONFIG_X86_64
Index: linux-2.6-x86/arch/x86/kernel/cpu/feature_names.c
===================================================================
--- linux-2.6-x86.orig/arch/x86/kernel/cpu/feature_names.c 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/cpu/feature_names.c 2008-05-12 13:09:56.000000000 -0700
@@ -46,7 +46,7 @@
"pni", NULL, NULL, "monitor", "ds_cpl", "vmx", "smx", "est",
"tm2", "ssse3", "cid", NULL, NULL, "cx16", "xtpr", NULL,
NULL, NULL, "dca", "sse4_1", "sse4_2", NULL, NULL, "popcnt",
- NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
+ NULL, NULL, "xsave", NULL, NULL, NULL, NULL, NULL,

/* VIA/Cyrix/Centaur-defined */
NULL, NULL, "rng", "rng_en", NULL, NULL, "ace", "ace_en",
Index: linux-2.6-x86/include/asm-x86/ucontext.h
===================================================================
--- linux-2.6-x86.orig/include/asm-x86/ucontext.h 2008-05-12 13:09:03.000000000 -0700
+++ linux-2.6-x86/include/asm-x86/ucontext.h 2008-05-12 15:15:12.000000000 -0700
@@ -6,7 +6,10 @@
struct ucontext *uc_link;
stack_t uc_stack;
struct sigcontext uc_mcontext;
- sigset_t uc_sigmask; /* mask last for extensibility */
+ sigset_t uc_sigmask;
+ /* Allow for uc_sigmask growth. Glibc uses a 1024-bit sigset_t. */
+ int __unused[32 - (sizeof (sigset_t) / sizeof (int))];
+ struct xstate_cntxt uc_xstate;
};

#endif /* _ASM_X86_UCONTEXT_H */
Index: linux-2.6-x86/include/asm-x86/ia32.h
===================================================================
--- linux-2.6-x86.orig/include/asm-x86/ia32.h 2008-05-12 13:09:03.000000000 -0700
+++ linux-2.6-x86/include/asm-x86/ia32.h 2008-05-12 13:09:56.000000000 -0700
@@ -41,6 +41,7 @@
stack_ia32_t uc_stack;
struct sigcontext_ia32 uc_mcontext;
compat_sigset_t uc_sigmask; /* mask last for extensibility */
+ struct xstate_cntxt_ia32 uc_xstate;
};

/* This matches struct stat64 in glibc2.2, hence the absolutely
Index: linux-2.6-x86/include/asm-x86/sigcontext32.h
===================================================================
--- linux-2.6-x86.orig/include/asm-x86/sigcontext32.h 2008-05-12 13:09:03.000000000 -0700
+++ linux-2.6-x86/include/asm-x86/sigcontext32.h 2008-05-12 13:09:56.000000000 -0700
@@ -43,6 +43,13 @@
__u32 padding[56];
};

+struct xstate_cntxt_ia32 {
+ u32 xstate; /* really (struct _xstate *) */
+ u32 size;
+ u32 lmask;
+ u32 hmask;
+};
+
struct sigcontext_ia32 {
unsigned short gs, __gsh;
unsigned short fs, __fsh;
Index: linux-2.6-x86/arch/x86/kernel/sigframe.h
===================================================================
--- linux-2.6-x86.orig/arch/x86/kernel/sigframe.h 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/sigframe.h 2008-05-12 13:09:56.000000000 -0700
@@ -3,9 +3,10 @@
char __user *pretcode;
int sig;
struct sigcontext sc;
- struct _fpstate fpstate;
+ struct xstate_cntxt xst_cnxt;
unsigned long extramask[_NSIG_WORDS-1];
char retcode[8];
+ /* fp and rest of the extended context state follows here */
};

struct rt_sigframe {
@@ -15,8 +16,8 @@
void __user *puc;
struct siginfo info;
struct ucontext uc;
- struct _fpstate fpstate;
char retcode[8];
+ /* fp and rest of the extended context state follows here */
};
#else
struct rt_sigframe {
Index: linux-2.6-x86/arch/x86/kernel/traps_32.c
===================================================================
--- linux-2.6-x86.orig/arch/x86/kernel/traps_32.c 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/traps_32.c 2008-05-12 13:09:56.000000000 -0700
@@ -1178,7 +1178,7 @@
}

clts(); /* Allow maths ops (or we recurse) */
- restore_fpu(tsk);
+ restore_fpu_xstate(tsk);
thread->status |= TS_USEDFPU; /* So we fnsave on switch_to() */
tsk->fpu_counter++;
}
@@ -1255,7 +1255,6 @@

set_bit(SYSCALL_VECTOR, used_vectors);

- init_thread_xstate();
/*
* Should be a barrier for any external CPU state:
*/
Index: linux-2.6-x86/arch/x86/kernel/cpu/common.c
===================================================================
--- linux-2.6-x86.orig/arch/x86/kernel/cpu/common.c 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/cpu/common.c 2008-05-12 13:27:21.000000000 -0700
@@ -742,6 +742,10 @@
current_thread_info()->status = 0;
clear_used_math();
mxcsr_feature_mask_init();
+
+ if (!smp_processor_id())
+ init_thread_xstate();
+ xsave_init();
}

#ifdef CONFIG_HOTPLUG_CPU
Mikael Pettersson
2008-05-16 13:26:15 UTC
Permalink
Post by Suresh Siddha
hi,
Appended patch adds the support for xsave/xrstor infrastructure for x86.
xsave/xrstor manages the existing and future processor extended states in x86
architecutre.
More info on xsave/xrstor can be found in the Intel SDM's located at
http://www.intel.com/products/processor/manuals/index.htm
Please let me know your feedback and comments. Specifically, I am not sure
if I break anything or make anyone's life harder with the ucontext_t extensions
that are proposed in the patch.
This is a large patch, and somewhat difficult to review since it mixes
kernel-private and user-visible changes. I'm going to focus on the
user-visible changes.
Post by Suresh Siddha
BTW, Traditionally glibc has this definition for struct ucontext.
glibc's definition is irrelevant, in part because glibc can and does lie
about kernel types to applications, and in part because glibc is not the
only user-space consumer of kernel types: there are other libcs, and there
are user-space virtualisation tools (my interest in this matter) that care
deeply about kernel types and sigframe layouts. In particular, user-space
needs to be able to copy and assemble sigframes.

So however the xsave support ends up looking, user-space must have a
sensible way of detecting and handling the layout changes.
Post by Suresh Siddha
--- linux-2.6-x86.orig/include/asm-x86/ucontext.h 2008-05-12 13:09:03.000000000 -0700
+++ linux-2.6-x86/include/asm-x86/ucontext.h 2008-05-12 15:15:12.000000000 -0700
@@ -6,7 +6,10 @@
struct ucontext *uc_link;
stack_t uc_stack;
struct sigcontext uc_mcontext;
- sigset_t uc_sigmask; /* mask last for extensibility */
+ sigset_t uc_sigmask;
+ /* Allow for uc_sigmask growth. Glibc uses a 1024-bit sigset_t. */
+ int __unused[32 - (sizeof (sigset_t) / sizeof (int))];
+ struct xstate_cntxt uc_xstate;
};
#endif /* _ASM_X86_UCONTEXT_H */
You're changing the layout of struct ucontext in two ways: uc_mcontext
changes elsewhere, and you're adding __unused and uc_xstate.

How is user-space supposed to know whether it's looking at a current
layout ucontext or an xsave-layout ucontext?

It seems that uc_flags is unused and always zero. Could you define a
flag bit (e.g. 1) for uc_flags to indicate the xsave layout?
Post by Suresh Siddha
--- linux-2.6-x86.orig/arch/x86/kernel/sigframe.h 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/sigframe.h 2008-05-12 13:09:56.000000000 -0700
@@ -3,9 +3,10 @@
char __user *pretcode;
int sig;
struct sigcontext sc;
- struct _fpstate fpstate;
+ struct xstate_cntxt xst_cnxt;
unsigned long extramask[_NSIG_WORDS-1];
char retcode[8];
+ /* fp and rest of the extended context state follows here */
};
Offset to extramask[] and retcode changes, as well as the size of the structure.

Why contract xstate_cntxt to xst_cnxt? That just obscures things.
Post by Suresh Siddha
--- linux-2.6-x86.orig/include/asm-x86/sigcontext.h 2008-05-12 13:09:03.000000000 -0700
+++ linux-2.6-x86/include/asm-x86/sigcontext.h 2008-05-12 14:54:21.000000000 -0700
@@ -202,4 +202,26 @@
#endif /* !__i386__ */
+struct _xsave_hdr_struct {
+ u64 xstate_bv;
+ u64 reserved1[2];
+ u64 reserved2[5];
+} __attribute__((packed));
+
+struct _xstate {
+ /*
+ * Applications need to refer to fpstate through fpstate pointer
+ * in sigcontext. Not here directly.
+ */
+ struct _fpstate fpstate;
+ struct _xsave_hdr_struct xsave_hdr;
+ /* new processor state extensions will go here */
+} __attribute__ ((aligned (64)));
+
+struct xstate_cntxt {
+ struct _xstate __user *xstate;
+ u32 size;
+ u32 lmask;
+ u32 hmask;
+};
#endif
What is the purpose of the xstate pointer in xstate_cntxt?
As far as I can tell, it's redundant and can alway be derived
from the ucontext's uc_mcontext.fpstate pointer.
Post by Suresh Siddha
--- linux-2.6-x86.orig/arch/x86/kernel/signal_64.c 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/signal_64.c 2008-05-12 13:09:56.000000000 -0700
...
Post by Suresh Siddha
err |= __put_user(fp, &frame->uc.uc_mcontext.fpstate);
+
+ if (cpu_has_xsave) {
+ err |= __put_user(fp, &frame->uc.uc_xstate.xstate);
+ err |= __put_user(xstate_size, &frame->uc.uc_xstate.size);
+ err |= __put_user(pcntxt_lmask, &frame->uc.uc_xstate.lmask);
+ err |= __put_user(pcntxt_hmask, &frame->uc.uc_xstate.hmask);
+ } else {
+ err |= __put_user(0, &frame->uc.uc_xstate.xstate);
--- linux-2.6-x86.orig/arch/x86/kernel/signal_32.c 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/signal_32.c 2008-05-12 13:09:56.000000000 -0700
...
Post by Suresh Siddha
-get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size)
+get_sigframe(struct k_sigaction *ka, struct pt_regs *regs, size_t frame_size,
+ struct _fpstate **fpstate, struct xsave_struct **xstate)
...
Post by Suresh Siddha
- sp -= frame_size;
+ if (used_math()) {
+ sp = round_down(sp - xstate_size, 64);
+ if (cpu_has_xsave)
+ *xstate = (struct xsave_struct *) sp;
+
+ sp = sp - (sizeof(struct _fpstate) - FXSAVE_SIZE);
+
+ *fpstate = (struct _fpstate *) sp;
+
+ sp -= frame_size;
+ } else
+ sp -= frame_size;
...(in setup_frame())
Post by Suresh Siddha
- err = setup_sigcontext(&frame->sc, &frame->fpstate, regs, set->sig[0]);
+ err = setup_sigcontext(&frame->sc, regs, set->sig[0],
+ fpstate, xstate);
if (err)
goto give_sigsegv;
+ if (cpu_has_xsave) {
+ err |= __put_user(xstate, &frame->xst_cnxt.xstate);
+ err |= __put_user(xstate_size, &frame->xst_cnxt.size);
+ err |= __put_user(pcntxt_lmask, &frame->xst_cnxt.lmask);
+ err |= __put_user(pcntxt_hmask, &frame->xst_cnxt.hmask);
+ } else {
+ err |= __put_user(0, &frame->xst_cnxt.xstate);
...(in setup_rt_frame())
Post by Suresh Siddha
- err |= setup_sigcontext(&frame->uc.uc_mcontext, &frame->fpstate,
- regs, set->sig[0]);
+ err |= setup_sigcontext(&frame->uc.uc_mcontext, regs, set->sig[0],
+ fpstate, xstate);
+ if (cpu_has_xsave) {
+ err |= __put_user(xstate, &frame->uc.uc_xstate.xstate);
+ err |= __put_user(xstate_size, &frame->uc.uc_xstate.size);
+ err |= __put_user(pcntxt_lmask, &frame->uc.uc_xstate.lmask);
+ err |= __put_user(pcntxt_hmask, &frame->uc.uc_xstate.hmask);
+ } else {
+ err |= __put_user(0, &frame->uc.uc_xstate.xstate);
In all cases .xstate seems to be a function of fpstate and cpu_has_xsave.

And why does x86-64 make xstate == fpstate while on x86-32 they're at an
offset from each other?

The fpstate pointer may be pointing into a struct _xstate.
How is user-space supposed to know if this is the case?

struct _fpstate has a 'magic' field which distinguishes x87-only
from x87+FXSR structs. Could that field also be used to indicate XSAVE?

How is user-space supposed to know how large the _xstate struct is?
Is that the size field in the struct xstate_cntxt?
Post by Suresh Siddha
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-x86/arch/x86/kernel/xsave.c 2008-05-12 13:09:56.000000000 -0700
...
Post by Suresh Siddha
+ /*
+ * FP and SSE state can't be restored directly from the userspace
+ * because of legacy reasons. Lets restore it to the fpstate
+ * in the task struct.
+ */
Can you please explain what those 'legacy reasons' are?
Post by Suresh Siddha
--- linux-2.6-x86.orig/arch/x86/kernel/signal_32.c 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/signal_32.c 2008-05-12 13:09:56.000000000 -0700
...(in setup_frame)
Post by Suresh Siddha
- if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame)))
+ if (!access_ok(VERIFY_WRITE, frame, sizeof (*frame)))
goto give_sigsegv;
Gratuitous whitespace change.

I know xsave will be needed once AVX is released to the masses.
Any idea on when that will happen?

/Mikael
Suresh Siddha
2008-05-18 01:34:16 UTC
Permalink
Post by Mikael Pettersson
This is a large patch, and somewhat difficult to review since it mixes
kernel-private and user-visible changes.
Sorry. I wanted to give the complete picture. Will see if I can split it up
while posting for upstream.
Post by Mikael Pettersson
I'm going to focus on the user-visible changes.
Post by Suresh Siddha
BTW, Traditionally glibc has this definition for struct ucontext.
glibc's definition is irrelevant, in part because glibc can and does lie
about kernel types to applications, and in part because glibc is not the
only user-space consumer of kernel types: there are other libcs, and there
are user-space virtualisation tools (my interest in this matter) that care
deeply about kernel types and sigframe layouts. In particular, user-space
needs to be able to copy and assemble sigframes.
Ok. I hope user-space is doing this copy and assemble in a (sane) way that
allows the kernel to modify the sigframe with out breaking old apps.

Do you know how the user-space is determining how much to copy today?

I have to probably use some magic values(or other flags), indicating the
presence of extended state context information in the ucontext.
Post by Mikael Pettersson
So however the xsave support ends up looking, user-space must have a
sensible way of detecting and handling the layout changes.
cpuid information (cpuid.0x1.ecx.osxsave) indicates whether the
OS supports xsave or not. This can be one way, that signal handlers
can use to interpret the ucontext.

Given that it is more than signal handlers, I can add a flag also,
representing ucontext extensions.
Post by Mikael Pettersson
Post by Suresh Siddha
--- linux-2.6-x86.orig/include/asm-x86/ucontext.h 2008-05-12 13:09:03.000000000 -0700
+++ linux-2.6-x86/include/asm-x86/ucontext.h 2008-05-12 15:15:12.000000000 -0700
@@ -6,7 +6,10 @@
struct ucontext *uc_link;
stack_t uc_stack;
struct sigcontext uc_mcontext;
- sigset_t uc_sigmask; /* mask last for extensibility */
+ sigset_t uc_sigmask;
+ /* Allow for uc_sigmask growth. Glibc uses a 1024-bit sigset_t. */
+ int __unused[32 - (sizeof (sigset_t) / sizeof (int))];
+ struct xstate_cntxt uc_xstate;
};
#endif /* _ASM_X86_UCONTEXT_H */
You're changing the layout of struct ucontext in two ways: uc_mcontext
changes elsewhere, and you're adding __unused and uc_xstate.
I am not changing the uc_mcontext (struct sigcontext). Just extending
the ucontext.
Post by Mikael Pettersson
How is user-space supposed to know whether it's looking at a current
layout ucontext or an xsave-layout ucontext?
It seems that uc_flags is unused and always zero. Could you define a
flag bit (e.g. 1) for uc_flags to indicate the xsave layout?
Sure. I can even add a magic word to indicate the xsave presence,
so that sigreturn can be double sure and not break older apps.
Post by Mikael Pettersson
Post by Suresh Siddha
--- linux-2.6-x86.orig/arch/x86/kernel/sigframe.h 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/sigframe.h 2008-05-12 13:09:56.000000000 -0700
@@ -3,9 +3,10 @@
char __user *pretcode;
int sig;
struct sigcontext sc;
- struct _fpstate fpstate;
+ struct xstate_cntxt xst_cnxt;
unsigned long extramask[_NSIG_WORDS-1];
char retcode[8];
+ /* fp and rest of the extended context state follows here */
};
Offset to extramask[] and retcode changes, as well as the size of the structure.
hm yes. this will break the restore of extramask[], for apps which set their
own sig return frames.

I can leave _fpstate as it is(and unused) and move xstate context (which
will contain fpstate + the extended state) after extramask[].
Post by Mikael Pettersson
Why contract xstate_cntxt to xst_cnxt? That just obscures things.
ok.
Post by Mikael Pettersson
Post by Suresh Siddha
--- linux-2.6-x86.orig/include/asm-x86/sigcontext.h 2008-05-12 13:09:03.000000000 -0700
+++ linux-2.6-x86/include/asm-x86/sigcontext.h 2008-05-12 14:54:21.000000000 -0700
@@ -202,4 +202,26 @@
#endif /* !__i386__ */
+struct _xsave_hdr_struct {
+ u64 xstate_bv;
+ u64 reserved1[2];
+ u64 reserved2[5];
+} __attribute__((packed));
+
+struct _xstate {
+ /*
+ * Applications need to refer to fpstate through fpstate pointer
+ * in sigcontext. Not here directly.
+ */
+ struct _fpstate fpstate;
+ struct _xsave_hdr_struct xsave_hdr;
+ /* new processor state extensions will go here */
+} __attribute__ ((aligned (64)));
+
+struct xstate_cntxt {
+ struct _xstate __user *xstate;
+ u32 size;
+ u32 lmask;
+ u32 hmask;
+};
#endif
What is the purpose of the xstate pointer in xstate_cntxt?
As far as I can tell, it's redundant and can alway be derived
from the ucontext's uc_mcontext.fpstate pointer.
This is true in 64bit.

While doing sigreturn, there is no way, kernel can know if the
uc_mcontext.fpstate is pointing to just the legacy fpstate(512 bytes)
or the extended state pointer (unless we incorporate
some magic word at the end of fpstate image, see below)

during sigreturn, fpstate will be restored from the uc_mcontext.fpstate
and extended state will be restored from xstate pointer.
Post by Mikael Pettersson
And why does x86-64 make xstate == fpstate while on x86-32 they're at an
offset from each other?
because sigcontext's 32bit fpstate is different from 64bit fpstate.
32bit fp state is fsave frame followed by fxsave frame. Whereas 64bit
fp state is just fxsave frame. To maintain 32bit legacy compatibility,
32bit xstate is different from 32bit fpstate.
Post by Mikael Pettersson
The fpstate pointer may be pointing into a struct _xstate.
How is user-space supposed to know if this is the case?
user-space doesn't have to know. user has to refer fpstate through
fpstate pointer, and all the extended state through xstate pointer.
during the sigreturn, kernel will restore fp state through
fpstate pointer and the rest (exclusing FP/SSE) will be restored
from xstate pointer.
Post by Mikael Pettersson
struct _fpstate has a 'magic' field which distinguishes x87-only
from x87+FXSR structs. Could that field also be used to indicate XSAVE?
I don't think we can use the existing 'magic' field. But we can
use some what similar magic, if the fxsave/fxrstor give away
some of the fields at the end of fxsave image (today it is reserved
and ignored during fxsave/fxrstor) for software use.
We can then use these fields at the end of fpstate, to indicate the presence of
xstate. But this requires some architecture changes like giving
away this space for SW use. We can take this to architects and
see what they think.
Post by Mikael Pettersson
How is user-space supposed to know how large the _xstate struct is?
Is that the size field in the struct xstate_cntxt?
yes. I will make the names more descriptive :)
Post by Mikael Pettersson
Post by Suresh Siddha
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-x86/arch/x86/kernel/xsave.c 2008-05-12 13:09:56.000000000 -0700
...
Post by Suresh Siddha
+ /*
+ * FP and SSE state can't be restored directly from the userspace
+ * because of legacy reasons. Lets restore it to the fpstate
+ * in the task struct.
+ */
Can you please explain what those 'legacy reasons' are?
As I mentioned earlier, fpstate contains both fsave frame followed by fxsave
frame. So the kernel can't directly restore the information.
Post by Mikael Pettersson
I know xsave will be needed once AVX is released to the masses.
Any idea on when that will happen?
AVX is currently planned for Sandybridge time frame.

thanks,
suresh
Mikael Pettersson
2008-05-19 14:52:01 UTC
Permalink
Post by Suresh Siddha
Post by Mikael Pettersson
This is a large patch, and somewhat difficult to review since it mixes
kernel-private and user-visible changes.
Sorry. I wanted to give the complete picture. Will see if I can split it up
while posting for upstream.
Ok.
Post by Suresh Siddha
Post by Mikael Pettersson
Post by Suresh Siddha
BTW, Traditionally glibc has this definition for struct ucontext.
glibc's definition is irrelevant, in part because glibc can and does lie
about kernel types to applications, and in part because glibc is not the
only user-space consumer of kernel types: there are other libcs, and there
are user-space virtualisation tools (my interest in this matter) that care
deeply about kernel types and sigframe layouts. In particular, user-space
needs to be able to copy and assemble sigframes.
Ok. I hope user-space is doing this copy and assemble in a (sane) way that
allows the kernel to modify the sigframe with out breaking old apps.
Do you know how the user-space is determining how much to copy today?
The de-facto ABI for signal delivery and sigreturn is unfortunately
based on fairly fixed-layout structs on the stack (sigframes). The
only flexibility there that I've found is the sigcontext's fpstate
pointer which allows the fpstate to be located elsewhere.

User-space pretty much has to know the kernel's sigframe layout, for
instance, non-siginfo sigframes on x86-32 hide additional sigmask words
above the fpstate.

User-space could in theory compare SP with the sigcontext's SP and
the altstack (if any) and deduce the sigframe size from that, but
that gives incomplete information; for instance, user-space must
still be able to locate and adjust embedded pointers in the sigframe.

In my case we "know" the sigframe struct layout based on OS and CPU
combination. I can't comment on how others deal with this.

It's pretty much guaranteed that any extension of these structs will
break some applications. I think the best we can hope for is that
1) applications that only inspect sigframes without copying them
don't break, and
2) applications get a mechanism for detecting the new layout of
sigframes and ucontexts, allowing them to be updated to handle
those changes.
Post by Suresh Siddha
I have to probably use some magic values(or other flags), indicating the
presence of extended state context information in the ucontext.
I believe so too.
Post by Suresh Siddha
cpuid information (cpuid.0x1.ecx.osxsave) indicates whether the
OS supports xsave or not. This can be one way, that signal handlers
can use to interpret the ucontext.
Given that it is more than signal handlers, I can add a flag also,
representing ucontext extensions.
My problem with the OSXAVE flag is that it's a very indirect way of
communicating the layout of sigframes and sigcontexts. These structures
should, if at all possible, be self-describing. A single flag bit in
the sigcontext could handle both structures (since a sigframe always
includes a sigcontext).

[But see my comment below about uc_flags.]
Post by Suresh Siddha
Post by Mikael Pettersson
You're changing the layout of struct ucontext in two ways: uc_mcontext
changes elsewhere, and you're adding __unused and uc_xstate.
I am not changing the uc_mcontext (struct sigcontext). Just extending
the ucontext.
Correct, my mistake.
Post by Suresh Siddha
Post by Mikael Pettersson
How is user-space supposed to know whether it's looking at a current
layout ucontext or an xsave-layout ucontext?
It seems that uc_flags is unused and always zero. Could you define a
flag bit (e.g. 1) for uc_flags to indicate the xsave layout?
Sure. I can even add a magic word to indicate the xsave presence,
so that sigreturn can be double sure and not break older apps.
The uc_flags field is not specified by SUSv3. Linux stores a zero
in it on signal delivery but doesn't use it on signal return.
Solaris describes it as implementation-dependent, and stores a bit
vector in it describing which parts of the ucontext are valid and
should be restored by setcontext (i.e. sigreturn in Linux). Thus
uc_flags would be a perfect place to store an XSAVE flag or cookie.

Using uc_flags takes care of x86-64 and x86-32 with "rt" sigframes
(SA_SIGINFO signal dispositions). The remaining problem is x86-32
with non-SA_SIGINFO sigframes, since they store plain sigcontexts
on the stack not ucontexts, and so don't have a uc_flags field.
To handle those we have to augment either sigcontexts or the plain
non-rt sigframes with some kind of marker.
Post by Suresh Siddha
Post by Mikael Pettersson
Post by Suresh Siddha
--- linux-2.6-x86.orig/arch/x86/kernel/sigframe.h 2008-05-12 13:09:02.000000000 -0700
+++ linux-2.6-x86/arch/x86/kernel/sigframe.h 2008-05-12 13:09:56.000000000 -0700
@@ -3,9 +3,10 @@
char __user *pretcode;
int sig;
struct sigcontext sc;
- struct _fpstate fpstate;
+ struct xstate_cntxt xst_cnxt;
unsigned long extramask[_NSIG_WORDS-1];
char retcode[8];
+ /* fp and rest of the extended context state follows here */
};
Offset to extramask[] and retcode changes, as well as the size of the structure.
hm yes. this will break the restore of extramask[], for apps which set their
own sig return frames.
Updating extramask[] is useful for combining sigprocmask(SIG_SETMASK,_,_) and
sigreturn into an atomic operation. AFAIK, there's no way to reach extramask[]
without knowing the sigframe layout.
Post by Suresh Siddha
I can leave _fpstate as it is(and unused) and move xstate context (which
will contain fpstate + the extended state) after extramask[].
I think that would be best.
Post by Suresh Siddha
Post by Mikael Pettersson
Why contract xstate_cntxt to xst_cnxt? That just obscures things.
ok.
Thanks.
Post by Suresh Siddha
Post by Mikael Pettersson
What is the purpose of the xstate pointer in xstate_cntxt?
As far as I can tell, it's redundant and can alway be derived
from the ucontext's uc_mcontext.fpstate pointer.
This is true in 64bit.
While doing sigreturn, there is no way, kernel can know if the
uc_mcontext.fpstate is pointing to just the legacy fpstate(512 bytes)
or the extended state pointer (unless we incorporate
some magic word at the end of fpstate image, see below)
during sigreturn, fpstate will be restored from the uc_mcontext.fpstate
and extended state will be restored from xstate pointer.
Post by Mikael Pettersson
And why does x86-64 make xstate == fpstate while on x86-32 they're at an
offset from each other?
because sigcontext's 32bit fpstate is different from 64bit fpstate.
32bit fp state is fsave frame followed by fxsave frame. Whereas 64bit
fp state is just fxsave frame. To maintain 32bit legacy compatibility,
32bit xstate is different from 32bit fpstate.
Ok, thanks for clarifying this.
Post by Suresh Siddha
Post by Mikael Pettersson
struct _fpstate has a 'magic' field which distinguishes x87-only
from x87+FXSR structs. Could that field also be used to indicate XSAVE?
I don't think we can use the existing 'magic' field.
Hmm, right now it seems this field has a de-facto ABI of being
either 0xffff (plain) or 0x0000 (fxsr). Using other values would
confuse at least one application I know of. Sad.
Post by Suresh Siddha
But we can
use some what similar magic, if the fxsave/fxrstor give away
some of the fields at the end of fxsave image (today it is reserved
and ignored during fxsave/fxrstor) for software use.
We can then use these fields at the end of fpstate, to indicate the presence of
xstate. But this requires some architecture changes like giving
away this space for SW use. We can take this to architects and
see what they think.
If the HW doesn't store anything valuable there, we could store
SW flags/cookies there on signal delivery, and clear them before
fxrstor (unless the HW is known to ignore those fields).
But it depends on how forgiving the HW is.

Another idea I have at the moment is that it seems that struct
_fpstate reserves storage for x87 in both the legacy part and
the fxsr part. Looking through arch/x86/kernel/i387.c it seems
that the x87 space in the fxsr part isn't actually used (see
convert_to_fxsr()). If so, it might be possible to reuse it for
an XSAVE indicator.

Another idea is that marking non-siginfo sigframes could be done
by adding a sys_xsave_sigreturn() and letting those sigframes
have a return address pointing to a vsyscall that invokes this
new syscall.

/Mikael
Andi Kleen
2008-05-19 15:04:24 UTC
Permalink
Post by Mikael Pettersson
The de-facto ABI for signal delivery and sigreturn is unfortunately
based on fairly fixed-layout structs on the stack (sigframes). The
only flexibility there that I've found is the sigcontext's fpstate
pointer which allows the fpstate to be located elsewhere.
One sure 100% compatible way would be to only change the signal layout once
the application used anything that needs XSAVE/XRSTOR. But implementing
that would be likely complicated and I'm not sure it's worth it.

I don't remember that much breakage when the FXSAVE support was
introduced on i386. That already changed these data structures.
Post by Mikael Pettersson
Another idea is that marking non-siginfo sigframes could be done
by adding a sys_xsave_sigreturn() and letting those sigframes
have a return address pointing to a vsyscall that invokes this
new syscall.
That sounds easy enough. x86-64 right now doesn't use signal vsyscalls but
there is no principle reason it can't.

-Andi
H. Peter Anvin
2008-05-19 16:29:01 UTC
Permalink
Post by Mikael Pettersson
My problem with the OSXAVE flag is that it's a very indirect way of
communicating the layout of sigframes and sigcontexts. These structures
should, if at all possible, be self-describing. A single flag bit in
the sigcontext could handle both structures (since a sigframe always
includes a sigcontext).
It's also wrong, since OSXSAVE indicates that the CPU can do it, not
that the kernel can.
Post by Mikael Pettersson
Post by Suresh Siddha
Post by Mikael Pettersson
struct _fpstate has a 'magic' field which distinguishes x87-only
from x87+FXSR structs. Could that field also be used to indicate XSAVE?
I don't think we can use the existing 'magic' field.
Hmm, right now it seems this field has a de-facto ABI of being
either 0xffff (plain) or 0x0000 (fxsr). Using other values would
confuse at least one application I know of. Sad.
Well, arguably it is the right thing to use since we're talking about a
new format. The difference is that the new format *does* extend
backwards to match the old format.
Post by Mikael Pettersson
Post by Suresh Siddha
But we can
use some what similar magic, if the fxsave/fxrstor give away
some of the fields at the end of fxsave image (today it is reserved
and ignored during fxsave/fxrstor) for software use.
We can then use these fields at the end of fpstate, to indicate the presence of
xstate. But this requires some architecture changes like giving
away this space for SW use. We can take this to architects and
see what they think.
If the HW doesn't store anything valuable there, we could store
SW flags/cookies there on signal delivery, and clear them before
fxrstor (unless the HW is known to ignore those fields).
But it depends on how forgiving the HW is.
All we need is a single field -- a single byte -- reserved indefinitely
for software use. Existing FXSAVE kernels will have set it to zero.

There might be fields the existing FXSAVE format which can be equally
abused, even. I will do some looking.

-hpa
Suresh Siddha
2008-05-19 16:57:35 UTC
Permalink
Post by H. Peter Anvin
Post by Mikael Pettersson
My problem with the OSXAVE flag is that it's a very indirect way of
communicating the layout of sigframes and sigcontexts. These structures
should, if at all possible, be self-describing. A single flag bit in
the sigcontext could handle both structures (since a sigframe always
includes a sigcontext).
It's also wrong, since OSXSAVE indicates that the CPU can do it, not
that the kernel can.
OSXSAVE indicates the OS support and XSAVE indicates the cpu support.
Post by H. Peter Anvin
Post by Mikael Pettersson
Post by Suresh Siddha
Post by Mikael Pettersson
struct _fpstate has a 'magic' field which distinguishes x87-only
from x87+FXSR structs. Could that field also be used to indicate XSAVE?
I don't think we can use the existing 'magic' field.
Hmm, right now it seems this field has a de-facto ABI of being
either 0xffff (plain) or 0x0000 (fxsr). Using other values would
confuse at least one application I know of. Sad.
Well, arguably it is the right thing to use since we're talking about a
new format. The difference is that the new format *does* extend
backwards to match the old format.
There might be some old applications, which just care about FP/SSE and
just check for 0xffff (plain) or 0x0000 (fxsr). We should extend this
in a backward compatible manner.
Post by H. Peter Anvin
Post by Mikael Pettersson
Post by Suresh Siddha
But we can
use some what similar magic, if the fxsave/fxrstor give away
some of the fields at the end of fxsave image (today it is reserved
and ignored during fxsave/fxrstor) for software use.
We can then use these fields at the end of fpstate, to indicate the presence of
xstate. But this requires some architecture changes like giving
away this space for SW use. We can take this to architects and
see what they think.
If the HW doesn't store anything valuable there, we could store
SW flags/cookies there on signal delivery, and clear them before
fxrstor (unless the HW is known to ignore those fields).
But it depends on how forgiving the HW is.
All we need is a single field -- a single byte -- reserved indefinitely
for software use. Existing FXSAVE kernels will have set it to zero.
There might be fields the existing FXSAVE format which can be equally
abused, even. I will do some looking.
All the reserved fields at the end of fxsave format are zeroed and
presented as such to the user. If HW makes some of these fields SW available,
then we can use those (will check). If there is any scope with the
existing format it self, that will be much better.

thanks,
suresh
H. Peter Anvin
2008-05-19 17:45:37 UTC
Permalink
Post by Suresh Siddha
Post by H. Peter Anvin
It's also wrong, since OSXSAVE indicates that the CPU can do it, not
that the kernel can.
OSXSAVE indicates the OS support and XSAVE indicates the cpu support.
Sorry, brainfart. Don't post so early in the morning.
Post by Suresh Siddha
Post by H. Peter Anvin
All we need is a single field -- a single byte -- reserved indefinitely
for software use. Existing FXSAVE kernels will have set it to zero.
There might be fields the existing FXSAVE format which can be equally
abused, even. I will do some looking.
All the reserved fields at the end of fxsave format are zeroed and
presented as such to the user. If HW makes some of these fields SW available,
then we can use those (will check). If there is any scope with the
existing format it self, that will be much better.
I was thinking about what we'd really like earlier, and given a clean
slate I'd like to see a structure looking like:

struct state_ptrs {
size_t len;
struct state_foo *foo;
struct state_bar *bar;
...
};

... where len is sizeof(struct state_ptrs). This is not merely
extensible, but it's easy for userspace to massage it into whatever
format -- longer or shorter -- that it happens to know about, and it
gives a natural way for the kernel to communicate "none of this state"
by feeding a NULL pointer. So pretty much we're looking for a way to
backwards-compatible way to stash a pointer to this structure, I figure.

-hpa
Suresh Siddha
2008-05-20 01:57:24 UTC
Permalink
Post by Mikael Pettersson
Post by Suresh Siddha
But we can
use some what similar magic, if the fxsave/fxrstor give away
some of the fields at the end of fxsave image (today it is reserved
and ignored during fxsave/fxrstor) for software use.
We can then use these fields at the end of fpstate, to indicate the presence of
xstate. But this requires some architecture changes like giving
away this space for SW use. We can take this to architects and
see what they think.
If the HW doesn't store anything valuable there, we could store
SW flags/cookies there on signal delivery, and clear them before
fxrstor (unless the HW is known to ignore those fields).
But it depends on how forgiving the HW is.
Ok. CPU folks are planning to make some of the bytes at the end of fxsave
image, SW usable.

We can use some of these fields, to represent the extended state
presence with a cookie, save area size, mask of the state
stored. If needed, we can include the start address of the fpstate pointer
(also as part of the cookie), so that we can detect the situation,
where apps are just memcopying sizeof(struct _fpstate) from the fpstate
pointer (but not aware of the extended state).

we don't need any ucontext_t extensions any more and just
use the fpstate pointer to indicate the extended state aswell, right?

In addition, we need to make sure that for 32bit non-rt sigframes, we
don't modify the extramask[] offset.

thanks,
suresh
Mikael Pettersson
2008-05-20 08:58:20 UTC
Permalink
Post by Suresh Siddha
Post by Mikael Pettersson
Post by Suresh Siddha
But we can
use some what similar magic, if the fxsave/fxrstor give away
some of the fields at the end of fxsave image (today it is reserved
and ignored during fxsave/fxrstor) for software use.
We can then use these fields at the end of fpstate, to indicate the presence of
xstate. But this requires some architecture changes like giving
away this space for SW use. We can take this to architects and
see what they think.
If the HW doesn't store anything valuable there, we could store
SW flags/cookies there on signal delivery, and clear them before
fxrstor (unless the HW is known to ignore those fields).
But it depends on how forgiving the HW is.
Ok. CPU folks are planning to make some of the bytes at the end of fxsave
image, SW usable.
Nice.
Post by Suresh Siddha
We can use some of these fields, to represent the extended state
presence with a cookie, save area size, mask of the state
stored. If needed, we can include the start address of the fpstate pointer
(also as part of the cookie), so that we can detect the situation,
where apps are just memcopying sizeof(struct _fpstate) from the fpstate
pointer (but not aware of the extended state).
I use a similar technique to detect user-space mangling
of ucontexts on Solaris.
Post by Suresh Siddha
we don't need any ucontext_t extensions any more and just
use the fpstate pointer to indicate the extended state aswell, right?
Yes, the old magic distinguishes x87-only from x87+fxsr, the new magic
distinguishes fxsr from xsave.
Post by Suresh Siddha
In addition, we need to make sure that for 32bit non-rt sigframes, we
don't modify the extramask[] offset.
Thanks,

/Mikael
Andi Kleen
2008-05-20 10:01:23 UTC
Permalink
Post by Suresh Siddha
Post by Mikael Pettersson
Post by Suresh Siddha
But we can
use some what similar magic, if the fxsave/fxrstor give away
some of the fields at the end of fxsave image (today it is reserved
and ignored during fxsave/fxrstor) for software use.
We can then use these fields at the end of fpstate, to indicate the presence of
xstate. But this requires some architecture changes like giving
away this space for SW use. We can take this to architects and
see what they think.
If the HW doesn't store anything valuable there, we could store
SW flags/cookies there on signal delivery, and clear them before
fxrstor (unless the HW is known to ignore those fields).
But it depends on how forgiving the HW is.
Ok. CPU folks are planning to make some of the bytes at the end of fxsave
image, SW usable.
Are they always zeroed in earlier CPUs though? If not that wouldn't
work 100% reliably because whatever cookie you put in could have been
there before by chance.

I don't see anything in the SDM guaranteeing zeroing.

-Andi
Mikael Pettersson
2008-05-20 13:19:58 UTC
Permalink
Post by Andi Kleen
Post by Suresh Siddha
Post by Mikael Pettersson
Post by Suresh Siddha
But we can
use some what similar magic, if the fxsave/fxrstor give away
some of the fields at the end of fxsave image (today it is reserved
and ignored during fxsave/fxrstor) for software use.
We can then use these fields at the end of fpstate, to indicate the presence of
xstate. But this requires some architecture changes like giving
away this space for SW use. We can take this to architects and
see what they think.
If the HW doesn't store anything valuable there, we could store
SW flags/cookies there on signal delivery, and clear them before
fxrstor (unless the HW is known to ignore those fields).
But it depends on how forgiving the HW is.
Ok. CPU folks are planning to make some of the bytes at the end of fxsave
image, SW usable.
Are they always zeroed in earlier CPUs though? If not that wouldn't
work 100% reliably because whatever cookie you put in could have been
there before by chance.
I wrote a test program (fill an area with zeroes, fxsave, inspect
reserved fields, then fill it with ones, fxsave, inspect again),
and all processors appear to just not write anything to the reserved
fields after the last xmm register. (Tested on an old Mobile Athlon64,
Opteron 280, P4 Xeon, Pentium-D, and C2 Xeon E5345.)

So the question now is what if anything has the Linux kernel written
to those reserved fields. (Looking..) Hmm, signal delivery on x86-64
seems to do fxsave directly to the fxsave area in the user's sigframe,
which would imply that the reserved fields have unpredictable values.

/Mikael
H. Peter Anvin
2008-05-20 14:58:13 UTC
Permalink
Post by Mikael Pettersson
Post by Andi Kleen
Are they always zeroed in earlier CPUs though? If not that wouldn't
work 100% reliably because whatever cookie you put in could have been
there before by chance.
I wrote a test program (fill an area with zeroes, fxsave, inspect
reserved fields, then fill it with ones, fxsave, inspect again),
and all processors appear to just not write anything to the reserved
fields after the last xmm register. (Tested on an old Mobile Athlon64,
Opteron 280, P4 Xeon, Pentium-D, and C2 Xeon E5345.)
So the question now is what if anything has the Linux kernel written
to those reserved fields. (Looking..) Hmm, signal delivery on x86-64
seems to do fxsave directly to the fxsave area in the user's sigframe,
which would imply that the reserved fields have unpredictable values.
OK, so that's not a usable path unless we can find some area in the
existing data set to put a flag. Groan.

-hpa
Mikael Pettersson
2008-05-20 15:20:43 UTC
Permalink
Post by H. Peter Anvin
Post by Mikael Pettersson
Post by Andi Kleen
Are they always zeroed in earlier CPUs though? If not that wouldn't
work 100% reliably because whatever cookie you put in could have been
there before by chance.
I wrote a test program (fill an area with zeroes, fxsave, inspect
reserved fields, then fill it with ones, fxsave, inspect again),
and all processors appear to just not write anything to the reserved
fields after the last xmm register. (Tested on an old Mobile Athlon64,
Opteron 280, P4 Xeon, Pentium-D, and C2 Xeon E5345.)
So the question now is what if anything has the Linux kernel written
to those reserved fields. (Looking..) Hmm, signal delivery on x86-64
seems to do fxsave directly to the fxsave area in the user's sigframe,
which would imply that the reserved fields have unpredictable values.
OK, so that's not a usable path unless we can find some area in the
existing data set to put a flag. Groan.
An ugly workaround could be to start clearing one of these fields,
and say that the data there is only valid for kernels >= 2.6.26.
(I said it was ugly...)

Or we go back to stashing a flag in uc_flags (which is kosher),
and try to figure out how to mark non-rt sigframes.

/Mikael
Suresh Siddha
2008-05-20 17:53:25 UTC
Permalink
Post by Mikael Pettersson
Post by H. Peter Anvin
Post by Mikael Pettersson
So the question now is what if anything has the Linux kernel written
to those reserved fields. (Looking..) Hmm, signal delivery on x86-64
seems to do fxsave directly to the fxsave area in the user's sigframe,
which would imply that the reserved fields have unpredictable values.
OK, so that's not a usable path unless we can find some area in the
existing data set to put a flag. Groan.
An ugly workaround could be to start clearing one of these fields,
and say that the data there is only valid for kernels >= 2.6.26.
(I said it was ugly...)
Or we go back to stashing a flag in uc_flags (which is kosher),
and try to figure out how to mark non-rt sigframes.
This issue of not-zeroing, is present in only 64bit kernels and for 64bit apps,
right?

64bit app signal handling uses only rt_frame, so we can add an uc_flag for
them and for 32bit apps, kernel was always zero'ing the reserved bits
at the end of _fpstate.

In short, for non-rt frames, they can check the reserved bits at the end
of fpstate frame and for rt-frames (perhaps even for 32bit rt frame handling)
apps can check for uc_flag aswell, for extended state presence. Is this
good enough?

thanks,
suresh
H. Peter Anvin
2008-05-20 17:59:41 UTC
Permalink
Post by Suresh Siddha
This issue of not-zeroing, is present in only 64bit kernels and for 64bit apps,
right?
64bit app signal handling uses only rt_frame, so we can add an uc_flag for
them and for 32bit apps, kernel was always zero'ing the reserved bits
at the end of _fpstate.
In short, for non-rt frames, they can check the reserved bits at the end
of fpstate frame and for rt-frames (perhaps even for 32bit rt frame handling)
apps can check for uc_flag aswell, for extended state presence. Is this
good enough?
Are we sure about the "always" bit here (I suspect yes, but we may want
to double-check at least back to 2.2 or 2.4.)

Anyway, seems reasonable enough to me.

-hpa
H. Peter Anvin
2008-05-22 00:28:41 UTC
Permalink
Post by Suresh Siddha
Post by Mikael Pettersson
An ugly workaround could be to start clearing one of these fields,
and say that the data there is only valid for kernels >= 2.6.26.
(I said it was ugly...)
Or we go back to stashing a flag in uc_flags (which is kosher),
and try to figure out how to mark non-rt sigframes.
This issue of not-zeroing, is present in only 64bit kernels and for 64bit apps,
right?
64bit app signal handling uses only rt_frame, so we can add an uc_flag for
them and for 32bit apps, kernel was always zero'ing the reserved bits
at the end of _fpstate.
In short, for non-rt frames, they can check the reserved bits at the end
of fpstate frame and for rt-frames (perhaps even for 32bit rt frame handling)
apps can check for uc_flag aswell, for extended state presence. Is this
good enough?
Okay, trying to close on this :)

I would suggest using the uc_flag for the rt frames, and simply rely on
the OSXSAVE flag for non-rt signal frames. It a rather sucky approach
(as previously discussed), but since any sane user of these fields (as
opposed to just relying on the kernel to save/restore) should use the
SIGINFO frames, I don't see a problem *as long as it's possible to get
the information* -- any solution which demands performance should just
turn on SIGINFO and be happy.

The biggest potential problem with this that I see is that relying on
CPUID can mess with certain virtualization solutions. Another option to
accomplish the same thing would be to have a system call (preferrably a
prctl, since it is at least in theory personality-dependent) to query
what information is included in the fpstate data - since it will always
be the same for any particular kernel.

Thoughts?

-hpa
Roland McGrath
2008-05-22 00:53:45 UTC
Permalink
Post by H. Peter Anvin
The biggest potential problem with this that I see is that relying on
CPUID can mess with certain virtualization solutions. Another option to
accomplish the same thing would be to have a system call (preferrably a
prctl, since it is at least in theory personality-dependent) to query
what information is included in the fpstate data - since it will always
be the same for any particular kernel.
It's easy and cheap to add an indicator to the vDSO for this.


Thanks,
Roland
H. Peter Anvin
2008-05-22 01:38:31 UTC
Permalink
Post by Roland McGrath
Post by H. Peter Anvin
The biggest potential problem with this that I see is that relying on
CPUID can mess with certain virtualization solutions. Another option to
accomplish the same thing would be to have a system call (preferrably a
prctl, since it is at least in theory personality-dependent) to query
what information is included in the fpstate data - since it will always
be the same for any particular kernel.
It's easy and cheap to add an indicator to the vDSO for this.
Yes, but I suspect for legacy apps running without vDSO might matter.

-hpa
Roland McGrath
2008-05-22 06:40:26 UTC
Permalink
Post by H. Peter Anvin
Yes, but I suspect for legacy apps running without vDSO might matter.
"Legacy" apps want to be changed to make a new kind of kernel query
and care about new details of signal frame layout for new features
they never used before, but don't want to handle the vDSO?


Thanks,
Roland
H. Peter Anvin
2008-05-22 07:18:33 UTC
Permalink
Post by Roland McGrath
Post by H. Peter Anvin
Yes, but I suspect for legacy apps running without vDSO might matter.
"Legacy" apps want to be changed to make a new kind of kernel query
and care about new details of signal frame layout for new features
they never used before, but don't want to handle the vDSO?
I'm talking mostly about semi-embedded ISVs that have managed to get
themselves funny ideas about what they don't want to change. The vDSO
definitely involves more machinery to get to.

-hpa
Mikael Pettersson
2008-05-22 08:49:34 UTC
Permalink
Post by Roland McGrath
Post by H. Peter Anvin
The biggest potential problem with this that I see is that relying on
CPUID can mess with certain virtualization solutions. Another option to
accomplish the same thing would be to have a system call (preferrably a
prctl, since it is at least in theory personality-dependent) to query
what information is included in the fpstate data - since it will always
be the same for any particular kernel.
It's easy and cheap to add an indicator to the vDSO for this.
For our user-space virtualisation code a plain system call,
like prctl(), would be strongly preferable. Right now I
can't say whether a vDSO solution would work at all for us,
but a system call solution would be no problem at all.

/Mikael
Mikael Pettersson
2008-05-22 08:57:03 UTC
Permalink
Post by H. Peter Anvin
Post by Suresh Siddha
Post by Mikael Pettersson
An ugly workaround could be to start clearing one of these fields,
and say that the data there is only valid for kernels >= 2.6.26.
(I said it was ugly...)
Or we go back to stashing a flag in uc_flags (which is kosher),
and try to figure out how to mark non-rt sigframes.
This issue of not-zeroing, is present in only 64bit kernels and for 64bit apps,
right?
64bit app signal handling uses only rt_frame, so we can add an uc_flag for
them and for 32bit apps, kernel was always zero'ing the reserved bits
at the end of _fpstate.
In short, for non-rt frames, they can check the reserved bits at the end
of fpstate frame and for rt-frames (perhaps even for 32bit rt frame handling)
apps can check for uc_flag aswell, for extended state presence. Is this
good enough?
Okay, trying to close on this :)
I would suggest using the uc_flag for the rt frames, and simply rely on
the OSXSAVE flag for non-rt signal frames. It a rather sucky approach
(as previously discussed), but since any sane user of these fields (as
opposed to just relying on the kernel to save/restore) should use the
SIGINFO frames, I don't see a problem *as long as it's possible to get
the information* -- any solution which demands performance should just
turn on SIGINFO and be happy.
I don't have the luxury to unconditionally change non-rt signal delivery
to rt signal delivery, but using uc_flags plus OSXSAVE or prctl() to
announce these layout changes would work for us. Of course, any existing
sigframe mangling (as opposed to just reading it) code must be updated
to avoid breakage, but that's unavoidable.
Post by H. Peter Anvin
The biggest potential problem with this that I see is that relying on
CPUID can mess with certain virtualization solutions. Another option to
accomplish the same thing would be to have a system call (preferrably a
prctl, since it is at least in theory personality-dependent) to query
what information is included in the fpstate data - since it will always
be the same for any particular kernel.
Thoughts?
Ok for me.

/Mikael
Suresh Siddha
2008-05-22 20:56:19 UTC
Permalink
Post by Mikael Pettersson
Post by H. Peter Anvin
Post by Suresh Siddha
In short, for non-rt frames, they can check the reserved bits at the end
of fpstate frame and for rt-frames (perhaps even for 32bit rt frame handling)
apps can check for uc_flag aswell, for extended state presence. Is this
good enough?
Okay, trying to close on this :)
I would suggest using the uc_flag for the rt frames, and simply rely on
the OSXSAVE flag for non-rt signal frames. It a rather sucky approach
(as previously discussed), but since any sane user of these fields (as
opposed to just relying on the kernel to save/restore) should use the
SIGINFO frames, I don't see a problem *as long as it's possible to get
the information* -- any solution which demands performance should just
turn on SIGINFO and be happy.
I don't have the luxury to unconditionally change non-rt signal delivery
to rt signal delivery, but using uc_flags plus OSXSAVE or prctl() to
announce these layout changes would work for us. Of course, any existing
sigframe mangling (as opposed to just reading it) code must be updated
to avoid breakage, but that's unavoidable.
Post by H. Peter Anvin
The biggest potential problem with this that I see is that relying on
CPUID can mess with certain virtualization solutions. Another option to
hpa, What is the virtualization problem? Are you referring to perf problem?
As you noted, regular non-rt signal handlers won't need this cpuid check. It's
needed only for those who manually look at non-rt signal frames and interpret it.
And also, they can do this check only once and not everytime.

To me, prtcl() just seems to be an overkill.

While restoring from the user, kernel also need to find out what layout
the user is passing. So it's bi-directional. I prefer the same mechanism
(using cookies/magic numbers etc inaddition to uc_flags or cpuid checks) to
interpret the fpstate for both user/kernel.

ARM also seem to be using similar things while extending their ucontext_t,
with out other kernel interfaces to indicate the layout.

BTW, how come 32bit kernel doesn't have the X86_FXSR_MAGIC checks, while restoring
the extended fxsave data from _fpstate?

thanks,
suresh
Post by Mikael Pettersson
Post by H. Peter Anvin
accomplish the same thing would be to have a system call (preferrably a
prctl, since it is at least in theory personality-dependent) to query
what information is included in the fpstate data - since it will always
be the same for any particular kernel.
Thoughts?
Ok for me.
/Mikael
H. Peter Anvin
2008-05-22 21:02:28 UTC
Permalink
Post by Suresh Siddha
hpa, What is the virtualization problem? Are you referring to perf problem?
As you noted, regular non-rt signal handlers won't need this cpuid check. It's
needed only for those who manually look at non-rt signal frames and interpret it.
And also, they can do this check only once and not everytime.
No, relying on CPUID and vdso both have implications for virtualization.
Post by Suresh Siddha
To me, prtcl() just seems to be an overkill.
I don't think it is ... it's not overkill but rather "underkill"... it's
a low-performance solution but it's guaranteed to be safe in the
presence of virtualization of all its various ilk. Note that you don't
need to be able to *set* the format via prctl(), just *query* (get) it.

Using prctl() allows us to make this personality-dependent if we ever
need to.
Post by Suresh Siddha
While restoring from the user, kernel also need to find out what layout
the user is passing. So it's bi-directional. I prefer the same mechanism
(using cookies/magic numbers etc inaddition to uc_flags or cpuid checks) to
interpret the fpstate for both user/kernel.
No, it really doesn't: the kernel only needs to be able to read the same
format as it itself wrote.
Post by Suresh Siddha
ARM also seem to be using similar things while extending their ucontext_t,
with out other kernel interfaces to indicate the layout.
BTW, how come 32bit kernel doesn't have the X86_FXSR_MAGIC checks, while restoring
the extended fxsave data from _fpstate?
Again, the kernel already knows the format, so it doesn't need to check.

-hpa
Mikael Pettersson
2008-05-23 11:46:25 UTC
Permalink
Post by Mikael Pettersson
Hmm, right now it seems this field has a de-facto ABI of being
either 0xffff (plain) or 0x0000 (fxsr). Using other values would
confuse at least one application I know of. Sad.
... could you share what application that is? This otherwise would be
ideal.
It's the virtual machine for the Erlang/OTP concurrent functional
language. It's open-source and used extensively for telecom and
internet server applications.

The language doesn't allow non-finite FP values, so FP ops must
be checked. As an optimisation, the virtual machine groups FP ops
into blocks, enables FP exceptions, and has a SIGFPE handler that
just records that an FP exception has occurred in a per-thread flag.
After each FP block there's a single check for this flag, which if
set triggers a language-level "badarith" exception.

In order to resume from an SSE2 exception the SIGFPE handler must
update some parts of the sigcontext. But first the handler must detect
if the exception came from x87 or SSE2, which requires looking at the
mxcsr field, which requires looking at the magic field to detect the
fpstate's layout.
struct _fpstate {
...
unsigned short magic; /* 0xffff = regular FPU data only */
...
};
#define X86_FXSR_MAGIC 0x0000
While this in theory allows for other values than -1 and 0 in magic,
any such value would be != FXSR_MAGIC and therefore would not trigger
the application's "inspect the fxsr layout" logic.

/Mikael
Andi Kleen
2008-05-23 12:11:54 UTC
Permalink
Post by Mikael Pettersson
Post by Mikael Pettersson
Hmm, right now it seems this field has a de-facto ABI of being
either 0xffff (plain) or 0x0000 (fxsr). Using other values would
confuse at least one application I know of. Sad.
... could you share what application that is? This otherwise would be
ideal.
It's the virtual machine for the Erlang/OTP concurrent functional
language. It's open-source and used extensively for telecom and
internet server applications.
I had a few reports over the years of other applications ddoing very weird
specific things with the sigframe. So I agree it's fairly important
to be as compatible here as possible.

-Andi
H. Peter Anvin
2008-05-23 16:57:32 UTC
Permalink
What I was doing in the RFC is: restore the state what ever that was
present and init the state that was not present in the stack frame.
That is consistent in spirit with the existing treatment of FPU data.
That is, if the sigcontext.fpstate pointer is NULL, the thread's FPU
state is reset to default. (And despite what hpa said about being
"supported", the facts in the code are that sigreturn just follows the
sigcontext.fpstate pointer, whatever it is. On 32-bit, the pointer is
NULL in the context saved when the thread had not used the FPU, so
modifying the sigcontext to include FP state when it didn't before
requires putting in some user-chosen pointer. There in fact may well be
existing code that does user-level coroutine switching using sigreturn
and relies on this, for all we know.)
Okay. Pretty much what it comes down to is that there is no ideal
solution. Thus, we're trying to explore the potential tradeoffs. The
scenario you describe above will crash horribly for a non-FXSAVE aware
application running on an FXSAVE kernel.

Either way, there has been a long time since, and new bad applications
have obviously emerged, partially "assisted" by our propensity to not
document, and the deep gulf between our kernel and userspace developers.

Let's try another strawman on for size:

- It is clear it is desirable not just for the frame itself but for the
fpstate to be self-describing.

- Thus, let's put a magic cookie in one of the reserved fields at the
end of the FXSAVE region, and make sure it is long enough to be unlikely
to pop up randomly; as well as another magic cookie outside the FXSAVE
region.

- The signal delivery code will write the cookie (or zero, for !XSAVE)
regardless of any crap ptrace might have written into it.

- We will ALSO set bit 0 in uc_flags for RT sigframes as an additional
assurance.

- We will introduce at least a 32-bit field for future use, to be
written unconditionally zero for now. We don't want to have to go
through this particular torture yet again.

- The XSAVE state beyond the FXSAVE region needs to be self-describing.
This may mean adding information not provided by the hardware.
Furthermore, it must be possible for userspace to know the length of the
frame, even if it doesn't understand its detailed contents.


None of this is foolproof on older kernels -- there simply *IS* no
option for older kernels that is 100% guaranteed, thanks to various
assumptions made and design decisions taken over the years. There are a
couple of failure scenarious here:

- XSAVE-aware application running on pre-XSAVE kernel:

Such an application will be aware that the XSAVE information may not
exist, but needs to know (with high probability) that it isn't
present. We have CPUID.OSXSAVE, the uc_flags bit, and the magic
cookie to help here. ptrace can introduce the magic cookie falsely
into the state, but ptrace can introduce all kinds of failures;
either way they would (probablistically) not see the *second*
cookie.

The fact that 64-bit kernels don't clear the unused fields is of
less concern, since 64-bit kernels get the uc_flags field.

- Pre-XSAVE application running on XSAVE kernel with XSAVE enabled:

Here we have the potential for all kinds of corrupt state, including
userspace trying to save away the state and load it later, not
knowing the proper size of it. Worse, some sick person might try to
save and restore state from different hosts, with potential for
all kinds of mayhem.

The saved state, if copied from the original, would contain the first
cookie but not the second cookie.

Again, the use of two cookies here adds some amount of assurance;
but that again amounts to probabilistic failure detection. However,
I personally don't see any way to avoid that scenario at all.

-hpa
Suresh Siddha
2008-05-23 17:50:14 UTC
Permalink
Post by H. Peter Anvin
What I was doing in the RFC is: restore the state what ever that was
present and init the state that was not present in the stack frame.
That is consistent in spirit with the existing treatment of FPU data.
That is, if the sigcontext.fpstate pointer is NULL, the thread's FPU
state is reset to default. (And despite what hpa said about being
"supported", the facts in the code are that sigreturn just follows the
sigcontext.fpstate pointer, whatever it is. On 32-bit, the pointer is
NULL in the context saved when the thread had not used the FPU, so
modifying the sigcontext to include FP state when it didn't before
requires putting in some user-chosen pointer. There in fact may well be
existing code that does user-level coroutine switching using sigreturn
and relies on this, for all we know.)
Okay. Pretty much what it comes down to is that there is no ideal
solution. Thus, we're trying to explore the potential tradeoffs. The
scenario you describe above will crash horribly for a non-FXSAVE aware
application running on an FXSAVE kernel.
yes, who ever added fxsave extensions missed this and we def want to handle
these kind of scenarios in xsave context.
Post by H. Peter Anvin
Either way, there has been a long time since, and new bad applications
have obviously emerged, partially "assisted" by our propensity to not
document, and the deep gulf between our kernel and userspace developers.
- It is clear it is desirable not just for the frame itself but for the
fpstate to be self-describing.
Yes.
Post by H. Peter Anvin
- Thus, let's put a magic cookie in one of the reserved fields at the
end of the FXSAVE region, and make sure it is long enough to be unlikely
to pop up randomly; as well as another magic cookie outside the FXSAVE
region.
- The signal delivery code will write the cookie (or zero, for !XSAVE)
regardless of any crap ptrace might have written into it.
- We will ALSO set bit 0 in uc_flags for RT sigframes as an additional
assurance.
- We will introduce at least a 32-bit field for future use, to be
written unconditionally zero for now. We don't want to have to go
through this particular torture yet again.
This is exactly what I have been trying to say. CPU folks are going
to document that bytes[464..511] of fxsave frame are going to be
SW usable. We can use some of these to represent the cookie, size
of the xsave frame, state that is saved in the frame etc. And the
remaining unused bytes can be cleared to zero for future use.
Post by H. Peter Anvin
as well as another magic cookie outside the FXSAVE region.
Is this to avoid issues with memcpy, who end up copying only fxsave
info but not the extended state? We don't need another magic
cookie outside the fxsave region, we can include the
address of the _fpstate pointer along side of the first cookie
in the SW usable portion of the fxsave state.
Post by H. Peter Anvin
- The XSAVE state beyond the FXSAVE region needs to be self-describing.
This may mean adding information not provided by the hardware.
In my RFC, I have added size of the xsave, state mask saved in xsave
frame. I will move this to the sw usable portion of fxsave. We can
add any thing more if desired.
Post by H. Peter Anvin
Furthermore, it must be possible for userspace to know the length of the
frame, even if it doesn't understand its detailed contents.
With the above, we are just changing the _fpstate layout (which
includes the length). But if we need frame size in the
sigframe layout, we can add it (and it can be done outside the scope of xsave
patches).
Post by H. Peter Anvin
None of this is foolproof on older kernels -- there simply *IS* no
option for older kernels that is 100% guaranteed, thanks to various
assumptions made and design decisions taken over the years. There are a
Such an application will be aware that the XSAVE information may not
exist, but needs to know (with high probability) that it isn't
present. We have CPUID.OSXSAVE, the uc_flags bit, and the magic
cookie to help here. ptrace can introduce the magic cookie falsely
into the state, but ptrace can introduce all kinds of failures;
either way they would (probablistically) not see the *second*
cookie.
Application can use cpuid.OSXSAVE (even in
paravirtualization, hypervisors will prevent the leak of
cpuid.osxsave for kernels which don't use it) reliably. And also,
we have uc_flags (for rt-frame) and other cookies in the SW usable
portion of fxsave frame.

In all the cases, kernel should be as graceful as it can be. xsave
aware kernel should try to find and restore xsave state and if there
is any failure, try to restore just the fpstate and init the rest.
Post by H. Peter Anvin
The fact that 64-bit kernels don't clear the unused fields is of
less concern, since 64-bit kernels get the uc_flags field.
Here we have the potential for all kinds of corrupt state, including
userspace trying to save away the state and load it later, not
knowing the proper size of it. Worse, some sick person might try to
save and restore state from different hosts, with potential for
all kinds of mayhem.
The saved state, if copied from the original, would contain the first
cookie but not the second cookie.
Again, the use of two cookies here adds some amount of assurance;
I think the pointer along with the first cookie is enough right? There
is no SW usable portion in the xsave layout outside the fxsave area.
And for compatibility reasons, its best to keep the xsave layout
similar to what the cpu uses.
Post by H. Peter Anvin
but that again amounts to probabilistic failure detection. However,
I personally don't see any way to avoid that scenario at all.
In this case, kernel will atleast not corrupt the fxsave state, which
the application cares about.

thanks,
suresh
Suresh Siddha
2008-05-23 18:09:28 UTC
Permalink
hmm.. so the kernel needs to export all the cpuid info (that the kernel
enables and supports) to the user through some mechanism then?
For a cheap interface, we use AT_HWCAP for this. Unfortunately that only
covers the first 32 bits of CPUID info. (For another cheap interface,
giving all the CPUID bits in the vDSO would be easy.)
For a clunky interface that already exists and is "simple" to use,
there is /dev/cpu/0/cpuid now. I wonder if having a device node and
opening it too much for applications that consider the vDSO too complex.
I doubt it.
Ok. If really needed, they can use this interface aswell. But I don't
see a need for a new system call / other mechanism, just for xsave
purpose. They can rely on cpuid or any other equivalent infrastructure the
kernel provides.

thanks,
suresh
H. Peter Anvin
2008-06-06 00:28:15 UTC
Permalink
Sorry for not getting back on this for so long.

I have looked at the XSAVE architecture, and it is pretty darn hideous,
mostly because it doesn't describe itself in the absence of CPUID
information. Given that, it would have been much better if there had
been separate invocations of XSAVE for each substate region. On the
other hand, normalizing to the current CPU format is probably desirable
anyway.

I would like to make this proposal for the signal frames (again, flagged
with a uc_flags bit for RT frames):

- The SW-reserved areas at the bottom of the FXSAVE region will be used
as follows:
- A magic number (M1)
- A length pointer (L1), giving the length of the entire XSAVE region.

- At the end of the XSAVE region, i.e. at the offset given by the length
pointer, we create a secondary structure looking something like this:

- Magic number (M2)
- Descriptor count (DC)
- DC * <EBX, EAX> from CPUID leaf 0Dh
- Possibly a checksum or CRC of this structure

Note that this tail structure will always be the same on a given kernel,
so it can be pre-canned at boot time. This tail structure serves two
purposes:

- It can be used to verify against truncation of the state.
(I.e. if an XSAVE-unaware application tries to copy and save away
a state and later restore it, but only copies the first 512 bytes
and later just puts a pointer to it.)
- It can be used to verify against an alien state (saved and restored
from another CPU, or even just another kernel version with different
support.)

If there is a mismatch, we can then take appropriate action:

- No M1 or M2 signature, or L1 or DC are insane:
-> Reinitialize any non-FXSAVE state.

- M1, M2, L1, and DC make sense, but mismatch on DC or descriptor
offsets:
-> Do a region-by-region copy in of the state; reinitialize any
regions not provided.

- Mismatch on descriptor sizes:
-> Consider that region corrupt and reinitialize?

The region-by-region copy could of course be used even in the same-CPU
case, if there turns out to be a negible performance difference over
whole-block copy.

Thoughts?

-hpa
Suresh Siddha
2008-06-06 20:14:21 UTC
Permalink
Post by H. Peter Anvin
Sorry for not getting back on this for so long.
I thought we had a closure on the previous thread. But no problem.
It's better late than never.
Post by H. Peter Anvin
I have looked at the XSAVE architecture, and it is pretty darn hideous,
mostly because it doesn't describe itself in the absence of CPUID
information.
xsave,xrstor are performance senstive instructions as they are used
in process context switches. It doesn't have to describe itself and
at any time, one can get all the xsave relevant layout information using
cpuid. And when needed, SW can always pass extra information with the
xsave image.
Post by H. Peter Anvin
Given that, it would have been much better if there had
been separate invocations of XSAVE for each substate region.
It's primarily designed for context switch handling, with
fxsave's interoperability in mind. Even though the substates are
laidout one after another, xsave/xrstor can operate on individual
states based on the edx:eax mask.

Each substate will probably also have different instructions to
save/restore/init their substate.
Post by H. Peter Anvin
On the
other hand, normalizing to the current CPU format is probably desirable
anyway.
yes.
Post by H. Peter Anvin
I would like to make this proposal for the signal frames (again, flagged
- The SW-reserved areas at the bottom of the FXSAVE region will be used
- A magic number (M1)
- A length pointer (L1), giving the length of the entire XSAVE region.
As previously agreed.
Post by H. Peter Anvin
- At the end of the XSAVE region, i.e. at the offset given by the length
- Magic number (M2)
As I mentioned earlier, we can avoid this magic number, by including
a pointer (which points to start of the fp and xstate on stack) along with M1.

This will catch any one copying the FP state of the frame but not aware of
Xstate.
Post by H. Peter Anvin
- Descriptor count (DC)
- DC * <EBX, EAX> from CPUID leaf 0Dh
As you mentioned, this doesn't change after a kernel boot. So do we really
need to save this static information on every signal? (also please see below
about the compaction).
Post by H. Peter Anvin
- Possibly a checksum or CRC of this structure
Note that this tail structure will always be the same on a given kernel,
so it can be pre-canned at boot time. This tail structure serves two
- It can be used to verify against truncation of the state.
(I.e. if an XSAVE-unaware application tries to copy and save away
a state and later restore it, but only copies the first 512 bytes
and later just puts a pointer to it.)
As I mentioned above, pointer along with M1 should be enough to catch this?
Post by H. Peter Anvin
- It can be used to verify against an alien state (saved and restored
from another CPU, or even just another kernel version with different
support.)
Though the xsave layout is extendable, save area is not
compacted if some features are not supported by processor and/or
system software. This is documented in Vol 2b under "xsave"
instruction.

In my RFC, we had the bit mask also saved in the SW-reserved area,
which represents the extended state saved in the signal frame.
This is in addition to the bit mask represeted by the xstate_bv
in the header. xstate_bv indicates the current status (init/non-init)
of the sub-states for the bit mask saved in the SW-reserved area.

While restoring, kernel can also use the same bit mask in SW area to restore
the state and init the other state not referred by the bit mask.
Post by H. Peter Anvin
-> Reinitialize any non-FXSAVE state.
- M1, M2, L1, and DC make sense, but mismatch on DC or descriptor
-> Do a region-by-region copy in of the state; reinitialize any
regions not provided.
Given that the descriptor offsets don't change, we can
achieve the same thing with a bit mask representing the state in
the xsave layout. xrstor with the approriate bit masks will automatically
restore/init the state.
Post by H. Peter Anvin
-> Consider that region corrupt and reinitialize?
The region-by-region copy could of course be used even in the same-CPU
case, if there turns out to be a negible performance difference over
whole-block copy.
Today in 64bit, we directly do fxsave/fxrstor in and out of user-space
for signal handlers. I would like to retain this behavior as much as possible
with xsave/xrstor aswell (and at the same time, provide as much information
as possible for the user to interpret the signal frame). Bit mask representing
the state saved in the xsave image, M1, length and some cookie (pointer along
with M1) to detect the image truncation can achieve this. Isn't it?

thanks,
suresh
H. Peter Anvin
2008-06-06 23:03:51 UTC
Permalink
Post by Suresh Siddha
I thought we had a closure on the previous thread. But no problem.
It's better late than never.
I apologize. The last two months have been exceptionally tough.
Post by Suresh Siddha
xsave,xrstor are performance senstive instructions as they are used
in process context switches. It doesn't have to describe itself and
at any time, one can get all the xsave relevant layout information using
cpuid. And when needed, SW can always pass extra information with the
xsave image.
Yes, the big problem with it is its monolithic nature (with no realistic
alternate instruction.)
Post by Suresh Siddha
Post by H. Peter Anvin
- Magic number (M2)
As I mentioned earlier, we can avoid this magic number, by including
a pointer (which points to start of the fp and xstate on stack) along with M1.
As I mentioned before, this introduces a very different constraint,
which is a really bad precedent; data shouldn't be dependent on its own
location.
Post by Suresh Siddha
This will catch any one copying the FP state of the frame but not aware of
Xstate.
Post by H. Peter Anvin
- Descriptor count (DC)
- DC * <EBX, EAX> from CPUID leaf 0Dh
As you mentioned, this doesn't change after a kernel boot. So do we really
need to save this static information on every signal? (also please see below
about the compaction).
I think given the compaction constraint we're okay with the bitmap plus
length of the area.
Post by Suresh Siddha
Post by H. Peter Anvin
- Possibly a checksum or CRC of this structure
Note that this tail structure will always be the same on a given kernel,
so it can be pre-canned at boot time. This tail structure serves two
- It can be used to verify against truncation of the state.
(I.e. if an XSAVE-unaware application tries to copy and save away
a state and later restore it, but only copies the first 512 bytes
and later just puts a pointer to it.)
As I mentioned above, pointer along with M1 should be enough to catch this?
I think the pointer is a really really bad idea. Even if we don't need
the structure I think having a tail magic is the better alternative, and
it's also really cheap to do since you have to have the length pointer
anyway.
Post by Suresh Siddha
Post by H. Peter Anvin
- It can be used to verify against an alien state (saved and restored
from another CPU, or even just another kernel version with different
support.)
Though the xsave layout is extendable, save area is not
compacted if some features are not supported by processor and/or
system software. This is documented in Vol 2b under "xsave"
instruction.
Ah, you're right, my bad. That does make the problem substantially
simpler (I somehow read only the second half of the and/or clause, but
it's all there.) So, OK, no need for descriptors (he says, as he waits
for the architectural shoe to drop, especially in a multivendor
environment.)
Post by Suresh Siddha
Given that the descriptor offsets don't change, we can
achieve the same thing with a bit mask representing the state in
the xsave layout. xrstor with the approriate bit masks will automatically
restore/init the state.
Agreed.
Post by Suresh Siddha
Post by H. Peter Anvin
-> Consider that region corrupt and reinitialize?
The region-by-region copy could of course be used even in the same-CPU
case, if there turns out to be a negible performance difference over
whole-block copy.
Today in 64bit, we directly do fxsave/fxrstor in and out of user-space
for signal handlers. I would like to retain this behavior as much as possible
with xsave/xrstor aswell (and at the same time, provide as much information
as possible for the user to interpret the signal frame). Bit mask representing
the state saved in the xsave image, M1, length and some cookie (pointer along
with M1) to detect the image truncation can achieve this. Isn't it?
If the state is complete, which it of course will be something like
99.9999% of the time, then doing XRSTOR from user space should work just
fine. The case of having to stitch up state is clearly an exceptional
case, which is not at all performance critical in any way.

-hpa

H. Peter Anvin
2008-05-20 17:57:04 UTC
Permalink
This post might be inappropriate. Click to display it.
H. Peter Anvin
2008-05-20 14:55:46 UTC
Permalink
Post by Andi Kleen
Post by Suresh Siddha
Ok. CPU folks are planning to make some of the bytes at the end of fxsave
image, SW usable.
Are they always zeroed in earlier CPUs though? If not that wouldn't
work 100% reliably because whatever cookie you put in could have been
there before by chance.
I don't see anything in the SDM guaranteeing zeroing.
I'm pretty sure they weren't zeroed by the CPUs. If they weren't zeroed
*by the kernel*, there might have been an information leak.

-hpa
Andi Kleen
2008-05-20 15:03:55 UTC
Permalink
Post by H. Peter Anvin
Post by Andi Kleen
Post by Suresh Siddha
Ok. CPU folks are planning to make some of the bytes at the end of fxsave
image, SW usable.
Are they always zeroed in earlier CPUs though? If not that wouldn't
work 100% reliably because whatever cookie you put in could have been
there before by chance.
I don't see anything in the SDM guaranteeing zeroing.
I'm pretty sure they weren't zeroed by the CPUs. If they weren't zeroed
*by the kernel*, there might have been an information leak.
I don't think there is one. We never copy fxsave completely out of the
kernel. x86-64 does FXSAVE directly in/out user space, but the
only leak is what there was before.

-Andi
Roland McGrath
2008-05-20 20:10:44 UTC
Permalink
Post by Andi Kleen
I don't think there is one. We never copy fxsave completely out of the
kernel. x86-64 does FXSAVE directly in/out user space, but the
only leak is what there was before.
ptrace/user_regset copies out and in the whole fxsave block from the ptrace
caller. (Only the mxcsr word is constrained after copy-in.)


Thanks,
Roland
H. Peter Anvin
2008-05-22 00:05:47 UTC
Permalink
Post by Roland McGrath
Post by Andi Kleen
I don't think there is one. We never copy fxsave completely out of the
kernel. x86-64 does FXSAVE directly in/out user space, but the
only leak is what there was before.
ptrace/user_regset copies out and in the whole fxsave block from the ptrace
caller. (Only the mxcsr word is constrained after copy-in.)
I see two problems with that:

1. potential information leak out of the kernel if the memory area isn't
zeroed before the first FXSAVE - I haven't verified if so is the case.
This would be a (potentially very serious) security hole.

2. Hidden state in the kernel - this means user space can set
nonarchitectural state in the kernel. There are a few risks with that:

a. Malware might use it to hide state.
b. The possibility of using the stability or lack thereof of this
state to extract information about kernel internals and/or
provide a covert channel in the presence of hardware changes.
c. It is not certain that future architectures will not have
off-limit fields here, like the equivalent of MXCSR. This is
somewhat of a tricky judgement, of course, but it seems safer
to me if we would explicitly list the modifiable fields.

Thoughts?

-hpa
Roland McGrath
2008-05-22 00:47:52 UTC
Permalink
This post might be inappropriate. Click to display it.
Andi Kleen
2008-05-22 08:14:21 UTC
Permalink
Post by Roland McGrath
AFAIK the only reason we control the MXCSR value is because we don't
want to handle a #GP in context switch.
x86-64 actually handles it in context switch, just i386 doesn't.

-Andi
Continue reading on narkive:
Loading...