Discussion:
[rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
(too old to reply)
c***@sgi.com
2007-11-20 01:11:32 UTC
Permalink
This is a pretty early draft stage of the patch. It works on
x86_64 only. Its a bit massive so I'd like to have some feedback
before proceeding (and maybe some help)?.

The support for other arches was not tested yet.

The patch establishes a new set of cpu operations that allow to
exploit single instruction atomicity to allow per cpu variable
modifications without disabling/enabling preempt or interrupts and
without the need to do an offset calculation in order to determine
the location of the variable on the current processor.

It then implements these operations on x86_64 after consolidating
per cpu access for allocpercpu, percpu and pda. All per
cpu data is then accessible via gs segment override.

This results in a reduction in code size of the kernel and in more efficient
operation of per cpu access.

Before:
text data bss dec hex filename
4041907 512371 1302360 5856638 595d7e vmlinux

After (this includes the code added for the cpu allocator!):

text data bss dec hex filename
3861532 527715 1298072 5687319 56c817 vmlinux


On x86_64 the segment override results in the following change for a simple
vm counter increment:

Before:

mov %gs:0x8,%rdx Get smp_processor_id
mov tableoffset,%rax Get table base
incq varoffset(%rax,%rdx,1) Perform the operation with a complex lookup
adding the var offset

An interrupt or a reschedule action can move the execution thread to another
processor if interrupt or preempt is not disabled. Then the variable of
the wrong processor may be updated in a racy way.

After:

incq %gs:varoffset(%rip)

Single instruction that is safe from interrupts or moving of the execution
thread. It will reliably operate on the current processors data area.

Other platforms can also perform address relocation plus atomic ops on
a memory location. Exploiting of the atomicity of instructions vs interrupts
is therefore possible and will reduce the cpu op processing overhead.

F.e on IA64 we have per cpu virtual mapping of the per cpu area. If
we add an offset to the per cpu area variable address then we can guarantee
that we always hit the per cpu areas local to a processor.

Other platforms (SPARC?) have registers that can be used to form addresses.
If the cpu area address is in one of those then atomic per cpu modifications
can be generated for those platforms in the same way.

Slub best performance in the fast fastpath goes from 47 cycles to 41 cycles
through the use of the segment override.


--
c***@sgi.com
2007-11-20 01:11:33 UTC
Permalink
ACPI uses NR_CPUS in various loops and in some it accesses per cpu
data of processors that are not present(!) and that will never be present.
The pointers to per cpu data are typically not initialized for processors
that are not present. So we seem to be reading something here from offset 0
in memory.

Make ACPI use nr_cpu_ids instead. That stops at the end of the possible
processors.

Convert one loop to NR_CPUS to use the cpu_possible map instead.

Signed-off-by: Christoph Lameter <***@sgi.com>

---
drivers/acpi/processor_core.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

Index: linux-2.6/drivers/acpi/processor_core.c
===================================================================
--- linux-2.6.orig/drivers/acpi/processor_core.c 2007-11-19 15:45:05.041140492 -0800
+++ linux-2.6/drivers/acpi/processor_core.c 2007-11-19 15:48:22.513639920 -0800
@@ -494,7 +494,7 @@ static int get_cpu_id(acpi_handle handle
if (apic_id == -1)
return apic_id;

- for (i = 0; i < NR_CPUS; ++i) {
+ for_each_possible_cpu(i) {
if (cpu_physical_id(i) == apic_id)
return i;
}
@@ -638,7 +638,7 @@ static int __cpuinit acpi_processor_star
return 0;
}

- BUG_ON((pr->id >= NR_CPUS) || (pr->id < 0));
+ BUG_ON((pr->id >= nr_cpu_ids) || (pr->id < 0));

/*
* Buggy BIOS check
@@ -771,7 +771,7 @@ static int acpi_processor_remove(struct

pr = acpi_driver_data(device);

- if (pr->id >= NR_CPUS) {
+ if (pr->id >= nr_cpu_ids) {
kfree(pr);
return 0;
}
@@ -842,7 +842,7 @@ int acpi_processor_device_add(acpi_handl
if (!pr)
return -ENODEV;

- if ((pr->id >= 0) && (pr->id < NR_CPUS)) {
+ if ((pr->id >= 0) && (pr->id < nr_cpu_ids)) {
kobject_uevent(&(*device)->dev.kobj, KOBJ_ONLINE);
}
return 0;
@@ -880,13 +880,13 @@ acpi_processor_hotplug_notify(acpi_handl
break;
}

- if (pr->id >= 0 && (pr->id < NR_CPUS)) {
+ if (pr->id >= 0 && (pr->id < nr_cpu_ids)) {
kobject_uevent(&device->dev.kobj, KOBJ_OFFLINE);
break;
}

result = acpi_processor_start(device);
- if ((!result) && ((pr->id >= 0) && (pr->id < NR_CPUS))) {
+ if ((!result) && ((pr->id >= 0) && (pr->id < nr_cpu_ids))) {
kobject_uevent(&device->dev.kobj, KOBJ_ONLINE);
} else {
printk(KERN_ERR PREFIX "Device [%s] failed to start\n",
@@ -909,7 +909,7 @@ acpi_processor_hotplug_notify(acpi_handl
return;
}

- if ((pr->id < NR_CPUS) && (cpu_present(pr->id)))
+ if ((pr->id < nr_cpu_ids) && (cpu_present(pr->id)))
kobject_uevent(&device->dev.kobj, KOBJ_OFFLINE);
break;
default:

--
Mathieu Desnoyers
2007-11-20 12:47:54 UTC
Permalink
Post by c***@sgi.com
ACPI uses NR_CPUS in various loops and in some it accesses per cpu
data of processors that are not present(!) and that will never be present.
The pointers to per cpu data are typically not initialized for processors
that are not present. So we seem to be reading something here from offset 0
in memory.
Make ACPI use nr_cpu_ids instead. That stops at the end of the possible
processors.
Convert one loop to NR_CPUS to use the cpu_possible map instead.
I'm just wondering how broken this is. Is there any assumption that
there is no holes in the online cpu map in this code ?

We can very well have :

0 off
1 on
2 on
3 on
...
Post by c***@sgi.com
---
drivers/acpi/processor_core.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
Index: linux-2.6/drivers/acpi/processor_core.c
===================================================================
--- linux-2.6.orig/drivers/acpi/processor_core.c 2007-11-19 15:45:05.041140492 -0800
+++ linux-2.6/drivers/acpi/processor_core.c 2007-11-19 15:48:22.513639920 -0800
@@ -494,7 +494,7 @@ static int get_cpu_id(acpi_handle handle
if (apic_id == -1)
return apic_id;
- for (i = 0; i < NR_CPUS; ++i) {
+ for_each_possible_cpu(i) {
if (cpu_physical_id(i) == apic_id)
return i;
}
@@ -638,7 +638,7 @@ static int __cpuinit acpi_processor_star
return 0;
}
- BUG_ON((pr->id >= NR_CPUS) || (pr->id < 0));
+ BUG_ON((pr->id >= nr_cpu_ids) || (pr->id < 0));
/*
* Buggy BIOS check
@@ -771,7 +771,7 @@ static int acpi_processor_remove(struct
pr = acpi_driver_data(device);
- if (pr->id >= NR_CPUS) {
+ if (pr->id >= nr_cpu_ids) {
kfree(pr);
return 0;
}
@@ -842,7 +842,7 @@ int acpi_processor_device_add(acpi_handl
if (!pr)
return -ENODEV;
- if ((pr->id >= 0) && (pr->id < NR_CPUS)) {
+ if ((pr->id >= 0) && (pr->id < nr_cpu_ids)) {
kobject_uevent(&(*device)->dev.kobj, KOBJ_ONLINE);
}
return 0;
@@ -880,13 +880,13 @@ acpi_processor_hotplug_notify(acpi_handl
break;
}
- if (pr->id >= 0 && (pr->id < NR_CPUS)) {
+ if (pr->id >= 0 && (pr->id < nr_cpu_ids)) {
kobject_uevent(&device->dev.kobj, KOBJ_OFFLINE);
break;
}
result = acpi_processor_start(device);
- if ((!result) && ((pr->id >= 0) && (pr->id < NR_CPUS))) {
+ if ((!result) && ((pr->id >= 0) && (pr->id < nr_cpu_ids))) {
kobject_uevent(&device->dev.kobj, KOBJ_ONLINE);
} else {
printk(KERN_ERR PREFIX "Device [%s] failed to start\n",
@@ -909,7 +909,7 @@ acpi_processor_hotplug_notify(acpi_handl
return;
}
- if ((pr->id < NR_CPUS) && (cpu_present(pr->id)))
+ if ((pr->id < nr_cpu_ids) && (cpu_present(pr->id)))
kobject_uevent(&device->dev.kobj, KOBJ_OFFLINE);
break;
--
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Christoph Lameter
2007-11-20 20:16:40 UTC
Permalink
Post by Mathieu Desnoyers
Post by c***@sgi.com
Convert one loop to NR_CPUS to use the cpu_possible map instead.
I'm just wondering how broken this is. Is there any assumption that
there is no holes in the online cpu map in this code ?
Yeah, I saw only one loop in there and so I think this covers it. The
debug code that I added to the cpu alloc patches did not throw any warning
and bugs anymore. So I am sure that this does not occur on the
configurations where I tested it. If some is running with the latest cpu
area patches then he will get bugs() for accessing an impossible
processors cpu area and warnings for accessing an offline processors cpu
area.
Andi Kleen
2007-11-20 15:29:14 UTC
Permalink
Post by c***@sgi.com
ACPI uses NR_CPUS in various loops and in some it accesses per cpu
data of processors that are not present(!) and that will never be present.
The pointers to per cpu data are typically not initialized for processors
that are not present. So we seem to be reading something here from offset 0
in memory.
Make ACPI use nr_cpu_ids instead. That stops at the end of the possible
processors.
This is a needed bug fix even without your patches, since
non possible CPUs per CPU data is uninitialized already
in the current code. It might not be unmapped, but accessing
it is still not healthy.

I would suggest to separate that one and fast-track it, possibly even for
2.6.24

[cc'ed Len]

-Andi
Post by c***@sgi.com
Convert one loop to NR_CPUS to use the cpu_possible map instead.
---
drivers/acpi/processor_core.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
Index: linux-2.6/drivers/acpi/processor_core.c
===================================================================
--- linux-2.6.orig/drivers/acpi/processor_core.c 2007-11-19 15:45:05.041140492 -0800
+++ linux-2.6/drivers/acpi/processor_core.c 2007-11-19 15:48:22.513639920 -0800
@@ -494,7 +494,7 @@ static int get_cpu_id(acpi_handle handle
if (apic_id == -1)
return apic_id;
- for (i = 0; i < NR_CPUS; ++i) {
+ for_each_possible_cpu(i) {
if (cpu_physical_id(i) == apic_id)
return i;
}
@@ -638,7 +638,7 @@ static int __cpuinit acpi_processor_star
return 0;
}
- BUG_ON((pr->id >= NR_CPUS) || (pr->id < 0));
+ BUG_ON((pr->id >= nr_cpu_ids) || (pr->id < 0));
/*
* Buggy BIOS check
@@ -771,7 +771,7 @@ static int acpi_processor_remove(struct
pr = acpi_driver_data(device);
- if (pr->id >= NR_CPUS) {
+ if (pr->id >= nr_cpu_ids) {
kfree(pr);
return 0;
}
@@ -842,7 +842,7 @@ int acpi_processor_device_add(acpi_handl
if (!pr)
return -ENODEV;
- if ((pr->id >= 0) && (pr->id < NR_CPUS)) {
+ if ((pr->id >= 0) && (pr->id < nr_cpu_ids)) {
kobject_uevent(&(*device)->dev.kobj, KOBJ_ONLINE);
}
return 0;
@@ -880,13 +880,13 @@ acpi_processor_hotplug_notify(acpi_handl
break;
}
- if (pr->id >= 0 && (pr->id < NR_CPUS)) {
+ if (pr->id >= 0 && (pr->id < nr_cpu_ids)) {
kobject_uevent(&device->dev.kobj, KOBJ_OFFLINE);
break;
}
result = acpi_processor_start(device);
- if ((!result) && ((pr->id >= 0) && (pr->id < NR_CPUS))) {
+ if ((!result) && ((pr->id >= 0) && (pr->id < nr_cpu_ids))) {
kobject_uevent(&device->dev.kobj, KOBJ_ONLINE);
} else {
printk(KERN_ERR PREFIX "Device [%s] failed to start\n",
@@ -909,7 +909,7 @@ acpi_processor_hotplug_notify(acpi_handl
return;
}
- if ((pr->id < NR_CPUS) && (cpu_present(pr->id)))
+ if ((pr->id < nr_cpu_ids) && (cpu_present(pr->id)))
kobject_uevent(&device->dev.kobj, KOBJ_OFFLINE);
break;
Christoph Lameter
2007-11-20 20:18:53 UTC
Permalink
Post by Andi Kleen
I would suggest to separate that one and fast-track it, possibly even for
2.6.24
I already did. Andrew has the patch in his tree.
c***@sgi.com
2007-11-20 01:11:34 UTC
Permalink
The core portion of the cpu allocator.

The per cpu allocator allows dynamic allocation of memory on all
processor simultaneously. A bitmap is used to track used areas.
The allocator implements tight packing to reduce the cache footprint
and increase speed since cacheline contention is typically not a concern
for memory mainly used by a single cpu. Small objects will fill up gaps
left by larger allocations that required alignments.

This is a limited version of the cpu allocator that only performs a
static allocation of a single page for each processor. This is enough
for the use of the cpu allocator in the slab and page allocator for most
of the common configurations. The configuration will be useful for
embedded systems to reduce memory requirements. However, there is a hard limit
of the size of the per cpu structures and so the default configuration of an
order 0 allocation can only support up to 150 slab caches (most systems that
I got use 70) and probably not more than 16 or so NUMA nodes. The size of the
statically configured area can be changed via make menuconfig etc.

The cpu allocator virtualization patch is needed in order to support the dynamically
extending per cpu areas.

V1->V2:
- Split off the dynamic extendable cpu area feature to make it clear that it exists.\
- Remove useless variables.
- Add boot_cpu_alloc for bootime cpu area reservations (allows the folding in of
per cpu areas and other arch specific per cpu stuff during boot).

Signed-off-by: Christoph Lameter <***@sgi.com>
---
include/linux/percpu.h | 78 +++++++++++++++++++
include/linux/vmstat.h | 2
mm/Kconfig | 7 +
mm/Makefile | 2
mm/cpu_alloc.c | 192 +++++++++++++++++++++++++++++++++++++++++++++++++
mm/vmstat.c | 1
6 files changed, 280 insertions(+), 2 deletions(-)
create mode 100644 include/linux/cpu_alloc.h
create mode 100644 mm/cpu_alloc.c

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h 2007-11-18 22:07:35.588274285 -0800
+++ linux-2.6/include/linux/vmstat.h 2007-11-18 22:07:49.864273686 -0800
@@ -36,7 +36,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
FOR_ALL_ZONES(PGSCAN_KSWAPD),
FOR_ALL_ZONES(PGSCAN_DIRECT),
PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
- PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+ PAGEOUTRUN, ALLOCSTALL, PGROTATED, CPU_BYTES,
NR_VM_EVENT_ITEMS
};

Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig 2007-11-18 22:07:35.600273725 -0800
+++ linux-2.6/mm/Kconfig 2007-11-18 22:13:51.405773802 -0800
@@ -194,3 +194,10 @@ config NR_QUICK
config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS
+
+config CPU_AREA_ORDER
+ int "Maximum size (order) of CPU area"
+ default "3"
+ help
+ Sets the maximum amount of memory that can be allocated via cpu_alloc
+ The size is set in page order, so 0 = PAGE_SIZE, 1 = PAGE_SIZE << 1 etc.
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile 2007-11-18 22:07:35.608273792 -0800
+++ linux-2.6/mm/Makefile 2007-11-18 22:13:44.924523941 -0800
@@ -11,7 +11,7 @@ obj-y := bootmem.o filemap.o mempool.o
page_alloc.o page-writeback.o pdflush.o \
readahead.o swap.o truncate.o vmscan.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
- page_isolation.o $(mmu-y)
+ page_isolation.o cpu_alloc.o $(mmu-y)

obj-$(CONFIG_BOUNCE) += bounce.o
obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
Index: linux-2.6/mm/cpu_alloc.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/cpu_alloc.c 2007-11-18 22:15:04.453743317 -0800
@@ -0,0 +1,192 @@
+/*
+ * Cpu allocator - Manage objects allocated for each processor
+ *
+ * (C) 2007 SGI, Christoph Lameter <***@sgi.com>
+ * Basic implementation with allocation and free from a dedicated per
+ * cpu area.
+ *
+ * The per cpu allocator allows dynamic allocation of memory on all
+ * processor simultaneously. A bitmap is used to track used areas.
+ * The allocator implements tight packing to reduce the cache footprint
+ * and increase speed since cacheline contention is typically not a concern
+ * for memory mainly used by a single cpu. Small objects will fill up gaps
+ * left by larger allocations that required alignments.
+ */
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/bitmap.h>
+
+/*
+ * Basic allocation unit. A bit map is created to track the use of each
+ * UNIT_SIZE element in the cpu area.
+ */
+
+#define UNIT_SIZE sizeof(int)
+#define UNITS (ALLOC_SIZE / UNIT_SIZE)
+
+/*
+ * How many units are needed for an object of a given size
+ */
+static int size_to_units(unsigned long size)
+{
+ return DIV_ROUND_UP(size, UNIT_SIZE);
+}
+
+/*
+ * Lock to protect the bitmap and the meta data for the cpu allocator.
+ */
+static DEFINE_SPINLOCK(cpu_alloc_map_lock);
+static unsigned long units_reserved; /* Units reserved by boot allocations */
+
+/*
+ * Static configuration. The cpu areas are of a fixed size and
+ * cannot be extended. Such configurations are mainly useful on
+ * machines that do not have MMU support. Note that we have to use
+ * bss space for the static declarations. The combination of a large number
+ * of processors and a large cpu area may cause problems with the size
+ * of the bss segment.
+ */
+#define ALLOC_SIZE (1UL << (CONFIG_CPU_AREA_ORDER + PAGE_SHIFT))
+
+char cpu_area[NR_CPUS * ALLOC_SIZE];
+static DECLARE_BITMAP(cpu_alloc_map, UNITS);
+
+void * __init boot_cpu_alloc(unsigned long size)
+{
+ unsigned long x = units_reserved;
+
+ units_reserved += size_to_units(size);
+ BUG_ON(units_reserved > UNITS);
+ return (void *)(x * UNIT_SIZE);
+}
+
+static int first_free; /* First known free unit */
+EXPORT_SYMBOL(cpu_area);
+
+/*
+ * Mark an object as used in the cpu_alloc_map
+ *
+ * Must hold cpu_alloc_map_lock
+ */
+static void set_map(int start, int length)
+{
+ while (length-- > 0)
+ __set_bit(start++, cpu_alloc_map);
+}
+
+/*
+ * Mark an area as freed.
+ *
+ * Must hold cpu_alloc_map_lock
+ */
+static void clear_map(int start, int length)
+{
+ while (length-- > 0)
+ __clear_bit(start++, cpu_alloc_map);
+}
+
+/*
+ * Allocate an object of a certain size
+ *
+ * Returns a special pointer that can be used with CPU_PTR to find the
+ * address of the object for a certain cpu.
+ */
+void *cpu_alloc(unsigned long size, gfp_t gfpflags, unsigned long align)
+{
+ unsigned long start;
+ int units = size_to_units(size);
+ void *ptr;
+ int first;
+ unsigned long flags;
+
+ BUG_ON(gfpflags & ~(GFP_RECLAIM_MASK | __GFP_ZERO));
+
+ spin_lock_irqsave(&cpu_alloc_map_lock, flags);
+
+ if (!units_reserved)
+ /*
+ * No boot time allocations. Must have at least one
+ * reserved unit to avoid returning a NULL pointer
+ */
+ units_reserved = 1;
+
+ first = 1;
+ start = first_free;
+
+ for ( ; ; ) {
+
+ start = find_next_zero_bit(cpu_alloc_map, ALLOC_SIZE, start);
+ if (start >= UNITS - units_reserved)
+ goto out_of_memory;
+
+ if (first)
+ first_free = start;
+
+ /*
+ * Check alignment and that there is enough space after
+ * the starting unit.
+ */
+ if ((start + units_reserved) % (align / UNIT_SIZE) == 0 &&
+ find_next_bit(cpu_alloc_map, ALLOC_SIZE, start + 1)
+ >= start + units)
+ break;
+ start++;
+ first = 0;
+ }
+
+ if (first)
+ first_free = start + units;
+
+ if (start + units > UNITS - units_reserved)
+ goto out_of_memory;
+
+ set_map(start, units);
+ __count_vm_events(CPU_BYTES, units * UNIT_SIZE);
+
+ spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
+
+ ptr = (void *)((start + units_reserved) * UNIT_SIZE);
+
+ if (gfpflags & __GFP_ZERO) {
+ int cpu;
+
+ for_each_possible_cpu(cpu)
+ memset(CPU_PTR(ptr, cpu), 0, size);
+ }
+
+ return ptr;
+
+out_of_memory:
+ spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
+ return NULL;
+}
+EXPORT_SYMBOL(cpu_alloc);
+
+/*
+ * Free an object. The pointer must be a cpu pointer allocated
+ * via cpu_alloc.
+ */
+void cpu_free(void *start, unsigned long size)
+{
+ int units = size_to_units(size);
+ int index;
+ unsigned long p = (unsigned long)start;
+ unsigned long flags;
+
+ BUG_ON(p < units_reserved * UNIT_SIZE);
+ index = p / UNIT_SIZE - units_reserved;
+ BUG_ON(!test_bit(index, cpu_alloc_map) ||
+ index >= UNITS - units_reserved);
+
+ spin_lock_irqsave(&cpu_alloc_map_lock, flags);
+
+ clear_map(index, units);
+ __count_vm_events(CPU_BYTES, -units * UNIT_SIZE);
+ if (index < first_free)
+ first_free = index;
+
+ spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
+}
+EXPORT_SYMBOL(cpu_free);
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c 2007-11-18 22:07:49.784273594 -0800
+++ linux-2.6/mm/vmstat.c 2007-11-18 22:13:51.538023840 -0800
@@ -639,6 +639,7 @@ static const char * const vmstat_text[]
"allocstall",

"pgrotated",
+ "cpu_bytes",
#endif
};

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h 2007-11-18 22:07:49.729023738 -0800
+++ linux-2.6/include/linux/percpu.h 2007-11-18 22:13:51.773274119 -0800
@@ -112,4 +112,82 @@ static inline void percpu_free(void *__p
#define free_percpu(ptr) percpu_free((ptr))
#define per_cpu_ptr(ptr, cpu) percpu_ptr((ptr), (cpu))

+
+/*
+ * cpu allocator definitions
+ *
+ * The cpu allocator allows allocating an array of objects on all processors.
+ * A single pointer can then be used to access the instance of the object
+ * on a particular processor.
+ *
+ * Cpu objects are typically small. The allocator packs them tightly
+ * to increase the chance on each access that a per cpu object is already
+ * cached. Alignments may be specified but the intent is to align the data
+ * properly due to cpu alignment constraints and not to avoid cacheline
+ * contention. Any holes left by aligning objects are filled up with smaller
+ * objects that are allocated later.
+ *
+ * Cpu data can be allocated using CPU_ALLOC. The resulting pointer is
+ * pointing to the instance of the variable on cpu 0. It is generally an
+ * error to use the pointer directly unless we are running on cpu 0. So
+ * the use is valid during boot for example.
+ *
+ * The GFP flags have their usual function: __GFP_ZERO zeroes the object
+ * and other flags may be used to control reclaim behavior if the cpu
+ * areas have to be extended. However, zones cannot be selected nor
+ * can locality constraint flags be used.
+ *
+ * CPU_PTR() may be used to calculate the pointer for a specific processor.
+ * CPU_PTR is highly scalable since it simply adds the shifted value of
+ * smp_processor_id() to the base.
+ *
+ * Note: Synchronization is up to caller. If preemption is disabled then
+ * it is generally safe to access cpu variables (unless they are also
+ * handled from an interrupt context).
+ */
+
+#define SHIFT_PTR(__p, __offset) ((__typeof__(__p))((void *)(__p) \
+ + (__offset)))
+extern char cpu_area[];
+
+static inline unsigned long __cpu_offset(unsigned long cpu)
+{
+ int shift = CONFIG_CPU_AREA_ORDER + PAGE_SHIFT;
+
+ return (unsigned long)cpu_area + (cpu << shift);
+}
+
+static inline unsigned long cpu_offset(unsigned long cpu)
+{
+#ifdef CONFIG_DEBUG_VM
+ if (system_state == SYSTEM_RUNNING) {
+ BUG_ON(!cpu_isset(cpu, cpu_possible_map));
+ WARN_ON(!cpu_isset(cpu, cpu_online_map));
+ }
+#endif
+ return __cpu_offset(cpu);
+}
+
+#define CPU_PTR(__p, __cpu) SHIFT_PTR(__p, cpu_offset(__cpu))
+#define __CPU_PTR(__p, __cpu) SHIFT_PTR(__p, __cpu_offset(__cpu))
+
+#define CPU_ALLOC(type, flags) cpu_alloc(sizeof(type), flags, \
+ __alignof__(type))
+#define CPU_FREE(pointer) cpu_free(pointer, sizeof(*(pointer)))
+
+#define THIS_CPU(__p) CPU_PTR(__p, smp_processor_id())
+#define __THIS_CPU(__p) CPU_PTR(__p, raw_smp_processor_id())
+
+/*
+ * Raw calls
+ */
+void *cpu_alloc(unsigned long size, gfp_t gfp, unsigned long align);
+void cpu_free(void *cpu_pointer, unsigned long size);
+
+/*
+ * Early boot allocator for per_cpu variables and special per cpu areas.
+ * Allocations are not tracked and cannot be freed.
+ */
+void *boot_cpu_alloc(unsigned long size);
+
#endif /* __LINUX_PERCPU_H */

--
c***@sgi.com
2007-11-20 01:11:36 UTC
Permalink
Using cpu alloc removes the needs for the per cpu arrays in the kmem_cache struct.
These could get quite big if we have to support system of up to thousands of cpus.
The use of alloc_percpu means that:

1. The size of kmem_cache for SMP configuration shrinks since we will only
need 1 pointer instead of NR_CPUS. The same pointer can be used by all
processors. Reduces cache footprint of the allocator.

2. We can dynamically size kmem_cache according to the actual nodes in the
system meaning less memory overhead for configurations that may potentially
support up to 1k NUMA nodes.

3. We can remove the diddle widdle with allocating and releasing kmem_cache_cpu
structures when bringing up and shuttting down cpus. The allocpercpu
logic will do it all for us. Removes some portions of the cpu hotplug
functionality.

4. Fastpath performance increases by another 20% vs. the earlier improvements.
Instead of having fastpath with 45-50 cycles it is now possible to get
below 40.

Remove the CONFIG_FAST_CMPXCHG version since this patch makes SLUB use CPU ops
instead.

Signed-off-by: Christoph Lameter <***@sgi.com>
---
arch/x86/Kconfig | 4
include/linux/slub_def.h | 6 -
mm/slub.c | 229 ++++++++++-------------------------------------
3 files changed, 52 insertions(+), 187 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2007-11-19 15:45:08.270140279 -0800
+++ linux-2.6/include/linux/slub_def.h 2007-11-19 15:53:25.869890760 -0800
@@ -34,6 +34,7 @@ struct kmem_cache_node {
* Slab cache management.
*/
struct kmem_cache {
+ struct kmem_cache_cpu *cpu_slab;
/* Used for retriving partial slabs etc */
unsigned long flags;
int size; /* The size of an object including meta data */
@@ -63,11 +64,6 @@ struct kmem_cache {
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
#endif
-#ifdef CONFIG_SMP
- struct kmem_cache_cpu *cpu_slab[NR_CPUS];
-#else
- struct kmem_cache_cpu cpu_slab;
-#endif
};

/*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2007-11-19 15:45:08.278140252 -0800
+++ linux-2.6/mm/slub.c 2007-11-19 15:54:10.513640214 -0800
@@ -239,15 +239,6 @@ static inline struct kmem_cache_node *ge
#endif
}

-static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
-{
-#ifdef CONFIG_SMP
- return s->cpu_slab[cpu];
-#else
- return &s->cpu_slab;
-#endif
-}
-
/*
* The end pointer in a slab is special. It points to the first object in the
* slab but has bit 0 set to mark it.
@@ -1472,7 +1463,7 @@ static inline void flush_slab(struct kme
*/
static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
{
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_cpu *c = CPU_PTR(s->cpu_slab, cpu);

if (likely(c && c->page))
flush_slab(s, c);
@@ -1487,15 +1478,7 @@ static void flush_cpu_slab(void *d)

static void flush_all(struct kmem_cache *s)
{
-#ifdef CONFIG_SMP
on_each_cpu(flush_cpu_slab, s, 1, 1);
-#else
- unsigned long flags;
-
- local_irq_save(flags);
- flush_cpu_slab(s);
- local_irq_restore(flags);
-#endif
}

/*
@@ -1511,6 +1494,15 @@ static inline int node_match(struct kmem
return 1;
}

+static inline int cpu_node_match(struct kmem_cache_cpu *c, int node)
+{
+#ifdef CONFIG_NUMA
+ if (node != -1 && __CPU_READ(c->node) != node)
+ return 0;
+#endif
+ return 1;
+}
+
/* Allocate a new slab and make it the current cpu slab */
static noinline unsigned long get_new_slab(struct kmem_cache *s,
struct kmem_cache_cpu **pc, gfp_t gfpflags, int node)
@@ -1529,7 +1521,7 @@ static noinline unsigned long get_new_sl
if (!page)
return 0;

- *pc = c = get_cpu_slab(s, smp_processor_id());
+ *pc = c = THIS_CPU(s->cpu_slab);
if (c->page)
flush_slab(s, c);
c->page = page;
@@ -1554,16 +1546,18 @@ static noinline unsigned long get_new_sl
* we need to allocate a new slab. This is slowest path since we may sleep.
*/
static void *__slab_alloc(struct kmem_cache *s,
- gfp_t gfpflags, int node, void *addr, struct kmem_cache_cpu *c)
+ gfp_t gfpflags, int node, void *addr)
{
void **object;
unsigned long state;
-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+ struct kmem_cache_cpu *c;
+#ifdef CONFIG_FAST_CPU_OPS
unsigned long flags;

local_irq_save(flags);
preempt_enable_no_resched();
#endif
+ c = THIS_CPU(s->cpu_slab);
if (likely(c->page)) {
state = slab_lock(c->page);

@@ -1597,7 +1591,7 @@ load_freelist:
unlock_out:
slab_unlock(c->page, state);
out:
-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+#ifdef CONFIG_FAST_CPU_OPS
preempt_disable();
local_irq_restore(flags);
#endif
@@ -1640,26 +1634,24 @@ static void __always_inline *slab_alloc(
void **object;
struct kmem_cache_cpu *c;

-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
- c = get_cpu_slab(s, get_cpu());
+#ifdef CONFIG_FAST_CPU_OPS
+ c = s->cpu_slab;
do {
- object = c->freelist;
- if (unlikely(is_end(object) || !node_match(c, node))) {
- object = __slab_alloc(s, gfpflags, node, addr, c);
- if (unlikely(!object)) {
- put_cpu();
+ object = __CPU_READ(c->freelist);
+ if (unlikely(is_end(object) ||
+ !cpu_node_match(c, node))) {
+ object = __slab_alloc(s, gfpflags, node, addr);
+ if (unlikely(!object))
goto out;
- }
break;
}
- } while (cmpxchg_local(&c->freelist, object, object[c->offset])
- != object);
- put_cpu();
+ } while (CPU_CMPXCHG(c->freelist, object,
+ object[__CPU_READ(c->offset)]) != object);
#else
unsigned long flags;

local_irq_save(flags);
- c = get_cpu_slab(s, smp_processor_id());
+ c = THIS_CPU(s->cpu_slab);
if (unlikely((is_end(c->freelist)) || !node_match(c, node))) {

object = __slab_alloc(s, gfpflags, node, addr, c);
@@ -1709,7 +1701,7 @@ static void __slab_free(struct kmem_cach
void **object = (void *)x;
unsigned long state;

-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+#ifdef CONFIG_FAST_CPU_OPS
unsigned long flags;

local_irq_save(flags);
@@ -1739,7 +1731,7 @@ checks_ok:

out_unlock:
slab_unlock(page, state);
-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+#ifdef CONFIG_FAST_CPU_OPS
local_irq_restore(flags);
#endif
return;
@@ -1752,7 +1744,7 @@ slab_empty:
remove_partial(s, page);

slab_unlock(page, state);
-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+#ifdef CONFIG_FAST_CPU_OPS
local_irq_restore(flags);
#endif
discard_slab(s, page);
@@ -1781,13 +1773,13 @@ static void __always_inline slab_free(st
void **object = (void *)x;
struct kmem_cache_cpu *c;

-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+#ifdef CONFIG_FAST_CPU_OPS
void **freelist;

- c = get_cpu_slab(s, get_cpu());
+ c = s->cpu_slab;
debug_check_no_locks_freed(object, s->objsize);
do {
- freelist = c->freelist;
+ freelist = __CPU_READ(c->freelist);
barrier();
/*
* If the compiler would reorder the retrieval of c->page to
@@ -1800,19 +1792,19 @@ static void __always_inline slab_free(st
* then any change of cpu_slab will cause the cmpxchg to fail
* since the freelist pointers are unique per slab.
*/
- if (unlikely(page != c->page || c->node < 0)) {
- __slab_free(s, page, x, addr, c->offset);
+ if (unlikely(page != __CPU_READ(c->page) ||
+ __CPU_READ(c->node) < 0)) {
+ __slab_free(s, page, x, addr, __CPU_READ(c->offset));
break;
}
- object[c->offset] = freelist;
- } while (cmpxchg_local(&c->freelist, freelist, object) != freelist);
- put_cpu();
+ object[__CPU_READ(c->offset)] = freelist;
+ } while (CPU_CMPXCHG(c->freelist, freelist, object) != freelist);
#else
unsigned long flags;

local_irq_save(flags);
debug_check_no_locks_freed(object, s->objsize);
- c = get_cpu_slab(s, smp_processor_id());
+ c = THIS_CPU(s->cpu_slab);
if (likely(page == c->page && c->node >= 0)) {
object[c->offset] = c->freelist;
c->freelist = object;
@@ -2015,130 +2007,19 @@ static void init_kmem_cache_node(struct
#endif
}

-#ifdef CONFIG_SMP
-/*
- * Per cpu array for per cpu structures.
- *
- * The per cpu array places all kmem_cache_cpu structures from one processor
- * close together meaning that it becomes possible that multiple per cpu
- * structures are contained in one cacheline. This may be particularly
- * beneficial for the kmalloc caches.
- *
- * A desktop system typically has around 60-80 slabs. With 100 here we are
- * likely able to get per cpu structures for all caches from the array defined
- * here. We must be able to cover all kmalloc caches during bootstrap.
- *
- * If the per cpu array is exhausted then fall back to kmalloc
- * of individual cachelines. No sharing is possible then.
- */
-#define NR_KMEM_CACHE_CPU 100
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu,
- kmem_cache_cpu)[NR_KMEM_CACHE_CPU];
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
-static cpumask_t kmem_cach_cpu_free_init_once = CPU_MASK_NONE;
-
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
- int cpu, gfp_t flags)
-{
- struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
-
- if (c)
- per_cpu(kmem_cache_cpu_free, cpu) =
- (void *)c->freelist;
- else {
- /* Table overflow: So allocate ourselves */
- c = kmalloc_node(
- ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
- flags, cpu_to_node(cpu));
- if (!c)
- return NULL;
- }
-
- init_kmem_cache_cpu(s, c);
- return c;
-}
-
-static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
-{
- if (c < per_cpu(kmem_cache_cpu, cpu) ||
- c > per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
- kfree(c);
- return;
- }
- c->freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
- per_cpu(kmem_cache_cpu_free, cpu) = c;
-}
-
-static void free_kmem_cache_cpus(struct kmem_cache *s)
-{
- int cpu;
-
- for_each_online_cpu(cpu) {
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
- if (c) {
- s->cpu_slab[cpu] = NULL;
- free_kmem_cache_cpu(c, cpu);
- }
- }
-}
-
static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
{
int cpu;

- for_each_online_cpu(cpu) {
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ s->cpu_slab = CPU_ALLOC(struct kmem_cache_cpu, flags);

- if (c)
- continue;
-
- c = alloc_kmem_cache_cpu(s, cpu, flags);
- if (!c) {
- free_kmem_cache_cpus(s);
- return 0;
- }
- s->cpu_slab[cpu] = c;
- }
- return 1;
-}
-
-/*
- * Initialize the per cpu array.
- */
-static void init_alloc_cpu_cpu(int cpu)
-{
- int i;
-
- if (cpu_isset(cpu, kmem_cach_cpu_free_init_once))
- return;
-
- for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
- free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);
-
- cpu_set(cpu, kmem_cach_cpu_free_init_once);
-}
-
-static void __init init_alloc_cpu(void)
-{
- int cpu;
+ if (!s->cpu_slab)
+ return 0;

for_each_online_cpu(cpu)
- init_alloc_cpu_cpu(cpu);
- }
-
-#else
-static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
-static inline void init_alloc_cpu(void) {}
-
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
- init_kmem_cache_cpu(s, &s->cpu_slab);
+ init_kmem_cache_cpu(s, CPU_PTR(s->cpu_slab, cpu));
return 1;
}
-#endif

#ifdef CONFIG_NUMA
/*
@@ -2452,9 +2333,8 @@ static inline int kmem_cache_close(struc
int node;

flush_all(s);
-
+ CPU_FREE(s->cpu_slab);
/* Attempt to free all objects */
- free_kmem_cache_cpus(s);
for_each_node_state(node, N_NORMAL_MEMORY) {
struct kmem_cache_node *n = get_node(s, node);

@@ -2958,8 +2838,6 @@ void __init kmem_cache_init(void)
int i;
int caches = 0;

- init_alloc_cpu();
-
#ifdef CONFIG_NUMA
/*
* Must first have the slab cache available for the allocations of the
@@ -3019,11 +2897,12 @@ void __init kmem_cache_init(void)
for (i = KMALLOC_SHIFT_LOW; i < PAGE_SHIFT; i++)
kmalloc_caches[i]. name =
kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
-
#ifdef CONFIG_SMP
register_cpu_notifier(&slab_notifier);
- kmem_size = offsetof(struct kmem_cache, cpu_slab) +
- nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#endif
+#ifdef CONFIG_NUMA
+ kmem_size = offsetof(struct kmem_cache, node) +
+ nr_node_ids * sizeof(struct kmem_cache_node *);
#else
kmem_size = sizeof(struct kmem_cache);
#endif
@@ -3120,7 +2999,7 @@ struct kmem_cache *kmem_cache_create(con
* per cpu structures
*/
for_each_online_cpu(cpu)
- get_cpu_slab(s, cpu)->objsize = s->objsize;
+ CPU_PTR(s->cpu_slab, cpu)->objsize = s->objsize;
s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
up_write(&slub_lock);
if (sysfs_slab_alias(s, name))
@@ -3165,11 +3044,9 @@ static int __cpuinit slab_cpuup_callback
switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
- init_alloc_cpu_cpu(cpu);
down_read(&slub_lock);
list_for_each_entry(s, &slab_caches, list)
- s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
- GFP_KERNEL);
+ init_kmem_cache_cpu(s, __CPU_PTR(s->cpu_slab, cpu));
up_read(&slub_lock);
break;

@@ -3179,13 +3056,9 @@ static int __cpuinit slab_cpuup_callback
case CPU_DEAD_FROZEN:
down_read(&slub_lock);
list_for_each_entry(s, &slab_caches, list) {
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
local_irq_save(flags);
__flush_cpu_slab(s, cpu);
local_irq_restore(flags);
- free_kmem_cache_cpu(c, cpu);
- s->cpu_slab[cpu] = NULL;
}
up_read(&slub_lock);
break;
@@ -3657,7 +3530,7 @@ static unsigned long slab_objects(struct
for_each_possible_cpu(cpu) {
struct page *page;
int node;
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_cpu *c = CPU_PTR(s->cpu_slab, cpu);

if (!c)
continue;
@@ -3724,7 +3597,7 @@ static int any_slab_objects(struct kmem_
int cpu;

for_each_possible_cpu(cpu) {
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_cpu *c = CPU_PTR(s->cpu_slab, cpu);

if (c && c->page)
return 1;
Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig 2007-11-19 15:53:55.529390403 -0800
+++ linux-2.6/arch/x86/Kconfig 2007-11-19 15:54:10.509139813 -0800
@@ -112,10 +112,6 @@ config GENERIC_TIME_VSYSCALL
bool
default X86_64

-config FAST_CMPXCHG_LOCAL
- bool
- default y
-
config ZONE_DMA32
bool
default X86_64

--
Mathieu Desnoyers
2007-11-20 12:42:03 UTC
Permalink
Post by c***@sgi.com
Using cpu alloc removes the needs for the per cpu arrays in the kmem_cache struct.
These could get quite big if we have to support system of up to thousands of cpus.
1. The size of kmem_cache for SMP configuration shrinks since we will only
need 1 pointer instead of NR_CPUS. The same pointer can be used by all
processors. Reduces cache footprint of the allocator.
2. We can dynamically size kmem_cache according to the actual nodes in the
system meaning less memory overhead for configurations that may potentially
support up to 1k NUMA nodes.
3. We can remove the diddle widdle with allocating and releasing kmem_cache_cpu
structures when bringing up and shuttting down cpus. The allocpercpu
logic will do it all for us. Removes some portions of the cpu hotplug
functionality.
4. Fastpath performance increases by another 20% vs. the earlier improvements.
Instead of having fastpath with 45-50 cycles it is now possible to get
below 40.
Remove the CONFIG_FAST_CMPXCHG version since this patch makes SLUB use CPU ops
instead.
---
arch/x86/Kconfig | 4
include/linux/slub_def.h | 6 -
mm/slub.c | 229 ++++++++++-------------------------------------
3 files changed, 52 insertions(+), 187 deletions(-)
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2007-11-19 15:45:08.270140279 -0800
+++ linux-2.6/include/linux/slub_def.h 2007-11-19 15:53:25.869890760 -0800
@@ -34,6 +34,7 @@ struct kmem_cache_node {
* Slab cache management.
*/
struct kmem_cache {
+ struct kmem_cache_cpu *cpu_slab;
/* Used for retriving partial slabs etc */
unsigned long flags;
int size; /* The size of an object including meta data */
@@ -63,11 +64,6 @@ struct kmem_cache {
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
#endif
-#ifdef CONFIG_SMP
- struct kmem_cache_cpu *cpu_slab[NR_CPUS];
-#else
- struct kmem_cache_cpu cpu_slab;
-#endif
};
/*
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2007-11-19 15:45:08.278140252 -0800
+++ linux-2.6/mm/slub.c 2007-11-19 15:54:10.513640214 -0800
@@ -239,15 +239,6 @@ static inline struct kmem_cache_node *ge
#endif
}
-static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
-{
-#ifdef CONFIG_SMP
- return s->cpu_slab[cpu];
-#else
- return &s->cpu_slab;
-#endif
-}
-
/*
* The end pointer in a slab is special. It points to the first object in the
* slab but has bit 0 set to mark it.
@@ -1472,7 +1463,7 @@ static inline void flush_slab(struct kme
*/
static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
{
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_cpu *c = CPU_PTR(s->cpu_slab, cpu);
if (likely(c && c->page))
flush_slab(s, c);
@@ -1487,15 +1478,7 @@ static void flush_cpu_slab(void *d)
static void flush_all(struct kmem_cache *s)
{
-#ifdef CONFIG_SMP
on_each_cpu(flush_cpu_slab, s, 1, 1);
-#else
- unsigned long flags;
-
- local_irq_save(flags);
- flush_cpu_slab(s);
- local_irq_restore(flags);
-#endif
}
Normally :

You can't use on_each_cpu if interrupts are already disabled. Therefore,
the implementation using "local_irq_disable/enable" in smp.h for the UP
case is semantically correct and there is no need to use a save/restore.
So just using on_each_cpu should be enough here.

I also wonder about flush_cpu_slab execution on _other_ cpus. I am not
convinced interrupts are disabled when it executes.. have I missing
something ?
Post by c***@sgi.com
/*
@@ -1511,6 +1494,15 @@ static inline int node_match(struct kmem
return 1;
}
+static inline int cpu_node_match(struct kmem_cache_cpu *c, int node)
+{
+#ifdef CONFIG_NUMA
+ if (node != -1 && __CPU_READ(c->node) != node)
+ return 0;
+#endif
+ return 1;
+}
+
/* Allocate a new slab and make it the current cpu slab */
static noinline unsigned long get_new_slab(struct kmem_cache *s,
struct kmem_cache_cpu **pc, gfp_t gfpflags, int node)
@@ -1529,7 +1521,7 @@ static noinline unsigned long get_new_sl
if (!page)
return 0;
- *pc = c = get_cpu_slab(s, smp_processor_id());
+ *pc = c = THIS_CPU(s->cpu_slab);
I think the preferred coding style is :

c = THIS_CPU(s->cpu_slab);
*pc = c;
Post by c***@sgi.com
if (c->page)
flush_slab(s, c);
c->page = page;
@@ -1554,16 +1546,18 @@ static noinline unsigned long get_new_sl
* we need to allocate a new slab. This is slowest path since we may sleep.
*/
static void *__slab_alloc(struct kmem_cache *s,
- gfp_t gfpflags, int node, void *addr, struct kmem_cache_cpu *c)
+ gfp_t gfpflags, int node, void *addr)
{
void **object;
unsigned long state;
-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+ struct kmem_cache_cpu *c;
+#ifdef CONFIG_FAST_CPU_OPS
unsigned long flags;
local_irq_save(flags);
preempt_enable_no_resched();
#endif
+ c = THIS_CPU(s->cpu_slab);
if (likely(c->page)) {
state = slab_lock(c->page);
slab_unlock(c->page, state);
-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+#ifdef CONFIG_FAST_CPU_OPS
preempt_disable();
local_irq_restore(flags);
#endif
@@ -1640,26 +1634,24 @@ static void __always_inline *slab_alloc(
void **object;
struct kmem_cache_cpu *c;
-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
- c = get_cpu_slab(s, get_cpu());
+#ifdef CONFIG_FAST_CPU_OPS
I wonder.. are there some architectures that would provide fast
cmpxchg_local but not fast cpu ops ?
Post by c***@sgi.com
+ c = s->cpu_slab;
do {
- object = c->freelist;
- if (unlikely(is_end(object) || !node_match(c, node))) {
- object = __slab_alloc(s, gfpflags, node, addr, c);
- if (unlikely(!object)) {
- put_cpu();
+ object = __CPU_READ(c->freelist);
+ if (unlikely(is_end(object) ||
+ !cpu_node_match(c, node))) {
+ object = __slab_alloc(s, gfpflags, node, addr);
+ if (unlikely(!object))
goto out;
- }
break;
}
- } while (cmpxchg_local(&c->freelist, object, object[c->offset])
- != object);
- put_cpu();
+ } while (CPU_CMPXCHG(c->freelist, object,
+ object[__CPU_READ(c->offset)]) != object);
Hrm. What happens here if we call __slab_alloc, get a valid object, then
have a CPU_CMPXCHG that fails, restart the loop.. is this case taken
care of or do we end up having an unreferenced object ? Maybe there is
some logic in freelist that takes care of it ?

Also, we have to be aware that we can now change CPU between the

__CPU_READ and the CPU_CMPXCHG. (also : should it be a __CPU_CMPXCHG ?)

But since "object" contains information specific to the local CPU, the
cmpxchg should fail if we are migrated and everything should be ok.

Hrm, actually, the

c = s->cpu_slab;

should probably be after the object = __CPU_READ(c->freelist); ?

The cpu_read acts as a safeguard checking that we do not change CPU
between the read and the cmpxchg. If we are preempted between the "c"
read and the cpu_read, we could do a !cpu_node_match(c, node) check that
would apply to the wrong cpu.
Post by c***@sgi.com
#else
unsigned long flags;
local_irq_save(flags);
- c = get_cpu_slab(s, smp_processor_id());
+ c = THIS_CPU(s->cpu_slab);
if (unlikely((is_end(c->freelist)) || !node_match(c, node))) {
object = __slab_alloc(s, gfpflags, node, addr, c);
@@ -1709,7 +1701,7 @@ static void __slab_free(struct kmem_cach
void **object = (void *)x;
unsigned long state;
-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+#ifdef CONFIG_FAST_CPU_OPS
unsigned long flags;
local_irq_save(flags);
slab_unlock(page, state);
-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+#ifdef CONFIG_FAST_CPU_OPS
local_irq_restore(flags);
#endif
return;
remove_partial(s, page);
slab_unlock(page, state);
-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+#ifdef CONFIG_FAST_CPU_OPS
local_irq_restore(flags);
#endif
discard_slab(s, page);
@@ -1781,13 +1773,13 @@ static void __always_inline slab_free(st
void **object = (void *)x;
struct kmem_cache_cpu *c;
-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+#ifdef CONFIG_FAST_CPU_OPS
void **freelist;
- c = get_cpu_slab(s, get_cpu());
+ c = s->cpu_slab;
debug_check_no_locks_freed(object, s->objsize);
do {
- freelist = c->freelist;
+ freelist = __CPU_READ(c->freelist);
Same here, c = s->cpu_slab; should probably be read after.
Post by c***@sgi.com
barrier();
/*
* If the compiler would reorder the retrieval of c->page to
@@ -1800,19 +1792,19 @@ static void __always_inline slab_free(st
* then any change of cpu_slab will cause the cmpxchg to fail
* since the freelist pointers are unique per slab.
*/
- if (unlikely(page != c->page || c->node < 0)) {
- __slab_free(s, page, x, addr, c->offset);
+ if (unlikely(page != __CPU_READ(c->page) ||
+ __CPU_READ(c->node) < 0)) {
+ __slab_free(s, page, x, addr, __CPU_READ(c->offset));
And same question as above : what happens if we fail after executing the
__slab_free.. is it valid to do it twice ?
Post by c***@sgi.com
break;
}
- object[c->offset] = freelist;
- } while (cmpxchg_local(&c->freelist, freelist, object) != freelist);
- put_cpu();
+ object[__CPU_READ(c->offset)] = freelist;
+ } while (CPU_CMPXCHG(c->freelist, freelist, object) != freelist);
#else
unsigned long flags;
local_irq_save(flags);
debug_check_no_locks_freed(object, s->objsize);
- c = get_cpu_slab(s, smp_processor_id());
+ c = THIS_CPU(s->cpu_slab);
if (likely(page == c->page && c->node >= 0)) {
object[c->offset] = c->freelist;
c->freelist = object;
@@ -2015,130 +2007,19 @@ static void init_kmem_cache_node(struct
#endif
}
-#ifdef CONFIG_SMP
-/*
- * Per cpu array for per cpu structures.
- *
- * The per cpu array places all kmem_cache_cpu structures from one processor
- * close together meaning that it becomes possible that multiple per cpu
- * structures are contained in one cacheline. This may be particularly
- * beneficial for the kmalloc caches.
- *
- * A desktop system typically has around 60-80 slabs. With 100 here we are
- * likely able to get per cpu structures for all caches from the array defined
- * here. We must be able to cover all kmalloc caches during bootstrap.
- *
- * If the per cpu array is exhausted then fall back to kmalloc
- * of individual cachelines. No sharing is possible then.
- */
-#define NR_KMEM_CACHE_CPU 100
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu,
- kmem_cache_cpu)[NR_KMEM_CACHE_CPU];
-
-static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
-static cpumask_t kmem_cach_cpu_free_init_once = CPU_MASK_NONE;
-
-static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s,
- int cpu, gfp_t flags)
-{
- struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
-
- if (c)
- per_cpu(kmem_cache_cpu_free, cpu) =
- (void *)c->freelist;
- else {
- /* Table overflow: So allocate ourselves */
- c = kmalloc_node(
- ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
- flags, cpu_to_node(cpu));
- if (!c)
- return NULL;
- }
-
- init_kmem_cache_cpu(s, c);
- return c;
-}
-
-static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
-{
- if (c < per_cpu(kmem_cache_cpu, cpu) ||
- c > per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
- kfree(c);
- return;
- }
- c->freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
- per_cpu(kmem_cache_cpu_free, cpu) = c;
-}
-
-static void free_kmem_cache_cpus(struct kmem_cache *s)
-{
- int cpu;
-
- for_each_online_cpu(cpu) {
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
- if (c) {
- s->cpu_slab[cpu] = NULL;
- free_kmem_cache_cpu(c, cpu);
- }
- }
-}
-
static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
{
int cpu;
- for_each_online_cpu(cpu) {
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ s->cpu_slab = CPU_ALLOC(struct kmem_cache_cpu, flags);
- if (c)
- continue;
-
- c = alloc_kmem_cache_cpu(s, cpu, flags);
- if (!c) {
- free_kmem_cache_cpus(s);
- return 0;
- }
- s->cpu_slab[cpu] = c;
- }
- return 1;
-}
-
-/*
- * Initialize the per cpu array.
- */
-static void init_alloc_cpu_cpu(int cpu)
-{
- int i;
-
- if (cpu_isset(cpu, kmem_cach_cpu_free_init_once))
- return;
-
- for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
- free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i], cpu);
-
- cpu_set(cpu, kmem_cach_cpu_free_init_once);
-}
-
-static void __init init_alloc_cpu(void)
-{
- int cpu;
+ if (!s->cpu_slab)
+ return 0;
for_each_online_cpu(cpu)
- init_alloc_cpu_cpu(cpu);
- }
-
-#else
-static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
-static inline void init_alloc_cpu(void) {}
-
-static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
-{
- init_kmem_cache_cpu(s, &s->cpu_slab);
+ init_kmem_cache_cpu(s, CPU_PTR(s->cpu_slab, cpu));
return 1;
}
-#endif
#ifdef CONFIG_NUMA
/*
@@ -2452,9 +2333,8 @@ static inline int kmem_cache_close(struc
int node;
flush_all(s);
-
+ CPU_FREE(s->cpu_slab);
/* Attempt to free all objects */
- free_kmem_cache_cpus(s);
for_each_node_state(node, N_NORMAL_MEMORY) {
struct kmem_cache_node *n = get_node(s, node);
@@ -2958,8 +2838,6 @@ void __init kmem_cache_init(void)
int i;
int caches = 0;
- init_alloc_cpu();
-
#ifdef CONFIG_NUMA
/*
* Must first have the slab cache available for the allocations of the
@@ -3019,11 +2897,12 @@ void __init kmem_cache_init(void)
for (i = KMALLOC_SHIFT_LOW; i < PAGE_SHIFT; i++)
kmalloc_caches[i]. name =
kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);
-
#ifdef CONFIG_SMP
register_cpu_notifier(&slab_notifier);
- kmem_size = offsetof(struct kmem_cache, cpu_slab) +
- nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
+#endif
+#ifdef CONFIG_NUMA
+ kmem_size = offsetof(struct kmem_cache, node) +
+ nr_node_ids * sizeof(struct kmem_cache_node *);
#else
kmem_size = sizeof(struct kmem_cache);
#endif
@@ -3120,7 +2999,7 @@ struct kmem_cache *kmem_cache_create(con
* per cpu structures
*/
for_each_online_cpu(cpu)
- get_cpu_slab(s, cpu)->objsize = s->objsize;
+ CPU_PTR(s->cpu_slab, cpu)->objsize = s->objsize;
s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
up_write(&slub_lock);
if (sysfs_slab_alias(s, name))
@@ -3165,11 +3044,9 @@ static int __cpuinit slab_cpuup_callback
switch (action) {
- init_alloc_cpu_cpu(cpu);
down_read(&slub_lock);
list_for_each_entry(s, &slab_caches, list)
- s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
- GFP_KERNEL);
+ init_kmem_cache_cpu(s, __CPU_PTR(s->cpu_slab, cpu));
up_read(&slub_lock);
break;
@@ -3179,13 +3056,9 @@ static int __cpuinit slab_cpuup_callback
down_read(&slub_lock);
list_for_each_entry(s, &slab_caches, list) {
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
-
local_irq_save(flags);
__flush_cpu_slab(s, cpu);
local_irq_restore(flags);
- free_kmem_cache_cpu(c, cpu);
- s->cpu_slab[cpu] = NULL;
}
up_read(&slub_lock);
break;
@@ -3657,7 +3530,7 @@ static unsigned long slab_objects(struct
for_each_possible_cpu(cpu) {
struct page *page;
int node;
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_cpu *c = CPU_PTR(s->cpu_slab, cpu);
if (!c)
continue;
@@ -3724,7 +3597,7 @@ static int any_slab_objects(struct kmem_
int cpu;
for_each_possible_cpu(cpu) {
- struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+ struct kmem_cache_cpu *c = CPU_PTR(s->cpu_slab, cpu);
if (c && c->page)
return 1;
Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig 2007-11-19 15:53:55.529390403 -0800
+++ linux-2.6/arch/x86/Kconfig 2007-11-19 15:54:10.509139813 -0800
@@ -112,10 +112,6 @@ config GENERIC_TIME_VSYSCALL
bool
default X86_64
-config FAST_CMPXCHG_LOCAL
- bool
- default y
-
config ZONE_DMA32
bool
default X86_64
--
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Christoph Lameter
2007-11-20 20:44:56 UTC
Permalink
Post by Mathieu Desnoyers
Post by c***@sgi.com
{
-#ifdef CONFIG_SMP
on_each_cpu(flush_cpu_slab, s, 1, 1);
-#else
- unsigned long flags;
-
- local_irq_save(flags);
- flush_cpu_slab(s);
- local_irq_restore(flags);
-#endif
}
You can't use on_each_cpu if interrupts are already disabled. Therefore,
the implementation using "local_irq_disable/enable" in smp.h for the UP
case is semantically correct and there is no need to use a save/restore.
So just using on_each_cpu should be enough here.
The on_each_cpu is only used in the smp case.
Post by Mathieu Desnoyers
I also wonder about flush_cpu_slab execution on _other_ cpus. I am not
convinced interrupts are disabled when it executes.. have I missing
something ?
flush_cpu_slab only flushes the local cpu slab. keventd threads are pinned
to each processor so it should be safe.
Post by Mathieu Desnoyers
Post by c***@sgi.com
-#ifdef CONFIG_FAST_CMPXCHG_LOCAL
- c = get_cpu_slab(s, get_cpu());
+#ifdef CONFIG_FAST_CPU_OPS
I wonder.. are there some architectures that would provide fast
cmpxchg_local but not fast cpu ops ?
If you have a fast cmpxchg_local then you can use that to implement fast
cpu ops.
Post by Mathieu Desnoyers
Post by c***@sgi.com
+ } while (CPU_CMPXCHG(c->freelist, object,
+ object[__CPU_READ(c->offset)]) != object);
Hrm. What happens here if we call __slab_alloc, get a valid object, then
have a CPU_CMPXCHG that fails, restart the loop.. is this case taken
care of or do we end up having an unreferenced object ? Maybe there is
some logic in freelist that takes care of it ?
If we fail then we have done nothing....
Post by Mathieu Desnoyers
Also, we have to be aware that we can now change CPU between the
__CPU_READ and the CPU_CMPXCHG. (also : should it be a __CPU_CMPXCHG ?)
But since "object" contains information specific to the local CPU, the
cmpxchg should fail if we are migrated and everything should be ok.
Right.
Post by Mathieu Desnoyers
Hrm, actually, the
c = s->cpu_slab;
should probably be after the object = __CPU_READ(c->freelist); ?
No. c is invariant in respect to cpus.
Post by Mathieu Desnoyers
The cpu_read acts as a safeguard checking that we do not change CPU
between the read and the cmpxchg. If we are preempted between the "c"
read and the cpu_read, we could do a !cpu_node_match(c, node) check that
would apply to the wrong cpu.
C is not pointing to a specific cpu. It can only be used in CPU_xx ops to
address the currrent cpu.
Post by Mathieu Desnoyers
Post by c***@sgi.com
@@ -1800,19 +1792,19 @@ static void __always_inline slab_free(st
* then any change of cpu_slab will cause the cmpxchg to fail
* since the freelist pointers are unique per slab.
*/
- if (unlikely(page != c->page || c->node < 0)) {
- __slab_free(s, page, x, addr, c->offset);
+ if (unlikely(page != __CPU_READ(c->page) ||
+ __CPU_READ(c->node) < 0)) {
+ __slab_free(s, page, x, addr, __CPU_READ(c->offset));
And same question as above : what happens if we fail after executing the
__slab_free.. is it valid to do it twice ?
__slab_free is always successful and will never cause a repeat of the
loop.
c***@sgi.com
2007-11-20 01:11:35 UTC
Permalink
Currently the per cpu subsystem is not able to use the atomic capabilities
of the processors we have.

This adds new functionality that allows the optimizing of per cpu variable
handliong. It in particular provides a simple way to exploit atomic operations
to avoid having to disable itnerrupts or add an per cpu offset.

F.e. current implementations may do

unsigned long flags;
struct stat_struct *p;

local_irq_save(flags);
/* Calculate address of per processor area */
p = CPU_PTR(stat, smp_processor_id());
p->counter++;
local_irq_restore(flags);

This whole segment can be replaced by a single CPU operation

CPU_INC(stat->counter);

And on most processors it is possible to perform the increment with
a single processor instruction. Processors have segment registers,
global registers and per cpu mappings of per cpu areas for that purpose.

The problem is that the current schemes cannot utilize those features.
local_t is not really addressing the issue since the offset calculation
is not solved. local_t is x86 processor specific. This solution here
can utilize other methods than just the x86 instruction set.

On x86 the above CPU_INC translated into a single instruction:

inc %%gs:(&stat->counter)

This instruction is interrupt safe since it can either be completed
or not.

The determination of the correct per cpu area for the current processor
does not require access to smp_processor_id() (expensive...). The gs
register is used to provide a processor specific offset to the respective
per cpu area where the per cpu variabvle resides.

Note tha the counter offset into the struct was added *before* the segment
selector was added. This is necessary to avoid calculation, In the past
we first determine the address of the stats structure on the respective
processor and then added the field offset. However, the offset may as
well be added earlier.

If stat was declared via DECLARE_PER_CPU then this patchset is capoable of
convincing the linker to provide the proper base address. In that case
no calculations are necessary.

Should the stats structure be reachable via a register then the address
calculation capabilities can be leverages to avoid calculations.

On IA64 the same will result in another single instruction using the
factor that we have a virtual address that always maps to the local per cpu
area.

fetchadd &stat->counter + (VCPU_BASE - __per_cpu_base)

The access is forced into the per cpu address reachable via the virtualized
address. Again the counter field offset is eadded to the offset. The access
is then similarly a singular instruction thing as on x86.

In order to be able to exploit the atomicity of this instructions we
introduce a series of new functions that take a BASE pointer (a pointer
into the area of cpu 0 which is the canonical base).

CPU_READ()
CPU_WRITE()
CPU_INC
CPU_DEC
CPU_ADD
CPU_SUB
CPU_XCHG
CPU_CMPXCHG






Signed-off-by: Christoph Lameter <***@sgi.com>

---
include/linux/percpu.h | 156 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 156 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h 2007-11-18 22:13:51.773274119 -0800
+++ linux-2.6/include/linux/percpu.h 2007-11-18 22:15:10.396773779 -0800
@@ -190,4 +190,160 @@ void cpu_free(void *cpu_pointer, unsigne
*/
void *boot_cpu_alloc(unsigned long size);

+/*
+ * Fast Atomic per cpu operations.
+ *
+ * The following operations can be overridden by arches to implement fast
+ * and efficient operations. The operations are atomic meaning that the
+ * determination of the processor, the calculation of the address and the
+ * operation on the data is an atomic operation.
+ */
+
+#ifndef CONFIG_FAST_CPU_OPS
+
+/*
+ * The fallbacks are rather slow but they are safe
+ *
+ * The first group of macros is used when we it is
+ * safe to update the per cpu variable because
+ * preemption is off (per cpu variables that are not
+ * updated from interrupt cointext) or because
+ * interrupts are already off.
+ */
+
+#define __CPU_READ(obj) \
+({ \
+ typeof(obj) x; \
+ x = *THIS_CPU(&(obj)); \
+ (x); \
+})
+
+#define __CPU_WRITE(obj, value) \
+({ \
+ *THIS_CPU((&(obj)) = value; \
+})
+
+#define __CPU_ADD(obj, value) \
+({ \
+ *THIS_CPU(&(obj)) += value; \
+})
+
+
+#define __CPU_INC(addr) __CPU_ADD(addr, 1)
+#define __CPU_DEC(addr) __CPU_ADD(addr, -1)
+#define __CPU_SUB(addr, value) __CPU_ADD(addr, -(value))
+
+#define __CPU_CMPXCHG(obj, old, new) \
+({ \
+ typeof(obj) x; \
+ typeof(obj) *p = THIS_CPU(&(obj)); \
+ x = *p; \
+ if (x == old) \
+ *p = new; \
+ (x); \
+})
+
+#define __CPU_XCHG(obj, new) \
+({ \
+ typeof(obj) x; \
+ typeof(obj) *p = THIS_CPU(&(obj)); \
+ x = *p; \
+ *p = new; \
+ (x); \
+})
+
+/*
+ * Second group used for per cpu variables that
+ * are not updated from an interrupt context.
+ * In that case we can simply disable preemption which
+ * may be free if the kernel is compiled without preemption.
+ */
+
+#define _CPU_READ(addr) \
+({ \
+ (__CPU_READ(addr)); \
+})
+
+#define _CPU_WRITE(addr, value) \
+({ \
+ __CPU_WRITE(addr, value); \
+})
+
+#define _CPU_ADD(addr, value) \
+({ \
+ preempt_disable(); \
+ __CPU_ADD(addr, value); \
+ preempt_enable(); \
+})
+
+#define _CPU_INC(addr) _CPU_ADD(addr, 1)
+#define _CPU_DEC(addr) _CPU_ADD(addr, -1)
+#define _CPU_SUB(addr, value) _CPU_ADD(addr, -(value))
+
+#define _CPU_CMPXCHG(addr, old, new) \
+({ \
+ typeof(addr) x; \
+ preempt_disable(); \
+ x = __CPU_CMPXCHG(addr, old, new); \
+ preempt_enable(); \
+ (x); \
+})
+
+#define _CPU_XCHG(addr, new) \
+({ \
+ typeof(addr) x; \
+ preempt_disable(); \
+ x = __CPU_XCHG(addr, new); \
+ preempt_enable(); \
+ (x); \
+})
+
+/*
+ * Interrupt safe CPU functions
+ */
+
+#define CPU_READ(addr) \
+({ \
+ (__CPU_READ(addr)); \
+})
+
+#define CPU_WRITE(addr, value) \
+({ \
+ __CPU_WRITE(addr, value); \
+})
+
+#define CPU_ADD(addr, value) \
+({ \
+ unsigned long flags; \
+ local_irq_save(flags); \
+ __CPU_ADD(addr, value); \
+ local_irq_restore(flags); \
+})
+
+#define CPU_INC(addr) CPU_ADD(addr, 1)
+#define CPU_DEC(addr) CPU_ADD(addr, -1)
+#define CPU_SUB(addr, value) CPU_ADD(addr, -(value))
+
+#define CPU_CMPXCHG(addr, old, new) \
+({ \
+ unsigned long flags; \
+ typeof(*addr) x; \
+ local_irq_save(flags); \
+ x = __CPU_CMPXCHG(addr, old, new); \
+ local_irq_restore(flags); \
+ (x); \
+})
+
+#define CPU_XCHG(addr, new) \
+({ \
+ unsigned long flags; \
+ typeof(*addr) x; \
+ local_irq_save(flags); \
+ x = __CPU_XCHG(addr, new); \
+ local_irq_restore(flags); \
+ (x); \
+})
+
+#endif /* CONFIG_FAST_CPU_OPS */
+
#endif /* __LINUX_PERCPU_H */

--
Mathieu Desnoyers
2007-11-20 03:17:52 UTC
Permalink
Very interesting patch! I did not expect we could mix local atomic ops
with per CPU offsets in an atomic manner.. brilliant :)

Some nitpicking follows...
Post by c***@sgi.com
Currently the per cpu subsystem is not able to use the atomic capabilities
of the processors we have.
This adds new functionality that allows the optimizing of per cpu variable
handliong. It in particular provides a simple way to exploit atomic operations
handling
Post by c***@sgi.com
to avoid having to disable itnerrupts or add an per cpu offset.
interrupts
Post by c***@sgi.com
F.e. current implementations may do
unsigned long flags;
struct stat_struct *p;
local_irq_save(flags);
/* Calculate address of per processor area */
p = CPU_PTR(stat, smp_processor_id());
p->counter++;
local_irq_restore(flags);
This whole segment can be replaced by a single CPU operation
CPU_INC(stat->counter);
And on most processors it is possible to perform the increment with
a single processor instruction. Processors have segment registers,
global registers and per cpu mappings of per cpu areas for that purpose.
The problem is that the current schemes cannot utilize those features.
local_t is not really addressing the issue since the offset calculation
is not solved. local_t is x86 processor specific. This solution here
can utilize other methods than just the x86 instruction set.
inc %%gs:(&stat->counter)
This instruction is interrupt safe since it can either be completed
or not.
The determination of the correct per cpu area for the current processor
does not require access to smp_processor_id() (expensive...). The gs
register is used to provide a processor specific offset to the respective
per cpu area where the per cpu variabvle resides.
variable
Post by c***@sgi.com
Note tha the counter offset into the struct was added *before* the segment
that
Post by c***@sgi.com
selector was added. This is necessary to avoid calculation, In the past
we first determine the address of the stats structure on the respective
processor and then added the field offset. However, the offset may as
well be added earlier.
If stat was declared via DECLARE_PER_CPU then this patchset is capoable of
capable
Post by c***@sgi.com
convincing the linker to provide the proper base address. In that case
no calculations are necessary.
Should the stats structure be reachable via a register then the address
calculation capabilities can be leverages to avoid calculations.
On IA64 the same will result in another single instruction using the
factor that we have a virtual address that always maps to the local per cpu
area.
fetchadd &stat->counter + (VCPU_BASE - __per_cpu_base)
The access is forced into the per cpu address reachable via the virtualized
address. Again the counter field offset is eadded to the offset. The access
added
Post by c***@sgi.com
is then similarly a singular instruction thing as on x86.
In order to be able to exploit the atomicity of this instructions we
introduce a series of new functions that take a BASE pointer (a pointer
into the area of cpu 0 which is the canonical base).
CPU_READ()
CPU_WRITE()
CPU_INC
CPU_DEC
CPU_ADD
CPU_SUB
CPU_XCHG
CPU_CMPXCHG
---
include/linux/percpu.h | 156 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 156 insertions(+)
Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h 2007-11-18 22:13:51.773274119 -0800
+++ linux-2.6/include/linux/percpu.h 2007-11-18 22:15:10.396773779 -0800
@@ -190,4 +190,160 @@ void cpu_free(void *cpu_pointer, unsigne
*/
void *boot_cpu_alloc(unsigned long size);
+/*
+ * Fast Atomic per cpu operations.
+ *
+ * The following operations can be overridden by arches to implement fast
+ * and efficient operations. The operations are atomic meaning that the
+ * determination of the processor, the calculation of the address and the
+ * operation on the data is an atomic operation.
+ */
+
+#ifndef CONFIG_FAST_CPU_OPS
+
+/*
+ * The fallbacks are rather slow but they are safe
+ *
+ * The first group of macros is used when we it is
+ * safe to update the per cpu variable because
+ * preemption is off (per cpu variables that are not
+ * updated from interrupt cointext) or because
context
Post by c***@sgi.com
+ * interrupts are already off.
+ */
+
+#define __CPU_READ(obj) \
+({ \
+ typeof(obj) x; \
+ x = *THIS_CPU(&(obj)); \
+ (x); \
+})
+
+#define __CPU_WRITE(obj, value) \
+({ \
+ *THIS_CPU((&(obj)) = value; \
+})
+
+#define __CPU_ADD(obj, value) \
+({ \
+ *THIS_CPU(&(obj)) += value; \
+})
+
+
+#define __CPU_INC(addr) __CPU_ADD(addr, 1)
+#define __CPU_DEC(addr) __CPU_ADD(addr, -1)
+#define __CPU_SUB(addr, value) __CPU_ADD(addr, -(value))
+
+#define __CPU_CMPXCHG(obj, old, new) \
+({ \
+ typeof(obj) x; \
+ typeof(obj) *p = THIS_CPU(&(obj)); \
+ x = *p; \
+ if (x == old) \
+ *p = new; \
I think you could use extra () around old, new etc.. ?
Post by c***@sgi.com
+ (x); \
+})
+
+#define __CPU_XCHG(obj, new) \
+({ \
+ typeof(obj) x; \
+ typeof(obj) *p = THIS_CPU(&(obj)); \
+ x = *p; \
+ *p = new; \
Same here.
Post by c***@sgi.com
+ (x); \
() seems unneeded here, since x is local.
Post by c***@sgi.com
+})
+
+/*
+ * Second group used for per cpu variables that
+ * are not updated from an interrupt context.
+ * In that case we can simply disable preemption which
+ * may be free if the kernel is compiled without preemption.
+ */
+
+#define _CPU_READ(addr) \
+({ \
+ (__CPU_READ(addr)); \
+})
({ }) seems to be unneeded here.
Post by c***@sgi.com
+
+#define _CPU_WRITE(addr, value) \
+({ \
+ __CPU_WRITE(addr, value); \
+})
and here..
Post by c***@sgi.com
+
+#define _CPU_ADD(addr, value) \
+({ \
+ preempt_disable(); \
+ __CPU_ADD(addr, value); \
+ preempt_enable(); \
+})
+
Add ()
Post by c***@sgi.com
+#define _CPU_INC(addr) _CPU_ADD(addr, 1)
+#define _CPU_DEC(addr) _CPU_ADD(addr, -1)
+#define _CPU_SUB(addr, value) _CPU_ADD(addr, -(value))
+
+#define _CPU_CMPXCHG(addr, old, new) \
+({ \
+ typeof(addr) x; \
+ preempt_disable(); \
+ x = __CPU_CMPXCHG(addr, old, new); \
add ()
Post by c***@sgi.com
+ preempt_enable(); \
+ (x); \
+})
+
+#define _CPU_XCHG(addr, new) \
+({ \
+ typeof(addr) x; \
+ preempt_disable(); \
+ x = __CPU_XCHG(addr, new); \
()
Post by c***@sgi.com
+ preempt_enable(); \
+ (x); \
() seems unneeded here, since x is local.
Post by c***@sgi.com
+})
+
+/*
+ * Interrupt safe CPU functions
+ */
+
+#define CPU_READ(addr) \
+({ \
+ (__CPU_READ(addr)); \
+})
+
Unnecessary ({ })
Post by c***@sgi.com
+#define CPU_WRITE(addr, value) \
+({ \
+ __CPU_WRITE(addr, value); \
+})
+
+#define CPU_ADD(addr, value) \
+({ \
+ unsigned long flags; \
+ local_irq_save(flags); \
+ __CPU_ADD(addr, value); \
+ local_irq_restore(flags); \
+})
+
+#define CPU_INC(addr) CPU_ADD(addr, 1)
+#define CPU_DEC(addr) CPU_ADD(addr, -1)
+#define CPU_SUB(addr, value) CPU_ADD(addr, -(value))
+
+#define CPU_CMPXCHG(addr, old, new) \
+({ \
+ unsigned long flags; \
+ typeof(*addr) x; \
+ local_irq_save(flags); \
+ x = __CPU_CMPXCHG(addr, old, new); \
()
Post by c***@sgi.com
+ local_irq_restore(flags); \
+ (x); \
() seems unneeded here, since x is local.
Post by c***@sgi.com
+})
+
+#define CPU_XCHG(addr, new) \
+({ \
+ unsigned long flags; \
+ typeof(*addr) x; \
+ local_irq_save(flags); \
+ x = __CPU_XCHG(addr, new); \
()
Post by c***@sgi.com
+ local_irq_restore(flags); \
+ (x); \
() seems unneeded here, since x is local.
Post by c***@sgi.com
+})
+
+#endif /* CONFIG_FAST_CPU_OPS */
+
#endif /* __LINUX_PERCPU_H */
--
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Christoph Lameter
2007-11-20 03:30:55 UTC
Permalink
Post by Mathieu Desnoyers
Very interesting patch! I did not expect we could mix local atomic ops
with per CPU offsets in an atomic manner.. brilliant :)
Some nitpicking follows...
Well this is a draft so I was not that thorough. The beast is getting too
big. It would be good if I could get the first patches merged that just
deal with the two allocators and then gradually work the rest.
Post by Mathieu Desnoyers
I think you could use extra () around old, new etc.. ?
Right.
Post by Mathieu Desnoyers
Same here.
Post by c***@sgi.com
+ (x); \
() seems unneeded here, since x is local.
But (x) is returned to the "caller" of the macro so it should be specially
marged.
Post by Mathieu Desnoyers
Post by c***@sgi.com
+ * In that case we can simply disable preemption which
+ * may be free if the kernel is compiled without preemption.
+ */
+
+#define _CPU_READ(addr) \
+({ \
+ (__CPU_READ(addr)); \
+})
({ }) seems to be unneeded here.
Hmmm.... I wanted a consistent style.
Mathieu Desnoyers
2007-11-20 04:07:30 UTC
Permalink
Post by Christoph Lameter
Post by Mathieu Desnoyers
Very interesting patch! I did not expect we could mix local atomic ops
with per CPU offsets in an atomic manner.. brilliant :)
Some nitpicking follows...
Well this is a draft so I was not that thorough. The beast is getting too
big. It would be good if I could get the first patches merged that just
deal with the two allocators and then gradually work the rest.
Post by Mathieu Desnoyers
I think you could use extra () around old, new etc.. ?
Right.
Post by Mathieu Desnoyers
Same here.
Post by c***@sgi.com
+ (x); \
() seems unneeded here, since x is local.
But (x) is returned to the "caller" of the macro so it should be specially
marged.
I don't think that it really matters.. the preprocessor already wraps
all the ({ }) in a single statement, doesn't it ?


Grepping for usage of ({ in include/linux shows that the return value is
never surrounded by supplementary ().
Post by Christoph Lameter
Post by Mathieu Desnoyers
Post by c***@sgi.com
+ * In that case we can simply disable preemption which
+ * may be free if the kernel is compiled without preemption.
+ */
+
+#define _CPU_READ(addr) \
+({ \
+ (__CPU_READ(addr)); \
+})
({ }) seems to be unneeded here.
Hmmm.... I wanted a consistent style.
Since checkpatch.pl emits a warning when a one liner if() uses brackets,
I guess compactness of code is preferred to a consistent style.

Just my 2 cents though :)

Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Christoph Lameter
2007-11-20 20:36:57 UTC
Permalink
Post by Mathieu Desnoyers
Post by Christoph Lameter
But (x) is returned to the "caller" of the macro so it should be specially
marged.
I don't think that it really matters.. the preprocessor already wraps
all the ({ }) in a single statement, doesn't it ?
No it does not matter for the preprocessor. It matters for readability
because I want to see that this is the return value.
Post by Mathieu Desnoyers
Since checkpatch.pl emits a warning when a one liner if() uses brackets,
I guess compactness of code is preferred to a consistent style.
I wish someone would fix it. Its giving so much false positives that its
useless for me.
c***@sgi.com
2007-11-20 01:11:37 UTC
Permalink
Remove the fields in kmem_cache_cpu that were used to cache data from
kmem_cache when they were in different cachelines. The cacheline that holds
the per cpu array pointer now also holds these values. We can cut down the
kmem_cache_cpu size to almost half.

The get_freepointer() and set_freepointer() functions that used to be only
intended for the slow path now are also useful for the hot path since access
to the field does not require an additional cacheline anymore. This results
in consistent use of setting the freepointer for objects throughout SLUB.

Signed-off-by: Christoph Lameter <***@sgi.com>
---
include/linux/slub_def.h | 3 --
mm/slub.c | 48 +++++++++++++++--------------------------------
2 files changed, 16 insertions(+), 35 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2007-11-19 15:53:25.869890760 -0800
+++ linux-2.6/include/linux/slub_def.h 2007-11-19 15:55:42.125640354 -0800
@@ -15,9 +15,6 @@ struct kmem_cache_cpu {
void **freelist;
struct page *page;
int node;
- unsigned int offset;
- unsigned int objsize;
- unsigned int objects;
};

struct kmem_cache_node {
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2007-11-19 15:54:10.513640214 -0800
+++ linux-2.6/mm/slub.c 2007-11-19 15:55:42.125640354 -0800
@@ -273,13 +273,6 @@ static inline int check_valid_pointer(st
return 1;
}

-/*
- * Slow version of get and set free pointer.
- *
- * This version requires touching the cache lines of kmem_cache which
- * we avoid to do in the fast alloc free paths. There we obtain the offset
- * from the page struct.
- */
static inline void *get_freepointer(struct kmem_cache *s, void *object)
{
return *(void **)(object + s->offset);
@@ -1438,10 +1431,10 @@ static void deactivate_slab(struct kmem_

/* Retrieve object from cpu_freelist */
object = c->freelist;
- c->freelist = c->freelist[c->offset];
+ c->freelist = get_freepointer(s, c->freelist);

/* And put onto the regular freelist */
- object[c->offset] = page->freelist;
+ set_freepointer(s, object, page->freelist);
page->freelist = object;
page->inuse--;
}
@@ -1584,8 +1577,8 @@ load_freelist:
goto debug;

object = c->page->freelist;
- c->freelist = object[c->offset];
- c->page->inuse = c->objects;
+ c->freelist = get_freepointer(s, object);
+ c->page->inuse = s->objects;
c->page->freelist = c->page->end;
c->node = page_to_nid(c->page);
unlock_out:
@@ -1613,7 +1606,7 @@ debug:
goto another_slab;

c->page->inuse++;
- c->page->freelist = object[c->offset];
+ c->page->freelist = get_freepointer(s, object);
c->node = -1;
goto unlock_out;
}
@@ -1646,7 +1639,7 @@ static void __always_inline *slab_alloc(
break;
}
} while (CPU_CMPXCHG(c->freelist, object,
- object[__CPU_READ(c->offset)]) != object);
+ get_freepointer(s, object)) != object);
#else
unsigned long flags;

@@ -1661,13 +1654,13 @@ static void __always_inline *slab_alloc(
}
} else {
object = c->freelist;
- c->freelist = object[c->offset];
+ c->freelist = get_freepointer(s, object);
}
local_irq_restore(flags);
#endif

if (unlikely((gfpflags & __GFP_ZERO)))
- memset(object, 0, c->objsize);
+ memset(object, 0, s->objsize);
out:
return object;
}
@@ -1695,7 +1688,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_node);
* handling required then we can return immediately.
*/
static void __slab_free(struct kmem_cache *s, struct page *page,
- void *x, void *addr, unsigned int offset)
+ void *x, void *addr)
{
void *prior;
void **object = (void *)x;
@@ -1711,7 +1704,8 @@ static void __slab_free(struct kmem_cach
if (unlikely(state & SLABDEBUG))
goto debug;
checks_ok:
- prior = object[offset] = page->freelist;
+ prior = page->freelist;
+ set_freepointer(s, object, prior);
page->freelist = object;
page->inuse--;

@@ -1794,10 +1788,10 @@ static void __always_inline slab_free(st
*/
if (unlikely(page != __CPU_READ(c->page) ||
__CPU_READ(c->node) < 0)) {
- __slab_free(s, page, x, addr, __CPU_READ(c->offset));
+ __slab_free(s, page, x, addr);
break;
}
- object[__CPU_READ(c->offset)] = freelist;
+ set_freepointer(s, object, freelist);
} while (CPU_CMPXCHG(c->freelist, freelist, object) != freelist);
#else
unsigned long flags;
@@ -1806,10 +1800,10 @@ static void __always_inline slab_free(st
debug_check_no_locks_freed(object, s->objsize);
c = THIS_CPU(s->cpu_slab);
if (likely(page == c->page && c->node >= 0)) {
- object[c->offset] = c->freelist;
+ set_freepointer(s, object, c->freelist);
c->freelist = object;
} else
- __slab_free(s, page, x, addr, c->offset);
+ __slab_free(s, page, x, addr);

local_irq_restore(flags);
#endif
@@ -1991,9 +1985,6 @@ static void init_kmem_cache_cpu(struct k
c->page = NULL;
c->freelist = (void *)PAGE_MAPPING_ANON;
c->node = 0;
- c->offset = s->offset / sizeof(void *);
- c->objsize = s->objsize;
- c->objects = s->objects;
}

static void init_kmem_cache_node(struct kmem_cache_node *n)
@@ -2985,21 +2976,14 @@ struct kmem_cache *kmem_cache_create(con
down_write(&slub_lock);
s = find_mergeable(size, align, flags, name, ctor);
if (s) {
- int cpu;
-
s->refcount++;
+
/*
* Adjust the object sizes so that we clear
* the complete object on kzalloc.
*/
s->objsize = max(s->objsize, (int)size);

- /*
- * And then we need to update the object size in the
- * per cpu structures
- */
- for_each_online_cpu(cpu)
- CPU_PTR(s->cpu_slab, cpu)->objsize = s->objsize;
s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
up_write(&slub_lock);
if (sysfs_slab_alias(s, name))

--
c***@sgi.com
2007-11-20 01:11:40 UTC
Permalink
64 bit:

Set up a cpu area that allows the use of up 16MB for each processor.

Cpu memory use can grow a bit. F.e. if we assume that a pageset
occupies 64 bytes of memory and we have 3 zones in each of 1024 nodes
then we need 3 * 1k * 16k = 50 million pagesets or 3096 pagesets per
processor. This results in a total of 3.2 GB of page structs.
Each cpu needs around 200k of cpu storage for the page allocator alone.
So its a worth it to use a 2M huge mapping here.

For the UP and SMP case map the area using 4k ptes. Typical use of per cpu
data is around 16k for UP and SMP configurations. It goes up to 45k when the
per cpu area is managed by cpu_alloc (see special x86_64 patchset).
Allocating in 2M segments would be overkill.

For NUMA map the area using 2M PMDs. A large NUMA system may use
lots of cpu data for the page allocator data alone. We typically
have large amounts of memory around on those size. Using a 2M page size
reduces TLB pressure for that case.

Some numbers for envisioned maximum configurations of NUMA systems:

4k cpu configurations with 1k nodes:

4096 * 16MB = 64TB of virtual space.

Maximum theoretical configuration 16384 processors 1k nodes:

16384 * 16MB = 256TB of virtual space.

Both fit within the established limits established.

32 bit:

Setup a 256 kB area for the cpu areas below the FIXADDR area.

The use of the cpu alloc area is pretty minimal on i386. An 8p system
with no extras uses only ~8kb. So 256kb should be plenty. A configuration
that supports up to 8 processors takes up 2MB of the scarce
virtual address space

Signed-off-by: Christoph Lameter <***@sgi.com>
---
arch/x86/Kconfig | 13 +++++++++++++
arch/x86/kernel/vmlinux_32.lds.S | 1 +
arch/x86/kernel/vmlinux_64.lds.S | 3 +++
arch/x86/mm/init_32.c | 3 +++
arch/x86/mm/init_64.c | 38 ++++++++++++++++++++++++++++++++++++++
include/asm-x86/pgtable_32.h | 7 +++++--
include/asm-x86/pgtable_64.h | 1 +
7 files changed, 64 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c 2007-11-19 15:45:07.602390533 -0800
+++ linux-2.6/arch/x86/mm/init_64.c 2007-11-19 15:55:53.165640248 -0800
@@ -781,3 +781,41 @@ int __meminit vmemmap_populate(struct pa
return 0;
}
#endif
+
+#ifdef CONFIG_NUMA
+int __meminit cpu_area_populate(void *start, unsigned long size,
+ gfp_t flags, int node)
+{
+ unsigned long addr = (unsigned long)start;
+ unsigned long end = addr + size;
+ unsigned long next;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ for (; addr < end; addr = next) {
+ next = pmd_addr_end(addr, end);
+
+ pgd = cpu_area_pgd_populate(addr, flags, node);
+ if (!pgd)
+ return -ENOMEM;
+ pud = cpu_area_pud_populate(pgd, addr, flags, node);
+ if (!pud)
+ return -ENOMEM;
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd)) {
+ pte_t entry;
+ void *p = cpu_area_alloc_block(PMD_SIZE, flags, node);
+ if (!p)
+ return -ENOMEM;
+
+ entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
+ mk_pte_huge(entry);
+ set_pmd(pmd, __pmd(pte_val(entry)));
+ }
+ }
+
+ return 0;
+}
+#endif
Index: linux-2.6/include/asm-x86/pgtable_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable_64.h 2007-11-19 15:45:07.638390147 -0800
+++ linux-2.6/include/asm-x86/pgtable_64.h 2007-11-19 15:55:53.165640248 -0800
@@ -138,6 +138,7 @@ static inline pte_t ptep_get_and_clear_f
#define VMALLOC_START _AC(0xffffc20000000000, UL)
#define VMALLOC_END _AC(0xffffe1ffffffffff, UL)
#define VMEMMAP_START _AC(0xffffe20000000000, UL)
+#define CPU_AREA_BASE _AC(0xffffffff84000000, UL)
#define MODULES_VADDR _AC(0xffffffff88000000, UL)
#define MODULES_END _AC(0xfffffffffff00000, UL)
#define MODULES_LEN (MODULES_END - MODULES_VADDR)
Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig 2007-11-19 15:54:10.509139813 -0800
+++ linux-2.6/arch/x86/Kconfig 2007-11-19 15:55:53.165640248 -0800
@@ -159,6 +159,19 @@ config X86_TRAMPOLINE

config KTIME_SCALAR
def_bool X86_32
+
+config CPU_AREA_VIRTUAL
+ bool
+ default y
+
+config CPU_AREA_ORDER
+ int
+ default "6"
+
+config CPU_AREA_ALLOC_ORDER
+ int
+ default "0"
+
source "init/Kconfig"

menu "Processor type and features"
Index: linux-2.6/arch/x86/mm/init_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_32.c 2007-11-19 15:45:07.610390367 -0800
+++ linux-2.6/arch/x86/mm/init_32.c 2007-11-19 15:55:53.165640248 -0800
@@ -674,6 +674,7 @@ void __init mem_init(void)
#if 1 /* double-sanity-check paranoia */
printk("virtual kernel memory layout:\n"
" fixmap : 0x%08lx - 0x%08lx (%4ld kB)\n"
+ " cpu area: 0x%08lx - 0x%08lx (%4ld kb)\n"
#ifdef CONFIG_HIGHMEM
" pkmap : 0x%08lx - 0x%08lx (%4ld kB)\n"
#endif
@@ -684,6 +685,8 @@ void __init mem_init(void)
" .text : 0x%08lx - 0x%08lx (%4ld kB)\n",
FIXADDR_START, FIXADDR_TOP,
(FIXADDR_TOP - FIXADDR_START) >> 10,
+ CPU_AREA_BASE, FIXADDR_START,
+ (FIXADDR_START - CPU_AREA_BASE) >> 10,

#ifdef CONFIG_HIGHMEM
PKMAP_BASE, PKMAP_BASE+LAST_PKMAP*PAGE_SIZE,
Index: linux-2.6/include/asm-x86/pgtable_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable_32.h 2007-11-19 15:45:07.646390299 -0800
+++ linux-2.6/include/asm-x86/pgtable_32.h 2007-11-19 15:55:53.165640248 -0800
@@ -79,11 +79,14 @@ void paging_init(void);
#define VMALLOC_START (((unsigned long) high_memory + \
2*VMALLOC_OFFSET-1) & ~(VMALLOC_OFFSET-1))
#ifdef CONFIG_HIGHMEM
-# define VMALLOC_END (PKMAP_BASE-2*PAGE_SIZE)
+# define CPU_AREA_BASE (PKMAP_BASE - NR_CPUS * \
+ (1 << (CONFIG_CPU_AREA_ORDER + PAGE_SHIFT)))
#else
-# define VMALLOC_END (FIXADDR_START-2*PAGE_SIZE)
+# define CPU_AREA_BASE (FIXADDR_START - NR_CPUS * \
+ (1 << (CONFIG_CPU_AREA_ORDER + PAGE_SHIFT)))
#endif

+#define VMALLOC_END (CPU_AREA_BASE - 2 * PAGE_SIZE)
/*
* _PAGE_PSE set in the page directory entry just means that
* the page directory entry points directly to a 4MB-aligned block of
Index: linux-2.6/arch/x86/kernel/vmlinux_32.lds.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/vmlinux_32.lds.S 2007-11-19 15:45:07.622390531 -0800
+++ linux-2.6/arch/x86/kernel/vmlinux_32.lds.S 2007-11-19 15:55:53.165640248 -0800
@@ -26,6 +26,7 @@ OUTPUT_FORMAT("elf32-i386", "elf32-i386"
OUTPUT_ARCH(i386)
ENTRY(phys_startup_32)
jiffies = jiffies_64;
+cpu_area = CPU_AREA_BASE;

PHDRS {
text PT_LOAD FLAGS(5); /* R_E */
Index: linux-2.6/arch/x86/kernel/vmlinux_64.lds.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/vmlinux_64.lds.S 2007-11-19 15:45:07.630390395 -0800
+++ linux-2.6/arch/x86/kernel/vmlinux_64.lds.S 2007-11-19 15:55:53.165640248 -0800
@@ -6,6 +6,7 @@

#include <asm-generic/vmlinux.lds.h>
#include <asm/page.h>
+#include <asm/pgtable_64.h>

#undef i386 /* in case the preprocessor is a 32bit one */

@@ -14,6 +15,8 @@ OUTPUT_ARCH(i386:x86-64)
ENTRY(phys_startup_64)
jiffies_64 = jiffies;
_proxy_pda = 1;
+cpu_area = CPU_AREA_BASE;
+
PHDRS {
text PT_LOAD FLAGS(5); /* R_E */
data PT_LOAD FLAGS(7); /* RWE */

--
H. Peter Anvin
2007-11-20 01:35:06 UTC
Permalink
Post by c***@sgi.com
For the UP and SMP case map the area using 4k ptes. Typical use of per cpu
data is around 16k for UP and SMP configurations. It goes up to 45k when the
per cpu area is managed by cpu_alloc (see special x86_64 patchset).
Allocating in 2M segments would be overkill.
For NUMA map the area using 2M PMDs. A large NUMA system may use
lots of cpu data for the page allocator data alone. We typically
have large amounts of memory around on those size. Using a 2M page size
reduces TLB pressure for that case.
4096 * 16MB = 64TB of virtual space.
16384 * 16MB = 256TB of virtual space.
Both fit within the established limits established.
You're making the assumption here that NUMA = large number of CPUs.
This assumption is flat-out wrong.

On x86-64, most two-socket systems are still NUMA, and I would expect
that most distro kernels probably compile in NUMA. However,
burning megabytes of memory on a two-socket dual-core system when we're
talking about tens of kilobytes used would be more than a wee bit insane.

I do like the concept, overall, but the above distinction needs to be fixed.

-hpa
Christoph Lameter
2007-11-20 02:02:11 UTC
Permalink
You're making the assumption here that NUMA = large number of CPUs. This
assumption is flat-out wrong.
Well maybe. Usually one gets to NUMA because the hardware gets too big to
be handleed the UMA way.
On x86-64, most two-socket systems are still NUMA, and I would expect that
most distro kernels probably compile in NUMA. However,
burning megabytes of memory on a two-socket dual-core system when we're
talking about tens of kilobytes used would be more than a wee bit insane.
Yeah yea but the latencies are minimal making the NUMA logic too expensive
for most loads ... If you put a NUMA kernel onto those then performance
drops (I think someone measures 15-30%?)
H. Peter Anvin
2007-11-20 02:18:42 UTC
Permalink
Post by Christoph Lameter
You're making the assumption here that NUMA = large number of CPUs. This
assumption is flat-out wrong.
Well maybe. Usually one gets to NUMA because the hardware gets too big to
be handleed the UMA way.
On x86-64, most two-socket systems are still NUMA, and I would expect that
most distro kernels probably compile in NUMA. However,
burning megabytes of memory on a two-socket dual-core system when we're
talking about tens of kilobytes used would be more than a wee bit insane.
Yeah yea but the latencies are minimal making the NUMA logic too expensive
for most loads ... If you put a NUMA kernel onto those then performance
drops (I think someone measures 15-30%?)
How do you handle this memory, in the first place? Do you allocate the
whole 2 MB for the particular CPU, or do you reclaim the upper part of
the large page? (I haven't dug far enough into the source to tell.)

-hpa
Nick Piggin
2007-11-20 03:37:37 UTC
Permalink
Post by Christoph Lameter
You're making the assumption here that NUMA = large number of CPUs. This
assumption is flat-out wrong.
Well maybe. Usually one gets to NUMA because the hardware gets too big to
be handleed the UMA way.
Not the way things are going with multicore and multithread, though
(that is, the hardware can be one socket and still have many cpus).

The chip might have several memory controllers on it, but they could
well be connected to the caches with a crossbar, so it needn't be
NUMA at all. Future scalability work shouldn't rely on many cores
~= many nodes, IMO.
Nick Piggin
2007-11-20 03:59:12 UTC
Permalink
Post by Christoph Lameter
You're making the assumption here that NUMA = large number of CPUs. This
assumption is flat-out wrong.
Well maybe. Usually one gets to NUMA because the hardware gets too big to
be handleed the UMA way.
On x86-64, most two-socket systems are still NUMA, and I would expect
that most distro kernels probably compile in NUMA. However,
burning megabytes of memory on a two-socket dual-core system when we're
talking about tens of kilobytes used would be more than a wee bit insane.
Yeah yea but the latencies are minimal making the NUMA logic too expensive
for most loads ... If you put a NUMA kernel onto those then performance
drops (I think someone measures 15-30%?)
Small socket count systems are going to increasingly be NUMA in future.
If CONFIG_NUMA hurts performance by that much on those systems, then the
kernel is broken IMO.
Andi Kleen
2007-11-20 12:05:24 UTC
Permalink
Post by Nick Piggin
Post by Christoph Lameter
Yeah yea but the latencies are minimal making the NUMA logic too
expensive for most loads ... If you put a NUMA kernel onto those then
performance drops (I think someone measures 15-30%?)
Small socket count systems are going to increasingly be NUMA in future.
If CONFIG_NUMA hurts performance by that much on those systems, then the
kernel is broken IMO.
Not sure where that number came from.

In my tests some time ago NUMA overhead on SMP was minimal.

This was admittedly with old 2.4 kernels. There have been some doubts about
some of the newer NUMA features added; in particular about NUMA slab;
don't think there was much trouble with anything else -- in fact the trouble
was that it apparently sometimes made moderate NUMA factor NUMA systems
slower too. But I assume SLUB will address this anyways.

-Andi
Andi Kleen
2007-11-20 03:16:32 UTC
Permalink
Post by c***@sgi.com
4096 * 16MB = 64TB of virtual space.
16384 * 16MB = 256TB of virtual space.
Both fit within the established limits established.
I might be pointing out the obvious, but on x86-64 there is definitely not
256TB of VM available for this.

Not even 64TB, as long as you want to have any
other mappings in kernel (total kernel memory 128TB, but it is split in
half for the direct mapping)

BTW if you allocate any VM you should also update
Documentation/x86_64/mm.txt which describes the mapping
Post by c***@sgi.com
Index: linux-2.6/include/asm-x86/pgtable_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable_64.h 2007-11-19
15:45:07.638390147 -0800 +++
linux-2.6/include/asm-x86/pgtable_64.h 2007-11-19 15:55:53.165640248 -0800
@@ -138,6 +138,7 @@ static inline pte_t ptep_get_and_clear_f
#define VMALLOC_START _AC(0xffffc20000000000, UL)
#define VMALLOC_END _AC(0xffffe1ffffffffff, UL)
#define VMEMMAP_START _AC(0xffffe20000000000, UL)
+#define CPU_AREA_BASE _AC(0xffffffff84000000, UL)
That's slightly less than 1GB before you bump into the maximum.
But you'll bump into the module mapping even before that.

For 16MB/CPU and the full 1GB that's ~123 CPUs if my calculations are correct.
Even for a not Altix that's quite tight.

I suppose 16MB/CPU are too large.

-Andi
Christoph Lameter
2007-11-20 03:50:01 UTC
Permalink
Post by Andi Kleen
I might be pointing out the obvious, but on x86-64 there is definitely not
256TB of VM available for this.
Well maybe in the future.

One of the issues that I ran into is that I had to place the cpu area
in between to make the offsets link right.

However, it would be best if the cpuarea came *after* the modules area. We
only need linking that covers the per cpu area of processor 0.

So I think we have a 2GB area right?

1GB kernel
1GB - 1x per cpu area (128M?) modules?
cpu aree 0
---- 2GB limit
cpu area 1
cpu area 2
....

For that we would need to move the kernel down a bit. Can we do that?
Andi Kleen
2007-11-20 12:01:24 UTC
Permalink
Post by Christoph Lameter
Post by Andi Kleen
I might be pointing out the obvious, but on x86-64 there is definitely
not 256TB of VM available for this.
Well maybe in the future.
That would either require more than 4 levels or larger pages
in page tables.
Post by Christoph Lameter
One of the issues that I ran into is that I had to place the cpu area
in between to make the offsets link right.
Above -2GB, otherwise you cannot address them

If you can move all the other CPUs somewhere else it might work.

But even then 16MB/cpu max is unrealistic. Perhaps 1M/CPU
max -- then 16k CPU would be 128GB which could still fit into the existing
vmalloc area.
Post by Christoph Lameter
However, it would be best if the cpuarea came *after* the modules area. We
only need linking that covers the per cpu area of processor 0.
So I think we have a 2GB area right?
For everything that needs the -31bit offsets; that is everything linked
Post by Christoph Lameter
1GB kernel
1GB - 1x per cpu area (128M?) modules?
cpu aree 0
---- 2GB limit
cpu area 1
cpu area 2
....
For that we would need to move the kernel down a bit. Can we do that?
The kernel model requires kernel and modules and everything else
linked be in negative -31bit space. That is how the kernel code model is
defined.

You could in theory move the modules, but then you would need to implement
a full PIC dynamic linker for them first and also increase runtime overhead
for them because they would need to use a GOT/PLT.

Or you could switch kernel over to the large model, which is very costly
and has toolkit problems.

Or use the UML trick and run the kernel PIC but again that causes
overhead.

I suspect all of this would cause far more overhead all over the kernel than
you could ever save with the per cpu data in your fast paths.

-Andi
Christoph Lameter
2007-11-20 20:35:22 UTC
Permalink
Post by Andi Kleen
Post by Christoph Lameter
So I think we have a 2GB area right?
For everything that needs the -31bit offsets; that is everything linked
Of course.
Post by Andi Kleen
Post by Christoph Lameter
1GB kernel
1GB - 1x per cpu area (128M?) modules?
cpu aree 0
---- 2GB limit
cpu area 1
cpu area 2
....
For that we would need to move the kernel down a bit. Can we do that?
The kernel model requires kernel and modules and everything else
linked be in negative -31bit space. That is how the kernel code model is
defined.
Right so I could move the kernel to

#define __PAGE_OFFSET _AC(0xffff810000000000, UL)
#define __START_KERNEL_map_AC(0xfffffff800000000, UL)
#define KERNEL_TEXT_START _AC(0xfffffff800000000, UL) 30 bits = 1GB for kernel text
#define MODULES_VADDR _AC(0xfffffff880000000, UL) 30 bits = 1GB for modules
#define MODULES_END _AC(0xfffffff8f0000000, UL)
#define CPU_AREA_BASE _AC(0xfffffff8f0000000, UL) 31 bits 256MB for cpu area 0
#define CPU_AREA_BASE1 _AC(0xfffffff900000000, UL) More cpu areas for higher numbered processors
#define CPU_AREA_END _AC(0xffffffffffff0000, UL)
Post by Andi Kleen
You could in theory move the modules, but then you would need to implement
a full PIC dynamic linker for them first and also increase runtime overhead
for them because they would need to use a GOT/PLT.
Why is it not possible to move the kernel lower while keeping bit 31 1?
Post by Andi Kleen
I suspect all of this would cause far more overhead all over the kernel than
you could ever save with the per cpu data in your fast paths.
Moving the kernel down a bit seems to be trivial without any of the weird
solutions.
Andi Kleen
2007-11-20 20:59:36 UTC
Permalink
Post by Christoph Lameter
Right so I could move the kernel to
#define __PAGE_OFFSET _AC(0xffff810000000000, UL)
#define __START_KERNEL_map_AC(0xfffffff800000000, UL)
That is -31GB unless I'm miscounting. But it needs to be >= -2GB
(31bits)

Right now it is at -2GB + 2MB, because it is loaded at physical +2MB
so it's convenient to identity map there. In theory you could avoid that
with some effort, but that would only buy you 2MB and would also
break some early code and earlyprintk I believe.
Post by Christoph Lameter
Post by Andi Kleen
You could in theory move the modules, but then you would need to implement
a full PIC dynamic linker for them first and also increase runtime overhead
for them because they would need to use a GOT/PLT.
Why is it not possible to move the kernel lower while keeping bit 31 1?
The kernel model relies on 32bit sign extension. This means bits [31;63] have
to be all 1
Post by Christoph Lameter
Post by Andi Kleen
I suspect all of this would cause far more overhead all over the kernel than
you could ever save with the per cpu data in your fast paths.
Moving the kernel down a bit seems to be trivial without any of the weird
solutions.
Another one I came up in the previous mail would be to do the linker reference
variable allocation in [0;2GB] positive space; but do all real references only
%gs relative. And keep the real data copy on some other address. That would
be a similar trick to the old style x86-64 vsyscalls. It gets fairly
messy in the linker map file though.

-Andi
H. Peter Anvin
2007-11-20 20:43:10 UTC
Permalink
Post by Andi Kleen
Post by Christoph Lameter
I might be pointing out the obvious, but on x86-64 there is definit=
ely
Post by Andi Kleen
Post by Christoph Lameter
not 256TB of VM available for this.
Well maybe in the future.
=20
That would either require more than 4 levels or larger pages
in page tables.
=20
Post by Christoph Lameter
One of the issues that I ran into is that I had to place the cpu are=
a
Post by Andi Kleen
Post by Christoph Lameter
in between to make the offsets link right.
=20
Above -2GB, otherwise you cannot address them
=20
This limitation shouldn't apply to the percpu area, since gs_base can b=
e=20
pointed anywhere in the address space -- in effect we're always indirec=
t.

Obviously the offsets *within* the percpu area has to be in range (=B12=
GB=20
per cpu for absolute offsets, slightly smaller for %rip-based addressin=
g=20
-- obviously judicious use of an offset for gs_base is essential in the=
=20
latter case).

Thus you want the percpu areas below -2 GB where they don't interfere=20
with modules or any other precious address space.

-hpa
Andi Kleen
2007-11-20 20:51:58 UTC
Permalink
This limitation shouldn't apply to the percpu area, since gs_base can be
pointed anywhere in the address space -- in effect we're always indirect.
The initial reference copy of the percpu area has to be addressed by
the linker.

Hmm, in theory since it is not actually used by itself I suppose you could
move it into positive space.

-Andi
Christoph Lameter
2007-11-20 20:58:34 UTC
Permalink
Post by Andi Kleen
This limitation shouldn't apply to the percpu area, since gs_base can be
pointed anywhere in the address space -- in effect we're always indirect.
The initial reference copy of the percpu area has to be addressed by
the linker.
Right that is important for the percpu references that can be folded by
the linker in order to avoid address calculations.
Post by Andi Kleen
Hmm, in theory since it is not actually used by itself I suppose you could
move it into positive space.
But the positive space is reserved for a processes memory.
H. Peter Anvin
2007-11-20 21:06:14 UTC
Permalink
Post by Christoph Lameter
Post by Andi Kleen
This limitation shouldn't apply to the percpu area, since gs_base can be
pointed anywhere in the address space -- in effect we're always indirect.
The initial reference copy of the percpu area has to be addressed by
the linker.
Right that is important for the percpu references that can be folded by
the linker in order to avoid address calculations.
Post by Andi Kleen
Hmm, in theory since it is not actually used by itself I suppose you could
move it into positive space.
But the positive space is reserved for a processes memory.
But you wouldn't actually *use* this address space. It's just for the
linker to know what address to tag the references with; it gets
relocated by gs_base down into proper kernel space. The linker can
stash the initialized reference copy at any address (LMA) which can be
different from what it will be used at (VMA); that is not an issue.

To use %rip references, though, which are more efficient, you probably
want to use offsets that are just below .text (at -2 GB); presumably
-2 GB-[max size of percpu section]. Again, however, no CPU actually
needs to have its data stashed in that particular location; it's just an
offset.

-hpa

H. Peter Anvin
2007-11-20 21:01:28 UTC
Permalink
Post by Andi Kleen
This limitation shouldn't apply to the percpu area, since gs_base can be
pointed anywhere in the address space -- in effect we're always indirect.
The initial reference copy of the percpu area has to be addressed by
the linker.
Hmm, in theory since it is not actually used by itself I suppose you could
move it into positive space.
Positive space for absolute references, or just below -2 GB for %rip
references; either should work.

-hpa
c***@sgi.com
2007-11-20 01:11:39 UTC
Permalink
Virtually map the cpu areas. This allows bigger maximum sizes and to only
populate the virtual mappings on demand.

In order to use the virtual mapping capability the arch must setup some
configuration variables in arch/xxx/Kconfig:

CONFIG_CPU_AREA_VIRTUAL to y

CONFIG_CPU_AREA_ORDER
to the largest allowed size that the per cpu area can grow to.

CONFIG_CPU_AREA_ALLOC_ORDER
to the allocation size when the cpu area needs to grow. Use 0
here to guarantee order 0 allocations.

The address to use must be defined in CPU_AREA_BASE. This is typically done
in include/asm-xxx/pgtable.h.

The maximum space used by the cpu are is

NR_CPUS * (PAGE_SIZE << CONFIG_CPU_AREA_ORDER)

An arch may provide its own population function for the virtual mappings
(in order to exploit huge page mappings and other frills of the MMU of an
architecture). The default populate function uses single page mappings.


int cpu_area_populate(void *start, unsigned long size, gfp_t flags, int node)

The list of cpu_area_xx functions exported in include/linux/mm.h may be used
as helpers to generate the mapping that the arch needs.

In the simplest form the arch code calls:

cpu_area_populate_basepages(start, size, flags, node);

The arch code must call

cpu_area_alloc_block(unsigned long size, gfp_t flags, int node)

for all its memory needs during the construction of the custom page table.

Signed-off-by: Christoph Lameter <***@sgi.com>

---
include/linux/mm.h | 13 ++
mm/Kconfig | 10 +
mm/cpu_alloc.c | 288 +++++++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 300 insertions(+), 11 deletions(-)

Index: linux-2.6/mm/cpu_alloc.c
===================================================================
--- linux-2.6.orig/mm/cpu_alloc.c 2007-11-19 15:53:21.569140430 -0800
+++ linux-2.6/mm/cpu_alloc.c 2007-11-19 15:55:48.805640240 -0800
@@ -17,6 +17,12 @@
#include <linux/module.h>
#include <linux/percpu.h>
#include <linux/bitmap.h>
+#include <linux/vmalloc.h>
+#include <linux/bootmem.h>
+#include <linux/sched.h> /* i386 definition of init_mm */
+#include <linux/highmem.h> /* i386 dependency on highmem config */
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>

/*
* Basic allocation unit. A bit map is created to track the use of each
@@ -24,7 +30,7 @@
*/

#define UNIT_SIZE sizeof(int)
-#define UNITS (ALLOC_SIZE / UNIT_SIZE)
+#define UNITS_PER_BLOCK (ALLOC_SIZE / UNIT_SIZE)

/*
* How many units are needed for an object of a given size
@@ -40,6 +46,246 @@ static int size_to_units(unsigned long s
static DEFINE_SPINLOCK(cpu_alloc_map_lock);
static unsigned long units_reserved; /* Units reserved by boot allocations */

+#ifdef CONFIG_CPU_AREA_VIRTUAL
+
+/*
+ * Virtualized cpu area. The cpu area can be extended if more space is needed.
+ */
+
+#define ALLOC_SIZE (1UL << (CONFIG_CPU_AREA_ALLOC_ORDER + PAGE_SHIFT))
+#define BOOT_ALLOC (1 << __GFP_BITS_SHIFT)
+
+/*
+ * The maximum number of blocks is the maximum size of the
+ * cpu area for one processor divided by the size of an allocation
+ * block.
+ */
+#define MAX_BLOCKS (1UL << (CONFIG_CPU_AREA_ORDER - \
+ CONFIG_CPU_AREA_ALLOC_ORDER))
+
+
+static unsigned long *cpu_alloc_map = NULL;
+static int cpu_alloc_map_order = -1; /* Size of the bitmap in page order */
+static unsigned long active_blocks; /* Number of block allocated on each cpu */
+static unsigned long units_total; /* Total units that are managed */
+/*
+ * Allocate a block of memory to be used to provide cpu area memory
+ * or to extend the bitmap for the cpu map.
+ */
+void *cpu_area_alloc_block(unsigned long size, gfp_t flags, int node)
+{
+ if (!(flags & BOOT_ALLOC)) {
+ struct page *page = alloc_pages_node(node,
+ flags, get_order(size));
+
+ if (page)
+ return page_address(page);
+ return NULL;
+ } else
+ return __alloc_bootmem_node(NODE_DATA(node), size, size,
+ __pa(MAX_DMA_ADDRESS));
+}
+
+pte_t *cpu_area_pte_populate(pmd_t *pmd, unsigned long addr,
+ gfp_t flags, int node)
+{
+ pte_t *pte = pte_offset_kernel(pmd, addr);
+ if (pte_none(*pte)) {
+ pte_t entry;
+ void *p = cpu_area_alloc_block(PAGE_SIZE, flags, node);
+ if (!p)
+ return 0;
+ entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
+ set_pte_at(&init_mm, addr, pte, entry);
+ }
+ return pte;
+}
+
+pmd_t *cpu_area_pmd_populate(pud_t *pud, unsigned long addr,
+ gfp_t flags, int node)
+{
+ pmd_t *pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd)) {
+ void *p = cpu_area_alloc_block(PAGE_SIZE, flags, node);
+ if (!p)
+ return 0;
+ pmd_populate_kernel(&init_mm, pmd, p);
+ }
+ return pmd;
+}
+
+pud_t *cpu_area_pud_populate(pgd_t *pgd, unsigned long addr,
+ gfp_t flags, int node)
+{
+ pud_t *pud = pud_offset(pgd, addr);
+ if (pud_none(*pud)) {
+ void *p = cpu_area_alloc_block(PAGE_SIZE, flags, node);
+ if (!p)
+ return 0;
+ pud_populate(&init_mm, pud, p);
+ }
+ return pud;
+}
+
+pgd_t *cpu_area_pgd_populate(unsigned long addr, gfp_t flags, int node)
+{
+ pgd_t *pgd = pgd_offset_k(addr);
+ if (pgd_none(*pgd)) {
+ void *p = cpu_area_alloc_block(PAGE_SIZE, flags, node);
+ if (!p)
+ return 0;
+ pgd_populate(&init_mm, pgd, p);
+ }
+ return pgd;
+}
+
+int cpu_area_populate_basepages(void *start, unsigned long size,
+ gfp_t flags, int node)
+{
+ unsigned long addr = (unsigned long)start;
+ unsigned long end = addr + size;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+
+ for (; addr < end; addr += PAGE_SIZE) {
+ pgd = cpu_area_pgd_populate(addr, flags, node);
+ if (!pgd)
+ return -ENOMEM;
+ pud = cpu_area_pud_populate(pgd, addr, flags, node);
+ if (!pud)
+ return -ENOMEM;
+ pmd = cpu_area_pmd_populate(pud, addr, flags, node);
+ if (!pmd)
+ return -ENOMEM;
+ pte = cpu_area_pte_populate(pmd, addr, flags, node);
+ if (!pte)
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+/*
+ * If no other population function is defined then this function will stand
+ * in and provide the capability to map PAGE_SIZE pages into the cpu area.
+ */
+int __attribute__((weak)) cpu_area_populate(void *start, unsigned long size,
+ gfp_t flags, int node)
+{
+ return cpu_area_populate_basepages(start, size, flags, node);
+}
+
+/*
+ * Extend the areas on all processors. This function may be called repeatedly
+ * until we have enough space to accomodate a newly allocated object.
+ *
+ * Must hold the cpu_alloc_map_lock on entry. Will drop the lock and then
+ * regain it.
+ */
+static int expand_cpu_area(gfp_t flags)
+{
+ unsigned long blocks = active_blocks;
+ unsigned long bits;
+ int cpu;
+ int err = -ENOMEM;
+ int map_order;
+ unsigned long *new_map = NULL;
+ void *start;
+
+ if (active_blocks == MAX_BLOCKS)
+ goto out;
+
+ spin_unlock(&cpu_alloc_map_lock);
+ if (flags & __GFP_WAIT)
+ local_irq_enable();
+
+ /*
+ * Determine the size of the bit map needed
+ */
+ bits = (blocks + 1) * UNITS_PER_BLOCK - units_reserved;
+
+ map_order = get_order(DIV_ROUND_UP(bits, 8));
+ BUG_ON(map_order >= MAX_ORDER);
+ start = (void *)(blocks << (PAGE_SHIFT + CONFIG_CPU_AREA_ALLOC_ORDER));
+
+ for_each_possible_cpu(cpu) {
+ err = cpu_area_populate(CPU_PTR(start, cpu), ALLOC_SIZE,
+ flags, cpu_to_node(cpu));
+
+ if (err) {
+ spin_lock(&cpu_alloc_map_lock);
+ goto out;
+ }
+ }
+
+ if (map_order > cpu_alloc_map_order) {
+ new_map = cpu_area_alloc_block(PAGE_SIZE << map_order,
+ flags | __GFP_ZERO, 0);
+ if (!new_map)
+ goto out;
+ }
+
+ if (flags & __GFP_WAIT)
+ local_irq_disable();
+ spin_lock(&cpu_alloc_map_lock);
+
+ /*
+ * We dropped the lock. Another processor may have already extended
+ * the cpu area size as needed.
+ */
+ if (blocks != active_blocks) {
+ if (new_map)
+ free_pages((unsigned long)new_map,
+ map_order);
+ err = 0;
+ goto out;
+ }
+
+ if (new_map) {
+ /*
+ * Need to extend the bitmap
+ */
+ if (cpu_alloc_map)
+ memcpy(new_map, cpu_alloc_map,
+ PAGE_SIZE << cpu_alloc_map_order);
+ cpu_alloc_map = new_map;
+ cpu_alloc_map_order = map_order;
+ }
+
+ active_blocks++;
+ units_total += UNITS_PER_BLOCK;
+ err = 0;
+out:
+ return err;
+}
+
+void * __init boot_cpu_alloc(unsigned long size)
+{
+ unsigned long flags;
+ unsigned long x = units_reserved;
+ unsigned long units = size_to_units(size);
+
+ /*
+ * Locking is really not necessary during boot
+ * but expand_cpu_area() unlocks and relocks.
+ * If we do not perform locking here then
+ *
+ * 1. The cpu_alloc_map_lock is locked when
+ * we exit boot causing a hang on the next cpu_alloc().
+ * 2. lockdep will get upset if we do not consistently
+ * handle things.
+ */
+ spin_lock_irqsave(&cpu_alloc_map_lock, flags);
+ while (units_reserved + units > units_total)
+ expand_cpu_area(BOOT_ALLOC);
+ units_reserved += units;
+ spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);
+ return (void *)(x * UNIT_SIZE);
+}
+#else
+
/*
* Static configuration. The cpu areas are of a fixed size and
* cannot be extended. Such configurations are mainly useful on
@@ -51,7 +297,14 @@ static unsigned long units_reserved; /*
#define ALLOC_SIZE (1UL << (CONFIG_CPU_AREA_ORDER + PAGE_SHIFT))

char cpu_area[NR_CPUS * ALLOC_SIZE];
-static DECLARE_BITMAP(cpu_alloc_map, UNITS);
+static DECLARE_BITMAP(cpu_alloc_map, UNITS_PER_BLOCK);
+#define cpu_alloc_map_order CONFIG_CPU_AREA_ORDER
+#define units_total UNITS_PER_BLOCK
+
+static inline int expand_cpu_area(gfp_t flags)
+{
+ return -ENOSYS;
+}

void * __init boot_cpu_alloc(unsigned long size)
{
@@ -61,6 +314,7 @@ void * __init boot_cpu_alloc(unsigned lo
BUG_ON(units_reserved > UNITS);
return (void *)(x * UNIT_SIZE);
}
+#endif

static int first_free; /* First known free unit */
EXPORT_SYMBOL(cpu_area);
@@ -99,6 +353,7 @@ void *cpu_alloc(unsigned long size, gfp_
int units = size_to_units(size);
void *ptr;
int first;
+ unsigned long map_size;
unsigned long flags;

BUG_ON(gfpflags & ~(GFP_RECLAIM_MASK | __GFP_ZERO));
@@ -110,16 +365,26 @@ void *cpu_alloc(unsigned long size, gfp_
* No boot time allocations. Must have at least one
* reserved unit to avoid returning a NULL pointer
*/
- units_reserved = 1;
+ units++;
+
+
+restart:
+ if (cpu_alloc_map_order >= 0)
+ map_size = PAGE_SIZE << cpu_alloc_map_order;
+ else
+ map_size = 0;

first = 1;
start = first_free;

for ( ; ; ) {

- start = find_next_zero_bit(cpu_alloc_map, ALLOC_SIZE, start);
- if (start >= UNITS - units_reserved)
+ start = find_next_zero_bit(cpu_alloc_map, map_size, start);
+ if (start >= units_total - units_reserved) {
+ if (!expand_cpu_area(gfpflags))
+ goto restart;
goto out_of_memory;
+ }

if (first)
first_free = start;
@@ -129,7 +394,7 @@ void *cpu_alloc(unsigned long size, gfp_
* the starting unit.
*/
if ((start + units_reserved) % (align / UNIT_SIZE) == 0 &&
- find_next_bit(cpu_alloc_map, ALLOC_SIZE, start + 1)
+ find_next_bit(cpu_alloc_map, map_size, start + 1)
Post by c***@sgi.com
= start + units)
break;
start++;
@@ -139,14 +404,19 @@ void *cpu_alloc(unsigned long size, gfp_
if (first)
first_free = start + units;

- if (start + units > UNITS - units_reserved)
- goto out_of_memory;
+ while (start + units > units_total - units_reserved) {
+ if (expand_cpu_area(gfpflags))
+ goto out_of_memory;
+ }

set_map(start, units);
__count_vm_events(CPU_BYTES, units * UNIT_SIZE);

spin_unlock_irqrestore(&cpu_alloc_map_lock, flags);

+ if (!units_reserved)
+ units_reserved = 1;
+
ptr = (void *)((start + units_reserved) * UNIT_SIZE);

if (gfpflags & __GFP_ZERO) {
@@ -178,7 +448,7 @@ void cpu_free(void *start, unsigned long
BUG_ON(p < units_reserved * UNIT_SIZE);
index = p / UNIT_SIZE - units_reserved;
BUG_ON(!test_bit(index, cpu_alloc_map) ||
- index >= UNITS - units_reserved);
+ index >= units_total - units_reserved);

spin_lock_irqsave(&cpu_alloc_map_lock, flags);

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2007-11-19 15:55:47.293389911 -0800
+++ linux-2.6/include/linux/mm.h 2007-11-19 15:55:48.805640240 -0800
@@ -1137,5 +1137,18 @@ int vmemmap_populate_basepages(struct pa
unsigned long pages, int node);
int vmemmap_populate(struct page *start_page, unsigned long pages, int node);

+pgd_t *cpu_area_pgd_populate(unsigned long addr, gfp_t flags, int node);
+pud_t *cpu_area_pud_populate(pgd_t *pgd, unsigned long addr,
+ gfp_t flags, int node);
+pmd_t *cpu_area_pmd_populate(pud_t *pud, unsigned long addr,
+ gfp_t flags, int node);
+pte_t *cpu_area_pte_populate(pmd_t *pmd, unsigned long addr,
+ gfp_t flags, int node);
+void *cpu_area_alloc_block(unsigned long size, gfp_t flags, int node);
+int cpu_area_populate_basepages(void *start, unsigned long size,
+ gfp_t flags, int node);
+int cpu_area_populate(void *start, unsigned long size,
+ gfp_t flags, int node);
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig 2007-11-19 15:53:21.569140430 -0800
+++ linux-2.6/mm/Kconfig 2007-11-19 15:55:48.805640240 -0800
@@ -197,7 +197,13 @@ config VIRT_TO_BUS

config CPU_AREA_ORDER
int "Maximum size (order) of CPU area"
- default "3"
+ default "10" if CPU_AREA_VIRTUAL
+ default "3" if !CPU_AREA_VIRTUAL
help
Sets the maximum amount of memory that can be allocated via cpu_alloc
- The size is set in page order, so 0 = PAGE_SIZE, 1 = PAGE_SIZE << 1 etc.
+ The size is set in page order. The size set (times the maximum
+ number of processors) determines the amount of virtual memory that
+ is set aside for the per cpu areas for virtualized cpu areas or the
+ amount of memory allocated in the bss segment for non virtualized
+ cpu areas.
+

--
c***@sgi.com
2007-11-20 01:11:42 UTC
Permalink
Enable a simple virtual configuration with 32MB available per cpu so that
we do not use a static area on sparc64.

[Not tested. I have no sparc64]

Signed-off-by: Christoph Lameter <***@sgi.com>

---
arch/sparc64/Kconfig | 15 +++++++++++++++
arch/sparc64/kernel/vmlinux.lds.S | 3 +++
include/asm-sparc64/pgtable.h | 1 +
3 files changed, 19 insertions(+)

Index: linux-2.6/arch/sparc64/Kconfig
===================================================================
--- linux-2.6.orig/arch/sparc64/Kconfig 2007-11-18 14:38:24.601033354 -0800
+++ linux-2.6/arch/sparc64/Kconfig 2007-11-18 21:14:11.476343425 -0800
@@ -103,6 +103,21 @@ config SPARC64_PAGE_SIZE_4MB

endchoice

+config CPU_AREA_VIRTUAL
+ bool
+ default y
+
+config CPU_AREA_ORDER
+ int
+ default "11" if SPARC64_PAGE_SIZE_8KB
+ default "9" if SPARC64_PAGE_SIZE_64KB
+ default "6" if SPARC64_PAGE_SIZE_512KB
+ default "3" if SPARC64_PAGE_SIZE_4MB
+
+config CPU_AREA_ALLOC_ORDER
+ int
+ default "0"
+
config SECCOMP
bool "Enable seccomp to safely compute untrusted bytecode"
depends on PROC_FS
Index: linux-2.6/include/asm-sparc64/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-sparc64/pgtable.h 2007-11-18 14:38:24.609034022 -0800
+++ linux-2.6/include/asm-sparc64/pgtable.h 2007-11-18 21:14:11.504343895 -0800
@@ -43,6 +43,7 @@
#define VMALLOC_START _AC(0x0000000100000000,UL)
#define VMALLOC_END _AC(0x0000000200000000,UL)
#define VMEMMAP_BASE _AC(0x0000000200000000,UL)
+#define CPU_AREA_BASE _AC(0x0000000300000000,UL)

#define vmemmap ((struct page *)VMEMMAP_BASE)

Index: linux-2.6/arch/sparc64/kernel/vmlinux.lds.S
===================================================================
--- linux-2.6.orig/arch/sparc64/kernel/vmlinux.lds.S 2007-11-18 21:14:31.068844088 -0800
+++ linux-2.6/arch/sparc64/kernel/vmlinux.lds.S 2007-11-18 21:14:50.469421513 -0800
@@ -2,12 +2,15 @@

#include <asm/page.h>
#include <asm-generic/vmlinux.lds.h>
+#include <asm/pgtable.h>

OUTPUT_FORMAT("elf64-sparc", "elf64-sparc", "elf64-sparc")
OUTPUT_ARCH(sparc:v9a)
ENTRY(_start)

jiffies = jiffies_64;
+cpu_area = CPU_AREA_BASE
+
SECTIONS
{
swapper_low_pmd_dir = 0x0000000000402000;

--
c***@sgi.com
2007-11-20 01:11:41 UTC
Permalink
Typical use of per cpu memory for a small system of 8G 8p 4node is less than
64k per cpu memory. This is increasing rapidly for larger systems where we can
get up to 512k or 1M of memory used for cpu storage.

The maximum size allowed of the cpu area is 128MB of memory.

The cpu area is placed in region 5 with the kernel, vmemmap and vmalloc areas.

Signed-off-by: Christoph Lameter <***@sgi.com>
---
arch/ia64/Kconfig | 13 +++++++++++++
arch/ia64/kernel/vmlinux.lds.S | 2 ++
include/asm-ia64/pgtable.h | 32 ++++++++++++++++++++++++++------
3 files changed, 41 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/ia64/Kconfig
===================================================================
--- linux-2.6.orig/arch/ia64/Kconfig 2007-11-18 14:38:24.661283318 -0800
+++ linux-2.6/arch/ia64/Kconfig 2007-11-18 21:13:23.281093698 -0800
@@ -99,6 +99,19 @@ config AUDIT_ARCH
bool
default y

+config CPU_AREA_VIRTUAL
+ bool
+ default y
+
+# Maximum of 128 MB cpu_alloc space per cpu
+config CPU_AREA_ORDER
+ int
+ default "13"
+
+config CPU_AREA_ALLOC_ORDER
+ int
+ default "0"
+
choice
prompt "System type"
default IA64_GENERIC
Index: linux-2.6/include/asm-ia64/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/pgtable.h 2007-11-18 14:38:24.669283083 -0800
+++ linux-2.6/include/asm-ia64/pgtable.h 2007-11-18 21:13:23.296343624 -0800
@@ -224,21 +224,41 @@ ia64_phys_addr_valid (unsigned long addr
*/


+/*
+ * Layout of RGN_GATE
+ *
+ * 47 bits wide (16kb pages)
+ *
+ * 0xa000000000000000-0xa00000200000000 8G Kernel data area
+ * 0xa000002000000000-0xa00040000000000 64T vmalloc
+ * 0xa000400000000000-0xa00060000000000 32T vmemmmap
+ * 0xa000600000000000-0xa00080000000000 32T cpu area
+ *
+ * 55 bits wide (64kb pages)
+ *
+ * 0xa000000000000000-0xa00000200000000 8G Kernel data area
+ * 0xa000002000000000-0xa04000000000000 16P vmalloc
+ * 0xa040000000000000-0xa06000000000000 8P vmemmmap
+ * 0xa060000000000000-0xa08000000000000 8P cpu area
+ */
+
#define VMALLOC_START (RGN_BASE(RGN_GATE) + 0x200000000UL)
+#define VMALLOC_END_INIT (RGN_BASE(RGN_GATE) + (1UL << (4*PAGE_SHIFT - 10)))
+
#ifdef CONFIG_VIRTUAL_MEM_MAP
-# define VMALLOC_END_INIT (RGN_BASE(RGN_GATE) + (1UL << (4*PAGE_SHIFT - 9)))
# define VMALLOC_END vmalloc_end
extern unsigned long vmalloc_end;
#else
+# define VMALLOC_END VMALLOC_END_INIT
+#endif
+
#if defined(CONFIG_SPARSEMEM) && defined(CONFIG_SPARSEMEM_VMEMMAP)
/* SPARSEMEM_VMEMMAP uses half of vmalloc... */
-# define VMALLOC_END (RGN_BASE(RGN_GATE) + (1UL << (4*PAGE_SHIFT - 10)))
-# define vmemmap ((struct page *)VMALLOC_END)
-#else
-# define VMALLOC_END (RGN_BASE(RGN_GATE) + (1UL << (4*PAGE_SHIFT - 9)))
-#endif
+# define vmemmap ((struct page *)VMALLOC_END_INIT)
#endif

+#define CPU_AREA_BASE (RGN_BASE(RGN_GATE) + (3UL << (4*PAGE_SHIFT - 11)))
+
/* fs/proc/kcore.c */
#define kc_vaddr_to_offset(v) ((v) - RGN_BASE(RGN_GATE))
#define kc_offset_to_vaddr(o) ((o) + RGN_BASE(RGN_GATE))
Index: linux-2.6/arch/ia64/kernel/vmlinux.lds.S
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/vmlinux.lds.S 2007-11-18 21:13:46.505344120 -0800
+++ linux-2.6/arch/ia64/kernel/vmlinux.lds.S 2007-11-18 21:14:03.996593749 -0800
@@ -16,6 +16,8 @@ OUTPUT_FORMAT("elf64-ia64-little")
OUTPUT_ARCH(ia64)
ENTRY(phys_start)
jiffies = jiffies_64;
+cpu_area = CPU_AREA_BASE;
+
PHDRS {
code PT_LOAD;
percpu PT_LOAD;

--
c***@sgi.com
2007-11-20 01:11:43 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
lib/percpu_counter.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c 2007-11-15 21:24:46.878154362 -0800
+++ linux-2.6/lib/percpu_counter.c 2007-11-15 21:25:28.963154085 -0800
@@ -20,7 +20,7 @@ void percpu_counter_set(struct percpu_co

spin_lock(&fbc->lock);
for_each_possible_cpu(cpu) {
- s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+ s32 *pcount = CPU_PTR(fbc->counters, cpu);
*pcount = 0;
}
fbc->count = amount;
@@ -34,7 +34,7 @@ void __percpu_counter_add(struct percpu_
s32 *pcount;
int cpu = get_cpu();

- pcount = per_cpu_ptr(fbc->counters, cpu);
+ pcount = CPU_PTR(fbc->counters, cpu);
count = *pcount + amount;
if (count >= batch || count <= -batch) {
spin_lock(&fbc->lock);
@@ -60,7 +60,7 @@ s64 __percpu_counter_sum(struct percpu_c
spin_lock(&fbc->lock);
ret = fbc->count;
for_each_online_cpu(cpu) {
- s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+ s32 *pcount = CPU_PTR(fbc->counters, cpu);
ret += *pcount;
}
spin_unlock(&fbc->lock);
@@ -74,7 +74,7 @@ int percpu_counter_init(struct percpu_co
{
spin_lock_init(&fbc->lock);
fbc->count = amount;
- fbc->counters = alloc_percpu(s32);
+ fbc->counters = CPU_ALLOC(s32, GFP_KERNEL|__GFP_ZERO);
if (!fbc->counters)
return -ENOMEM;
#ifdef CONFIG_HOTPLUG_CPU
@@ -101,7 +101,7 @@ void percpu_counter_destroy(struct percp
if (!fbc->counters)
return;

- free_percpu(fbc->counters);
+ CPU_FREE(fbc->counters);
#ifdef CONFIG_HOTPLUG_CPU
mutex_lock(&percpu_counters_lock);
list_del(&fbc->list);
@@ -127,7 +127,7 @@ static int __cpuinit percpu_counter_hotc
unsigned long flags;

spin_lock_irqsave(&fbc->lock, flags);
- pcount = per_cpu_ptr(fbc->counters, cpu);
+ pcount = CPU_PTR(fbc->counters, cpu);
fbc->count += *pcount;
*pcount = 0;
spin_unlock_irqrestore(&fbc->lock, flags);

--
c***@sgi.com
2007-11-20 01:11:38 UTC
Permalink
Use the new cpu_alloc functionality to avoid per cpu arrays in struct zone.
This drastically reduces the size of struct zone for systems with a large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Surprisingly this clears up much of the painful NUMA bringup. Bootstrap
becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs are
reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

Signed-off-by: Christoph Lameter <***@sgi.com>
---
include/linux/mm.h | 4 -
include/linux/mmzone.h | 12 ---
mm/page_alloc.c | 162 +++++++++++++++++++------------------------------
mm/vmstat.c | 14 ++--
4 files changed, 73 insertions(+), 119 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2007-11-18 20:35:35.944298207 -0800
+++ linux-2.6/include/linux/mm.h 2007-11-18 20:35:38.297048298 -0800
@@ -931,11 +931,7 @@ extern void show_mem(void);
extern void si_meminfo(struct sysinfo * val);
extern void si_meminfo_node(struct sysinfo *val, int nid);

-#ifdef CONFIG_NUMA
extern void setup_per_cpu_pageset(void);
-#else
-static inline void setup_per_cpu_pageset(void) {}
-#endif

/* prio_tree.c */
void vma_prio_tree_add(struct vm_area_struct *, struct vm_area_struct *old);
Index: linux-2.6/include/linux/mmzone.h
===================================================================
--- linux-2.6.orig/include/linux/mmzone.h 2007-11-18 20:35:35.952298159 -0800
+++ linux-2.6/include/linux/mmzone.h 2007-11-18 20:35:38.297048298 -0800
@@ -121,13 +121,7 @@ struct per_cpu_pageset {
s8 stat_threshold;
s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
#endif
-} ____cacheline_aligned_in_smp;
-
-#ifdef CONFIG_NUMA
-#define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
-#else
-#define zone_pcp(__z, __cpu) (&(__z)->pageset[(__cpu)])
-#endif
+};

enum zone_type {
#ifdef CONFIG_ZONE_DMA
@@ -231,10 +225,8 @@ struct zone {
*/
unsigned long min_unmapped_pages;
unsigned long min_slab_pages;
- struct per_cpu_pageset *pageset[NR_CPUS];
-#else
- struct per_cpu_pageset pageset[NR_CPUS];
#endif
+ struct per_cpu_pageset *pageset;
/*
* free areas of different sizes
*/
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2007-11-18 20:35:35.960298523 -0800
+++ linux-2.6/mm/page_alloc.c 2007-11-18 20:43:06.720048123 -0800
@@ -892,7 +892,7 @@ static void __drain_pages(unsigned int c
if (!populated_zone(zone))
continue;

- pset = zone_pcp(zone, cpu);
+ pset = CPU_PTR(zone->pageset, cpu);

pcp = &pset->pcp;
local_irq_save(flags);
@@ -988,8 +988,8 @@ static void fastcall free_hot_cold_page(
arch_free_page(page, 0);
kernel_map_pages(page, 1, 0);

- pcp = &zone_pcp(zone, get_cpu())->pcp;
local_irq_save(flags);
+ pcp = &THIS_CPU(zone->pageset)->pcp;
__count_vm_event(PGFREE);
if (cold)
list_add_tail(&page->lru, &pcp->list);
@@ -1002,7 +1002,6 @@ static void fastcall free_hot_cold_page(
pcp->count -= pcp->batch;
}
local_irq_restore(flags);
- put_cpu();
}

void fastcall free_hot_page(struct page *page)
@@ -1044,16 +1043,14 @@ static struct page *buffered_rmqueue(str
unsigned long flags;
struct page *page;
int cold = !!(gfp_flags & __GFP_COLD);
- int cpu;
int migratetype = allocflags_to_migratetype(gfp_flags);

again:
- cpu = get_cpu();
if (likely(order == 0)) {
struct per_cpu_pages *pcp;

- pcp = &zone_pcp(zone, cpu)->pcp;
local_irq_save(flags);
+ pcp = &THIS_CPU(zone->pageset)->pcp;
if (!pcp->count) {
pcp->count = rmqueue_bulk(zone, 0,
pcp->batch, &pcp->list, migratetype);
@@ -1092,7 +1089,6 @@ again:
__count_zone_vm_events(PGALLOC, zone, 1 << order);
zone_statistics(zonelist, zone);
local_irq_restore(flags);
- put_cpu();

VM_BUG_ON(bad_range(zone, page));
if (prep_new_page(page, order, gfp_flags))
@@ -1101,7 +1097,6 @@ again:

failed:
local_irq_restore(flags);
- put_cpu();
return NULL;
}

@@ -1795,7 +1790,7 @@ void show_free_areas(void)
for_each_online_cpu(cpu) {
struct per_cpu_pageset *pageset;

- pageset = zone_pcp(zone, cpu);
+ pageset = CPU_PTR(zone->pageset, cpu);

printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
cpu, pageset->pcp.high,
@@ -2621,82 +2616,33 @@ static void setup_pagelist_highmark(stru
pcp->batch = PAGE_SHIFT * 8;
}

-
-#ifdef CONFIG_NUMA
/*
- * Boot pageset table. One per cpu which is going to be used for all
- * zones and all nodes. The parameters will be set in such a way
- * that an item put on a list will immediately be handed over to
- * the buddy list. This is safe since pageset manipulation is done
- * with interrupts disabled.
- *
- * Some NUMA counter updates may also be caught by the boot pagesets.
- *
- * The boot_pagesets must be kept even after bootup is complete for
- * unused processors and/or zones. They do play a role for bootstrapping
- * hotplugged processors.
- *
- * zoneinfo_show() and maybe other functions do
- * not check if the processor is online before following the pageset pointer.
- * Other parts of the kernel may not check if the zone is available.
+ * Dynamically allocate memory for the per cpu pageset array in struct zone.
*/
-static struct per_cpu_pageset boot_pageset[NR_CPUS];
-
-/*
- * Dynamically allocate memory for the
- * per cpu pageset array in struct zone.
- */
-static int __cpuinit process_zones(int cpu)
+static void __cpuinit process_zones(int cpu)
{
- struct zone *zone, *dzone;
+ struct zone *zone;
int node = cpu_to_node(cpu);

node_set_state(node, N_CPU); /* this node has a cpu */

for_each_zone(zone) {
+ struct per_cpu_pageset *pcp =
+ CPU_PTR(zone->pageset, cpu);

if (!populated_zone(zone))
continue;

- zone_pcp(zone, cpu) = kmalloc_node(sizeof(struct per_cpu_pageset),
- GFP_KERNEL, node);
- if (!zone_pcp(zone, cpu))
- goto bad;
-
- setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
+ setup_pageset(pcp, zone_batchsize(zone));

if (percpu_pagelist_fraction)
- setup_pagelist_highmark(zone_pcp(zone, cpu),
- (zone->present_pages / percpu_pagelist_fraction));
- }
-
- return 0;
-bad:
- for_each_zone(dzone) {
- if (!populated_zone(dzone))
- continue;
- if (dzone == zone)
- break;
- kfree(zone_pcp(dzone, cpu));
- zone_pcp(dzone, cpu) = NULL;
- }
- return -ENOMEM;
-}
+ setup_pagelist_highmark(pcp, zone->present_pages /
+ percpu_pagelist_fraction);

-static inline void free_zone_pagesets(int cpu)
-{
- struct zone *zone;
-
- for_each_zone(zone) {
- struct per_cpu_pageset *pset = zone_pcp(zone, cpu);
-
- /* Free per_cpu_pageset if it is slab allocated */
- if (pset != &boot_pageset[cpu])
- kfree(pset);
- zone_pcp(zone, cpu) = NULL;
}
}

+#ifdef CONFIG_SMP
static int __cpuinit pageset_cpuup_callback(struct notifier_block *nfb,
unsigned long action,
void *hcpu)
@@ -2707,14 +2653,7 @@ static int __cpuinit pageset_cpuup_callb
switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
- if (process_zones(cpu))
- ret = NOTIFY_BAD;
- break;
- case CPU_UP_CANCELED:
- case CPU_UP_CANCELED_FROZEN:
- case CPU_DEAD:
- case CPU_DEAD_FROZEN:
- free_zone_pagesets(cpu);
+ process_zones(cpu);
break;
default:
break;
@@ -2724,21 +2663,34 @@ static int __cpuinit pageset_cpuup_callb

static struct notifier_block __cpuinitdata pageset_notifier =
{ &pageset_cpuup_callback, NULL, 0 };
+#endif

void __init setup_per_cpu_pageset(void)
{
- int err;
-
- /* Initialize per_cpu_pageset for cpu 0.
+ /*
+ * Initialize per_cpu settings for the boot cpu.
* A cpuup callback will do this for every cpu
- * as it comes online
+ * as it comes online.
+ *
+ * This is also initializing the cpu areas for the
+ * pagesets.
*/
- err = process_zones(smp_processor_id());
- BUG_ON(err);
- register_cpu_notifier(&pageset_notifier);
-}
+ struct zone *zone;

+ for_each_zone(zone) {
+
+ if (!populated_zone(zone))
+ continue;
+
+ zone->pageset = CPU_ALLOC(struct per_cpu_pageset,
+ GFP_KERNEL|__GFP_ZERO);
+ BUG_ON(!zone->pageset);
+ }
+ process_zones(smp_processor_id());
+#ifdef CONFIG_SMP
+ register_cpu_notifier(&pageset_notifier);
#endif
+}

static noinline __init_refok
int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
@@ -2785,21 +2737,31 @@ int zone_wait_table_init(struct zone *zo

static __meminit void zone_pcp_init(struct zone *zone)
{
- int cpu;
- unsigned long batch = zone_batchsize(zone);
+ static struct per_cpu_pageset boot_pageset;

- for (cpu = 0; cpu < NR_CPUS; cpu++) {
-#ifdef CONFIG_NUMA
- /* Early boot. Slab allocator not functional yet */
- zone_pcp(zone, cpu) = &boot_pageset[cpu];
- setup_pageset(&boot_pageset[cpu],0);
-#else
- setup_pageset(zone_pcp(zone,cpu), batch);
-#endif
- }
+ /*
+ * Fake a cpu_alloc pointer that can take the required
+ * offset to get to the boot pageset. This is only
+ * needed for the boot pageset while bootstrapping
+ * the new zone. In the course of zone bootstrap
+ * setup_cpu_pagesets() will do the proper CPU_ALLOC and
+ * set things up the right way.
+ *
+ * Deferral allows CPU_ALLOC() to use the boot pageset
+ * to allocate the initial memory to get going and then provide
+ * the proper memory when called from setup_cpu_pagesets() to
+ * install the proper pagesets.
+ *
+ * Deferral also allows slab allocators to perform their
+ * initialization without resorting to bootmem.
+ */
+ zone->pageset = SHIFT_PTR(&boot_pageset,
+ -cpu_offset(smp_processor_id()));
+ setup_pageset(&boot_pageset, 0);
if (zone->present_pages)
- printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
- zone->name, zone->present_pages, batch);
+ printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%u\n",
+ zone->name, zone->present_pages,
+ zone_batchsize(zone));
}

__meminit int init_currently_empty_zone(struct zone *zone,
@@ -4214,11 +4176,13 @@ int percpu_pagelist_fraction_sysctl_hand
ret = proc_dointvec_minmax(table, write, file, buffer, length, ppos);
if (!write || (ret == -EINVAL))
return ret;
- for_each_zone(zone) {
- for_each_online_cpu(cpu) {
+ for_each_online_cpu(cpu) {
+ for_each_zone(zone) {
unsigned long high;
+
high = zone->present_pages / percpu_pagelist_fraction;
- setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+ setup_pagelist_highmark(CPU_PTR(zone->pageset, cpu),
+ high);
}
}
return 0;
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c 2007-11-18 20:35:35.968297960 -0800
+++ linux-2.6/mm/vmstat.c 2007-11-18 20:35:38.297048298 -0800
@@ -147,7 +147,8 @@ static void refresh_zone_stat_thresholds
threshold = calculate_threshold(zone);

for_each_online_cpu(cpu)
- zone_pcp(zone, cpu)->stat_threshold = threshold;
+ CPU_PTR(zone->pageset, cpu)->stat_threshold
+ = threshold;
}
}

@@ -157,7 +158,8 @@ static void refresh_zone_stat_thresholds
void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
int delta)
{
- struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+ struct per_cpu_pageset *pcp = THIS_CPU(zone->pageset);
+
s8 *p = pcp->vm_stat_diff + item;
long x;

@@ -210,7 +212,7 @@ EXPORT_SYMBOL(mod_zone_page_state);
*/
void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
- struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+ struct per_cpu_pageset *pcp = THIS_CPU(zone->pageset);
s8 *p = pcp->vm_stat_diff + item;

(*p)++;
@@ -231,7 +233,7 @@ EXPORT_SYMBOL(__inc_zone_page_state);

void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
{
- struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+ struct per_cpu_pageset *pcp = THIS_CPU(zone->pageset);
s8 *p = pcp->vm_stat_diff + item;

(*p)--;
@@ -307,7 +309,7 @@ void refresh_cpu_vm_stats(int cpu)
if (!populated_zone(zone))
continue;

- p = zone_pcp(zone, cpu);
+ p = CPU_PTR(zone->pageset, cpu);

for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
if (p->vm_stat_diff[i]) {
@@ -680,7 +682,7 @@ static void zoneinfo_show_print(struct s
for_each_online_cpu(i) {
struct per_cpu_pageset *pageset;

- pageset = zone_pcp(zone, i);
+ pageset = CPU_PTR(zone->pageset, i);
seq_printf(m,
"\n cpu: %i"
"\n count: %i"

--
c***@sgi.com
2007-11-20 01:11:44 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
arch/ia64/kernel/crash.c | 2 +-
drivers/base/cpu.c | 2 +-
kernel/kexec.c | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/ia64/kernel/crash.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/crash.c 2007-11-15 21:18:10.647904573 -0800
+++ linux-2.6/arch/ia64/kernel/crash.c 2007-11-15 21:25:29.423155123 -0800
@@ -71,7 +71,7 @@ crash_save_this_cpu(void)
dst[46] = (unsigned long)ia64_rse_skip_regs((unsigned long *)dst[46],
sof - sol);

- buf = (u64 *) per_cpu_ptr(crash_notes, cpu);
+ buf = (u64 *) CPU_PTR(crash_notes, cpu);
if (!buf)
return;
buf = append_elf_note(buf, KEXEC_CORE_NOTE_NAME, NT_PRSTATUS, prstatus,
Index: linux-2.6/drivers/base/cpu.c
===================================================================
--- linux-2.6.orig/drivers/base/cpu.c 2007-11-15 21:18:10.655904442 -0800
+++ linux-2.6/drivers/base/cpu.c 2007-11-15 21:25:29.423155123 -0800
@@ -95,7 +95,7 @@ static ssize_t show_crash_notes(struct s
* boot up and this data does not change there after. Hence this
* operation should be safe. No locking required.
*/
- addr = __pa(per_cpu_ptr(crash_notes, cpunum));
+ addr = __pa(CPU_PTR(crash_notes, cpunum));
rc = sprintf(buf, "%Lx\n", addr);
return rc;
}
Index: linux-2.6/kernel/kexec.c
===================================================================
--- linux-2.6.orig/kernel/kexec.c 2007-11-15 21:18:10.663904549 -0800
+++ linux-2.6/kernel/kexec.c 2007-11-15 21:25:29.423155123 -0800
@@ -1122,7 +1122,7 @@ void crash_save_cpu(struct pt_regs *regs
* squirrelled away. ELF notes happen to provide
* all of that, so there is no need to invent something new.
*/
- buf = (u32*)per_cpu_ptr(crash_notes, cpu);
+ buf = (u32*)CPU_PTR(crash_notes, cpu);
if (!buf)
return;
memset(&prstatus, 0, sizeof(prstatus));
@@ -1136,7 +1136,7 @@ void crash_save_cpu(struct pt_regs *regs
static int __init crash_notes_memory_init(void)
{
/* Allocate memory for saving cpu registers. */
- crash_notes = alloc_percpu(note_buf_t);
+ crash_notes = CPU_ALLOC(note_buf_t, GFP_KERNEL|__GFP_ZERO);
if (!crash_notes) {
printk("Kexec: Memory allocation for saving cpu register"
" states failed\n");

--
Mathieu Desnoyers
2007-11-20 13:03:15 UTC
Permalink
Post by c***@sgi.com
---
arch/ia64/kernel/crash.c | 2 +-
drivers/base/cpu.c | 2 +-
kernel/kexec.c | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)
Index: linux-2.6/arch/ia64/kernel/crash.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/crash.c 2007-11-15 21:18:10.647904573 -0800
+++ linux-2.6/arch/ia64/kernel/crash.c 2007-11-15 21:25:29.423155123 -0800
@@ -71,7 +71,7 @@ crash_save_this_cpu(void)
dst[46] = (unsigned long)ia64_rse_skip_regs((unsigned long *)dst[46],
sof - sol);
- buf = (u64 *) per_cpu_ptr(crash_notes, cpu);
+ buf = (u64 *) CPU_PTR(crash_notes, cpu);
if (!buf)
return;
buf = append_elf_note(buf, KEXEC_CORE_NOTE_NAME, NT_PRSTATUS, prstatus,
Index: linux-2.6/drivers/base/cpu.c
===================================================================
--- linux-2.6.orig/drivers/base/cpu.c 2007-11-15 21:18:10.655904442 -0800
+++ linux-2.6/drivers/base/cpu.c 2007-11-15 21:25:29.423155123 -0800
@@ -95,7 +95,7 @@ static ssize_t show_crash_notes(struct s
* boot up and this data does not change there after. Hence this
* operation should be safe. No locking required.
*/
- addr = __pa(per_cpu_ptr(crash_notes, cpunum));
+ addr = __pa(CPU_PTR(crash_notes, cpunum));
rc = sprintf(buf, "%Lx\n", addr);
return rc;
}
Index: linux-2.6/kernel/kexec.c
===================================================================
--- linux-2.6.orig/kernel/kexec.c 2007-11-15 21:18:10.663904549 -0800
+++ linux-2.6/kernel/kexec.c 2007-11-15 21:25:29.423155123 -0800
@@ -1122,7 +1122,7 @@ void crash_save_cpu(struct pt_regs *regs
* squirrelled away. ELF notes happen to provide
* all of that, so there is no need to invent something new.
*/
- buf = (u32*)per_cpu_ptr(crash_notes, cpu);
+ buf = (u32*)CPU_PTR(crash_notes, cpu);
Nitpick : (u32 *)
Post by c***@sgi.com
if (!buf)
return;
memset(&prstatus, 0, sizeof(prstatus));
@@ -1136,7 +1136,7 @@ void crash_save_cpu(struct pt_regs *regs
static int __init crash_notes_memory_init(void)
{
/* Allocate memory for saving cpu registers. */
- crash_notes = alloc_percpu(note_buf_t);
+ crash_notes = CPU_ALLOC(note_buf_t, GFP_KERNEL|__GFP_ZERO);
if (!crash_notes) {
printk("Kexec: Memory allocation for saving cpu register"
" states failed\n");
--
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Christoph Lameter
2007-11-20 20:50:14 UTC
Permalink
Post by Mathieu Desnoyers
Post by c***@sgi.com
- buf = (u32*)per_cpu_ptr(crash_notes, cpu);
+ buf = (u32*)CPU_PTR(crash_notes, cpu);
Nitpick : (u32 *)
Yeah. I tend to leave the things as they were...
c***@sgi.com
2007-11-20 01:11:45 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
kernel/workqueue.c | 27 ++++++++++++++-------------
1 file changed, 14 insertions(+), 13 deletions(-)

Index: linux-2.6/kernel/workqueue.c
===================================================================
--- linux-2.6.orig/kernel/workqueue.c 2007-11-15 21:18:11.726153923 -0800
+++ linux-2.6/kernel/workqueue.c 2007-11-15 21:25:29.966154099 -0800
@@ -100,7 +100,7 @@ struct cpu_workqueue_struct *wq_per_cpu(
{
if (unlikely(is_single_threaded(wq)))
cpu = singlethread_cpu;
- return per_cpu_ptr(wq->cpu_wq, cpu);
+ return CPU_PTR(wq->cpu_wq, cpu);
}

/*
@@ -398,7 +398,7 @@ void fastcall flush_workqueue(struct wor
lock_acquire(&wq->lockdep_map, 0, 0, 0, 2, _THIS_IP_);
lock_release(&wq->lockdep_map, 1, _THIS_IP_);
for_each_cpu_mask(cpu, *cpu_map)
- flush_cpu_workqueue(per_cpu_ptr(wq->cpu_wq, cpu));
+ flush_cpu_workqueue(CPU_PTR(wq->cpu_wq, cpu));
}
EXPORT_SYMBOL_GPL(flush_workqueue);

@@ -478,7 +478,7 @@ static void wait_on_work(struct work_str
cpu_map = wq_cpu_map(wq);

for_each_cpu_mask(cpu, *cpu_map)
- wait_on_cpu_work(per_cpu_ptr(wq->cpu_wq, cpu), work);
+ wait_on_cpu_work(CPU_PTR(wq->cpu_wq, cpu), work);
}

static int __cancel_work_timer(struct work_struct *work,
@@ -601,21 +601,21 @@ int schedule_on_each_cpu(work_func_t fun
int cpu;
struct work_struct *works;

- works = alloc_percpu(struct work_struct);
+ works = CPU_ALLOC(struct work_struct, GFP_KERNEL);
if (!works)
return -ENOMEM;

preempt_disable(); /* CPU hotplug */
for_each_online_cpu(cpu) {
- struct work_struct *work = per_cpu_ptr(works, cpu);
+ struct work_struct *work = CPU_PTR(works, cpu);

INIT_WORK(work, func);
set_bit(WORK_STRUCT_PENDING, work_data_bits(work));
- __queue_work(per_cpu_ptr(keventd_wq->cpu_wq, cpu), work);
+ __queue_work(CPU_PTR(keventd_wq->cpu_wq, cpu), work);
}
preempt_enable();
flush_workqueue(keventd_wq);
- free_percpu(works);
+ CPU_FREE(works);
return 0;
}

@@ -664,7 +664,7 @@ int current_is_keventd(void)

BUG_ON(!keventd_wq);

- cwq = per_cpu_ptr(keventd_wq->cpu_wq, cpu);
+ cwq = CPU_PTR(keventd_wq->cpu_wq, cpu);
if (current == cwq->thread)
ret = 1;

@@ -675,7 +675,7 @@ int current_is_keventd(void)
static struct cpu_workqueue_struct *
init_cpu_workqueue(struct workqueue_struct *wq, int cpu)
{
- struct cpu_workqueue_struct *cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+ struct cpu_workqueue_struct *cwq = CPU_PTR(wq->cpu_wq, cpu);

cwq->wq = wq;
spin_lock_init(&cwq->lock);
@@ -732,7 +732,8 @@ struct workqueue_struct *__create_workqu
if (!wq)
return NULL;

- wq->cpu_wq = alloc_percpu(struct cpu_workqueue_struct);
+ wq->cpu_wq = CPU_ALLOC(struct cpu_workqueue_struct,
+ GFP_KERNEL|__GFP_ZERO);
if (!wq->cpu_wq) {
kfree(wq);
return NULL;
@@ -814,11 +815,11 @@ void destroy_workqueue(struct workqueue_
mutex_unlock(&workqueue_mutex);

for_each_cpu_mask(cpu, *cpu_map) {
- cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+ cwq = CPU_PTR(wq->cpu_wq, cpu);
cleanup_workqueue_thread(cwq, cpu);
}

- free_percpu(wq->cpu_wq);
+ CPU_FREE(wq->cpu_wq);
kfree(wq);
}
EXPORT_SYMBOL_GPL(destroy_workqueue);
@@ -847,7 +848,7 @@ static int __devinit workqueue_cpu_callb
}

list_for_each_entry(wq, &workqueues, list) {
- cwq = per_cpu_ptr(wq->cpu_wq, cpu);
+ cwq = CPU_PTR(wq->cpu_wq, cpu);

switch (action) {
case CPU_UP_PREPARE:

--
c***@sgi.com
2007-11-20 01:11:46 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
arch/x86/kernel/acpi/cstate.c | 9 +++++----
arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c | 7 ++++---
drivers/acpi/processor_perflib.c | 4 ++--
3 files changed, 11 insertions(+), 9 deletions(-)

Index: linux-2.6/arch/x86/kernel/acpi/cstate.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/acpi/cstate.c 2007-11-15 21:18:09.238904115 -0800
+++ linux-2.6/arch/x86/kernel/acpi/cstate.c 2007-11-15 21:25:30.499154221 -0800
@@ -87,7 +87,7 @@ int acpi_processor_ffh_cstate_probe(unsi
if (reg->bit_offset != NATIVE_CSTATE_BEYOND_HALT)
return -1;

- percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
+ percpu_entry = CPU_PTR(cpu_cstate_entry, cpu);
percpu_entry->states[cx->index].eax = 0;
percpu_entry->states[cx->index].ecx = 0;

@@ -138,7 +138,7 @@ void acpi_processor_ffh_cstate_enter(str
unsigned int cpu = smp_processor_id();
struct cstate_entry *percpu_entry;

- percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
+ percpu_entry = CPU_PTR(cpu_cstate_entry, cpu);
mwait_idle_with_hints(percpu_entry->states[cx->index].eax,
percpu_entry->states[cx->index].ecx);
}
@@ -150,13 +150,14 @@ static int __init ffh_cstate_init(void)
if (c->x86_vendor != X86_VENDOR_INTEL)
return -1;

- cpu_cstate_entry = alloc_percpu(struct cstate_entry);
+ cpu_cstate_entry = CPU_ALLOC(struct cstate_entry,
+ GFP_KERNEL|__GFP_ZERO);
return 0;
}

static void __exit ffh_cstate_exit(void)
{
- free_percpu(cpu_cstate_entry);
+ CPU_FREE(cpu_cstate_entry);
cpu_cstate_entry = NULL;
}

Index: linux-2.6/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c 2007-11-15 21:18:09.246904080 -0800
+++ linux-2.6/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c 2007-11-15 21:25:30.499154221 -0800
@@ -513,7 +513,8 @@ static int __init acpi_cpufreq_early_ini
{
dprintk("acpi_cpufreq_early_init\n");

- acpi_perf_data = alloc_percpu(struct acpi_processor_performance);
+ acpi_perf_data = CPU_ALLOC(struct acpi_processor_performance,
+ GFP_KERNEL|__GFP_ZERO);
if (!acpi_perf_data) {
dprintk("Memory allocation error for acpi_perf_data.\n");
return -ENOMEM;
@@ -569,7 +570,7 @@ static int acpi_cpufreq_cpu_init(struct
if (!data)
return -ENOMEM;

- data->acpi_data = percpu_ptr(acpi_perf_data, cpu);
+ data->acpi_data = CPU_PTR(acpi_perf_data, cpu);
drv_data[cpu] = data;

if (cpu_has(c, X86_FEATURE_CONSTANT_TSC))
@@ -782,7 +783,7 @@ static void __exit acpi_cpufreq_exit(voi

cpufreq_unregister_driver(&acpi_cpufreq_driver);

- free_percpu(acpi_perf_data);
+ CPU_FREE(acpi_perf_data);

return;
}
Index: linux-2.6/drivers/acpi/processor_perflib.c
===================================================================
--- linux-2.6.orig/drivers/acpi/processor_perflib.c 2007-11-15 21:18:09.254904773 -0800
+++ linux-2.6/drivers/acpi/processor_perflib.c 2007-11-15 21:25:30.499154221 -0800
@@ -567,12 +567,12 @@ int acpi_processor_preregister_performan
continue;
}

- if (!performance || !percpu_ptr(performance, i)) {
+ if (!performance || !CPU_PTR(performance, i)) {
retval = -EINVAL;
continue;
}

- pr->performance = percpu_ptr(performance, i);
+ pr->performance = CPU_PTR(performance, i);
cpu_set(i, pr->performance->shared_cpu_map);
if (acpi_processor_get_psd(pr)) {
retval = -EINVAL;

--
c***@sgi.com
2007-11-20 01:11:47 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
include/linux/genhd.h | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)

Index: linux-2.6/include/linux/genhd.h
===================================================================
--- linux-2.6.orig/include/linux/genhd.h 2007-11-18 14:38:24.285456612 -0800
+++ linux-2.6/include/linux/genhd.h 2007-11-18 22:04:47.428523761 -0800
@@ -158,21 +158,21 @@ struct disk_attribute {
*/
#ifdef CONFIG_SMP
#define __disk_stat_add(gendiskp, field, addnd) \
- (per_cpu_ptr(gendiskp->dkstats, smp_processor_id())->field += addnd)
+ __CPU_ADD(gendiskp->dkstats->field, addnd)

#define disk_stat_read(gendiskp, field) \
({ \
typeof(gendiskp->dkstats->field) res = 0; \
int i; \
for_each_possible_cpu(i) \
- res += per_cpu_ptr(gendiskp->dkstats, i)->field; \
+ res += CPU_PTR(gendiskp->dkstats, i)->field; \
res; \
})

static inline void disk_stat_set_all(struct gendisk *gendiskp, int value) {
int i;
for_each_possible_cpu(i)
- memset(per_cpu_ptr(gendiskp->dkstats, i), value,
+ memset(CPU_PTR(gendiskp->dkstats, i), value,
sizeof (struct disk_stats));
}

@@ -187,11 +187,7 @@ static inline void disk_stat_set_all(str
#endif

#define disk_stat_add(gendiskp, field, addnd) \
- do { \
- preempt_disable(); \
- __disk_stat_add(gendiskp, field, addnd); \
- preempt_enable(); \
- } while (0)
+ _CPU_ADD(gendiskp->dkstats->field, addnd);

#define __disk_stat_dec(gendiskp, field) __disk_stat_add(gendiskp, field, -1)
#define disk_stat_dec(gendiskp, field) disk_stat_add(gendiskp, field, -1)
@@ -209,7 +205,7 @@ static inline void disk_stat_set_all(str
#ifdef CONFIG_SMP
static inline int init_disk_stats(struct gendisk *disk)
{
- disk->dkstats = alloc_percpu(struct disk_stats);
+ disk->dkstats = CPU_ALLOC(struct disk_stats, GFP_KERNEL | __GFP_ZERO);
if (!disk->dkstats)
return 0;
return 1;
@@ -217,7 +213,7 @@ static inline int init_disk_stats(struct

static inline void free_disk_stats(struct gendisk *disk)
{
- free_percpu(disk->dkstats);
+ CPU_FREE(disk->dkstats);
}
#else /* CONFIG_SMP */
static inline int init_disk_stats(struct gendisk *disk)

--
c***@sgi.com
2007-11-20 01:11:49 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
kernel/rcutorture.c | 4 ++--
kernel/srcu.c | 20 ++++++++------------
2 files changed, 10 insertions(+), 14 deletions(-)

Index: linux-2.6/kernel/rcutorture.c
===================================================================
--- linux-2.6.orig/kernel/rcutorture.c 2007-11-18 14:38:24.149783392 -0800
+++ linux-2.6/kernel/rcutorture.c 2007-11-18 21:55:20.028547162 -0800
@@ -441,8 +441,8 @@ static int srcu_torture_stats(char *page
torture_type, TORTURE_FLAG, idx);
for_each_possible_cpu(cpu) {
cnt += sprintf(&page[cnt], " %d(%d,%d)", cpu,
- per_cpu_ptr(srcu_ctl.per_cpu_ref, cpu)->c[!idx],
- per_cpu_ptr(srcu_ctl.per_cpu_ref, cpu)->c[idx]);
+ CPU_PTR(srcu_ctl.per_cpu_ref, cpu)->c[!idx],
+ CPU_PTR(srcu_ctl.per_cpu_ref, cpu)->c[idx]);
}
cnt += sprintf(&page[cnt], "\n");
return cnt;
Index: linux-2.6/kernel/srcu.c
===================================================================
--- linux-2.6.orig/kernel/srcu.c 2007-11-18 14:38:24.157783685 -0800
+++ linux-2.6/kernel/srcu.c 2007-11-18 22:04:47.332273074 -0800
@@ -46,7 +46,8 @@ int init_srcu_struct(struct srcu_struct
{
sp->completed = 0;
mutex_init(&sp->mutex);
- sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
+ sp->per_cpu_ref = CPU_ALLOC(struct srcu_struct_array,
+ GFP_KERNEL|__GFP_ZERO);
return (sp->per_cpu_ref ? 0 : -ENOMEM);
}

@@ -62,7 +63,7 @@ static int srcu_readers_active_idx(struc

sum = 0;
for_each_possible_cpu(cpu)
- sum += per_cpu_ptr(sp->per_cpu_ref, cpu)->c[idx];
+ sum += CPU_PTR(sp->per_cpu_ref, cpu)->c[idx];
return sum;
}

@@ -94,7 +95,7 @@ void cleanup_srcu_struct(struct srcu_str
WARN_ON(sum); /* Leakage unless caller handles error. */
if (sum != 0)
return;
- free_percpu(sp->per_cpu_ref);
+ CPU_FREE(sp->per_cpu_ref);
sp->per_cpu_ref = NULL;
}

@@ -110,12 +111,9 @@ int srcu_read_lock(struct srcu_struct *s
{
int idx;

- preempt_disable();
idx = sp->completed & 0x1;
- barrier(); /* ensure compiler looks -once- at sp->completed. */
- per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]++;
- srcu_barrier(); /* ensure compiler won't misorder critical section. */
- preempt_enable();
+ srcu_barrier();
+ _CPU_INC(sp->per_cpu_ref->c[idx]);
return idx;
}

@@ -131,10 +129,8 @@ int srcu_read_lock(struct srcu_struct *s
*/
void srcu_read_unlock(struct srcu_struct *sp, int idx)
{
- preempt_disable();
- srcu_barrier(); /* ensure compiler won't misorder critical section. */
- per_cpu_ptr(sp->per_cpu_ref, smp_processor_id())->c[idx]--;
- preempt_enable();
+ srcu_barrier();
+ _CPU_DEC(sp->per_cpu_ref->c[idx]);
}

/**

--
c***@sgi.com
2007-11-20 01:11:51 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
fs/nfs/iostat.h | 8 ++++----
fs/nfs/super.c | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6/fs/nfs/iostat.h
===================================================================
--- linux-2.6.orig/fs/nfs/iostat.h 2007-11-15 21:17:24.391404458 -0800
+++ linux-2.6/fs/nfs/iostat.h 2007-11-15 21:25:33.167654066 -0800
@@ -123,7 +123,7 @@ static inline void nfs_inc_server_stats(
int cpu;

cpu = get_cpu();
- iostats = per_cpu_ptr(server->io_stats, cpu);
+ iostats = CPU_PTR(server->io_stats, cpu);
iostats->events[stat] ++;
put_cpu_no_resched();
}
@@ -139,7 +139,7 @@ static inline void nfs_add_server_stats(
int cpu;

cpu = get_cpu();
- iostats = per_cpu_ptr(server->io_stats, cpu);
+ iostats = CPU_PTR(server->io_stats, cpu);
iostats->bytes[stat] += addend;
put_cpu_no_resched();
}
@@ -151,13 +151,13 @@ static inline void nfs_add_stats(struct

static inline struct nfs_iostats *nfs_alloc_iostats(void)
{
- return alloc_percpu(struct nfs_iostats);
+ return CPU_ALLOC(struct nfs_iostats, GFP_KERNEL | __GFP_ZERO);
}

static inline void nfs_free_iostats(struct nfs_iostats *stats)
{
if (stats != NULL)
- free_percpu(stats);
+ CPU_FREE(stats);
}

#endif
Index: linux-2.6/fs/nfs/super.c
===================================================================
--- linux-2.6.orig/fs/nfs/super.c 2007-11-15 21:17:24.399404478 -0800
+++ linux-2.6/fs/nfs/super.c 2007-11-15 21:25:33.171654143 -0800
@@ -529,7 +529,7 @@ static int nfs_show_stats(struct seq_fil
struct nfs_iostats *stats;

preempt_disable();
- stats = per_cpu_ptr(nfss->io_stats, cpu);
+ stats = CPU_PTR(nfss->io_stats, cpu);

for (i = 0; i < __NFSIOS_COUNTSMAX; i++)
totals.events[i] += stats->events[i];

--
Mathieu Desnoyers
2007-11-20 13:02:01 UTC
Permalink
Post by c***@sgi.com
---
fs/nfs/iostat.h | 8 ++++----
fs/nfs/super.c | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)
Index: linux-2.6/fs/nfs/iostat.h
===================================================================
--- linux-2.6.orig/fs/nfs/iostat.h 2007-11-15 21:17:24.391404458 -0800
+++ linux-2.6/fs/nfs/iostat.h 2007-11-15 21:25:33.167654066 -0800
@@ -123,7 +123,7 @@ static inline void nfs_inc_server_stats(
int cpu;
cpu = get_cpu();
- iostats = per_cpu_ptr(server->io_stats, cpu);
+ iostats = CPU_PTR(server->io_stats, cpu);
iostats->events[stat] ++;
Is there a way to change this into a CPU_ADD ?
Post by c***@sgi.com
put_cpu_no_resched();
Why put_cpu_no_resched here ?
Post by c***@sgi.com
}
@@ -139,7 +139,7 @@ static inline void nfs_add_server_stats(
int cpu;
cpu = get_cpu();
- iostats = per_cpu_ptr(server->io_stats, cpu);
+ iostats = CPU_PTR(server->io_stats, cpu);
iostats->bytes[stat] += addend;
put_cpu_no_resched();
Why put_cpu_no_resched here ?
Post by c***@sgi.com
}
@@ -151,13 +151,13 @@ static inline void nfs_add_stats(struct
static inline struct nfs_iostats *nfs_alloc_iostats(void)
{
- return alloc_percpu(struct nfs_iostats);
+ return CPU_ALLOC(struct nfs_iostats, GFP_KERNEL | __GFP_ZERO);
}
static inline void nfs_free_iostats(struct nfs_iostats *stats)
{
if (stats != NULL)
- free_percpu(stats);
+ CPU_FREE(stats);
}
#endif
Index: linux-2.6/fs/nfs/super.c
===================================================================
--- linux-2.6.orig/fs/nfs/super.c 2007-11-15 21:17:24.399404478 -0800
+++ linux-2.6/fs/nfs/super.c 2007-11-15 21:25:33.171654143 -0800
@@ -529,7 +529,7 @@ static int nfs_show_stats(struct seq_fil
struct nfs_iostats *stats;
preempt_disable();
- stats = per_cpu_ptr(nfss->io_stats, cpu);
+ stats = CPU_PTR(nfss->io_stats, cpu);
for (i = 0; i < __NFSIOS_COUNTSMAX; i++)
totals.events[i] += stats->events[i];
--
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Christoph Lameter
2007-11-20 20:49:37 UTC
Permalink
Post by Mathieu Desnoyers
Post by c***@sgi.com
Index: linux-2.6/fs/nfs/iostat.h
===================================================================
--- linux-2.6.orig/fs/nfs/iostat.h 2007-11-15 21:17:24.391404458 -0800
+++ linux-2.6/fs/nfs/iostat.h 2007-11-15 21:25:33.167654066 -0800
@@ -123,7 +123,7 @@ static inline void nfs_inc_server_stats(
int cpu;
cpu = get_cpu();
- iostats = per_cpu_ptr(server->io_stats, cpu);
+ iostats = CPU_PTR(server->io_stats, cpu);
iostats->events[stat] ++;
Is there a way to change this into a CPU_ADD ?
Yes I must have missed that.

Could be

CPU_INC(server->io_stats->events[stat]);
Post by Mathieu Desnoyers
Post by c***@sgi.com
put_cpu_no_resched();
Why put_cpu_no_resched here ?
We do not want to reschedule here? We may have already disabled interrupts
or some such thing.
Trond Myklebust
2007-11-20 20:56:38 UTC
Permalink
Post by Christoph Lameter
Post by Mathieu Desnoyers
Post by c***@sgi.com
Index: linux-2.6/fs/nfs/iostat.h
===================================================================
--- linux-2.6.orig/fs/nfs/iostat.h 2007-11-15 21:17:24.391404458 -0800
+++ linux-2.6/fs/nfs/iostat.h 2007-11-15 21:25:33.167654066 -0800
@@ -123,7 +123,7 @@ static inline void nfs_inc_server_stats(
int cpu;
cpu = get_cpu();
- iostats = per_cpu_ptr(server->io_stats, cpu);
+ iostats = CPU_PTR(server->io_stats, cpu);
iostats->events[stat] ++;
Is there a way to change this into a CPU_ADD ?
Yes I must have missed that.
Could be
CPU_INC(server->io_stats->events[stat]);
Post by Mathieu Desnoyers
Post by c***@sgi.com
put_cpu_no_resched();
Why put_cpu_no_resched here ?
We do not want to reschedule here? We may have already disabled interrupts
or some such thing.
Some of these statistics are updated from inside a spinlocked
environment, hence the put_no_resched().

Trond
c***@sgi.com
2007-11-20 01:11:50 UTC
Permalink
Also remove the useless zeroing after allocation. Allocpercpu already
zeroed the objects.

Signed-off-by: Christoph Lameter <***@sgi.com>
---
fs/xfs/xfs_mount.c | 24 ++++++++----------------
1 file changed, 8 insertions(+), 16 deletions(-)

Index: linux-2.6/fs/xfs/xfs_mount.c
===================================================================
--- linux-2.6.orig/fs/xfs/xfs_mount.c 2007-11-15 21:17:24.467654585 -0800
+++ linux-2.6/fs/xfs/xfs_mount.c 2007-11-15 21:25:32.643904117 -0800
@@ -1924,7 +1924,7 @@ xfs_icsb_cpu_notify(

mp = (xfs_mount_t *)container_of(nfb, xfs_mount_t, m_icsb_notifier);
cntp = (xfs_icsb_cnts_t *)
- per_cpu_ptr(mp->m_sb_cnts, (unsigned long)hcpu);
+ CPU_PTR(mp->m_sb_cnts, (unsigned long)hcpu);
switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
@@ -1976,10 +1976,7 @@ int
xfs_icsb_init_counters(
xfs_mount_t *mp)
{
- xfs_icsb_cnts_t *cntp;
- int i;
-
- mp->m_sb_cnts = alloc_percpu(xfs_icsb_cnts_t);
+ mp->m_sb_cnts = CPU_ALLOC(xfs_icsb_cnts_t, GFP_KERNEL | __GFP_ZERO);
if (mp->m_sb_cnts == NULL)
return -ENOMEM;

@@ -1989,11 +1986,6 @@ xfs_icsb_init_counters(
register_hotcpu_notifier(&mp->m_icsb_notifier);
#endif /* CONFIG_HOTPLUG_CPU */

- for_each_online_cpu(i) {
- cntp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, i);
- memset(cntp, 0, sizeof(xfs_icsb_cnts_t));
- }
-
mutex_init(&mp->m_icsb_mutex);

/*
@@ -2026,7 +2018,7 @@ xfs_icsb_destroy_counters(
{
if (mp->m_sb_cnts) {
unregister_hotcpu_notifier(&mp->m_icsb_notifier);
- free_percpu(mp->m_sb_cnts);
+ CPU_FREE(mp->m_sb_cnts);
}
mutex_destroy(&mp->m_icsb_mutex);
}
@@ -2056,7 +2048,7 @@ xfs_icsb_lock_all_counters(
int i;

for_each_online_cpu(i) {
- cntp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, i);
+ cntp = (xfs_icsb_cnts_t *)CPU_PTR(mp->m_sb_cnts, i);
xfs_icsb_lock_cntr(cntp);
}
}
@@ -2069,7 +2061,7 @@ xfs_icsb_unlock_all_counters(
int i;

for_each_online_cpu(i) {
- cntp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, i);
+ cntp = (xfs_icsb_cnts_t *)CPU_PTR(mp->m_sb_cnts, i);
xfs_icsb_unlock_cntr(cntp);
}
}
@@ -2089,7 +2081,7 @@ xfs_icsb_count(
xfs_icsb_lock_all_counters(mp);

for_each_online_cpu(i) {
- cntp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, i);
+ cntp = (xfs_icsb_cnts_t *)CPU_PTR(mp->m_sb_cnts, i);
cnt->icsb_icount += cntp->icsb_icount;
cnt->icsb_ifree += cntp->icsb_ifree;
cnt->icsb_fdblocks += cntp->icsb_fdblocks;
@@ -2167,7 +2159,7 @@ xfs_icsb_enable_counter(

xfs_icsb_lock_all_counters(mp);
for_each_online_cpu(i) {
- cntp = per_cpu_ptr(mp->m_sb_cnts, i);
+ cntp = CPU_PTR(mp->m_sb_cnts, i);
switch (field) {
case XFS_SBS_ICOUNT:
cntp->icsb_icount = count + resid;
@@ -2307,7 +2299,7 @@ xfs_icsb_modify_counters(
might_sleep();
again:
cpu = get_cpu();
- icsbp = (xfs_icsb_cnts_t *)per_cpu_ptr(mp->m_sb_cnts, cpu);
+ icsbp = (xfs_icsb_cnts_t *)CPU_PTR(mp->m_sb_cnts, cpu);

/*
* if the counter is disabled, go to slow path

--
Christoph Hellwig
2007-11-20 08:12:30 UTC
Permalink
Post by c***@sgi.com
Also remove the useless zeroing after allocation. Allocpercpu already
zeroed the objects.
You still haven't answered my comment to the last iteration.
Christoph Lameter
2007-11-20 20:38:29 UTC
Permalink
Post by Christoph Hellwig
Post by c***@sgi.com
Also remove the useless zeroing after allocation. Allocpercpu already
zeroed the objects.
You still haven't answered my comment to the last iteration.
And you have not read the discussion on that subject in the prior
iteration between Peter Zilkstra and me.
c***@sgi.com
2007-11-20 01:11:48 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
block/blktrace.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/block/blktrace.c
===================================================================
--- linux-2.6.orig/block/blktrace.c 2007-11-15 21:17:24.586154116 -0800
+++ linux-2.6/block/blktrace.c 2007-11-15 21:25:31.591154091 -0800
@@ -155,7 +155,7 @@ void __blk_add_trace(struct blk_trace *b
t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len);
if (t) {
cpu = smp_processor_id();
- sequence = per_cpu_ptr(bt->sequence, cpu);
+ sequence = CPU_PTR(bt->sequence, cpu);

t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
t->sequence = ++(*sequence);
@@ -227,7 +227,7 @@ static void blk_trace_cleanup(struct blk
relay_close(bt->rchan);
debugfs_remove(bt->dropped_file);
blk_remove_tree(bt->dir);
- free_percpu(bt->sequence);
+ CPU_FREE(bt->sequence);
kfree(bt);
}

@@ -338,7 +338,7 @@ int do_blk_trace_setup(struct request_qu
if (!bt)
goto err;

- bt->sequence = alloc_percpu(unsigned long);
+ bt->sequence = CPU_ALLOC(unsigned long, GFP_KERNEL | __GFP_ZERO);
if (!bt->sequence)
goto err;

@@ -387,7 +387,7 @@ err:
if (bt) {
if (bt->dropped_file)
debugfs_remove(bt->dropped_file);
- free_percpu(bt->sequence);
+ CPU_FREE(bt->sequence);
if (bt->rchan)
relay_close(bt->rchan);
kfree(bt);

--
c***@sgi.com
2007-11-20 01:11:52 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
include/net/neighbour.h | 6 +-----
net/core/neighbour.c | 11 ++++++-----
2 files changed, 7 insertions(+), 10 deletions(-)

Index: linux-2.6/include/net/neighbour.h
===================================================================
--- linux-2.6.orig/include/net/neighbour.h 2007-11-18 14:38:23.910033621 -0800
+++ linux-2.6/include/net/neighbour.h 2007-11-18 22:04:47.465297422 -0800
@@ -81,11 +81,7 @@ struct neigh_statistics
};

#define NEIGH_CACHE_STAT_INC(tbl, field) \
- do { \
- preempt_disable(); \
- (per_cpu_ptr((tbl)->stats, smp_processor_id())->field)++; \
- preempt_enable(); \
- } while (0)
+ _CPU_INC(tbl->stats->field)

struct neighbour
{
Index: linux-2.6/net/core/neighbour.c
===================================================================
--- linux-2.6.orig/net/core/neighbour.c 2007-11-18 14:38:23.918033788 -0800
+++ linux-2.6/net/core/neighbour.c 2007-11-18 21:55:23.064297499 -0800
@@ -1348,7 +1348,8 @@ void neigh_table_init_no_netlink(struct
kmem_cache_create(tbl->id, tbl->entry_size, 0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC,
NULL);
- tbl->stats = alloc_percpu(struct neigh_statistics);
+ tbl->stats = CPU_ALLOC(struct neigh_statistics,
+ GFP_KERNEL | __GFP_ZERO);
if (!tbl->stats)
panic("cannot create neighbour cache statistics");

@@ -1437,7 +1438,7 @@ int neigh_table_clear(struct neigh_table

remove_proc_entry(tbl->id, init_net.proc_net_stat);

- free_percpu(tbl->stats);
+ CPU_FREE(tbl->stats);
tbl->stats = NULL;

kmem_cache_destroy(tbl->kmem_cachep);
@@ -1694,7 +1695,7 @@ static int neightbl_fill_info(struct sk_
for_each_possible_cpu(cpu) {
struct neigh_statistics *st;

- st = per_cpu_ptr(tbl->stats, cpu);
+ st = CPU_PTR(tbl->stats, cpu);
ndst.ndts_allocs += st->allocs;
ndst.ndts_destroys += st->destroys;
ndst.ndts_hash_grows += st->hash_grows;
@@ -2343,7 +2344,7 @@ static void *neigh_stat_seq_start(struct
if (!cpu_possible(cpu))
continue;
*pos = cpu+1;
- return per_cpu_ptr(tbl->stats, cpu);
+ return CPU_PTR(tbl->stats, cpu);
}
return NULL;
}
@@ -2358,7 +2359,7 @@ static void *neigh_stat_seq_next(struct
if (!cpu_possible(cpu))
continue;
*pos = cpu+1;
- return per_cpu_ptr(tbl->stats, cpu);
+ return CPU_PTR(tbl->stats, cpu);
}
return NULL;
}

--
c***@sgi.com
2007-11-20 01:11:58 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
drivers/net/chelsio/sge.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

Index: linux-2.6/drivers/net/chelsio/sge.c
===================================================================
--- linux-2.6.orig/drivers/net/chelsio/sge.c 2007-11-15 21:17:23.927654318 -0800
+++ linux-2.6/drivers/net/chelsio/sge.c 2007-11-15 21:25:37.015154316 -0800
@@ -805,7 +805,7 @@ void t1_sge_destroy(struct sge *sge)
int i;

for_each_port(sge->adapter, i)
- free_percpu(sge->port_stats[i]);
+ CPU_FREE(sge->port_stats[i]);

kfree(sge->tx_sched);
free_tx_resources(sge);
@@ -984,7 +984,7 @@ void t1_sge_get_port_stats(const struct

memset(ss, 0, sizeof(*ss));
for_each_possible_cpu(cpu) {
- struct sge_port_stats *st = per_cpu_ptr(sge->port_stats[port], cpu);
+ struct sge_port_stats *st = CPU_PTR(sge->port_stats[port], cpu);

ss->rx_packets += st->rx_packets;
ss->rx_cso_good += st->rx_cso_good;
@@ -1379,7 +1379,7 @@ static void sge_rx(struct sge *sge, stru
}
__skb_pull(skb, sizeof(*p));

- st = per_cpu_ptr(sge->port_stats[p->iff], smp_processor_id());
+ st = THIS_CPU(sge->port_stats[p->iff]);
st->rx_packets++;

skb->protocol = eth_type_trans(skb, adapter->port[p->iff].dev);
@@ -1848,7 +1848,7 @@ int t1_start_xmit(struct sk_buff *skb, s
{
struct adapter *adapter = dev->priv;
struct sge *sge = adapter->sge;
- struct sge_port_stats *st = per_cpu_ptr(sge->port_stats[dev->if_port], smp_processor_id());
+ struct sge_port_stats *st = THIS_CPU(sge->port_stats[dev->if_port]);
struct cpl_tx_pkt *cpl;
struct sk_buff *orig_skb = skb;
int ret;
@@ -2165,7 +2165,8 @@ struct sge * __devinit t1_sge_create(str
sge->jumbo_fl = t1_is_T1B(adapter) ? 1 : 0;

for_each_port(adapter, i) {
- sge->port_stats[i] = alloc_percpu(struct sge_port_stats);
+ sge->port_stats[i] = CPU_ALLOC(struct sge_port_stats,
+ GFP_KERNEL | __GFP_ZERO);
if (!sge->port_stats[i])
goto nomem_port;
}
@@ -2209,7 +2210,7 @@ struct sge * __devinit t1_sge_create(str
return sge;
nomem_port:
while (i >= 0) {
- free_percpu(sge->port_stats[i]);
+ CPU_FREE(sge->port_stats[i]);
--i;
}
kfree(sge);

--
c***@sgi.com
2007-11-20 01:11:57 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
drivers/net/veth.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6/drivers/net/veth.c
===================================================================
--- linux-2.6.orig/drivers/net/veth.c 2007-11-15 21:17:24.010404318 -0800
+++ linux-2.6/drivers/net/veth.c 2007-11-15 21:25:36.483154219 -0800
@@ -162,7 +162,7 @@ static int veth_xmit(struct sk_buff *skb
rcv_priv = netdev_priv(rcv);

cpu = smp_processor_id();
- stats = per_cpu_ptr(priv->stats, cpu);
+ stats = CPU_PTR(priv->stats, cpu);

if (!(rcv->flags & IFF_UP))
goto outf;
@@ -183,7 +183,7 @@ static int veth_xmit(struct sk_buff *skb
stats->tx_bytes += length;
stats->tx_packets++;

- stats = per_cpu_ptr(rcv_priv->stats, cpu);
+ stats = CPU_PTR(rcv_priv->stats, cpu);
stats->rx_bytes += length;
stats->rx_packets++;

@@ -217,7 +217,7 @@ static struct net_device_stats *veth_get
dev_stats->tx_dropped = 0;

for_each_online_cpu(cpu) {
- stats = per_cpu_ptr(priv->stats, cpu);
+ stats = CPU_PTR(priv->stats, cpu);

dev_stats->rx_packets += stats->rx_packets;
dev_stats->tx_packets += stats->tx_packets;
@@ -261,7 +261,7 @@ static int veth_dev_init(struct net_devi
struct veth_net_stats *stats;
struct veth_priv *priv;

- stats = alloc_percpu(struct veth_net_stats);
+ stats = CPU_ALLOC(struct veth_net_stats, GFP_KERNEL | __GFP_ZER);
if (stats == NULL)
return -ENOMEM;

@@ -275,7 +275,7 @@ static void veth_dev_free(struct net_dev
struct veth_priv *priv;

priv = netdev_priv(dev);
- free_percpu(priv->stats);
+ CPU_FREE(priv->stats);
free_netdev(dev);
}


--
c***@sgi.com
2007-11-20 01:11:56 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
drivers/net/loopback.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)

Index: linux-2.6/drivers/net/loopback.c
===================================================================
--- linux-2.6.orig/drivers/net/loopback.c 2007-11-18 14:38:23.621283530 -0800
+++ linux-2.6/drivers/net/loopback.c 2007-11-18 22:04:47.399082691 -0800
@@ -134,7 +134,7 @@ static void emulate_large_send_offload(s
*/
static int loopback_xmit(struct sk_buff *skb, struct net_device *dev)
{
- struct pcpu_lstats *pcpu_lstats, *lb_stats;
+ struct pcpu_lstats *pcpu_lstats;

skb_orphan(skb);

@@ -154,11 +154,9 @@ static int loopback_xmit(struct sk_buff
#endif
dev->last_rx = jiffies;

- /* it's OK to use per_cpu_ptr() because BHs are off */
pcpu_lstats = netdev_priv(dev);
- lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
- lb_stats->bytes += skb->len;
- lb_stats->packets++;
+ __CPU_ADD(pcpu_lstats->bytes, skb->len);
+ __CPU_INC(pcpu_lstats->packets);

netif_rx(skb);

@@ -177,7 +175,7 @@ static struct net_device_stats *get_stat
for_each_possible_cpu(i) {
const struct pcpu_lstats *lb_stats;

- lb_stats = per_cpu_ptr(pcpu_lstats, i);
+ lb_stats = CPU_PTR(pcpu_lstats, i);
bytes += lb_stats->bytes;
packets += lb_stats->packets;
}
@@ -205,7 +203,7 @@ static int loopback_dev_init(struct net_
{
struct pcpu_lstats *lstats;

- lstats = alloc_percpu(struct pcpu_lstats);
+ lstats = CPU_ALLOC(struct pcpu_lstats, GFP_KERNEL | __GFP_ZERO);
if (!lstats)
return -ENOMEM;

@@ -217,7 +215,7 @@ static void loopback_dev_free(struct net
{
struct pcpu_lstats *lstats = netdev_priv(dev);

- free_percpu(lstats);
+ CPU_FREE(lstats);
free_netdev(dev);
}


--
c***@sgi.com
2007-11-20 01:11:54 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
net/ipv4/ipcomp.c | 26 +++++++++++++-------------
net/ipv6/ipcomp6.c | 26 +++++++++++++-------------
2 files changed, 26 insertions(+), 26 deletions(-)

Index: linux-2.6/net/ipv4/ipcomp.c
===================================================================
--- linux-2.6.orig/net/ipv4/ipcomp.c 2007-11-15 21:17:24.199404507 -0800
+++ linux-2.6/net/ipv4/ipcomp.c 2007-11-15 21:25:34.771154012 -0800
@@ -48,8 +48,8 @@ static int ipcomp_decompress(struct xfrm
int dlen = IPCOMP_SCRATCH_SIZE;
const u8 *start = skb->data;
const int cpu = get_cpu();
- u8 *scratch = *per_cpu_ptr(ipcomp_scratches, cpu);
- struct crypto_comp *tfm = *per_cpu_ptr(ipcd->tfms, cpu);
+ u8 *scratch = *CPU_PTR(ipcomp_scratches, cpu);
+ struct crypto_comp *tfm = *CPU_PTR(ipcd->tfms, cpu);
int err = crypto_comp_decompress(tfm, start, plen, scratch, &dlen);

if (err)
@@ -103,8 +103,8 @@ static int ipcomp_compress(struct xfrm_s
int dlen = IPCOMP_SCRATCH_SIZE;
u8 *start = skb->data;
const int cpu = get_cpu();
- u8 *scratch = *per_cpu_ptr(ipcomp_scratches, cpu);
- struct crypto_comp *tfm = *per_cpu_ptr(ipcd->tfms, cpu);
+ u8 *scratch = *CPU_PTR(ipcomp_scratches, cpu);
+ struct crypto_comp *tfm = *CPU_PTR(ipcd->tfms, cpu);
int err = crypto_comp_compress(tfm, start, plen, scratch, &dlen);

if (err)
@@ -252,9 +252,9 @@ static void ipcomp_free_scratches(void)
return;

for_each_possible_cpu(i)
- vfree(*per_cpu_ptr(scratches, i));
+ vfree(*CPU_PTR(scratches, i));

- free_percpu(scratches);
+ CPU_FREE(scratches);
}

static void **ipcomp_alloc_scratches(void)
@@ -265,7 +265,7 @@ static void **ipcomp_alloc_scratches(voi
if (ipcomp_scratch_users++)
return ipcomp_scratches;

- scratches = alloc_percpu(void *);
+ scratches = CPU_ALLOC(void *, GFP_KERNEL);
if (!scratches)
return NULL;

@@ -275,7 +275,7 @@ static void **ipcomp_alloc_scratches(voi
void *scratch = vmalloc(IPCOMP_SCRATCH_SIZE);
if (!scratch)
return NULL;
- *per_cpu_ptr(scratches, i) = scratch;
+ *CPU_PTR(scratches, i) = scratch;
}

return scratches;
@@ -303,10 +303,10 @@ static void ipcomp_free_tfms(struct cryp
return;

for_each_possible_cpu(cpu) {
- struct crypto_comp *tfm = *per_cpu_ptr(tfms, cpu);
+ struct crypto_comp *tfm = *CPU_PTR(tfms, cpu);
crypto_free_comp(tfm);
}
- free_percpu(tfms);
+ CPU_FREE(tfms);
}

static struct crypto_comp **ipcomp_alloc_tfms(const char *alg_name)
@@ -322,7 +322,7 @@ static struct crypto_comp **ipcomp_alloc
struct crypto_comp *tfm;

tfms = pos->tfms;
- tfm = *per_cpu_ptr(tfms, cpu);
+ tfm = *CPU_PTR(tfms, cpu);

if (!strcmp(crypto_comp_name(tfm), alg_name)) {
pos->users++;
@@ -338,7 +338,7 @@ static struct crypto_comp **ipcomp_alloc
INIT_LIST_HEAD(&pos->list);
list_add(&pos->list, &ipcomp_tfms_list);

- pos->tfms = tfms = alloc_percpu(struct crypto_comp *);
+ pos->tfms = tfms = CPU_ALLOC(struct crypto_comp *, GFP_KERNEL);
if (!tfms)
goto error;

@@ -347,7 +347,7 @@ static struct crypto_comp **ipcomp_alloc
CRYPTO_ALG_ASYNC);
if (IS_ERR(tfm))
goto error;
- *per_cpu_ptr(tfms, cpu) = tfm;
+ *CPU_PTR(tfms, cpu) = tfm;
}

return tfms;
Index: linux-2.6/net/ipv6/ipcomp6.c
===================================================================
--- linux-2.6.orig/net/ipv6/ipcomp6.c 2007-11-15 21:17:24.207404544 -0800
+++ linux-2.6/net/ipv6/ipcomp6.c 2007-11-15 21:25:34.774656957 -0800
@@ -88,8 +88,8 @@ static int ipcomp6_input(struct xfrm_sta
start = skb->data;

cpu = get_cpu();
- scratch = *per_cpu_ptr(ipcomp6_scratches, cpu);
- tfm = *per_cpu_ptr(ipcd->tfms, cpu);
+ scratch = *CPU_PTR(ipcomp6_scratches, cpu);
+ tfm = *CPU_PTR(ipcd->tfms, cpu);

err = crypto_comp_decompress(tfm, start, plen, scratch, &dlen);
if (err)
@@ -140,8 +140,8 @@ static int ipcomp6_output(struct xfrm_st
start = skb->data;

cpu = get_cpu();
- scratch = *per_cpu_ptr(ipcomp6_scratches, cpu);
- tfm = *per_cpu_ptr(ipcd->tfms, cpu);
+ scratch = *CPU_PTR(ipcomp6_scratches, cpu);
+ tfm = *CPU_PTR(ipcd->tfms, cpu);

err = crypto_comp_compress(tfm, start, plen, scratch, &dlen);
if (err || (dlen + sizeof(*ipch)) >= plen) {
@@ -263,12 +263,12 @@ static void ipcomp6_free_scratches(void)
return;

for_each_possible_cpu(i) {
- void *scratch = *per_cpu_ptr(scratches, i);
+ void *scratch = *CPU_PTR(scratches, i);

vfree(scratch);
}

- free_percpu(scratches);
+ CPU_FREE(scratches);
}

static void **ipcomp6_alloc_scratches(void)
@@ -279,7 +279,7 @@ static void **ipcomp6_alloc_scratches(vo
if (ipcomp6_scratch_users++)
return ipcomp6_scratches;

- scratches = alloc_percpu(void *);
+ scratches = CPU_ALLOC(void *, GFP_KERNEL);
if (!scratches)
return NULL;

@@ -289,7 +289,7 @@ static void **ipcomp6_alloc_scratches(vo
void *scratch = vmalloc(IPCOMP_SCRATCH_SIZE);
if (!scratch)
return NULL;
- *per_cpu_ptr(scratches, i) = scratch;
+ *CPU_PTR(scratches, i) = scratch;
}

return scratches;
@@ -317,10 +317,10 @@ static void ipcomp6_free_tfms(struct cry
return;

for_each_possible_cpu(cpu) {
- struct crypto_comp *tfm = *per_cpu_ptr(tfms, cpu);
+ struct crypto_comp *tfm = *CPU_PTR(tfms, cpu);
crypto_free_comp(tfm);
}
- free_percpu(tfms);
+ CPU_FREE(tfms);
}

static struct crypto_comp **ipcomp6_alloc_tfms(const char *alg_name)
@@ -336,7 +336,7 @@ static struct crypto_comp **ipcomp6_allo
struct crypto_comp *tfm;

tfms = pos->tfms;
- tfm = *per_cpu_ptr(tfms, cpu);
+ tfm = *CPU_PTR(tfms, cpu);

if (!strcmp(crypto_comp_name(tfm), alg_name)) {
pos->users++;
@@ -352,7 +352,7 @@ static struct crypto_comp **ipcomp6_allo
INIT_LIST_HEAD(&pos->list);
list_add(&pos->list, &ipcomp6_tfms_list);

- pos->tfms = tfms = alloc_percpu(struct crypto_comp *);
+ pos->tfms = tfms = CPU_ALLOC(struct crypto_comp *, GFP_KERNEL);
if (!tfms)
goto error;

@@ -361,7 +361,7 @@ static struct crypto_comp **ipcomp6_allo
CRYPTO_ALG_ASYNC);
if (IS_ERR(tfm))
goto error;
- *per_cpu_ptr(tfms, cpu) = tfm;
+ *CPU_PTR(tfms, cpu) = tfm;
}

return tfms;

--
c***@sgi.com
2007-11-20 01:11:55 UTC
Permalink
Convert DMA engine to use CPU_xx operations. This also removes the use of local_t
from the dmaengine.

Signed-off-by: Christoph Lameter <***@sgi.com>
---
drivers/dma/dmaengine.c | 38 ++++++++++++++------------------------
include/linux/dmaengine.h | 16 ++++++----------
2 files changed, 20 insertions(+), 34 deletions(-)

Index: linux-2.6/drivers/dma/dmaengine.c
===================================================================
--- linux-2.6.orig/drivers/dma/dmaengine.c 2007-11-19 15:45:06.009390961 -0800
+++ linux-2.6/drivers/dma/dmaengine.c 2007-11-19 15:59:59.894744662 -0800
@@ -84,7 +84,7 @@ static ssize_t show_memcpy_count(struct
int i;

for_each_possible_cpu(i)
- count += per_cpu_ptr(chan->local, i)->memcpy_count;
+ count += CPU_PTR(chan->local, i)->memcpy_count;

return sprintf(buf, "%lu\n", count);
}
@@ -96,7 +96,7 @@ static ssize_t show_bytes_transferred(st
int i;

for_each_possible_cpu(i)
- count += per_cpu_ptr(chan->local, i)->bytes_transferred;
+ count += CPU_PTR(chan->local, i)->bytes_transferred;

return sprintf(buf, "%lu\n", count);
}
@@ -110,10 +110,8 @@ static ssize_t show_in_use(struct class_
atomic_read(&chan->refcount.refcount) > 1)
in_use = 1;
else {
- if (local_read(&(per_cpu_ptr(chan->local,
- get_cpu())->refcount)) > 0)
+ if (_CPU_READ(chan->local->refcount) > 0)
in_use = 1;
- put_cpu();
}

return sprintf(buf, "%d\n", in_use);
@@ -226,7 +224,7 @@ static void dma_chan_free_rcu(struct rcu
int bias = 0x7FFFFFFF;
int i;
for_each_possible_cpu(i)
- bias -= local_read(&per_cpu_ptr(chan->local, i)->refcount);
+ bias -= _CPU_READ(chan->local->refcount);
atomic_sub(bias, &chan->refcount.refcount);
kref_put(&chan->refcount, dma_chan_cleanup);
}
@@ -372,7 +370,8 @@ int dma_async_device_register(struct dma

/* represent channels in sysfs. Probably want devs too */
list_for_each_entry(chan, &device->channels, device_node) {
- chan->local = alloc_percpu(typeof(*chan->local));
+ chan->local = CPU_ALLOC(typeof(*chan->local),
+ GFP_KERNEL | __GFP_ZERO);
if (chan->local == NULL)
continue;

@@ -385,7 +384,7 @@ int dma_async_device_register(struct dma
rc = class_device_register(&chan->class_dev);
if (rc) {
chancnt--;
- free_percpu(chan->local);
+ CPU_FREE(chan->local);
chan->local = NULL;
goto err_out;
}
@@ -413,7 +412,7 @@ err_out:
kref_put(&device->refcount, dma_async_device_cleanup);
class_device_unregister(&chan->class_dev);
chancnt--;
- free_percpu(chan->local);
+ CPU_FREE(chan->local);
}
return rc;
}
@@ -488,11 +487,8 @@ dma_async_memcpy_buf_to_buf(struct dma_c
tx->tx_set_dest(addr, tx, 0);
cookie = tx->tx_submit(tx);

- cpu = get_cpu();
- per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
- per_cpu_ptr(chan->local, cpu)->memcpy_count++;
- put_cpu();
-
+ __CPU_ADD(chan->local->bytes_transferred, len);
+ __CPU_INC(chan->local->memcpy_count);
return cookie;
}
EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
@@ -532,11 +528,8 @@ dma_async_memcpy_buf_to_pg(struct dma_ch
tx->tx_set_dest(addr, tx, 0);
cookie = tx->tx_submit(tx);

- cpu = get_cpu();
- per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
- per_cpu_ptr(chan->local, cpu)->memcpy_count++;
- put_cpu();
-
+ _CPU_ADD(chan->local->bytes_transferred, len);
+ _CPU_INC(chan->local->memcpy_count);
return cookie;
}
EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
@@ -578,11 +571,8 @@ dma_async_memcpy_pg_to_pg(struct dma_cha
tx->tx_set_dest(addr, tx, 0);
cookie = tx->tx_submit(tx);

- cpu = get_cpu();
- per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
- per_cpu_ptr(chan->local, cpu)->memcpy_count++;
- put_cpu();
-
+ _CPU_ADD(chan->local->bytes_transferred, len);
+ _CPU_INC(chan->local->memcpy_count);
return cookie;
}
EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
Index: linux-2.6/include/linux/dmaengine.h
===================================================================
--- linux-2.6.orig/include/linux/dmaengine.h 2007-11-19 15:45:06.017390185 -0800
+++ linux-2.6/include/linux/dmaengine.h 2007-11-19 15:56:26.814390333 -0800
@@ -102,13 +102,13 @@ typedef struct { DECLARE_BITMAP(bits, DM

/**
* struct dma_chan_percpu - the per-CPU part of struct dma_chan
- * @refcount: local_t used for open-coded "bigref" counting
+ * @refcount: int used for open-coded "bigref" counting
* @memcpy_count: transaction counter
* @bytes_transferred: byte counter
*/

struct dma_chan_percpu {
- local_t refcount;
+ int refcount;
/* stats */
unsigned long memcpy_count;
unsigned long bytes_transferred;
@@ -149,20 +149,16 @@ static inline void dma_chan_get(struct d
{
if (unlikely(chan->slow_ref))
kref_get(&chan->refcount);
- else {
- local_inc(&(per_cpu_ptr(chan->local, get_cpu())->refcount));
- put_cpu();
- }
+ else
+ _CPU_INC(chan->local->refcount);
}

static inline void dma_chan_put(struct dma_chan *chan)
{
if (unlikely(chan->slow_ref))
kref_put(&chan->refcount, dma_chan_cleanup);
- else {
- local_dec(&(per_cpu_ptr(chan->local, get_cpu())->refcount));
- put_cpu();
- }
+ else
+ _CPU_DEC(chan->local->refcount);
}

/*

--
Mathieu Desnoyers
2007-11-20 12:50:47 UTC
Permalink
Post by c***@sgi.com
Convert DMA engine to use CPU_xx operations. This also removes the use of local_t
from the dmaengine.
---
drivers/dma/dmaengine.c | 38 ++++++++++++++------------------------
include/linux/dmaengine.h | 16 ++++++----------
2 files changed, 20 insertions(+), 34 deletions(-)
Index: linux-2.6/drivers/dma/dmaengine.c
===================================================================
--- linux-2.6.orig/drivers/dma/dmaengine.c 2007-11-19 15:45:06.009390961 -0800
+++ linux-2.6/drivers/dma/dmaengine.c 2007-11-19 15:59:59.894744662 -0800
@@ -84,7 +84,7 @@ static ssize_t show_memcpy_count(struct
int i;
for_each_possible_cpu(i)
- count += per_cpu_ptr(chan->local, i)->memcpy_count;
+ count += CPU_PTR(chan->local, i)->memcpy_count;
return sprintf(buf, "%lu\n", count);
}
@@ -96,7 +96,7 @@ static ssize_t show_bytes_transferred(st
int i;
for_each_possible_cpu(i)
- count += per_cpu_ptr(chan->local, i)->bytes_transferred;
+ count += CPU_PTR(chan->local, i)->bytes_transferred;
return sprintf(buf, "%lu\n", count);
}
@@ -110,10 +110,8 @@ static ssize_t show_in_use(struct class_
atomic_read(&chan->refcount.refcount) > 1)
in_use = 1;
else {
- if (local_read(&(per_cpu_ptr(chan->local,
- get_cpu())->refcount)) > 0)
+ if (_CPU_READ(chan->local->refcount) > 0)
in_use = 1;
- put_cpu();
}
return sprintf(buf, "%d\n", in_use);
@@ -226,7 +224,7 @@ static void dma_chan_free_rcu(struct rcu
int bias = 0x7FFFFFFF;
int i;
for_each_possible_cpu(i)
- bias -= local_read(&per_cpu_ptr(chan->local, i)->refcount);
+ bias -= _CPU_READ(chan->local->refcount);
atomic_sub(bias, &chan->refcount.refcount);
kref_put(&chan->refcount, dma_chan_cleanup);
}
@@ -372,7 +370,8 @@ int dma_async_device_register(struct dma
/* represent channels in sysfs. Probably want devs too */
list_for_each_entry(chan, &device->channels, device_node) {
- chan->local = alloc_percpu(typeof(*chan->local));
+ chan->local = CPU_ALLOC(typeof(*chan->local),
+ GFP_KERNEL | __GFP_ZERO);
if (chan->local == NULL)
continue;
@@ -385,7 +384,7 @@ int dma_async_device_register(struct dma
rc = class_device_register(&chan->class_dev);
if (rc) {
chancnt--;
- free_percpu(chan->local);
+ CPU_FREE(chan->local);
chan->local = NULL;
goto err_out;
}
kref_put(&device->refcount, dma_async_device_cleanup);
class_device_unregister(&chan->class_dev);
chancnt--;
- free_percpu(chan->local);
+ CPU_FREE(chan->local);
}
return rc;
}
@@ -488,11 +487,8 @@ dma_async_memcpy_buf_to_buf(struct dma_c
tx->tx_set_dest(addr, tx, 0);
cookie = tx->tx_submit(tx);
- cpu = get_cpu();
- per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
- per_cpu_ptr(chan->local, cpu)->memcpy_count++;
- put_cpu();
-
+ __CPU_ADD(chan->local->bytes_transferred, len);
+ __CPU_INC(chan->local->memcpy_count);
I am wondering about the impact of the preempt disable removal here. It
means that there is a statistically low probability that we will be
moved to a different CPU between the bytes_transferred and the
memcpy_count increments. I hope nobody relies on the fact that the
per-cpu counts should match perfectly...
Post by c***@sgi.com
return cookie;
}
EXPORT_SYMBOL(dma_async_memcpy_buf_to_buf);
@@ -532,11 +528,8 @@ dma_async_memcpy_buf_to_pg(struct dma_ch
tx->tx_set_dest(addr, tx, 0);
cookie = tx->tx_submit(tx);
- cpu = get_cpu();
- per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
- per_cpu_ptr(chan->local, cpu)->memcpy_count++;
- put_cpu();
-
+ _CPU_ADD(chan->local->bytes_transferred, len);
+ _CPU_INC(chan->local->memcpy_count);
return cookie;
}
EXPORT_SYMBOL(dma_async_memcpy_buf_to_pg);
@@ -578,11 +571,8 @@ dma_async_memcpy_pg_to_pg(struct dma_cha
tx->tx_set_dest(addr, tx, 0);
cookie = tx->tx_submit(tx);
- cpu = get_cpu();
- per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
- per_cpu_ptr(chan->local, cpu)->memcpy_count++;
- put_cpu();
-
+ _CPU_ADD(chan->local->bytes_transferred, len);
+ _CPU_INC(chan->local->memcpy_count);
return cookie;
}
EXPORT_SYMBOL(dma_async_memcpy_pg_to_pg);
Index: linux-2.6/include/linux/dmaengine.h
===================================================================
--- linux-2.6.orig/include/linux/dmaengine.h 2007-11-19 15:45:06.017390185 -0800
+++ linux-2.6/include/linux/dmaengine.h 2007-11-19 15:56:26.814390333 -0800
@@ -102,13 +102,13 @@ typedef struct { DECLARE_BITMAP(bits, DM
/**
* struct dma_chan_percpu - the per-CPU part of struct dma_chan
*/
struct dma_chan_percpu {
- local_t refcount;
+ int refcount;
/* stats */
unsigned long memcpy_count;
unsigned long bytes_transferred;
@@ -149,20 +149,16 @@ static inline void dma_chan_get(struct d
{
if (unlikely(chan->slow_ref))
kref_get(&chan->refcount);
- else {
- local_inc(&(per_cpu_ptr(chan->local, get_cpu())->refcount));
- put_cpu();
- }
+ else
+ _CPU_INC(chan->local->refcount);
}
static inline void dma_chan_put(struct dma_chan *chan)
{
if (unlikely(chan->slow_ref))
kref_put(&chan->refcount, dma_chan_cleanup);
- else {
- local_dec(&(per_cpu_ptr(chan->local, get_cpu())->refcount));
- put_cpu();
- }
+ else
+ _CPU_DEC(chan->local->refcount);
}
/*
--
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Christoph Lameter
2007-11-20 20:46:43 UTC
Permalink
Post by Mathieu Desnoyers
Post by c***@sgi.com
@@ -488,11 +487,8 @@ dma_async_memcpy_buf_to_buf(struct dma_c
tx->tx_set_dest(addr, tx, 0);
cookie = tx->tx_submit(tx);
- cpu = get_cpu();
- per_cpu_ptr(chan->local, cpu)->bytes_transferred += len;
- per_cpu_ptr(chan->local, cpu)->memcpy_count++;
- put_cpu();
-
+ __CPU_ADD(chan->local->bytes_transferred, len);
+ __CPU_INC(chan->local->memcpy_count);
I am wondering about the impact of the preempt disable removal here. It
means that there is a statistically low probability that we will be
moved to a different CPU between the bytes_transferred and the
memcpy_count increments. I hope nobody relies on the fact that the
per-cpu counts should match perfectly...
True. But as far as I can tell this is just statistics and technically the
operation is not complete until this function has terminated. So one half
was run on one and one half on the other. Where the statistics should be
kept is ambiguous.

Having said that: We could leave the code unchanged if this is a concern.
c***@sgi.com
2007-11-20 01:12:02 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
crypto/async_tx/async_tx.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

Index: linux-2.6/crypto/async_tx/async_tx.c
===================================================================
--- linux-2.6.orig/crypto/async_tx/async_tx.c 2007-11-15 21:17:23.610404668 -0800
+++ linux-2.6/crypto/async_tx/async_tx.c 2007-11-15 21:25:39.834904080 -0800
@@ -207,10 +207,10 @@ static void async_tx_rebalance(void)
for_each_dma_cap_mask(cap, dma_cap_mask_all)
for_each_possible_cpu(cpu) {
struct dma_chan_ref *ref =
- per_cpu_ptr(channel_table[cap], cpu)->ref;
+ CPU_PTR(channel_table[cap], cpu)->ref;
if (ref) {
atomic_set(&ref->count, 0);
- per_cpu_ptr(channel_table[cap], cpu)->ref =
+ CPU_PTR(channel_table[cap], cpu)->ref =
NULL;
}
}
@@ -223,7 +223,7 @@ static void async_tx_rebalance(void)
else
new = get_chan_ref_by_cap(cap, -1);

- per_cpu_ptr(channel_table[cap], cpu)->ref = new;
+ CPU_PTR(channel_table[cap], cpu)->ref = new;
}

spin_unlock_irqrestore(&async_tx_lock, flags);
@@ -327,7 +327,8 @@ async_tx_init(void)
clear_bit(DMA_INTERRUPT, dma_cap_mask_all.bits);

for_each_dma_cap_mask(cap, dma_cap_mask_all) {
- channel_table[cap] = alloc_percpu(struct chan_ref_percpu);
+ channel_table[cap] = CPU_ALLOC(struct chan_ref_percpu,
+ GFP_KERNEL | __GFP_ZERO);
if (!channel_table[cap])
goto err;
}
@@ -343,7 +344,7 @@ err:
printk(KERN_ERR "async_tx: initialization failure\n");

while (--cap >= 0)
- free_percpu(channel_table[cap]);
+ CPU_FRE(channel_table[cap]);

return 1;
}
@@ -356,7 +357,7 @@ static void __exit async_tx_exit(void)

for_each_dma_cap_mask(cap, dma_cap_mask_all)
if (channel_table[cap])
- free_percpu(channel_table[cap]);
+ CPU_FREE(channel_table[cap]);

dma_async_client_unregister(&async_tx_dma);
}
@@ -378,7 +379,7 @@ async_tx_find_channel(struct dma_async_t
else if (likely(channel_table_initialized)) {
struct dma_chan_ref *ref;
int cpu = get_cpu();
- ref = per_cpu_ptr(channel_table[tx_type], cpu)->ref;
+ ref = CPU_PTR(channel_table[tx_type], cpu)->ref;
put_cpu();
return ref ? ref->chan : NULL;
} else

--
c***@sgi.com
2007-11-20 01:11:53 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
net/ipv4/tcp.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6/net/ipv4/tcp.c
===================================================================
--- linux-2.6.orig/net/ipv4/tcp.c 2007-11-15 21:17:24.267654551 -0800
+++ linux-2.6/net/ipv4/tcp.c 2007-11-15 21:25:34.214404334 -0800
@@ -2273,7 +2273,7 @@ static void __tcp_free_md5sig_pool(struc
{
int cpu;
for_each_possible_cpu(cpu) {
- struct tcp_md5sig_pool *p = *per_cpu_ptr(pool, cpu);
+ struct tcp_md5sig_pool *p = *CPU_PTR(pool, cpu);
if (p) {
if (p->md5_desc.tfm)
crypto_free_hash(p->md5_desc.tfm);
@@ -2281,7 +2281,7 @@ static void __tcp_free_md5sig_pool(struc
p = NULL;
}
}
- free_percpu(pool);
+ CPU_FREE(pool);
}

void tcp_free_md5sig_pool(void)
@@ -2305,7 +2305,7 @@ static struct tcp_md5sig_pool **__tcp_al
int cpu;
struct tcp_md5sig_pool **pool;

- pool = alloc_percpu(struct tcp_md5sig_pool *);
+ pool = CPU_ALLOC(struct tcp_md5sig_pool *, GFP_KERNEL);
if (!pool)
return NULL;

@@ -2316,7 +2316,7 @@ static struct tcp_md5sig_pool **__tcp_al
p = kzalloc(sizeof(*p), GFP_KERNEL);
if (!p)
goto out_free;
- *per_cpu_ptr(pool, cpu) = p;
+ *CPU_PTR(pool, cpu) = p;

hash = crypto_alloc_hash("md5", 0, CRYPTO_ALG_ASYNC);
if (!hash || IS_ERR(hash))
@@ -2381,7 +2381,7 @@ struct tcp_md5sig_pool *__tcp_get_md5sig
if (p)
tcp_md5sig_users++;
spin_unlock_bh(&tcp_md5sig_pool_lock);
- return (p ? *per_cpu_ptr(p, cpu) : NULL);
+ return (p ? *CPU_PTR(p, cpu) : NULL);
}

EXPORT_SYMBOL(__tcp_get_md5sig_pool);

--
c***@sgi.com
2007-11-20 01:11:59 UTC
Permalink
Use the cpu alloc functions for the mib handling functions in the net
layer. The API for snmp_mib_free() is changed to add a size parameter
since cpu_fre requires that.

Signed-off-by: Christoph Lameter <***@sgi.com>
---
include/net/ip.h | 2 +-
include/net/snmp.h | 15 +++++++--------
net/dccp/proto.c | 12 +++++++-----
net/ipv4/af_inet.c | 31 +++++++++++++++++--------------
net/ipv6/addrconf.c | 10 +++++-----
net/ipv6/af_inet6.c | 18 +++++++++---------
net/sctp/proc.c | 4 ++--
net/sctp/protocol.c | 12 +++++++-----
8 files changed, 55 insertions(+), 49 deletions(-)

Index: linux-2.6/include/net/ip.h
===================================================================
--- linux-2.6.orig/include/net/ip.h 2007-11-18 14:38:23.385283253 -0800
+++ linux-2.6/include/net/ip.h 2007-11-18 21:55:27.684797168 -0800
@@ -170,7 +170,7 @@ DECLARE_SNMP_STAT(struct linux_mib, net_

extern unsigned long snmp_fold_field(void *mib[], int offt);
extern int snmp_mib_init(void *ptr[2], size_t mibsize, size_t mibalign);
-extern void snmp_mib_free(void *ptr[2]);
+extern void snmp_mib_free(void *ptr[2], size_t mibsize);

extern void inet_get_local_port_range(int *low, int *high);

Index: linux-2.6/include/net/snmp.h
===================================================================
--- linux-2.6.orig/include/net/snmp.h 2007-11-18 14:38:23.393283116 -0800
+++ linux-2.6/include/net/snmp.h 2007-11-18 22:04:47.477773921 -0800
@@ -132,19 +132,18 @@ struct linux_mib {
#define SNMP_STAT_BHPTR(name) (name[0])
#define SNMP_STAT_USRPTR(name) (name[1])

-#define SNMP_INC_STATS_BH(mib, field) \
- (per_cpu_ptr(mib[0], raw_smp_processor_id())->mibs[field]++)
+#define SNMP_INC_STATS_BH(mib, field) __CPU_INC(mib[0]->mibs[field])
#define SNMP_INC_STATS_OFFSET_BH(mib, field, offset) \
- (per_cpu_ptr(mib[0], raw_smp_processor_id())->mibs[field + (offset)]++)
+ __CPU_INC(mib[0]->mibs[field + (offset)])
#define SNMP_INC_STATS_USER(mib, field) \
- (per_cpu_ptr(mib[1], raw_smp_processor_id())->mibs[field]++)
+ __CPU_INC(mib[1]->mibs[field])
#define SNMP_INC_STATS(mib, field) \
- (per_cpu_ptr(mib[!in_softirq()], raw_smp_processor_id())->mibs[field]++)
+ __CPU_INC(mib[!in_softirq()]->mibs[field])
#define SNMP_DEC_STATS(mib, field) \
- (per_cpu_ptr(mib[!in_softirq()], raw_smp_processor_id())->mibs[field]--)
+ __CPU_DEC(mib[!in_softirq()]->mibs[field])
#define SNMP_ADD_STATS_BH(mib, field, addend) \
- (per_cpu_ptr(mib[0], raw_smp_processor_id())->mibs[field] += addend)
+ __CPU_ADD(mib[0]->mibs[field], addend)
#define SNMP_ADD_STATS_USER(mib, field, addend) \
- (per_cpu_ptr(mib[1], raw_smp_processor_id())->mibs[field] += addend)
+ __CPU_ADD(mib[1]->mibs[field], addend)

#endif
Index: linux-2.6/net/dccp/proto.c
===================================================================
--- linux-2.6.orig/net/dccp/proto.c 2007-11-18 14:38:23.397283534 -0800
+++ linux-2.6/net/dccp/proto.c 2007-11-18 21:55:27.704797139 -0800
@@ -990,11 +990,13 @@ static int __init dccp_mib_init(void)
{
int rc = -ENOMEM;

- dccp_statistics[0] = alloc_percpu(struct dccp_mib);
+ dccp_statistics[0] = CPU_ALLOC(struct dccp_mib,
+ GFP_KERNEL | __GFP_ZERO);
if (dccp_statistics[0] == NULL)
goto out;

- dccp_statistics[1] = alloc_percpu(struct dccp_mib);
+ dccp_statistics[1] = CPU_ALLOC(struct dccp_mib,
+ GFP_KERNEL | __GFP_ZERO);
if (dccp_statistics[1] == NULL)
goto out_free_one;

@@ -1002,7 +1004,7 @@ static int __init dccp_mib_init(void)
out:
return rc;
out_free_one:
- free_percpu(dccp_statistics[0]);
+ CPU_FREE(dccp_statistics[0]);
dccp_statistics[0] = NULL;
goto out;

@@ -1010,8 +1012,8 @@ out_free_one:

static void dccp_mib_exit(void)
{
- free_percpu(dccp_statistics[0]);
- free_percpu(dccp_statistics[1]);
+ CPU_FREE(dccp_statistics[0]);
+ CPU_FREE(dccp_statistics[1]);
dccp_statistics[0] = dccp_statistics[1] = NULL;
}

Index: linux-2.6/net/ipv4/af_inet.c
===================================================================
--- linux-2.6.orig/net/ipv4/af_inet.c 2007-11-18 14:38:23.405283096 -0800
+++ linux-2.6/net/ipv4/af_inet.c 2007-11-18 21:55:27.725047497 -0800
@@ -1230,8 +1230,8 @@ unsigned long snmp_fold_field(void *mib[
int i;

for_each_possible_cpu(i) {
- res += *(((unsigned long *) per_cpu_ptr(mib[0], i)) + offt);
- res += *(((unsigned long *) per_cpu_ptr(mib[1], i)) + offt);
+ res += *(((unsigned long *) CPU_PTR(mib[0], i)) + offt);
+ res += *(((unsigned long *) CPU_PTR(mib[1], i)) + offt);
}
return res;
}
@@ -1240,26 +1240,28 @@ EXPORT_SYMBOL_GPL(snmp_fold_field);
int snmp_mib_init(void *ptr[2], size_t mibsize, size_t mibalign)
{
BUG_ON(ptr == NULL);
- ptr[0] = __alloc_percpu(mibsize);
+ ptr[0] = cpu_alloc(mibsize, GFP_KERNEL | __GFP_ZERO,
+ mibalign);
if (!ptr[0])
goto err0;
- ptr[1] = __alloc_percpu(mibsize);
+ ptr[1] = cpu_alloc(mibsize, GFP_KERNEL | __GFP_ZERO,
+ mibalign);
if (!ptr[1])
goto err1;
return 0;
err1:
- free_percpu(ptr[0]);
+ cpu_free(ptr[0], mibsize);
ptr[0] = NULL;
err0:
return -ENOMEM;
}
EXPORT_SYMBOL_GPL(snmp_mib_init);

-void snmp_mib_free(void *ptr[2])
+void snmp_mib_free(void *ptr[2], size_t mibsize)
{
BUG_ON(ptr == NULL);
- free_percpu(ptr[0]);
- free_percpu(ptr[1]);
+ cpu_free(ptr[0], mibsize);
+ cpu_free(ptr[1], mibsize);
ptr[0] = ptr[1] = NULL;
}
EXPORT_SYMBOL_GPL(snmp_mib_free);
@@ -1324,17 +1326,18 @@ static int __init init_ipv4_mibs(void)
return 0;

err_udplite_mib:
- snmp_mib_free((void **)udp_statistics);
+ snmp_mib_free((void **)udp_statistics, sizeof(struct udp_mib));
err_udp_mib:
- snmp_mib_free((void **)tcp_statistics);
+ snmp_mib_free((void **)tcp_statistics, sizeof(struct tcp_mib));
err_tcp_mib:
- snmp_mib_free((void **)icmpmsg_statistics);
+ snmp_mib_free((void **)icmpmsg_statistics,
+ sizeof(struct icmpmsg_mib));
err_icmpmsg_mib:
- snmp_mib_free((void **)icmp_statistics);
+ snmp_mib_free((void **)icmp_statistics, sizeof(struct icmp_mib));
err_icmp_mib:
- snmp_mib_free((void **)ip_statistics);
+ snmp_mib_free((void **)ip_statistics, sizeof(struct ipstats_mib));
err_ip_mib:
- snmp_mib_free((void **)net_statistics);
+ snmp_mib_free((void **)net_statistics, sizeof(struct linux_mib));
err_net_mib:
return -ENOMEM;
}
Index: linux-2.6/net/ipv6/addrconf.c
===================================================================
--- linux-2.6.orig/net/ipv6/addrconf.c 2007-11-18 14:38:23.413283686 -0800
+++ linux-2.6/net/ipv6/addrconf.c 2007-11-18 21:55:27.749296774 -0800
@@ -271,18 +271,18 @@ static int snmp6_alloc_dev(struct inet6_
return 0;

err_icmpmsg:
- snmp_mib_free((void **)idev->stats.icmpv6);
+ snmp_mib_free((void **)idev->stats.icmpv6, sizeof(struct icmpv6_mib));
err_icmp:
- snmp_mib_free((void **)idev->stats.ipv6);
+ snmp_mib_free((void **)idev->stats.ipv6, sizeof(struct ipstats_mib));
err_ip:
return -ENOMEM;
}

static void snmp6_free_dev(struct inet6_dev *idev)
{
- snmp_mib_free((void **)idev->stats.icmpv6msg);
- snmp_mib_free((void **)idev->stats.icmpv6);
- snmp_mib_free((void **)idev->stats.ipv6);
+ snmp_mib_free((void **)idev->stats.icmpv6msg, sizeof(struct icmpv6_mib));
+ snmp_mib_free((void **)idev->stats.icmpv6, sizeof(struct icmpv6_mib));
+ snmp_mib_free((void **)idev->stats.ipv6, sizeof(struct ipstats_mib));
}

/* Nobody refers to this device, we may destroy it. */
Index: linux-2.6/net/ipv6/af_inet6.c
===================================================================
--- linux-2.6.orig/net/ipv6/af_inet6.c 2007-11-18 14:38:23.417283064 -0800
+++ linux-2.6/net/ipv6/af_inet6.c 2007-11-18 21:55:27.756797042 -0800
@@ -731,13 +731,13 @@ static int __init init_ipv6_mibs(void)
return 0;

err_udplite_mib:
- snmp_mib_free((void **)udp_stats_in6);
+ snmp_mib_free((void **)udp_stats_in6, sizeof(struct udp_mib));
err_udp_mib:
- snmp_mib_free((void **)icmpv6msg_statistics);
+ snmp_mib_free((void **)icmpv6msg_statistics, sizeof(struct icmpv6_mib));
err_icmpmsg_mib:
- snmp_mib_free((void **)icmpv6_statistics);
+ snmp_mib_free((void **)icmpv6_statistics, sizeof(struct icmpv6_mib));
err_icmp_mib:
- snmp_mib_free((void **)ipv6_statistics);
+ snmp_mib_free((void **)ipv6_statistics, sizeof(struct ipstats_mib));
err_ip_mib:
return -ENOMEM;

@@ -745,11 +745,11 @@ err_ip_mib:

static void cleanup_ipv6_mibs(void)
{
- snmp_mib_free((void **)ipv6_statistics);
- snmp_mib_free((void **)icmpv6_statistics);
- snmp_mib_free((void **)icmpv6msg_statistics);
- snmp_mib_free((void **)udp_stats_in6);
- snmp_mib_free((void **)udplite_stats_in6);
+ snmp_mib_free((void **)ipv6_statistics, sizeof(struct ipstats_mib));
+ snmp_mib_free((void **)icmpv6_statistics, sizeof(struct icmpv6_mib));
+ snmp_mib_free((void **)icmpv6msg_statistics, sizeof(struct icmpv6_mib));
+ snmp_mib_free((void **)udp_stats_in6, sizeof(struct udp_mib));
+ snmp_mib_free((void **)udplite_stats_in6, sizeof(struct udp_mib));
}

static int __init inet6_init(void)
Index: linux-2.6/net/sctp/proc.c
===================================================================
--- linux-2.6.orig/net/sctp/proc.c 2007-11-18 14:38:23.425283197 -0800
+++ linux-2.6/net/sctp/proc.c 2007-11-18 21:55:27.772797109 -0800
@@ -86,10 +86,10 @@ fold_field(void *mib[], int nr)

for_each_possible_cpu(i) {
res +=
- *((unsigned long *) (((void *) per_cpu_ptr(mib[0], i)) +
+ *((unsigned long *) (((void *)CPU_PTR(mib[0], i)) +
sizeof (unsigned long) * nr));
res +=
- *((unsigned long *) (((void *) per_cpu_ptr(mib[1], i)) +
+ *((unsigned long *) (((void *)CPU_PTR(mib[1], i)) +
sizeof (unsigned long) * nr));
}
return res;
Index: linux-2.6/net/sctp/protocol.c
===================================================================
--- linux-2.6.orig/net/sctp/protocol.c 2007-11-18 14:38:23.433283093 -0800
+++ linux-2.6/net/sctp/protocol.c 2007-11-18 21:55:27.784297095 -0800
@@ -970,12 +970,14 @@ int sctp_register_pf(struct sctp_pf *pf,

static int __init init_sctp_mibs(void)
{
- sctp_statistics[0] = alloc_percpu(struct sctp_mib);
+ sctp_statistics[0] = CPU_ALLOC(struct sctp_mib,
+ GFP_KERNEL | __GFP_ZERO);
if (!sctp_statistics[0])
return -ENOMEM;
- sctp_statistics[1] = alloc_percpu(struct sctp_mib);
+ sctp_statistics[1] = CPU_ALLOC(struct sctp_mib,
+ GFP_KERNEL | __GFP_ZERO);
if (!sctp_statistics[1]) {
- free_percpu(sctp_statistics[0]);
+ CPU_FREE(sctp_statistics[0]);
return -ENOMEM;
}
return 0;
@@ -984,8 +986,8 @@ static int __init init_sctp_mibs(void)

static void cleanup_sctp_mibs(void)
{
- free_percpu(sctp_statistics[0]);
- free_percpu(sctp_statistics[1]);
+ CPU_FREE(sctp_statistics[0]);
+ CPU_FREE(sctp_statistics[1]);
}

/* Initialize the universe into something sensible. */

--
c***@sgi.com
2007-11-20 01:12:00 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
net/core/sock.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/net/core/sock.c
===================================================================
--- linux-2.6.orig/net/core/sock.c 2007-11-18 14:38:23.290283038 -0800
+++ linux-2.6/net/core/sock.c 2007-11-18 22:04:47.561605283 -0800
@@ -1809,21 +1809,21 @@ static LIST_HEAD(proto_list);
*/
static void inuse_add(struct proto *prot, int inc)
{
- per_cpu_ptr(prot->inuse_ptr, smp_processor_id())[0] += inc;
+ __CPU_ADD(prot->inuse_ptr[0], inc);
}

static int inuse_get(const struct proto *prot)
{
int res = 0, cpu;
for_each_possible_cpu(cpu)
- res += per_cpu_ptr(prot->inuse_ptr, cpu)[0];
+ res += CPU_PTR(prot->inuse_ptr, cpu)[0];
return res;
}

static int inuse_init(struct proto *prot)
{
if (!prot->inuse_getval || !prot->inuse_add) {
- prot->inuse_ptr = alloc_percpu(int);
+ prot->inuse_ptr = CPU_ALLOC(int, GFP_KERNEL);
if (prot->inuse_ptr == NULL)
return -ENOBUFS;

@@ -1836,7 +1836,7 @@ static int inuse_init(struct proto *prot
static void inuse_fini(struct proto *prot)
{
if (prot->inuse_ptr != NULL) {
- free_percpu(prot->inuse_ptr);
+ CPU_FREE(prot->inuse_ptr);
prot->inuse_ptr = NULL;
prot->inuse_getval = NULL;
prot->inuse_add = NULL;

--
c***@sgi.com
2007-11-20 01:12:01 UTC
Permalink
Signed-off-by: Christoph Lameter <***@sgi.com>
---
drivers/infiniband/hw/ehca/ehca_irq.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)

Index: linux-2.6/drivers/infiniband/hw/ehca/ehca_irq.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/ehca/ehca_irq.c 2007-11-15 21:17:23.663404239 -0800
+++ linux-2.6/drivers/infiniband/hw/ehca/ehca_irq.c 2007-11-15 21:25:39.310404188 -0800
@@ -646,7 +646,7 @@ static void queue_comp_task(struct ehca_
cpu_id = find_next_online_cpu(pool);
BUG_ON(!cpu_online(cpu_id));

- cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu_id);
+ cct = CPU_PTR(pool->cpu_comp_tasks, cpu_id);
BUG_ON(!cct);

spin_lock_irqsave(&cct->task_lock, flags);
@@ -654,7 +654,7 @@ static void queue_comp_task(struct ehca_
spin_unlock_irqrestore(&cct->task_lock, flags);
if (cq_jobs > 0) {
cpu_id = find_next_online_cpu(pool);
- cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu_id);
+ cct = CPU_PTR(pool->cpu_comp_tasks, cpu_id);
BUG_ON(!cct);
}

@@ -727,7 +727,7 @@ static struct task_struct *create_comp_t
{
struct ehca_cpu_comp_task *cct;

- cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu);
+ cct = CPU_PTR(pool->cpu_comp_tasks, cpu);
spin_lock_init(&cct->task_lock);
INIT_LIST_HEAD(&cct->cq_list);
init_waitqueue_head(&cct->wait_queue);
@@ -743,7 +743,7 @@ static void destroy_comp_task(struct ehc
struct task_struct *task;
unsigned long flags_cct;

- cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu);
+ cct = CPU_PTR(pool->cpu_comp_tasks, cpu);

spin_lock_irqsave(&cct->task_lock, flags_cct);

@@ -759,7 +759,7 @@ static void destroy_comp_task(struct ehc

static void __cpuinit take_over_work(struct ehca_comp_pool *pool, int cpu)
{
- struct ehca_cpu_comp_task *cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu);
+ struct ehca_cpu_comp_task *cct = CPU_PTR(pool->cpu_comp_tasks, cpu);
LIST_HEAD(list);
struct ehca_cq *cq;
unsigned long flags_cct;
@@ -772,8 +772,7 @@ static void __cpuinit take_over_work(str
cq = list_entry(cct->cq_list.next, struct ehca_cq, entry);

list_del(&cq->entry);
- __queue_comp_task(cq, per_cpu_ptr(pool->cpu_comp_tasks,
- smp_processor_id()));
+ __queue_comp_task(cq, THIS_CPU(pool->cpu_comp_tasks));
}

spin_unlock_irqrestore(&cct->task_lock, flags_cct);
@@ -799,14 +798,14 @@ static int __cpuinit comp_pool_callback(
case CPU_UP_CANCELED:
case CPU_UP_CANCELED_FROZEN:
ehca_gen_dbg("CPU: %x (CPU_CANCELED)", cpu);
- cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu);
+ cct = CPU_PTR(pool->cpu_comp_tasks, cpu);
kthread_bind(cct->task, any_online_cpu(cpu_online_map));
destroy_comp_task(pool, cpu);
break;
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
ehca_gen_dbg("CPU: %x (CPU_ONLINE)", cpu);
- cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu);
+ cct = CPU_PTR(pool->cpu_comp_tasks, cpu);
kthread_bind(cct->task, cpu);
wake_up_process(cct->task);
break;
@@ -849,7 +848,8 @@ int ehca_create_comp_pool(void)
spin_lock_init(&pool->last_cpu_lock);
pool->last_cpu = any_online_cpu(cpu_online_map);

- pool->cpu_comp_tasks = alloc_percpu(struct ehca_cpu_comp_task);
+ pool->cpu_comp_tasks = CPU_ALLOC(struct ehca_cpu_comp_task,
+ GFP_KERNEL | __GFP_ZERO);
if (pool->cpu_comp_tasks == NULL) {
kfree(pool);
return -EINVAL;
@@ -883,6 +883,6 @@ void ehca_destroy_comp_pool(void)
if (cpu_online(i))
destroy_comp_task(pool, i);
}
- free_percpu(pool->cpu_comp_tasks);
+ CPU_FREE(pool->cpu_comp_tasks);
kfree(pool);
}

--
c***@sgi.com
2007-11-20 01:12:05 UTC
Permalink
These are critical fast paths. Using a segment override instead of an address
calculation is reducing overhead.

Signed-off-by: Christoph LAmeter <***@sgi.com>
---
arch/x86/kernel/nmi_64.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/x86/kernel/nmi_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/nmi_64.c 2007-11-19 15:38:59.522056618 -0800
+++ linux-2.6/arch/x86/kernel/nmi_64.c 2007-11-19 15:40:04.481556226 -0800
@@ -291,7 +291,7 @@ void stop_apic_nmi_watchdog(void *unused
*/

static DEFINE_PER_CPU(unsigned, last_irq_sum);
-static DEFINE_PER_CPU(local_t, alert_counter);
+static DEFINE_PER_CPU(int, alert_counter);
static DEFINE_PER_CPU(int, nmi_touch);

void touch_nmi_watchdog(void)
@@ -355,13 +355,13 @@ int __kprobes nmi_watchdog_tick(struct p
* Ayiee, looks like this CPU is stuck ...
* wait a few IRQs (5 seconds) before doing the oops ...
*/
- local_inc(&__get_cpu_var(alert_counter));
- if (local_read(&__get_cpu_var(alert_counter)) == 5*nmi_hz)
+ CPU_INC(get_cpu_var(alert_counter));
+ if (CPU_READ(get_cpu_var(alert_counter)) == 5*nmi_hz)
die_nmi("NMI Watchdog detected LOCKUP on CPU %d\n", regs,
panic_on_timeout);
} else {
__get_cpu_var(last_irq_sum) = sum;
- local_set(&__get_cpu_var(alert_counter), 0);
+ CPU_WRITE(get_cpu_var(alert_counter), 0);
}

/* see if the nmi watchdog went off */

--
c***@sgi.com
2007-11-20 01:12:06 UTC
Permalink
Use boot_cpu_alloc to allocate a cpu area chunk that is needed to store the
statically declared per cpu data and then point the per_cpu_offset pointers
to the cpu area.

The per cpu area is moved to a ZERO offset using some linker scripting.
All per cpu variable addresses become true offsets into a cpu area to their
respective variable. The addresses of per cpu variables can be treated
like the offsets that are returned by CPU_ALLOC.

Signed-off-by: Christoph Lameter <***@sgi.com>

---
arch/x86/kernel/setup64.c | 29 ++++++++++++-----------------
arch/x86/kernel/vmlinux_64.lds.S | 3 ++-
include/asm-generic/sections.h | 1 +
include/asm-generic/vmlinux.lds.h | 17 +++++++++++++++++
4 files changed, 32 insertions(+), 18 deletions(-)

Index: linux-2.6/arch/x86/kernel/setup64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup64.c 2007-11-18 22:39:24.706247819 -0800
+++ linux-2.6/arch/x86/kernel/setup64.c 2007-11-19 10:31:49.088824106 -0800
@@ -87,35 +87,30 @@ __setup("noexec32=", nonx32_setup);
void __init setup_per_cpu_areas(void)
{
int i;
- unsigned long size;
+ char *base;

#ifdef CONFIG_HOTPLUG_CPU
prefill_possible_map();
#endif

/* Copy section for each CPU (we discard the original) */
- size = PERCPU_ENOUGH_ROOM;
+ base = boot_cpu_alloc(PERCPU_ENOUGH_ROOM);
+ if (base)
+ panic("Cannot allocate per cpu data at 0\n");

- printk(KERN_INFO "PERCPU: Allocating %lu bytes of per cpu data\n", size);
+ printk(KERN_INFO "PERCPU: Allocating %lu bytes of per cpu data\n",
+ PERCPU_ENOUGH_ROOM);
for_each_cpu_mask (i, cpu_possible_map) {
- char *ptr;
+ cpu_pda(i)->data_offset = cpu_offset(i);

- if (!NODE_DATA(cpu_to_node(i))) {
- printk("cpu with no node %d, num_online_nodes %d\n",
- i, num_online_nodes());
- ptr = alloc_bootmem_pages(size);
- } else {
- ptr = alloc_bootmem_pages_node(NODE_DATA(cpu_to_node(i)), size);
- }
- if (!ptr)
- panic("Cannot allocate cpu data for CPU %d\n", i);
- cpu_pda(i)->data_offset = ptr - __per_cpu_start;
- memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
+ memcpy(CPU_PTR(base, i), __load_per_cpu_start,
+ __per_cpu_end - __per_cpu_start);
}
-}
+ count_vm_events(CPU_BYTES, PERCPU_ENOUGH_ROOM);
+}

void pda_init(int cpu)
-{
+{
struct x8664_pda *pda = cpu_pda(cpu);

/* Setup up data that may be needed in __get_free_pages early */
Index: linux-2.6/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6.orig/include/asm-generic/vmlinux.lds.h 2007-11-18 22:39:23.370497641 -0800
+++ linux-2.6/include/asm-generic/vmlinux.lds.h 2007-11-19 10:30:20.095586336 -0800
@@ -258,8 +258,25 @@
#define PERCPU(align) \
. = ALIGN(align); \
__per_cpu_start = .; \
+ __load_per_cpu_start = .; \
.data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) { \
*(.data.percpu) \
*(.data.percpu.shared_aligned) \
} \
+ __load_per_cpu_end = .; \
__per_cpu_end = .;
+
+#define ZERO_BASED_PERCPU \
+ percpu : { } :percpu \
+ __load_per_cpu_start = .; \
+ .data.percpu 0 : AT(__load_per_cpu_start - LOAD_OFFSET) { \
+ __per_cpu_start = .; \
+ *(.data.percpu) \
+ *(.data.percpu.shared_aligned) \
+ __per_cpu_end = .; \
+ } \
+ . = __load_per_cpu_start + __per_cpu_end - __per_cpu_start; \
+ __load_per_cpu_end = .; \
+ data : { } :data
+
+
Index: linux-2.6/arch/x86/kernel/vmlinux_64.lds.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/vmlinux_64.lds.S 2007-11-18 22:41:39.261997209 -0800
+++ linux-2.6/arch/x86/kernel/vmlinux_64.lds.S 2007-11-18 22:42:33.119247430 -0800
@@ -19,6 +19,7 @@ cpu_area = CPU_AREA_BASE;

PHDRS {
text PT_LOAD FLAGS(5); /* R_E */
+ percpu PT_LOAD FLAGS(4); /* R__ */
data PT_LOAD FLAGS(7); /* RWE */
user PT_LOAD FLAGS(7); /* RWE */
data.init PT_LOAD FLAGS(7); /* RWE */
@@ -206,7 +207,7 @@ SECTIONS
__initramfs_end = .;
#endif

- PERCPU(4096)
+ ZERO_BASED_PERCPU

. = ALIGN(4096);
__init_end = .;
Index: linux-2.6/include/asm-generic/sections.h
===================================================================
--- linux-2.6.orig/include/asm-generic/sections.h 2007-11-18 22:39:23.374497627 -0800
+++ linux-2.6/include/asm-generic/sections.h 2007-11-18 22:45:33.073997351 -0800
@@ -12,6 +12,7 @@ extern char _sextratext[] __attribute__(
extern char _eextratext[] __attribute__((weak));
extern char _end[];
extern char __per_cpu_start[], __per_cpu_end[];
+extern char __load_per_cpu_start[], __load_per_cpu_end[];
extern char __kprobes_text_start[], __kprobes_text_end[];
extern char __initdata_begin[], __initdata_end[];
extern char __start_rodata[], __end_rodata[];

--
c***@sgi.com
2007-11-20 01:12:03 UTC
Permalink
There is no user of allocpercpu left after all the earlier patches were
applied. Remove the code that realizes allocpercpu.

Signed-off-by: Christoph Lameter <***@sgi.com>
---
include/linux/percpu.h | 80 ------------------------------
mm/Makefile | 1
mm/allocpercpu.c | 127 -------------------------------------------------
3 files changed, 208 deletions(-)
delete mode 100644 mm/allocpercpu.c

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h 2007-11-19 16:28:42.577389871 -0800
+++ linux-2.6/include/linux/percpu.h 2007-11-19 16:29:00.029140648 -0800
@@ -33,86 +33,6 @@ DECLARE_PER_CPU(cpumask_t, cpu_mask);
&__get_cpu_var(var); }))
#define put_cpu_var(var) preempt_enable()

-#ifdef CONFIG_SMP
-
-struct percpu_data {
- void *ptrs[NR_CPUS];
-};
-
-#define __percpu_disguise(pdata) (struct percpu_data *)~(unsigned long)(pdata)
-/*
- * Use this to get to a cpu's version of the per-cpu object dynamically
- * allocated. Non-atomic access to the current CPU's version should
- * probably be combined with get_cpu()/put_cpu().
- */
-#define percpu_ptr(ptr, cpu) \
-({ \
- struct percpu_data *__p = __percpu_disguise(ptr); \
- (__typeof__(ptr))__p->ptrs[(cpu)]; \
-})
-
-extern void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu);
-extern void percpu_depopulate(void *__pdata, int cpu);
-extern int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp,
- cpumask_t *mask);
-extern void __percpu_depopulate_mask(void *__pdata, cpumask_t *mask);
-extern void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask);
-extern void percpu_free(void *__pdata);
-
-#else /* CONFIG_SMP */
-
-#define percpu_ptr(ptr, cpu) ({ (void)(cpu); (ptr); })
-
-static inline void percpu_depopulate(void *__pdata, int cpu)
-{
-}
-
-static inline void __percpu_depopulate_mask(void *__pdata, cpumask_t *mask)
-{
-}
-
-static inline void *percpu_populate(void *__pdata, size_t size, gfp_t gfp,
- int cpu)
-{
- return percpu_ptr(__pdata, cpu);
-}
-
-static inline int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp,
- cpumask_t *mask)
-{
- return 0;
-}
-
-static __always_inline void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
-{
- return kzalloc(size, gfp);
-}
-
-static inline void percpu_free(void *__pdata)
-{
- kfree(__pdata);
-}
-
-#endif /* CONFIG_SMP */
-
-#define percpu_populate_mask(__pdata, size, gfp, mask) \
- __percpu_populate_mask((__pdata), (size), (gfp), &(mask))
-#define percpu_depopulate_mask(__pdata, mask) \
- __percpu_depopulate_mask((__pdata), &(mask))
-#define percpu_alloc_mask(size, gfp, mask) \
- __percpu_alloc_mask((size), (gfp), &(mask))
-
-#define percpu_alloc(size, gfp) percpu_alloc_mask((size), (gfp), cpu_online_map)
-
-/* (legacy) interface for use without CPU hotplug handling */
-
-#define __alloc_percpu(size) percpu_alloc_mask((size), GFP_KERNEL, \
- cpu_possible_map)
-#define alloc_percpu(type) (type *)__alloc_percpu(sizeof(type))
-#define free_percpu(ptr) percpu_free((ptr))
-#define per_cpu_ptr(ptr, cpu) percpu_ptr((ptr), (cpu))
-
-
/*
* cpu allocator definitions
*
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile 2007-11-19 16:28:42.185140751 -0800
+++ linux-2.6/mm/Makefile 2007-11-19 16:29:00.029140648 -0800
@@ -28,6 +28,5 @@ obj-$(CONFIG_SLUB) += slub.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
-obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o

Index: linux-2.6/mm/allocpercpu.c
===================================================================
--- linux-2.6.orig/mm/allocpercpu.c 2007-11-19 16:28:01.113890069 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,127 +0,0 @@
-/*
- * linux/mm/allocpercpu.c
- *
- * Separated from slab.c August 11, 2006 Christoph Lameter <***@sgi.com>
- */
-#include <linux/mm.h>
-#include <linux/module.h>
-
-/**
- * percpu_depopulate - depopulate per-cpu data for given cpu
- * @__pdata: per-cpu data to depopulate
- * @cpu: depopulate per-cpu data for this cpu
- *
- * Depopulating per-cpu data for a cpu going offline would be a typical
- * use case. You need to register a cpu hotplug handler for that purpose.
- */
-void percpu_depopulate(void *__pdata, int cpu)
-{
- struct percpu_data *pdata = __percpu_disguise(__pdata);
-
- kfree(pdata->ptrs[cpu]);
- pdata->ptrs[cpu] = NULL;
-}
-EXPORT_SYMBOL_GPL(percpu_depopulate);
-
-/**
- * percpu_depopulate_mask - depopulate per-cpu data for some cpu's
- * @__pdata: per-cpu data to depopulate
- * @mask: depopulate per-cpu data for cpu's selected through mask bits
- */
-void __percpu_depopulate_mask(void *__pdata, cpumask_t *mask)
-{
- int cpu;
- for_each_cpu_mask(cpu, *mask)
- percpu_depopulate(__pdata, cpu);
-}
-EXPORT_SYMBOL_GPL(__percpu_depopulate_mask);
-
-/**
- * percpu_populate - populate per-cpu data for given cpu
- * @__pdata: per-cpu data to populate further
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @cpu: populate per-data for this cpu
- *
- * Populating per-cpu data for a cpu coming online would be a typical
- * use case. You need to register a cpu hotplug handler for that purpose.
- * Per-cpu object is populated with zeroed buffer.
- */
-void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu)
-{
- struct percpu_data *pdata = __percpu_disguise(__pdata);
- int node = cpu_to_node(cpu);
-
- BUG_ON(pdata->ptrs[cpu]);
- if (node_online(node))
- pdata->ptrs[cpu] = kmalloc_node(size, gfp|__GFP_ZERO, node);
- else
- pdata->ptrs[cpu] = kzalloc(size, gfp);
- return pdata->ptrs[cpu];
-}
-EXPORT_SYMBOL_GPL(percpu_populate);
-
-/**
- * percpu_populate_mask - populate per-cpu data for more cpu's
- * @__pdata: per-cpu data to populate further
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @mask: populate per-cpu data for cpu's selected through mask bits
- *
- * Per-cpu objects are populated with zeroed buffers.
- */
-int __percpu_populate_mask(void *__pdata, size_t size, gfp_t gfp,
- cpumask_t *mask)
-{
- cpumask_t populated = CPU_MASK_NONE;
- int cpu;
-
- for_each_cpu_mask(cpu, *mask)
- if (unlikely(!percpu_populate(__pdata, size, gfp, cpu))) {
- __percpu_depopulate_mask(__pdata, &populated);
- return -ENOMEM;
- } else
- cpu_set(cpu, populated);
- return 0;
-}
-EXPORT_SYMBOL_GPL(__percpu_populate_mask);
-
-/**
- * percpu_alloc_mask - initial setup of per-cpu data
- * @size: size of per-cpu object
- * @gfp: may sleep or not etc.
- * @mask: populate per-data for cpu's selected through mask bits
- *
- * Populating per-cpu data for all online cpu's would be a typical use case,
- * which is simplified by the percpu_alloc() wrapper.
- * Per-cpu objects are populated with zeroed buffers.
- */
-void *__percpu_alloc_mask(size_t size, gfp_t gfp, cpumask_t *mask)
-{
- void *pdata = kzalloc(sizeof(struct percpu_data), gfp);
- void *__pdata = __percpu_disguise(pdata);
-
- if (unlikely(!pdata))
- return NULL;
- if (likely(!__percpu_populate_mask(__pdata, size, gfp, mask)))
- return __pdata;
- kfree(pdata);
- return NULL;
-}
-EXPORT_SYMBOL_GPL(__percpu_alloc_mask);
-
-/**
- * percpu_free - final cleanup of per-cpu data
- * @__pdata: object to clean up
- *
- * We simply clean up any per-cpu object left. No need for the client to
- * track and specify through a bis mask which per-cpu objects are to free.
- */
-void percpu_free(void *__pdata)
-{
- if (unlikely(!__pdata))
- return;
- __percpu_depopulate_mask(__pdata, &cpu_possible_map);
- kfree(__percpu_disguise(__pdata));
-}
-EXPORT_SYMBOL_GPL(percpu_free);

--
c***@sgi.com
2007-11-20 01:12:08 UTC
Permalink
If we move the pda to the beginning of the cpu area then the gs segment will
also point to the beginning of the cpu area. After this patch we can use gs
on any percpu variable or cpu_alloc pointer from cpu 0 to get to the active
processors variable. There is no need anymore to add a per cpu offset
anymore.

Signed-off-by: Christoph Lameter <***@sgi.com>

---
arch/x86/kernel/setup64.c | 2 +-
include/asm-generic/vmlinux.lds.h | 2 ++
include/asm-x86/percpu_64.h | 4 ++++
3 files changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/kernel/setup64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup64.c 2007-11-19 14:55:18.546422138 -0800
+++ linux-2.6/arch/x86/kernel/setup64.c 2007-11-19 14:55:41.937923200 -0800
@@ -31,7 +31,7 @@ cpumask_t cpu_initialized __cpuinitdata
struct x8664_pda *_cpu_pda[NR_CPUS] __read_mostly;
EXPORT_SYMBOL(_cpu_pda);

-DEFINE_PER_CPU(struct x8664_pda, pda);
+DEFINE_PER_CPU_FIRST(struct x8664_pda, pda);
EXPORT_PER_CPU_SYMBOL(pda);

struct desc_ptr idt_descr = { 256 * 16 - 1, (unsigned long) idt_table };
Index: linux-2.6/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux-2.6.orig/include/asm-generic/vmlinux.lds.h 2007-11-19 14:53:45.510673454 -0800
+++ linux-2.6/include/asm-generic/vmlinux.lds.h 2007-11-19 14:55:24.554173129 -0800
@@ -260,6 +260,7 @@
__per_cpu_start = .; \
__load_per_cpu_start = .; \
.data.percpu : AT(ADDR(.data.percpu) - LOAD_OFFSET) { \
+ *(.data.percpu.first) \
*(.data.percpu) \
*(.data.percpu.shared_aligned) \
} \
@@ -271,6 +272,7 @@
__load_per_cpu_start = .; \
.data.percpu 0 : AT(__load_per_cpu_start - LOAD_OFFSET) { \
__per_cpu_start = .; \
+ *(.data.percpu.first) \
*(.data.percpu) \
*(.data.percpu.shared_aligned) \
__per_cpu_end = .; \
Index: linux-2.6/include/asm-x86/percpu_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/percpu_64.h 2007-11-19 14:53:45.518673338 -0800
+++ linux-2.6/include/asm-x86/percpu_64.h 2007-11-19 14:55:24.554173129 -0800
@@ -25,6 +25,10 @@
__typeof__(type) per_cpu__##name \
____cacheline_internodealigned_in_smp

+#define DEFINE_PER_CPU_FIRST(type, name) \
+ __attribute__((__section__(".data.percpu.first"))) \
+ __typeof__(type) per_cpu__##name
+
/* var is in discarded region: offset to particular copy we want */
#define per_cpu(var, cpu) (*({ \
extern int simple_identifier_##var(void); \

--
c***@sgi.com
2007-11-20 01:12:07 UTC
Permalink
Declare the pda as a per cpu variable. This will have the effect of moving
the pda data into the cpu area managed by cpu alloc.

The boot_pdas are only needed in head64.c so move the declaration
over there and make it static.

Remove the code that allocates special pda data structures.

Signed-off-by: Christoph Lameter <***@sgi.com>

---
arch/x86/kernel/head64.c | 6 ++++++
arch/x86/kernel/setup64.c | 11 ++++++++++-
arch/x86/kernel/smpboot_64.c | 16 ----------------
include/asm-x86/pda.h | 1 -
include/asm-x86/percpu_64.h | 1 +
5 files changed, 17 insertions(+), 18 deletions(-)

Index: linux-2.6/arch/x86/kernel/setup64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup64.c 2007-11-19 16:29:09.045139782 -0800
+++ linux-2.6/arch/x86/kernel/setup64.c 2007-11-19 16:29:15.693140270 -0800
@@ -30,7 +30,9 @@ cpumask_t cpu_initialized __cpuinitdata

struct x8664_pda *_cpu_pda[NR_CPUS] __read_mostly;
EXPORT_SYMBOL(_cpu_pda);
-struct x8664_pda boot_cpu_pda[NR_CPUS] __cacheline_aligned;
+
+DEFINE_PER_CPU(struct x8664_pda, pda);
+EXPORT_PER_CPU_SYMBOL(pda);

struct desc_ptr idt_descr = { 256 * 16 - 1, (unsigned long) idt_table };

@@ -105,7 +107,13 @@ void __init setup_per_cpu_areas(void)

memcpy(CPU_PTR(base, i), __load_per_cpu_start,
__per_cpu_end - __per_cpu_start);
+
+ /* Relocate the pda */
+ memcpy(&per_cpu(pda, i), cpu_pda(i), sizeof(struct x8664_pda));
+ cpu_pda(i) = &per_cpu(pda, i);
}
+ /* Fix up pda for this processor .... */
+ pda_init(0);
count_vm_events(CPU_BYTES, PERCPU_ENOUGH_ROOM);
}

@@ -120,6 +128,7 @@ void pda_init(int cpu)
wrmsrl(MSR_GS_BASE, pda);
mb();

+ printk(KERN_INFO "Processor #%d: GS for cpu variable access set to %p\n",cpu, pda);
pda->cpunumber = cpu;
pda->irqcount = -1;
pda->kernelstack =
Index: linux-2.6/arch/x86/kernel/smpboot_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smpboot_64.c 2007-11-19 16:28:00.781640288 -0800
+++ linux-2.6/arch/x86/kernel/smpboot_64.c 2007-11-19 16:29:15.693140270 -0800
@@ -556,22 +556,6 @@ static int __cpuinit do_boot_cpu(int cpu
return -1;
}

- /* Allocate node local memory for AP pdas */
- if (cpu_pda(cpu) == &boot_cpu_pda[cpu]) {
- struct x8664_pda *newpda, *pda;
- int node = cpu_to_node(cpu);
- pda = cpu_pda(cpu);
- newpda = kmalloc_node(sizeof (struct x8664_pda), GFP_ATOMIC,
- node);
- if (newpda) {
- memcpy(newpda, pda, sizeof (struct x8664_pda));
- cpu_pda(cpu) = newpda;
- } else
- printk(KERN_ERR
- "Could not allocate node local PDA for CPU %d on node %d\n",
- cpu, node);
- }
-
alternatives_smp_switch(1);

c_idle.idle = get_idle_for_cpu(cpu);
Index: linux-2.6/arch/x86/kernel/head64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/head64.c 2007-11-19 16:28:00.789640087 -0800
+++ linux-2.6/arch/x86/kernel/head64.c 2007-11-19 16:29:15.693140270 -0800
@@ -20,6 +20,12 @@
#include <asm/tlbflush.h>
#include <asm/sections.h>

+/*
+ * Only used before the per cpu areas are setup. The use for the non possible
+ * cpus continues after boot
+ */
+static struct x8664_pda boot_cpu_pda[NR_CPUS];
+
static void __init zap_identity_mappings(void)
{
pgd_t *pgd = pgd_offset_k(0UL);
Index: linux-2.6/include/asm-x86/pda.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pda.h 2007-11-19 16:28:00.801640345 -0800
+++ linux-2.6/include/asm-x86/pda.h 2007-11-19 16:29:15.693140270 -0800
@@ -39,7 +39,6 @@ struct x8664_pda {
} ____cacheline_aligned_in_smp;

extern struct x8664_pda *_cpu_pda[];
-extern struct x8664_pda boot_cpu_pda[];

#define cpu_pda(i) (_cpu_pda[i])

Index: linux-2.6/include/asm-x86/percpu_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/percpu_64.h 2007-11-19 16:28:33.611190669 -0800
+++ linux-2.6/include/asm-x86/percpu_64.h 2007-11-19 16:29:39.569139671 -0800
@@ -61,6 +61,7 @@ extern void setup_per_cpu_areas(void);
#endif /* SMP */

#define DECLARE_PER_CPU(type, name) extern __typeof__(type) per_cpu__##name
+DECLARE_PER_CPU(struct x8664_pda, pda);

#define EXPORT_PER_CPU_SYMBOL(var) EXPORT_SYMBOL(per_cpu__##var)
#define EXPORT_PER_CPU_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var)

--
c***@sgi.com
2007-11-20 01:12:09 UTC
Permalink
Support fast cpu ops in x86_64 by providing a series of functions that
generate the proper instructions. Define CONFIG_FAST_CPU_OPS so that core code
can exploit the availability of fast per cpu operations.

Signed-off-by: Christoph Lameter <***@sgi.com>

---
arch/x86/Kconfig | 4
include/asm-x86/percpu_64.h | 262 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 266 insertions(+)

Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig 2007-11-19 16:16:03.458140098 -0800
+++ linux-2.6/arch/x86/Kconfig 2007-11-19 16:17:17.473389874 -0800
@@ -137,6 +137,10 @@ config GENERIC_PENDING_IRQ
depends on GENERIC_HARDIRQS && SMP
default y

+config FAST_CPU_OPS
+ bool
+ default y
+
config X86_SMP
bool
depends on X86_32 && SMP && !X86_VOYAGER
Index: linux-2.6/include/asm-x86/percpu_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/percpu_64.h 2007-11-19 16:17:16.953139798 -0800
+++ linux-2.6/include/asm-x86/percpu_64.h 2007-11-19 16:17:17.473389874 -0800
@@ -71,4 +71,266 @@ DECLARE_PER_CPU(struct x8664_pda, pda);
#define EXPORT_PER_CPU_SYMBOL_GPL(var) EXPORT_SYMBOL_GPL(per_cpu__##var)


+#define __xp(x) ((volatile unsigned long *)(x))
+
+static inline unsigned long __cpu_read_gs(volatile void *ptr, int size)
+{
+ unsigned long result;
+ switch (size) {
+ case 1:
+ __asm__ ("mov %%gs:%1, %b0"
+ : "=r"(result)
+ : "m"(*__xp(ptr)));
+ return result;
+ case 2:
+ __asm__ ("movw %%gs:%1, %w0"
+ : "=r"(result)
+ : "m"(*__xp(ptr)));
+ return result;
+ case 4:
+ __asm__ ("movl %%gs:%1, %k0"
+ : "=r"(result)
+ : "m"(*__xp(ptr)));
+ return result;
+ case 8:
+ __asm__ ("movq %%gs:%1, %0"
+ : "=r"(result)
+ : "m"(*__xp(ptr)));
+ return result;
+ }
+ BUG();
+}
+
+#define cpu_read_gs(obj)\
+ ((__typeof__(obj))__cpu_read_gs(&(obj), sizeof(obj)))
+
+static inline void __cpu_write_gs(volatile void *ptr,
+ unsigned long data, int size)
+{
+ switch (size) {
+ case 1:
+ __asm__ ("mov %b0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 2:
+ __asm__ ("movw %w0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 4:
+ __asm__ ("movl %k0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 8:
+ __asm__ ("movq %0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ }
+ BUG();
+}
+
+#define cpu_write_gs(obj, value)\
+ __cpu_write_gs(&(obj), (unsigned long)value, sizeof(obj))
+
+static inline void __cpu_add_gs(volatile void *ptr,
+ long data, int size)
+{
+ switch (size) {
+ case 1:
+ __asm__ ("add %b0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 2:
+ __asm__ ("addw %w0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 4:
+ __asm__ ("addl %k0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 8:
+ __asm__ ("addq %0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ }
+ BUG();
+}
+
+#define cpu_add_gs(obj, value)\
+ __cpu_add_gs(&(obj), (unsigned long)value, sizeof(obj))
+
+static inline void __cpu_sub_gs(volatile void *ptr,
+ long data, int size)
+{
+ switch (size) {
+ case 1:
+ __asm__ ("sub %b0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 2:
+ __asm__ ("subw %w0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 4:
+ __asm__ ("subl %k0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 8:
+ __asm__ ("subq %0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ }
+ BUG();
+}
+
+#define cpu_sub_gs(obj, value)\
+ __cpu_sub_gs(&(obj), (unsigned long)value, sizeof(obj))
+
+static inline void __cpu_xchg_gs(volatile void *ptr,
+ long data, int size)
+{
+ switch (size) {
+ case 1:
+ __asm__ ("xchg %b0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 2:
+ __asm__ ("xchgw %w0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 4:
+ __asm__ ("xchgl %k0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 8:
+ __asm__ ("xchgq %0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ }
+ BUG();
+}
+
+#define cpu_xchg_gs(obj, value)\
+ __cpu_xchg_gs(&(obj), (unsigned long)value, sizeof(obj))
+
+static inline void __cpu_inc_gs(volatile void *ptr, int size)
+{
+ switch (size) {
+ case 1:
+ __asm__ ("incb %%gs:%0"
+ : : "m"(*__xp(ptr)));
+ return;
+ case 2:
+ __asm__ ("incw %%gs:%0"
+ : : "m"(*__xp(ptr)));
+ return;
+ case 4:
+ __asm__ ("incl %%gs:%0"
+ : : "m"(*__xp(ptr)));
+ return;
+ case 8:
+ __asm__ ("incq %%gs:%0"
+ : : "m"(*__xp(ptr)));
+ return;
+ }
+ BUG();
+}
+
+#define cpu_inc_gs(obj)\
+ __cpu_inc_gs(&(obj), sizeof(obj))
+
+static inline void __cpu_dec_gs(volatile void *ptr, int size)
+{
+ switch (size) {
+ case 1:
+ __asm__ ("decb %%gs:%0"
+ : : "m"(*__xp(ptr)));
+ return;
+ case 2:
+ __asm__ ("decw %%gs:%0"
+ : : "m"(*__xp(ptr)));
+ return;
+ case 4:
+ __asm__ ("decl %%gs:%0"
+ : : "m"(*__xp(ptr)));
+ return;
+ case 8:
+ __asm__ ("decq %%gs:%0"
+ : : "m"(*__xp(ptr)));
+ return;
+ }
+ BUG();
+}
+
+#define cpu_dec_gs(obj)\
+ __cpu_dec_gs(&(obj), sizeof(obj))
+
+static inline unsigned long __cmpxchg_local_gs(volatile void *ptr,
+ unsigned long old, unsigned long new, int size)
+{
+ unsigned long prev;
+ switch (size) {
+ case 1:
+ __asm__ ("cmpxchgb %b1, %%gs:%2"
+ : "=a"(prev)
+ : "q"(new), "m"(*__xp(ptr)), "0"(old)
+ : "memory");
+ return prev;
+ case 2:
+ __asm__ ("cmpxchgw %w1, %%gs:%2"
+ : "=a"(prev)
+ : "r"(new), "m"(*__xp(ptr)), "0"(old)
+ : "memory");
+ return prev;
+ case 4:
+ __asm__ ("cmpxchgl %k1, %%gs:%2"
+ : "=a"(prev)
+ : "r"(new), "m"(*__xp(ptr)), "0"(old)
+ : "memory");
+ return prev;
+ case 8:
+ __asm__ ("cmpxchgq %1, %%gs:%2"
+ : "=a"(prev)
+ : "r"(new), "m"(*__xp(ptr)), "0"(old)
+ : "memory");
+ return prev;
+ }
+ return old;
+}
+
+#define cmpxchg_local_gs(obj, o, n)\
+ ((__typeof__(obj))__cmpxchg_local_gs(&(obj),(unsigned long)(o),\
+ (unsigned long)(n),sizeof(obj)))
+
+#define CPU_READ(obj) cpu_read_gs(obj)
+#define CPU_WRITE(obj,val) cpu_write_gs(obj, val)
+#define CPU_ADD(obj,val) cpu_add_gs(obj, val)
+#define CPU_SUB(obj,val) cpu_sub_gs(obj, val)
+#define CPU_INC(obj) cpu_inc_gs(obj)
+#define CPU_DEC(obj) cpu_dec_gs(obj)
+
+#define CPU_XCHG(obj,val) cpu_xchg_gs(obj, val)
+#define CPU_CMPXCHG(obj, old, new) cmpxchg_local_gs(obj, old, new)
+
+/*
+ * All cpu operations are interrupt safe and do not need to disable
+ * preempt. So the other variants all reduce to the same instruction.
+ */
+#define _CPU_READ CPU_READ
+#define _CPU_WRITE CPU_WRITE
+#define _CPU_ADD CPU_ADD
+#define _CPU_SUB CPU_SUB
+#define _CPU_INC CPU_INC
+#define _CPU_DEC CPU_DEC
+#define _CPU_XCHG CPU_XCHG
+#define _CPU_CMPXCHG CPU_CMPXCHG
+
+#define __CPU_READ CPU_READ
+#define __CPU_WRITE CPU_WRITE
+#define __CPU_ADD CPU_ADD
+#define __CPU_SUB CPU_SUB
+#define __CPU_INC CPU_INC
+#define __CPU_DEC CPU_DEC
+#define __CPU_XCHG CPU_XCHG
+#define __CPU_CMPXCHG CPU_CMPXCHG
+
#endif /* _ASM_X8664_PERCPU_H_ */

--
H. Peter Anvin
2007-11-20 02:00:23 UTC
Permalink
Post by c***@sgi.com
Support fast cpu ops in x86_64 by providing a series of functions that
generate the proper instructions. Define CONFIG_FAST_CPU_OPS so that core code
can exploit the availability of fast per cpu operations.
There was, at some point, discussion about using the gcc TLS mechanism,
which should permit even better code to be generated. Unfortunately, it
would require gcc to be able to reference %gs instead of %fs (and vice
versa for i386), which I don't think is available in anything except
maybe the most cutting-edge version of gcc.

However, if we're doing a masssive revampt it would be good to get an
idea of how to migrate to that model eventually, or why it doesn't make
sense at all.

-hpa
Christoph Lameter
2007-11-20 02:03:56 UTC
Permalink
There was, at some point, discussion about using the gcc TLS mechanism, which
should permit even better code to be generated. Unfortunately, it would
How would that be possible? Oh. You mean the discussion where I mentioned
using the thread attribute?
require gcc to be able to reference %gs instead of %fs (and vice versa for
i386), which I don't think is available in anything except maybe the most
cutting-edge version of gcc.
Right. That is why we do it in ASM here.
However, if we're doing a masssive revampt it would be good to get an idea of
how to migrate to that model eventually, or why it doesn't make sense at all.
If you can tell me what the difference would be then we can discuss it.
AFAICT there is no difference. Both use a segment register.
H. Peter Anvin
2007-11-20 02:15:59 UTC
Permalink
Post by Christoph Lameter
There was, at some point, discussion about using the gcc TLS mechanism, which
should permit even better code to be generated. Unfortunately, it would
How would that be possible? Oh. You mean the discussion where I mentioned
using the thread attribute?
require gcc to be able to reference %gs instead of %fs (and vice versa for
i386), which I don't think is available in anything except maybe the most
cutting-edge version of gcc.
Right. That is why we do it in ASM here.
However, if we're doing a masssive revampt it would be good to get an idea of
how to migrate to that model eventually, or why it doesn't make sense at all.
If you can tell me what the difference would be then we can discuss it.
AFAICT there is no difference. Both use a segment register.
As far as I can tell from a *very* brief look at your code (which means
I might have misread it), these are the differences:

- gcc uses %fs:0 to contain a pointer to itself.
- gcc uses absolute offsets from the thread pointer, rather than adding
%rip. The %rip-based form is actually more efficient, but it does
affect the usable range off the base pointer.

-hpa
David Miller
2007-11-20 02:17:20 UTC
Permalink
From: "H. Peter Anvin" <***@zytor.com>
Date: Mon, 19 Nov 2007 18:00:23 -0800
Post by H. Peter Anvin
Post by c***@sgi.com
Support fast cpu ops in x86_64 by providing a series of functions that
generate the proper instructions. Define CONFIG_FAST_CPU_OPS so that core code
can exploit the availability of fast per cpu operations.
There was, at some point, discussion about using the gcc TLS mechanism,
which should permit even better code to be generated. Unfortunately, it
would require gcc to be able to reference %gs instead of %fs (and vice
versa for i386), which I don't think is available in anything except
maybe the most cutting-edge version of gcc.
You can't use __thread because GCC will cache __thread computed
addresses across context switches and cpu changes.

It's been tried before on powerpc, it doesn't work.
H. Peter Anvin
2007-11-20 02:19:33 UTC
Permalink
Post by David Miller
Post by H. Peter Anvin
There was, at some point, discussion about using the gcc TLS mechanism,
which should permit even better code to be generated. Unfortunately, it
would require gcc to be able to reference %gs instead of %fs (and vice
versa for i386), which I don't think is available in anything except
maybe the most cutting-edge version of gcc.
You can't use __thread because GCC will cache __thread computed
addresses across context switches and cpu changes.
It's been tried before on powerpc, it doesn't work.
OK, that pretty much answers that question.

-hpa
Andi Kleen
2007-11-20 03:23:07 UTC
Permalink
Post by H. Peter Anvin
Post by David Miller
Post by H. Peter Anvin
There was, at some point, discussion about using the gcc TLS mechanism,
which should permit even better code to be generated. Unfortunately, it
would require gcc to be able to reference %gs instead of %fs (and vice
versa for i386), which I don't think is available in anything except
maybe the most cutting-edge version of gcc.
You can't use __thread because GCC will cache __thread computed
addresses across context switches and cpu changes.
It's been tried before on powerpc, it doesn't work.
OK, that pretty much answers that question.
I investigated that some time ago.

There are other obstacles too on x86-64, e.g. the relocations
are wrong for kernel mode. You would need to extend the linker first.

-Andi
Paul Mackerras
2007-11-20 02:45:44 UTC
Permalink
Post by H. Peter Anvin
There was, at some point, discussion about using the gcc TLS mechanism,
which should permit even better code to be generated. Unfortunately, it
would require gcc to be able to reference %gs instead of %fs (and vice
versa for i386), which I don't think is available in anything except
maybe the most cutting-edge version of gcc.
However, if we're doing a masssive revampt it would be good to get an
idea of how to migrate to that model eventually, or why it doesn't make
sense at all.
The problem I found when I tried to do that on powerpc is that gcc
believes it can cache addresses of TLS variables. If you try and use
TLS accesses for per-cpu variables then you end up accessing the wrong
cpu's variables due to that, since our "TLS" pointer can change at any
point where preemption is enabled.

If we wanted to do per-task variables then TLS would be perfect for
that.

Paul.
c***@sgi.com
2007-11-20 01:12:10 UTC
Permalink
Replace all uses of __per_cpu_offset with CPU_PTR. This will avoid a lot
of lookups for per cpu offset calculations.

Keep per_cpu_offset() itself because lockdep uses it.

Signed-off-by: Christoph Lameter <***@sgi.com>

---
arch/x86/kernel/smpboot_64.c | 8 +++-----
include/asm-x86/percpu_64.h | 14 ++++++--------
2 files changed, 9 insertions(+), 13 deletions(-)

Index: linux-2.6/include/asm-x86/percpu_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/percpu_64.h 2007-11-19 16:07:43.153640561 -0800
+++ linux-2.6/include/asm-x86/percpu_64.h 2007-11-19 16:12:01.977139696 -0800
@@ -11,10 +11,8 @@

#include <asm/pda.h>

-#define __per_cpu_offset(cpu) (cpu_pda(cpu)->data_offset)
-#define __my_cpu_offset() read_pda(data_offset)
-
-#define per_cpu_offset(x) (__per_cpu_offset(x))
+/* Legacy: lockdep in the core kernel uses this */
+#define per_cpu_offset(cpu) CPU_OFFSET(cpu)

/* Separate out the type, so (int[3], foo) works. */
#define DEFINE_PER_CPU(type, name) \
@@ -32,20 +30,20 @@
/* var is in discarded region: offset to particular copy we want */
#define per_cpu(var, cpu) (*({ \
extern int simple_identifier_##var(void); \
- RELOC_HIDE(&per_cpu__##var, __per_cpu_offset(cpu)); }))
+ CPU_PTR(&per_cpu__##var, (cpu)); }))
#define __get_cpu_var(var) (*({ \
extern int simple_identifier_##var(void); \
- RELOC_HIDE(&per_cpu__##var, __my_cpu_offset()); }))
+ THIS_CPU(&per_cpu__##var); }))
#define __raw_get_cpu_var(var) (*({ \
extern int simple_identifier_##var(void); \
- RELOC_HIDE(&per_cpu__##var, __my_cpu_offset()); }))
+ __THIS_CPU(&per_cpu__##var); }))

/* A macro to avoid #include hell... */
#define percpu_modcopy(pcpudst, src, size) \
do { \
unsigned int __i; \
for_each_possible_cpu(__i) \
- memcpy((pcpudst)+__per_cpu_offset(__i), \
+ memcpy(CPU_PTR(pcpudst, __i), \
(src), (size)); \
} while (0)

Index: linux-2.6/arch/x86/kernel/smpboot_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/smpboot_64.c 2007-11-19 16:06:15.301390091 -0800
+++ linux-2.6/arch/x86/kernel/smpboot_64.c 2007-11-19 16:10:23.998889835 -0800
@@ -835,11 +835,9 @@ void __init smp_set_apicids(void)
{
int cpu;

- for_each_cpu_mask(cpu, cpu_possible_map) {
- if (per_cpu_offset(cpu))
- per_cpu(x86_cpu_to_apicid, cpu) =
- x86_cpu_to_apicid_init[cpu];
- }
+ for_each_cpu_mask(cpu, cpu_possible_map)
+ per_cpu(x86_cpu_to_apicid, cpu) =
+ x86_cpu_to_apicid_init[cpu];

/* indicate the static array will be going away soon */
x86_cpu_to_apicid_ptr = NULL;

--
c***@sgi.com
2007-11-20 01:12:11 UTC
Permalink
It is useless now since gs can always stand in for data_offset.

Move active_mm into the available slot in order to not upset the
established offsets.

Signed-off-by: Christoph Lameter <***@sgi.com>

---
arch/x86/kernel/asm-offsets_64.c | 1 -
arch/x86/kernel/entry_64.S | 7 ++-----
arch/x86/kernel/setup64.c | 2 --
include/asm-x86/pda.h | 4 +---
4 files changed, 3 insertions(+), 11 deletions(-)

Index: linux-2.6/arch/x86/kernel/asm-offsets_64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/asm-offsets_64.c 2007-11-19 15:45:03.902390058 -0800
+++ linux-2.6/arch/x86/kernel/asm-offsets_64.c 2007-11-19 16:13:29.241640104 -0800
@@ -56,7 +56,6 @@ int main(void)
ENTRY(irqcount);
ENTRY(cpunumber);
ENTRY(irqstackptr);
- ENTRY(data_offset);
BLANK();
#undef ENTRY
#ifdef CONFIG_IA32_EMULATION
Index: linux-2.6/arch/x86/kernel/entry_64.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/entry_64.S 2007-11-19 15:45:03.910390570 -0800
+++ linux-2.6/arch/x86/kernel/entry_64.S 2007-11-19 16:13:29.241640104 -0800
@@ -734,18 +734,15 @@ END(spurious_interrupt)
swapgs
xorl %ebx,%ebx
1:
- .if \ist
- movq %gs:pda_data_offset, %rbp
- .endif
movq %rsp,%rdi
movq ORIG_RAX(%rsp),%rsi
movq $-1,ORIG_RAX(%rsp)
.if \ist
- subq $EXCEPTION_STKSZ, per_cpu__init_tss + TSS_ist + (\ist - 1) * 8(%rbp)
+ subq $EXCEPTION_STKSZ, %gs: per_cpu__init_tss + TSS_ist
.endif
call \sym
.if \ist
- addq $EXCEPTION_STKSZ, per_cpu__init_tss + TSS_ist + (\ist - 1) * 8(%rbp)
+ addq $EXCEPTION_STKSZ, %gs: per_cpu__init_tss + TSS_ist
.endif
cli
.if \irqtrace
Index: linux-2.6/arch/x86/kernel/setup64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup64.c 2007-11-19 16:06:50.162390389 -0800
+++ linux-2.6/arch/x86/kernel/setup64.c 2007-11-19 16:13:29.245640006 -0800
@@ -103,8 +103,6 @@ void __init setup_per_cpu_areas(void)
printk(KERN_INFO "PERCPU: Allocating %lu bytes of per cpu data\n",
PERCPU_ENOUGH_ROOM);
for_each_cpu_mask (i, cpu_possible_map) {
- cpu_pda(i)->data_offset = cpu_offset(i);
-
memcpy(CPU_PTR(base, i), __load_per_cpu_start,
__per_cpu_end - __per_cpu_start);

Index: linux-2.6/include/asm-x86/pda.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pda.h 2007-11-19 16:06:15.301390091 -0800
+++ linux-2.6/include/asm-x86/pda.h 2007-11-19 16:13:29.245640006 -0800
@@ -10,8 +10,7 @@
/* Per processor datastructure. %gs points to it while the kernel runs */
struct x8664_pda {
struct task_struct *pcurrent; /* 0 Current process */
- unsigned long data_offset; /* 8 Per cpu data offset from linker
- address */
+ struct mm_struct *active_mm;
unsigned long kernelstack; /* 16 top of kernel stack for current */
unsigned long oldrsp; /* 24 user rsp for system call */
int irqcount; /* 32 Irq nesting counter. Starts with -1 */
@@ -27,7 +26,6 @@ struct x8664_pda {
unsigned int __nmi_count; /* number of NMI on this CPUs */
short mmu_state;
short isidle;
- struct mm_struct *active_mm;
unsigned apic_timer_irqs;
unsigned irq0_irqs;
unsigned irq_resched_count;

--
c***@sgi.com
2007-11-20 01:12:12 UTC
Permalink
There needs to be a way to determine the offset for the CPU ops of per cpu
variables. The offset is simply the address of the variable. But we do not
want to code ugly things like

CPU_READ(per_cpu__statistics)

in the core. So define a new helper per_cpu_var(var) that simply adds
the per_cpu__ prefix.

Signed-off-by: Christoph Lameter <***@sgi.com>

---
include/asm-x86/percpu_64.h | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

Index: linux-2.6/include/asm-x86/percpu_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/percpu_64.h 2007-11-19 16:17:21.477639855 -0800
+++ linux-2.6/include/asm-x86/percpu_64.h 2007-11-19 16:17:23.942140438 -0800
@@ -14,29 +14,35 @@
/* Legacy: lockdep in the core kernel uses this */
#define per_cpu_offset(cpu) CPU_OFFSET(cpu)

+/*
+ * Needed in order to be able to pass per cpu variables to CPU_xx
+ * macros. Another solution may be to simply drop the prefix?
+ */
+#define per_cpu_var(var) per_cpu__##var
+
/* Separate out the type, so (int[3], foo) works. */
#define DEFINE_PER_CPU(type, name) \
- __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name
+ __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu_var(name)

#define DEFINE_PER_CPU_SHARED_ALIGNED(type, name) \
__attribute__((__section__(".data.percpu.shared_aligned"))) \
- __typeof__(type) per_cpu__##name \
+ __typeof__(type) per_cpu_var(name) \
____cacheline_internodealigned_in_smp

#define DEFINE_PER_CPU_FIRST(type, name) \
__attribute__((__section__(".data.percpu.first"))) \
- __typeof__(type) per_cpu__##name
+ __typeof__(type) per_cpu_var(name)

/* var is in discarded region: offset to particular copy we want */
#define per_cpu(var, cpu) (*({ \
extern int simple_identifier_##var(void); \
- CPU_PTR(&per_cpu__##var, (cpu)); }))
+ CPU_PTR(&per_cpu_var(var), (cpu)); }))
#define __get_cpu_var(var) (*({ \
extern int simple_identifier_##var(void); \
- THIS_CPU(&per_cpu__##var); }))
+ THIS_CPU(&per_cpu_var(var)); }))
#define __raw_get_cpu_var(var) (*({ \
extern int simple_identifier_##var(void); \
- __THIS_CPU(&per_cpu__##var); }))
+ __THIS_CPU(&per_cpu_var(var)); }))

/* A macro to avoid #include hell... */
#define percpu_modcopy(pcpudst, src, size) \

--
c***@sgi.com
2007-11-20 01:12:13 UTC
Permalink
The use of CPU ops here avoids the offset calculations that we used to have
to do with per cpu ops. The result of this patch is that event counters are
coded with a single instruction the following way:

incq %gs:offset(%rip)

Without these patches this was:

mov %gs:0x8,%rdx
mov %eax,0x38(%rsp)
mov xxx(%rip),%eax
mov %eax,0x48(%rsp)
mov varoffset,%rax
incq 0x110(%rax,%rdx,1)

Signed-off-by: Christoph Lameter <***@sgi.com

---
include/linux/vmstat.h | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h 2007-11-18 09:29:53.177783417 -0800
+++ linux-2.6/include/linux/vmstat.h 2007-11-18 09:29:54.741283496 -0800
@@ -59,24 +59,22 @@ DECLARE_PER_CPU(struct vm_event_state, v

static inline void __count_vm_event(enum vm_event_item item)
{
- __get_cpu_var(vm_event_states).event[item]++;
+ __CPU_INC(per_cpu_var(vm_event_states).event[item]);
}

static inline void count_vm_event(enum vm_event_item item)
{
- get_cpu_var(vm_event_states).event[item]++;
- put_cpu();
+ _CPU_INC(per_cpu_var(vm_event_states).event[item]);
}

static inline void __count_vm_events(enum vm_event_item item, long delta)
{
- __get_cpu_var(vm_event_states).event[item] += delta;
+ __CPU_ADD(per_cpu_var(vm_event_states).event[item], delta);
}

static inline void count_vm_events(enum vm_event_item item, long delta)
{
- get_cpu_var(vm_event_states).event[item] += delta;
- put_cpu();
+ _CPU_ADD(per_cpu_var(vm_event_states).event[item], delta);
}

extern void all_vm_events(unsigned long *);

--
c***@sgi.com
2007-11-20 01:12:04 UTC
Permalink
Use the CPU_xx operations to deal with the per cpu data.

Avoid a loop to NR_CPUS here. Use the possible map instead.

Signed-off-by: Christoph Lameter <***@sgi.com>

---
include/linux/module.h | 13 +++++--------
kernel/module.c | 17 +++++++----------
2 files changed, 12 insertions(+), 18 deletions(-)

Index: linux-2.6/include/linux/module.h
===================================================================
--- linux-2.6.orig/include/linux/module.h 2007-11-19 14:44:06.233923291 -0800
+++ linux-2.6/include/linux/module.h 2007-11-19 15:10:50.982356094 -0800
@@ -219,8 +219,8 @@ void *__symbol_get_gpl(const char *symbo

struct module_ref
{
- local_t count;
-} ____cacheline_aligned;
+ int count;
+};

enum module_state
{
@@ -324,7 +324,7 @@ struct module

#ifdef CONFIG_MODULE_UNLOAD
/* Reference counts */
- struct module_ref ref[NR_CPUS];
+ struct module_ref *ref;

/* What modules depend on me? */
struct list_head modules_which_use_me;
@@ -401,8 +401,7 @@ static inline void __module_get(struct m
{
if (module) {
BUG_ON(module_refcount(module) == 0);
- local_inc(&module->ref[get_cpu()].count);
- put_cpu();
+ _CPU_INC(module->ref->count);
}
}

@@ -411,12 +410,10 @@ static inline int try_module_get(struct
int ret = 1;

if (module) {
- unsigned int cpu = get_cpu();
if (likely(module_is_live(module)))
- local_inc(&module->ref[cpu].count);
+ _CPU_INC(module->ref->count);
else
ret = 0;
- put_cpu();
}
return ret;
}
Index: linux-2.6/kernel/module.c
===================================================================
--- linux-2.6.orig/kernel/module.c 2007-11-19 14:44:06.241923622 -0800
+++ linux-2.6/kernel/module.c 2007-11-19 15:20:28.491897584 -0800
@@ -501,13 +501,11 @@ MODINFO_ATTR(srcversion);
/* Init the unload section of the module. */
static void module_unload_init(struct module *mod)
{
- unsigned int i;
-
INIT_LIST_HEAD(&mod->modules_which_use_me);
- for (i = 0; i < NR_CPUS; i++)
- local_set(&mod->ref[i].count, 0);
+ mod->ref = CPU_ALLOC(struct module_ref, GFP_KERNEL | __GFP_ZERO);
+
/* Hold reference count during initialization. */
- local_set(&mod->ref[raw_smp_processor_id()].count, 1);
+ __CPU_WRITE(mod->ref->count, 1);
/* Backwards compatibility macros put refcount during init. */
mod->waiter = current;
}
@@ -575,6 +573,7 @@ static void module_unload_free(struct mo
kfree(use);
sysfs_remove_link(i->holders_dir, mod->name);
/* There can be at most one match. */
+ CPU_FREE(i->ref);
break;
}
}
@@ -630,8 +629,8 @@ unsigned int module_refcount(struct modu
{
unsigned int i, total = 0;

- for (i = 0; i < NR_CPUS; i++)
- total += local_read(&mod->ref[i].count);
+ for_each_online_cpu(i)
+ total += CPU_PTR(mod->ref, i)->count;
return total;
}
EXPORT_SYMBOL(module_refcount);
@@ -790,12 +789,10 @@ static struct module_attribute refcnt =
void module_put(struct module *module)
{
if (module) {
- unsigned int cpu = get_cpu();
- local_dec(&module->ref[cpu].count);
+ _CPU_DEC(module->ref->count);
/* Maybe they're waiting for us to drop reference? */
if (unlikely(!module_is_live(module)))
wake_up_process(module->waiter);
- put_cpu();
}
}
EXPORT_SYMBOL(module_put);

--
c***@sgi.com
2007-11-20 01:12:17 UTC
Permalink
The module subsystem cannot handle symbols that are zero. It prints out
a message that these symbols are unresolved. Define a constant

UNRESOLVED

that is used to hold the value used for unresolved symbols. Set it to 1
(its hopefully unlikely that a symbol will have the value 1). This is necessary
so that the pda variable which is placed at offset 0 of the per cpu
segment is handled correctly.

Signed-off-by: Christoph Lameter <***@sgi.com>

---
kernel/module.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/module.c
===================================================================
--- linux-2.6.orig/kernel/module.c 2007-11-19 17:08:40.563625526 -0800
+++ linux-2.6/kernel/module.c 2007-11-19 17:08:40.855625732 -0800
@@ -62,6 +62,7 @@ extern int module_sysfs_initialized;
/* If this is set, the section belongs in the init part of the module */
#define INIT_OFFSET_MASK (1UL << (BITS_PER_LONG-1))

+#define UNRESOLVED 1
/* List of modules, protected by module_mutex or preempt_disable
* (add/delete uses stop_machine). */
static DEFINE_MUTEX(module_mutex);
@@ -285,7 +286,7 @@ static unsigned long __find_symbol(const
}
}
DEBUGP("Failed to find symbol %s\n", name);
- return 0;
+ return UNRESOLVED;
}

/* Search for module by name: must hold module_mutex. */
@@ -755,7 +756,7 @@ void __symbol_put(const char *symbol)
const unsigned long *crc;

preempt_disable();
- if (!__find_symbol(symbol, &owner, &crc, 1))
+ if (__find_symbol(symbol, &owner, &crc, 1) == UNRESOLVED)
BUG();
module_put(owner);
preempt_enable();
@@ -899,7 +900,7 @@ static inline int check_modstruct_versio
const unsigned long *crc;
struct module *owner;

- if (!__find_symbol("struct_module", &owner, &crc, 1))
+ if (__find_symbol("struct_module", &owner, &crc, 1) == UNRESOLVED)
BUG();
return check_version(sechdrs, versindex, "struct_module", mod,
crc);
@@ -952,7 +953,7 @@ static unsigned long resolve_symbol(Elf_
/* use_module can fail due to OOM, or module unloading */
if (!check_version(sechdrs, versindex, name, mod, crc) ||
!use_module(mod, owner))
- ret = 0;
+ ret = UNRESOLVED;
}
return ret;
}
@@ -1345,14 +1346,14 @@ static int verify_export_symbols(struct
const unsigned long *crc;

for (i = 0; i < mod->num_syms; i++)
- if (__find_symbol(mod->syms[i].name, &owner, &crc, 1)) {
+ if (__find_symbol(mod->syms[i].name, &owner, &crc, 1) != UNRESOLVED) {
name = mod->syms[i].name;
ret = -ENOEXEC;
goto dup;
}

for (i = 0; i < mod->num_gpl_syms; i++)
- if (__find_symbol(mod->gpl_syms[i].name, &owner, &crc, 1)) {
+ if (__find_symbol(mod->gpl_syms[i].name, &owner, &crc, 1)!= UNRESOLVED) {
name = mod->gpl_syms[i].name;
ret = -ENOEXEC;
goto dup;
@@ -1402,7 +1403,7 @@ static int simplify_symbols(Elf_Shdr *se
strtab + sym[i].st_name, mod);

/* Ok if resolved. */
- if (sym[i].st_value != 0)
+ if (sym[i].st_value != UNRESOLVED)
break;
/* Ok if weak. */
if (ELF_ST_BIND(sym[i].st_info) == STB_WEAK)

--
Mathieu Desnoyers
2007-11-20 02:20:07 UTC
Permalink
Post by c***@sgi.com
The module subsystem cannot handle symbols that are zero. It prints out
a message that these symbols are unresolved. Define a constant
UNRESOLVED
that is used to hold the value used for unresolved symbols. Set it to 1
(its hopefully unlikely that a symbol will have the value 1). This is necessary
so that the pda variable which is placed at offset 0 of the per cpu
segment is handled correctly.
Wouldn't it be better to simply return a standard ERR_PTR(-ENO....) ?
Post by c***@sgi.com
---
kernel/module.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
Index: linux-2.6/kernel/module.c
===================================================================
--- linux-2.6.orig/kernel/module.c 2007-11-19 17:08:40.563625526 -0800
+++ linux-2.6/kernel/module.c 2007-11-19 17:08:40.855625732 -0800
@@ -62,6 +62,7 @@ extern int module_sysfs_initialized;
/* If this is set, the section belongs in the init part of the module */
#define INIT_OFFSET_MASK (1UL << (BITS_PER_LONG-1))
+#define UNRESOLVED 1
/* List of modules, protected by module_mutex or preempt_disable
* (add/delete uses stop_machine). */
static DEFINE_MUTEX(module_mutex);
@@ -285,7 +286,7 @@ static unsigned long __find_symbol(const
}
}
DEBUGP("Failed to find symbol %s\n", name);
- return 0;
+ return UNRESOLVED;
}
/* Search for module by name: must hold module_mutex. */
@@ -755,7 +756,7 @@ void __symbol_put(const char *symbol)
const unsigned long *crc;
preempt_disable();
- if (!__find_symbol(symbol, &owner, &crc, 1))
+ if (__find_symbol(symbol, &owner, &crc, 1) == UNRESOLVED)
BUG();
module_put(owner);
preempt_enable();
@@ -899,7 +900,7 @@ static inline int check_modstruct_versio
const unsigned long *crc;
struct module *owner;
- if (!__find_symbol("struct_module", &owner, &crc, 1))
+ if (__find_symbol("struct_module", &owner, &crc, 1) == UNRESOLVED)
BUG();
return check_version(sechdrs, versindex, "struct_module", mod,
crc);
@@ -952,7 +953,7 @@ static unsigned long resolve_symbol(Elf_
/* use_module can fail due to OOM, or module unloading */
if (!check_version(sechdrs, versindex, name, mod, crc) ||
!use_module(mod, owner))
- ret = 0;
+ ret = UNRESOLVED;
}
return ret;
}
@@ -1345,14 +1346,14 @@ static int verify_export_symbols(struct
const unsigned long *crc;
for (i = 0; i < mod->num_syms; i++)
- if (__find_symbol(mod->syms[i].name, &owner, &crc, 1)) {
+ if (__find_symbol(mod->syms[i].name, &owner, &crc, 1) != UNRESOLVED) {
name = mod->syms[i].name;
ret = -ENOEXEC;
goto dup;
}
for (i = 0; i < mod->num_gpl_syms; i++)
- if (__find_symbol(mod->gpl_syms[i].name, &owner, &crc, 1)) {
+ if (__find_symbol(mod->gpl_syms[i].name, &owner, &crc, 1)!= UNRESOLVED) {
name = mod->gpl_syms[i].name;
ret = -ENOEXEC;
goto dup;
@@ -1402,7 +1403,7 @@ static int simplify_symbols(Elf_Shdr *se
strtab + sym[i].st_name, mod);
/* Ok if resolved. */
- if (sym[i].st_value != 0)
+ if (sym[i].st_value != UNRESOLVED)
break;
/* Ok if weak. */
if (ELF_ST_BIND(sym[i].st_info) == STB_WEAK)
--
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Christoph Lameter
2007-11-20 02:49:56 UTC
Permalink
Post by Mathieu Desnoyers
Post by c***@sgi.com
The module subsystem cannot handle symbols that are zero. It prints out
a message that these symbols are unresolved. Define a constant
UNRESOLVED
that is used to hold the value used for unresolved symbols. Set it to 1
(its hopefully unlikely that a symbol will have the value 1). This is necessary
so that the pda variable which is placed at offset 0 of the per cpu
segment is handled correctly.
Wouldn't it be better to simply return a standard ERR_PTR(-ENO....) ?
Good idea. But can you guarantee that this wont clash with an address?
Mathieu Desnoyers
2007-11-20 03:29:04 UTC
Permalink
Post by Christoph Lameter
Post by Mathieu Desnoyers
Post by c***@sgi.com
The module subsystem cannot handle symbols that are zero. It prints out
a message that these symbols are unresolved. Define a constant
UNRESOLVED
that is used to hold the value used for unresolved symbols. Set it to 1
(its hopefully unlikely that a symbol will have the value 1). This is necessary
so that the pda variable which is placed at offset 0 of the per cpu
segment is handled correctly.
Wouldn't it be better to simply return a standard ERR_PTR(-ENO....) ?
Good idea. But can you guarantee that this wont clash with an address?
linux/err.h assumes that the last page of the address space is never
ever used. Is that a correct assumption ?
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
c***@sgi.com
2007-11-20 01:12:15 UTC
Permalink
Get rid of one of the leftover pda accessors and cut out some more of pda.h.

Signed-off-by: Christoph Lameter <***@sgi.com>


---
include/asm-x86/pda.h | 35 +----------------------------------
include/asm-x86/percpu_64.h | 30 ++++++++++++++++++++++++++++++
2 files changed, 31 insertions(+), 34 deletions(-)

Index: linux-2.6/include/asm-x86/pda.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pda.h 2007-11-19 16:24:13.569640223 -0800
+++ linux-2.6/include/asm-x86/pda.h 2007-11-19 16:24:33.001389877 -0800
@@ -40,12 +40,6 @@ extern struct x8664_pda *_cpu_pda[];

#define cpu_pda(i) (_cpu_pda[i])

-/*
- * There is no fast way to get the base address of the PDA, all the accesses
- * have to mention %fs/%gs. So it needs to be done this Torvaldian way.
- */
-extern void __bad_pda_field(void) __attribute__((noreturn));
-
/*
* proxy_pda doesn't actually exist, but tell gcc it is accessed for
* all PDA accesses so it gets read/write dependencies right.
@@ -54,38 +48,11 @@ extern struct x8664_pda _proxy_pda;

#define pda_offset(field) offsetof(struct x8664_pda, field)

-#define pda_to_op(op,field,val) do { \
- typedef typeof(_proxy_pda.field) T__; \
- if (0) { T__ tmp__; tmp__ = (val); } /* type checking */ \
- switch (sizeof(_proxy_pda.field)) { \
- case 2: \
- asm(op "w %1,%%gs:%c2" : \
- "+m" (_proxy_pda.field) : \
- "ri" ((T__)val), \
- "i"(pda_offset(field))); \
- break; \
- case 4: \
- asm(op "l %1,%%gs:%c2" : \
- "+m" (_proxy_pda.field) : \
- "ri" ((T__)val), \
- "i" (pda_offset(field))); \
- break; \
- case 8: \
- asm(op "q %1,%%gs:%c2": \
- "+m" (_proxy_pda.field) : \
- "ri" ((T__)val), \
- "i"(pda_offset(field))); \
- break; \
- default: \
- __bad_pda_field(); \
- } \
- } while (0)
-
#define read_pda(field) CPU_READ(per_cpu_var(pda).field)
#define write_pda(field,val) CPU_WRITE(per_cpu_var(pda).field, val)
#define add_pda(field,val) CPU_ADD(per_cpu_var(pda).field, val)
#define sub_pda(field,val) CPU_ADD(per_cpu_var(pda).field, val)
-#define or_pda(field,val) pda_to_op("or",field,val)
+#define or_pda(field,val) CPU_OR(per_cpu_var(pda).field, val)

/* This is not atomic against other CPUs -- CPU preemption needs to be off */
#define test_and_clear_bit_pda(bit,field) ({ \
Index: linux-2.6/include/asm-x86/percpu_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/percpu_64.h 2007-11-19 16:17:46.854889825 -0800
+++ linux-2.6/include/asm-x86/percpu_64.h 2007-11-19 16:24:33.001389877 -0800
@@ -305,12 +305,40 @@ static inline unsigned long __cmpxchg_lo
((__typeof__(obj))__cmpxchg_local_gs(&(obj),(unsigned long)(o),\
(unsigned long)(n),sizeof(obj)))

+static inline void __cpu_or_gs(volatile void *ptr,
+ long data, int size)
+{
+ switch (size) {
+ case 1:
+ __asm__ ("orb %b0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 2:
+ __asm__ ("orw %w0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 4:
+ __asm__ ("orl %k0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ case 8:
+ __asm__ ("orq %0, %%gs:%1"
+ : : "ri"(data), "m"(*__xp(ptr)));
+ return;
+ }
+ BUG();
+}
+
+#define cpu_or_gs(obj, value)\
+ __cpu_or_gs(&(obj), (unsigned long)value, sizeof(obj))
+
#define CPU_READ(obj) cpu_read_gs(obj)
#define CPU_WRITE(obj,val) cpu_write_gs(obj, val)
#define CPU_ADD(obj,val) cpu_add_gs(obj, val)
#define CPU_SUB(obj,val) cpu_sub_gs(obj, val)
#define CPU_INC(obj) cpu_inc_gs(obj)
#define CPU_DEC(obj) cpu_dec_gs(obj)
+#define CPU_OR(obj, val) cpu_or_gs(obj, val)

#define CPU_XCHG(obj,val) cpu_xchg_gs(obj, val)
#define CPU_CMPXCHG(obj, old, new) cmpxchg_local_gs(obj, old, new)
@@ -327,6 +355,7 @@ static inline unsigned long __cmpxchg_lo
#define _CPU_DEC CPU_DEC
#define _CPU_XCHG CPU_XCHG
#define _CPU_CMPXCHG CPU_CMPXCHG
+#define _CPU_OR CPU_OR

#define __CPU_READ CPU_READ
#define __CPU_WRITE CPU_WRITE
@@ -336,5 +365,6 @@ static inline unsigned long __cmpxchg_lo
#define __CPU_DEC CPU_DEC
#define __CPU_XCHG CPU_XCHG
#define __CPU_CMPXCHG CPU_CMPXCHG
+#define __CPU_OR CPU_OR

#endif /* _ASM_X8664_PERCPU_H_ */

--
c***@sgi.com
2007-11-20 01:12:16 UTC
Permalink
There is no user of local_t remaining after the cpu ops patchset. local_t
always suffered from the problem that the operations it generated were not
able to perform the relocation of a pointer to the target processor and the
atomic update at the same time. There was a need to disable preemption
and/or interrupts which made it awkward to use.

Signed-off-by: Christoph Lameter <***@sgi.com>

---
Documentation/local_ops.txt | 209 -------------------------------------
arch/frv/kernel/local.h | 59 ----------
include/asm-alpha/local.h | 118 ---------------------
include/asm-arm/local.h | 1
include/asm-avr32/local.h | 6 -
include/asm-blackfin/local.h | 6 -
include/asm-cris/local.h | 1
include/asm-frv/local.h | 6 -
include/asm-generic/local.h | 75 -------------
include/asm-h8300/local.h | 6 -
include/asm-ia64/local.h | 1
include/asm-m32r/local.h | 6 -
include/asm-m68k/local.h | 6 -
include/asm-m68knommu/local.h | 6 -
include/asm-mips/local.h | 221 ---------------------------------------
include/asm-parisc/local.h | 1
include/asm-powerpc/local.h | 200 ------------------------------------
include/asm-s390/local.h | 1
include/asm-sh/local.h | 7 -
include/asm-sh64/local.h | 7 -
include/asm-sparc/local.h | 6 -
include/asm-sparc64/local.h | 1
include/asm-um/local.h | 6 -
include/asm-v850/local.h | 6 -
include/asm-x86/local.h | 5
include/asm-x86/local_32.h | 233 ------------------------------------------
include/asm-x86/local_64.h | 222 ----------------------------------------
include/asm-xtensa/local.h | 16 --
include/linux/module.h | 2
29 files changed, 1 insertion(+), 1439 deletions(-)

Index: linux-2.6/Documentation/local_ops.txt
===================================================================
--- linux-2.6.orig/Documentation/local_ops.txt 2007-11-19 15:45:01.989139706 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,209 +0,0 @@
- Semantics and Behavior of Local Atomic Operations
-
- Mathieu Desnoyers
-
-
- This document explains the purpose of the local atomic operations, how
-to implement them for any given architecture and shows how they can be used
-properly. It also stresses on the precautions that must be taken when reading
-those local variables across CPUs when the order of memory writes matters.
-
-
-
-* Purpose of local atomic operations
-
-Local atomic operations are meant to provide fast and highly reentrant per CPU
-counters. They minimize the performance cost of standard atomic operations by
-removing the LOCK prefix and memory barriers normally required to synchronize
-across CPUs.
-
-Having fast per CPU atomic counters is interesting in many cases : it does not
-require disabling interrupts to protect from interrupt handlers and it permits
-coherent counters in NMI handlers. It is especially useful for tracing purposes
-and for various performance monitoring counters.
-
-Local atomic operations only guarantee variable modification atomicity wrt the
-CPU which owns the data. Therefore, care must taken to make sure that only one
-CPU writes to the local_t data. This is done by using per cpu data and making
-sure that we modify it from within a preemption safe context. It is however
-permitted to read local_t data from any CPU : it will then appear to be written
-out of order wrt other memory writes by the owner CPU.
-
-
-* Implementation for a given architecture
-
-It can be done by slightly modifying the standard atomic operations : only
-their UP variant must be kept. It typically means removing LOCK prefix (on
-i386 and x86_64) and any SMP sychronization barrier. If the architecture does
-not have a different behavior between SMP and UP, including asm-generic/local.h
-in your archtecture's local.h is sufficient.
-
-The local_t type is defined as an opaque signed long by embedding an
-atomic_long_t inside a structure. This is made so a cast from this type to a
-long fails. The definition looks like :
-
-typedef struct { atomic_long_t a; } local_t;
-
-
-* Rules to follow when using local atomic operations
-
-- Variables touched by local ops must be per cpu variables.
-- _Only_ the CPU owner of these variables must write to them.
-- This CPU can use local ops from any context (process, irq, softirq, nmi, ...)
- to update its local_t variables.
-- Preemption (or interrupts) must be disabled when using local ops in
- process context to make sure the process won't be migrated to a
- different CPU between getting the per-cpu variable and doing the
- actual local op.
-- When using local ops in interrupt context, no special care must be
- taken on a mainline kernel, since they will run on the local CPU with
- preemption already disabled. I suggest, however, to explicitly
- disable preemption anyway to make sure it will still work correctly on
- -rt kernels.
-- Reading the local cpu variable will provide the current copy of the
- variable.
-- Reads of these variables can be done from any CPU, because updates to
- "long", aligned, variables are always atomic. Since no memory
- synchronization is done by the writer CPU, an outdated copy of the
- variable can be read when reading some _other_ cpu's variables.
-
-
-* Rules to follow when using local atomic operations
-
-- Variables touched by local ops must be per cpu variables.
-- _Only_ the CPU owner of these variables must write to them.
-- This CPU can use local ops from any context (process, irq, softirq, nmi, ...)
- to update its local_t variables.
-- Preemption (or interrupts) must be disabled when using local ops in
- process context to make sure the process won't be migrated to a
- different CPU between getting the per-cpu variable and doing the
- actual local op.
-- When using local ops in interrupt context, no special care must be
- taken on a mainline kernel, since they will run on the local CPU with
- preemption already disabled. I suggest, however, to explicitly
- disable preemption anyway to make sure it will still work correctly on
- -rt kernels.
-- Reading the local cpu variable will provide the current copy of the
- variable.
-- Reads of these variables can be done from any CPU, because updates to
- "long", aligned, variables are always atomic. Since no memory
- synchronization is done by the writer CPU, an outdated copy of the
- variable can be read when reading some _other_ cpu's variables.
-
-
-* How to use local atomic operations
-
-#include <linux/percpu.h>
-#include <asm/local.h>
-
-static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0);
-
-
-* Counting
-
-Counting is done on all the bits of a signed long.
-
-In preemptible context, use get_cpu_var() and put_cpu_var() around local atomic
-operations : it makes sure that preemption is disabled around write access to
-the per cpu variable. For instance :
-
- local_inc(&get_cpu_var(counters));
- put_cpu_var(counters);
-
-If you are already in a preemption-safe context, you can directly use
-__get_cpu_var() instead.
-
- local_inc(&__get_cpu_var(counters));
-
-
-
-* Reading the counters
-
-Those local counters can be read from foreign CPUs to sum the count. Note that
-the data seen by local_read across CPUs must be considered to be out of order
-relatively to other memory writes happening on the CPU that owns the data.
-
- long sum = 0;
- for_each_online_cpu(cpu)
- sum += local_read(&per_cpu(counters, cpu));
-
-If you want to use a remote local_read to synchronize access to a resource
-between CPUs, explicit smp_wmb() and smp_rmb() memory barriers must be used
-respectively on the writer and the reader CPUs. It would be the case if you use
-the local_t variable as a counter of bytes written in a buffer : there should
-be a smp_wmb() between the buffer write and the counter increment and also a
-smp_rmb() between the counter read and the buffer read.
-
-
-Here is a sample module which implements a basic per cpu counter using local.h.
-
---- BEGIN ---
-/* test-local.c
- *
- * Sample module for local.h usage.
- */
-
-
-#include <asm/local.h>
-#include <linux/module.h>
-#include <linux/timer.h>
-
-static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0);
-
-static struct timer_list test_timer;
-
-/* IPI called on each CPU. */
-static void test_each(void *info)
-{
- /* Increment the counter from a non preemptible context */
- printk("Increment on cpu %d\n", smp_processor_id());
- local_inc(&__get_cpu_var(counters));
-
- /* This is what incrementing the variable would look like within a
- * preemptible context (it disables preemption) :
- *
- * local_inc(&get_cpu_var(counters));
- * put_cpu_var(counters);
- */
-}
-
-static void do_test_timer(unsigned long data)
-{
- int cpu;
-
- /* Increment the counters */
- on_each_cpu(test_each, NULL, 0, 1);
- /* Read all the counters */
- printk("Counters read from CPU %d\n", smp_processor_id());
- for_each_online_cpu(cpu) {
- printk("Read : CPU %d, count %ld\n", cpu,
- local_read(&per_cpu(counters, cpu)));
- }
- del_timer(&test_timer);
- test_timer.expires = jiffies + 1000;
- add_timer(&test_timer);
-}
-
-static int __init test_init(void)
-{
- /* initialize the timer that will increment the counter */
- init_timer(&test_timer);
- test_timer.function = do_test_timer;
- test_timer.expires = jiffies + 1;
- add_timer(&test_timer);
-
- return 0;
-}
-
-static void __exit test_exit(void)
-{
- del_timer_sync(&test_timer);
-}
-
-module_init(test_init);
-module_exit(test_exit);
-
-MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Mathieu Desnoyers");
-MODULE_DESCRIPTION("Local Atomic Ops");
---- END ---
Index: linux-2.6/include/asm-x86/local.h
===================================================================
--- linux-2.6.orig/include/asm-x86/local.h 2007-11-19 15:45:02.002639906 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,5 +0,0 @@
-#ifdef CONFIG_X86_32
-# include "local_32.h"
-#else
-# include "local_64.h"
-#endif
Index: linux-2.6/include/asm-x86/local_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/local_32.h 2007-11-19 15:45:02.006640289 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,233 +0,0 @@
-#ifndef _ARCH_I386_LOCAL_H
-#define _ARCH_I386_LOCAL_H
-
-#include <linux/percpu.h>
-#include <asm/system.h>
-#include <asm/atomic.h>
-
-typedef struct
-{
- atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i) { ATOMIC_LONG_INIT(i) }
-
-#define local_read(l) atomic_long_read(&(l)->a)
-#define local_set(l,i) atomic_long_set(&(l)->a, (i))
-
-static __inline__ void local_inc(local_t *l)
-{
- __asm__ __volatile__(
- "incl %0"
- :"+m" (l->a.counter));
-}
-
-static __inline__ void local_dec(local_t *l)
-{
- __asm__ __volatile__(
- "decl %0"
- :"+m" (l->a.counter));
-}
-
-static __inline__ void local_add(long i, local_t *l)
-{
- __asm__ __volatile__(
- "addl %1,%0"
- :"+m" (l->a.counter)
- :"ir" (i));
-}
-
-static __inline__ void local_sub(long i, local_t *l)
-{
- __asm__ __volatile__(
- "subl %1,%0"
- :"+m" (l->a.counter)
- :"ir" (i));
-}
-
-/**
- * local_sub_and_test - subtract value from variable and test result
- * @i: integer value to subtract
- * @l: pointer of type local_t
- *
- * Atomically subtracts @i from @l and returns
- * true if the result is zero, or false for all
- * other cases.
- */
-static __inline__ int local_sub_and_test(long i, local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "subl %2,%0; sete %1"
- :"+m" (l->a.counter), "=qm" (c)
- :"ir" (i) : "memory");
- return c;
-}
-
-/**
- * local_dec_and_test - decrement and test
- * @l: pointer of type local_t
- *
- * Atomically decrements @l by 1 and
- * returns true if the result is 0, or false for all other
- * cases.
- */
-static __inline__ int local_dec_and_test(local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "decl %0; sete %1"
- :"+m" (l->a.counter), "=qm" (c)
- : : "memory");
- return c != 0;
-}
-
-/**
- * local_inc_and_test - increment and test
- * @l: pointer of type local_t
- *
- * Atomically increments @l by 1
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-static __inline__ int local_inc_and_test(local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "incl %0; sete %1"
- :"+m" (l->a.counter), "=qm" (c)
- : : "memory");
- return c != 0;
-}
-
-/**
- * local_add_negative - add and test if negative
- * @l: pointer of type local_t
- * @i: integer value to add
- *
- * Atomically adds @i to @l and returns true
- * if the result is negative, or false when
- * result is greater than or equal to zero.
- */
-static __inline__ int local_add_negative(long i, local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "addl %2,%0; sets %1"
- :"+m" (l->a.counter), "=qm" (c)
- :"ir" (i) : "memory");
- return c;
-}
-
-/**
- * local_add_return - add and return
- * @l: pointer of type local_t
- * @i: integer value to add
- *
- * Atomically adds @i to @l and returns @i + @l
- */
-static __inline__ long local_add_return(long i, local_t *l)
-{
- long __i;
-#ifdef CONFIG_M386
- unsigned long flags;
- if(unlikely(boot_cpu_data.x86 <= 3))
- goto no_xadd;
-#endif
- /* Modern 486+ processor */
- __i = i;
- __asm__ __volatile__(
- "xaddl %0, %1;"
- :"+r" (i), "+m" (l->a.counter)
- : : "memory");
- return i + __i;
-
-#ifdef CONFIG_M386
-no_xadd: /* Legacy 386 processor */
- local_irq_save(flags);
- __i = local_read(l);
- local_set(l, i + __i);
- local_irq_restore(flags);
- return i + __i;
-#endif
-}
-
-static __inline__ long local_sub_return(long i, local_t *l)
-{
- return local_add_return(-i,l);
-}
-
-#define local_inc_return(l) (local_add_return(1,l))
-#define local_dec_return(l) (local_sub_return(1,l))
-
-#define local_cmpxchg(l, o, n) \
- (cmpxchg_local(&((l)->a.counter), (o), (n)))
-/* Always has a lock prefix */
-#define local_xchg(l, n) (xchg(&((l)->a.counter), (n)))
-
-/**
- * local_add_unless - add unless the number is a given value
- * @l: pointer of type local_t
- * @a: the amount to add to l...
- * @u: ...unless l is equal to u.
- *
- * Atomically adds @a to @l, so long as it was not @u.
- * Returns non-zero if @l was not @u, and zero otherwise.
- */
-#define local_add_unless(l, a, u) \
-({ \
- long c, old; \
- c = local_read(l); \
- for (;;) { \
- if (unlikely(c == (u))) \
- break; \
- old = local_cmpxchg((l), c, c + (a)); \
- if (likely(old == c)) \
- break; \
- c = old; \
- } \
- c != (u); \
-})
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-/* On x86, these are no better than the atomic variants. */
-#define __local_inc(l) local_inc(l)
-#define __local_dec(l) local_dec(l)
-#define __local_add(i,l) local_add((i),(l))
-#define __local_sub(i,l) local_sub((i),(l))
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable, not an address.
- */
-
-/* Need to disable preemption for the cpu local counters otherwise we could
- still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l) \
- ({ local_t res__; \
- preempt_disable(); \
- res__ = (l); \
- preempt_enable(); \
- res__; })
-#define cpu_local_wrap(l) \
- ({ preempt_disable(); \
- l; \
- preempt_enable(); }) \
-
-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l) cpu_local_inc(l)
-#define __cpu_local_dec(l) cpu_local_dec(l)
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
-
-#endif /* _ARCH_I386_LOCAL_H */
Index: linux-2.6/include/asm-x86/local_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/local_64.h 2007-11-19 15:45:02.026640148 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,222 +0,0 @@
-#ifndef _ARCH_X8664_LOCAL_H
-#define _ARCH_X8664_LOCAL_H
-
-#include <linux/percpu.h>
-#include <asm/atomic.h>
-
-typedef struct
-{
- atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i) { ATOMIC_LONG_INIT(i) }
-
-#define local_read(l) atomic_long_read(&(l)->a)
-#define local_set(l,i) atomic_long_set(&(l)->a, (i))
-
-static inline void local_inc(local_t *l)
-{
- __asm__ __volatile__(
- "incq %0"
- :"=m" (l->a.counter)
- :"m" (l->a.counter));
-}
-
-static inline void local_dec(local_t *l)
-{
- __asm__ __volatile__(
- "decq %0"
- :"=m" (l->a.counter)
- :"m" (l->a.counter));
-}
-
-static inline void local_add(long i, local_t *l)
-{
- __asm__ __volatile__(
- "addq %1,%0"
- :"=m" (l->a.counter)
- :"ir" (i), "m" (l->a.counter));
-}
-
-static inline void local_sub(long i, local_t *l)
-{
- __asm__ __volatile__(
- "subq %1,%0"
- :"=m" (l->a.counter)
- :"ir" (i), "m" (l->a.counter));
-}
-
-/**
- * local_sub_and_test - subtract value from variable and test result
- * @i: integer value to subtract
- * @l: pointer to type local_t
- *
- * Atomically subtracts @i from @l and returns
- * true if the result is zero, or false for all
- * other cases.
- */
-static __inline__ int local_sub_and_test(long i, local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "subq %2,%0; sete %1"
- :"=m" (l->a.counter), "=qm" (c)
- :"ir" (i), "m" (l->a.counter) : "memory");
- return c;
-}
-
-/**
- * local_dec_and_test - decrement and test
- * @l: pointer to type local_t
- *
- * Atomically decrements @l by 1 and
- * returns true if the result is 0, or false for all other
- * cases.
- */
-static __inline__ int local_dec_and_test(local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "decq %0; sete %1"
- :"=m" (l->a.counter), "=qm" (c)
- :"m" (l->a.counter) : "memory");
- return c != 0;
-}
-
-/**
- * local_inc_and_test - increment and test
- * @l: pointer to type local_t
- *
- * Atomically increments @l by 1
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-static __inline__ int local_inc_and_test(local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "incq %0; sete %1"
- :"=m" (l->a.counter), "=qm" (c)
- :"m" (l->a.counter) : "memory");
- return c != 0;
-}
-
-/**
- * local_add_negative - add and test if negative
- * @i: integer value to add
- * @l: pointer to type local_t
- *
- * Atomically adds @i to @l and returns true
- * if the result is negative, or false when
- * result is greater than or equal to zero.
- */
-static __inline__ int local_add_negative(long i, local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "addq %2,%0; sets %1"
- :"=m" (l->a.counter), "=qm" (c)
- :"ir" (i), "m" (l->a.counter) : "memory");
- return c;
-}
-
-/**
- * local_add_return - add and return
- * @i: integer value to add
- * @l: pointer to type local_t
- *
- * Atomically adds @i to @l and returns @i + @l
- */
-static __inline__ long local_add_return(long i, local_t *l)
-{
- long __i = i;
- __asm__ __volatile__(
- "xaddq %0, %1;"
- :"+r" (i), "+m" (l->a.counter)
- : : "memory");
- return i + __i;
-}
-
-static __inline__ long local_sub_return(long i, local_t *l)
-{
- return local_add_return(-i,l);
-}
-
-#define local_inc_return(l) (local_add_return(1,l))
-#define local_dec_return(l) (local_sub_return(1,l))
-
-#define local_cmpxchg(l, o, n) \
- (cmpxchg_local(&((l)->a.counter), (o), (n)))
-/* Always has a lock prefix */
-#define local_xchg(l, n) (xchg(&((l)->a.counter), (n)))
-
-/**
- * atomic_up_add_unless - add unless the number is a given value
- * @l: pointer of type local_t
- * @a: the amount to add to l...
- * @u: ...unless l is equal to u.
- *
- * Atomically adds @a to @l, so long as it was not @u.
- * Returns non-zero if @l was not @u, and zero otherwise.
- */
-#define local_add_unless(l, a, u) \
-({ \
- long c, old; \
- c = local_read(l); \
- for (;;) { \
- if (unlikely(c == (u))) \
- break; \
- old = local_cmpxchg((l), c, c + (a)); \
- if (likely(old == c)) \
- break; \
- c = old; \
- } \
- c != (u); \
-})
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-/* On x86-64 these are better than the atomic variants on SMP kernels
- because they dont use a lock prefix. */
-#define __local_inc(l) local_inc(l)
-#define __local_dec(l) local_dec(l)
-#define __local_add(i,l) local_add((i),(l))
-#define __local_sub(i,l) local_sub((i),(l))
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable, not an address.
- *
- * This could be done better if we moved the per cpu data directly
- * after GS.
- */
-
-/* Need to disable preemption for the cpu local counters otherwise we could
- still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l) \
- ({ local_t res__; \
- preempt_disable(); \
- res__ = (l); \
- preempt_enable(); \
- res__; })
-#define cpu_local_wrap(l) \
- ({ preempt_disable(); \
- l; \
- preempt_enable(); }) \
-
-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l) cpu_local_inc(l)
-#define __cpu_local_dec(l) cpu_local_dec(l)
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
-
-#endif /* _ARCH_X8664_LOCAL_H */
Index: linux-2.6/arch/frv/kernel/local.h
===================================================================
--- linux-2.6.orig/arch/frv/kernel/local.h 2007-11-19 15:45:02.509640199 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,59 +0,0 @@
-/* local.h: local definitions
- *
- * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (***@redhat.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- */
-
-#ifndef _FRV_LOCAL_H
-#define _FRV_LOCAL_H
-
-#include <asm/sections.h>
-
-#ifndef __ASSEMBLY__
-
-/* dma.c */
-extern unsigned long frv_dma_inprogress;
-
-extern void frv_dma_pause_all(void);
-extern void frv_dma_resume_all(void);
-
-/* sleep.S */
-extern asmlinkage void frv_cpu_suspend(unsigned long);
-extern asmlinkage void frv_cpu_core_sleep(void);
-
-/* setup.c */
-extern unsigned long __nongprelbss pdm_suspend_mode;
-extern void determine_clocks(int verbose);
-extern int __nongprelbss clock_p0_current;
-extern int __nongprelbss clock_cm_current;
-extern int __nongprelbss clock_cmode_current;
-
-#ifdef CONFIG_PM
-extern int __nongprelbss clock_cmodes_permitted;
-extern unsigned long __nongprelbss clock_bits_settable;
-#define CLOCK_BIT_CM 0x0000000f
-#define CLOCK_BIT_CM_H 0x00000001 /* CLKC.CM can be set to 0 */
-#define CLOCK_BIT_CM_M 0x00000002 /* CLKC.CM can be set to 1 */
-#define CLOCK_BIT_CM_L 0x00000004 /* CLKC.CM can be set to 2 */
-#define CLOCK_BIT_P0 0x00000010 /* CLKC.P0 can be changed */
-#define CLOCK_BIT_CMODE 0x00000020 /* CLKC.CMODE can be changed */
-
-extern void (*__power_switch_wake_setup)(void);
-extern int (*__power_switch_wake_check)(void);
-extern void (*__power_switch_wake_cleanup)(void);
-#endif
-
-/* time.c */
-extern void time_divisor_init(void);
-
-/* cmode.S */
-extern asmlinkage void frv_change_cmode(int);
-
-
-#endif /* __ASSEMBLY__ */
-#endif /* _FRV_LOCAL_H */
Index: linux-2.6/include/asm-alpha/local.h
===================================================================
--- linux-2.6.orig/include/asm-alpha/local.h 2007-11-19 15:45:02.062094005 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,118 +0,0 @@
-#ifndef _ALPHA_LOCAL_H
-#define _ALPHA_LOCAL_H
-
-#include <linux/percpu.h>
-#include <asm/atomic.h>
-
-typedef struct
-{
- atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i) { ATOMIC_LONG_INIT(i) }
-#define local_read(l) atomic_long_read(&(l)->a)
-#define local_set(l,i) atomic_long_set(&(l)->a, (i))
-#define local_inc(l) atomic_long_inc(&(l)->a)
-#define local_dec(l) atomic_long_dec(&(l)->a)
-#define local_add(i,l) atomic_long_add((i),(&(l)->a))
-#define local_sub(i,l) atomic_long_sub((i),(&(l)->a))
-
-static __inline__ long local_add_return(long i, local_t * l)
-{
- long temp, result;
- __asm__ __volatile__(
- "1: ldq_l %0,%1\n"
- " addq %0,%3,%2\n"
- " addq %0,%3,%0\n"
- " stq_c %0,%1\n"
- " beq %0,2f\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (temp), "=m" (l->a.counter), "=&r" (result)
- :"Ir" (i), "m" (l->a.counter) : "memory");
- return result;
-}
-
-static __inline__ long local_sub_return(long i, local_t * l)
-{
- long temp, result;
- __asm__ __volatile__(
- "1: ldq_l %0,%1\n"
- " subq %0,%3,%2\n"
- " subq %0,%3,%0\n"
- " stq_c %0,%1\n"
- " beq %0,2f\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (temp), "=m" (l->a.counter), "=&r" (result)
- :"Ir" (i), "m" (l->a.counter) : "memory");
- return result;
-}
-
-#define local_cmpxchg(l, o, n) \
- (cmpxchg_local(&((l)->a.counter), (o), (n)))
-#define local_xchg(l, n) (xchg_local(&((l)->a.counter), (n)))
-
-/**
- * local_add_unless - add unless the number is a given value
- * @l: pointer of type local_t
- * @a: the amount to add to l...
- * @u: ...unless l is equal to u.
- *
- * Atomically adds @a to @l, so long as it was not @u.
- * Returns non-zero if @l was not @u, and zero otherwise.
- */
-#define local_add_unless(l, a, u) \
-({ \
- long c, old; \
- c = local_read(l); \
- for (;;) { \
- if (unlikely(c == (u))) \
- break; \
- old = local_cmpxchg((l), c, c + (a)); \
- if (likely(old == c)) \
- break; \
- c = old; \
- } \
- c != (u); \
-})
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-#define local_add_negative(a, l) (local_add_return((a), (l)) < 0)
-
-#define local_dec_return(l) local_sub_return(1,(l))
-
-#define local_inc_return(l) local_add_return(1,(l))
-
-#define local_sub_and_test(i,l) (local_sub_return((i), (l)) == 0)
-
-#define local_inc_and_test(l) (local_add_return(1, (l)) == 0)
-
-#define local_dec_and_test(l) (local_sub_return(1, (l)) == 0)
-
-/* Verify if faster than atomic ops */
-#define __local_inc(l) ((l)->a.counter++)
-#define __local_dec(l) ((l)->a.counter++)
-#define __local_add(i,l) ((l)->a.counter+=(i))
-#define __local_sub(i,l) ((l)->a.counter-=(i))
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable, not an address.
- */
-#define cpu_local_read(l) local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i) local_set(&__get_cpu_var(l), (i))
-
-#define cpu_local_inc(l) local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l) local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l) local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l) local_sub((i), &__get_cpu_var(l))
-
-#define __cpu_local_inc(l) __local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l) __local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l) __local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l) __local_sub((i), &__get_cpu_var(l))
-
-#endif /* _ALPHA_LOCAL_H */
Index: linux-2.6/include/asm-arm/local.h
===================================================================
--- linux-2.6.orig/include/asm-arm/local.h 2007-11-19 15:45:02.102329901 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-avr32/local.h
===================================================================
--- linux-2.6.orig/include/asm-avr32/local.h 2007-11-19 15:45:02.126639967 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __ASM_AVR32_LOCAL_H
-#define __ASM_AVR32_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __ASM_AVR32_LOCAL_H */
Index: linux-2.6/include/asm-blackfin/local.h
===================================================================
--- linux-2.6.orig/include/asm-blackfin/local.h 2007-11-19 15:45:02.161234863 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __BLACKFIN_LOCAL_H
-#define __BLACKFIN_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __BLACKFIN_LOCAL_H */
Index: linux-2.6/include/asm-cris/local.h
===================================================================
--- linux-2.6.orig/include/asm-cris/local.h 2007-11-19 15:45:02.182639948 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-frv/local.h
===================================================================
--- linux-2.6.orig/include/asm-frv/local.h 2007-11-19 15:45:02.190640206 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef _ASM_LOCAL_H
-#define _ASM_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* _ASM_LOCAL_H */
Index: linux-2.6/include/asm-generic/local.h
===================================================================
--- linux-2.6.orig/include/asm-generic/local.h 2007-11-19 15:45:02.198640216 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,75 +0,0 @@
-#ifndef _ASM_GENERIC_LOCAL_H
-#define _ASM_GENERIC_LOCAL_H
-
-#include <linux/percpu.h>
-#include <linux/hardirq.h>
-#include <asm/atomic.h>
-#include <asm/types.h>
-
-/*
- * A signed long type for operations which are atomic for a single CPU.
- * Usually used in combination with per-cpu variables.
- *
- * This is the default implementation, which uses atomic_long_t. Which is
- * rather pointless. The whole point behind local_t is that some processors
- * can perform atomic adds and subtracts in a manner which is atomic wrt IRQs
- * running on this CPU. local_t allows exploitation of such capabilities.
- */
-
-/* Implement in terms of atomics. */
-
-/* Don't use typedef: don't want them to be mixed with atomic_t's. */
-typedef struct
-{
- atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i) { ATOMIC_LONG_INIT(i) }
-
-#define local_read(l) atomic_long_read(&(l)->a)
-#define local_set(l,i) atomic_long_set((&(l)->a),(i))
-#define local_inc(l) atomic_long_inc(&(l)->a)
-#define local_dec(l) atomic_long_dec(&(l)->a)
-#define local_add(i,l) atomic_long_add((i),(&(l)->a))
-#define local_sub(i,l) atomic_long_sub((i),(&(l)->a))
-
-#define local_sub_and_test(i, l) atomic_long_sub_and_test((i), (&(l)->a))
-#define local_dec_and_test(l) atomic_long_dec_and_test(&(l)->a)
-#define local_inc_and_test(l) atomic_long_inc_and_test(&(l)->a)
-#define local_add_negative(i, l) atomic_long_add_negative((i), (&(l)->a))
-#define local_add_return(i, l) atomic_long_add_return((i), (&(l)->a))
-#define local_sub_return(i, l) atomic_long_sub_return((i), (&(l)->a))
-#define local_inc_return(l) atomic_long_inc_return(&(l)->a)
-
-#define local_cmpxchg(l, o, n) atomic_long_cmpxchg((&(l)->a), (o), (n))
-#define local_xchg(l, n) atomic_long_xchg((&(l)->a), (n))
-#define local_add_unless(l, a, u) atomic_long_add_unless((&(l)->a), (a), (u))
-#define local_inc_not_zero(l) atomic_long_inc_not_zero(&(l)->a)
-
-/* Non-atomic variants, ie. preemption disabled and won't be touched
- * in interrupt, etc. Some archs can optimize this case well. */
-#define __local_inc(l) local_set((l), local_read(l) + 1)
-#define __local_dec(l) local_set((l), local_read(l) - 1)
-#define __local_add(i,l) local_set((l), local_read(l) + (i))
-#define __local_sub(i,l) local_set((l), local_read(l) - (i))
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable (eg. mystruct.foo), not an address.
- */
-#define cpu_local_read(l) local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i) local_set(&__get_cpu_var(l), (i))
-#define cpu_local_inc(l) local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l) local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l) local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l) local_sub((i), &__get_cpu_var(l))
-
-/* Non-atomic increments, ie. preemption disabled and won't be touched
- * in interrupt, etc. Some archs can optimize this case well.
- */
-#define __cpu_local_inc(l) __local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l) __local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l) __local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l) __local_sub((i), &__get_cpu_var(l))
-
-#endif /* _ASM_GENERIC_LOCAL_H */
Index: linux-2.6/include/asm-h8300/local.h
===================================================================
--- linux-2.6.orig/include/asm-h8300/local.h 2007-11-19 15:45:02.245140408 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef _H8300_LOCAL_H_
-#define _H8300_LOCAL_H_
-
-#include <asm-generic/local.h>
-
-#endif
Index: linux-2.6/include/asm-ia64/local.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/local.h 2007-11-19 15:45:02.277139840 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-m32r/local.h
===================================================================
--- linux-2.6.orig/include/asm-m32r/local.h 2007-11-19 15:45:02.285140737 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __M32R_LOCAL_H
-#define __M32R_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __M32R_LOCAL_H */
Index: linux-2.6/include/asm-m68k/local.h
===================================================================
--- linux-2.6.orig/include/asm-m68k/local.h 2007-11-19 15:45:02.305140224 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef _ASM_M68K_LOCAL_H
-#define _ASM_M68K_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* _ASM_M68K_LOCAL_H */
Index: linux-2.6/include/asm-m68knommu/local.h
===================================================================
--- linux-2.6.orig/include/asm-m68knommu/local.h 2007-11-19 15:45:02.321139891 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __M68KNOMMU_LOCAL_H
-#define __M68KNOMMU_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __M68KNOMMU_LOCAL_H */
Index: linux-2.6/include/asm-mips/local.h
===================================================================
--- linux-2.6.orig/include/asm-mips/local.h 2007-11-19 15:45:02.333140816 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,221 +0,0 @@
-#ifndef _ARCH_MIPS_LOCAL_H
-#define _ARCH_MIPS_LOCAL_H
-
-#include <linux/percpu.h>
-#include <linux/bitops.h>
-#include <asm/atomic.h>
-#include <asm/cmpxchg.h>
-#include <asm/war.h>
-
-typedef struct
-{
- atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i) { ATOMIC_LONG_INIT(i) }
-
-#define local_read(l) atomic_long_read(&(l)->a)
-#define local_set(l, i) atomic_long_set(&(l)->a, (i))
-
-#define local_add(i, l) atomic_long_add((i), (&(l)->a))
-#define local_sub(i, l) atomic_long_sub((i), (&(l)->a))
-#define local_inc(l) atomic_long_inc(&(l)->a)
-#define local_dec(l) atomic_long_dec(&(l)->a)
-
-/*
- * Same as above, but return the result value
- */
-static __inline__ long local_add_return(long i, local_t * l)
-{
- unsigned long result;
-
- if (cpu_has_llsc && R10000_LLSC_WAR) {
- unsigned long temp;
-
- __asm__ __volatile__(
- " .set mips3 \n"
- "1:" __LL "%1, %2 # local_add_return \n"
- " addu %0, %1, %3 \n"
- __SC "%0, %2 \n"
- " beqzl %0, 1b \n"
- " addu %0, %1, %3 \n"
- " .set mips0 \n"
- : "=&r" (result), "=&r" (temp), "=m" (l->a.counter)
- : "Ir" (i), "m" (l->a.counter)
- : "memory");
- } else if (cpu_has_llsc) {
- unsigned long temp;
-
- __asm__ __volatile__(
- " .set mips3 \n"
- "1:" __LL "%1, %2 # local_add_return \n"
- " addu %0, %1, %3 \n"
- __SC "%0, %2 \n"
- " beqz %0, 1b \n"
- " addu %0, %1, %3 \n"
- " .set mips0 \n"
- : "=&r" (result), "=&r" (temp), "=m" (l->a.counter)
- : "Ir" (i), "m" (l->a.counter)
- : "memory");
- } else {
- unsigned long flags;
-
- local_irq_save(flags);
- result = l->a.counter;
- result += i;
- l->a.counter = result;
- local_irq_restore(flags);
- }
-
- return result;
-}
-
-static __inline__ long local_sub_return(long i, local_t * l)
-{
- unsigned long result;
-
- if (cpu_has_llsc && R10000_LLSC_WAR) {
- unsigned long temp;
-
- __asm__ __volatile__(
- " .set mips3 \n"
- "1:" __LL "%1, %2 # local_sub_return \n"
- " subu %0, %1, %3 \n"
- __SC "%0, %2 \n"
- " beqzl %0, 1b \n"
- " subu %0, %1, %3 \n"
- " .set mips0 \n"
- : "=&r" (result), "=&r" (temp), "=m" (l->a.counter)
- : "Ir" (i), "m" (l->a.counter)
- : "memory");
- } else if (cpu_has_llsc) {
- unsigned long temp;
-
- __asm__ __volatile__(
- " .set mips3 \n"
- "1:" __LL "%1, %2 # local_sub_return \n"
- " subu %0, %1, %3 \n"
- __SC "%0, %2 \n"
- " beqz %0, 1b \n"
- " subu %0, %1, %3 \n"
- " .set mips0 \n"
- : "=&r" (result), "=&r" (temp), "=m" (l->a.counter)
- : "Ir" (i), "m" (l->a.counter)
- : "memory");
- } else {
- unsigned long flags;
-
- local_irq_save(flags);
- result = l->a.counter;
- result -= i;
- l->a.counter = result;
- local_irq_restore(flags);
- }
-
- return result;
-}
-
-#define local_cmpxchg(l, o, n) \
- ((long)cmpxchg_local(&((l)->a.counter), (o), (n)))
-#define local_xchg(l, n) (xchg_local(&((l)->a.counter), (n)))
-
-/**
- * local_add_unless - add unless the number is a given value
- * @l: pointer of type local_t
- * @a: the amount to add to l...
- * @u: ...unless l is equal to u.
- *
- * Atomically adds @a to @l, so long as it was not @u.
- * Returns non-zero if @l was not @u, and zero otherwise.
- */
-#define local_add_unless(l, a, u) \
-({ \
- long c, old; \
- c = local_read(l); \
- while (c != (u) && (old = local_cmpxchg((l), c, c + (a))) != c) \
- c = old; \
- c != (u); \
-})
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-#define local_dec_return(l) local_sub_return(1, (l))
-#define local_inc_return(l) local_add_return(1, (l))
-
-/*
- * local_sub_and_test - subtract value from variable and test result
- * @i: integer value to subtract
- * @l: pointer of type local_t
- *
- * Atomically subtracts @i from @l and returns
- * true if the result is zero, or false for all
- * other cases.
- */
-#define local_sub_and_test(i, l) (local_sub_return((i), (l)) == 0)
-
-/*
- * local_inc_and_test - increment and test
- * @l: pointer of type local_t
- *
- * Atomically increments @l by 1
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-#define local_inc_and_test(l) (local_inc_return(l) == 0)
-
-/*
- * local_dec_and_test - decrement by 1 and test
- * @l: pointer of type local_t
- *
- * Atomically decrements @l by 1 and
- * returns true if the result is 0, or false for all other
- * cases.
- */
-#define local_dec_and_test(l) (local_sub_return(1, (l)) == 0)
-
-/*
- * local_add_negative - add and test if negative
- * @l: pointer of type local_t
- * @i: integer value to add
- *
- * Atomically adds @i to @l and returns true
- * if the result is negative, or false when
- * result is greater than or equal to zero.
- */
-#define local_add_negative(i, l) (local_add_return(i, (l)) < 0)
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable, not an address.
- */
-
-#define __local_inc(l) ((l)->a.counter++)
-#define __local_dec(l) ((l)->a.counter++)
-#define __local_add(i, l) ((l)->a.counter+=(i))
-#define __local_sub(i, l) ((l)->a.counter-=(i))
-
-/* Need to disable preemption for the cpu local counters otherwise we could
- still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l) \
- ({ local_t res__; \
- preempt_disable(); \
- res__ = (l); \
- preempt_enable(); \
- res__; })
-#define cpu_local_wrap(l) \
- ({ preempt_disable(); \
- l; \
- preempt_enable(); }) \
-
-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l) cpu_local_inc(l)
-#define __cpu_local_dec(l) cpu_local_dec(l)
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
-
-#endif /* _ARCH_MIPS_LOCAL_H */
Index: linux-2.6/include/asm-parisc/local.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/local.h 2007-11-19 15:45:02.341140171 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-powerpc/local.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/local.h 2007-11-19 15:45:02.365140002 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,200 +0,0 @@
-#ifndef _ARCH_POWERPC_LOCAL_H
-#define _ARCH_POWERPC_LOCAL_H
-
-#include <linux/percpu.h>
-#include <asm/atomic.h>
-
-typedef struct
-{
- atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i) { ATOMIC_LONG_INIT(i) }
-
-#define local_read(l) atomic_long_read(&(l)->a)
-#define local_set(l,i) atomic_long_set(&(l)->a, (i))
-
-#define local_add(i,l) atomic_long_add((i),(&(l)->a))
-#define local_sub(i,l) atomic_long_sub((i),(&(l)->a))
-#define local_inc(l) atomic_long_inc(&(l)->a)
-#define local_dec(l) atomic_long_dec(&(l)->a)
-
-static __inline__ long local_add_return(long a, local_t *l)
-{
- long t;
-
- __asm__ __volatile__(
-"1:" PPC_LLARX "%0,0,%2 # local_add_return\n\
- add %0,%1,%0\n"
- PPC405_ERR77(0,%2)
- PPC_STLCX "%0,0,%2 \n\
- bne- 1b"
- : "=&r" (t)
- : "r" (a), "r" (&(l->a.counter))
- : "cc", "memory");
-
- return t;
-}
-
-#define local_add_negative(a, l) (local_add_return((a), (l)) < 0)
-
-static __inline__ long local_sub_return(long a, local_t *l)
-{
- long t;
-
- __asm__ __volatile__(
-"1:" PPC_LLARX "%0,0,%2 # local_sub_return\n\
- subf %0,%1,%0\n"
- PPC405_ERR77(0,%2)
- PPC_STLCX "%0,0,%2 \n\
- bne- 1b"
- : "=&r" (t)
- : "r" (a), "r" (&(l->a.counter))
- : "cc", "memory");
-
- return t;
-}
-
-static __inline__ long local_inc_return(local_t *l)
-{
- long t;
-
- __asm__ __volatile__(
-"1:" PPC_LLARX "%0,0,%1 # local_inc_return\n\
- addic %0,%0,1\n"
- PPC405_ERR77(0,%1)
- PPC_STLCX "%0,0,%1 \n\
- bne- 1b"
- : "=&r" (t)
- : "r" (&(l->a.counter))
- : "cc", "memory");
-
- return t;
-}
-
-/*
- * local_inc_and_test - increment and test
- * @l: pointer of type local_t
- *
- * Atomically increments @l by 1
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-#define local_inc_and_test(l) (local_inc_return(l) == 0)
-
-static __inline__ long local_dec_return(local_t *l)
-{
- long t;
-
- __asm__ __volatile__(
-"1:" PPC_LLARX "%0,0,%1 # local_dec_return\n\
- addic %0,%0,-1\n"
- PPC405_ERR77(0,%1)
- PPC_STLCX "%0,0,%1\n\
- bne- 1b"
- : "=&r" (t)
- : "r" (&(l->a.counter))
- : "cc", "memory");
-
- return t;
-}
-
-#define local_cmpxchg(l, o, n) \
- (cmpxchg_local(&((l)->a.counter), (o), (n)))
-#define local_xchg(l, n) (xchg_local(&((l)->a.counter), (n)))
-
-/**
- * local_add_unless - add unless the number is a given value
- * @l: pointer of type local_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
- *
- * Atomically adds @a to @l, so long as it was not @u.
- * Returns non-zero if @l was not @u, and zero otherwise.
- */
-static __inline__ int local_add_unless(local_t *l, long a, long u)
-{
- long t;
-
- __asm__ __volatile__ (
-"1:" PPC_LLARX "%0,0,%1 # local_add_unless\n\
- cmpw 0,%0,%3 \n\
- beq- 2f \n\
- add %0,%2,%0 \n"
- PPC405_ERR77(0,%2)
- PPC_STLCX "%0,0,%1 \n\
- bne- 1b \n"
-" subf %0,%2,%0 \n\
-2:"
- : "=&r" (t)
- : "r" (&(l->a.counter)), "r" (a), "r" (u)
- : "cc", "memory");
-
- return t != u;
-}
-
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-#define local_sub_and_test(a, l) (local_sub_return((a), (l)) == 0)
-#define local_dec_and_test(l) (local_dec_return((l)) == 0)
-
-/*
- * Atomically test *l and decrement if it is greater than 0.
- * The function returns the old value of *l minus 1.
- */
-static __inline__ long local_dec_if_positive(local_t *l)
-{
- long t;
-
- __asm__ __volatile__(
-"1:" PPC_LLARX "%0,0,%1 # local_dec_if_positive\n\
- cmpwi %0,1\n\
- addi %0,%0,-1\n\
- blt- 2f\n"
- PPC405_ERR77(0,%1)
- PPC_STLCX "%0,0,%1\n\
- bne- 1b"
- "\n\
-2:" : "=&b" (t)
- : "r" (&(l->a.counter))
- : "cc", "memory");
-
- return t;
-}
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable, not an address.
- */
-
-#define __local_inc(l) ((l)->a.counter++)
-#define __local_dec(l) ((l)->a.counter++)
-#define __local_add(i,l) ((l)->a.counter+=(i))
-#define __local_sub(i,l) ((l)->a.counter-=(i))
-
-/* Need to disable preemption for the cpu local counters otherwise we could
- still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l) \
- ({ local_t res__; \
- preempt_disable(); \
- res__ = (l); \
- preempt_enable(); \
- res__; })
-#define cpu_local_wrap(l) \
- ({ preempt_disable(); \
- l; \
- preempt_enable(); }) \
-
-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l) cpu_local_inc(l)
-#define __cpu_local_dec(l) cpu_local_dec(l)
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
-
-#endif /* _ARCH_POWERPC_LOCAL_H */
Index: linux-2.6/include/asm-s390/local.h
===================================================================
--- linux-2.6.orig/include/asm-s390/local.h 2007-11-19 15:45:02.373140085 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-sh/local.h
===================================================================
--- linux-2.6.orig/include/asm-sh/local.h 2007-11-19 15:45:02.405389823 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,7 +0,0 @@
-#ifndef __ASM_SH_LOCAL_H
-#define __ASM_SH_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __ASM_SH_LOCAL_H */
-
Index: linux-2.6/include/asm-sh64/local.h
===================================================================
--- linux-2.6.orig/include/asm-sh64/local.h 2007-11-19 15:45:02.413640013 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,7 +0,0 @@
-#ifndef __ASM_SH64_LOCAL_H
-#define __ASM_SH64_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __ASM_SH64_LOCAL_H */
-
Index: linux-2.6/include/asm-sparc/local.h
===================================================================
--- linux-2.6.orig/include/asm-sparc/local.h 2007-11-19 15:45:02.429640001 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef _SPARC_LOCAL_H
-#define _SPARC_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif
Index: linux-2.6/include/asm-sparc64/local.h
===================================================================
--- linux-2.6.orig/include/asm-sparc64/local.h 2007-11-19 15:45:02.437640328 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-um/local.h
===================================================================
--- linux-2.6.orig/include/asm-um/local.h 2007-11-19 15:45:02.457639838 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __UM_LOCAL_H
-#define __UM_LOCAL_H
-
-#include "asm/arch/local.h"
-
-#endif
Index: linux-2.6/include/asm-v850/local.h
===================================================================
--- linux-2.6.orig/include/asm-v850/local.h 2007-11-19 15:45:02.465640304 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __V850_LOCAL_H__
-#define __V850_LOCAL_H__
-
-#include <asm-generic/local.h>
-
-#endif /* __V850_LOCAL_H__ */
Index: linux-2.6/include/asm-xtensa/local.h
===================================================================
--- linux-2.6.orig/include/asm-xtensa/local.h 2007-11-19 15:45:02.469640160 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,16 +0,0 @@
-/*
- * include/asm-xtensa/local.h
- *
- * This file is subject to the terms and conditions of the GNU General Public
- * License. See the file "COPYING" in the main directory of this archive
- * for more details.
- *
- * Copyright (C) 2001 - 2005 Tensilica Inc.
- */
-
-#ifndef _XTENSA_LOCAL_H
-#define _XTENSA_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* _XTENSA_LOCAL_H */
Index: linux-2.6/include/linux/module.h
===================================================================
--- linux-2.6.orig/include/linux/module.h 2007-11-19 16:00:49.421639813 -0800
+++ linux-2.6/include/linux/module.h 2007-11-19 16:25:42.314191640 -0800
@@ -16,7 +16,7 @@
#include <linux/kobject.h>
#include <linux/moduleparam.h>
#include <linux/marker.h>
-#include <asm/local.h>
+#include <linux/percpu.h>

#include <asm/module.h>


--
Mathieu Desnoyers
2007-11-20 12:59:44 UTC
Permalink
Post by c***@sgi.com
There is no user of local_t remaining after the cpu ops patchset. local_t
always suffered from the problem that the operations it generated were not
able to perform the relocation of a pointer to the target processor and the
atomic update at the same time. There was a need to disable preemption
and/or interrupts which made it awkward to use.
The question that arises is : are there some architectures that do not
provide fast PER_CPU ops but provides fast local atomic ops ?
Post by c***@sgi.com
---
Documentation/local_ops.txt | 209 -------------------------------------
arch/frv/kernel/local.h | 59 ----------
include/asm-alpha/local.h | 118 ---------------------
include/asm-arm/local.h | 1
include/asm-avr32/local.h | 6 -
include/asm-blackfin/local.h | 6 -
include/asm-cris/local.h | 1
include/asm-frv/local.h | 6 -
include/asm-generic/local.h | 75 -------------
include/asm-h8300/local.h | 6 -
include/asm-ia64/local.h | 1
include/asm-m32r/local.h | 6 -
include/asm-m68k/local.h | 6 -
include/asm-m68knommu/local.h | 6 -
include/asm-mips/local.h | 221 ---------------------------------------
include/asm-parisc/local.h | 1
include/asm-powerpc/local.h | 200 ------------------------------------
include/asm-s390/local.h | 1
include/asm-sh/local.h | 7 -
include/asm-sh64/local.h | 7 -
include/asm-sparc/local.h | 6 -
include/asm-sparc64/local.h | 1
include/asm-um/local.h | 6 -
include/asm-v850/local.h | 6 -
include/asm-x86/local.h | 5
include/asm-x86/local_32.h | 233 ------------------------------------------
include/asm-x86/local_64.h | 222 ----------------------------------------
include/asm-xtensa/local.h | 16 --
include/linux/module.h | 2
29 files changed, 1 insertion(+), 1439 deletions(-)
Index: linux-2.6/Documentation/local_ops.txt
===================================================================
--- linux-2.6.orig/Documentation/local_ops.txt 2007-11-19 15:45:01.989139706 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,209 +0,0 @@
- Semantics and Behavior of Local Atomic Operations
-
- Mathieu Desnoyers
-
-
- This document explains the purpose of the local atomic operations, how
-to implement them for any given architecture and shows how they can be used
-properly. It also stresses on the precautions that must be taken when reading
-those local variables across CPUs when the order of memory writes matters.
-
-
-
-* Purpose of local atomic operations
-
-Local atomic operations are meant to provide fast and highly reentrant per CPU
-counters. They minimize the performance cost of standard atomic operations by
-removing the LOCK prefix and memory barriers normally required to synchronize
-across CPUs.
-
-Having fast per CPU atomic counters is interesting in many cases : it does not
-require disabling interrupts to protect from interrupt handlers and it permits
-coherent counters in NMI handlers. It is especially useful for tracing purposes
-and for various performance monitoring counters.
-
-Local atomic operations only guarantee variable modification atomicity wrt the
-CPU which owns the data. Therefore, care must taken to make sure that only one
-CPU writes to the local_t data. This is done by using per cpu data and making
-sure that we modify it from within a preemption safe context. It is however
-permitted to read local_t data from any CPU : it will then appear to be written
-out of order wrt other memory writes by the owner CPU.
-
-
-* Implementation for a given architecture
-
-It can be done by slightly modifying the standard atomic operations : only
-their UP variant must be kept. It typically means removing LOCK prefix (on
-i386 and x86_64) and any SMP sychronization barrier. If the architecture does
-not have a different behavior between SMP and UP, including asm-generic/local.h
-in your archtecture's local.h is sufficient.
-
-The local_t type is defined as an opaque signed long by embedding an
-atomic_long_t inside a structure. This is made so a cast from this type to a
-
-typedef struct { atomic_long_t a; } local_t;
-
-
-* Rules to follow when using local atomic operations
-
-- Variables touched by local ops must be per cpu variables.
-- _Only_ the CPU owner of these variables must write to them.
-- This CPU can use local ops from any context (process, irq, softirq, nmi, ...)
- to update its local_t variables.
-- Preemption (or interrupts) must be disabled when using local ops in
- process context to make sure the process won't be migrated to a
- different CPU between getting the per-cpu variable and doing the
- actual local op.
-- When using local ops in interrupt context, no special care must be
- taken on a mainline kernel, since they will run on the local CPU with
- preemption already disabled. I suggest, however, to explicitly
- disable preemption anyway to make sure it will still work correctly on
- -rt kernels.
-- Reading the local cpu variable will provide the current copy of the
- variable.
-- Reads of these variables can be done from any CPU, because updates to
- "long", aligned, variables are always atomic. Since no memory
- synchronization is done by the writer CPU, an outdated copy of the
- variable can be read when reading some _other_ cpu's variables.
-
-
-* Rules to follow when using local atomic operations
-
-- Variables touched by local ops must be per cpu variables.
-- _Only_ the CPU owner of these variables must write to them.
-- This CPU can use local ops from any context (process, irq, softirq, nmi, ...)
- to update its local_t variables.
-- Preemption (or interrupts) must be disabled when using local ops in
- process context to make sure the process won't be migrated to a
- different CPU between getting the per-cpu variable and doing the
- actual local op.
-- When using local ops in interrupt context, no special care must be
- taken on a mainline kernel, since they will run on the local CPU with
- preemption already disabled. I suggest, however, to explicitly
- disable preemption anyway to make sure it will still work correctly on
- -rt kernels.
-- Reading the local cpu variable will provide the current copy of the
- variable.
-- Reads of these variables can be done from any CPU, because updates to
- "long", aligned, variables are always atomic. Since no memory
- synchronization is done by the writer CPU, an outdated copy of the
- variable can be read when reading some _other_ cpu's variables.
-
-
-* How to use local atomic operations
-
-#include <linux/percpu.h>
-#include <asm/local.h>
-
-static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0);
-
-
-* Counting
-
-Counting is done on all the bits of a signed long.
-
-In preemptible context, use get_cpu_var() and put_cpu_var() around local atomic
-operations : it makes sure that preemption is disabled around write access to
-
- local_inc(&get_cpu_var(counters));
- put_cpu_var(counters);
-
-If you are already in a preemption-safe context, you can directly use
-__get_cpu_var() instead.
-
- local_inc(&__get_cpu_var(counters));
-
-
-
-* Reading the counters
-
-Those local counters can be read from foreign CPUs to sum the count. Note that
-the data seen by local_read across CPUs must be considered to be out of order
-relatively to other memory writes happening on the CPU that owns the data.
-
- long sum = 0;
- for_each_online_cpu(cpu)
- sum += local_read(&per_cpu(counters, cpu));
-
-If you want to use a remote local_read to synchronize access to a resource
-between CPUs, explicit smp_wmb() and smp_rmb() memory barriers must be used
-respectively on the writer and the reader CPUs. It would be the case if you use
-the local_t variable as a counter of bytes written in a buffer : there should
-be a smp_wmb() between the buffer write and the counter increment and also a
-smp_rmb() between the counter read and the buffer read.
-
-
-Here is a sample module which implements a basic per cpu counter using local.h.
-
---- BEGIN ---
-/* test-local.c
- *
- * Sample module for local.h usage.
- */
-
-
-#include <asm/local.h>
-#include <linux/module.h>
-#include <linux/timer.h>
-
-static DEFINE_PER_CPU(local_t, counters) = LOCAL_INIT(0);
-
-static struct timer_list test_timer;
-
-/* IPI called on each CPU. */
-static void test_each(void *info)
-{
- /* Increment the counter from a non preemptible context */
- printk("Increment on cpu %d\n", smp_processor_id());
- local_inc(&__get_cpu_var(counters));
-
- /* This is what incrementing the variable would look like within a
- *
- * local_inc(&get_cpu_var(counters));
- * put_cpu_var(counters);
- */
-}
-
-static void do_test_timer(unsigned long data)
-{
- int cpu;
-
- /* Increment the counters */
- on_each_cpu(test_each, NULL, 0, 1);
- /* Read all the counters */
- printk("Counters read from CPU %d\n", smp_processor_id());
- for_each_online_cpu(cpu) {
- printk("Read : CPU %d, count %ld\n", cpu,
- local_read(&per_cpu(counters, cpu)));
- }
- del_timer(&test_timer);
- test_timer.expires = jiffies + 1000;
- add_timer(&test_timer);
-}
-
-static int __init test_init(void)
-{
- /* initialize the timer that will increment the counter */
- init_timer(&test_timer);
- test_timer.function = do_test_timer;
- test_timer.expires = jiffies + 1;
- add_timer(&test_timer);
-
- return 0;
-}
-
-static void __exit test_exit(void)
-{
- del_timer_sync(&test_timer);
-}
-
-module_init(test_init);
-module_exit(test_exit);
-
-MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Mathieu Desnoyers");
-MODULE_DESCRIPTION("Local Atomic Ops");
---- END ---
Index: linux-2.6/include/asm-x86/local.h
===================================================================
--- linux-2.6.orig/include/asm-x86/local.h 2007-11-19 15:45:02.002639906 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,5 +0,0 @@
-#ifdef CONFIG_X86_32
-# include "local_32.h"
-#else
-# include "local_64.h"
-#endif
Index: linux-2.6/include/asm-x86/local_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/local_32.h 2007-11-19 15:45:02.006640289 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,233 +0,0 @@
-#ifndef _ARCH_I386_LOCAL_H
-#define _ARCH_I386_LOCAL_H
-
-#include <linux/percpu.h>
-#include <asm/system.h>
-#include <asm/atomic.h>
-
-typedef struct
-{
- atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i) { ATOMIC_LONG_INIT(i) }
-
-#define local_read(l) atomic_long_read(&(l)->a)
-#define local_set(l,i) atomic_long_set(&(l)->a, (i))
-
-static __inline__ void local_inc(local_t *l)
-{
- __asm__ __volatile__(
- "incl %0"
- :"+m" (l->a.counter));
-}
-
-static __inline__ void local_dec(local_t *l)
-{
- __asm__ __volatile__(
- "decl %0"
- :"+m" (l->a.counter));
-}
-
-static __inline__ void local_add(long i, local_t *l)
-{
- __asm__ __volatile__(
- "addl %1,%0"
- :"+m" (l->a.counter)
- :"ir" (i));
-}
-
-static __inline__ void local_sub(long i, local_t *l)
-{
- __asm__ __volatile__(
- "subl %1,%0"
- :"+m" (l->a.counter)
- :"ir" (i));
-}
-
-/**
- * local_sub_and_test - subtract value from variable and test result
- *
- * true if the result is zero, or false for all
- * other cases.
- */
-static __inline__ int local_sub_and_test(long i, local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "subl %2,%0; sete %1"
- :"+m" (l->a.counter), "=qm" (c)
- :"ir" (i) : "memory");
- return c;
-}
-
-/**
- * local_dec_and_test - decrement and test
- *
- * returns true if the result is 0, or false for all other
- * cases.
- */
-static __inline__ int local_dec_and_test(local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "decl %0; sete %1"
- :"+m" (l->a.counter), "=qm" (c)
- : : "memory");
- return c != 0;
-}
-
-/**
- * local_inc_and_test - increment and test
- *
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-static __inline__ int local_inc_and_test(local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "incl %0; sete %1"
- :"+m" (l->a.counter), "=qm" (c)
- : : "memory");
- return c != 0;
-}
-
-/**
- * local_add_negative - add and test if negative
- *
- * if the result is negative, or false when
- * result is greater than or equal to zero.
- */
-static __inline__ int local_add_negative(long i, local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "addl %2,%0; sets %1"
- :"+m" (l->a.counter), "=qm" (c)
- :"ir" (i) : "memory");
- return c;
-}
-
-/**
- * local_add_return - add and return
- *
- */
-static __inline__ long local_add_return(long i, local_t *l)
-{
- long __i;
-#ifdef CONFIG_M386
- unsigned long flags;
- if(unlikely(boot_cpu_data.x86 <= 3))
- goto no_xadd;
-#endif
- /* Modern 486+ processor */
- __i = i;
- __asm__ __volatile__(
- "xaddl %0, %1;"
- :"+r" (i), "+m" (l->a.counter)
- : : "memory");
- return i + __i;
-
-#ifdef CONFIG_M386
-no_xadd: /* Legacy 386 processor */
- local_irq_save(flags);
- __i = local_read(l);
- local_set(l, i + __i);
- local_irq_restore(flags);
- return i + __i;
-#endif
-}
-
-static __inline__ long local_sub_return(long i, local_t *l)
-{
- return local_add_return(-i,l);
-}
-
-#define local_inc_return(l) (local_add_return(1,l))
-#define local_dec_return(l) (local_sub_return(1,l))
-
-#define local_cmpxchg(l, o, n) \
- (cmpxchg_local(&((l)->a.counter), (o), (n)))
-/* Always has a lock prefix */
-#define local_xchg(l, n) (xchg(&((l)->a.counter), (n)))
-
-/**
- * local_add_unless - add unless the number is a given value
- *
- */
-#define local_add_unless(l, a, u) \
-({ \
- long c, old; \
- c = local_read(l); \
- for (;;) { \
- if (unlikely(c == (u))) \
- break; \
- old = local_cmpxchg((l), c, c + (a)); \
- if (likely(old == c)) \
- break; \
- c = old; \
- } \
- c != (u); \
-})
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-/* On x86, these are no better than the atomic variants. */
-#define __local_inc(l) local_inc(l)
-#define __local_dec(l) local_dec(l)
-#define __local_add(i,l) local_add((i),(l))
-#define __local_sub(i,l) local_sub((i),(l))
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable, not an address.
- */
-
-/* Need to disable preemption for the cpu local counters otherwise we could
- still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l) \
- ({ local_t res__; \
- preempt_disable(); \
- res__ = (l); \
- preempt_enable(); \
- res__; })
-#define cpu_local_wrap(l) \
- ({ preempt_disable(); \
- l; \
- preempt_enable(); }) \
-
-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l) cpu_local_inc(l)
-#define __cpu_local_dec(l) cpu_local_dec(l)
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
-
-#endif /* _ARCH_I386_LOCAL_H */
Index: linux-2.6/include/asm-x86/local_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/local_64.h 2007-11-19 15:45:02.026640148 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,222 +0,0 @@
-#ifndef _ARCH_X8664_LOCAL_H
-#define _ARCH_X8664_LOCAL_H
-
-#include <linux/percpu.h>
-#include <asm/atomic.h>
-
-typedef struct
-{
- atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i) { ATOMIC_LONG_INIT(i) }
-
-#define local_read(l) atomic_long_read(&(l)->a)
-#define local_set(l,i) atomic_long_set(&(l)->a, (i))
-
-static inline void local_inc(local_t *l)
-{
- __asm__ __volatile__(
- "incq %0"
- :"=m" (l->a.counter)
- :"m" (l->a.counter));
-}
-
-static inline void local_dec(local_t *l)
-{
- __asm__ __volatile__(
- "decq %0"
- :"=m" (l->a.counter)
- :"m" (l->a.counter));
-}
-
-static inline void local_add(long i, local_t *l)
-{
- __asm__ __volatile__(
- "addq %1,%0"
- :"=m" (l->a.counter)
- :"ir" (i), "m" (l->a.counter));
-}
-
-static inline void local_sub(long i, local_t *l)
-{
- __asm__ __volatile__(
- "subq %1,%0"
- :"=m" (l->a.counter)
- :"ir" (i), "m" (l->a.counter));
-}
-
-/**
- * local_sub_and_test - subtract value from variable and test result
- *
- * true if the result is zero, or false for all
- * other cases.
- */
-static __inline__ int local_sub_and_test(long i, local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "subq %2,%0; sete %1"
- :"=m" (l->a.counter), "=qm" (c)
- :"ir" (i), "m" (l->a.counter) : "memory");
- return c;
-}
-
-/**
- * local_dec_and_test - decrement and test
- *
- * returns true if the result is 0, or false for all other
- * cases.
- */
-static __inline__ int local_dec_and_test(local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "decq %0; sete %1"
- :"=m" (l->a.counter), "=qm" (c)
- :"m" (l->a.counter) : "memory");
- return c != 0;
-}
-
-/**
- * local_inc_and_test - increment and test
- *
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-static __inline__ int local_inc_and_test(local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "incq %0; sete %1"
- :"=m" (l->a.counter), "=qm" (c)
- :"m" (l->a.counter) : "memory");
- return c != 0;
-}
-
-/**
- * local_add_negative - add and test if negative
- *
- * if the result is negative, or false when
- * result is greater than or equal to zero.
- */
-static __inline__ int local_add_negative(long i, local_t *l)
-{
- unsigned char c;
-
- __asm__ __volatile__(
- "addq %2,%0; sets %1"
- :"=m" (l->a.counter), "=qm" (c)
- :"ir" (i), "m" (l->a.counter) : "memory");
- return c;
-}
-
-/**
- * local_add_return - add and return
- *
- */
-static __inline__ long local_add_return(long i, local_t *l)
-{
- long __i = i;
- __asm__ __volatile__(
- "xaddq %0, %1;"
- :"+r" (i), "+m" (l->a.counter)
- : : "memory");
- return i + __i;
-}
-
-static __inline__ long local_sub_return(long i, local_t *l)
-{
- return local_add_return(-i,l);
-}
-
-#define local_inc_return(l) (local_add_return(1,l))
-#define local_dec_return(l) (local_sub_return(1,l))
-
-#define local_cmpxchg(l, o, n) \
- (cmpxchg_local(&((l)->a.counter), (o), (n)))
-/* Always has a lock prefix */
-#define local_xchg(l, n) (xchg(&((l)->a.counter), (n)))
-
-/**
- * atomic_up_add_unless - add unless the number is a given value
- *
- */
-#define local_add_unless(l, a, u) \
-({ \
- long c, old; \
- c = local_read(l); \
- for (;;) { \
- if (unlikely(c == (u))) \
- break; \
- old = local_cmpxchg((l), c, c + (a)); \
- if (likely(old == c)) \
- break; \
- c = old; \
- } \
- c != (u); \
-})
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-/* On x86-64 these are better than the atomic variants on SMP kernels
- because they dont use a lock prefix. */
-#define __local_inc(l) local_inc(l)
-#define __local_dec(l) local_dec(l)
-#define __local_add(i,l) local_add((i),(l))
-#define __local_sub(i,l) local_sub((i),(l))
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable, not an address.
- *
- * This could be done better if we moved the per cpu data directly
- * after GS.
- */
-
-/* Need to disable preemption for the cpu local counters otherwise we could
- still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l) \
- ({ local_t res__; \
- preempt_disable(); \
- res__ = (l); \
- preempt_enable(); \
- res__; })
-#define cpu_local_wrap(l) \
- ({ preempt_disable(); \
- l; \
- preempt_enable(); }) \
-
-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l) cpu_local_inc(l)
-#define __cpu_local_dec(l) cpu_local_dec(l)
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
-
-#endif /* _ARCH_X8664_LOCAL_H */
Index: linux-2.6/arch/frv/kernel/local.h
===================================================================
--- linux-2.6.orig/arch/frv/kernel/local.h 2007-11-19 15:45:02.509640199 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,59 +0,0 @@
-/* local.h: local definitions
- *
- * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- */
-
-#ifndef _FRV_LOCAL_H
-#define _FRV_LOCAL_H
-
-#include <asm/sections.h>
-
-#ifndef __ASSEMBLY__
-
-/* dma.c */
-extern unsigned long frv_dma_inprogress;
-
-extern void frv_dma_pause_all(void);
-extern void frv_dma_resume_all(void);
-
-/* sleep.S */
-extern asmlinkage void frv_cpu_suspend(unsigned long);
-extern asmlinkage void frv_cpu_core_sleep(void);
-
-/* setup.c */
-extern unsigned long __nongprelbss pdm_suspend_mode;
-extern void determine_clocks(int verbose);
-extern int __nongprelbss clock_p0_current;
-extern int __nongprelbss clock_cm_current;
-extern int __nongprelbss clock_cmode_current;
-
-#ifdef CONFIG_PM
-extern int __nongprelbss clock_cmodes_permitted;
-extern unsigned long __nongprelbss clock_bits_settable;
-#define CLOCK_BIT_CM 0x0000000f
-#define CLOCK_BIT_CM_H 0x00000001 /* CLKC.CM can be set to 0 */
-#define CLOCK_BIT_CM_M 0x00000002 /* CLKC.CM can be set to 1 */
-#define CLOCK_BIT_CM_L 0x00000004 /* CLKC.CM can be set to 2 */
-#define CLOCK_BIT_P0 0x00000010 /* CLKC.P0 can be changed */
-#define CLOCK_BIT_CMODE 0x00000020 /* CLKC.CMODE can be changed */
-
-extern void (*__power_switch_wake_setup)(void);
-extern int (*__power_switch_wake_check)(void);
-extern void (*__power_switch_wake_cleanup)(void);
-#endif
-
-/* time.c */
-extern void time_divisor_init(void);
-
-/* cmode.S */
-extern asmlinkage void frv_change_cmode(int);
-
-
-#endif /* __ASSEMBLY__ */
-#endif /* _FRV_LOCAL_H */
Index: linux-2.6/include/asm-alpha/local.h
===================================================================
--- linux-2.6.orig/include/asm-alpha/local.h 2007-11-19 15:45:02.062094005 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,118 +0,0 @@
-#ifndef _ALPHA_LOCAL_H
-#define _ALPHA_LOCAL_H
-
-#include <linux/percpu.h>
-#include <asm/atomic.h>
-
-typedef struct
-{
- atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i) { ATOMIC_LONG_INIT(i) }
-#define local_read(l) atomic_long_read(&(l)->a)
-#define local_set(l,i) atomic_long_set(&(l)->a, (i))
-#define local_inc(l) atomic_long_inc(&(l)->a)
-#define local_dec(l) atomic_long_dec(&(l)->a)
-#define local_add(i,l) atomic_long_add((i),(&(l)->a))
-#define local_sub(i,l) atomic_long_sub((i),(&(l)->a))
-
-static __inline__ long local_add_return(long i, local_t * l)
-{
- long temp, result;
- __asm__ __volatile__(
- "1: ldq_l %0,%1\n"
- " addq %0,%3,%2\n"
- " addq %0,%3,%0\n"
- " stq_c %0,%1\n"
- " beq %0,2f\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (temp), "=m" (l->a.counter), "=&r" (result)
- :"Ir" (i), "m" (l->a.counter) : "memory");
- return result;
-}
-
-static __inline__ long local_sub_return(long i, local_t * l)
-{
- long temp, result;
- __asm__ __volatile__(
- "1: ldq_l %0,%1\n"
- " subq %0,%3,%2\n"
- " subq %0,%3,%0\n"
- " stq_c %0,%1\n"
- " beq %0,2f\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (temp), "=m" (l->a.counter), "=&r" (result)
- :"Ir" (i), "m" (l->a.counter) : "memory");
- return result;
-}
-
-#define local_cmpxchg(l, o, n) \
- (cmpxchg_local(&((l)->a.counter), (o), (n)))
-#define local_xchg(l, n) (xchg_local(&((l)->a.counter), (n)))
-
-/**
- * local_add_unless - add unless the number is a given value
- *
- */
-#define local_add_unless(l, a, u) \
-({ \
- long c, old; \
- c = local_read(l); \
- for (;;) { \
- if (unlikely(c == (u))) \
- break; \
- old = local_cmpxchg((l), c, c + (a)); \
- if (likely(old == c)) \
- break; \
- c = old; \
- } \
- c != (u); \
-})
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-#define local_add_negative(a, l) (local_add_return((a), (l)) < 0)
-
-#define local_dec_return(l) local_sub_return(1,(l))
-
-#define local_inc_return(l) local_add_return(1,(l))
-
-#define local_sub_and_test(i,l) (local_sub_return((i), (l)) == 0)
-
-#define local_inc_and_test(l) (local_add_return(1, (l)) == 0)
-
-#define local_dec_and_test(l) (local_sub_return(1, (l)) == 0)
-
-/* Verify if faster than atomic ops */
-#define __local_inc(l) ((l)->a.counter++)
-#define __local_dec(l) ((l)->a.counter++)
-#define __local_add(i,l) ((l)->a.counter+=(i))
-#define __local_sub(i,l) ((l)->a.counter-=(i))
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable, not an address.
- */
-#define cpu_local_read(l) local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i) local_set(&__get_cpu_var(l), (i))
-
-#define cpu_local_inc(l) local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l) local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l) local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l) local_sub((i), &__get_cpu_var(l))
-
-#define __cpu_local_inc(l) __local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l) __local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l) __local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l) __local_sub((i), &__get_cpu_var(l))
-
-#endif /* _ALPHA_LOCAL_H */
Index: linux-2.6/include/asm-arm/local.h
===================================================================
--- linux-2.6.orig/include/asm-arm/local.h 2007-11-19 15:45:02.102329901 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-avr32/local.h
===================================================================
--- linux-2.6.orig/include/asm-avr32/local.h 2007-11-19 15:45:02.126639967 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __ASM_AVR32_LOCAL_H
-#define __ASM_AVR32_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __ASM_AVR32_LOCAL_H */
Index: linux-2.6/include/asm-blackfin/local.h
===================================================================
--- linux-2.6.orig/include/asm-blackfin/local.h 2007-11-19 15:45:02.161234863 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __BLACKFIN_LOCAL_H
-#define __BLACKFIN_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __BLACKFIN_LOCAL_H */
Index: linux-2.6/include/asm-cris/local.h
===================================================================
--- linux-2.6.orig/include/asm-cris/local.h 2007-11-19 15:45:02.182639948 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-frv/local.h
===================================================================
--- linux-2.6.orig/include/asm-frv/local.h 2007-11-19 15:45:02.190640206 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef _ASM_LOCAL_H
-#define _ASM_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* _ASM_LOCAL_H */
Index: linux-2.6/include/asm-generic/local.h
===================================================================
--- linux-2.6.orig/include/asm-generic/local.h 2007-11-19 15:45:02.198640216 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,75 +0,0 @@
-#ifndef _ASM_GENERIC_LOCAL_H
-#define _ASM_GENERIC_LOCAL_H
-
-#include <linux/percpu.h>
-#include <linux/hardirq.h>
-#include <asm/atomic.h>
-#include <asm/types.h>
-
-/*
- * A signed long type for operations which are atomic for a single CPU.
- * Usually used in combination with per-cpu variables.
- *
- * This is the default implementation, which uses atomic_long_t. Which is
- * rather pointless. The whole point behind local_t is that some processors
- * can perform atomic adds and subtracts in a manner which is atomic wrt IRQs
- * running on this CPU. local_t allows exploitation of such capabilities.
- */
-
-/* Implement in terms of atomics. */
-
-/* Don't use typedef: don't want them to be mixed with atomic_t's. */
-typedef struct
-{
- atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i) { ATOMIC_LONG_INIT(i) }
-
-#define local_read(l) atomic_long_read(&(l)->a)
-#define local_set(l,i) atomic_long_set((&(l)->a),(i))
-#define local_inc(l) atomic_long_inc(&(l)->a)
-#define local_dec(l) atomic_long_dec(&(l)->a)
-#define local_add(i,l) atomic_long_add((i),(&(l)->a))
-#define local_sub(i,l) atomic_long_sub((i),(&(l)->a))
-
-#define local_sub_and_test(i, l) atomic_long_sub_and_test((i), (&(l)->a))
-#define local_dec_and_test(l) atomic_long_dec_and_test(&(l)->a)
-#define local_inc_and_test(l) atomic_long_inc_and_test(&(l)->a)
-#define local_add_negative(i, l) atomic_long_add_negative((i), (&(l)->a))
-#define local_add_return(i, l) atomic_long_add_return((i), (&(l)->a))
-#define local_sub_return(i, l) atomic_long_sub_return((i), (&(l)->a))
-#define local_inc_return(l) atomic_long_inc_return(&(l)->a)
-
-#define local_cmpxchg(l, o, n) atomic_long_cmpxchg((&(l)->a), (o), (n))
-#define local_xchg(l, n) atomic_long_xchg((&(l)->a), (n))
-#define local_add_unless(l, a, u) atomic_long_add_unless((&(l)->a), (a), (u))
-#define local_inc_not_zero(l) atomic_long_inc_not_zero(&(l)->a)
-
-/* Non-atomic variants, ie. preemption disabled and won't be touched
- * in interrupt, etc. Some archs can optimize this case well. */
-#define __local_inc(l) local_set((l), local_read(l) + 1)
-#define __local_dec(l) local_set((l), local_read(l) - 1)
-#define __local_add(i,l) local_set((l), local_read(l) + (i))
-#define __local_sub(i,l) local_set((l), local_read(l) - (i))
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable (eg. mystruct.foo), not an address.
- */
-#define cpu_local_read(l) local_read(&__get_cpu_var(l))
-#define cpu_local_set(l, i) local_set(&__get_cpu_var(l), (i))
-#define cpu_local_inc(l) local_inc(&__get_cpu_var(l))
-#define cpu_local_dec(l) local_dec(&__get_cpu_var(l))
-#define cpu_local_add(i, l) local_add((i), &__get_cpu_var(l))
-#define cpu_local_sub(i, l) local_sub((i), &__get_cpu_var(l))
-
-/* Non-atomic increments, ie. preemption disabled and won't be touched
- * in interrupt, etc. Some archs can optimize this case well.
- */
-#define __cpu_local_inc(l) __local_inc(&__get_cpu_var(l))
-#define __cpu_local_dec(l) __local_dec(&__get_cpu_var(l))
-#define __cpu_local_add(i, l) __local_add((i), &__get_cpu_var(l))
-#define __cpu_local_sub(i, l) __local_sub((i), &__get_cpu_var(l))
-
-#endif /* _ASM_GENERIC_LOCAL_H */
Index: linux-2.6/include/asm-h8300/local.h
===================================================================
--- linux-2.6.orig/include/asm-h8300/local.h 2007-11-19 15:45:02.245140408 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef _H8300_LOCAL_H_
-#define _H8300_LOCAL_H_
-
-#include <asm-generic/local.h>
-
-#endif
Index: linux-2.6/include/asm-ia64/local.h
===================================================================
--- linux-2.6.orig/include/asm-ia64/local.h 2007-11-19 15:45:02.277139840 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-m32r/local.h
===================================================================
--- linux-2.6.orig/include/asm-m32r/local.h 2007-11-19 15:45:02.285140737 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __M32R_LOCAL_H
-#define __M32R_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __M32R_LOCAL_H */
Index: linux-2.6/include/asm-m68k/local.h
===================================================================
--- linux-2.6.orig/include/asm-m68k/local.h 2007-11-19 15:45:02.305140224 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef _ASM_M68K_LOCAL_H
-#define _ASM_M68K_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* _ASM_M68K_LOCAL_H */
Index: linux-2.6/include/asm-m68knommu/local.h
===================================================================
--- linux-2.6.orig/include/asm-m68knommu/local.h 2007-11-19 15:45:02.321139891 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __M68KNOMMU_LOCAL_H
-#define __M68KNOMMU_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __M68KNOMMU_LOCAL_H */
Index: linux-2.6/include/asm-mips/local.h
===================================================================
--- linux-2.6.orig/include/asm-mips/local.h 2007-11-19 15:45:02.333140816 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,221 +0,0 @@
-#ifndef _ARCH_MIPS_LOCAL_H
-#define _ARCH_MIPS_LOCAL_H
-
-#include <linux/percpu.h>
-#include <linux/bitops.h>
-#include <asm/atomic.h>
-#include <asm/cmpxchg.h>
-#include <asm/war.h>
-
-typedef struct
-{
- atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i) { ATOMIC_LONG_INIT(i) }
-
-#define local_read(l) atomic_long_read(&(l)->a)
-#define local_set(l, i) atomic_long_set(&(l)->a, (i))
-
-#define local_add(i, l) atomic_long_add((i), (&(l)->a))
-#define local_sub(i, l) atomic_long_sub((i), (&(l)->a))
-#define local_inc(l) atomic_long_inc(&(l)->a)
-#define local_dec(l) atomic_long_dec(&(l)->a)
-
-/*
- * Same as above, but return the result value
- */
-static __inline__ long local_add_return(long i, local_t * l)
-{
- unsigned long result;
-
- if (cpu_has_llsc && R10000_LLSC_WAR) {
- unsigned long temp;
-
- __asm__ __volatile__(
- " .set mips3 \n"
- "1:" __LL "%1, %2 # local_add_return \n"
- " addu %0, %1, %3 \n"
- __SC "%0, %2 \n"
- " beqzl %0, 1b \n"
- " addu %0, %1, %3 \n"
- " .set mips0 \n"
- : "=&r" (result), "=&r" (temp), "=m" (l->a.counter)
- : "Ir" (i), "m" (l->a.counter)
- : "memory");
- } else if (cpu_has_llsc) {
- unsigned long temp;
-
- __asm__ __volatile__(
- " .set mips3 \n"
- "1:" __LL "%1, %2 # local_add_return \n"
- " addu %0, %1, %3 \n"
- __SC "%0, %2 \n"
- " beqz %0, 1b \n"
- " addu %0, %1, %3 \n"
- " .set mips0 \n"
- : "=&r" (result), "=&r" (temp), "=m" (l->a.counter)
- : "Ir" (i), "m" (l->a.counter)
- : "memory");
- } else {
- unsigned long flags;
-
- local_irq_save(flags);
- result = l->a.counter;
- result += i;
- l->a.counter = result;
- local_irq_restore(flags);
- }
-
- return result;
-}
-
-static __inline__ long local_sub_return(long i, local_t * l)
-{
- unsigned long result;
-
- if (cpu_has_llsc && R10000_LLSC_WAR) {
- unsigned long temp;
-
- __asm__ __volatile__(
- " .set mips3 \n"
- "1:" __LL "%1, %2 # local_sub_return \n"
- " subu %0, %1, %3 \n"
- __SC "%0, %2 \n"
- " beqzl %0, 1b \n"
- " subu %0, %1, %3 \n"
- " .set mips0 \n"
- : "=&r" (result), "=&r" (temp), "=m" (l->a.counter)
- : "Ir" (i), "m" (l->a.counter)
- : "memory");
- } else if (cpu_has_llsc) {
- unsigned long temp;
-
- __asm__ __volatile__(
- " .set mips3 \n"
- "1:" __LL "%1, %2 # local_sub_return \n"
- " subu %0, %1, %3 \n"
- __SC "%0, %2 \n"
- " beqz %0, 1b \n"
- " subu %0, %1, %3 \n"
- " .set mips0 \n"
- : "=&r" (result), "=&r" (temp), "=m" (l->a.counter)
- : "Ir" (i), "m" (l->a.counter)
- : "memory");
- } else {
- unsigned long flags;
-
- local_irq_save(flags);
- result = l->a.counter;
- result -= i;
- l->a.counter = result;
- local_irq_restore(flags);
- }
-
- return result;
-}
-
-#define local_cmpxchg(l, o, n) \
- ((long)cmpxchg_local(&((l)->a.counter), (o), (n)))
-#define local_xchg(l, n) (xchg_local(&((l)->a.counter), (n)))
-
-/**
- * local_add_unless - add unless the number is a given value
- *
- */
-#define local_add_unless(l, a, u) \
-({ \
- long c, old; \
- c = local_read(l); \
- while (c != (u) && (old = local_cmpxchg((l), c, c + (a))) != c) \
- c = old; \
- c != (u); \
-})
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-#define local_dec_return(l) local_sub_return(1, (l))
-#define local_inc_return(l) local_add_return(1, (l))
-
-/*
- * local_sub_and_test - subtract value from variable and test result
- *
- * true if the result is zero, or false for all
- * other cases.
- */
-#define local_sub_and_test(i, l) (local_sub_return((i), (l)) == 0)
-
-/*
- * local_inc_and_test - increment and test
- *
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-#define local_inc_and_test(l) (local_inc_return(l) == 0)
-
-/*
- * local_dec_and_test - decrement by 1 and test
- *
- * returns true if the result is 0, or false for all other
- * cases.
- */
-#define local_dec_and_test(l) (local_sub_return(1, (l)) == 0)
-
-/*
- * local_add_negative - add and test if negative
- *
- * if the result is negative, or false when
- * result is greater than or equal to zero.
- */
-#define local_add_negative(i, l) (local_add_return(i, (l)) < 0)
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable, not an address.
- */
-
-#define __local_inc(l) ((l)->a.counter++)
-#define __local_dec(l) ((l)->a.counter++)
-#define __local_add(i, l) ((l)->a.counter+=(i))
-#define __local_sub(i, l) ((l)->a.counter-=(i))
-
-/* Need to disable preemption for the cpu local counters otherwise we could
- still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l) \
- ({ local_t res__; \
- preempt_disable(); \
- res__ = (l); \
- preempt_enable(); \
- res__; })
-#define cpu_local_wrap(l) \
- ({ preempt_disable(); \
- l; \
- preempt_enable(); }) \
-
-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l) cpu_local_inc(l)
-#define __cpu_local_dec(l) cpu_local_dec(l)
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
-
-#endif /* _ARCH_MIPS_LOCAL_H */
Index: linux-2.6/include/asm-parisc/local.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/local.h 2007-11-19 15:45:02.341140171 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-powerpc/local.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/local.h 2007-11-19 15:45:02.365140002 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,200 +0,0 @@
-#ifndef _ARCH_POWERPC_LOCAL_H
-#define _ARCH_POWERPC_LOCAL_H
-
-#include <linux/percpu.h>
-#include <asm/atomic.h>
-
-typedef struct
-{
- atomic_long_t a;
-} local_t;
-
-#define LOCAL_INIT(i) { ATOMIC_LONG_INIT(i) }
-
-#define local_read(l) atomic_long_read(&(l)->a)
-#define local_set(l,i) atomic_long_set(&(l)->a, (i))
-
-#define local_add(i,l) atomic_long_add((i),(&(l)->a))
-#define local_sub(i,l) atomic_long_sub((i),(&(l)->a))
-#define local_inc(l) atomic_long_inc(&(l)->a)
-#define local_dec(l) atomic_long_dec(&(l)->a)
-
-static __inline__ long local_add_return(long a, local_t *l)
-{
- long t;
-
- __asm__ __volatile__(
-"1:" PPC_LLARX "%0,0,%2 # local_add_return\n\
- add %0,%1,%0\n"
- PPC405_ERR77(0,%2)
- PPC_STLCX "%0,0,%2 \n\
- bne- 1b"
- : "=&r" (t)
- : "r" (a), "r" (&(l->a.counter))
- : "cc", "memory");
-
- return t;
-}
-
-#define local_add_negative(a, l) (local_add_return((a), (l)) < 0)
-
-static __inline__ long local_sub_return(long a, local_t *l)
-{
- long t;
-
- __asm__ __volatile__(
-"1:" PPC_LLARX "%0,0,%2 # local_sub_return\n\
- subf %0,%1,%0\n"
- PPC405_ERR77(0,%2)
- PPC_STLCX "%0,0,%2 \n\
- bne- 1b"
- : "=&r" (t)
- : "r" (a), "r" (&(l->a.counter))
- : "cc", "memory");
-
- return t;
-}
-
-static __inline__ long local_inc_return(local_t *l)
-{
- long t;
-
- __asm__ __volatile__(
-"1:" PPC_LLARX "%0,0,%1 # local_inc_return\n\
- addic %0,%0,1\n"
- PPC405_ERR77(0,%1)
- PPC_STLCX "%0,0,%1 \n\
- bne- 1b"
- : "=&r" (t)
- : "r" (&(l->a.counter))
- : "cc", "memory");
-
- return t;
-}
-
-/*
- * local_inc_and_test - increment and test
- *
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-#define local_inc_and_test(l) (local_inc_return(l) == 0)
-
-static __inline__ long local_dec_return(local_t *l)
-{
- long t;
-
- __asm__ __volatile__(
-"1:" PPC_LLARX "%0,0,%1 # local_dec_return\n\
- addic %0,%0,-1\n"
- PPC405_ERR77(0,%1)
- PPC_STLCX "%0,0,%1\n\
- bne- 1b"
- : "=&r" (t)
- : "r" (&(l->a.counter))
- : "cc", "memory");
-
- return t;
-}
-
-#define local_cmpxchg(l, o, n) \
- (cmpxchg_local(&((l)->a.counter), (o), (n)))
-#define local_xchg(l, n) (xchg_local(&((l)->a.counter), (n)))
-
-/**
- * local_add_unless - add unless the number is a given value
- *
- */
-static __inline__ int local_add_unless(local_t *l, long a, long u)
-{
- long t;
-
- __asm__ __volatile__ (
-"1:" PPC_LLARX "%0,0,%1 # local_add_unless\n\
- cmpw 0,%0,%3 \n\
- beq- 2f \n\
- add %0,%2,%0 \n"
- PPC405_ERR77(0,%2)
- PPC_STLCX "%0,0,%1 \n\
- bne- 1b \n"
-" subf %0,%2,%0 \n\
-2:"
- : "=&r" (t)
- : "r" (&(l->a.counter)), "r" (a), "r" (u)
- : "cc", "memory");
-
- return t != u;
-}
-
-#define local_inc_not_zero(l) local_add_unless((l), 1, 0)
-
-#define local_sub_and_test(a, l) (local_sub_return((a), (l)) == 0)
-#define local_dec_and_test(l) (local_dec_return((l)) == 0)
-
-/*
- * Atomically test *l and decrement if it is greater than 0.
- * The function returns the old value of *l minus 1.
- */
-static __inline__ long local_dec_if_positive(local_t *l)
-{
- long t;
-
- __asm__ __volatile__(
-"1:" PPC_LLARX "%0,0,%1 # local_dec_if_positive\n\
- cmpwi %0,1\n\
- addi %0,%0,-1\n\
- blt- 2f\n"
- PPC405_ERR77(0,%1)
- PPC_STLCX "%0,0,%1\n\
- bne- 1b"
- "\n\
-2:" : "=&b" (t)
- : "r" (&(l->a.counter))
- : "cc", "memory");
-
- return t;
-}
-
-/* Use these for per-cpu local_t variables: on some archs they are
- * much more efficient than these naive implementations. Note they take
- * a variable, not an address.
- */
-
-#define __local_inc(l) ((l)->a.counter++)
-#define __local_dec(l) ((l)->a.counter++)
-#define __local_add(i,l) ((l)->a.counter+=(i))
-#define __local_sub(i,l) ((l)->a.counter-=(i))
-
-/* Need to disable preemption for the cpu local counters otherwise we could
- still access a variable of a previous CPU in a non atomic way. */
-#define cpu_local_wrap_v(l) \
- ({ local_t res__; \
- preempt_disable(); \
- res__ = (l); \
- preempt_enable(); \
- res__; })
-#define cpu_local_wrap(l) \
- ({ preempt_disable(); \
- l; \
- preempt_enable(); }) \
-
-#define cpu_local_read(l) cpu_local_wrap_v(local_read(&__get_cpu_var(l)))
-#define cpu_local_set(l, i) cpu_local_wrap(local_set(&__get_cpu_var(l), (i)))
-#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var(l)))
-#define cpu_local_dec(l) cpu_local_wrap(local_dec(&__get_cpu_var(l)))
-#define cpu_local_add(i, l) cpu_local_wrap(local_add((i), &__get_cpu_var(l)))
-#define cpu_local_sub(i, l) cpu_local_wrap(local_sub((i), &__get_cpu_var(l)))
-
-#define __cpu_local_inc(l) cpu_local_inc(l)
-#define __cpu_local_dec(l) cpu_local_dec(l)
-#define __cpu_local_add(i, l) cpu_local_add((i), (l))
-#define __cpu_local_sub(i, l) cpu_local_sub((i), (l))
-
-#endif /* _ARCH_POWERPC_LOCAL_H */
Index: linux-2.6/include/asm-s390/local.h
===================================================================
--- linux-2.6.orig/include/asm-s390/local.h 2007-11-19 15:45:02.373140085 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-sh/local.h
===================================================================
--- linux-2.6.orig/include/asm-sh/local.h 2007-11-19 15:45:02.405389823 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,7 +0,0 @@
-#ifndef __ASM_SH_LOCAL_H
-#define __ASM_SH_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __ASM_SH_LOCAL_H */
-
Index: linux-2.6/include/asm-sh64/local.h
===================================================================
--- linux-2.6.orig/include/asm-sh64/local.h 2007-11-19 15:45:02.413640013 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,7 +0,0 @@
-#ifndef __ASM_SH64_LOCAL_H
-#define __ASM_SH64_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* __ASM_SH64_LOCAL_H */
-
Index: linux-2.6/include/asm-sparc/local.h
===================================================================
--- linux-2.6.orig/include/asm-sparc/local.h 2007-11-19 15:45:02.429640001 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef _SPARC_LOCAL_H
-#define _SPARC_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif
Index: linux-2.6/include/asm-sparc64/local.h
===================================================================
--- linux-2.6.orig/include/asm-sparc64/local.h 2007-11-19 15:45:02.437640328 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1 +0,0 @@
-#include <asm-generic/local.h>
Index: linux-2.6/include/asm-um/local.h
===================================================================
--- linux-2.6.orig/include/asm-um/local.h 2007-11-19 15:45:02.457639838 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __UM_LOCAL_H
-#define __UM_LOCAL_H
-
-#include "asm/arch/local.h"
-
-#endif
Index: linux-2.6/include/asm-v850/local.h
===================================================================
--- linux-2.6.orig/include/asm-v850/local.h 2007-11-19 15:45:02.465640304 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,6 +0,0 @@
-#ifndef __V850_LOCAL_H__
-#define __V850_LOCAL_H__
-
-#include <asm-generic/local.h>
-
-#endif /* __V850_LOCAL_H__ */
Index: linux-2.6/include/asm-xtensa/local.h
===================================================================
--- linux-2.6.orig/include/asm-xtensa/local.h 2007-11-19 15:45:02.469640160 -0800
+++ /dev/null 1970-01-01 00:00:00.000000000 +0000
@@ -1,16 +0,0 @@
-/*
- * include/asm-xtensa/local.h
- *
- * This file is subject to the terms and conditions of the GNU General Public
- * License. See the file "COPYING" in the main directory of this archive
- * for more details.
- *
- * Copyright (C) 2001 - 2005 Tensilica Inc.
- */
-
-#ifndef _XTENSA_LOCAL_H
-#define _XTENSA_LOCAL_H
-
-#include <asm-generic/local.h>
-
-#endif /* _XTENSA_LOCAL_H */
Index: linux-2.6/include/linux/module.h
===================================================================
--- linux-2.6.orig/include/linux/module.h 2007-11-19 16:00:49.421639813 -0800
+++ linux-2.6/include/linux/module.h 2007-11-19 16:25:42.314191640 -0800
@@ -16,7 +16,7 @@
#include <linux/kobject.h>
#include <linux/moduleparam.h>
#include <linux/marker.h>
-#include <asm/local.h>
+#include <linux/percpu.h>
#include <asm/module.h>
--
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Christoph Lameter
2007-11-20 20:48:26 UTC
Permalink
Post by Mathieu Desnoyers
The question that arises is : are there some architectures that do not
provide fast PER_CPU ops but provides fast local atomic ops ?
There were only a few no critical users of local_t before this patch. So I
doubt that it matters.
Christoph Lameter
2007-11-20 01:18:27 UTC
Permalink
Duh a patch did not make it due to XXX in the header.

I will put this also on git.kernel.org slab tree branch cpudata


x86_64: Strip down PDA operations through the use of CPU_XXX operations.

The *_pda operations behave in the same way as the CPU_XX ops. They both access data
that is relative to a segment register. So strip out as much as we can.

What is left after this patchset are some special pda ops for x86_64

or_pda()
test_and_clear_bit_pda()

Signed-off-by: Christoph Lameter <***@sgi.com>

---
include/asm-x86/current_64.h | 2 +-
include/asm-x86/pda.h | 34 ++++------------------------------
include/asm-x86/thread_info_64.h | 2 ++
3 files changed, 7 insertions(+), 31 deletions(-)

Index: linux-2.6/include/asm-x86/pda.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pda.h 2007-11-19 16:17:22.325639807 -0800
+++ linux-2.6/include/asm-x86/pda.h 2007-11-19 16:24:13.569640223 -0800
@@ -81,36 +81,10 @@ extern struct x8664_pda _proxy_pda;
} \
} while (0)

-#define pda_from_op(op,field) ({ \
- typeof(_proxy_pda.field) ret__; \
- switch (sizeof(_proxy_pda.field)) { \
- case 2: \
- asm(op "w %%gs:%c1,%0" : \
- "=r" (ret__) : \
- "i" (pda_offset(field)), \
- "m" (_proxy_pda.field)); \
- break; \
- case 4: \
- asm(op "l %%gs:%c1,%0": \
- "=r" (ret__): \
- "i" (pda_offset(field)), \
- "m" (_proxy_pda.field)); \
- break; \
- case 8: \
- asm(op "q %%gs:%c1,%0": \
- "=r" (ret__) : \
- "i" (pda_offset(field)), \
- "m" (_proxy_pda.field)); \
- break; \
- default: \
- __bad_pda_field(); \
- } \
- ret__; })
-
-#define read_pda(field) pda_from_op("mov",field)
-#define write_pda(field,val) pda_to_op("mov",field,val)
-#define add_pda(field,val) pda_to_op("add",field,val)
-#define sub_pda(field,val) pda_to_op("sub",field,val)
+#define read_pda(field) CPU_READ(per_cpu_var(pda).field)
+#define write_pda(field,val) CPU_WRITE(per_cpu_var(pda).field, val)
+#define add_pda(field,val) CPU_ADD(per_cpu_var(pda).field, val)
+#define sub_pda(field,val) CPU_ADD(per_cpu_var(pda).field, val)
#define or_pda(field,val) pda_to_op("or",field,val)

/* This is not atomic against other CPUs -- CPU preemption needs to be off */
Index: linux-2.6/include/asm-x86/current_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/current_64.h 2007-11-19 15:45:03.470390243 -0800
+++ linux-2.6/include/asm-x86/current_64.h 2007-11-19 16:24:13.569640223 -0800
@@ -4,7 +4,7 @@
#if !defined(__ASSEMBLY__)
struct task_struct;

-#include <asm/pda.h>
+#include <asm/percpu.h>

static inline struct task_struct *get_current(void)
{
Index: linux-2.6/include/asm-x86/thread_info_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/thread_info_64.h 2007-11-19 15:45:03.482390495 -0800
+++ linux-2.6/include/asm-x86/thread_info_64.h 2007-11-19 16:24:13.569640223 -0800
@@ -41,6 +41,8 @@ struct thread_info {
* preempt_count needs to be 1 initially, until the scheduler is functional.
*/
#ifndef __ASSEMBLY__
+#include <asm/percpu_64.h>
+
#define INIT_THREAD_INFO(tsk) \
{ \
.task = &tsk, \
David Miller
2007-11-20 01:51:16 UTC
Permalink
From: ***@sgi.com
Date: Mon, 19 Nov 2007 17:11:32 -0800
Post by c***@sgi.com
mov %gs:0x8,%rdx Get smp_processor_id
mov tableoffset,%rax Get table base
incq varoffset(%rax,%rdx,1) Perform the operation with a complex lookup
adding the var offset
An interrupt or a reschedule action can move the execution thread to another
processor if interrupt or preempt is not disabled. Then the variable of
the wrong processor may be updated in a racy way.
incq %gs:varoffset(%rip)
Single instruction that is safe from interrupts or moving of the execution
thread. It will reliably operate on the current processors data area.
Other platforms can also perform address relocation plus atomic ops on
a memory location. Exploiting of the atomicity of instructions vs interrupts
is therefore possible and will reduce the cpu op processing overhead.
F.e on IA64 we have per cpu virtual mapping of the per cpu area. If
we add an offset to the per cpu area variable address then we can guarantee
that we always hit the per cpu areas local to a processor.
Other platforms (SPARC?) have registers that can be used to form addresses.
If the cpu area address is in one of those then atomic per cpu modifications
can be generated for those platforms in the same way.
Although we have a per-cpu area base in a fixed global register
for addressing, the above isn't beneficial on sparc64 because
the atomic is much slower than doing a:

local_irq_disable();
nonatomic_percpu_memory_op();
local_irq_enable();

local_irq_{disable,enable}() together is about 18 cycles.
Just the cmpxchg() part of the atomic sequence is at least
32 cycles and requires a loop:

while (1) {
x = ld();
if (cmpxchg(x, op(x)))
break;
}

which bloats up the atomic version even more.
Christoph Lameter
2007-11-20 01:59:33 UTC
Permalink
Post by David Miller
Although we have a per-cpu area base in a fixed global register
for addressing, the above isn't beneficial on sparc64 because
local_irq_disable();
nonatomic_percpu_memory_op();
local_irq_enable();
local_irq_{disable,enable}() together is about 18 cycles.
Just the cmpxchg() part of the atomic sequence is at least
while (1) {
x = ld();
if (cmpxchg(x, op(x)))
break;
}
which bloats up the atomic version even more.
In that case the generic fallbacks can just provide what you already have.
David Miller
2007-11-20 02:10:16 UTC
Permalink
From: Christoph Lameter <***@sgi.com>
Date: Mon, 19 Nov 2007 17:59:33 -0800 (PST)
Post by Christoph Lameter
In that case the generic fallbacks can just provide what you already have.
I understand, I was just letting you know why we probably won't
take advantage of this new stuff :-)
Christoph Lameter
2007-11-20 02:12:24 UTC
Permalink
Post by David Miller
Date: Mon, 19 Nov 2007 17:59:33 -0800 (PST)
Post by Christoph Lameter
In that case the generic fallbacks can just provide what you already have.
I understand, I was just letting you know why we probably won't
take advantage of this new stuff :-)
On the other hand: The pointer array removal and the allocation
density improvements of the cpu_alloc should also help sparc to
increase the information density in cachelines and thus increase overall
speed of your 64p box.
Andi Kleen
2007-11-20 03:25:34 UTC
Permalink
Post by David Miller
Although we have a per-cpu area base in a fixed global register
for addressing, the above isn't beneficial on sparc64 because
local_irq_disable();
nonatomic_percpu_memory_op();
local_irq_enable();
Again might be pointing out the obvious, but you
need of course save_flags()/restore_flags(), not disable/enable().

If it was just disable/enable x86 could do it much faster too
and Christoph probably would never felt the need to approach
this project for his SLUB fast path.

-Andi
Christoph Lameter
2007-11-20 03:33:25 UTC
Permalink
Post by Andi Kleen
Post by David Miller
Although we have a per-cpu area base in a fixed global register
for addressing, the above isn't beneficial on sparc64 because
local_irq_disable();
nonatomic_percpu_memory_op();
local_irq_enable();
Again might be pointing out the obvious, but you
need of course save_flags()/restore_flags(), not disable/enable().
If it was just disable/enable x86 could do it much faster too
and Christoph probably would never felt the need to approach
this project for his SLUB fast path.
I already have no need for that anymore with the material now in Andrews
tree. However, this cuts out another 6 cycles from the fastpath and I
found that the same principles reduce overhead all over the kernel.
David Miller
2007-11-20 04:04:41 UTC
Permalink
From: Andi Kleen <***@suse.de>
Date: Tue, 20 Nov 2007 04:25:34 +0100
Post by Andi Kleen
Post by David Miller
Although we have a per-cpu area base in a fixed global register
for addressing, the above isn't beneficial on sparc64 because
local_irq_disable();
nonatomic_percpu_memory_op();
local_irq_enable();
Again might be pointing out the obvious, but you
need of course save_flags()/restore_flags(), not disable/enable().
Right, but the cost is the same for that on sparc64 unlike
x86 et al.
Continue reading on narkive:
Loading...