Discussion:
[PATCH 04/43] Document the slow work thread pool [ver #46]
(too old to reply)
David Howells
2009-04-01 23:03:42 UTC
Permalink
Document the slow work thread pool.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

Documentation/slow-work.txt | 174 +++++++++++++++++++++++++++++++++++++++++++
include/linux/slow-work.h | 2
kernel/slow-work.c | 2
3 files changed, 178 insertions(+), 0 deletions(-)
create mode 100644 Documentation/slow-work.txt


diff --git a/Documentation/slow-work.txt b/Documentation/slow-work.txt
new file mode 100644
index 0000000..ebc50f8
--- /dev/null
+++ b/Documentation/slow-work.txt
@@ -0,0 +1,174 @@
+ ====================================
+ SLOW WORK ITEM EXECUTION THREAD POOL
+ ====================================
+
+By: David Howells <***@redhat.com>
+
+The slow work item execution thread pool is a pool of threads for performing
+things that take a relatively long time, such as making mkdir calls.
+Typically, when processing something, these items will spend a lot of time
+blocking a thread on I/O, thus making that thread unavailable for doing other
+work.
+
+The standard workqueue model is unsuitable for this class of work item as that
+limits the owner to a single thread or a single thread per CPU. For some
+tasks, however, more threads - or fewer - are required.
+
+There is just one pool per system. It contains no threads unless something
+wants to use it - and that something must register its interest first. When
+the pool is active, the number of threads it contains is dynamic, varying
+between a maximum and minimum setting, depending on the load.
+
+
+====================
+CLASSES OF WORK ITEM
+====================
+
+This pool support two classes of work items:
+
+ (*) Slow work items.
+
+ (*) Very slow work items.
+
+The former are expected to finish much quicker than the latter.
+
+An operation of the very slow class may do a batch combination of several
+lookups, mkdirs, and a create for instance.
+
+An operation of the ordinarily slow class may, for example, write stuff or
+expand files, provided the time taken to do so isn't too long.
+
+Operations of both types may sleep during execution, thus tying up the thread
+loaned to it.
+
+
+THREAD-TO-CLASS ALLOCATION
+--------------------------
+
+Not all the threads in the pool are available to work on very slow work items.
+The number will be between one and one fewer than the number of active threads.
+This is configurable (see the "Pool Configuration" section).
+
+All the threads are available to work on ordinarily slow work items, but a
+percentage of the threads will prefer to work on very slow work items.
+
+The configuration ensures that at least one thread will be available to work on
+very slow work items, and at least one thread will be available that won't work
+on very slow work items at all.
+
+
+=====================
+USING SLOW WORK ITEMS
+=====================
+
+Firstly, a module or subsystem wanting to make use of slow work items must
+register its interest:
+
+ int ret = slow_work_register_user();
+
+This will return 0 if successful, or a -ve error upon failure.
+
+
+Slow work items may then be set up by:
+
+ (1) Declaring a slow_work struct type variable:
+
+ #include <linux/slow-work.h>
+
+ struct slow_work myitem;
+
+ (2) Declaring the operations to be used for this item:
+
+ struct slow_work_ops myitem_ops = {
+ .get_ref = myitem_get_ref,
+ .put_ref = myitem_put_ref,
+ .execute = myitem_execute,
+ };
+
+ [*] For a description of the ops, see section "Item Operations".
+
+ (3) Initialising the item:
+
+ slow_work_init(&myitem, &myitem_ops);
+
+ or:
+
+ vslow_work_init(&myitem, &myitem_ops);
+
+ depending on its class.
+
+A suitably set up work item can then be enqueued for processing:
+
+ int ret = slow_work_enqueue(&myitem);
+
+This will return a -ve error if the thread pool is unable to gain a reference
+on the item, 0 otherwise.
+
+
+The items are reference counted, so there ought to be no need for a flush
+operation. When all a module's slow work items have been processed, and the
+module has no further interest in the facility, it should unregister its
+interest:
+
+ slow_work_unregister_user();
+
+
+===============
+ITEM OPERATIONS
+===============
+
+Each work item requires a table of operations of type struct slow_work_ops.
+All members are required:
+
+ (*) Get a reference on an item:
+
+ int (*get_ref)(struct slow_work *work);
+
+ This allows the thread pool to attempt to pin an item by getting a
+ reference on it. This function should return 0 if the reference was
+ granted, or a -ve error otherwise. If an error is returned,
+ slow_work_enqueue() will fail.
+
+ The reference is held whilst the item is queued and whilst it is being
+ executed. The item may then be requeued with the same reference held, or
+ the reference will be released.
+
+ (*) Release a reference on an item:
+
+ void (*put_ref)(struct slow_work *work);
+
+ This allows the thread pool to unpin an item by releasing the reference on
+ it. The thread pool will not touch the item again once this has been
+ called.
+
+ (*) Execute an item:
+
+ void (*execute)(struct slow_work *work);
+
+ This should perform the work required of the item. It may sleep, it may
+ perform disk I/O and it may wait for locks.
+
+
+==================
+POOL CONFIGURATION
+==================
+
+The slow-work thread pool has a number of configurables:
+
+ (*) /proc/sys/kernel/slow-work/min-threads
+
+ The minimum number of threads that should be in the pool whilst it is in
+ use. This may be anywhere between 2 and max-threads.
+
+ (*) /proc/sys/kernel/slow-work/max-threads
+
+ The maximum number of threads that should in the pool. This may be
+ anywhere between min-threads and 255 or NR_CPUS * 2, whichever is greater.
+
+ (*) /proc/sys/kernel/slow-work/vslow-percentage
+
+ The percentage of active threads in the pool that may be used to execute
+ very slow work items. This may be between 1 and 99. The resultant number
+ is bounded to between 1 and one fewer than the number of active threads.
+ This ensures there is always at least one thread that can process very
+ slow work items, and always at least one thread that won't.
diff --git a/include/linux/slow-work.h b/include/linux/slow-work.h
index 8262809..8595827 100644
--- a/include/linux/slow-work.h
+++ b/include/linux/slow-work.h
@@ -7,6 +7,8 @@
* modify it under the terms of the GNU General Public Licence
* as published by the Free Software Foundation; either version
* 2 of the Licence, or (at your option) any later version.
+ *
+ * See Documentation/slow-work.txt
*/

#ifndef _LINUX_SLOW_WORK_H
diff --git a/kernel/slow-work.c b/kernel/slow-work.c
index 3f65900..cf2bc01 100644
--- a/kernel/slow-work.c
+++ b/kernel/slow-work.c
@@ -7,6 +7,8 @@
* modify it under the terms of the GNU General Public Licence
* as published by the Free Software Foundation; either version
* 2 of the Licence, or (at your option) any later version.
+ *
+ * See Documentation/slow-work.txt
*/

#include <linux/module.h>

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:03:26 UTC
Permalink
Create a dynamically sized pool of threads for doing very slow work items, such
as invoking mkdir() or rmdir() - things that may take a long time and may
sleep, holding mutexes/semaphores and hogging a thread, and are thus unsuitable
for workqueues.

The number of threads is always at least a settable minimum, but more are
started when there's more work to do, up to a limit. Because of the nature of
the load, it's not suitable for a 1-thread-per-CPU type pool. A system with
one CPU may well want several threads.

This is used by FS-Cache to do slow caching operations in the background, such
as looking up, creating or deleting cache objects.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Serge Hallyn <***@us.ibm.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

include/linux/slow-work.h | 88 ++++++++++
init/Kconfig | 12 +
kernel/Makefile | 1
kernel/slow-work.c | 388 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 489 insertions(+), 0 deletions(-)
create mode 100644 include/linux/slow-work.h
create mode 100644 kernel/slow-work.c


diff --git a/include/linux/slow-work.h b/include/linux/slow-work.h
new file mode 100644
index 0000000..4dd754a
--- /dev/null
+++ b/include/linux/slow-work.h
@@ -0,0 +1,88 @@
+/* Worker thread pool for slow items, such as filesystem lookups or mkdirs
+ *
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _LINUX_SLOW_WORK_H
+#define _LINUX_SLOW_WORK_H
+
+#ifdef CONFIG_SLOW_WORK
+
+struct slow_work;
+
+/*
+ * The operations used to support slow work items
+ */
+struct slow_work_ops {
+ /* get a ref on a work item
+ * - return 0 if successful, -ve if not
+ */
+ int (*get_ref)(struct slow_work *work);
+
+ /* discard a ref to a work item */
+ void (*put_ref)(struct slow_work *work);
+
+ /* execute a work item */
+ void (*execute)(struct slow_work *work);
+};
+
+/*
+ * A slow work item
+ * - A reference is held on the parent object by the thread pool when it is
+ * queued
+ */
+struct slow_work {
+ unsigned long flags;
+#define SLOW_WORK_PENDING 0 /* item pending (further) execution */
+#define SLOW_WORK_EXECUTING 1 /* item currently executing */
+#define SLOW_WORK_ENQ_DEFERRED 2 /* item enqueue deferred */
+#define SLOW_WORK_VERY_SLOW 3 /* item is very slow */
+ const struct slow_work_ops *ops; /* operations table for this item */
+ struct list_head link; /* link in queue */
+};
+
+/**
+ * slow_work_init - Initialise a slow work item
+ * @work: The work item to initialise
+ * @ops: The operations to use to handle the slow work item
+ *
+ * Initialise a slow work item.
+ */
+static inline void slow_work_init(struct slow_work *work,
+ const struct slow_work_ops *ops)
+{
+ work->flags = 0;
+ work->ops = ops;
+ INIT_LIST_HEAD(&work->link);
+}
+
+/**
+ * slow_work_init - Initialise a very slow work item
+ * @work: The work item to initialise
+ * @ops: The operations to use to handle the slow work item
+ *
+ * Initialise a very slow work item. This item will be restricted such that
+ * only a certain number of the pool threads will be able to execute items of
+ * this type.
+ */
+static inline void vslow_work_init(struct slow_work *work,
+ const struct slow_work_ops *ops)
+{
+ work->flags = 1 << SLOW_WORK_VERY_SLOW;
+ work->ops = ops;
+ INIT_LIST_HEAD(&work->link);
+}
+
+extern int slow_work_enqueue(struct slow_work *work);
+extern int slow_work_register_user(void);
+extern void slow_work_unregister_user(void);
+
+
+#endif /* CONFIG_SLOW_WORK */
+#endif /* _LINUX_SLOW_WORK_H */
diff --git a/init/Kconfig b/init/Kconfig
index 14c483d..a864688 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1012,6 +1012,18 @@ config MARKERS

source "arch/Kconfig"

+config SLOW_WORK
+ default n
+ bool "Enable slow work thread pool"
+ help
+ The slow work thread pool provides a number of dynamically allocated
+ threads that can be used by the kernel to perform operations that
+ take a relatively long time.
+
+ An example of this would be CacheFiles doing a path lookup followed
+ by a series of mkdirs and a create call, all of which have to touch
+ disk.
+
endmenu # General setup

config HAVE_GENERIC_DMA_COHERENT
diff --git a/kernel/Makefile b/kernel/Makefile
index e4791b3..bab1dff 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -93,6 +93,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
obj-$(CONFIG_FUNCTION_TRACER) += trace/
obj-$(CONFIG_TRACING) += trace/
obj-$(CONFIG_SMP) += sched_cpupri.o
+obj-$(CONFIG_SLOW_WORK) += slow-work.o

ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
# According to Alan Modra <***@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/slow-work.c b/kernel/slow-work.c
new file mode 100644
index 0000000..5a73927
--- /dev/null
+++ b/kernel/slow-work.c
@@ -0,0 +1,388 @@
+/* Worker thread pool for slow items, such as filesystem lookups or mkdirs
+ *
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/slow-work.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+#include <linux/wait.h>
+#include <asm/system.h>
+
+/*
+ * The pool of threads has at least min threads in it as long as someone is
+ * using the facility, and may have as many as max.
+ *
+ * A portion of the pool may be processing very slow operations.
+ */
+static unsigned slow_work_min_threads = 2;
+static unsigned slow_work_max_threads = 4;
+static unsigned vslow_work_proportion = 50; /* % of threads that may process
+ * very slow work */
+static atomic_t slow_work_thread_count;
+static atomic_t vslow_work_executing_count;
+
+/*
+ * The queues of work items and the lock governing access to them. These are
+ * shared between all the CPUs. It doesn't make sense to have per-CPU queues
+ * as the number of threads bears no relation to the number of CPUs.
+ *
+ * There are two queues of work items: one for slow work items, and one for
+ * very slow work items.
+ */
+static LIST_HEAD(slow_work_queue);
+static LIST_HEAD(vslow_work_queue);
+static DEFINE_SPINLOCK(slow_work_queue_lock);
+
+/*
+ * The thread controls. A variable used to signal to the threads that they
+ * should exit when the queue is empty, a waitqueue used by the threads to wait
+ * for signals, and a completion set by the last thread to exit.
+ */
+static bool slow_work_threads_should_exit;
+static DECLARE_WAIT_QUEUE_HEAD(slow_work_thread_wq);
+static DECLARE_COMPLETION(slow_work_last_thread_exited);
+
+/*
+ * The number of users of the thread pool and its lock. Whilst this is zero we
+ * have no threads hanging around, and when this reaches zero, we wait for all
+ * active or queued work items to complete and kill all the threads we do have.
+ */
+static int slow_work_user_count;
+static DEFINE_MUTEX(slow_work_user_lock);
+
+/*
+ * Calculate the maximum number of active threads in the pool that are
+ * permitted to process very slow work items.
+ *
+ * The answer is rounded up to at least 1, but may not equal or exceed the
+ * maximum number of the threads in the pool. This means we always have at
+ * least one thread that can process slow work items, and we always have at
+ * least one thread that won't get tied up doing so.
+ */
+static unsigned slow_work_calc_vsmax(void)
+{
+ unsigned vsmax;
+
+ vsmax = atomic_read(&slow_work_thread_count) * vslow_work_proportion;
+ vsmax /= 100;
+ vsmax = max(vsmax, 1U);
+ return min(vsmax, slow_work_max_threads - 1);
+}
+
+/*
+ * Attempt to execute stuff queued on a slow thread. Return true if we managed
+ * it, false if there was nothing to do.
+ */
+static bool slow_work_execute(void)
+{
+ struct slow_work *work = NULL;
+ unsigned vsmax;
+ bool very_slow;
+
+ vsmax = slow_work_calc_vsmax();
+
+ /* find something to execute */
+ spin_lock_irq(&slow_work_queue_lock);
+ if (!list_empty(&vslow_work_queue) &&
+ atomic_read(&vslow_work_executing_count) < vsmax) {
+ work = list_entry(vslow_work_queue.next,
+ struct slow_work, link);
+ if (test_and_set_bit_lock(SLOW_WORK_EXECUTING, &work->flags))
+ BUG();
+ list_del_init(&work->link);
+ atomic_inc(&vslow_work_executing_count);
+ very_slow = true;
+ } else if (!list_empty(&slow_work_queue)) {
+ work = list_entry(slow_work_queue.next,
+ struct slow_work, link);
+ if (test_and_set_bit_lock(SLOW_WORK_EXECUTING, &work->flags))
+ BUG();
+ list_del_init(&work->link);
+ very_slow = false;
+ } else {
+ very_slow = false; /* avoid the compiler warning */
+ }
+ spin_unlock_irq(&slow_work_queue_lock);
+
+ if (!work)
+ return false;
+
+ if (!test_and_clear_bit(SLOW_WORK_PENDING, &work->flags))
+ BUG();
+
+ work->ops->execute(work);
+
+ if (very_slow)
+ atomic_dec(&vslow_work_executing_count);
+ clear_bit_unlock(SLOW_WORK_EXECUTING, &work->flags);
+
+ /* if someone tried to enqueue the item whilst we were executing it,
+ * then it'll be left unenqueued to avoid multiple threads trying to
+ * execute it simultaneously
+ *
+ * there is, however, a race between us testing the pending flag and
+ * getting the spinlock, and between the enqueuer setting the pending
+ * flag and getting the spinlock, so we use a deferral bit to tell us
+ * if the enqueuer got there first
+ */
+ if (test_bit(SLOW_WORK_PENDING, &work->flags)) {
+ spin_lock_irq(&slow_work_queue_lock);
+
+ if (!test_bit(SLOW_WORK_EXECUTING, &work->flags) &&
+ test_and_clear_bit(SLOW_WORK_ENQ_DEFERRED, &work->flags))
+ goto auto_requeue;
+
+ spin_unlock_irq(&slow_work_queue_lock);
+ }
+
+ work->ops->put_ref(work);
+ return true;
+
+auto_requeue:
+ /* we must complete the enqueue operation
+ * - we transfer our ref on the item back to the appropriate queue
+ * - don't wake another thread up as we're awake already
+ */
+ if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags))
+ list_add_tail(&work->link, &vslow_work_queue);
+ else
+ list_add_tail(&work->link, &slow_work_queue);
+ spin_unlock_irq(&slow_work_queue_lock);
+ return true;
+}
+
+/**
+ * slow_work_enqueue - Schedule a slow work item for processing
+ * @work: The work item to queue
+ *
+ * Schedule a slow work item for processing. If the item is already undergoing
+ * execution, this guarantees not to re-enter the execution routine until the
+ * first execution finishes.
+ *
+ * The item is pinned by this function as it retains a reference to it, managed
+ * through the item operations. The item is unpinned once it has been
+ * executed.
+ *
+ * An item may hog the thread that is running it for a relatively large amount
+ * of time, sufficient, for example, to perform several lookup, mkdir, create
+ * and setxattr operations. It may sleep on I/O and may sleep to obtain locks.
+ *
+ * Conversely, if a number of items are awaiting processing, it may take some
+ * time before any given item is given attention. The number of threads in the
+ * pool may be increased to deal with demand, but only up to a limit.
+ *
+ * If SLOW_WORK_VERY_SLOW is set on the work item, then it will be placed in
+ * the very slow queue, from which only a portion of the threads will be
+ * allowed to pick items to execute. This ensures that very slow items won't
+ * overly block ones that are just ordinarily slow.
+ *
+ * Returns 0 if successful, -EAGAIN if not.
+ */
+int slow_work_enqueue(struct slow_work *work)
+{
+ unsigned long flags;
+
+ BUG_ON(slow_work_user_count <= 0);
+ BUG_ON(!work);
+ BUG_ON(!work->ops);
+ BUG_ON(!work->ops->get_ref);
+
+ /* when honouring an enqueue request, we only promise that we will run
+ * the work function in the future; we do not promise to run it once
+ * per enqueue request
+ *
+ * we use the PENDING bit to merge together repeat requests without
+ * having to disable IRQs and take the spinlock, whilst still
+ * maintaining our promise
+ */
+ if (!test_and_set_bit_lock(SLOW_WORK_PENDING, &work->flags)) {
+ spin_lock_irqsave(&slow_work_queue_lock, flags);
+
+ /* we promise that we will not attempt to execute the work
+ * function in more than one thread simultaneously
+ *
+ * this, however, leaves us with a problem if we're asked to
+ * enqueue the work whilst someone is executing the work
+ * function as simply queueing the work immediately means that
+ * another thread may try executing it whilst it is already
+ * under execution
+ *
+ * to deal with this, we set the ENQ_DEFERRED bit instead of
+ * enqueueing, and the thread currently executing the work
+ * function will enqueue the work item when the work function
+ * returns and it has cleared the EXECUTING bit
+ */
+ if (test_bit(SLOW_WORK_EXECUTING, &work->flags)) {
+ set_bit(SLOW_WORK_ENQ_DEFERRED, &work->flags);
+ } else {
+ if (work->ops->get_ref(work) < 0)
+ goto cant_get_ref;
+ if (test_bit(SLOW_WORK_VERY_SLOW, &work->flags))
+ list_add_tail(&work->link, &vslow_work_queue);
+ else
+ list_add_tail(&work->link, &slow_work_queue);
+ wake_up(&slow_work_thread_wq);
+ }
+
+ spin_unlock_irqrestore(&slow_work_queue_lock, flags);
+ }
+ return 0;
+
+cant_get_ref:
+ spin_unlock_irqrestore(&slow_work_queue_lock, flags);
+ return -EAGAIN;
+}
+EXPORT_SYMBOL(slow_work_enqueue);
+
+/*
+ * Determine if there is slow work available for dispatch
+ */
+static inline bool slow_work_available(int vsmax)
+{
+ return !list_empty(&slow_work_queue) ||
+ (!list_empty(&vslow_work_queue) &&
+ atomic_read(&vslow_work_executing_count) < vsmax);
+}
+
+/*
+ * Worker thread dispatcher
+ */
+static int slow_work_thread(void *_data)
+{
+ int vsmax;
+
+ DEFINE_WAIT(wait);
+
+ set_freezable();
+ set_user_nice(current, -5);
+
+ for (;;) {
+ vsmax = vslow_work_proportion;
+ vsmax *= atomic_read(&slow_work_thread_count);
+ vsmax /= 100;
+
+ prepare_to_wait(&slow_work_thread_wq, &wait,
+ TASK_INTERRUPTIBLE);
+ if (!freezing(current) &&
+ !slow_work_threads_should_exit &&
+ !slow_work_available(vsmax))
+ schedule();
+ finish_wait(&slow_work_thread_wq, &wait);
+
+ try_to_freeze();
+
+ vsmax = vslow_work_proportion;
+ vsmax *= atomic_read(&slow_work_thread_count);
+ vsmax /= 100;
+
+ if (slow_work_available(vsmax) && slow_work_execute()) {
+ cond_resched();
+ continue;
+ }
+
+ if (slow_work_threads_should_exit)
+ break;
+ }
+
+ if (atomic_dec_and_test(&slow_work_thread_count))
+ complete_and_exit(&slow_work_last_thread_exited, 0);
+ return 0;
+}
+
+/**
+ * slow_work_register_user - Register a user of the facility
+ *
+ * Register a user of the facility, starting up the initial threads if there
+ * aren't any other users at this point. This will return 0 if successful, or
+ * an error if not.
+ */
+int slow_work_register_user(void)
+{
+ struct task_struct *p;
+ int loop;
+
+ mutex_lock(&slow_work_user_lock);
+
+ if (slow_work_user_count == 0) {
+ printk(KERN_NOTICE "Slow work thread pool: Starting up\n");
+ init_completion(&slow_work_last_thread_exited);
+
+ slow_work_threads_should_exit = false;
+
+ /* start the minimum number of threads */
+ for (loop = 0; loop < slow_work_min_threads; loop++) {
+ atomic_inc(&slow_work_thread_count);
+ p = kthread_run(slow_work_thread, NULL, "kslowd");
+ if (IS_ERR(p))
+ goto error;
+ }
+ printk(KERN_NOTICE "Slow work thread pool: Ready\n");
+ }
+
+ slow_work_user_count++;
+ mutex_unlock(&slow_work_user_lock);
+ return 0;
+
+error:
+ if (atomic_dec_and_test(&slow_work_thread_count))
+ complete(&slow_work_last_thread_exited);
+ if (loop > 0) {
+ printk(KERN_ERR "Slow work thread pool:"
+ " Aborting startup on ENOMEM\n");
+ slow_work_threads_should_exit = true;
+ wake_up_all(&slow_work_thread_wq);
+ wait_for_completion(&slow_work_last_thread_exited);
+ printk(KERN_ERR "Slow work thread pool: Aborted\n");
+ }
+ mutex_unlock(&slow_work_user_lock);
+ return PTR_ERR(p);
+}
+EXPORT_SYMBOL(slow_work_register_user);
+
+/**
+ * slow_work_unregister_user - Unregister a user of the facility
+ *
+ * Unregister a user of the facility, killing all the threads if this was the
+ * last one.
+ */
+void slow_work_unregister_user(void)
+{
+ mutex_lock(&slow_work_user_lock);
+
+ BUG_ON(slow_work_user_count <= 0);
+
+ slow_work_user_count--;
+ if (slow_work_user_count == 0) {
+ printk(KERN_NOTICE "Slow work thread pool: Shutting down\n");
+ slow_work_threads_should_exit = true;
+ wake_up_all(&slow_work_thread_wq);
+ wait_for_completion(&slow_work_last_thread_exited);
+ printk(KERN_NOTICE "Slow work thread pool:"
+ " Shut down complete\n");
+ }
+
+ mutex_unlock(&slow_work_user_lock);
+}
+EXPORT_SYMBOL(slow_work_unregister_user);
+
+/*
+ * Initialise the slow work facility
+ */
+static int __init init_slow_work(void)
+{
+ unsigned nr_cpus = num_possible_cpus();
+
+ if (nr_cpus > slow_work_max_threads)
+ slow_work_max_threads = nr_cpus;
+ return 0;
+}
+
+subsys_initcall(init_slow_work);

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:03:32 UTC
Permalink
Make the slow-work thread pool actually dynamic in the number of threads it
contains. With this patch, it will both create additional threads when it has
extra work to do, and cull excess threads that aren't doing anything.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Serge Hallyn <***@us.ibm.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

kernel/slow-work.c | 138 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 137 insertions(+), 1 deletions(-)


diff --git a/kernel/slow-work.c b/kernel/slow-work.c
index 5a73927..454abb2 100644
--- a/kernel/slow-work.c
+++ b/kernel/slow-work.c
@@ -16,6 +16,14 @@
#include <linux/wait.h>
#include <asm/system.h>

+#define SLOW_WORK_CULL_TIMEOUT (5 * HZ) /* cull threads 5s after running out of
+ * things to do */
+#define SLOW_WORK_OOM_TIMEOUT (5 * HZ) /* can't start new threads for 5s after
+ * OOM */
+
+static void slow_work_cull_timeout(unsigned long);
+static void slow_work_oom_timeout(unsigned long);
+
/*
* The pool of threads has at least min threads in it as long as someone is
* using the facility, and may have as many as max.
@@ -29,6 +37,12 @@ static unsigned vslow_work_proportion = 50; /* % of threads that may process
static atomic_t slow_work_thread_count;
static atomic_t vslow_work_executing_count;

+static bool slow_work_may_not_start_new_thread;
+static bool slow_work_cull; /* cull a thread due to lack of activity */
+static DEFINE_TIMER(slow_work_cull_timer, slow_work_cull_timeout, 0, 0);
+static DEFINE_TIMER(slow_work_oom_timer, slow_work_oom_timeout, 0, 0);
+static struct slow_work slow_work_new_thread; /* new thread starter */
+
/*
* The queues of work items and the lock governing access to them. These are
* shared between all the CPUs. It doesn't make sense to have per-CPU queues
@@ -89,6 +103,14 @@ static bool slow_work_execute(void)

vsmax = slow_work_calc_vsmax();

+ /* see if we can schedule a new thread to be started if we're not
+ * keeping up with the work */
+ if (!waitqueue_active(&slow_work_thread_wq) &&
+ (!list_empty(&slow_work_queue) || !list_empty(&vslow_work_queue)) &&
+ atomic_read(&slow_work_thread_count) < slow_work_max_threads &&
+ !slow_work_may_not_start_new_thread)
+ slow_work_enqueue(&slow_work_new_thread);
+
/* find something to execute */
spin_lock_irq(&slow_work_queue_lock);
if (!list_empty(&vslow_work_queue) &&
@@ -243,6 +265,33 @@ cant_get_ref:
EXPORT_SYMBOL(slow_work_enqueue);

/*
+ * Worker thread culling algorithm
+ */
+static bool slow_work_cull_thread(void)
+{
+ unsigned long flags;
+ bool do_cull = false;
+
+ spin_lock_irqsave(&slow_work_queue_lock, flags);
+
+ if (slow_work_cull) {
+ slow_work_cull = false;
+
+ if (list_empty(&slow_work_queue) &&
+ list_empty(&vslow_work_queue) &&
+ atomic_read(&slow_work_thread_count) >
+ slow_work_min_threads) {
+ mod_timer(&slow_work_cull_timer,
+ jiffies + SLOW_WORK_CULL_TIMEOUT);
+ do_cull = true;
+ }
+ }
+
+ spin_unlock_irqrestore(&slow_work_queue_lock, flags);
+ return do_cull;
+}
+
+/*
* Determine if there is slow work available for dispatch
*/
static inline bool slow_work_available(int vsmax)
@@ -273,7 +322,8 @@ static int slow_work_thread(void *_data)
TASK_INTERRUPTIBLE);
if (!freezing(current) &&
!slow_work_threads_should_exit &&
- !slow_work_available(vsmax))
+ !slow_work_available(vsmax) &&
+ !slow_work_cull)
schedule();
finish_wait(&slow_work_thread_wq, &wait);

@@ -285,11 +335,20 @@ static int slow_work_thread(void *_data)

if (slow_work_available(vsmax) && slow_work_execute()) {
cond_resched();
+ if (list_empty(&slow_work_queue) &&
+ list_empty(&vslow_work_queue) &&
+ atomic_read(&slow_work_thread_count) >
+ slow_work_min_threads)
+ mod_timer(&slow_work_cull_timer,
+ jiffies + SLOW_WORK_CULL_TIMEOUT);
continue;
}

if (slow_work_threads_should_exit)
break;
+
+ if (slow_work_cull && slow_work_cull_thread())
+ break;
}

if (atomic_dec_and_test(&slow_work_thread_count))
@@ -297,6 +356,77 @@ static int slow_work_thread(void *_data)
return 0;
}

+/*
+ * Handle thread cull timer expiration
+ */
+static void slow_work_cull_timeout(unsigned long data)
+{
+ slow_work_cull = true;
+ wake_up(&slow_work_thread_wq);
+}
+
+/*
+ * Get a reference on slow work thread starter
+ */
+static int slow_work_new_thread_get_ref(struct slow_work *work)
+{
+ return 0;
+}
+
+/*
+ * Drop a reference on slow work thread starter
+ */
+static void slow_work_new_thread_put_ref(struct slow_work *work)
+{
+}
+
+/*
+ * Start a new slow work thread
+ */
+static void slow_work_new_thread_execute(struct slow_work *work)
+{
+ struct task_struct *p;
+
+ if (slow_work_threads_should_exit)
+ return;
+
+ if (atomic_read(&slow_work_thread_count) >= slow_work_max_threads)
+ return;
+
+ if (!mutex_trylock(&slow_work_user_lock))
+ return;
+
+ slow_work_may_not_start_new_thread = true;
+ atomic_inc(&slow_work_thread_count);
+ p = kthread_run(slow_work_thread, NULL, "kslowd");
+ if (IS_ERR(p)) {
+ printk(KERN_DEBUG "Slow work thread pool: OOM\n");
+ if (atomic_dec_and_test(&slow_work_thread_count))
+ BUG(); /* we're running on a slow work thread... */
+ mod_timer(&slow_work_oom_timer,
+ jiffies + SLOW_WORK_OOM_TIMEOUT);
+ } else {
+ /* ratelimit the starting of new threads */
+ mod_timer(&slow_work_oom_timer, jiffies + 1);
+ }
+
+ mutex_unlock(&slow_work_user_lock);
+}
+
+static const struct slow_work_ops slow_work_new_thread_ops = {
+ .get_ref = slow_work_new_thread_get_ref,
+ .put_ref = slow_work_new_thread_put_ref,
+ .execute = slow_work_new_thread_execute,
+};
+
+/*
+ * post-OOM new thread start suppression expiration
+ */
+static void slow_work_oom_timeout(unsigned long data)
+{
+ slow_work_may_not_start_new_thread = false;
+}
+
/**
* slow_work_register_user - Register a user of the facility
*
@@ -316,6 +446,10 @@ int slow_work_register_user(void)
init_completion(&slow_work_last_thread_exited);

slow_work_threads_should_exit = false;
+ slow_work_init(&slow_work_new_thread,
+ &slow_work_new_thread_ops);
+ slow_work_may_not_start_new_thread = false;
+ slow_work_cull = false;

/* start the minimum number of threads */
for (loop = 0; loop < slow_work_min_threads; loop++) {
@@ -369,6 +503,8 @@ void slow_work_unregister_user(void)
" Shut down complete\n");
}

+ del_timer_sync(&slow_work_cull_timer);
+
mutex_unlock(&slow_work_user_lock);
}
EXPORT_SYMBOL(slow_work_unregister_user);

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:03:37 UTC
Permalink
Make the slow work pool configurable through /proc/sys/kernel/slow-work.

(*) /proc/sys/kernel/slow-work/min-threads

The minimum number of threads that should be in the pool as long as it is
in use. This may be anywhere between 2 and max-threads.

(*) /proc/sys/kernel/slow-work/max-threads

The maximum number of threads that should in the pool. This may be
anywhere between min-threads and 255 or NR_CPUS * 2, whichever is greater.

(*) /proc/sys/kernel/slow-work/vslow-percentage

The percentage of active threads in the pool that may be used to execute
very slow work items. This may be between 1 and 99. The resultant number
is bounded to between 1 and one fewer than the number of active threads.
This ensures there is always at least one thread that can process very
slow work items, and always at least one thread that won't.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Serge Hallyn <***@us.ibm.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

include/linux/slow-work.h | 5 ++
kernel/slow-work.c | 118 ++++++++++++++++++++++++++++++++++++++++++++-
kernel/sysctl.c | 9 +++
3 files changed, 130 insertions(+), 2 deletions(-)


diff --git a/include/linux/slow-work.h b/include/linux/slow-work.h
index 4dd754a..8262809 100644
--- a/include/linux/slow-work.h
+++ b/include/linux/slow-work.h
@@ -14,6 +14,8 @@

#ifdef CONFIG_SLOW_WORK

+#include <linux/sysctl.h>
+
struct slow_work;

/*
@@ -83,6 +85,9 @@ extern int slow_work_enqueue(struct slow_work *work);
extern int slow_work_register_user(void);
extern void slow_work_unregister_user(void);

+#ifdef CONFIG_SYSCTL
+extern ctl_table slow_work_sysctls[];
+#endif

#endif /* CONFIG_SLOW_WORK */
#endif /* _LINUX_SLOW_WORK_H */
diff --git a/kernel/slow-work.c b/kernel/slow-work.c
index 454abb2..3f65900 100644
--- a/kernel/slow-work.c
+++ b/kernel/slow-work.c
@@ -14,7 +14,6 @@
#include <linux/kthread.h>
#include <linux/freezer.h>
#include <linux/wait.h>
-#include <asm/system.h>

#define SLOW_WORK_CULL_TIMEOUT (5 * HZ) /* cull threads 5s after running out of
* things to do */
@@ -24,6 +23,14 @@
static void slow_work_cull_timeout(unsigned long);
static void slow_work_oom_timeout(unsigned long);

+#ifdef CONFIG_SYSCTL
+static int slow_work_min_threads_sysctl(struct ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
+
+static int slow_work_max_threads_sysctl(struct ctl_table *, int , struct file *,
+ void __user *, size_t *, loff_t *);
+#endif
+
/*
* The pool of threads has at least min threads in it as long as someone is
* using the facility, and may have as many as max.
@@ -34,6 +41,51 @@ static unsigned slow_work_min_threads = 2;
static unsigned slow_work_max_threads = 4;
static unsigned vslow_work_proportion = 50; /* % of threads that may process
* very slow work */
+
+#ifdef CONFIG_SYSCTL
+static const int slow_work_min_min_threads = 2;
+static int slow_work_max_max_threads = 255;
+static const int slow_work_min_vslow = 1;
+static const int slow_work_max_vslow = 99;
+
+ctl_table slow_work_sysctls[] = {
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "min-threads",
+ .data = &slow_work_min_threads,
+ .maxlen = sizeof(unsigned),
+ .mode = 0644,
+ .proc_handler = slow_work_min_threads_sysctl,
+ .extra1 = (void *) &slow_work_min_min_threads,
+ .extra2 = &slow_work_max_threads,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "max-threads",
+ .data = &slow_work_max_threads,
+ .maxlen = sizeof(unsigned),
+ .mode = 0644,
+ .proc_handler = slow_work_max_threads_sysctl,
+ .extra1 = &slow_work_min_threads,
+ .extra2 = (void *) &slow_work_max_max_threads,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "vslow-percentage",
+ .data = &vslow_work_proportion,
+ .maxlen = sizeof(unsigned),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .extra1 = (void *) &slow_work_min_vslow,
+ .extra2 = (void *) &slow_work_max_vslow,
+ },
+ { .ctl_name = 0 }
+};
+#endif
+
+/*
+ * The active state of the thread pool
+ */
static atomic_t slow_work_thread_count;
static atomic_t vslow_work_executing_count;

@@ -427,6 +479,64 @@ static void slow_work_oom_timeout(unsigned long data)
slow_work_may_not_start_new_thread = false;
}

+#ifdef CONFIG_SYSCTL
+/*
+ * Handle adjustment of the minimum number of threads
+ */
+static int slow_work_min_threads_sysctl(struct ctl_table *table, int write,
+ struct file *filp, void __user *buffer,
+ size_t *lenp, loff_t *ppos)
+{
+ int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
+ int n;
+
+ if (ret == 0) {
+ mutex_lock(&slow_work_user_lock);
+ if (slow_work_user_count > 0) {
+ /* see if we need to start or stop threads */
+ n = atomic_read(&slow_work_thread_count) -
+ slow_work_min_threads;
+
+ if (n < 0 && !slow_work_may_not_start_new_thread)
+ slow_work_enqueue(&slow_work_new_thread);
+ else if (n > 0)
+ mod_timer(&slow_work_cull_timer,
+ jiffies + SLOW_WORK_CULL_TIMEOUT);
+ }
+ mutex_unlock(&slow_work_user_lock);
+ }
+
+ return ret;
+}
+
+/*
+ * Handle adjustment of the maximum number of threads
+ */
+static int slow_work_max_threads_sysctl(struct ctl_table *table, int write,
+ struct file *filp, void __user *buffer,
+ size_t *lenp, loff_t *ppos)
+{
+ int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
+ int n;
+
+ if (ret == 0) {
+ mutex_lock(&slow_work_user_lock);
+ if (slow_work_user_count > 0) {
+ /* see if we need to stop threads */
+ n = slow_work_max_threads -
+ atomic_read(&slow_work_thread_count);
+
+ if (n < 0)
+ mod_timer(&slow_work_cull_timer,
+ jiffies + SLOW_WORK_CULL_TIMEOUT);
+ }
+ mutex_unlock(&slow_work_user_lock);
+ }
+
+ return ret;
+}
+#endif /* CONFIG_SYSCTL */
+
/**
* slow_work_register_user - Register a user of the facility
*
@@ -516,8 +626,12 @@ static int __init init_slow_work(void)
{
unsigned nr_cpus = num_possible_cpus();

- if (nr_cpus > slow_work_max_threads)
+ if (slow_work_max_threads < nr_cpus)
slow_work_max_threads = nr_cpus;
+#ifdef CONFIG_SYSCTL
+ if (slow_work_max_max_threads < nr_cpus * 2)
+ slow_work_max_max_threads = nr_cpus * 2;
+#endif
return 0;
}

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 2e490a3..8a9b696 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -48,6 +48,7 @@
#include <linux/acpi.h>
#include <linux/reboot.h>
#include <linux/ftrace.h>
+#include <linux/slow-work.h>

#include <asm/uaccess.h>
#include <asm/processor.h>
@@ -900,6 +901,14 @@ static struct ctl_table kern_table[] = {
.proc_handler = &scan_unevictable_handler,
},
#endif
+#ifdef CONFIG_SLOW_WORK
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "slow-work",
+ .mode = 0555,
+ .child = slow_work_sysctls,
+ },
+#endif
/*
* NOTE: do not add new entries to this table unless you have read
* Documentation/sysctl/ctl_unnumbered.txt

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:03:47 UTC
Permalink
The attached patch causes read_cache_pages() to release page-private data on a
page for which add_to_page_cache() fails or the filler function fails. This
permits pages with caching references associated with them to be cleaned up.

The invalidatepage() address space op is called (indirectly) to do the honours.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

include/linux/page-flags.h | 2 +-
mm/readahead.c | 39 +++++++++++++++++++++++++++++++++++++--
2 files changed, 38 insertions(+), 3 deletions(-)


diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 61df177..9d99e74 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -182,7 +182,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; }

struct page; /* forward declaration */

-TESTPAGEFLAG(Locked, locked)
+TESTPAGEFLAG(Locked, locked) TESTSETFLAG(Locked, locked)
PAGEFLAG(Error, error)
PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
diff --git a/mm/readahead.c b/mm/readahead.c
index 9ce303d..6be9275 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -31,6 +31,41 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);

#define list_to_page(head) (list_entry((head)->prev, struct page, lru))

+/*
+ * see if a page needs releasing upon read_cache_pages() failure
+ * - the caller of read_cache_pages() may have set PG_private before calling,
+ * such as the NFS fs marking pages that are cached locally on disk, thus we
+ * need to give the fs a chance to clean up in the event of an error
+ */
+static void read_cache_pages_invalidate_page(struct address_space *mapping,
+ struct page *page)
+{
+ if (PagePrivate(page)) {
+ if (!trylock_page(page))
+ BUG();
+ page->mapping = mapping;
+ do_invalidatepage(page, 0);
+ page->mapping = NULL;
+ unlock_page(page);
+ }
+ page_cache_release(page);
+}
+
+/*
+ * release a list of pages, invalidating them first if need be
+ */
+static void read_cache_pages_invalidate_pages(struct address_space *mapping,
+ struct list_head *pages)
+{
+ struct page *victim;
+
+ while (!list_empty(pages)) {
+ victim = list_to_page(pages);
+ list_del(&victim->lru);
+ read_cache_pages_invalidate_page(mapping, victim);
+ }
+}
+
/**
* read_cache_pages - populate an address space with some pages & start reads against them
* @mapping: the address_space
@@ -52,14 +87,14 @@ int read_cache_pages(struct address_space *mapping, struct list_head *pages,
list_del(&page->lru);
if (add_to_page_cache_lru(page, mapping,
page->index, GFP_KERNEL)) {
- page_cache_release(page);
+ read_cache_pages_invalidate_page(mapping, page);
continue;
}
page_cache_release(page);

ret = filler(data, page);
if (unlikely(ret)) {
- put_pages_list(pages);
+ read_cache_pages_invalidate_pages(mapping, pages);
break;
}
task_io_account_read(PAGE_CACHE_SIZE);

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2009-04-02 13:51:09 UTC
Permalink
Post by David Howells
The attached patch causes read_cache_pages() to release page-private data
on a page for which add_to_page_cache() fails or the filler function
fails.
Post by David Howells
This permits pages with caching references associated with them to be
cleaned up.
The invalidatepage() address space op is called (indirectly) to do the honours.
This does not release the page for which the filler function fails.
It releases for all pages that do not get put on the LRU.

I guess that's just a changelog bug.

I don't have a problem with this if it is significantly easier to do
here than the caller -- it should be a slowpath.
Post by David Howells
---
include/linux/page-flags.h | 2 +-
mm/readahead.c | 39 +++++++++++++++++++++++++++++++++++++--
2 files changed, 38 insertions(+), 3 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 61df177..9d99e74 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -182,7 +182,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; }
struct page; /* forward declaration */
-TESTPAGEFLAG(Locked, locked)
+TESTPAGEFLAG(Locked, locked) TESTSETFLAG(Locked, locked)
PAGEFLAG(Error, error)
PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty,
dirty) diff --git a/mm/readahead.c b/mm/readahead.c
index 9ce303d..6be9275 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -31,6 +31,41 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);
#define list_to_page(head) (list_entry((head)->prev, struct page, lru))
+/*
+ * see if a page needs releasing upon read_cache_pages() failure
+ * - the caller of read_cache_pages() may have set PG_private before
calling, + * such as the NFS fs marking pages that are cached locally on
disk, thus we + * need to give the fs a chance to clean up in the event
of an error + */
+static void read_cache_pages_invalidate_page(struct address_space
*mapping, + struct page
*page)
Post by David Howells
+{
+ if (PagePrivate(page)) {
+ if (!trylock_page(page))
+ BUG();
+ page->mapping = mapping;
+ do_invalidatepage(page, 0);
+ page->mapping = NULL;
+ unlock_page(page);
+ }
+ page_cache_release(page);
+}
+
+/*
+ * release a list of pages, invalidating them first if need be
+ */
+static void read_cache_pages_invalidate_pages(struct address_space
*mapping, + struct
list_head *pages)
Post by David Howells
+{
+ struct page *victim;
+
+ while (!list_empty(pages)) {
+ victim = list_to_page(pages);
+ list_del(&victim->lru);
+ read_cache_pages_invalidate_page(mapping, victim);
+ }
+}
+
/**
* read_cache_pages - populate an address space with some pages & start
@@ -52,14 +87,14 @@ int read_cache_pages(struct address_space *mapping,
struct list_head *pages, list_del(&page->lru);
if (add_to_page_cache_lru(page, mapping,
page->index, GFP_KERNEL))
{
Post by David Howells
- page_cache_release(page);
+ read_cache_pages_invalidate_page(mapping,
page);
Post by David Howells
continue;
}
page_cache_release(page);
ret = filler(data, page);
if (unlikely(ret)) {
David Howells
2009-04-02 14:21:07 UTC
Permalink
Post by Nick Piggin
This does not release the page for which the filler function fails.
That's a good point. That page is attached to the pagecache and the LRU by
that point. I suppose the page could be removed (but someone else may be
trying to reference it by this point), or I could amend the patch text to:

The attached patch causes read_cache_pages() to release page-private
data on a page for which add_to_page_cache() fails. If the filler
function fails, then the problematic page is left attached to the
pagecache (with appropriate flags set, one presumes) and the remaining
to-be-attached pages are invalidated and discarded. This permits
pages with caching references associated with them to be cleaned up.

The invalidatepage() address space op is called (indirectly) to do the
honours.

How about that?
Post by Nick Piggin
I don't have a problem with this if it is significantly easier to do here
than the caller -- it should be a slowpath.
It could be done in the caller - every caller of read_cache_pages() that might
have the pages pre-annotated. But the changes are all in the error handling
path. If read_cache_pages() doesn't see an error, these changes are not
activated.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rik van Riel
2009-04-02 17:17:35 UTC
Permalink
Post by David Howells
The attached patch causes read_cache_pages() to release page-private data on a
page for which add_to_page_cache() fails or the filler function fails. This
permits pages with caching references associated with them to be cleaned up.
The invalidatepage() address space op is called (indirectly) to do the honours.
Acked-by: Rik van Riel <***@redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:04:02 UTC
Permalink
Add the API for a generic facility (FS-Cache) by which caches may declare them
selves open for business, and may obtain work to be done from network
filesystems. The header file is included by:

#include <linux/fscache-cache.h>

Documentation for the API is also added to:

Documentation/filesystems/caching/backend-api.txt

This API is not usable without the implementation of the utility functions
which will be added in further patches.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

Documentation/filesystems/caching/backend-api.txt | 664 +++++++++++++++++++++
include/linux/fscache-cache.h | 508 ++++++++++++++++
2 files changed, 1172 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/caching/backend-api.txt
create mode 100644 include/linux/fscache-cache.h


diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.txt
new file mode 100644
index 0000000..1772305
--- /dev/null
+++ b/Documentation/filesystems/caching/backend-api.txt
@@ -0,0 +1,664 @@
+ ==========================
+ FS-CACHE CACHE BACKEND API
+ ==========================
+
+The FS-Cache system provides an API by which actual caches can be supplied to
+FS-Cache for it to then serve out to network filesystems and other interested
+parties.
+
+This API is declared in <linux/fscache-cache.h>.
+
+
+====================================
+INITIALISING AND REGISTERING A CACHE
+====================================
+
+To start off, a cache definition must be initialised and registered for each
+cache the backend wants to make available. For instance, CacheFS does this in
+the fill_super() operation on mounting.
+
+The cache definition (struct fscache_cache) should be initialised by calling:
+
+ void fscache_init_cache(struct fscache_cache *cache,
+ struct fscache_cache_ops *ops,
+ const char *idfmt,
+ ...);
+
+Where:
+
+ (*) "cache" is a pointer to the cache definition;
+
+ (*) "ops" is a pointer to the table of operations that the backend supports on
+ this cache; and
+
+ (*) "idfmt" is a format and printf-style arguments for constructing a label
+ for the cache.
+
+
+The cache should then be registered with FS-Cache by passing a pointer to the
+previously initialised cache definition to:
+
+ int fscache_add_cache(struct fscache_cache *cache,
+ struct fscache_object *fsdef,
+ const char *tagname);
+
+Two extra arguments should also be supplied:
+
+ (*) "fsdef" which should point to the object representation for the FS-Cache
+ master index in this cache. Netfs primary index entries will be created
+ here. FS-Cache keeps the caller's reference to the index object if
+ successful and will release it upon withdrawal of the cache.
+
+ (*) "tagname" which, if given, should be a text string naming this cache. If
+ this is NULL, the identifier will be used instead. For CacheFS, the
+ identifier is set to name the underlying block device and the tag can be
+ supplied by mount.
+
+This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag
+is already in use. 0 will be returned on success.
+
+
+=====================
+UNREGISTERING A CACHE
+=====================
+
+A cache can be withdrawn from the system by calling this function with a
+pointer to the cache definition:
+
+ void fscache_withdraw_cache(struct fscache_cache *cache);
+
+In CacheFS's case, this is called by put_super().
+
+
+========
+SECURITY
+========
+
+The cache methods are executed one of two contexts:
+
+ (1) that of the userspace process that issued the netfs operation that caused
+ the cache method to be invoked, or
+
+ (2) that of one of the processes in the FS-Cache thread pool.
+
+In either case, this may not be an appropriate context in which to access the
+cache.
+
+The calling process's fsuid, fsgid and SELinux security identities may need to
+be masqueraded for the duration of the cache driver's access to the cache.
+This is left to the cache to handle; FS-Cache makes no effort in this regard.
+
+
+===================================
+CONTROL AND STATISTICS PRESENTATION
+===================================
+
+The cache may present data to the outside world through FS-Cache's interfaces
+in sysfs and procfs - the former for control and the latter for statistics.
+
+A sysfs directory called /sys/fs/fscache/<cachetag>/ is created if CONFIG_SYSFS
+is enabled. This is accessible through the kobject struct fscache_cache::kobj
+and is for use by the cache as it sees fit.
+
+The cache driver may create itself a directory named for the cache type in the
+/proc/fs/fscache/ directory. This is available if CONFIG_FSCACHE_PROC is
+enabled and is accessible through:
+
+ struct proc_dir_entry *proc_fscache;
+
+
+========================
+RELEVANT DATA STRUCTURES
+========================
+
+ (*) Index/Data file FS-Cache representation cookie:
+
+ struct fscache_cookie {
+ struct fscache_object_def *def;
+ struct fscache_netfs *netfs;
+ void *netfs_data;
+ ...
+ };
+
+ The fields that might be of use to the backend describe the object
+ definition, the netfs definition and the netfs's data for this cookie.
+ The object definition contain functions supplied by the netfs for loading
+ and matching index entries; these are required to provide some of the
+ cache operations.
+
+
+ (*) In-cache object representation:
+
+ struct fscache_object {
+ int debug_id;
+ enum {
+ FSCACHE_OBJECT_RECYCLING,
+ ...
+ } state;
+ spinlock_t lock
+ struct fscache_cache *cache;
+ struct fscache_cookie *cookie;
+ ...
+ };
+
+ Structures of this type should be allocated by the cache backend and
+ passed to FS-Cache when requested by the appropriate cache operation. In
+ the case of CacheFS, they're embedded in CacheFS's internal object
+ structures.
+
+ The debug_id is a simple integer that can be used in debugging messages
+ that refer to a particular object. In such a case it should be printed
+ using "OBJ%x" to be consistent with FS-Cache.
+
+ Each object contains a pointer to the cookie that represents the object it
+ is backing. An object should retired when put_object() is called if it is
+ in state FSCACHE_OBJECT_RECYCLING. The fscache_object struct should be
+ initialised by calling fscache_object_init(object).
+
+
+ (*) FS-Cache operation record:
+
+ struct fscache_operation {
+ atomic_t usage;
+ struct fscache_object *object;
+ unsigned long flags;
+ #define FSCACHE_OP_EXCLUSIVE
+ void (*processor)(struct fscache_operation *op);
+ void (*release)(struct fscache_operation *op);
+ ...
+ };
+
+ FS-Cache has a pool of threads that it uses to give CPU time to the
+ various asynchronous operations that need to be done as part of driving
+ the cache. These are represented by the above structure. The processor
+ method is called to give the op CPU time, and the release method to get
+ rid of it when its usage count reaches 0.
+
+ An operation can be made exclusive upon an object by setting the
+ appropriate flag before enqueuing it with fscache_enqueue_operation(). If
+ an operation needs more processing time, it should be enqueued again.
+
+
+ (*) FS-Cache retrieval operation record:
+
+ struct fscache_retrieval {
+ struct fscache_operation op;
+ struct address_space *mapping;
+ struct list_head *to_do;
+ ...
+ };
+
+ A structure of this type is allocated by FS-Cache to record retrieval and
+ allocation requests made by the netfs. This struct is then passed to the
+ backend to do the operation. The backend may get extra refs to it by
+ calling fscache_get_retrieval() and refs may be discarded by calling
+ fscache_put_retrieval().
+
+ A retrieval operation can be used by the backend to do retrieval work. To
+ do this, the retrieval->op.processor method pointer should be set
+ appropriately by the backend and fscache_enqueue_retrieval() called to
+ submit it to the thread pool. CacheFiles, for example, uses this to queue
+ page examination when it detects PG_lock being cleared.
+
+ The to_do field is an empty list available for the cache backend to use as
+ it sees fit.
+
+
+ (*) FS-Cache storage operation record:
+
+ struct fscache_storage {
+ struct fscache_operation op;
+ pgoff_t store_limit;
+ ...
+ };
+
+ A structure of this type is allocated by FS-Cache to record outstanding
+ writes to be made. FS-Cache itself enqueues this operation and invokes
+ the write_page() method on the object at appropriate times to effect
+ storage.
+
+
+================
+CACHE OPERATIONS
+================
+
+The cache backend provides FS-Cache with a table of operations that can be
+performed on the denizens of the cache. These are held in a structure of type:
+
+ struct fscache_cache_ops
+
+ (*) Name of cache provider [mandatory]:
+
+ const char *name
+
+ This isn't strictly an operation, but should be pointed at a string naming
+ the backend.
+
+
+ (*) Allocate a new object [mandatory]:
+
+ struct fscache_object *(*alloc_object)(struct fscache_cache *cache,
+ struct fscache_cookie *cookie)
+
+ This method is used to allocate a cache object representation to back a
+ cookie in a particular cache. fscache_object_init() should be called on
+ the object to initialise it prior to returning.
+
+ This function may also be used to parse the index key to be used for
+ multiple lookup calls to turn it into a more convenient form. FS-Cache
+ will call the lookup_complete() method to allow the cache to release the
+ form once lookup is complete or aborted.
+
+
+ (*) Look up and create object [mandatory]:
+
+ void (*lookup_object)(struct fscache_object *object)
+
+ This method is used to look up an object, given that the object is already
+ allocated and attached to the cookie. This should instantiate that object
+ in the cache if it can.
+
+ The method should call fscache_object_lookup_negative() as soon as
+ possible if it determines the object doesn't exist in the cache. If the
+ object is found to exist and the netfs indicates that it is valid then
+ fscache_obtained_object() should be called once the object is in a
+ position to have data stored in it. Similarly, fscache_obtained_object()
+ should also be called once a non-present object has been created.
+
+ If a lookup error occurs, fscache_object_lookup_error() should be called
+ to abort the lookup of that object.
+
+
+ (*) Release lookup data [mandatory]:
+
+ void (*lookup_complete)(struct fscache_object *object)
+
+ This method is called to ask the cache to release any resources it was
+ using to perform a lookup.
+
+
+ (*) Increment object refcount [mandatory]:
+
+ struct fscache_object *(*grab_object)(struct fscache_object *object)
+
+ This method is called to increment the reference count on an object. It
+ may fail (for instance if the cache is being withdrawn) by returning NULL.
+ It should return the object pointer if successful.
+
+
+ (*) Lock/Unlock object [mandatory]:
+
+ void (*lock_object)(struct fscache_object *object)
+ void (*unlock_object)(struct fscache_object *object)
+
+ These methods are used to exclusively lock an object. It must be possible
+ to schedule with the lock held, so a spinlock isn't sufficient.
+
+
+ (*) Pin/Unpin object [optional]:
+
+ int (*pin_object)(struct fscache_object *object)
+ void (*unpin_object)(struct fscache_object *object)
+
+ These methods are used to pin an object into the cache. Once pinned an
+ object cannot be reclaimed to make space. Return -ENOSPC if there's not
+ enough space in the cache to permit this.
+
+
+ (*) Update object [mandatory]:
+
+ int (*update_object)(struct fscache_object *object)
+
+ This is called to update the index entry for the specified object. The
+ new information should be in object->cookie->netfs_data. This can be
+ obtained by calling object->cookie->def->get_aux()/get_attr().
+
+
+ (*) Discard object [mandatory]:
+
+ void (*drop_object)(struct fscache_object *object)
+
+ This method is called to indicate that an object has been unbound from its
+ cookie, and that the cache should release the object's resources and
+ retire it if it's in state FSCACHE_OBJECT_RECYCLING.
+
+ This method should not attempt to release any references held by the
+ caller. The caller will invoke the put_object() method as appropriate.
+
+
+ (*) Release object reference [mandatory]:
+
+ void (*put_object)(struct fscache_object *object)
+
+ This method is used to discard a reference to an object. The object may
+ be freed when all the references to it are released.
+
+
+ (*) Synchronise a cache [mandatory]:
+
+ void (*sync)(struct fscache_cache *cache)
+
+ This is called to ask the backend to synchronise a cache with its backing
+ device.
+
+
+ (*) Dissociate a cache [mandatory]:
+
+ void (*dissociate_pages)(struct fscache_cache *cache)
+
+ This is called to ask a cache to perform any page dissociations as part of
+ cache withdrawal.
+
+
+ (*) Notification that the attributes on a netfs file changed [mandatory]:
+
+ int (*attr_changed)(struct fscache_object *object);
+
+ This is called to indicate to the cache that certain attributes on a netfs
+ file have changed (for example the maximum size a file may reach). The
+ cache can read these from the netfs by calling the cookie's get_attr()
+ method.
+
+ The cache may use the file size information to reserve space on the cache.
+ It should also call fscache_set_store_limit() to indicate to FS-Cache the
+ highest byte it's willing to store for an object.
+
+ This method may return -ve if an error occurred or the cache object cannot
+ be expanded. In such a case, the object will be withdrawn from service.
+
+ This operation is run asynchronously from FS-Cache's thread pool, and
+ storage and retrieval operations from the netfs are excluded during the
+ execution of this operation.
+
+
+ (*) Reserve cache space for an object's data [optional]:
+
+ int (*reserve_space)(struct fscache_object *object, loff_t size);
+
+ This is called to request that cache space be reserved to hold the data
+ for an object and the metadata used to track it. Zero size should be
+ taken as request to cancel a reservation.
+
+ This should return 0 if successful, -ENOSPC if there isn't enough space
+ available, or -ENOMEM or -EIO on other errors.
+
+ The reservation may exceed the current size of the object, thus permitting
+ future expansion. If the amount of space consumed by an object would
+ exceed the reservation, it's permitted to refuse requests to allocate
+ pages, but not required. An object may be pruned down to its reservation
+ size if larger than that already.
+
+
+ (*) Request page be read from cache [mandatory]:
+
+ int (*read_or_alloc_page)(struct fscache_retrieval *op,
+ struct page *page,
+ gfp_t gfp)
+
+ This is called to attempt to read a netfs page from the cache, or to
+ reserve a backing block if not. FS-Cache will have done as much checking
+ as it can before calling, but most of the work belongs to the backend.
+
+ If there's no page in the cache, then -ENODATA should be returned if the
+ backend managed to reserve a backing block; -ENOBUFS or -ENOMEM if it
+ didn't.
+
+ If there is suitable data in the cache, then a read operation should be
+ queued and 0 returned. When the read finishes, fscache_end_io() should be
+ called.
+
+ The fscache_mark_pages_cached() should be called for the page if any cache
+ metadata is retained. This will indicate to the netfs that the page needs
+ explicit uncaching. This operation takes a pagevec, thus allowing several
+ pages to be marked at once.
+
+ The retrieval record pointed to by op should be retained for each page
+ queued and released when I/O on the page has been formally ended.
+ fscache_get/put_retrieval() are available for this purpose.
+
+ The retrieval record may be used to get CPU time via the FS-Cache thread
+ pool. If this is desired, the op->op.processor should be set to point to
+ the appropriate processing routine, and fscache_enqueue_retrieval() should
+ be called at an appropriate point to request CPU time. For instance, the
+ retrieval routine could be enqueued upon the completion of a disk read.
+ The to_do field in the retrieval record is provided to aid in this.
+
+ If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS
+ returned if possible or fscache_end_io() called with a suitable error
+ code..
+
+
+ (*) Request pages be read from cache [mandatory]:
+
+ int (*read_or_alloc_pages)(struct fscache_retrieval *op,
+ struct list_head *pages,
+ unsigned *nr_pages,
+ gfp_t gfp)
+
+ This is like the read_or_alloc_page() method, except it is handed a list
+ of pages instead of one page. Any pages on which a read operation is
+ started must be added to the page cache for the specified mapping and also
+ to the LRU. Such pages must also be removed from the pages list and
+ *nr_pages decremented per page.
+
+ If there was an error such as -ENOMEM, then that should be returned; else
+ if one or more pages couldn't be read or allocated, then -ENOBUFS should
+ be returned; else if one or more pages couldn't be read, then -ENODATA
+ should be returned. If all the pages are dispatched then 0 should be
+ returned.
+
+
+ (*) Request page be allocated in the cache [mandatory]:
+
+ int (*allocate_page)(struct fscache_retrieval *op,
+ struct page *page,
+ gfp_t gfp)
+
+ This is like the read_or_alloc_page() method, except that it shouldn't
+ read from the cache, even if there's data there that could be retrieved.
+ It should, however, set up any internal metadata required such that
+ the write_page() method can write to the cache.
+
+ If there's no backing block available, then -ENOBUFS should be returned
+ (or -ENOMEM if there were other problems). If a block is successfully
+ allocated, then the netfs page should be marked and 0 returned.
+
+
+ (*) Request pages be allocated in the cache [mandatory]:
+
+ int (*allocate_pages)(struct fscache_retrieval *op,
+ struct list_head *pages,
+ unsigned *nr_pages,
+ gfp_t gfp)
+
+ This is an multiple page version of the allocate_page() method. pages and
+ nr_pages should be treated as for the read_or_alloc_pages() method.
+
+
+ (*) Request page be written to cache [mandatory]:
+
+ int (*write_page)(struct fscache_storage *op,
+ struct page *page);
+
+ This is called to write from a page on which there was a previously
+ successful read_or_alloc_page() call or similar. FS-Cache filters out
+ pages that don't have mappings.
+
+ This method is called asynchronously from the FS-Cache thread pool. It is
+ not required to actually store anything, provided -ENODATA is then
+ returned to the next read of this page.
+
+ If an error occurred, then a negative error code should be returned,
+ otherwise zero should be returned. FS-Cache will take appropriate action
+ in response to an error, such as withdrawing this object.
+
+ If this method returns success then FS-Cache will inform the netfs
+ appropriately.
+
+
+ (*) Discard retained per-page metadata [mandatory]:
+
+ void (*uncache_page)(struct fscache_object *object, struct page *page)
+
+ This is called when a netfs page is being evicted from the pagecache. The
+ cache backend should tear down any internal representation or tracking it
+ maintains for this page.
+
+
+==================
+FS-CACHE UTILITIES
+==================
+
+FS-Cache provides some utilities that a cache backend may make use of:
+
+ (*) Note occurrence of an I/O error in a cache:
+
+ void fscache_io_error(struct fscache_cache *cache)
+
+ This tells FS-Cache that an I/O error occurred in the cache. After this
+ has been called, only resource dissociation operations (object and page
+ release) will be passed from the netfs to the cache backend for the
+ specified cache.
+
+ This does not actually withdraw the cache. That must be done separately.
+
+
+ (*) Invoke the retrieval I/O completion function:
+
+ void fscache_end_io(struct fscache_retrieval *op, struct page *page,
+ int error);
+
+ This is called to note the end of an attempt to retrieve a page. The
+ error value should be 0 if successful and an error otherwise.
+
+
+ (*) Set highest store limit:
+
+ void fscache_set_store_limit(struct fscache_object *object,
+ loff_t i_size);
+
+ This sets the limit FS-Cache imposes on the highest byte it's willing to
+ try and store for a netfs. Any page over this limit is automatically
+ rejected by fscache_read_alloc_page() and co with -ENOBUFS.
+
+
+ (*) Mark pages as being cached:
+
+ void fscache_mark_pages_cached(struct fscache_retrieval *op,
+ struct pagevec *pagevec);
+
+ This marks a set of pages as being cached. After this has been called,
+ the netfs must call fscache_uncache_page() to unmark the pages.
+
+
+ (*) Perform coherency check on an object:
+
+ enum fscache_checkaux fscache_check_aux(struct fscache_object *object,
+ const void *data,
+ uint16_t datalen);
+
+ This asks the netfs to perform a coherency check on an object that has
+ just been looked up. The cookie attached to the object will determine the
+ netfs to use. data and datalen should specify where the auxiliary data
+ retrieved from the cache can be found.
+
+ One of three values will be returned:
+
+ (*) FSCACHE_CHECKAUX_OKAY
+
+ The coherency data indicates the object is valid as is.
+
+ (*) FSCACHE_CHECKAUX_NEEDS_UPDATE
+
+ The coherency data needs updating, but otherwise the object is
+ valid.
+
+ (*) FSCACHE_CHECKAUX_OBSOLETE
+
+ The coherency data indicates that the object is obsolete and should
+ be discarded.
+
+
+ (*) Initialise a freshly allocated object:
+
+ void fscache_object_init(struct fscache_object *object);
+
+ This initialises all the fields in an object representation.
+
+
+ (*) Indicate the destruction of an object:
+
+ void fscache_object_destroyed(struct fscache_cache *cache);
+
+ This must be called to inform FS-Cache that an object that belonged to a
+ cache has been destroyed and deallocated. This will allow continuation
+ of the cache withdrawal process when it is stopped pending destruction of
+ all the objects.
+
+
+ (*) Indicate negative lookup on an object:
+
+ void fscache_object_lookup_negative(struct fscache_object *object);
+
+ This is called to indicate to FS-Cache that a lookup process for an object
+ found a negative result.
+
+ This changes the state of an object to permit reads pending on lookup
+ completion to go off and start fetching data from the netfs server as it's
+ known at this point that there can't be any data in the cache.
+
+ This may be called multiple times on an object. Only the first call is
+ significant - all subsequent calls are ignored.
+
+
+ (*) Indicate an object has been obtained:
+
+ void fscache_obtained_object(struct fscache_object *object);
+
+ This is called to indicate to FS-Cache that a lookup process for an object
+ produced a positive result, or that an object was created. This should
+ only be called once for any particular object.
+
+ This changes the state of an object to indicate:
+
+ (1) if no call to fscache_object_lookup_negative() has been made on
+ this object, that there may be data available, and that reads can
+ now go and look for it; and
+
+ (2) that writes may now proceed against this object.
+
+
+ (*) Indicate that object lookup failed:
+
+ void fscache_object_lookup_error(struct fscache_object *object);
+
+ This marks an object as having encountered a fatal error (usually EIO)
+ and causes it to move into a state whereby it will be withdrawn as soon
+ as possible.
+
+
+ (*) Get and release references on a retrieval record:
+
+ void fscache_get_retrieval(struct fscache_retrieval *op);
+ void fscache_put_retrieval(struct fscache_retrieval *op);
+
+ These two functions are used to retain a retrieval record whilst doing
+ asynchronous data retrieval and block allocation.
+
+
+ (*) Enqueue a retrieval record for processing.
+
+ void fscache_enqueue_retrieval(struct fscache_retrieval *op);
+
+ This enqueues a retrieval record for processing by the FS-Cache thread
+ pool. One of the threads in the pool will invoke the retrieval record's
+ op->op.processor callback function. This function may be called from
+ within the callback function.
+
+
+ (*) List of object state names:
+
+ const char *fscache_object_states[];
+
+ For debugging purposes, this may be used to turn the state that an object
+ is in into a text string for display purposes.
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
new file mode 100644
index 0000000..a31f4c1
--- /dev/null
+++ b/include/linux/fscache-cache.h
@@ -0,0 +1,508 @@
+/* General filesystem caching backing cache interface
+ *
+ * Copyright (C) 2004-2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * NOTE!!! See:
+ *
+ * Documentation/filesystems/caching/backend-api.txt
+ *
+ * for a description of the cache backend interface declared here.
+ */
+
+#ifndef _LINUX_FSCACHE_CACHE_H
+#define _LINUX_FSCACHE_CACHE_H
+
+#include <linux/fscache.h>
+#include <linux/sched.h>
+#include <linux/slow-work.h>
+
+#define NR_MAXCACHES BITS_PER_LONG
+
+struct fscache_cache;
+struct fscache_cache_ops;
+struct fscache_object;
+struct fscache_operation;
+
+#ifdef CONFIG_FSCACHE_PROC
+extern struct proc_dir_entry *proc_fscache;
+#endif
+
+/*
+ * cache tag definition
+ */
+struct fscache_cache_tag {
+ struct list_head link;
+ struct fscache_cache *cache; /* cache referred to by this tag */
+ unsigned long flags;
+#define FSCACHE_TAG_RESERVED 0 /* T if tag is reserved for a cache */
+ atomic_t usage;
+ char name[0]; /* tag name */
+};
+
+/*
+ * cache definition
+ */
+struct fscache_cache {
+ const struct fscache_cache_ops *ops;
+ struct fscache_cache_tag *tag; /* tag representing this cache */
+ struct kobject *kobj; /* system representation of this cache */
+ struct list_head link; /* link in list of caches */
+ size_t max_index_size; /* maximum size of index data */
+ char identifier[36]; /* cache label */
+
+ /* node management */
+ struct work_struct op_gc; /* operation garbage collector */
+ struct list_head object_list; /* list of data/index objects */
+ struct list_head op_gc_list; /* list of ops to be deleted */
+ spinlock_t object_list_lock;
+ spinlock_t op_gc_list_lock;
+ atomic_t object_count; /* no. of live objects in this cache */
+ struct fscache_object *fsdef; /* object for the fsdef index */
+ unsigned long flags;
+#define FSCACHE_IOERROR 0 /* cache stopped on I/O error */
+#define FSCACHE_CACHE_WITHDRAWN 1 /* cache has been withdrawn */
+};
+
+extern wait_queue_head_t fscache_cache_cleared_wq;
+
+/*
+ * operation to be applied to a cache object
+ * - retrieval initiation operations are done in the context of the process
+ * that issued them, and not in an async thread pool
+ */
+typedef void (*fscache_operation_release_t)(struct fscache_operation *op);
+typedef void (*fscache_operation_processor_t)(struct fscache_operation *op);
+
+struct fscache_operation {
+ union {
+ struct work_struct fast_work; /* record for fast ops */
+ struct slow_work slow_work; /* record for (very) slow ops */
+ };
+ struct list_head pend_link; /* link in object->pending_ops */
+ struct fscache_object *object; /* object to be operated upon */
+
+ unsigned long flags;
+#define FSCACHE_OP_TYPE 0x000f /* operation type */
+#define FSCACHE_OP_FAST 0x0001 /* - fast op, processor may not sleep for disk */
+#define FSCACHE_OP_SLOW 0x0002 /* - (very) slow op, processor may sleep for disk */
+#define FSCACHE_OP_MYTHREAD 0x0003 /* - processing is done be issuing thread, not pool */
+#define FSCACHE_OP_WAITING 4 /* cleared when op is woken */
+#define FSCACHE_OP_EXCLUSIVE 5 /* exclusive op, other ops must wait */
+#define FSCACHE_OP_DEAD 6 /* op is now dead */
+
+ atomic_t usage;
+ unsigned debug_id; /* debugging ID */
+
+ /* operation processor callback
+ * - can be NULL if FSCACHE_OP_WAITING is going to be used to perform
+ * the op in a non-pool thread */
+ fscache_operation_processor_t processor;
+
+ /* operation releaser */
+ fscache_operation_release_t release;
+};
+
+extern atomic_t fscache_op_debug_id;
+extern const struct slow_work_ops fscache_op_slow_work_ops;
+
+extern void fscache_enqueue_operation(struct fscache_operation *);
+extern void fscache_put_operation(struct fscache_operation *);
+
+/**
+ * fscache_operation_init - Do basic initialisation of an operation
+ * @op: The operation to initialise
+ * @release: The release function to assign
+ *
+ * Do basic initialisation of an operation. The caller must still set flags,
+ * object, either fast_work or slow_work if necessary, and processor if needed.
+ */
+static inline void fscache_operation_init(struct fscache_operation *op,
+ fscache_operation_release_t release)
+{
+ atomic_set(&op->usage, 1);
+ op->debug_id = atomic_inc_return(&fscache_op_debug_id);
+ op->release = release;
+ INIT_LIST_HEAD(&op->pend_link);
+}
+
+/**
+ * fscache_operation_init_slow - Do additional initialisation of a slow op
+ * @op: The operation to initialise
+ * @processor: The processor function to assign
+ *
+ * Do additional initialisation of an operation as required for slow work.
+ */
+static inline
+void fscache_operation_init_slow(struct fscache_operation *op,
+ fscache_operation_processor_t processor)
+{
+ op->processor = processor;
+ slow_work_init(&op->slow_work, &fscache_op_slow_work_ops);
+}
+
+/*
+ * data read operation
+ */
+struct fscache_retrieval {
+ struct fscache_operation op;
+ struct address_space *mapping; /* netfs pages */
+ fscache_rw_complete_t end_io_func; /* function to call on I/O completion */
+ void *context; /* netfs read context (pinned) */
+ struct list_head to_do; /* list of things to be done by the backend */
+ unsigned long start_time; /* time at which retrieval started */
+};
+
+typedef int (*fscache_page_retrieval_func_t)(struct fscache_retrieval *op,
+ struct page *page,
+ gfp_t gfp);
+
+typedef int (*fscache_pages_retrieval_func_t)(struct fscache_retrieval *op,
+ struct list_head *pages,
+ unsigned *nr_pages,
+ gfp_t gfp);
+
+/**
+ * fscache_get_retrieval - Get an extra reference on a retrieval operation
+ * @op: The retrieval operation to get a reference on
+ *
+ * Get an extra reference on a retrieval operation.
+ */
+static inline
+struct fscache_retrieval *fscache_get_retrieval(struct fscache_retrieval *op)
+{
+ atomic_inc(&op->op.usage);
+ return op;
+}
+
+/**
+ * fscache_enqueue_retrieval - Enqueue a retrieval operation for processing
+ * @op: The retrieval operation affected
+ *
+ * Enqueue a retrieval operation for processing by the FS-Cache thread pool.
+ */
+static inline void fscache_enqueue_retrieval(struct fscache_retrieval *op)
+{
+ fscache_enqueue_operation(&op->op);
+}
+
+/**
+ * fscache_put_retrieval - Drop a reference to a retrieval operation
+ * @op: The retrieval operation affected
+ *
+ * Drop a reference to a retrieval operation.
+ */
+static inline void fscache_put_retrieval(struct fscache_retrieval *op)
+{
+ fscache_put_operation(&op->op);
+}
+
+/*
+ * cached page storage work item
+ * - used to do three things:
+ * - batch writes to the cache
+ * - do cache writes asynchronously
+ * - defer writes until cache object lookup completion
+ */
+struct fscache_storage {
+ struct fscache_operation op;
+ pgoff_t store_limit; /* don't write more than this */
+};
+
+/*
+ * cache operations
+ */
+struct fscache_cache_ops {
+ /* name of cache provider */
+ const char *name;
+
+ /* allocate an object record for a cookie */
+ struct fscache_object *(*alloc_object)(struct fscache_cache *cache,
+ struct fscache_cookie *cookie);
+
+ /* look up the object for a cookie */
+ void (*lookup_object)(struct fscache_object *object);
+
+ /* finished looking up */
+ void (*lookup_complete)(struct fscache_object *object);
+
+ /* increment the usage count on this object (may fail if unmounting) */
+ struct fscache_object *(*grab_object)(struct fscache_object *object);
+
+ /* pin an object in the cache */
+ int (*pin_object)(struct fscache_object *object);
+
+ /* unpin an object in the cache */
+ void (*unpin_object)(struct fscache_object *object);
+
+ /* store the updated auxilliary data on an object */
+ void (*update_object)(struct fscache_object *object);
+
+ /* discard the resources pinned by an object and effect retirement if
+ * necessary */
+ void (*drop_object)(struct fscache_object *object);
+
+ /* dispose of a reference to an object */
+ void (*put_object)(struct fscache_object *object);
+
+ /* sync a cache */
+ void (*sync_cache)(struct fscache_cache *cache);
+
+ /* notification that the attributes of a non-index object (such as
+ * i_size) have changed */
+ int (*attr_changed)(struct fscache_object *object);
+
+ /* reserve space for an object's data and associated metadata */
+ int (*reserve_space)(struct fscache_object *object, loff_t i_size);
+
+ /* request a backing block for a page be read or allocated in the
+ * cache */
+ fscache_page_retrieval_func_t read_or_alloc_page;
+
+ /* request backing blocks for a list of pages be read or allocated in
+ * the cache */
+ fscache_pages_retrieval_func_t read_or_alloc_pages;
+
+ /* request a backing block for a page be allocated in the cache so that
+ * it can be written directly */
+ fscache_page_retrieval_func_t allocate_page;
+
+ /* request backing blocks for pages be allocated in the cache so that
+ * they can be written directly */
+ fscache_pages_retrieval_func_t allocate_pages;
+
+ /* write a page to its backing block in the cache */
+ int (*write_page)(struct fscache_storage *op, struct page *page);
+
+ /* detach backing block from a page (optional)
+ * - must release the cookie lock before returning
+ * - may sleep
+ */
+ void (*uncache_page)(struct fscache_object *object,
+ struct page *page);
+
+ /* dissociate a cache from all the pages it was backing */
+ void (*dissociate_pages)(struct fscache_cache *cache);
+};
+
+/*
+ * data file or index object cookie
+ * - a file will only appear in one cache
+ * - a request to cache a file may or may not be honoured, subject to
+ * constraints such as disk space
+ * - indices are created on disk just-in-time
+ */
+struct fscache_cookie {
+ atomic_t usage; /* number of users of this cookie */
+ atomic_t n_children; /* number of children of this cookie */
+ spinlock_t lock;
+ struct hlist_head backing_objects; /* object(s) backing this file/index */
+ const struct fscache_cookie_def *def; /* definition */
+ struct fscache_cookie *parent; /* parent of this entry */
+ void *netfs_data; /* back pointer to netfs */
+ unsigned long flags;
+#define FSCACHE_COOKIE_LOOKING_UP 0 /* T if non-index cookie being looked up still */
+#define FSCACHE_COOKIE_CREATING 1 /* T if non-index object being created still */
+#define FSCACHE_COOKIE_NO_DATA_YET 2 /* T if new object with no cached data yet */
+#define FSCACHE_COOKIE_PENDING_FILL 3 /* T if pending initial fill on object */
+#define FSCACHE_COOKIE_FILLING 4 /* T if filling object incrementally */
+#define FSCACHE_COOKIE_UNAVAILABLE 5 /* T if cookie is unavailable (error, etc) */
+};
+
+extern struct fscache_cookie fscache_fsdef_index;
+
+/*
+ * on-disk cache file or index handle
+ */
+struct fscache_object {
+ enum fscache_object_state {
+ FSCACHE_OBJECT_INIT, /* object in initial unbound state */
+ FSCACHE_OBJECT_LOOKING_UP, /* looking up object */
+ FSCACHE_OBJECT_CREATING, /* creating object */
+
+ /* active states */
+ FSCACHE_OBJECT_AVAILABLE, /* cleaning up object after creation */
+ FSCACHE_OBJECT_ACTIVE, /* object is usable */
+ FSCACHE_OBJECT_UPDATING, /* object is updating */
+
+ /* terminal states */
+ FSCACHE_OBJECT_DYING, /* object waiting for accessors to finish */
+ FSCACHE_OBJECT_LC_DYING, /* object cleaning up after lookup/create */
+ FSCACHE_OBJECT_ABORT_INIT, /* abort the init state */
+ FSCACHE_OBJECT_RELEASING, /* releasing object */
+ FSCACHE_OBJECT_RECYCLING, /* retiring object */
+ FSCACHE_OBJECT_WITHDRAWING, /* withdrawing object */
+ FSCACHE_OBJECT_DEAD, /* object is now dead */
+ } state;
+
+ int debug_id; /* debugging ID */
+ int n_children; /* number of child objects */
+ int n_ops; /* number of ops outstanding on object */
+ int n_obj_ops; /* number of object ops outstanding on object */
+ int n_in_progress; /* number of ops in progress */
+ int n_exclusive; /* number of exclusive ops queued */
+ spinlock_t lock; /* state and operations lock */
+
+ unsigned long lookup_jif; /* time at which lookup started */
+ unsigned long event_mask; /* events this object is interested in */
+ unsigned long events; /* events to be processed by this object
+ * (order is important - using fls) */
+#define FSCACHE_OBJECT_EV_REQUEUE 0 /* T if object should be requeued */
+#define FSCACHE_OBJECT_EV_UPDATE 1 /* T if object should be updated */
+#define FSCACHE_OBJECT_EV_CLEARED 2 /* T if accessors all gone */
+#define FSCACHE_OBJECT_EV_ERROR 3 /* T if fatal error occurred during processing */
+#define FSCACHE_OBJECT_EV_RELEASE 4 /* T if netfs requested object release */
+#define FSCACHE_OBJECT_EV_RETIRE 5 /* T if netfs requested object retirement */
+#define FSCACHE_OBJECT_EV_WITHDRAW 6 /* T if cache requested object withdrawal */
+
+ unsigned long flags;
+#define FSCACHE_OBJECT_LOCK 0 /* T if object is busy being processed */
+#define FSCACHE_OBJECT_PENDING_WRITE 1 /* T if object has pending write */
+#define FSCACHE_OBJECT_WAITING 2 /* T if object is waiting on its parent */
+
+ struct list_head cache_link; /* link in cache->object_list */
+ struct hlist_node cookie_link; /* link in cookie->backing_objects */
+ struct fscache_cache *cache; /* cache that supplied this object */
+ struct fscache_cookie *cookie; /* netfs's file/index object */
+ struct fscache_object *parent; /* parent object */
+ struct slow_work work; /* attention scheduling record */
+ struct list_head dependents; /* FIFO of dependent objects */
+ struct list_head dep_link; /* link in parent's dependents list */
+ struct list_head pending_ops; /* unstarted operations on this object */
+ struct radix_tree_root stores; /* data to be stored */
+ pgoff_t store_limit; /* current storage limit */
+};
+
+extern const char *fscache_object_states[];
+
+#define fscache_object_is_active(obj) \
+ (!test_bit(FSCACHE_IOERROR, &(obj)->cache->flags) && \
+ (obj)->state >= FSCACHE_OBJECT_AVAILABLE && \
+ (obj)->state < FSCACHE_OBJECT_DYING)
+
+extern const struct slow_work_ops fscache_object_slow_work_ops;
+
+/**
+ * fscache_object_init - Initialise a cache object description
+ * @object: Object description
+ *
+ * Initialise a cache object description to its basic values.
+ *
+ * See Documentation/filesystems/caching/backend-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_object_init(struct fscache_object *object,
+ struct fscache_cookie *cookie,
+ struct fscache_cache *cache)
+{
+ atomic_inc(&cache->object_count);
+
+ object->state = FSCACHE_OBJECT_INIT;
+ spin_lock_init(&object->lock);
+ INIT_LIST_HEAD(&object->cache_link);
+ INIT_HLIST_NODE(&object->cookie_link);
+ vslow_work_init(&object->work, &fscache_object_slow_work_ops);
+ INIT_LIST_HEAD(&object->dependents);
+ INIT_LIST_HEAD(&object->dep_link);
+ INIT_LIST_HEAD(&object->pending_ops);
+ INIT_RADIX_TREE(&object->stores, GFP_NOFS);
+ object->n_children = 0;
+ object->n_ops = object->n_in_progress = object->n_exclusive = 0;
+ object->events = object->event_mask = 0;
+ object->flags = 0;
+ object->store_limit = 0;
+ object->cache = cache;
+ object->cookie = cookie;
+ object->parent = NULL;
+}
+
+extern void fscache_object_lookup_negative(struct fscache_object *object);
+extern void fscache_obtained_object(struct fscache_object *object);
+
+/**
+ * fscache_object_destroyed - Note destruction of an object in a cache
+ * @cache: The cache from which the object came
+ *
+ * Note the destruction and deallocation of an object record in a cache.
+ */
+static inline void fscache_object_destroyed(struct fscache_cache *cache)
+{
+ if (atomic_dec_and_test(&cache->object_count))
+ wake_up_all(&fscache_cache_cleared_wq);
+}
+
+/**
+ * fscache_object_lookup_error - Note an object encountered an error
+ * @object: The object on which the error was encountered
+ *
+ * Note that an object encountered a fatal error (usually an I/O error) and
+ * that it should be withdrawn as soon as possible.
+ */
+static inline void fscache_object_lookup_error(struct fscache_object *object)
+{
+ set_bit(FSCACHE_OBJECT_EV_ERROR, &object->events);
+}
+
+/**
+ * fscache_set_store_limit - Set the maximum size to be stored in an object
+ * @object: The object to set the maximum on
+ * @i_size: The limit to set in bytes
+ *
+ * Set the maximum size an object is permitted to reach, implying the highest
+ * byte that may be written. Intended to be called by the attr_changed() op.
+ *
+ * See Documentation/filesystems/caching/backend-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_set_store_limit(struct fscache_object *object, loff_t i_size)
+{
+ object->store_limit = i_size >> PAGE_SHIFT;
+ if (i_size & ~PAGE_MASK)
+ object->store_limit++;
+}
+
+/**
+ * fscache_end_io - End a retrieval operation on a page
+ * @op: The FS-Cache operation covering the retrieval
+ * @page: The page that was to be fetched
+ * @error: The error code (0 if successful)
+ *
+ * Note the end of an operation to retrieve a page, as covered by a particular
+ * operation record.
+ */
+static inline void fscache_end_io(struct fscache_retrieval *op,
+ struct page *page, int error)
+{
+ op->end_io_func(page, op->context, error);
+}
+
+/*
+ * out-of-line cache backend functions
+ */
+extern void fscache_init_cache(struct fscache_cache *cache,
+ const struct fscache_cache_ops *ops,
+ const char *idfmt,
+ ...) __attribute__ ((format (printf, 3, 4)));
+
+extern int fscache_add_cache(struct fscache_cache *cache,
+ struct fscache_object *fsdef,
+ const char *tagname);
+extern void fscache_withdraw_cache(struct fscache_cache *cache);
+
+extern void fscache_io_error(struct fscache_cache *cache);
+
+extern void fscache_mark_pages_cached(struct fscache_retrieval *op,
+ struct pagevec *pagevec);
+
+extern enum fscache_checkaux fscache_check_aux(struct fscache_object *object,
+ const void *data,
+ uint16_t datalen);
+
+#endif /* _LINUX_FSCACHE_CACHE_H */

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:04:13 UTC
Permalink
Make FS-Cache create its /proc interface and present various statistical
information through it. Also provide the functions for updating this
information.

These features are enabled by:

CONFIG_FSCACHE_PROC
CONFIG_FSCACHE_STATS
CONFIG_FSCACHE_HISTOGRAM

The /proc directory for FS-Cache is also exported so that caching modules can
add their own statistics there too.

The FS-Cache module is loadable at this point, and the statistics files can be
examined by userspace:

cat /proc/fs/fscache/stats
cat /proc/fs/fscache/histogram

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

Documentation/filesystems/caching/backend-api.txt | 6 -
Documentation/filesystems/caching/fscache.txt | 12 -
fs/fscache/Kconfig | 34 +++
fs/fscache/Makefile | 4
fs/fscache/histogram.c | 109 +++++++++++
fs/fscache/internal.h | 127 +++++++++++++
fs/fscache/main.c | 7 +
fs/fscache/proc.c | 68 +++++++
fs/fscache/stats.c | 212 +++++++++++++++++++++
include/linux/fscache-cache.h | 4
10 files changed, 566 insertions(+), 17 deletions(-)
create mode 100644 fs/fscache/histogram.c
create mode 100644 fs/fscache/proc.c
create mode 100644 fs/fscache/stats.c


diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.txt
index 1772305..382d52c 100644
--- a/Documentation/filesystems/caching/backend-api.txt
+++ b/Documentation/filesystems/caching/backend-api.txt
@@ -100,12 +100,6 @@ A sysfs directory called /sys/fs/fscache/<cachetag>/ is created if CONFIG_SYSFS
is enabled. This is accessible through the kobject struct fscache_cache::kobj
and is for use by the cache as it sees fit.

-The cache driver may create itself a directory named for the cache type in the
-/proc/fs/fscache/ directory. This is available if CONFIG_FSCACHE_PROC is
-enabled and is accessible through:
-
- struct proc_dir_entry *proc_fscache;
-

========================
RELEVANT DATA STRUCTURES
diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
index a759d91..0a751f3 100644
--- a/Documentation/filesystems/caching/fscache.txt
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -195,7 +195,6 @@ STATISTICAL INFORMATION

If FS-Cache is compiled with the following options enabled:

- CONFIG_FSCACHE_PROC=y (implied by the following two)
CONFIG_FSCACHE_STATS=y
CONFIG_FSCACHE_HISTOGRAM=y

@@ -275,7 +274,7 @@ proc files.
(*) /proc/fs/fscache/histogram

cat /proc/fs/fscache/histogram
- +HZ +TIME OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
+ JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
===== ===== ========= ========= ========= ========= =========

This shows the breakdown of the number of times each amount of time
@@ -291,16 +290,16 @@ proc files.
RETRIEVLS Time between beginning and end of a retrieval

Each row shows the number of events that took a particular range of times.
- Each step is 1 jiffy in size. The +HZ column indicates the particular
- jiffy range covered, and the +TIME field the equivalent number of seconds.
+ Each step is 1 jiffy in size. The JIFS column indicates the particular
+ jiffy range covered, and the SECS field the equivalent number of seconds.


=========
DEBUGGING
=========

-The FS-Cache facility can have runtime debugging enabled by adjusting the value
-in:
+If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
+debugging enabled by adjusting the value in:

/sys/module/fscache/parameters/debug

@@ -327,4 +326,3 @@ the control file. For example:
echo $((1|8|64)) >/sys/module/fscache/parameters/debug

will turn on all function entry debugging.
-
diff --git a/fs/fscache/Kconfig b/fs/fscache/Kconfig
index 7c7bccd..9bbb8ce 100644
--- a/fs/fscache/Kconfig
+++ b/fs/fscache/Kconfig
@@ -11,6 +11,40 @@ config FSCACHE

See Documentation/filesystems/caching/fscache.txt for more information.

+config FSCACHE_STATS
+ bool "Gather statistical information on local caching"
+ depends on FSCACHE && PROC_FS
+ help
+ This option causes statistical information to be gathered on local
+ caching and exported through file:
+
+ /proc/fs/fscache/stats
+
+ The gathering of statistics adds a certain amount of overhead to
+ execution as there are a quite a few stats gathered, and on a
+ multi-CPU system these may be on cachelines that keep bouncing
+ between CPUs. On the other hand, the stats are very useful for
+ debugging purposes. Saying 'Y' here is recommended.
+
+ See Documentation/filesystems/caching/fscache.txt for more information.
+
+config FSCACHE_HISTOGRAM
+ bool "Gather latency information on local caching"
+ depends on FSCACHE && PROC_FS
+ help
+ This option causes latency information to be gathered on local
+ caching and exported through file:
+
+ /proc/fs/fscache/histogram
+
+ The generation of this histogram adds a certain amount of overhead to
+ execution as there are a number of points at which data is gathered,
+ and on a multi-CPU system these may be on cachelines that keep
+ bouncing between CPUs. On the other hand, the histogram may be
+ useful for debugging purposes. Saying 'N' here is recommended.
+
+ See Documentation/filesystems/caching/fscache.txt for more information.
+
config FSCACHE_DEBUG
bool "Debug FS-Cache"
depends on FSCACHE
diff --git a/fs/fscache/Makefile b/fs/fscache/Makefile
index f8038b8..1384823 100644
--- a/fs/fscache/Makefile
+++ b/fs/fscache/Makefile
@@ -5,4 +5,8 @@
fscache-y := \
main.o

+fscache-$(CONFIG_PROC_FS) += proc.o
+fscache-$(CONFIG_FSCACHE_STATS) += stats.o
+fscache-$(CONFIG_FSCACHE_HISTOGRAM) += histogram.o
+
obj-$(CONFIG_FSCACHE) := fscache.o
diff --git a/fs/fscache/histogram.c b/fs/fscache/histogram.c
new file mode 100644
index 0000000..bad4967
--- /dev/null
+++ b/fs/fscache/histogram.c
@@ -0,0 +1,109 @@
+/* FS-Cache latency histogram
+ *
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL THREAD
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include "internal.h"
+
+atomic_t fscache_obj_instantiate_histogram[HZ];
+atomic_t fscache_objs_histogram[HZ];
+atomic_t fscache_ops_histogram[HZ];
+atomic_t fscache_retrieval_delay_histogram[HZ];
+atomic_t fscache_retrieval_histogram[HZ];
+
+/*
+ * display the time-taken histogram
+ */
+static int fscache_histogram_show(struct seq_file *m, void *v)
+{
+ unsigned long index;
+ unsigned n[5], t;
+
+ switch ((unsigned long) v) {
+ case 1:
+ seq_puts(m, "JIFS SECS OBJ INST OP RUNS OBJ RUNS "
+ " RETRV DLY RETRIEVLS\n");
+ return 0;
+ case 2:
+ seq_puts(m, "===== ===== ========= ========= ========="
+ " ========= =========\n");
+ return 0;
+ default:
+ index = (unsigned long) v - 3;
+ n[0] = atomic_read(&fscache_obj_instantiate_histogram[index]);
+ n[1] = atomic_read(&fscache_ops_histogram[index]);
+ n[2] = atomic_read(&fscache_objs_histogram[index]);
+ n[3] = atomic_read(&fscache_retrieval_delay_histogram[index]);
+ n[4] = atomic_read(&fscache_retrieval_histogram[index]);
+ if (!(n[0] | n[1] | n[2] | n[3] | n[4]))
+ return 0;
+
+ t = (index * 1000) / HZ;
+
+ seq_printf(m, "%4lu 0.%03u %9u %9u %9u %9u %9u\n",
+ index, t, n[0], n[1], n[2], n[3], n[4]);
+ return 0;
+ }
+}
+
+/*
+ * set up the iterator to start reading from the first line
+ */
+static void *fscache_histogram_start(struct seq_file *m, loff_t *_pos)
+{
+ if ((unsigned long long)*_pos >= HZ + 2)
+ return NULL;
+ if (*_pos == 0)
+ *_pos = 1;
+ return (void *)(unsigned long) *_pos;
+}
+
+/*
+ * move to the next line
+ */
+static void *fscache_histogram_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ (*pos)++;
+ return (unsigned long long)*pos > HZ + 2 ?
+ NULL : (void *)(unsigned long) *pos;
+}
+
+/*
+ * clean up after reading
+ */
+static void fscache_histogram_stop(struct seq_file *m, void *v)
+{
+}
+
+static const struct seq_operations fscache_histogram_ops = {
+ .start = fscache_histogram_start,
+ .stop = fscache_histogram_stop,
+ .next = fscache_histogram_next,
+ .show = fscache_histogram_show,
+};
+
+/*
+ * open "/proc/fs/fscache/histogram" to provide latency data
+ */
+static int fscache_histogram_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &fscache_histogram_ops);
+}
+
+const struct file_operations fscache_histogram_fops = {
+ .owner = THIS_MODULE,
+ .open = fscache_histogram_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index 95dc92d..16f9f1f 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -28,6 +28,30 @@
#define FSCACHE_MAX_THREADS 32

/*
+ * fsc-histogram.c
+ */
+#ifdef CONFIG_FSCACHE_HISTOGRAM
+extern atomic_t fscache_obj_instantiate_histogram[HZ];
+extern atomic_t fscache_objs_histogram[HZ];
+extern atomic_t fscache_ops_histogram[HZ];
+extern atomic_t fscache_retrieval_delay_histogram[HZ];
+extern atomic_t fscache_retrieval_histogram[HZ];
+
+static inline void fscache_hist(atomic_t histogram[], unsigned long start_jif)
+{
+ unsigned long jif = jiffies - start_jif;
+ if (jif >= HZ)
+ jif = HZ - 1;
+ atomic_inc(&histogram[jif]);
+}
+
+extern const struct file_operations fscache_histogram_fops;
+
+#else
+#define fscache_hist(hist, start_jif) do {} while (0)
+#endif
+
+/*
* fsc-main.c
*/
extern unsigned fscache_defer_lookup;
@@ -35,6 +59,109 @@ extern unsigned fscache_defer_create;
extern unsigned fscache_debug;
extern struct kobject *fscache_root;

+/*
+ * fsc-proc.c
+ */
+#ifdef CONFIG_PROC_FS
+extern int __init fscache_proc_init(void);
+extern void fscache_proc_cleanup(void);
+#else
+#define fscache_proc_init() (0)
+#define fscache_proc_cleanup() do {} while (0)
+#endif
+
+/*
+ * fsc-stats.c
+ */
+#ifdef CONFIG_FSCACHE_STATS
+extern atomic_t fscache_n_ops_processed[FSCACHE_MAX_THREADS];
+extern atomic_t fscache_n_objs_processed[FSCACHE_MAX_THREADS];
+
+extern atomic_t fscache_n_op_pend;
+extern atomic_t fscache_n_op_run;
+extern atomic_t fscache_n_op_enqueue;
+extern atomic_t fscache_n_op_deferred_release;
+extern atomic_t fscache_n_op_release;
+extern atomic_t fscache_n_op_gc;
+
+extern atomic_t fscache_n_attr_changed;
+extern atomic_t fscache_n_attr_changed_ok;
+extern atomic_t fscache_n_attr_changed_nobufs;
+extern atomic_t fscache_n_attr_changed_nomem;
+extern atomic_t fscache_n_attr_changed_calls;
+
+extern atomic_t fscache_n_allocs;
+extern atomic_t fscache_n_allocs_ok;
+extern atomic_t fscache_n_allocs_wait;
+extern atomic_t fscache_n_allocs_nobufs;
+extern atomic_t fscache_n_alloc_ops;
+extern atomic_t fscache_n_alloc_op_waits;
+
+extern atomic_t fscache_n_retrievals;
+extern atomic_t fscache_n_retrievals_ok;
+extern atomic_t fscache_n_retrievals_wait;
+extern atomic_t fscache_n_retrievals_nodata;
+extern atomic_t fscache_n_retrievals_nobufs;
+extern atomic_t fscache_n_retrievals_intr;
+extern atomic_t fscache_n_retrievals_nomem;
+extern atomic_t fscache_n_retrieval_ops;
+extern atomic_t fscache_n_retrieval_op_waits;
+
+extern atomic_t fscache_n_stores;
+extern atomic_t fscache_n_stores_ok;
+extern atomic_t fscache_n_stores_again;
+extern atomic_t fscache_n_stores_nobufs;
+extern atomic_t fscache_n_stores_oom;
+extern atomic_t fscache_n_store_ops;
+extern atomic_t fscache_n_store_calls;
+
+extern atomic_t fscache_n_marks;
+extern atomic_t fscache_n_uncaches;
+
+extern atomic_t fscache_n_acquires;
+extern atomic_t fscache_n_acquires_null;
+extern atomic_t fscache_n_acquires_no_cache;
+extern atomic_t fscache_n_acquires_ok;
+extern atomic_t fscache_n_acquires_nobufs;
+extern atomic_t fscache_n_acquires_oom;
+
+extern atomic_t fscache_n_updates;
+extern atomic_t fscache_n_updates_null;
+extern atomic_t fscache_n_updates_run;
+
+extern atomic_t fscache_n_relinquishes;
+extern atomic_t fscache_n_relinquishes_null;
+extern atomic_t fscache_n_relinquishes_waitcrt;
+
+extern atomic_t fscache_n_cookie_index;
+extern atomic_t fscache_n_cookie_data;
+extern atomic_t fscache_n_cookie_special;
+
+extern atomic_t fscache_n_object_alloc;
+extern atomic_t fscache_n_object_no_alloc;
+extern atomic_t fscache_n_object_lookups;
+extern atomic_t fscache_n_object_lookups_negative;
+extern atomic_t fscache_n_object_lookups_positive;
+extern atomic_t fscache_n_object_created;
+extern atomic_t fscache_n_object_avail;
+extern atomic_t fscache_n_object_dead;
+
+extern atomic_t fscache_n_checkaux_none;
+extern atomic_t fscache_n_checkaux_okay;
+extern atomic_t fscache_n_checkaux_update;
+extern atomic_t fscache_n_checkaux_obsolete;
+
+static inline void fscache_stat(atomic_t *stat)
+{
+ atomic_inc(stat);
+}
+
+extern const struct file_operations fscache_stats_fops;
+#else
+
+#define fscache_stat(stat) do {} while (0)
+#endif
+
/*****************************************************************************/
/*
* debug tracing
diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index 76f7c69..7c734b7 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -52,9 +52,15 @@ static int __init fscache_init(void)
if (ret < 0)
goto error_slow_work;

+ ret = fscache_proc_init();
+ if (ret < 0)
+ goto error_proc;
+
printk(KERN_NOTICE "FS-Cache: Loaded\n");
return 0;

+error_proc:
+ slow_work_unregister_user();
error_slow_work:
return ret;
}
@@ -68,6 +74,7 @@ static void __exit fscache_exit(void)
{
_enter("");

+ fscache_proc_cleanup();
slow_work_unregister_user();
printk(KERN_NOTICE "FS-Cache: Unloaded\n");
}
diff --git a/fs/fscache/proc.c b/fs/fscache/proc.c
new file mode 100644
index 0000000..beeab44
--- /dev/null
+++ b/fs/fscache/proc.c
@@ -0,0 +1,68 @@
+/* FS-Cache statistics viewing interface
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL OPERATION
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include "internal.h"
+
+/*
+ * initialise the /proc/fs/fscache/ directory
+ */
+int __init fscache_proc_init(void)
+{
+ _enter("");
+
+ if (!proc_mkdir("fs/fscache", NULL))
+ goto error_dir;
+
+#ifdef CONFIG_FSCACHE_STATS
+ if (!proc_create("fs/fscache/stats", S_IFREG | 0444, NULL,
+ &fscache_stats_fops))
+ goto error_stats;
+#endif
+
+#ifdef CONFIG_FSCACHE_HISTOGRAM
+ if (!proc_create("fs/fscache/histogram", S_IFREG | 0444, NULL,
+ &fscache_histogram_fops))
+ goto error_histogram;
+#endif
+
+ _leave(" = 0");
+ return 0;
+
+#ifdef CONFIG_FSCACHE_HISTOGRAM
+error_histogram:
+#endif
+#ifdef CONFIG_FSCACHE_STATS
+ remove_proc_entry("fs/fscache/stats", NULL);
+error_stats:
+#endif
+ remove_proc_entry("fs/fscache", NULL);
+error_dir:
+ _leave(" = -ENOMEM");
+ return -ENOMEM;
+}
+
+/*
+ * clean up the /proc/fs/fscache/ directory
+ */
+void fscache_proc_cleanup(void)
+{
+#ifdef CONFIG_FSCACHE_HISTOGRAM
+ remove_proc_entry("fs/fscache/histogram", NULL);
+#endif
+#ifdef CONFIG_FSCACHE_STATS
+ remove_proc_entry("fs/fscache/stats", NULL);
+#endif
+ remove_proc_entry("fs/fscache", NULL);
+}
diff --git a/fs/fscache/stats.c b/fs/fscache/stats.c
new file mode 100644
index 0000000..65deb99
--- /dev/null
+++ b/fs/fscache/stats.c
@@ -0,0 +1,212 @@
+/* FS-Cache statistics
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL THREAD
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include "internal.h"
+
+/*
+ * operation counters
+ */
+atomic_t fscache_n_op_pend;
+atomic_t fscache_n_op_run;
+atomic_t fscache_n_op_enqueue;
+atomic_t fscache_n_op_requeue;
+atomic_t fscache_n_op_deferred_release;
+atomic_t fscache_n_op_release;
+atomic_t fscache_n_op_gc;
+
+atomic_t fscache_n_attr_changed;
+atomic_t fscache_n_attr_changed_ok;
+atomic_t fscache_n_attr_changed_nobufs;
+atomic_t fscache_n_attr_changed_nomem;
+atomic_t fscache_n_attr_changed_calls;
+
+atomic_t fscache_n_allocs;
+atomic_t fscache_n_allocs_ok;
+atomic_t fscache_n_allocs_wait;
+atomic_t fscache_n_allocs_nobufs;
+atomic_t fscache_n_alloc_ops;
+atomic_t fscache_n_alloc_op_waits;
+
+atomic_t fscache_n_retrievals;
+atomic_t fscache_n_retrievals_ok;
+atomic_t fscache_n_retrievals_wait;
+atomic_t fscache_n_retrievals_nodata;
+atomic_t fscache_n_retrievals_nobufs;
+atomic_t fscache_n_retrievals_intr;
+atomic_t fscache_n_retrievals_nomem;
+atomic_t fscache_n_retrieval_ops;
+atomic_t fscache_n_retrieval_op_waits;
+
+atomic_t fscache_n_stores;
+atomic_t fscache_n_stores_ok;
+atomic_t fscache_n_stores_again;
+atomic_t fscache_n_stores_nobufs;
+atomic_t fscache_n_stores_oom;
+atomic_t fscache_n_store_ops;
+atomic_t fscache_n_store_calls;
+
+atomic_t fscache_n_marks;
+atomic_t fscache_n_uncaches;
+
+atomic_t fscache_n_acquires;
+atomic_t fscache_n_acquires_null;
+atomic_t fscache_n_acquires_no_cache;
+atomic_t fscache_n_acquires_ok;
+atomic_t fscache_n_acquires_nobufs;
+atomic_t fscache_n_acquires_oom;
+
+atomic_t fscache_n_updates;
+atomic_t fscache_n_updates_null;
+atomic_t fscache_n_updates_run;
+
+atomic_t fscache_n_relinquishes;
+atomic_t fscache_n_relinquishes_null;
+atomic_t fscache_n_relinquishes_waitcrt;
+
+atomic_t fscache_n_cookie_index;
+atomic_t fscache_n_cookie_data;
+atomic_t fscache_n_cookie_special;
+
+atomic_t fscache_n_object_alloc;
+atomic_t fscache_n_object_no_alloc;
+atomic_t fscache_n_object_lookups;
+atomic_t fscache_n_object_lookups_negative;
+atomic_t fscache_n_object_lookups_positive;
+atomic_t fscache_n_object_created;
+atomic_t fscache_n_object_avail;
+atomic_t fscache_n_object_dead;
+
+atomic_t fscache_n_checkaux_none;
+atomic_t fscache_n_checkaux_okay;
+atomic_t fscache_n_checkaux_update;
+atomic_t fscache_n_checkaux_obsolete;
+
+/*
+ * display the general statistics
+ */
+static int fscache_stats_show(struct seq_file *m, void *v)
+{
+ seq_puts(m, "FS-Cache statistics\n");
+
+ seq_printf(m, "Cookies: idx=%u dat=%u spc=%u\n",
+ atomic_read(&fscache_n_cookie_index),
+ atomic_read(&fscache_n_cookie_data),
+ atomic_read(&fscache_n_cookie_special));
+
+ seq_printf(m, "Objects: alc=%u nal=%u avl=%u ded=%u\n",
+ atomic_read(&fscache_n_object_alloc),
+ atomic_read(&fscache_n_object_no_alloc),
+ atomic_read(&fscache_n_object_avail),
+ atomic_read(&fscache_n_object_dead));
+ seq_printf(m, "ChkAux : non=%u ok=%u upd=%u obs=%u\n",
+ atomic_read(&fscache_n_checkaux_none),
+ atomic_read(&fscache_n_checkaux_okay),
+ atomic_read(&fscache_n_checkaux_update),
+ atomic_read(&fscache_n_checkaux_obsolete));
+
+ seq_printf(m, "Pages : mrk=%u unc=%u\n",
+ atomic_read(&fscache_n_marks),
+ atomic_read(&fscache_n_uncaches));
+
+ seq_printf(m, "Acquire: n=%u nul=%u noc=%u ok=%u nbf=%u"
+ " oom=%u\n",
+ atomic_read(&fscache_n_acquires),
+ atomic_read(&fscache_n_acquires_null),
+ atomic_read(&fscache_n_acquires_no_cache),
+ atomic_read(&fscache_n_acquires_ok),
+ atomic_read(&fscache_n_acquires_nobufs),
+ atomic_read(&fscache_n_acquires_oom));
+
+ seq_printf(m, "Lookups: n=%u neg=%u pos=%u crt=%u\n",
+ atomic_read(&fscache_n_object_lookups),
+ atomic_read(&fscache_n_object_lookups_negative),
+ atomic_read(&fscache_n_object_lookups_positive),
+ atomic_read(&fscache_n_object_created));
+
+ seq_printf(m, "Updates: n=%u nul=%u run=%u\n",
+ atomic_read(&fscache_n_updates),
+ atomic_read(&fscache_n_updates_null),
+ atomic_read(&fscache_n_updates_run));
+
+ seq_printf(m, "Relinqs: n=%u nul=%u wcr=%u\n",
+ atomic_read(&fscache_n_relinquishes),
+ atomic_read(&fscache_n_relinquishes_null),
+ atomic_read(&fscache_n_relinquishes_waitcrt));
+
+ seq_printf(m, "AttrChg: n=%u ok=%u nbf=%u oom=%u run=%u\n",
+ atomic_read(&fscache_n_attr_changed),
+ atomic_read(&fscache_n_attr_changed_ok),
+ atomic_read(&fscache_n_attr_changed_nobufs),
+ atomic_read(&fscache_n_attr_changed_nomem),
+ atomic_read(&fscache_n_attr_changed_calls));
+
+ seq_printf(m, "Allocs : n=%u ok=%u wt=%u nbf=%u\n",
+ atomic_read(&fscache_n_allocs),
+ atomic_read(&fscache_n_allocs_ok),
+ atomic_read(&fscache_n_allocs_wait),
+ atomic_read(&fscache_n_allocs_nobufs));
+ seq_printf(m, "Allocs : ops=%u owt=%u\n",
+ atomic_read(&fscache_n_alloc_ops),
+ atomic_read(&fscache_n_alloc_op_waits));
+
+ seq_printf(m, "Retrvls: n=%u ok=%u wt=%u nod=%u nbf=%u"
+ " int=%u oom=%u\n",
+ atomic_read(&fscache_n_retrievals),
+ atomic_read(&fscache_n_retrievals_ok),
+ atomic_read(&fscache_n_retrievals_wait),
+ atomic_read(&fscache_n_retrievals_nodata),
+ atomic_read(&fscache_n_retrievals_nobufs),
+ atomic_read(&fscache_n_retrievals_intr),
+ atomic_read(&fscache_n_retrievals_nomem));
+ seq_printf(m, "Retrvls: ops=%u owt=%u\n",
+ atomic_read(&fscache_n_retrieval_ops),
+ atomic_read(&fscache_n_retrieval_op_waits));
+
+ seq_printf(m, "Stores : n=%u ok=%u agn=%u nbf=%u oom=%u\n",
+ atomic_read(&fscache_n_stores),
+ atomic_read(&fscache_n_stores_ok),
+ atomic_read(&fscache_n_stores_again),
+ atomic_read(&fscache_n_stores_nobufs),
+ atomic_read(&fscache_n_stores_oom));
+ seq_printf(m, "Stores : ops=%u run=%u\n",
+ atomic_read(&fscache_n_store_ops),
+ atomic_read(&fscache_n_store_calls));
+
+ seq_printf(m, "Ops : pend=%u run=%u enq=%u\n",
+ atomic_read(&fscache_n_op_pend),
+ atomic_read(&fscache_n_op_run),
+ atomic_read(&fscache_n_op_enqueue));
+ seq_printf(m, "Ops : dfr=%u rel=%u gc=%u\n",
+ atomic_read(&fscache_n_op_deferred_release),
+ atomic_read(&fscache_n_op_release),
+ atomic_read(&fscache_n_op_gc));
+ return 0;
+}
+
+/*
+ * open "/proc/fs/fscache/stats" allowing provision of a statistical summary
+ */
+static int fscache_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, fscache_stats_show, NULL);
+}
+
+const struct file_operations fscache_stats_fops = {
+ .owner = THIS_MODULE,
+ .open = fscache_stats_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index a31f4c1..0410bd9 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -29,10 +29,6 @@ struct fscache_cache_ops;
struct fscache_object;
struct fscache_operation;

-#ifdef CONFIG_FSCACHE_PROC
-extern struct proc_dir_entry *proc_fscache;
-#endif
-
/*
* cache tag definition
*/

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:04:18 UTC
Permalink
Add a description of the root index of the cache for later patches to make use
of.

The root index is owned by FS-Cache itself. When a netfs requests caching
facilities, FS-Cache will, if one doesn't already exist, create an entry in
the root index with the key being the name of the netfs ("AFS" for example),
and the auxiliary data holding the index structure version supplied by the
netfs:

FSDEF
|
+-----------+
| |
NFS AFS
[v=1] [v=1]

If an entry with the appropriate name does already exist, the version is
compared. If the version is different, the entire subtree from that entry
will be discarded and a new entry created.

The new entry will be an index, and a cookie referring to it will be passed to
the netfs. This is then the root handle by which the netfs accesses the
cache. It can create whatever objects it likes in that index, including
further indices.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/fscache/Makefile | 1
fs/fscache/fsdef.c | 144 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/fscache/internal.h | 6 ++
3 files changed, 151 insertions(+), 0 deletions(-)
create mode 100644 fs/fscache/fsdef.c


diff --git a/fs/fscache/Makefile b/fs/fscache/Makefile
index 1384823..bc1f3b9 100644
--- a/fs/fscache/Makefile
+++ b/fs/fscache/Makefile
@@ -3,6 +3,7 @@
#

fscache-y := \
+ fsdef.o \
main.o

fscache-$(CONFIG_PROC_FS) += proc.o
diff --git a/fs/fscache/fsdef.c b/fs/fscache/fsdef.c
new file mode 100644
index 0000000..f5b4bae
--- /dev/null
+++ b/fs/fscache/fsdef.c
@@ -0,0 +1,144 @@
+/* Filesystem index definition
+ *
+ * Copyright (C) 2004-2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL CACHE
+#include <linux/module.h>
+#include "internal.h"
+
+static uint16_t fscache_fsdef_netfs_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax);
+
+static uint16_t fscache_fsdef_netfs_get_aux(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax);
+
+static
+enum fscache_checkaux fscache_fsdef_netfs_check_aux(void *cookie_netfs_data,
+ const void *data,
+ uint16_t datalen);
+
+/*
+ * The root index is owned by FS-Cache itself.
+ *
+ * When a netfs requests caching facilities, FS-Cache will, if one doesn't
+ * already exist, create an entry in the root index with the key being the name
+ * of the netfs ("AFS" for example), and the auxiliary data holding the index
+ * structure version supplied by the netfs:
+ *
+ * FSDEF
+ * |
+ * +-----------+
+ * | |
+ * NFS AFS
+ * [v=1] [v=1]
+ *
+ * If an entry with the appropriate name does already exist, the version is
+ * compared. If the version is different, the entire subtree from that entry
+ * will be discarded and a new entry created.
+ *
+ * The new entry will be an index, and a cookie referring to it will be passed
+ * to the netfs. This is then the root handle by which the netfs accesses the
+ * cache. It can create whatever objects it likes in that index, including
+ * further indices.
+ */
+static struct fscache_cookie_def fscache_fsdef_index_def = {
+ .name = ".FS-Cache",
+ .type = FSCACHE_COOKIE_TYPE_INDEX,
+};
+
+struct fscache_cookie fscache_fsdef_index = {
+ .usage = ATOMIC_INIT(1),
+ .lock = __SPIN_LOCK_UNLOCKED(fscache_fsdef_index.lock),
+ .backing_objects = HLIST_HEAD_INIT,
+ .def = &fscache_fsdef_index_def,
+};
+EXPORT_SYMBOL(fscache_fsdef_index);
+
+/*
+ * Definition of an entry in the root index. Each entry is an index, keyed to
+ * a specific netfs and only applicable to a particular version of the index
+ * structure used by that netfs.
+ */
+struct fscache_cookie_def fscache_fsdef_netfs_def = {
+ .name = "FSDEF.netfs",
+ .type = FSCACHE_COOKIE_TYPE_INDEX,
+ .get_key = fscache_fsdef_netfs_get_key,
+ .get_aux = fscache_fsdef_netfs_get_aux,
+ .check_aux = fscache_fsdef_netfs_check_aux,
+};
+
+/*
+ * get the key data for an FSDEF index record - this is the name of the netfs
+ * for which this entry is created
+ */
+static uint16_t fscache_fsdef_netfs_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
+{
+ const struct fscache_netfs *netfs = cookie_netfs_data;
+ unsigned klen;
+
+ _enter("{%s.%u},", netfs->name, netfs->version);
+
+ klen = strlen(netfs->name);
+ if (klen > bufmax)
+ return 0;
+
+ memcpy(buffer, netfs->name, klen);
+ return klen;
+}
+
+/*
+ * get the auxiliary data for an FSDEF index record - this is the index
+ * structure version number of the netfs for which this version is created
+ */
+static uint16_t fscache_fsdef_netfs_get_aux(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
+{
+ const struct fscache_netfs *netfs = cookie_netfs_data;
+ unsigned dlen;
+
+ _enter("{%s.%u},", netfs->name, netfs->version);
+
+ dlen = sizeof(uint32_t);
+ if (dlen > bufmax)
+ return 0;
+
+ memcpy(buffer, &netfs->version, dlen);
+ return dlen;
+}
+
+/*
+ * check that the index structure version number stored in the auxiliary data
+ * matches the one the netfs gave us
+ */
+static enum fscache_checkaux fscache_fsdef_netfs_check_aux(
+ void *cookie_netfs_data,
+ const void *data,
+ uint16_t datalen)
+{
+ struct fscache_netfs *netfs = cookie_netfs_data;
+ uint32_t version;
+
+ _enter("{%s},,%hu", netfs->name, datalen);
+
+ if (datalen != sizeof(version)) {
+ _leave(" = OBSOLETE [dl=%d v=%zu]", datalen, sizeof(version));
+ return FSCACHE_CHECKAUX_OBSOLETE;
+ }
+
+ memcpy(&version, data, sizeof(version));
+ if (version != netfs->version) {
+ _leave(" = OBSOLETE [ver=%x net=%x]", version, netfs->version);
+ return FSCACHE_CHECKAUX_OBSOLETE;
+ }
+
+ _leave(" = OKAY");
+ return FSCACHE_CHECKAUX_OKAY;
+}
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index 16f9f1f..4113af8 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -28,6 +28,12 @@
#define FSCACHE_MAX_THREADS 32

/*
+ * fsc-fsdef.c
+ */
+extern struct fscache_cookie fscache_fsdef_index;
+extern struct fscache_cookie_def fscache_fsdef_netfs_def;
+
+/*
* fsc-histogram.c
*/
#ifdef CONFIG_FSCACHE_HISTOGRAM

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:03:52 UTC
Permalink
Recruit a couple of page flags to aid in cache management. The following extra
flags are defined:

(1) PG_fscache (PG_private_2)

The marked page is backed by a local cache and is pinning resources in the
cache driver.

(2) PG_fscache_write (PG_owner_priv_2)

The marked page is being written to the local cache. The page may not be
modified whilst this is in progress.

If PG_fscache is set, then things that checked for PG_private will now also
check for that. This includes things like truncation and page invalidation.
The function page_has_private() had been added to make the checks for both
PG_private and PG_private_2 at the same time.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/splice.c | 3 ++-
include/linux/page-flags.h | 41 ++++++++++++++++++++++++++++++++++++-----
include/linux/pagemap.h | 14 ++++++++++++++
mm/filemap.c | 18 ++++++++++++++++++
mm/migrate.c | 10 +++++-----
mm/readahead.c | 9 +++++----
mm/swap.c | 4 ++--
mm/truncate.c | 10 +++++-----
mm/vmscan.c | 6 +++---
9 files changed, 90 insertions(+), 25 deletions(-)


diff --git a/fs/splice.c b/fs/splice.c
index 4ed0ba4..dd727d4 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -59,7 +59,8 @@ static int page_cache_pipe_buf_steal(struct pipe_inode_info *pipe,
*/
wait_on_page_writeback(page);

- if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
+ if (page_has_private(page) &&
+ !try_to_release_page(page, GFP_KERNEL))
goto out_unlock;

/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9d99e74..0633e06 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -79,9 +79,11 @@ enum pageflags {
PG_active,
PG_slab,
PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
+ PG_owner_priv_2, /* Owner use. fs may use in pagecache */
PG_arch_1,
PG_reserved,
PG_private, /* If pagecache, has fs-private data */
+ PG_private_2, /* If pagecache, has fs aux data */
PG_writeback, /* Page is under writeback */
#ifdef CONFIG_PAGEFLAGS_EXTENDED
PG_head, /* A head page */
@@ -108,6 +110,13 @@ enum pageflags {
/* Filesystems */
PG_checked = PG_owner_priv_1,

+ /* Two page bits are conscripted by FS-Cache to maintain local caching
+ * state. These bits are set on pages belonging to the netfs's inodes
+ * when those inodes are being locally cached.
+ */
+ PG_fscache = PG_private_2, /* page backed by cache */
+ PG_fscache_write = PG_owner_priv_2, /* page being written to cache */
+
/* XEN */
PG_pinned = PG_owner_priv_1,
PG_savepinned = PG_dirty,
@@ -194,8 +203,6 @@ PAGEFLAG(Checked, checked) /* Used by some filesystems */
PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned) /* Xen */
PAGEFLAG(SavePinned, savepinned); /* Xen */
PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
-PAGEFLAG(Private, private) __CLEARPAGEFLAG(Private, private)
- __SETPAGEFLAG(Private, private)
PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)

__PAGEFLAG(SlobPage, slob_page)
@@ -205,6 +212,17 @@ __PAGEFLAG(SlubFrozen, slub_frozen)
__PAGEFLAG(SlubDebug, slub_debug)

/*
+ * Private page markings that may be used by the filesystem that owns the page
+ * for its own purposes.
+ * - PG_private and PG_private_2 cause releasepage() and co to be invoked
+ */
+PAGEFLAG(Private, private) __SETPAGEFLAG(Private, private)
+ __CLEARPAGEFLAG(Private, private)
+PAGEFLAG(Private2, private_2) TESTSCFLAG(Private2, private_2)
+PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1)
+PAGEFLAG(OwnerPriv2, owner_priv_2) TESTSCFLAG(OwnerPriv2, owner_priv_2)
+
+/*
* Only test-and-set exist for PG_writeback. The unconditional operators are
* risky: they bypass page accounting.
*/
@@ -384,9 +402,10 @@ static inline void __ClearPageTail(struct page *page)
* these flags set. It they are, there is a problem.
*/
#define PAGE_FLAGS_CHECK_AT_FREE \
- (1 << PG_lru | 1 << PG_private | 1 << PG_locked | \
- 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \
- 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
+ (1 << PG_lru | 1 << PG_locked | \
+ 1 << PG_private | 1 << PG_private_2 | \
+ 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \
+ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
__PG_UNEVICTABLE | __PG_MLOCKED)

/*
@@ -397,4 +416,16 @@ static inline void __ClearPageTail(struct page *page)
#define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)

#endif /* !__GENERATING_BOUNDS_H */
+
+/**
+ * page_has_private - Determine if page has private stuff
+ * @page: The page to be checked
+ *
+ * Determine if a page has private stuff, indicating that release routines
+ * should be invoked upon it.
+ */
+#define page_has_private(page) \
+ ((page)->flags & ((1 << PG_private) | \
+ (1 << PG_private_2)))
+
#endif /* PAGE_FLAGS_H */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 01ca085..d15e21e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -379,6 +379,20 @@ static inline void wait_on_page_writeback(struct page *page)

extern void end_page_writeback(struct page *page);

+/**
+ * wait_on_page_owner_priv_2 - Wait for PG_owner_priv_2 to become clear
+ * @page: The page to monitor
+ *
+ * Wait for a PG_owner_priv_2 to become clear on the specified page.
+ */
+static inline void wait_on_page_owner_priv_2(struct page *page)
+{
+ if (PageOwnerPriv2(page))
+ wait_on_page_bit(page, PG_owner_priv_2);
+}
+
+extern void end_page_owner_priv_2(struct page *page);
+
/*
* Fault a userspace page into pagetables. Return non-zero on a fault.
*
diff --git a/mm/filemap.c b/mm/filemap.c
index 126d397..f8ae118 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -603,6 +603,21 @@ void end_page_writeback(struct page *page)
EXPORT_SYMBOL(end_page_writeback);

/**
+ * end_page_owner_priv_2 - Clear PG_owner_priv_2 and wake up any waiters
+ * @page: the page
+ *
+ * Clear PG_owner_priv_2 and wake up any processes waiting for that event.
+ */
+void end_page_owner_priv_2(struct page *page)
+{
+ if (!TestClearPageOwnerPriv2(page))
+ BUG();
+ smp_mb__after_clear_bit();
+ wake_up_page(page, PG_owner_priv_2);
+}
+EXPORT_SYMBOL(end_page_owner_priv_2);
+
+/**
* __lock_page - get a lock on the page, assuming we need to sleep to get it
* @page: the page to lock
*
@@ -2463,6 +2478,9 @@ EXPORT_SYMBOL(generic_file_aio_write);
* (presumably at page->private). If the release was successful, return `1'.
* Otherwise return zero.
*
+ * This may also be called if PG_fscache is set on a page, indicating that the
+ * page is known to the local caching routines.
+ *
* The @gfp_mask argument specifies whether I/O may be performed to release
* this page (__GFP_IO), and whether the call may block (__GFP_WAIT & __GFP_FS).
*
diff --git a/mm/migrate.c b/mm/migrate.c
index a9eff3f..068655d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -250,7 +250,7 @@ out:
* The number of remaining references must be:
* 1 for anonymous pages without a mapping
* 2 for pages with a mapping
- * 3 for pages with a mapping and PagePrivate set.
+ * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
*/
static int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page)
@@ -270,7 +270,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));

- expected_count = 2 + !!PagePrivate(page);
+ expected_count = 2 + !!page_has_private(page);
if (page_count(page) != expected_count ||
(struct page *)radix_tree_deref_slot(pslot) != page) {
spin_unlock_irq(&mapping->tree_lock);
@@ -386,7 +386,7 @@ EXPORT_SYMBOL(fail_migrate_page);

/*
* Common logic to directly migrate a single page suitable for
- * pages that do not use PagePrivate.
+ * pages that do not use PagePrivate/PagePrivate2.
*
* Pages are locked upon entry and exit.
*/
@@ -522,7 +522,7 @@ static int fallback_migrate_page(struct address_space *mapping,
* Buffers may be managed in a filesystem specific way.
* We must have no buffers or drop them.
*/
- if (PagePrivate(page) &&
+ if (page_has_private(page) &&
!try_to_release_page(page, GFP_KERNEL))
return -EAGAIN;

@@ -655,7 +655,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
* free the metadata, so the page can be freed.
*/
if (!page->mapping) {
- if (!PageAnon(page) && PagePrivate(page)) {
+ if (!PageAnon(page) && page_has_private(page)) {
/*
* Go direct to try_to_free_buffers() here because
* a) that's what try_to_release_page() would do anyway
diff --git a/mm/readahead.c b/mm/readahead.c
index 6be9275..133b6d5 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -33,14 +33,15 @@ EXPORT_SYMBOL_GPL(file_ra_state_init);

/*
* see if a page needs releasing upon read_cache_pages() failure
- * - the caller of read_cache_pages() may have set PG_private before calling,
- * such as the NFS fs marking pages that are cached locally on disk, thus we
- * need to give the fs a chance to clean up in the event of an error
+ * - the caller of read_cache_pages() may have set PG_private or PG_fscache
+ * before calling, such as the NFS fs marking pages that are cached locally
+ * on disk, thus we need to give the fs a chance to clean up in the event of
+ * an error
*/
static void read_cache_pages_invalidate_page(struct address_space *mapping,
struct page *page)
{
- if (PagePrivate(page)) {
+ if (page_has_private(page)) {
if (!trylock_page(page))
BUG();
page->mapping = mapping;
diff --git a/mm/swap.c b/mm/swap.c
index 6e83084..bede23c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -448,8 +448,8 @@ void pagevec_strip(struct pagevec *pvec)
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];

- if (PagePrivate(page) && trylock_page(page)) {
- if (PagePrivate(page))
+ if (page_has_private(page) && trylock_page(page)) {
+ if (page_has_private(page))
try_to_release_page(page, 0);
unlock_page(page);
}
diff --git a/mm/truncate.c b/mm/truncate.c
index 1229211..55206fa 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -50,7 +50,7 @@ void do_invalidatepage(struct page *page, unsigned long offset)
static inline void truncate_partial_page(struct page *page, unsigned partial)
{
zero_user_segment(page, partial, PAGE_CACHE_SIZE);
- if (PagePrivate(page))
+ if (page_has_private(page))
do_invalidatepage(page, partial);
}

@@ -99,7 +99,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
if (page->mapping != mapping)
return;

- if (PagePrivate(page))
+ if (page_has_private(page))
do_invalidatepage(page, 0);

cancel_dirty_page(page, PAGE_CACHE_SIZE);
@@ -126,7 +126,7 @@ invalidate_complete_page(struct address_space *mapping, struct page *page)
if (page->mapping != mapping)
return 0;

- if (PagePrivate(page) && !try_to_release_page(page, 0))
+ if (page_has_private(page) && !try_to_release_page(page, 0))
return 0;

clear_page_mlock(page);
@@ -348,7 +348,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
if (page->mapping != mapping)
return 0;

- if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
+ if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
return 0;

spin_lock_irq(&mapping->tree_lock);
@@ -356,7 +356,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
goto failed;

clear_page_mlock(page);
- BUG_ON(PagePrivate(page));
+ BUG_ON(page_has_private(page));
__remove_from_page_cache(page);
spin_unlock_irq(&mapping->tree_lock);
page_cache_release(page); /* pagecache ref */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 06e7269..4252449 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -283,7 +283,7 @@ static inline int page_mapping_inuse(struct page *page)

static inline int is_page_cache_freeable(struct page *page)
{
- return page_count(page) - !!PagePrivate(page) == 2;
+ return page_count(page) - !!page_has_private(page) == 2;
}

static int may_write_to_queue(struct backing_dev_info *bdi)
@@ -367,7 +367,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
* Some data journaling orphaned pages can have
* page->mapping == NULL while being dirty with clean buffers.
*/
- if (PagePrivate(page)) {
+ if (page_has_private(page)) {
if (try_to_free_buffers(page)) {
ClearPageDirty(page);
printk("%s: orphaned page\n", __func__);
@@ -727,7 +727,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* process address space (page_count == 1) it can be freed.
* Otherwise, leave the page on the LRU so it is swappable.
*/
- if (PagePrivate(page)) {
+ if (page_has_private(page)) {
if (!try_to_release_page(page, sc->gfp_mask))
goto activate_locked;
if (!mapping && page_count(page) == 1) {

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2009-04-02 13:55:25 UTC
Permalink
Post by David Howells
Recruit a couple of page flags to aid in cache management. The following
(1) PG_fscache (PG_private_2)
The marked page is backed by a local cache and is pinning resources
in
Post by David Howells
the cache driver.
(2) PG_fscache_write (PG_owner_priv_2)
The marked page is being written to the local cache. The page may
not
Post by David Howells
be modified whilst this is in progress.
If PG_fscache is set, then things that checked for PG_private will now also
check for that. This includes things like truncation and page
invalidation. The function page_has_private() had been added to make the
checks for both PG_private and PG_private_2 at the same time.
2 new page flags? Ouch.

1) PG_mappedtodisk is basically PG_owner_priv_2. Please alias that and
use it? Then at least we're down to 1 extra flag.

2) Why do you need another PG_private? PG_private for pagecache means
that it should call into the filesystem when it needs to handle fs data
attached to the page, right? So PG_private_2 doesn't really make sense
in that respect. Can't you just keep another flag privately in the fs
and distinguish what to do based on that?
Post by David Howells
---
fs/splice.c | 3 ++-
include/linux/page-flags.h | 41
++++++++++++++++++++++++++++++++++++----- include/linux/pagemap.h |
14
Post by David Howells
++++++++++++++
mm/filemap.c | 18 ++++++++++++++++++
mm/migrate.c | 10 +++++-----
mm/readahead.c | 9 +++++----
mm/swap.c | 4 ++--
mm/truncate.c | 10 +++++-----
mm/vmscan.c | 6 +++---
9 files changed, 90 insertions(+), 25 deletions(-)
David Howells
2009-04-02 14:36:12 UTC
Permalink
Post by Nick Piggin
1) PG_mappedtodisk is basically PG_owner_priv_2. Please alias that and
use it? Then at least we're down to 1 extra flag.
2) Why do you need another PG_private? PG_private for pagecache means
that it should call into the filesystem when it needs to handle fs data
attached to the page, right? So PG_private_2 doesn't really make sense
in that respect.
Won't that either break fs/buffer.c and fs/mpage.c or preclude the use of
FS-Cache with block-based filesystems that use the standard buffer wangling
routines?

As I've previously stated, I want to be able to make ISO9660 use FS-Cache.
That rules out use of PG_mappedtodisk and PG_private for anything FS-Cache
related.

We can actually reclaim PG_private, I think. There are patches to do that.
At the very least, we can probably reclaim the std buffering code's use of it.

If anything, avoiding the need for PG_fscache_write is probably easier - just
more memory intensive and slower. I could build a second radix tree for each
inode that kept track of which pages from that inode FS-Cache knows about, and
use the status bits in that node to keep track of what pages are being written
out to the cache.

We still need a way of triggering the page invalidation callbacks for in-use
pages, however. PG_private, as I've said, is not currently a viable option.

David
Nick Piggin
2009-04-02 15:25:15 UTC
Permalink
[sorry for that html crap :(]
Post by David Howells
Post by Nick Piggin
1) PG_mappedtodisk is basically PG_owner_priv_2. Please alias that and
use it? Then at least we're down to 1 extra flag.
2) Why do you need another PG_private? PG_private for pagecache means
that it should call into the filesystem when it needs to handle fs data
attached to the page, right? So PG_private_2 doesn't really make sense
in that respect.
Won't that either break fs/buffer.c and fs/mpage.c or preclude the use of
FS-Cache with block-based filesystems that use the standard buffer wangling
routines?
Haven't looked closely at how fscache works. Possibly you can't reuse
mappedtodisk....

But PG_private. PG_private from the vm/vfs side means to call into the
filesystem. So from that point, the filesystem should handle it. It just
doesn't seem to make sense to have 2 flags for this.

I mean, there is only the single aop that can be called, so having 2 VM
visible flags doesn't help the VM do anything, and presumably your aop
knows how to handle this, so it should be an fs private bit.

Just give me a situation of why it won't work.
Post by David Howells
As I've previously stated, I want to be able to make ISO9660 use FS-Cache.
That rules out use of PG_mappedtodisk and PG_private for anything FS-Cache
related.
We can actually reclaim PG_private, I think. There are patches to do that.
At the very least, we can probably reclaim the std buffering code's use of it.
But that's peripheral issue.
Post by David Howells
If anything, avoiding the need for PG_fscache_write is probably easier - just
more memory intensive and slower. I could build a second radix tree for each
inode that kept track of which pages from that inode FS-Cache knows about, and
use the status bits in that node to keep track of what pages are being written
out to the cache.
If it's not much slower, that would be nice.
Post by David Howells
We still need a way of triggering the page invalidation callbacks for in-use
pages, however. PG_private, as I've said, is not currently a viable option.
Can you say exactly why not?
David Howells
2009-04-02 15:51:02 UTC
Permalink
Post by Nick Piggin
Haven't looked closely at how fscache works.
It's fairly simple. FS-Cache sets PG_private_2 (PageFsCache) on pages that the
netfs tells it about, if it retains an interest in the page. This causes
invalidatepage() and suchlike to be invoked on that page when the page is
discarded.

The netfs can check for the page being in use by calling PageFsCache() and then
uncache the page if it is in use:

#ifdef CONFIG_AFS_FSCACHE
if (PageFsCache(page)) {
struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
wait_on_page_fscache_write(page);
fscache_uncache_page(vnode->cache, page);
}
#endif

which clears the bit.

Furthermore, when FS-Cache is asked to store a page to the cache, it
immediately marks it with PG_owner_priv_2 (PageFsCacheWrite). This is cleared
when FS-Cache no longer needs the data in the page for writing to the cache.

This allows (1) invalidatepage() to wait until the page is written before it is
returned to the memory allocator, and (2) releasepage() to indicate that the
page is busy if __GFP_WAIT is not given.
Post by Nick Piggin
Possibly you can't reuse mappedtodisk....
PG_mappedtodisk has a very specific meaning to fs/buffer.c and fs/mpage.c. I
can't also easily make it mean that a page is backed by the cache. A page can
be cached and not mapped to disk.
Post by Nick Piggin
Post by David Howells
We still need a way of triggering the page invalidation callbacks for in-use
pages, however. PG_private, as I've said, is not currently a viable option.
Can you say exactly why not?
fs/buffer.c owns PG_private in filesystems that use standard buffering. It
sets it, clears it and tests it at its own behest without recourse to the
filesystem using it.

ISO9660 uses fs/buffer.c.

I want to get ISO9660 to use FS-Cache.

Also NFS uses PG_private for its own nefarious purposes. Making PG_private be
the conjunction of both purposes entailed some fairly messy patching.

David
Nick Piggin
2009-04-02 16:15:02 UTC
Permalink
Post by David Howells
Post by Nick Piggin
Haven't looked closely at how fscache works.
It's fairly simple. FS-Cache sets PG_private_2 (PageFsCache) on pages that the
netfs tells it about, if it retains an interest in the page. This causes
invalidatepage() and suchlike to be invoked on that page when the page is
discarded.
The netfs can check for the page being in use by calling PageFsCache() and then
#ifdef CONFIG_AFS_FSCACHE
if (PageFsCache(page)) {
struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
wait_on_page_fscache_write(page);
fscache_uncache_page(vnode->cache, page);
}
#endif
which clears the bit.
OK, then you just use PG_private for that, and have the netfs
use a PG_owner_private or some such bit to tell that it is an
fscache page.
Post by David Howells
Furthermore, when FS-Cache is asked to store a page to the cache, it
immediately marks it with PG_owner_priv_2 (PageFsCacheWrite). This is cleared
when FS-Cache no longer needs the data in the page for writing to the cache.
This allows (1) invalidatepage() to wait until the page is written before it is
returned to the memory allocator, and (2) releasepage() to indicate that the
page is busy if __GFP_WAIT is not given.
So it isn't written synchronously at invalidatepage-time? OK.
Post by David Howells
Post by Nick Piggin
Possibly you can't reuse mappedtodisk....
PG_mappedtodisk has a very specific meaning to fs/buffer.c and fs/mpage.c. I
can't also easily make it mean that a page is backed by the cache. A page can
be cached and not mapped to disk.
You have 2 types of pagecache pages you are dealing with here, right? The
netfs and the backingfs pages. From what you write above, am I to take it
that you need to know whether a backingfs page is "backed by the cache"?
WTF for? And what cache is it backed by if it is the backing store?
Post by David Howells
Post by Nick Piggin
Post by David Howells
We still need a way of triggering the page invalidation callbacks for in-use
pages, however. PG_private, as I've said, is not currently a viable option.
Can you say exactly why not?
fs/buffer.c owns PG_private in filesystems that use standard buffering. It
sets it, clears it and tests it at its own behest without recourse to the
filesystem using it.
One of us is confused about how this works. Firstly, from your description
above, you're needing the invalidatepage call from the *netfs* page. This is
not using fs/buffer.c presumably, that is the backing store fs.

Secondly, I repeat again, PG_private is only to tell the VM to call the fs aop,
so if you just think you can override this without changing the aop then you
are mistaken: buffer.c will blow up in serveral places if its page aops are
called without buffers being attached to the page (try_to_release_page being one
of them).

Thirdly, buffer layer is just a library for the filesystem to use. Of course it
has recourse to override things just by giving different aops (which could then
call into buffer.c if PG_fscache is not set or whatever you require).
Post by David Howells
Also NFS uses PG_private for its own nefarious purposes. Making PG_private be
the conjunction of both purposes entailed some fairly messy patching.
This is basically an NFS mess, so that's where it belongs. But anyway I don't
see how it could be less messy to add this in the VM because NFS aops *still*
need to distinguish between cases anyway.
Nick Piggin
2009-04-02 16:53:17 UTC
Permalink
Post by Nick Piggin
Post by David Howells
Post by Nick Piggin
Possibly you can't reuse mappedtodisk....
PG_mappedtodisk has a very specific meaning to fs/buffer.c and fs/mpage.c. I
can't also easily make it mean that a page is backed by the cache. A page can
be cached and not mapped to disk.
You have 2 types of pagecache pages you are dealing with here, right? The
netfs and the backingfs pages. From what you write above, am I to take it
that you need to know whether a backingfs page is "backed by the cache"?
WTF for? And what cache is it backed by if it is the backing store?
Ah, sorry you want isofs to be cached by another backing store. All
this talk of netfs confused me. Yes that's fair enough and could
put PG_mappedtodisk out of reach. OTOH, is isofs important? Enough
that it can't be done with fuse or something?

Comments about PG_private still stand.
David Howells
2009-04-02 17:09:15 UTC
Permalink
Post by Nick Piggin
Comments about PG_private still stand.
Replies about PG_private still stand.

It's owned by fs/buffer.c, and if the 'netfs' uses that, then PG_private is
not available. Though, as I said, fs/buffer.c could be broken of its control
of the bit.

David
Nick Piggin
2009-04-02 17:31:03 UTC
Permalink
Post by David Howells
Post by Nick Piggin
Comments about PG_private still stand.
Replies about PG_private still stand.
It's owned by fs/buffer.c, and if the 'netfs' uses that, then PG_private is
not available.
Well in theory I still think it would be cleanest to modify buffer to
play more nicely with it. But maybe that ends up being harder to
distinguish the 3 cases of attached metadata on the page. I don't know,
you haven't posted any isofs code so either way it is inappropriate to
use up this extra page flag here.

Is isofs cache worth a page flag?
David Howells
2009-04-02 17:40:57 UTC
Permalink
Post by Nick Piggin
Well in theory I still think it would be cleanest to modify buffer to
play more nicely with it. But maybe that ends up being harder to
distinguish the 3 cases of attached metadata on the page. I don't know,
you haven't posted any isofs code so either way it is inappropriate to
use up this extra page flag here.
Is isofs cache worth a page flag?
Well, isofs was something I wanted at the time.

Besides, as I said NFS uses PG_private for its own purposes, and entangling
the two wasn't the most fun I've had. Trond didn't like it either.

David
Nick Piggin
2009-04-02 18:14:42 UTC
Permalink
Post by David Howells
Post by Nick Piggin
Well in theory I still think it would be cleanest to modify buffer to
play more nicely with it. But maybe that ends up being harder to
distinguish the 3 cases of attached metadata on the page. I don't know,
you haven't posted any isofs code so either way it is inappropriate to
use up this extra page flag here.
Is isofs cache worth a page flag?
Well, isofs was something I wanted at the time.
It can't be done with FUSE?
Post by David Howells
Besides, as I said NFS uses PG_private for its own purposes, and entangling
the two wasn't the most fun I've had. Trond didn't like it either.
IMO that is quite OK to make them go through the pain of that if it
avoids resulting in a new page flag that is probably unusable to most
other filesystems.

But... I guess I won't get hung up on it. If you avoid at least one of
these flags it would be a good start.
Trond Myklebust
2009-04-02 19:19:47 UTC
Permalink
Post by David Howells
Post by Nick Piggin
Well in theory I still think it would be cleanest to modify buffer to
play more nicely with it. But maybe that ends up being harder to
distinguish the 3 cases of attached metadata on the page. I don't know,
you haven't posted any isofs code so either way it is inappropriate to
use up this extra page flag here.
Is isofs cache worth a page flag?
Well, isofs was something I wanted at the time.
Besides, as I said NFS uses PG_private for its own purposes, and entangling
the two wasn't the most fun I've had. Trond didn't like it either.
We use it for exactly the same purpose as everyone else: to signal that
we need a callback to ->releasepage() if the VM wants to truncate the
page.

Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-03 01:11:10 UTC
Permalink
Post by Nick Piggin
Post by David Howells
Besides, as I said NFS uses PG_private for its own purposes, and entangling
the two wasn't the most fun I've had. Trond didn't like it either.
IMO that is quite OK to make them go through the pain of that if it
avoids resulting in a new page flag that is probably unusable to most
other filesystems.
But... I guess I won't get hung up on it. If you avoid at least one of
these flags it would be a good start.
Okay, the attached patch allows PageFsCacheWrite to be dispensed with. I have
a radix tree already to track pages that need writing, so I just extend the
use of that.

It does make waiting for a page to be written more tricky, though, as I can't
now just wait on a page flag. It's also theoretically slower as I now have to
take extra spinlocks.

I've benchmarked it three times:

Fri Apr 3 01:42:16 BST 2009
Fri Apr 3 01:44:25 BST 2009 2m 9s

Fri Apr 3 01:46:45 BST 2009
Fri Apr 3 01:48:24 BST 2009 1m 39s

Fri Apr 3 01:50:28 BST 2009
Fri Apr 3 01:52:40 BST 2009 2m 12s

using the same benchmark as for the patch to use the write() file op.

David
---
commit 9a10592512b01b42eb65f32c1028c57a1a6c864e
Author: David Howells <***@redhat.com>
Date: Fri Apr 3 02:04:15 2009 +0100

FS-Cache: Use a radix tree to track pages being written rather than a page flag

Use a radix tree attached to struct fscache_cookie to track what pages are
undergoing write, rather than using a page flag.

The radix tree that was resident in struct fscache_object to track pages that
need writing is moved to fscache_cookie. Pages that need writing and pages
that are being written are both held in there. The difference being that the
former are tagged with FSCACHE_COOKIE_PENDING_TAG.

Signed-off-by: David Howells <***@redhat.com>

diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt
index da8f92f..4db125b 100644
--- a/Documentation/filesystems/caching/netfs-api.txt
+++ b/Documentation/filesystems/caching/netfs-api.txt
@@ -640,7 +640,18 @@ Note that pages can't be explicitly deleted from the a data file. The whole
data file must be retired (see the relinquish cookie function below).

Furthermore, note that this does not cancel the asynchronous read or write
-operation started by the read/alloc and write functions.
+operation started by the read/alloc and write functions, so the page
+invalidation and release functions must use:
+
+ bool fscache_check_page_write(struct fscache_cookie *cookie,
+ struct page *page);
+
+to see if a page is being written to the cache, and:
+
+ void fscache_wait_on_page_write(struct fscache_cookie *cookie,
+ struct page *page);
+
+to wait for it to finish if it is.


==========================
@@ -730,52 +741,32 @@ this, the caller should relinquish and retire the cookie they have, and then
acquire a new one.


-============================
-FS-CACHE SPECIFIC PAGE FLAGS
-============================
-
-FS-Cache makes use of two page flags, PG_private_2 and PG_owner_priv_2, for
-its own purpose. The first is given the alternative name PG_fscache and the
-second PG_fscache_write.
-
-FS-Cache uses these flags to keep track of two bits of information per cached
-netfs page:
+===========================
+FS-CACHE SPECIFIC PAGE FLAG
+===========================

- (1) PG_fscache.
+FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is
+given the alternative name PG_fscache.

- This indicates that the page is known by the cache, and that the cache
- must be informed if the page is going to go away. It's an indication to
- the netfs that the cache has an interest in this page, where an interest
- may be a pointer to it, resources allocated or reserved for it, or I/O in
- progress upon it.
+PG_fscache is used to indicate that the page is known by the cache, and that
+the cache must be informed if the page is going to go away. It's an indication
+to the netfs that the cache has an interest in this page, where an interest may
+be a pointer to it, resources allocated or reserved for it, or I/O in progress
+upon it.

- The netfs can use this information in methods such as releasepage() to
- determine whether it needs to uncache a page or update it.
+The netfs can use this information in methods such as releasepage() to
+determine whether it needs to uncache a page or update it.

- Furthermore, if this bit is set, releasepage() and invalidatepage()
- operations will be called on a page to get rid of it, even if PG_private
- is not set. This allows caching to attempted on a page before
- read_cache_pages() to be called after fscache_read_or_alloc_pages() as
- the former will try and release pages it was given under certain
- circumstances.
+Furthermore, if this bit is set, releasepage() and invalidatepage() operations
+will be called on a page to get rid of it, even if PG_private is not set. This
+allows caching to attempted on a page before read_cache_pages() to be called
+after fscache_read_or_alloc_pages() as the former will try and release pages it
+was given under certain circumstances.

- (2) PG_fscache_write.
+This bit does not overlap with such as PG_private. This means that FS-Cache
+can be used with a filesystem that uses the block buffering code.

- This indicates that the page is being written to disk by the cache, and
- that it cannot be released until completion. Ideally it shouldn't be
- changed until completion either so as to maintain the known state of the
- cache. This cannot be unified with PG_writeback as the page may be being
- written to both the server and the cache at the same time or at different
- times.
-
- This can be used by the netfs to wait for a page to be written out to the
- cache before, say, releasing or invalidating it, or before allowing
- someone to modify it in page_mkwrite(), say.
-
-Neither of these two bits overlaps with such as PG_private. This means that
-FS-Cache can be used with a filesystem that uses the block buffering code.
-
-There are a number of operations defined on these two bits:
+There are a number of operations defined on this flag:

int PageFsCache(struct page *page);
void SetPageFsCache(struct page *page)
@@ -783,18 +774,5 @@ There are a number of operations defined on these two bits:
int TestSetPageFsCache(struct page *page)
int TestClearPageFsCache(struct page *page)

- int PageFsCacheWrite(struct page *page)
- void SetPageFsCacheWrite(struct page *page)
- void ClearPageFsCacheWrite(struct page *page)
- int TestSetPageFsCacheWrite(struct page *page)
- int TestClearPageFsCacheWrite(struct page *page)
-
These functions are bit test, bit set, bit clear, bit test and set and bit
-test and clear operations on PG_fscache and PG_fscache_write.
-
- void wait_on_page_fscache_write(struct page *page)
- void end_page_fscache_write(struct page *page)
-
-The first of these two functions waits uninterruptibly for PG_fscache_write to
-become clear, if it isn't already so. The second clears PG_fscache_write and
-wakes up anyone waiting for it.
+test and clear operations on PG_fscache.
diff --git a/fs/afs/file.c b/fs/afs/file.c
index aeb6cdd..7a1d942 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -299,7 +299,7 @@ static void afs_invalidatepage(struct page *page, unsigned long offset)
#ifdef CONFIG_AFS_FSCACHE
if (PageFsCache(page)) {
struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
- wait_on_page_fscache_write(page);
+ fscache_wait_on_page_write(vnode->cache, page);
fscache_uncache_page(vnode->cache, page);
ClearPageFsCache(page);
}
@@ -336,12 +336,12 @@ static int afs_releasepage(struct page *page, gfp_t gfp_flags)
* elected to wait */
#ifdef CONFIG_AFS_FSCACHE
if (PageFsCache(page)) {
- if (PageFsCacheWrite(page)) {
+ if (fscache_check_page_write(vnode->cache, page)) {
if (!(gfp_flags & __GFP_WAIT)) {
_leave(" = F [cache busy]");
return 0;
}
- wait_on_page_fscache_write(page);
+ fscache_wait_on_page_write(vnode->cache, page);
}

fscache_uncache_page(vnode->cache, page);
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 7884518..c2e7a7f 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -795,7 +795,7 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
/* wait for the page to be written to the cache before we allow it to
* be modified */
#ifdef CONFIG_AFS_FSCACHE
- wait_on_page_fscache_write(page);
+ fscache_wait_on_page_write(vnode->cache, page);
#endif

_leave(" = 0");
diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index cd9d065..72fd18f 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -102,6 +102,8 @@ struct fscache_cookie *__fscache_acquire_cookie(
cookie->netfs_data = netfs_data;
cookie->flags = 0;

+ INIT_RADIX_TREE(&cookie->stores, GFP_NOFS);
+
switch (cookie->def->type) {
case FSCACHE_COOKIE_TYPE_INDEX:
fscache_stat(&fscache_n_cookie_index);
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index 512ec2c..2568e0e 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -17,6 +17,47 @@
#include "internal.h"

/*
+ * check to see if a page is being written to the cache
+ */
+bool __fscache_check_page_write(struct fscache_cookie *cookie, struct page *page)
+{
+ void *val;
+
+ rcu_read_lock();
+ val = radix_tree_lookup(&cookie->stores, page->index);
+ rcu_read_unlock();
+
+ return val != NULL;
+}
+EXPORT_SYMBOL(__fscache_check_page_write);
+
+/*
+ * wait for a page to finish being written to the cache
+ */
+void __fscache_wait_on_page_write(struct fscache_cookie *cookie, struct page *page)
+{
+ wait_queue_head_t *wq = bit_waitqueue(&cookie->flags, 0);
+
+ wait_event(*wq, !__fscache_check_page_write(cookie, page));
+}
+EXPORT_SYMBOL(__fscache_wait_on_page_write);
+
+/*
+ * note that a page has finished being written to the cache
+ */
+static void fscache_end_page_write(struct fscache_cookie *cookie, struct page *page)
+{
+ struct page *xpage;
+
+ spin_lock(&cookie->lock);
+ xpage = radix_tree_delete(&cookie->stores, page->index);
+ spin_unlock(&cookie->lock);
+ ASSERT(xpage != NULL);
+
+ wake_up_bit(&cookie->flags, 0);
+}
+
+/*
* actually apply the changed attributes to a cache object
*/
static void fscache_attr_changed_op(struct fscache_operation *op)
@@ -480,6 +521,7 @@ static void fscache_write_op(struct fscache_operation *_op)
struct fscache_storage *op =
container_of(_op, struct fscache_storage, op);
struct fscache_object *object = op->op.object;
+ struct fscache_cookie *cookie = object->cookie;
struct page *page;
unsigned n;
void *results[1];
@@ -487,10 +529,12 @@ static void fscache_write_op(struct fscache_operation *_op)

_enter("{OP%x,%d}", op->op.debug_id, atomic_read(&op->op.usage));

+ spin_lock(&cookie->lock);
spin_lock(&object->lock);

if (!fscache_object_is_active(object)) {
spin_unlock(&object->lock);
+ spin_unlock(&cookie->lock);
_leave("");
return;
}
@@ -499,23 +543,24 @@ static void fscache_write_op(struct fscache_operation *_op)

/* find a page to store */
page = NULL;
- n = radix_tree_gang_lookup(&object->stores, results, 0, 1);
- if (n == 1) {
- page = results[0];
- _debug("gang %d [%lx]", n, page->index);
- if (page->index <= op->store_limit)
- radix_tree_delete(&object->stores, page->index);
- else
- goto superseded;
- } else {
+ n = radix_tree_gang_lookup_tag(&cookie->stores, results, 0, 1,
+ FSCACHE_COOKIE_PENDING_TAG);
+ if (n != 1)
+ goto superseded;
+ page = results[0];
+ _debug("gang %d [%lx]", n, page->index);
+ if (page->index > op->store_limit)
goto superseded;
- }
+
+ radix_tree_tag_clear(&cookie->stores, page->index,
+ FSCACHE_COOKIE_PENDING_TAG);

spin_unlock(&object->lock);
+ spin_unlock(&cookie->lock);

if (page) {
ret = object->cache->ops->write_page(op, page);
- end_page_fscache_write(page);
+ fscache_end_page_write(cookie, page);
page_cache_release(page);
if (ret < 0)
fscache_abort_object(object);
@@ -532,6 +577,7 @@ superseded:
_debug("cease");
clear_bit(FSCACHE_OBJECT_PENDING_WRITE, &object->flags);
spin_unlock(&object->lock);
+ spin_unlock(&cookie->lock);
_leave("");
}

@@ -609,7 +655,7 @@ int __fscache_write_page(struct fscache_cookie *cookie,

_debug("store limit %llx", (unsigned long long) object->store_limit);

- ret = radix_tree_insert(&object->stores, page->index, page);
+ ret = radix_tree_insert(&cookie->stores, page->index, page);
if (ret < 0) {
if (ret == -EEXIST)
goto already_queued;
@@ -617,9 +663,9 @@ int __fscache_write_page(struct fscache_cookie *cookie,
goto nobufs_unlock_obj;
}

+ radix_tree_tag_set(&cookie->stores, page->index,
+ FSCACHE_COOKIE_PENDING_TAG);
page_cache_get(page);
- if (TestSetPageFsCacheWrite(page))
- BUG();

/* we only want one writer at a time, but we do need to queue new
* writers after exclusive ops */
@@ -656,8 +702,7 @@ already_pending:
return 0;

submit_failed:
- radix_tree_delete(&object->stores, page->index);
- end_page_fscache_write(page);
+ radix_tree_delete(&cookie->stores, page->index);
page_cache_release(page);
ret = -ENOBUFS;
goto nobufs;
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index d3060c4..3523b89 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -456,11 +456,12 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
static int nfs_launder_page(struct page *page)
{
struct inode *inode = page->mapping->host;
+ struct nfs_inode *nfsi = NFS_I(inode);

dfprintk(PAGECACHE, "NFS: launder_page(%ld, %llu)\n",
inode->i_ino, (long long)page_offset(page));

- wait_on_page_fscache_write(page);
+ nfs_fscache_wait_on_page_write(nfsi, page);
return nfs_wb_page(inode, page);
}

@@ -498,7 +499,7 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
(long long)page_offset(page));

/* make sure the cache has finished storing the page */
- wait_on_page_fscache_write(page);
+ nfs_fscache_wait_on_page_write(NFS_I(dentry->d_inode), page);

lock_page(page);
mapping = page->mapping;
diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index 968cf5d..379be67 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -337,21 +337,22 @@ void nfs_fscache_reset_inode_cookie(struct inode *inode)
*/
int nfs_fscache_release_page(struct page *page, gfp_t gfp)
{
- if (PageFsCacheWrite(page)) {
+ struct nfs_inode *nfsi = NFS_I(page->mapping->host);
+ struct fscache_cookie *cookie = nfsi->fscache;
+
+ BUG_ON(!cookie);
+
+ if (fscache_check_page_write(cookie, page)) {
if (!(gfp & __GFP_WAIT))
return 0;
- wait_on_page_fscache_write(page);
+ fscache_wait_on_page_write(cookie, page);
}

if (PageFsCache(page)) {
- struct nfs_inode *nfsi = NFS_I(page->mapping->host);
-
- BUG_ON(!nfsi->fscache);
-
dfprintk(FSCACHE, "NFS: fscache releasepage (0x%p/0x%p/0x%p)\n",
- nfsi->fscache, page, nfsi);
+ cookie, page, nfsi);

- fscache_uncache_page(nfsi->fscache, page);
+ fscache_uncache_page(cookie, page);
nfs_add_fscache_stats(page->mapping->host,
NFSIOS_FSCACHE_PAGES_UNCACHED, 1);
}
@@ -366,16 +367,17 @@ int nfs_fscache_release_page(struct page *page, gfp_t gfp)
void __nfs_fscache_invalidate_page(struct page *page, struct inode *inode)
{
struct nfs_inode *nfsi = NFS_I(inode);
+ struct fscache_cookie *cookie = nfsi->fscache;

- BUG_ON(!nfsi->fscache);
+ BUG_ON(!cookie);

dfprintk(FSCACHE, "NFS: fscache invalidatepage (0x%p/0x%p/0x%p)\n",
- nfsi->fscache, page, nfsi);
+ cookie, page, nfsi);

- wait_on_page_fscache_write(page);
+ fscache_wait_on_page_write(cookie, page);

BUG_ON(!PageLocked(page));
- fscache_uncache_page(nfsi->fscache, page);
+ fscache_uncache_page(cookie, page);
nfs_add_fscache_stats(page->mapping->host,
NFSIOS_FSCACHE_PAGES_UNCACHED, 1);
}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 2d43b67..6e809bb 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -94,6 +94,16 @@ extern int __nfs_readpages_from_fscache(struct nfs_open_context *,
extern void __nfs_readpage_to_fscache(struct inode *, struct page *, int);

/*
+ * wait for a page to complete writing to the cache
+ */
+static inline void nfs_fscache_wait_on_page_write(struct nfs_inode *nfsi,
+ struct page *page)
+{
+ if (PageFsCache(page))
+ fscache_wait_on_page_write(nfsi->fscache, page);
+}
+
+/*
* release the caching state associated with a page if undergoing complete page
* invalidation
*/
@@ -181,6 +191,8 @@ static inline int nfs_fscache_release_page(struct page *page, gfp_t gfp)
}
static inline void nfs_fscache_invalidate_page(struct page *page,
struct inode *inode) {}
+static inline void nfs_fscache_wait_on_page_write(struct nfs_inode *nfsi,
+ struct page *page) {}

static inline int nfs_readpage_from_fscache(struct nfs_open_context *ctx,
struct inode *inode,
diff --git a/include/linux/fscache-cache.h b/include/linux/fscache-cache.h
index 0410bd9..84d3532 100644
--- a/include/linux/fscache-cache.h
+++ b/include/linux/fscache-cache.h
@@ -301,6 +301,9 @@ struct fscache_cookie {
const struct fscache_cookie_def *def; /* definition */
struct fscache_cookie *parent; /* parent of this entry */
void *netfs_data; /* back pointer to netfs */
+ struct radix_tree_root stores; /* pages to be stored on this cookie */
+#define FSCACHE_COOKIE_PENDING_TAG 0 /* pages tag: pending write to cache */
+
unsigned long flags;
#define FSCACHE_COOKIE_LOOKING_UP 0 /* T if non-index cookie being looked up still */
#define FSCACHE_COOKIE_CREATING 1 /* T if non-index object being created still */
@@ -370,7 +373,6 @@ struct fscache_object {
struct list_head dependents; /* FIFO of dependent objects */
struct list_head dep_link; /* link in parent's dependents list */
struct list_head pending_ops; /* unstarted operations on this object */
- struct radix_tree_root stores; /* data to be stored */
pgoff_t store_limit; /* current storage limit */
};

@@ -407,7 +409,6 @@ void fscache_object_init(struct fscache_object *object,
INIT_LIST_HEAD(&object->dependents);
INIT_LIST_HEAD(&object->dep_link);
INIT_LIST_HEAD(&object->pending_ops);
- INIT_RADIX_TREE(&object->stores, GFP_NOFS);
object->n_children = 0;
object->n_ops = object->n_in_progress = object->n_exclusive = 0;
object->events = object->event_mask = 0;
diff --git a/include/linux/fscache.h b/include/linux/fscache.h
index 006c919..6d8ee46 100644
--- a/include/linux/fscache.h
+++ b/include/linux/fscache.h
@@ -42,20 +42,6 @@
#define TestSetPageFsCache(page) TestSetPagePrivate2((page))
#define TestClearPageFsCache(page) TestClearPagePrivate2((page))

-/*
- * overload PG_owner_priv_2 to give us PG_fscache_write - this is used to
- * indicate that a page is currently being written to a local disk cache
- */
-#define PageFsCacheWrite(page) PageOwnerPriv2((page))
-#define SetPageFsCacheWrite(page) SetPageOwnerPriv2((page))
-#define ClearPageFsCacheWrite(page) ClearPageOwnerPriv2((page))
-#define TestSetPageFsCacheWrite(page) TestSetPageOwnerPriv2((page))
-#define TestClearPageFsCacheWrite(page) TestClearPageOwnerPriv2((page))
-
-#define wait_on_page_fscache_write(page) wait_on_page_owner_priv_2((page))
-#define end_page_fscache_write(page) end_page_owner_priv_2((page))
-
-
/* pattern used to fill dead space in an index entry */
#define FSCACHE_INDEX_DEADFILL_PATTERN 0x79

@@ -214,6 +200,8 @@ extern int __fscache_read_or_alloc_pages(struct fscache_cookie *,
extern int __fscache_alloc_page(struct fscache_cookie *, struct page *, gfp_t);
extern int __fscache_write_page(struct fscache_cookie *, struct page *, gfp_t);
extern void __fscache_uncache_page(struct fscache_cookie *, struct page *);
+extern bool __fscache_check_page_write(struct fscache_cookie *, struct page *);
+extern void __fscache_wait_on_page_write(struct fscache_cookie *, struct page *);

/**
* fscache_register_netfs - Register a filesystem as desiring caching services
@@ -589,4 +577,42 @@ void fscache_uncache_page(struct fscache_cookie *cookie,
__fscache_uncache_page(cookie, page);
}

+/**
+ * fscache_check_page_write - Ask if a page is being writing to the cache
+ * @cookie: The cookie representing the cache object
+ * @page: The netfs page that is being cached.
+ *
+ * Ask the cache if a page is being written to the cache.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+bool fscache_check_page_write(struct fscache_cookie *cookie,
+ struct page *page)
+{
+ if (fscache_cookie_valid(cookie))
+ return __fscache_check_page_write(cookie, page);
+ return false;
+}
+
+/**
+ * fscache_wait_on_page_write - Wait for a page to complete writing to the cache
+ * @cookie: The cookie representing the cache object
+ * @page: The netfs page that is being cached.
+ *
+ * Ask the cache to wake us up when a page is no longer being written to the
+ * cache.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_wait_on_page_write(struct fscache_cookie *cookie,
+ struct page *page)
+{
+ if (fscache_cookie_valid(cookie))
+ __fscache_wait_on_page_write(cookie, page);
+}
+
#endif /* _LINUX_FSCACHE_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 0633e06..62214c7 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -79,7 +79,6 @@ enum pageflags {
PG_active,
PG_slab,
PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
- PG_owner_priv_2, /* Owner use. fs may use in pagecache */
PG_arch_1,
PG_reserved,
PG_private, /* If pagecache, has fs-private data */
@@ -115,7 +114,6 @@ enum pageflags {
* when those inodes are being locally cached.
*/
PG_fscache = PG_private_2, /* page backed by cache */
- PG_fscache_write = PG_owner_priv_2, /* page being written to cache */

/* XEN */
PG_pinned = PG_owner_priv_1,
@@ -220,7 +218,6 @@ PAGEFLAG(Private, private) __SETPAGEFLAG(Private, private)
__CLEARPAGEFLAG(Private, private)
PAGEFLAG(Private2, private_2) TESTSCFLAG(Private2, private_2)
PAGEFLAG(OwnerPriv1, owner_priv_1) TESTCLEARFLAG(OwnerPriv1, owner_priv_1)
-PAGEFLAG(OwnerPriv2, owner_priv_2) TESTSCFLAG(OwnerPriv2, owner_priv_2)

/*
* Only test-and-set exist for PG_writeback. The unconditional operators are
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 135028e..fc8a67f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -379,22 +379,6 @@ static inline void wait_on_page_writeback(struct page *page)

extern void end_page_writeback(struct page *page);

-/**
- * wait_on_page_owner_priv_2 - Wait for PG_owner_priv_2 to become clear
- * @page: The page to monitor
- *
- * Wait for a PG_owner_priv_2 to become clear on the specified page. This is
- * also used to monitor PG_fscache_write (which is an alternate name for the
- * same bit).
- */
-static inline void wait_on_page_owner_priv_2(struct page *page)
-{
- if (PageOwnerPriv2(page))
- wait_on_page_bit(page, PG_owner_priv_2);
-}
-
-extern void end_page_owner_priv_2(struct page *page);
-
/*
* Add an arbitrary waiter to a page's wait queue
*/
diff --git a/mm/filemap.c b/mm/filemap.c
index 15dda3c..9eb70e5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -621,23 +621,6 @@ void end_page_writeback(struct page *page)
EXPORT_SYMBOL(end_page_writeback);

/**
- * end_page_owner_priv_2 - Clear PG_owner_priv_2 and wake up any waiters
- * @page: the page
- *
- * Clear PG_owner_priv_2 and wake up any processes waiting for that event.
- * This is used to indicate - using PG_fscache_write (an alternate name for the
- * same bit) - that a page has finished being written to the local disk cache.
- */
-void end_page_owner_priv_2(struct page *page)
-{
- if (!TestClearPageOwnerPriv2(page))
- BUG();
- smp_mb__after_clear_bit();
- wake_up_page(page, PG_owner_priv_2);
-}
-EXPORT_SYMBOL(end_page_owner_priv_2);
-
-/**
* __lock_page - get a lock on the page, assuming we need to sleep to get it
* @page: the page to lock
*
David Howells
2009-04-02 16:48:38 UTC
Permalink
I think I see the problem. I think you're misunderstanding what I mean by
netfs.

Note that when I say 'netfs' I don't necessarily _mean_ it has to be a network
fs. It could be NFS, AFS, CIFS, but it might also be ISO9660. Look at these
two diagrams (taken from the docs added by patch 07):

+---------+
| | +--------------+
| NFS |--+ | |
| | | +-->| CacheFS |
+---------+ | +----------+ | | /dev/hda5 |
| | | | +--------------+
+---------+ +-->| | |
| | | |--+
| AFS |----->| FS-Cache |
| | | |--+
+---------+ +-->| | |
| | | | +--------------+
+---------+ | +----------+ | | |
| | | +-->| CacheFiles |
| ISOFS |--+ | /var/cache |
| | +--------------+
+---------+

NFS, AFS and ISOFS in this diagram all fill the roll of 'netfs'. CacheFS
(cache on blockdev) and CacheFiles (cache on filesystem) fill the roll of
'backingfs'.

And:

+---------+
| |
| Server |
| |
+---------+
| NETWORK
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
| +----------+
V | |
+---------+ | |
| | | |
| NFS |----->| FS-Cache |
| | | |--+
+---------+ | | | +--------------+ +--------------+
| | | | | | | |
V +----------+ +-->| CacheFiles |-->| Ext3 |
+---------+ | /var/cache | | /dev/sda6 |
| | +--------------+ +--------------+
| VFS | ^ ^
| | | |
+---------+ +--------------+ |
| KERNEL SPACE | |
~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~
| USER SPACE | |
V | |
+---------+ +--------------+
| | | |
| Process | | cachefilesd |
| | | |
+---------+ +--------------+


FS-Cache tracks pages from the 'netfs' because it may want to keep metadata on
them (such as where in the backingfs the pages are to be stored). In many
ways, PG_fscache is equivalent of PG_mappedtodisk. You could think of it as
PG_mappedtocache if you like.

The procedure for the netfs to use the cache on a page is that it calls one
of:

fscache_read_or_alloc_page()
fscache_read_or_alloc_pages()
fscache_alloc_page()

to either read pages from the cache or reserve space in the cache into which
the page can be written.

Then, the netfs calls:

fscache_write_page()

to write pages back to the cache as often as it likes. This may only be
called if read/alloc has been called first, and is permitted, inside the
caching backend (eg CacheFiles), to cache metadata about the read/alloc.

Finally, to polish off, the netfs calls:

fscache_uncache_page()

to tell FS-Cache to release any metadata it might hold on that page.

As I've been saying, I want to be able to make iso9660 be a client of FS-Cache
- ie: fill the roll of a netfs.

David
David Howells
2009-04-02 17:06:40 UTC
Permalink
Post by David Howells
NFS, AFS and ISOFS in this diagram all fill the roll of 'netfs'.
Make that 'role', not 'roll'. Or 'r=F4le' if you prefer.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel=
" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rik van Riel
2009-04-02 18:23:34 UTC
Permalink
Post by David Howells
If PG_fscache is set, then things that checked for PG_private will now also
check for that. This includes things like truncation and page invalidation.
The function page_has_private() had been added to make the checks for both
PG_private and PG_private_2 at the same time.
Acked-by: Rik van Riel <***@redhat.com>
David Howells
2009-04-01 23:03:57 UTC
Permalink
Add the API for a generic facility (FS-Cache) by which filesystems (such as AFS
or NFS) may call on local caching capabilities without having to know anything
about how the cache works, or even if there is a cache:

+---------+
| | +--------------+
| NFS |--+ | |
| | | +-->| CacheFS |
+---------+ | +----------+ | | /dev/hda5 |
| | | | +--------------+
+---------+ +-->| | |
| | | |--+
| AFS |----->| FS-Cache |
| | | |--+
+---------+ +-->| | |
| | | | +--------------+
+---------+ | +----------+ | | |
| | | +-->| CacheFiles |
| ISOFS |--+ | /var/cache |
| | +--------------+
+---------+

General documentation and documentation of the netfs specific API are provided
in addition to the header files.

As this patch stands, it is possible to build a filesystem against the facility
and attempt to use it. All that will happen is that all requests will be
immediately denied as if no cache is present.

Further patches will implement the core of the facility. The facility will
transfer requests from networking filesystems to appropriate caches if
possible, or else gracefully deny them.

If this facility is disabled in the kernel configuration, then all its
operations will trivially reduce to nothing during compilation.


WHY NOT I_MAPPING?
==================

I have added my own API to implement caching rather than using i_mapping to do
this for a number of reasons. These have been discussed a lot on the LKML and
CacheFS mailing lists, but to summarise the basics:

(1) Most filesystems don't do hole reportage. Holes in files are treated as
blocks of zeros and can't be distinguished otherwise, making it difficult
to distinguish blocks that have been read from the network and cached from
those that haven't.

(2) The backing inode must be fully populated before being exposed to
userspace through the main inode because the VM/VFS goes directly to the
backing inode and does not interrogate the front inode's VM ops.

Therefore:

(a) The backing inode must fit entirely within the cache.

(b) All backed files currently open must fit entirely within the cache at
the same time.

(c) A working set of files in total larger than the cache may not be
cached.

(d) A file may not grow larger than the available space in the cache.

(e) A file that's open and cached, and remotely grows larger than the
cache is potentially stuffed.

(3) Writes go to the backing filesystem, and can only be transferred to the
network when the file is closed.

(4) There's no record of what changes have been made, so the whole file must
be written back.

(5) The pages belong to the backing filesystem, and all metadata associated
with that page are relevant only to the backing filesystem, and not
anything stacked atop it.



OVERVIEW
========

FS-Cache provides (or will provide) the following facilities:

(1) Caches can be added / removed at any time, even whilst in use.

(2) Adds a facility by which tags can be used to refer to caches, even if
they're not available yet.

(3) More than one cache can be used at once. Caches can be selected
explicitly by use of tags.

(4) The netfs is provided with an interface that allows either party to
withdraw caching facilities from a file (required for (1)).

(5) A netfs may annotate cache objects that belongs to it. This permits the
storage of coherency maintenance data.

(6) Cache objects will be pinnable and space reservations will be possible.

(7) The interface to the netfs returns as few errors as possible, preferring
rather to let the netfs remain oblivious.

(8) Cookies are used to represent indices, files and other objects to the
netfs. The simplest cookie is just a NULL pointer - indicating nothing
cached there.

(9) The netfs is allowed to propose - dynamically - any index hierarchy it
desires, though it must be aware that the index search function is
recursive, stack space is limited, and indices can only be children of
indices.

(10) Indices can be used to group files together to reduce key size and to make
group invalidation easier. The use of indices may make lookup quicker,
but that's cache dependent.

(11) Data I/O is effectively done directly to and from the netfs's pages. The
netfs indicates that page A is at index B of the data-file represented by
cookie C, and that it should be read or written. The cache backend may or
may not start I/O on that page, but if it does, a netfs callback will be
invoked to indicate completion. The I/O may be either synchronous or
asynchronous.

(12) Cookies can be "retired" upon release. At this point FS-Cache will mark
them as obsolete and the index hierarchy rooted at that point will get
recycled.

(13) The netfs provides a "match" function for index searches. In addition to
saying whether a match was made or not, this can also specify that an
entry should be updated or deleted.


FS-Cache maintains a virtual index tree in which all indices, files, objects
and pages are kept. Bits of this tree may actually reside in one or more
caches.

FSDEF
|
+------------------------------------+
| |
NFS AFS
| |
+--------------------------+ +-----------+
| | | |
homedir mirror afs.org redhat.com
| | |
+------------+ +---------------+ +----------+
| | | | | |
00001 00002 00007 00125 vol00001 vol00002
| | | | |
+---+---+ +-----+ +---+ +------+------+ +-----+----+
| | | | | | | | | | | | |
PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
| |
PG0 +-------+
| |
00001 00003
|
+---+---+
| | |
PG0 PG1 PG2

In the example above, two netfs's can be seen to be backed: NFS and AFS. These
have different index hierarchies:

(*) The NFS primary index will probably contain per-server indices. Each
server index is indexed by NFS file handles to get data file objects.
Each data file objects can have an array of pages, but may also have
further child objects, such as extended attributes and directory entries.
Extended attribute objects themselves have page-array contents.

(*) The AFS primary index contains per-cell indices. Each cell index contains
per-logical-volume indices. Each of volume index contains up to three
indices for the read-write, read-only and backup mirrors of those volumes.
Each of these contains vnode data file objects, each of which contains an
array of pages.

The very top index is the FS-Cache master index in which individual netfs's
have entries.

Any index object may reside in more than one cache, provided it only has index
children. Any index with non-index object children will be assumed to only
reside in one cache.


The FS-Cache overview can be found in:

Documentation/filesystems/caching/fscache.txt

The netfs API to FS-Cache can be found in:

Documentation/filesystems/caching/netfs-api.txt

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

Documentation/filesystems/caching/fscache.txt | 330 +++++++++
Documentation/filesystems/caching/netfs-api.txt | 800 +++++++++++++++++++++++
include/linux/fscache.h | 528 +++++++++++++++
include/linux/pagemap.h | 4
mm/filemap.c | 2
5 files changed, 1663 insertions(+), 1 deletions(-)
create mode 100644 Documentation/filesystems/caching/fscache.txt
create mode 100644 Documentation/filesystems/caching/netfs-api.txt
create mode 100644 include/linux/fscache.h


diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
new file mode 100644
index 0000000..a759d91
--- /dev/null
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -0,0 +1,330 @@
+ ==========================
+ General Filesystem Caching
+ ==========================
+
+========
+OVERVIEW
+========
+
+This facility is a general purpose cache for network filesystems, though it
+could be used for caching other things such as ISO9660 filesystems too.
+
+FS-Cache mediates between cache backends (such as CacheFS) and network
+filesystems:
+
+ +---------+
+ | | +--------------+
+ | NFS |--+ | |
+ | | | +-->| CacheFS |
+ +---------+ | +----------+ | | /dev/hda5 |
+ | | | | +--------------+
+ +---------+ +-->| | |
+ | | | |--+
+ | AFS |----->| FS-Cache |
+ | | | |--+
+ +---------+ +-->| | |
+ | | | | +--------------+
+ +---------+ | +----------+ | | |
+ | | | +-->| CacheFiles |
+ | ISOFS |--+ | /var/cache |
+ | | +--------------+
+ +---------+
+
+Or to look at it another way, FS-Cache is a module that provides a caching
+facility to a network filesystem such that the cache is transparent to the
+user:
+
+ +---------+
+ | |
+ | Server |
+ | |
+ +---------+
+ | NETWORK
+ ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ |
+ | +----------+
+ V | |
+ +---------+ | |
+ | | | |
+ | NFS |----->| FS-Cache |
+ | | | |--+
+ +---------+ | | | +--------------+ +--------------+
+ | | | | | | | |
+ V +----------+ +-->| CacheFiles |-->| Ext3 |
+ +---------+ | /var/cache | | /dev/sda6 |
+ | | +--------------+ +--------------+
+ | VFS | ^ ^
+ | | | |
+ +---------+ +--------------+ |
+ | KERNEL SPACE | |
+ ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~
+ | USER SPACE | |
+ V | |
+ +---------+ +--------------+
+ | | | |
+ | Process | | cachefilesd |
+ | | | |
+ +---------+ +--------------+
+
+
+FS-Cache does not follow the idea of completely loading every netfs file
+opened in its entirety into a cache before permitting it to be accessed and
+then serving the pages out of that cache rather than the netfs inode because:
+
+ (1) It must be practical to operate without a cache.
+
+ (2) The size of any accessible file must not be limited to the size of the
+ cache.
+
+ (3) The combined size of all opened files (this includes mapped libraries)
+ must not be limited to the size of the cache.
+
+ (4) The user should not be forced to download an entire file just to do a
+ one-off access of a small portion of it (such as might be done with the
+ "file" program).
+
+It instead serves the cache out in PAGE_SIZE chunks as and when requested by
+the netfs('s) using it.
+
+
+FS-Cache provides the following facilities:
+
+ (1) More than one cache can be used at once. Caches can be selected
+ explicitly by use of tags.
+
+ (2) Caches can be added / removed at any time.
+
+ (3) The netfs is provided with an interface that allows either party to
+ withdraw caching facilities from a file (required for (2)).
+
+ (4) The interface to the netfs returns as few errors as possible, preferring
+ rather to let the netfs remain oblivious.
+
+ (5) Cookies are used to represent indices, files and other objects to the
+ netfs. The simplest cookie is just a NULL pointer - indicating nothing
+ cached there.
+
+ (6) The netfs is allowed to propose - dynamically - any index hierarchy it
+ desires, though it must be aware that the index search function is
+ recursive, stack space is limited, and indices can only be children of
+ indices.
+
+ (7) Data I/O is done direct to and from the netfs's pages. The netfs
+ indicates that page A is at index B of the data-file represented by cookie
+ C, and that it should be read or written. The cache backend may or may
+ not start I/O on that page, but if it does, a netfs callback will be
+ invoked to indicate completion. The I/O may be either synchronous or
+ asynchronous.
+
+ (8) Cookies can be "retired" upon release. At this point FS-Cache will mark
+ them as obsolete and the index hierarchy rooted at that point will get
+ recycled.
+
+ (9) The netfs provides a "match" function for index searches. In addition to
+ saying whether a match was made or not, this can also specify that an
+ entry should be updated or deleted.
+
+(10) As much as possible is done asynchronously.
+
+
+FS-Cache maintains a virtual indexing tree in which all indices, files, objects
+and pages are kept. Bits of this tree may actually reside in one or more
+caches.
+
+ FSDEF
+ |
+ +------------------------------------+
+ | |
+ NFS AFS
+ | |
+ +--------------------------+ +-----------+
+ | | | |
+ homedir mirror afs.org redhat.com
+ | | |
+ +------------+ +---------------+ +----------+
+ | | | | | |
+ 00001 00002 00007 00125 vol00001 vol00002
+ | | | | |
+ +---+---+ +-----+ +---+ +------+------+ +-----+----+
+ | | | | | | | | | | | | |
+PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
+ | |
+ PG0 +-------+
+ | |
+ 00001 00003
+ |
+ +---+---+
+ | | |
+ PG0 PG1 PG2
+
+In the example above, you can see two netfs's being backed: NFS and AFS. These
+have different index hierarchies:
+
+ (*) The NFS primary index contains per-server indices. Each server index is
+ indexed by NFS file handles to get data file objects. Each data file
+ objects can have an array of pages, but may also have further child
+ objects, such as extended attributes and directory entries. Extended
+ attribute objects themselves have page-array contents.
+
+ (*) The AFS primary index contains per-cell indices. Each cell index contains
+ per-logical-volume indices. Each of volume index contains up to three
+ indices for the read-write, read-only and backup mirrors of those volumes.
+ Each of these contains vnode data file objects, each of which contains an
+ array of pages.
+
+The very top index is the FS-Cache master index in which individual netfs's
+have entries.
+
+Any index object may reside in more than one cache, provided it only has index
+children. Any index with non-index object children will be assumed to only
+reside in one cache.
+
+
+The netfs API to FS-Cache can be found in:
+
+ Documentation/filesystems/caching/netfs-api.txt
+
+The cache backend API to FS-Cache can be found in:
+
+ Documentation/filesystems/caching/backend-api.txt
+
+
+=======================
+STATISTICAL INFORMATION
+=======================
+
+If FS-Cache is compiled with the following options enabled:
+
+ CONFIG_FSCACHE_PROC=y (implied by the following two)
+ CONFIG_FSCACHE_STATS=y
+ CONFIG_FSCACHE_HISTOGRAM=y
+
+then it will gather certain statistics and display them through a number of
+proc files.
+
+ (*) /proc/fs/fscache/stats
+
+ This shows counts of a number of events that can happen in FS-Cache:
+
+ CLASS EVENT MEANING
+ ======= ======= =======================================================
+ Cookies idx=N Number of index cookies allocated
+ dat=N Number of data storage cookies allocated
+ spc=N Number of special cookies allocated
+ Objects alc=N Number of objects allocated
+ nal=N Number of object allocation failures
+ avl=N Number of objects that reached the available state
+ ded=N Number of objects that reached the dead state
+ ChkAux non=N Number of objects that didn't have a coherency check
+ ok=N Number of objects that passed a coherency check
+ upd=N Number of objects that needed a coherency data update
+ obs=N Number of objects that were declared obsolete
+ Pages mrk=N Number of pages marked as being cached
+ unc=N Number of uncache page requests seen
+ Acquire n=N Number of acquire cookie requests seen
+ nul=N Number of acq reqs given a NULL parent
+ noc=N Number of acq reqs rejected due to no cache available
+ ok=N Number of acq reqs succeeded
+ nbf=N Number of acq reqs rejected due to error
+ oom=N Number of acq reqs failed on ENOMEM
+ Lookups n=N Number of lookup calls made on cache backends
+ neg=N Number of negative lookups made
+ pos=N Number of positive lookups made
+ crt=N Number of objects created by lookup
+ Updates n=N Number of update cookie requests seen
+ nul=N Number of upd reqs given a NULL parent
+ run=N Number of upd reqs granted CPU time
+ Relinqs n=N Number of relinquish cookie requests seen
+ nul=N Number of rlq reqs given a NULL parent
+ wcr=N Number of rlq reqs waited on completion of creation
+ AttrChg n=N Number of attribute changed requests seen
+ ok=N Number of attr changed requests queued
+ nbf=N Number of attr changed rejected -ENOBUFS
+ oom=N Number of attr changed failed -ENOMEM
+ run=N Number of attr changed ops given CPU time
+ Allocs n=N Number of allocation requests seen
+ ok=N Number of successful alloc reqs
+ wt=N Number of alloc reqs that waited on lookup completion
+ nbf=N Number of alloc reqs rejected -ENOBUFS
+ ops=N Number of alloc reqs submitted
+ owt=N Number of alloc reqs waited for CPU time
+ Retrvls n=N Number of retrieval (read) requests seen
+ ok=N Number of successful retr reqs
+ wt=N Number of retr reqs that waited on lookup completion
+ nod=N Number of retr reqs returned -ENODATA
+ nbf=N Number of retr reqs rejected -ENOBUFS
+ int=N Number of retr reqs aborted -ERESTARTSYS
+ oom=N Number of retr reqs failed -ENOMEM
+ ops=N Number of retr reqs submitted
+ owt=N Number of retr reqs waited for CPU time
+ Stores n=N Number of storage (write) requests seen
+ ok=N Number of successful store reqs
+ agn=N Number of store reqs on a page already pending storage
+ nbf=N Number of store reqs rejected -ENOBUFS
+ oom=N Number of store reqs failed -ENOMEM
+ ops=N Number of store reqs submitted
+ run=N Number of store reqs granted CPU time
+ Ops pend=N Number of times async ops added to pending queues
+ run=N Number of times async ops given CPU time
+ enq=N Number of times async ops queued for processing
+ dfr=N Number of async ops queued for deferred release
+ rel=N Number of async ops released
+ gc=N Number of deferred-release async ops garbage collected
+
+
+ (*) /proc/fs/fscache/histogram
+
+ cat /proc/fs/fscache/histogram
+ +HZ +TIME OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
+ ===== ===== ========= ========= ========= ========= =========
+
+ This shows the breakdown of the number of times each amount of time
+ between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
+ columns are as follows:
+
+ COLUMN TIME MEASUREMENT
+ ======= =======================================================
+ OBJ INST Length of time to instantiate an object
+ OP RUNS Length of time a call to process an operation took
+ OBJ RUNS Length of time a call to process an object event took
+ RETRV DLY Time between an requesting a read and lookup completing
+ RETRIEVLS Time between beginning and end of a retrieval
+
+ Each row shows the number of events that took a particular range of times.
+ Each step is 1 jiffy in size. The +HZ column indicates the particular
+ jiffy range covered, and the +TIME field the equivalent number of seconds.
+
+
+=========
+DEBUGGING
+=========
+
+The FS-Cache facility can have runtime debugging enabled by adjusting the value
+in:
+
+ /sys/module/fscache/parameters/debug
+
+This is a bitmask of debugging streams to enable:
+
+ BIT VALUE STREAM POINT
+ ======= ======= =============================== =======================
+ 0 1 Cache management Function entry trace
+ 1 2 Function exit trace
+ 2 4 General
+ 3 8 Cookie management Function entry trace
+ 4 16 Function exit trace
+ 5 32 General
+ 6 64 Page handling Function entry trace
+ 7 128 Function exit trace
+ 8 256 General
+ 9 512 Operation management Function entry trace
+ 10 1024 Function exit trace
+ 11 2048 General
+
+The appropriate set of values should be OR'd together and the result written to
+the control file. For example:
+
+ echo $((1|8|64)) >/sys/module/fscache/parameters/debug
+
+will turn on all function entry debugging.
+
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt
new file mode 100644
index 0000000..da8f92f
--- /dev/null
+++ b/Documentation/filesystems/caching/netfs-api.txt
@@ -0,0 +1,800 @@
+ ===============================
+ FS-CACHE NETWORK FILESYSTEM API
+ ===============================
+
+There's an API by which a network filesystem can make use of the FS-Cache
+facilities. This is based around a number of principles:
+
+ (1) Caches can store a number of different object types. There are two main
+ object types: indices and files. The first is a special type used by
+ FS-Cache to make finding objects faster and to make retiring of groups of
+ objects easier.
+
+ (2) Every index, file or other object is represented by a cookie. This cookie
+ may or may not have anything associated with it, but the netfs doesn't
+ need to care.
+
+ (3) Barring the top-level index (one entry per cached netfs), the index
+ hierarchy for each netfs is structured according the whim of the netfs.
+
+This API is declared in <linux/fscache.h>.
+
+This document contains the following sections:
+
+ (1) Network filesystem definition
+ (2) Index definition
+ (3) Object definition
+ (4) Network filesystem (un)registration
+ (5) Cache tag lookup
+ (6) Index registration
+ (7) Data file registration
+ (8) Miscellaneous object registration
+ (9) Setting the data file size
+ (10) Page alloc/read/write
+ (11) Page uncaching
+ (12) Index and data file update
+ (13) Miscellaneous cookie operations
+ (14) Cookie unregistration
+ (15) Index and data file invalidation
+ (16) FS-Cache specific page flags.
+
+
+=============================
+NETWORK FILESYSTEM DEFINITION
+=============================
+
+FS-Cache needs a description of the network filesystem. This is specified
+using a record of the following structure:
+
+ struct fscache_netfs {
+ uint32_t version;
+ const char *name;
+ struct fscache_cookie *primary_index;
+ ...
+ };
+
+This first two fields should be filled in before registration, and the third
+will be filled in by the registration function; any other fields should just be
+ignored and are for internal use only.
+
+The fields are:
+
+ (1) The name of the netfs (used as the key in the toplevel index).
+
+ (2) The version of the netfs (if the name matches but the version doesn't, the
+ entire in-cache hierarchy for this netfs will be scrapped and begun
+ afresh).
+
+ (3) The cookie representing the primary index will be allocated according to
+ another parameter passed into the registration function.
+
+For example, kAFS (linux/fs/afs/) uses the following definitions to describe
+itself:
+
+ struct fscache_netfs afs_cache_netfs = {
+ .version = 0,
+ .name = "afs",
+ };
+
+
+================
+INDEX DEFINITION
+================
+
+Indices are used for two purposes:
+
+ (1) To aid the finding of a file based on a series of keys (such as AFS's
+ "cell", "volume ID", "vnode ID").
+
+ (2) To make it easier to discard a subset of all the files cached based around
+ a particular key - for instance to mirror the removal of an AFS volume.
+
+However, since it's unlikely that any two netfs's are going to want to define
+their index hierarchies in quite the same way, FS-Cache tries to impose as few
+restraints as possible on how an index is structured and where it is placed in
+the tree. The netfs can even mix indices and data files at the same level, but
+it's not recommended.
+
+Each index entry consists of a key of indeterminate length plus some auxilliary
+data, also of indeterminate length.
+
+There are some limits on indices:
+
+ (1) Any index containing non-index objects should be restricted to a single
+ cache. Any such objects created within an index will be created in the
+ first cache only. The cache in which an index is created can be
+ controlled by cache tags (see below).
+
+ (2) The entry data must be atomically journallable, so it is limited to about
+ 400 bytes at present. At least 400 bytes will be available.
+
+ (3) The depth of the index tree should be judged with care as the search
+ function is recursive. Too many layers will run the kernel out of stack.
+
+
+=================
+OBJECT DEFINITION
+=================
+
+To define an object, a structure of the following type should be filled out:
+
+ struct fscache_cookie_def
+ {
+ uint8_t name[16];
+ uint8_t type;
+
+ struct fscache_cache_tag *(*select_cache)(
+ const void *parent_netfs_data,
+ const void *cookie_netfs_data);
+
+ uint16_t (*get_key)(const void *cookie_netfs_data,
+ void *buffer,
+ uint16_t bufmax);
+
+ void (*get_attr)(const void *cookie_netfs_data,
+ uint64_t *size);
+
+ uint16_t (*get_aux)(const void *cookie_netfs_data,
+ void *buffer,
+ uint16_t bufmax);
+
+ enum fscache_checkaux (*check_aux)(void *cookie_netfs_data,
+ const void *data,
+ uint16_t datalen);
+
+ void (*get_context)(void *cookie_netfs_data, void *context);
+
+ void (*put_context)(void *cookie_netfs_data, void *context);
+
+ void (*mark_pages_cached)(void *cookie_netfs_data,
+ struct address_space *mapping,
+ struct pagevec *cached_pvec);
+
+ void (*now_uncached)(void *cookie_netfs_data);
+ };
+
+This has the following fields:
+
+ (1) The type of the object [mandatory].
+
+ This is one of the following values:
+
+ (*) FSCACHE_COOKIE_TYPE_INDEX
+
+ This defines an index, which is a special FS-Cache type.
+
+ (*) FSCACHE_COOKIE_TYPE_DATAFILE
+
+ This defines an ordinary data file.
+
+ (*) Any other value between 2 and 255
+
+ This defines an extraordinary object such as an XATTR.
+
+ (2) The name of the object type (NUL terminated unless all 16 chars are used)
+ [optional].
+
+ (3) A function to select the cache in which to store an index [optional].
+
+ This function is invoked when an index needs to be instantiated in a cache
+ during the instantiation of a non-index object. Only the immediate index
+ parent for the non-index object will be queried. Any indices above that
+ in the hierarchy may be stored in multiple caches. This function does not
+ need to be supplied for any non-index object or any index that will only
+ have index children.
+
+ If this function is not supplied or if it returns NULL then the first
+ cache in the parent's list will be chosed, or failing that, the first
+ cache in the master list.
+
+ (4) A function to retrieve an object's key from the netfs [mandatory].
+
+ This function will be called with the netfs data that was passed to the
+ cookie acquisition function and the maximum length of key data that it may
+ provide. It should write the required key data into the given buffer and
+ return the quantity it wrote.
+
+ (5) A function to retrieve attribute data from the netfs [optional].
+
+ This function will be called with the netfs data that was passed to the
+ cookie acquisition function. It should return the size of the file if
+ this is a data file. The size may be used to govern how much cache must
+ be reserved for this file in the cache.
+
+ If the function is absent, a file size of 0 is assumed.
+
+ (6) A function to retrieve auxilliary data from the netfs [optional].
+
+ This function will be called with the netfs data that was passed to the
+ cookie acquisition function and the maximum length of auxilliary data that
+ it may provide. It should write the auxilliary data into the given buffer
+ and return the quantity it wrote.
+
+ If this function is absent, the auxilliary data length will be set to 0.
+
+ The length of the auxilliary data buffer may be dependent on the key
+ length. A netfs mustn't rely on being able to provide more than 400 bytes
+ for both.
+
+ (7) A function to check the auxilliary data [optional].
+
+ This function will be called to check that a match found in the cache for
+ this object is valid. For instance with AFS it could check the auxilliary
+ data against the data version number returned by the server to determine
+ whether the index entry in a cache is still valid.
+
+ If this function is absent, it will be assumed that matching objects in a
+ cache are always valid.
+
+ If present, the function should return one of the following values:
+
+ (*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is
+ (*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update
+ (*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted
+
+ This function can also be used to extract data from the auxilliary data in
+ the cache and copy it into the netfs's structures.
+
+ (8) A pair of functions to manage contexts for the completion callback
+ [optional].
+
+ The cache read/write functions are passed a context which is then passed
+ to the I/O completion callback function. To ensure this context remains
+ valid until after the I/O completion is called, two functions may be
+ provided: one to get an extra reference on the context, and one to drop a
+ reference to it.
+
+ If the context is not used or is a type of object that won't go out of
+ scope, then these functions are not required. These functions are not
+ required for indices as indices may not contain data. These functions may
+ be called in interrupt context and so may not sleep.
+
+ (9) A function to mark a page as retaining cache metadata [optional].
+
+ This is called by the cache to indicate that it is retaining in-memory
+ information for this page and that the netfs should uncache the page when
+ it has finished. This does not indicate whether there's data on the disk
+ or not. Note that several pages at once may be presented for marking.
+
+ The PG_fscache bit is set on the pages before this function would be
+ called, so the function need not be provided if this is sufficient.
+
+ This function is not required for indices as they're not permitted data.
+
+(10) A function to unmark all the pages retaining cache metadata [mandatory].
+
+ This is called by FS-Cache to indicate that a backing store is being
+ unbound from a cookie and that all the marks on the pages should be
+ cleared to prevent confusion. Note that the cache will have torn down all
+ its tracking information so that the pages don't need to be explicitly
+ uncached.
+
+ This function is not required for indices as they're not permitted data.
+
+
+===================================
+NETWORK FILESYSTEM (UN)REGISTRATION
+===================================
+
+The first step is to declare the network filesystem to the cache. This also
+involves specifying the layout of the primary index (for AFS, this would be the
+"cell" level).
+
+The registration function is:
+
+ int fscache_register_netfs(struct fscache_netfs *netfs);
+
+It just takes a pointer to the netfs definition. It returns 0 or an error as
+appropriate.
+
+For kAFS, registration is done as follows:
+
+ ret = fscache_register_netfs(&afs_cache_netfs);
+
+The last step is, of course, unregistration:
+
+ void fscache_unregister_netfs(struct fscache_netfs *netfs);
+
+
+================
+CACHE TAG LOOKUP
+================
+
+FS-Cache permits the use of more than one cache. To permit particular index
+subtrees to be bound to particular caches, the second step is to look up cache
+representation tags. This step is optional; it can be left entirely up to
+FS-Cache as to which cache should be used. The problem with doing that is that
+FS-Cache will always pick the first cache that was registered.
+
+To get the representation for a named tag:
+
+ struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name);
+
+This takes a text string as the name and returns a representation of a tag. It
+will never return an error. It may return a dummy tag, however, if it runs out
+of memory; this will inhibit caching with this tag.
+
+Any representation so obtained must be released by passing it to this function:
+
+ void fscache_release_cache_tag(struct fscache_cache_tag *tag);
+
+The tag will be retrieved by FS-Cache when it calls the object definition
+operation select_cache().
+
+
+==================
+INDEX REGISTRATION
+==================
+
+The third step is to inform FS-Cache about part of an index hierarchy that can
+be used to locate files. This is done by requesting a cookie for each index in
+the path to the file:
+
+ struct fscache_cookie *
+ fscache_acquire_cookie(struct fscache_cookie *parent,
+ const struct fscache_object_def *def,
+ void *netfs_data);
+
+This function creates an index entry in the index represented by parent,
+filling in the index entry by calling the operations pointed to by def.
+
+Note that this function never returns an error - all errors are handled
+internally. It may, however, return NULL to indicate no cookie. It is quite
+acceptable to pass this token back to this function as the parent to another
+acquisition (or even to the relinquish cookie, read page and write page
+functions - see below).
+
+Note also that no indices are actually created in a cache until a non-index
+object needs to be created somewhere down the hierarchy. Furthermore, an index
+may be created in several different caches independently at different times.
+This is all handled transparently, and the netfs doesn't see any of it.
+
+For example, with AFS, a cell would be added to the primary index. This index
+entry would have a dependent inode containing a volume location index for the
+volume mappings within this cell:
+
+ cell->cache =
+ fscache_acquire_cookie(afs_cache_netfs.primary_index,
+ &afs_cell_cache_index_def,
+ cell);
+
+Then when a volume location was accessed, it would be entered into the cell's
+index and an inode would be allocated that acts as a volume type and hash chain
+combination:
+
+ vlocation->cache =
+ fscache_acquire_cookie(cell->cache,
+ &afs_vlocation_cache_index_def,
+ vlocation);
+
+And then a particular flavour of volume (R/O for example) could be added to
+that index, creating another index for vnodes (AFS inode equivalents):
+
+ volume->cache =
+ fscache_acquire_cookie(vlocation->cache,
+ &afs_volume_cache_index_def,
+ volume);
+
+
+======================
+DATA FILE REGISTRATION
+======================
+
+The fourth step is to request a data file be created in the cache. This is
+identical to index cookie acquisition. The only difference is that the type in
+the object definition should be something other than index type.
+
+ vnode->cache =
+ fscache_acquire_cookie(volume->cache,
+ &afs_vnode_cache_object_def,
+ vnode);
+
+
+=================================
+MISCELLANEOUS OBJECT REGISTRATION
+=================================
+
+An optional step is to request an object of miscellaneous type be created in
+the cache. This is almost identical to index cookie acquisition. The only
+difference is that the type in the object definition should be something other
+than index type. Whilst the parent object could be an index, it's more likely
+it would be some other type of object such as a data file.
+
+ xattr->cache =
+ fscache_acquire_cookie(vnode->cache,
+ &afs_xattr_cache_object_def,
+ xattr);
+
+Miscellaneous objects might be used to store extended attributes or directory
+entries for example.
+
+
+==========================
+SETTING THE DATA FILE SIZE
+==========================
+
+The fifth step is to set the physical attributes of the file, such as its size.
+This doesn't automatically reserve any space in the cache, but permits the
+cache to adjust its metadata for data tracking appropriately:
+
+ int fscache_attr_changed(struct fscache_cookie *cookie);
+
+The cache will return -ENOBUFS if there is no backing cache or if there is no
+space to allocate any extra metadata required in the cache. The attributes
+will be accessed with the get_attr() cookie definition operation.
+
+Note that attempts to read or write data pages in the cache over this size may
+be rebuffed with -ENOBUFS.
+
+This operation schedules an attribute adjustment to happen asynchronously at
+some point in the future, and as such, it may happen after the function returns
+to the caller. The attribute adjustment excludes read and write operations.
+
+
+=====================
+PAGE READ/ALLOC/WRITE
+=====================
+
+And the sixth step is to store and retrieve pages in the cache. There are
+three functions that are used to do this.
+
+Note:
+
+ (1) A page should not be re-read or re-allocated without uncaching it first.
+
+ (2) A read or allocated page must be uncached when the netfs page is released
+ from the pagecache.
+
+ (3) A page should only be written to the cache if previous read or allocated.
+
+This permits the cache to maintain its page tracking in proper order.
+
+
+PAGE READ
+---------
+
+Firstly, the netfs should ask FS-Cache to examine the caches and read the
+contents cached for a particular page of a particular file if present, or else
+allocate space to store the contents if not:
+
+ typedef
+ void (*fscache_rw_complete_t)(struct page *page,
+ void *context,
+ int error);
+
+ int fscache_read_or_alloc_page(struct fscache_cookie *cookie,
+ struct page *page,
+ fscache_rw_complete_t end_io_func,
+ void *context,
+ gfp_t gfp);
+
+The cookie argument must specify a cookie for an object that isn't an index,
+the page specified will have the data loaded into it (and is also used to
+specify the page number), and the gfp argument is used to control how any
+memory allocations made are satisfied.
+
+If the cookie indicates the inode is not cached:
+
+ (1) The function will return -ENOBUFS.
+
+Else if there's a copy of the page resident in the cache:
+
+ (1) The mark_pages_cached() cookie operation will be called on that page.
+
+ (2) The function will submit a request to read the data from the cache's
+ backing device directly into the page specified.
+
+ (3) The function will return 0.
+
+ (4) When the read is complete, end_io_func() will be invoked with:
+
+ (*) The netfs data supplied when the cookie was created.
+
+ (*) The page descriptor.
+
+ (*) The context argument passed to the above function. This will be
+ maintained with the get_context/put_context functions mentioned above.
+
+ (*) An argument that's 0 on success or negative for an error code.
+
+ If an error occurs, it should be assumed that the page contains no usable
+ data.
+
+ end_io_func() will be called in process context if the read is results in
+ an error, but it might be called in interrupt context if the read is
+ successful.
+
+Otherwise, if there's not a copy available in cache, but the cache may be able
+to store the page:
+
+ (1) The mark_pages_cached() cookie operation will be called on that page.
+
+ (2) A block may be reserved in the cache and attached to the object at the
+ appropriate place.
+
+ (3) The function will return -ENODATA.
+
+This function may also return -ENOMEM or -EINTR, in which case it won't have
+read any data from the cache.
+
+
+PAGE ALLOCATE
+-------------
+
+Alternatively, if there's not expected to be any data in the cache for a page
+because the file has been extended, a block can simply be allocated instead:
+
+ int fscache_alloc_page(struct fscache_cookie *cookie,
+ struct page *page,
+ gfp_t gfp);
+
+This is similar to the fscache_read_or_alloc_page() function, except that it
+never reads from the cache. It will return 0 if a block has been allocated,
+rather than -ENODATA as the other would. One or the other must be performed
+before writing to the cache.
+
+The mark_pages_cached() cookie operation will be called on the page if
+successful.
+
+
+PAGE WRITE
+----------
+
+Secondly, if the netfs changes the contents of the page (either due to an
+initial download or if a user performs a write), then the page should be
+written back to the cache:
+
+ int fscache_write_page(struct fscache_cookie *cookie,
+ struct page *page,
+ gfp_t gfp);
+
+The cookie argument must specify a data file cookie, the page specified should
+contain the data to be written (and is also used to specify the page number),
+and the gfp argument is used to control how any memory allocations made are
+satisfied.
+
+The page must have first been read or allocated successfully and must not have
+been uncached before writing is performed.
+
+If the cookie indicates the inode is not cached then:
+
+ (1) The function will return -ENOBUFS.
+
+Else if space can be allocated in the cache to hold this page:
+
+ (1) PG_fscache_write will be set on the page.
+
+ (2) The function will submit a request to write the data to cache's backing
+ device directly from the page specified.
+
+ (3) The function will return 0.
+
+ (4) When the write is complete PG_fscache_write is cleared on the page and
+ anyone waiting for that bit will be woken up.
+
+Else if there's no space available in the cache, -ENOBUFS will be returned. It
+is also possible for the PG_fscache_write bit to be cleared when no write took
+place if unforeseen circumstances arose (such as a disk error).
+
+Writing takes place asynchronously.
+
+
+MULTIPLE PAGE READ
+------------------
+
+A facility is provided to read several pages at once, as requested by the
+readpages() address space operation:
+
+ int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
+ struct address_space *mapping,
+ struct list_head *pages,
+ int *nr_pages,
+ fscache_rw_complete_t end_io_func,
+ void *context,
+ gfp_t gfp);
+
+This works in a similar way to fscache_read_or_alloc_page(), except:
+
+ (1) Any page it can retrieve data for is removed from pages and nr_pages and
+ dispatched for reading to the disk. Reads of adjacent pages on disk may
+ be merged for greater efficiency.
+
+ (2) The mark_pages_cached() cookie operation will be called on several pages
+ at once if they're being read or allocated.
+
+ (3) If there was an general error, then that error will be returned.
+
+ Else if some pages couldn't be allocated or read, then -ENOBUFS will be
+ returned.
+
+ Else if some pages couldn't be read but were allocated, then -ENODATA will
+ be returned.
+
+ Otherwise, if all pages had reads dispatched, then 0 will be returned, the
+ list will be empty and *nr_pages will be 0.
+
+ (4) end_io_func will be called once for each page being read as the reads
+ complete. It will be called in process context if error != 0, but it may
+ be called in interrupt context if there is no error.
+
+Note that a return of -ENODATA, -ENOBUFS or any other error does not preclude
+some of the pages being read and some being allocated. Those pages will have
+been marked appropriately and will need uncaching.
+
+
+==============
+PAGE UNCACHING
+==============
+
+To uncache a page, this function should be called:
+
+ void fscache_uncache_page(struct fscache_cookie *cookie,
+ struct page *page);
+
+This function permits the cache to release any in-memory representation it
+might be holding for this netfs page. This function must be called once for
+each page on which the read or write page functions above have been called to
+make sure the cache's in-memory tracking information gets torn down.
+
+Note that pages can't be explicitly deleted from the a data file. The whole
+data file must be retired (see the relinquish cookie function below).
+
+Furthermore, note that this does not cancel the asynchronous read or write
+operation started by the read/alloc and write functions.
+
+
+==========================
+INDEX AND DATA FILE UPDATE
+==========================
+
+To request an update of the index data for an index or other object, the
+following function should be called:
+
+ void fscache_update_cookie(struct fscache_cookie *cookie);
+
+This function will refer back to the netfs_data pointer stored in the cookie by
+the acquisition function to obtain the data to write into each revised index
+entry. The update method in the parent index definition will be called to
+transfer the data.
+
+Note that partial updates may happen automatically at other times, such as when
+data blocks are added to a data file object.
+
+
+===============================
+MISCELLANEOUS COOKIE OPERATIONS
+===============================
+
+There are a number of operations that can be used to control cookies:
+
+ (*) Cookie pinning:
+
+ int fscache_pin_cookie(struct fscache_cookie *cookie);
+ void fscache_unpin_cookie(struct fscache_cookie *cookie);
+
+ These operations permit data cookies to be pinned into the cache and to
+ have the pinning removed. They are not permitted on index cookies.
+
+ The pinning function will return 0 if successful, -ENOBUFS in the cookie
+ isn't backed by a cache, -EOPNOTSUPP if the cache doesn't support pinning,
+ -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
+ -EIO if there's any other problem.
+
+ (*) Data space reservation:
+
+ int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size);
+
+ This permits a netfs to request cache space be reserved to store up to the
+ given amount of a file. It is permitted to ask for more than the current
+ size of the file to allow for future file expansion.
+
+ If size is given as zero then the reservation will be cancelled.
+
+ The function will return 0 if successful, -ENOBUFS in the cookie isn't
+ backed by a cache, -EOPNOTSUPP if the cache doesn't support reservations,
+ -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
+ -EIO if there's any other problem.
+
+ Note that this doesn't pin an object in a cache; it can still be culled to
+ make space if it's not in use.
+
+
+=====================
+COOKIE UNREGISTRATION
+=====================
+
+To get rid of a cookie, this function should be called.
+
+ void fscache_relinquish_cookie(struct fscache_cookie *cookie,
+ int retire);
+
+If retire is non-zero, then the object will be marked for recycling, and all
+copies of it will be removed from all active caches in which it is present.
+Not only that but all child objects will also be retired.
+
+If retire is zero, then the object may be available again when next the
+acquisition function is called. Retirement here will overrule the pinning on a
+cookie.
+
+One very important note - relinquish must NOT be called for a cookie unless all
+the cookies for "child" indices, objects and pages have been relinquished
+first.
+
+
+================================
+INDEX AND DATA FILE INVALIDATION
+================================
+
+There is no direct way to invalidate an index subtree or a data file. To do
+this, the caller should relinquish and retire the cookie they have, and then
+acquire a new one.
+
+
+============================
+FS-CACHE SPECIFIC PAGE FLAGS
+============================
+
+FS-Cache makes use of two page flags, PG_private_2 and PG_owner_priv_2, for
+its own purpose. The first is given the alternative name PG_fscache and the
+second PG_fscache_write.
+
+FS-Cache uses these flags to keep track of two bits of information per cached
+netfs page:
+
+ (1) PG_fscache.
+
+ This indicates that the page is known by the cache, and that the cache
+ must be informed if the page is going to go away. It's an indication to
+ the netfs that the cache has an interest in this page, where an interest
+ may be a pointer to it, resources allocated or reserved for it, or I/O in
+ progress upon it.
+
+ The netfs can use this information in methods such as releasepage() to
+ determine whether it needs to uncache a page or update it.
+
+ Furthermore, if this bit is set, releasepage() and invalidatepage()
+ operations will be called on a page to get rid of it, even if PG_private
+ is not set. This allows caching to attempted on a page before
+ read_cache_pages() to be called after fscache_read_or_alloc_pages() as
+ the former will try and release pages it was given under certain
+ circumstances.
+
+ (2) PG_fscache_write.
+
+ This indicates that the page is being written to disk by the cache, and
+ that it cannot be released until completion. Ideally it shouldn't be
+ changed until completion either so as to maintain the known state of the
+ cache. This cannot be unified with PG_writeback as the page may be being
+ written to both the server and the cache at the same time or at different
+ times.
+
+ This can be used by the netfs to wait for a page to be written out to the
+ cache before, say, releasing or invalidating it, or before allowing
+ someone to modify it in page_mkwrite(), say.
+
+Neither of these two bits overlaps with such as PG_private. This means that
+FS-Cache can be used with a filesystem that uses the block buffering code.
+
+There are a number of operations defined on these two bits:
+
+ int PageFsCache(struct page *page);
+ void SetPageFsCache(struct page *page)
+ void ClearPageFsCache(struct page *page)
+ int TestSetPageFsCache(struct page *page)
+ int TestClearPageFsCache(struct page *page)
+
+ int PageFsCacheWrite(struct page *page)
+ void SetPageFsCacheWrite(struct page *page)
+ void ClearPageFsCacheWrite(struct page *page)
+ int TestSetPageFsCacheWrite(struct page *page)
+ int TestClearPageFsCacheWrite(struct page *page)
+
+These functions are bit test, bit set, bit clear, bit test and set and bit
+test and clear operations on PG_fscache and PG_fscache_write.
+
+ void wait_on_page_fscache_write(struct page *page)
+ void end_page_fscache_write(struct page *page)
+
+The first of these two functions waits uninterruptibly for PG_fscache_write to
+become clear, if it isn't already so. The second clears PG_fscache_write and
+wakes up anyone waiting for it.
diff --git a/include/linux/fscache.h b/include/linux/fscache.h
new file mode 100644
index 0000000..ea0fd0f
--- /dev/null
+++ b/include/linux/fscache.h
@@ -0,0 +1,528 @@
+/* General filesystem caching interface
+ *
+ * Copyright (C) 2004-2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * NOTE!!! See:
+ *
+ * Documentation/filesystems/caching/netfs-api.txt
+ *
+ * for a description of the network filesystem interface declared here.
+ */
+
+#ifndef _LINUX_FSCACHE_H
+#define _LINUX_FSCACHE_H
+
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/pagemap.h>
+#include <linux/pagevec.h>
+
+#if defined(CONFIG_FSCACHE) || defined(CONFIG_FSCACHE_MODULE)
+#define fscache_available() (1)
+#define fscache_cookie_valid(cookie) (cookie)
+#else
+#define fscache_available() (0)
+#define fscache_cookie_valid(cookie) (0)
+#endif
+
+
+/*
+ * overload PG_private_2 to give us PG_fscache - this is used to indicate that
+ * a page is currently backed by a local disk cache
+ */
+#define PageFsCache(page) PagePrivate2((page))
+#define SetPageFsCache(page) SetPagePrivate2((page))
+#define ClearPageFsCache(page) ClearPagePrivate2((page))
+#define TestSetPageFsCache(page) TestSetPagePrivate2((page))
+#define TestClearPageFsCache(page) TestClearPagePrivate2((page))
+
+/*
+ * overload PG_owner_priv_2 to give us PG_fscache_write - this is used to
+ * indicate that a page is currently being written to a local disk cache
+ */
+#define PageFsCacheWrite(page) PageOwnerPriv2((page))
+#define SetPageFsCacheWrite(page) SetPageOwnerPriv2((page))
+#define ClearPageFsCacheWrite(page) ClearPageOwnerPriv2((page))
+#define TestSetPageFsCacheWrite(page) TestSetPageOwnerPriv2((page))
+#define TestClearPageFsCacheWrite(page) TestClearPageOwnerPriv2((page))
+
+#define wait_on_page_fscache_write(page) wait_on_page_owner_priv_2((page))
+#define end_page_fscache_write(page) end_page_owner_priv_2((page))
+
+
+/* pattern used to fill dead space in an index entry */
+#define FSCACHE_INDEX_DEADFILL_PATTERN 0x79
+
+struct pagevec;
+struct fscache_cache_tag;
+struct fscache_cookie;
+struct fscache_netfs;
+
+typedef void (*fscache_rw_complete_t)(struct page *page,
+ void *context,
+ int error);
+
+/* result of index entry consultation */
+enum fscache_checkaux {
+ FSCACHE_CHECKAUX_OKAY, /* entry okay as is */
+ FSCACHE_CHECKAUX_NEEDS_UPDATE, /* entry requires update */
+ FSCACHE_CHECKAUX_OBSOLETE, /* entry requires deletion */
+};
+
+/*
+ * fscache cookie definition
+ */
+struct fscache_cookie_def {
+ /* name of cookie type */
+ char name[16];
+
+ /* cookie type */
+ uint8_t type;
+#define FSCACHE_COOKIE_TYPE_INDEX 0
+#define FSCACHE_COOKIE_TYPE_DATAFILE 1
+
+ /* select the cache into which to insert an entry in this index
+ * - optional
+ * - should return a cache identifier or NULL to cause the cache to be
+ * inherited from the parent if possible or the first cache picked
+ * for a non-index file if not
+ */
+ struct fscache_cache_tag *(*select_cache)(
+ const void *parent_netfs_data,
+ const void *cookie_netfs_data);
+
+ /* get an index key
+ * - should store the key data in the buffer
+ * - should return the amount of amount stored
+ * - not permitted to return an error
+ * - the netfs data from the cookie being used as the source is
+ * presented
+ */
+ uint16_t (*get_key)(const void *cookie_netfs_data,
+ void *buffer,
+ uint16_t bufmax);
+
+ /* get certain file attributes from the netfs data
+ * - this function can be absent for an index
+ * - not permitted to return an error
+ * - the netfs data from the cookie being used as the source is
+ * presented
+ */
+ void (*get_attr)(const void *cookie_netfs_data, uint64_t *size);
+
+ /* get the auxilliary data from netfs data
+ * - this function can be absent if the index carries no state data
+ * - should store the auxilliary data in the buffer
+ * - should return the amount of amount stored
+ * - not permitted to return an error
+ * - the netfs data from the cookie being used as the source is
+ * presented
+ */
+ uint16_t (*get_aux)(const void *cookie_netfs_data,
+ void *buffer,
+ uint16_t bufmax);
+
+ /* consult the netfs about the state of an object
+ * - this function can be absent if the index carries no state data
+ * - the netfs data from the cookie being used as the target is
+ * presented, as is the auxilliary data
+ */
+ enum fscache_checkaux (*check_aux)(void *cookie_netfs_data,
+ const void *data,
+ uint16_t datalen);
+
+ /* get an extra reference on a read context
+ * - this function can be absent if the completion function doesn't
+ * require a context
+ */
+ void (*get_context)(void *cookie_netfs_data, void *context);
+
+ /* release an extra reference on a read context
+ * - this function can be absent if the completion function doesn't
+ * require a context
+ */
+ void (*put_context)(void *cookie_netfs_data, void *context);
+
+ /* indicate pages that now have cache metadata retained
+ * - this function should mark the specified pages as now being cached
+ * - the pages will have been marked with PG_fscache before this is
+ * called, so this is optional
+ */
+ void (*mark_pages_cached)(void *cookie_netfs_data,
+ struct address_space *mapping,
+ struct pagevec *cached_pvec);
+
+ /* indicate the cookie is no longer cached
+ * - this function is called when the backing store currently caching
+ * a cookie is removed
+ * - the netfs should use this to clean up any markers indicating
+ * cached pages
+ * - this is mandatory for any object that may have data
+ */
+ void (*now_uncached)(void *cookie_netfs_data);
+};
+
+/*
+ * fscache cached network filesystem type
+ * - name, version and ops must be filled in before registration
+ * - all other fields will be set during registration
+ */
+struct fscache_netfs {
+ uint32_t version; /* indexing version */
+ const char *name; /* filesystem name */
+ struct fscache_cookie *primary_index;
+ struct list_head link; /* internal link */
+};
+
+/*
+ * slow-path functions for when there is actually caching available, and the
+ * netfs does actually have a valid token
+ * - these are not to be called directly
+ * - these are undefined symbols when FS-Cache is not configured and the
+ * optimiser takes care of not using them
+ */
+
+/**
+ * fscache_register_netfs - Register a filesystem as desiring caching services
+ * @netfs: The description of the filesystem
+ *
+ * Register a filesystem as desiring caching services if they're available.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_register_netfs(struct fscache_netfs *netfs)
+{
+ return 0;
+}
+
+/**
+ * fscache_unregister_netfs - Indicate that a filesystem no longer desires
+ * caching services
+ * @netfs: The description of the filesystem
+ *
+ * Indicate that a filesystem no longer desires caching services for the
+ * moment.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_unregister_netfs(struct fscache_netfs *netfs)
+{
+}
+
+/**
+ * fscache_lookup_cache_tag - Look up a cache tag
+ * @name: The name of the tag to search for
+ *
+ * Acquire a specific cache referral tag that can be used to select a specific
+ * cache in which to cache an index.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name)
+{
+ return NULL;
+}
+
+/**
+ * fscache_release_cache_tag - Release a cache tag
+ * @tag: The tag to release
+ *
+ * Release a reference to a cache referral tag previously looked up.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_release_cache_tag(struct fscache_cache_tag *tag)
+{
+}
+
+/**
+ * fscache_acquire_cookie - Acquire a cookie to represent a cache object
+ * @parent: The cookie that's to be the parent of this one
+ * @def: A description of the cache object, including callback operations
+ * @netfs_data: An arbitrary piece of data to be kept in the cookie to
+ * represent the cache object to the netfs
+ *
+ * This function is used to inform FS-Cache about part of an index hierarchy
+ * that can be used to locate files. This is done by requesting a cookie for
+ * each index in the path to the file.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+struct fscache_cookie *fscache_acquire_cookie(
+ struct fscache_cookie *parent,
+ const struct fscache_cookie_def *def,
+ void *netfs_data)
+{
+ return NULL;
+}
+
+/**
+ * fscache_relinquish_cookie - Return the cookie to the cache, maybe discarding
+ * it
+ * @cookie: The cookie being returned
+ * @retire: True if the cache object the cookie represents is to be discarded
+ *
+ * This function returns a cookie to the cache, forcibly discarding the
+ * associated cache object if retire is set to true.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)
+{
+}
+
+/**
+ * fscache_update_cookie - Request that a cache object be updated
+ * @cookie: The cookie representing the cache object
+ *
+ * Request an update of the index data for the cache object associated with the
+ * cookie.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_update_cookie(struct fscache_cookie *cookie)
+{
+}
+
+/**
+ * fscache_pin_cookie - Pin a data-storage cache object in its cache
+ * @cookie: The cookie representing the cache object
+ *
+ * Permit data-storage cache objects to be pinned in the cache.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_pin_cookie(struct fscache_cookie *cookie)
+{
+ return -ENOBUFS;
+}
+
+/**
+ * fscache_pin_cookie - Unpin a data-storage cache object in its cache
+ * @cookie: The cookie representing the cache object
+ *
+ * Permit data-storage cache objects to be unpinned from the cache.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_unpin_cookie(struct fscache_cookie *cookie)
+{
+}
+
+/**
+ * fscache_attr_changed - Notify cache that an object's attributes changed
+ * @cookie: The cookie representing the cache object
+ *
+ * Send a notification to the cache indicating that an object's attributes have
+ * changed. This includes the data size. These attributes will be obtained
+ * through the get_attr() cookie definition op.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_attr_changed(struct fscache_cookie *cookie)
+{
+ return -ENOBUFS;
+}
+
+/**
+ * fscache_reserve_space - Reserve data space for a cached object
+ * @cookie: The cookie representing the cache object
+ * @i_size: The amount of space to be reserved
+ *
+ * Reserve an amount of space in the cache for the cache object attached to a
+ * cookie so that a write to that object within the space can always be
+ * honoured.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size)
+{
+ return -ENOBUFS;
+}
+
+/**
+ * fscache_read_or_alloc_page - Read a page from the cache or allocate a block
+ * in which to store it
+ * @cookie: The cookie representing the cache object
+ * @page: The netfs page to fill if possible
+ * @end_io_func: The callback to invoke when and if the page is filled
+ * @context: An arbitrary piece of data to pass on to end_io_func()
+ * @gfp: The conditions under which memory allocation should be made
+ *
+ * Read a page from the cache, or if that's not possible make a potential
+ * one-block reservation in the cache into which the page may be stored once
+ * fetched from the server.
+ *
+ * If the page is not backed by the cache object, or if it there's some reason
+ * it can't be, -ENOBUFS will be returned and nothing more will be done for
+ * that page.
+ *
+ * Else, if that page is backed by the cache, a read will be initiated directly
+ * to the netfs's page and 0 will be returned by this function. The
+ * end_io_func() callback will be invoked when the operation terminates on a
+ * completion or failure. Note that the callback may be invoked before the
+ * return.
+ *
+ * Else, if the page is unbacked, -ENODATA is returned and a block may have
+ * been allocated in the cache.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_read_or_alloc_page(struct fscache_cookie *cookie,
+ struct page *page,
+ fscache_rw_complete_t end_io_func,
+ void *context,
+ gfp_t gfp)
+{
+ return -ENOBUFS;
+}
+
+/**
+ * fscache_read_or_alloc_pages - Read pages from the cache and/or allocate
+ * blocks in which to store them
+ * @cookie: The cookie representing the cache object
+ * @mapping: The netfs inode mapping to which the pages will be attached
+ * @pages: A list of potential netfs pages to be filled
+ * @end_io_func: The callback to invoke when and if each page is filled
+ * @context: An arbitrary piece of data to pass on to end_io_func()
+ * @gfp: The conditions under which memory allocation should be made
+ *
+ * Read a set of pages from the cache, or if that's not possible, attempt to
+ * make a potential one-block reservation for each page in the cache into which
+ * that page may be stored once fetched from the server.
+ *
+ * If some pages are not backed by the cache object, or if it there's some
+ * reason they can't be, -ENOBUFS will be returned and nothing more will be
+ * done for that pages.
+ *
+ * Else, if some of the pages are backed by the cache, a read will be initiated
+ * directly to the netfs's page and 0 will be returned by this function. The
+ * end_io_func() callback will be invoked when the operation terminates on a
+ * completion or failure. Note that the callback may be invoked before the
+ * return.
+ *
+ * Else, if a page is unbacked, -ENODATA is returned and a block may have
+ * been allocated in the cache.
+ *
+ * Because the function may want to return all of -ENOBUFS, -ENODATA and 0 in
+ * regard to different pages, the return values are prioritised in that order.
+ * Any pages submitted for reading are removed from the pages list.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
+ struct address_space *mapping,
+ struct list_head *pages,
+ unsigned *nr_pages,
+ fscache_rw_complete_t end_io_func,
+ void *context,
+ gfp_t gfp)
+{
+ return -ENOBUFS;
+}
+
+/**
+ * fscache_alloc_page - Allocate a block in which to store a page
+ * @cookie: The cookie representing the cache object
+ * @page: The netfs page to allocate a page for
+ * @gfp: The conditions under which memory allocation should be made
+ *
+ * Request Allocation a block in the cache in which to store a netfs page
+ * without retrieving any contents from the cache.
+ *
+ * If the page is not backed by a file then -ENOBUFS will be returned and
+ * nothing more will be done, and no reservation will be made.
+ *
+ * Else, a block will be allocated if one wasn't already, and 0 will be
+ * returned
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_alloc_page(struct fscache_cookie *cookie,
+ struct page *page,
+ gfp_t gfp)
+{
+ return -ENOBUFS;
+}
+
+/**
+ * fscache_write_page - Request storage of a page in the cache
+ * @cookie: The cookie representing the cache object
+ * @page: The netfs page to store
+ * @gfp: The conditions under which memory allocation should be made
+ *
+ * Request the contents of the netfs page be written into the cache. This
+ * request may be ignored if no cache block is currently allocated, in which
+ * case it will return -ENOBUFS.
+ *
+ * If a cache block was already allocated, a write will be initiated and 0 will
+ * be returned. The PG_fscache_write page bit is set immediately and will then
+ * be cleared at the completion of the write to indicate the success or failure
+ * of the operation. Note that the completion may happen before the return.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+int fscache_write_page(struct fscache_cookie *cookie,
+ struct page *page,
+ gfp_t gfp)
+{
+ return -ENOBUFS;
+}
+
+/**
+ * fscache_uncache_page - Indicate that caching is no longer required on a page
+ * @cookie: The cookie representing the cache object
+ * @page: The netfs page that was being cached.
+ *
+ * Tell the cache that we no longer want a page to be cached and that it should
+ * remove any knowledge of the netfs page it may have.
+ *
+ * Note that this cannot cancel any outstanding I/O operations between this
+ * page and the cache.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for a complete
+ * description.
+ */
+static inline
+void fscache_uncache_page(struct fscache_cookie *cookie,
+ struct page *page)
+{
+}
+
+#endif /* _LINUX_FSCACHE_H */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d15e21e..1ae38aa 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -383,7 +383,9 @@ extern void end_page_writeback(struct page *page);
* wait_on_page_owner_priv_2 - Wait for PG_owner_priv_2 to become clear
* @page: The page to monitor
*
- * Wait for a PG_owner_priv_2 to become clear on the specified page.
+ * Wait for a PG_owner_priv_2 to become clear on the specified page. This is
+ * also used to monitor PG_fscache_write (which is an alternate name for the
+ * same bit).
*/
static inline void wait_on_page_owner_priv_2(struct page *page)
{
diff --git a/mm/filemap.c b/mm/filemap.c
index f8ae118..3c55c1a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -607,6 +607,8 @@ EXPORT_SYMBOL(end_page_writeback);
* @page: the page
*
* Clear PG_owner_priv_2 and wake up any processes waiting for that event.
+ * This is used to indicate - using PG_fscache_write (an alternate name for the
+ * same bit) - that a page has finished being written to the local disk cache.
*/
void end_page_owner_priv_2(struct page *page)
{

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:04:08 UTC
Permalink
Add the main configuration option, allowing FS-Cache to be selected; the
module entry and exit functions and the debugging stuff used by these patches.

The two configuration options added are:

CONFIG_FSCACHE
CONFIG_FSCACHE_DEBUG

The first enables the facility, and the second makes the debugging statements
enableable through the "debug" module parameter. The value of this parameter
is a bitmask as described in:

Documentation/filesystems/caching/fscache.txt

The module can be loaded at this point, but all it will do at this point in
the patch series is to start up the slow work facility and shut it down again.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/Kconfig | 6 ++
fs/Makefile | 1
fs/fscache/Kconfig | 22 +++++++
fs/fscache/Makefile | 8 ++
fs/fscache/internal.h | 165 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/fscache/main.c | 75 ++++++++++++++++++++++
6 files changed, 277 insertions(+), 0 deletions(-)
create mode 100644 fs/fscache/Kconfig
create mode 100644 fs/fscache/Makefile
create mode 100644 fs/fscache/internal.h
create mode 100644 fs/fscache/main.c


diff --git a/fs/Kconfig b/fs/Kconfig
index cef8b18..3942df6 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -66,6 +66,12 @@ config GENERIC_ACL
bool
select FS_POSIX_ACL

+menu "Caches"
+
+source "fs/fscache/Kconfig"
+
+endmenu
+
if BLOCK
menu "CD-ROM/DVD Filesystems"

diff --git a/fs/Makefile b/fs/Makefile
index 6e82a30..24ce1fe 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -63,6 +63,7 @@ obj-$(CONFIG_PROFILING) += dcookies.o
obj-$(CONFIG_DLM) += dlm/

# Do not add any filesystems before this line
+obj-$(CONFIG_FSCACHE) += fscache/
obj-$(CONFIG_REISERFS_FS) += reiserfs/
obj-$(CONFIG_EXT3_FS) += ext3/ # Before ext2 so root fs can be ext3
obj-$(CONFIG_EXT2_FS) += ext2/
diff --git a/fs/fscache/Kconfig b/fs/fscache/Kconfig
new file mode 100644
index 0000000..7c7bccd
--- /dev/null
+++ b/fs/fscache/Kconfig
@@ -0,0 +1,22 @@
+
+config FSCACHE
+ tristate "General filesystem local caching manager"
+ depends on EXPERIMENTAL
+ select SLOW_WORK
+ help
+ This option enables a generic filesystem caching manager that can be
+ used by various network and other filesystems to cache data locally.
+ Different sorts of caches can be plugged in, depending on the
+ resources available.
+
+ See Documentation/filesystems/caching/fscache.txt for more information.
+
+config FSCACHE_DEBUG
+ bool "Debug FS-Cache"
+ depends on FSCACHE
+ help
+ This permits debugging to be dynamically enabled in the local caching
+ management module. If this is set, the debugging output may be
+ enabled by setting bits in /sys/modules/fscache/parameter/debug.
+
+ See Documentation/filesystems/caching/fscache.txt for more information.
diff --git a/fs/fscache/Makefile b/fs/fscache/Makefile
new file mode 100644
index 0000000..f8038b8
--- /dev/null
+++ b/fs/fscache/Makefile
@@ -0,0 +1,8 @@
+#
+# Makefile for general filesystem caching code
+#
+
+fscache-y := \
+ main.o
+
+obj-$(CONFIG_FSCACHE) := fscache.o
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
new file mode 100644
index 0000000..95dc92d
--- /dev/null
+++ b/fs/fscache/internal.h
@@ -0,0 +1,165 @@
+/* Internal definitions for FS-Cache
+ *
+ * Copyright (C) 2004-2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+/*
+ * Lock order, in the order in which multiple locks should be obtained:
+ * - fscache_addremove_sem
+ * - cookie->lock
+ * - cookie->parent->lock
+ * - cache->object_list_lock
+ * - object->lock
+ * - object->parent->lock
+ * - fscache_thread_lock
+ *
+ */
+
+#include <linux/fscache-cache.h>
+#include <linux/sched.h>
+
+#define FSCACHE_MIN_THREADS 4
+#define FSCACHE_MAX_THREADS 32
+
+/*
+ * fsc-main.c
+ */
+extern unsigned fscache_defer_lookup;
+extern unsigned fscache_defer_create;
+extern unsigned fscache_debug;
+extern struct kobject *fscache_root;
+
+/*****************************************************************************/
+/*
+ * debug tracing
+ */
+#define dbgprintk(FMT, ...) \
+ printk(KERN_DEBUG "[%-6.6s] "FMT"\n", current->comm, ##__VA_ARGS__)
+
+/* make sure we maintain the format strings, even when debugging is disabled */
+static inline __attribute__((format(printf, 1, 2)))
+void _dbprintk(const char *fmt, ...)
+{
+}
+
+#define kenter(FMT, ...) dbgprintk("==> %s("FMT")", __func__, ##__VA_ARGS__)
+#define kleave(FMT, ...) dbgprintk("<== %s()"FMT"", __func__, ##__VA_ARGS__)
+#define kdebug(FMT, ...) dbgprintk(FMT, ##__VA_ARGS__)
+
+#define kjournal(FMT, ...) _dbprintk(FMT, ##__VA_ARGS__)
+
+#ifdef __KDEBUG
+#define _enter(FMT, ...) kenter(FMT, ##__VA_ARGS__)
+#define _leave(FMT, ...) kleave(FMT, ##__VA_ARGS__)
+#define _debug(FMT, ...) kdebug(FMT, ##__VA_ARGS__)
+
+#elif defined(CONFIG_FSCACHE_DEBUG)
+#define _enter(FMT, ...) \
+do { \
+ if (__do_kdebug(ENTER)) \
+ kenter(FMT, ##__VA_ARGS__); \
+} while (0)
+
+#define _leave(FMT, ...) \
+do { \
+ if (__do_kdebug(LEAVE)) \
+ kleave(FMT, ##__VA_ARGS__); \
+} while (0)
+
+#define _debug(FMT, ...) \
+do { \
+ if (__do_kdebug(DEBUG)) \
+ kdebug(FMT, ##__VA_ARGS__); \
+} while (0)
+
+#else
+#define _enter(FMT, ...) _dbprintk("==> %s("FMT")", __func__, ##__VA_ARGS__)
+#define _leave(FMT, ...) _dbprintk("<== %s()"FMT"", __func__, ##__VA_ARGS__)
+#define _debug(FMT, ...) _dbprintk(FMT, ##__VA_ARGS__)
+#endif
+
+/*
+ * determine whether a particular optional debugging point should be logged
+ * - we need to go through three steps to persuade cpp to correctly join the
+ * shorthand in FSCACHE_DEBUG_LEVEL with its prefix
+ */
+#define ____do_kdebug(LEVEL, POINT) \
+ unlikely((fscache_debug & \
+ (FSCACHE_POINT_##POINT << (FSCACHE_DEBUG_ ## LEVEL * 3))))
+#define ___do_kdebug(LEVEL, POINT) \
+ ____do_kdebug(LEVEL, POINT)
+#define __do_kdebug(POINT) \
+ ___do_kdebug(FSCACHE_DEBUG_LEVEL, POINT)
+
+#define FSCACHE_DEBUG_CACHE 0
+#define FSCACHE_DEBUG_COOKIE 1
+#define FSCACHE_DEBUG_PAGE 2
+#define FSCACHE_DEBUG_OPERATION 3
+
+#define FSCACHE_POINT_ENTER 1
+#define FSCACHE_POINT_LEAVE 2
+#define FSCACHE_POINT_DEBUG 4
+
+#ifndef FSCACHE_DEBUG_LEVEL
+#define FSCACHE_DEBUG_LEVEL CACHE
+#endif
+
+/*
+ * assertions
+ */
+#if 1 /* defined(__KDEBUGALL) */
+
+#define ASSERT(X) \
+do { \
+ if (unlikely(!(X))) { \
+ printk(KERN_ERR "\n"); \
+ printk(KERN_ERR "FS-Cache: Assertion failed\n"); \
+ BUG(); \
+ } \
+} while (0)
+
+#define ASSERTCMP(X, OP, Y) \
+do { \
+ if (unlikely(!((X) OP (Y)))) { \
+ printk(KERN_ERR "\n"); \
+ printk(KERN_ERR "FS-Cache: Assertion failed\n"); \
+ printk(KERN_ERR "%lx " #OP " %lx is false\n", \
+ (unsigned long)(X), (unsigned long)(Y)); \
+ BUG(); \
+ } \
+} while (0)
+
+#define ASSERTIF(C, X) \
+do { \
+ if (unlikely((C) && !(X))) { \
+ printk(KERN_ERR "\n"); \
+ printk(KERN_ERR "FS-Cache: Assertion failed\n"); \
+ BUG(); \
+ } \
+} while (0)
+
+#define ASSERTIFCMP(C, X, OP, Y) \
+do { \
+ if (unlikely((C) && !((X) OP (Y)))) { \
+ printk(KERN_ERR "\n"); \
+ printk(KERN_ERR "FS-Cache: Assertion failed\n"); \
+ printk(KERN_ERR "%lx " #OP " %lx is false\n", \
+ (unsigned long)(X), (unsigned long)(Y)); \
+ BUG(); \
+ } \
+} while (0)
+
+#else
+
+#define ASSERT(X) do {} while (0)
+#define ASSERTCMP(X, OP, Y) do {} while (0)
+#define ASSERTIF(C, X) do {} while (0)
+#define ASSERTIFCMP(C, X, OP, Y) do {} while (0)
+
+#endif /* assert or not */
diff --git a/fs/fscache/main.c b/fs/fscache/main.c
new file mode 100644
index 0000000..76f7c69
--- /dev/null
+++ b/fs/fscache/main.c
@@ -0,0 +1,75 @@
+/* General filesystem local caching manager
+ *
+ * Copyright (C) 2004-2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL CACHE
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include "internal.h"
+
+MODULE_DESCRIPTION("FS Cache Manager");
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
+
+unsigned fscache_defer_lookup = 1;
+module_param_named(defer_lookup, fscache_defer_lookup, uint,
+ S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(fscache_defer_lookup,
+ "Defer cookie lookup to background thread");
+
+unsigned fscache_defer_create = 1;
+module_param_named(defer_create, fscache_defer_create, uint,
+ S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(fscache_defer_create,
+ "Defer cookie creation to background thread");
+
+unsigned fscache_debug;
+module_param_named(debug, fscache_debug, uint,
+ S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(fscache_debug,
+ "FS-Cache debugging mask");
+
+struct kobject *fscache_root;
+
+/*
+ * initialise the fs caching module
+ */
+static int __init fscache_init(void)
+{
+ int ret;
+
+ ret = slow_work_register_user();
+ if (ret < 0)
+ goto error_slow_work;
+
+ printk(KERN_NOTICE "FS-Cache: Loaded\n");
+ return 0;
+
+error_slow_work:
+ return ret;
+}
+
+fs_initcall(fscache_init);
+
+/*
+ * clean up on module removal
+ */
+static void __exit fscache_exit(void)
+{
+ _enter("");
+
+ slow_work_unregister_user();
+ printk(KERN_NOTICE "FS-Cache: Unloaded\n");
+}
+
+module_exit(fscache_exit);

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:04:28 UTC
Permalink
Implement the entry points by which a cache backend may initialise, add,
declare an error upon and withdraw a cache.

Further, an object is created in sysfs under which each cache added will get
an object created:

/sys/fs/fscache/<cachetag>/

All of this is described in Documentation/filesystems/caching/backend-api.txt
added by a previous patch.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/fscache/cache.c | 249 ++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fscache/main.c | 7 +
2 files changed, 256 insertions(+), 0 deletions(-)


diff --git a/fs/fscache/cache.c b/fs/fscache/cache.c
index 1a28df3..355172f 100644
--- a/fs/fscache/cache.c
+++ b/fs/fscache/cache.c
@@ -16,6 +16,8 @@

LIST_HEAD(fscache_cache_list);
DECLARE_RWSEM(fscache_addremove_sem);
+DECLARE_WAIT_QUEUE_HEAD(fscache_cache_cleared_wq);
+EXPORT_SYMBOL(fscache_cache_cleared_wq);

static LIST_HEAD(fscache_cache_tag_list);

@@ -164,3 +166,250 @@ no_preference:
_leave(" = %p [first]", cache);
return cache;
}
+
+/**
+ * fscache_init_cache - Initialise a cache record
+ * @cache: The cache record to be initialised
+ * @ops: The cache operations to be installed in that record
+ * @idfmt: Format string to define identifier
+ * @...: sprintf-style arguments
+ *
+ * Initialise a record of a cache and fill in the name.
+ *
+ * See Documentation/filesystems/caching/backend-api.txt for a complete
+ * description.
+ */
+void fscache_init_cache(struct fscache_cache *cache,
+ const struct fscache_cache_ops *ops,
+ const char *idfmt,
+ ...)
+{
+ va_list va;
+
+ memset(cache, 0, sizeof(*cache));
+
+ cache->ops = ops;
+
+ va_start(va, idfmt);
+ vsnprintf(cache->identifier, sizeof(cache->identifier), idfmt, va);
+ va_end(va);
+
+ INIT_WORK(&cache->op_gc, NULL);
+ INIT_LIST_HEAD(&cache->link);
+ INIT_LIST_HEAD(&cache->object_list);
+ INIT_LIST_HEAD(&cache->op_gc_list);
+ spin_lock_init(&cache->object_list_lock);
+ spin_lock_init(&cache->op_gc_list_lock);
+}
+EXPORT_SYMBOL(fscache_init_cache);
+
+/**
+ * fscache_add_cache - Declare a cache as being open for business
+ * @cache: The record describing the cache
+ * @ifsdef: The record of the cache object describing the top-level index
+ * @tagname: The tag describing this cache
+ *
+ * Add a cache to the system, making it available for netfs's to use.
+ *
+ * See Documentation/filesystems/caching/backend-api.txt for a complete
+ * description.
+ */
+int fscache_add_cache(struct fscache_cache *cache,
+ struct fscache_object *ifsdef,
+ const char *tagname)
+{
+ struct fscache_cache_tag *tag;
+
+ BUG_ON(!cache->ops);
+ BUG_ON(!ifsdef);
+
+ cache->flags = 0;
+ ifsdef->event_mask = ULONG_MAX & ~(1 << FSCACHE_OBJECT_EV_CLEARED);
+ ifsdef->state = FSCACHE_OBJECT_ACTIVE;
+
+ if (!tagname)
+ tagname = cache->identifier;
+
+ BUG_ON(!tagname[0]);
+
+ _enter("{%s.%s},,%s", cache->ops->name, cache->identifier, tagname);
+
+ /* we use the cache tag to uniquely identify caches */
+ tag = __fscache_lookup_cache_tag(tagname);
+ if (IS_ERR(tag))
+ goto nomem;
+
+ if (test_and_set_bit(FSCACHE_TAG_RESERVED, &tag->flags))
+ goto tag_in_use;
+
+ cache->kobj = kobject_create_and_add(tagname, fscache_root);
+ if (!cache->kobj)
+ goto error;
+
+ ifsdef->cookie = &fscache_fsdef_index;
+ ifsdef->cache = cache;
+ cache->fsdef = ifsdef;
+
+ down_write(&fscache_addremove_sem);
+
+ tag->cache = cache;
+ cache->tag = tag;
+
+ /* add the cache to the list */
+ list_add(&cache->link, &fscache_cache_list);
+
+ /* add the cache's netfs definition index object to the cache's
+ * list */
+ spin_lock(&cache->object_list_lock);
+ list_add_tail(&ifsdef->cache_link, &cache->object_list);
+ spin_unlock(&cache->object_list_lock);
+
+ /* add the cache's netfs definition index object to the top level index
+ * cookie as a known backing object */
+ spin_lock(&fscache_fsdef_index.lock);
+
+ hlist_add_head(&ifsdef->cookie_link,
+ &fscache_fsdef_index.backing_objects);
+
+ atomic_inc(&fscache_fsdef_index.usage);
+
+ /* done */
+ spin_unlock(&fscache_fsdef_index.lock);
+ up_write(&fscache_addremove_sem);
+
+ printk(KERN_NOTICE "FS-Cache: Cache \"%s\" added (type %s)\n",
+ cache->tag->name, cache->ops->name);
+ kobject_uevent(cache->kobj, KOBJ_ADD);
+
+ _leave(" = 0 [%s]", cache->identifier);
+ return 0;
+
+tag_in_use:
+ printk(KERN_ERR "FS-Cache: Cache tag '%s' already in use\n", tagname);
+ __fscache_release_cache_tag(tag);
+ _leave(" = -EXIST");
+ return -EEXIST;
+
+error:
+ __fscache_release_cache_tag(tag);
+ _leave(" = -EINVAL");
+ return -EINVAL;
+
+nomem:
+ _leave(" = -ENOMEM");
+ return -ENOMEM;
+}
+EXPORT_SYMBOL(fscache_add_cache);
+
+/**
+ * fscache_io_error - Note a cache I/O error
+ * @cache: The record describing the cache
+ *
+ * Note that an I/O error occurred in a cache and that it should no longer be
+ * used for anything. This also reports the error into the kernel log.
+ *
+ * See Documentation/filesystems/caching/backend-api.txt for a complete
+ * description.
+ */
+void fscache_io_error(struct fscache_cache *cache)
+{
+ set_bit(FSCACHE_IOERROR, &cache->flags);
+
+ printk(KERN_ERR "FS-Cache: Cache %s stopped due to I/O error\n",
+ cache->ops->name);
+}
+EXPORT_SYMBOL(fscache_io_error);
+
+/*
+ * request withdrawal of all the objects in a cache
+ * - all the objects being withdrawn are moved onto the supplied list
+ */
+static void fscache_withdraw_all_objects(struct fscache_cache *cache,
+ struct list_head *dying_objects)
+{
+ struct fscache_object *object;
+
+ spin_lock(&cache->object_list_lock);
+
+ while (!list_empty(&cache->object_list)) {
+ object = list_entry(cache->object_list.next,
+ struct fscache_object, cache_link);
+ list_move_tail(&object->cache_link, dying_objects);
+
+ _debug("withdraw %p", object->cookie);
+
+ spin_lock(&object->lock);
+ spin_unlock(&cache->object_list_lock);
+ fscache_raise_event(object, FSCACHE_OBJECT_EV_WITHDRAW);
+ spin_unlock(&object->lock);
+
+ cond_resched();
+ spin_lock(&cache->object_list_lock);
+ }
+
+ spin_unlock(&cache->object_list_lock);
+}
+
+/**
+ * fscache_withdraw_cache - Withdraw a cache from the active service
+ * @cache: The record describing the cache
+ *
+ * Withdraw a cache from service, unbinding all its cache objects from the
+ * netfs cookies they're currently representing.
+ *
+ * See Documentation/filesystems/caching/backend-api.txt for a complete
+ * description.
+ */
+void fscache_withdraw_cache(struct fscache_cache *cache)
+{
+ LIST_HEAD(dying_objects);
+
+ _enter("");
+
+ printk(KERN_NOTICE "FS-Cache: Withdrawing cache \"%s\"\n",
+ cache->tag->name);
+
+ /* make the cache unavailable for cookie acquisition */
+ if (test_and_set_bit(FSCACHE_CACHE_WITHDRAWN, &cache->flags))
+ BUG();
+
+ down_write(&fscache_addremove_sem);
+ list_del_init(&cache->link);
+ cache->tag->cache = NULL;
+ up_write(&fscache_addremove_sem);
+
+ /* make sure all pages pinned by operations on behalf of the netfs are
+ * written to disk */
+ cache->ops->sync_cache(cache);
+
+ /* dissociate all the netfs pages backed by this cache from the block
+ * mappings in the cache */
+ cache->ops->dissociate_pages(cache);
+
+ /* we now have to destroy all the active objects pertaining to this
+ * cache - which we do by passing them off to thread pool to be
+ * disposed of */
+ _debug("destroy");
+
+ fscache_withdraw_all_objects(cache, &dying_objects);
+
+ /* wait for all extant objects to finish their outstanding operations
+ * and go away */
+ _debug("wait for finish");
+ wait_event(fscache_cache_cleared_wq,
+ atomic_read(&cache->object_count) == 0);
+ _debug("wait for clearance");
+ wait_event(fscache_cache_cleared_wq,
+ list_empty(&cache->object_list));
+ _debug("cleared");
+ ASSERT(list_empty(&dying_objects));
+
+ kobject_put(cache->kobj);
+
+ clear_bit(FSCACHE_TAG_RESERVED, &cache->tag->flags);
+ fscache_release_cache_tag(cache->tag);
+ cache->tag = NULL;
+
+ _leave("");
+}
+EXPORT_SYMBOL(fscache_withdraw_cache);
diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index 7c734b7..c2f3e63 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -56,9 +56,15 @@ static int __init fscache_init(void)
if (ret < 0)
goto error_proc;

+ fscache_root = kobject_create_and_add("fscache", kernel_kobj);
+ if (!fscache_root)
+ goto error_kobj;
+
printk(KERN_NOTICE "FS-Cache: Loaded\n");
return 0;

+error_kobj:
+ fscache_proc_cleanup();
error_proc:
slow_work_unregister_user();
error_slow_work:
@@ -74,6 +80,7 @@ static void __exit fscache_exit(void)
{
_enter("");

+ kobject_put(fscache_root);
fscache_proc_cleanup();
slow_work_unregister_user();
printk(KERN_NOTICE "FS-Cache: Unloaded\n");

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:04:49 UTC
Permalink
Implement the cache object management state machine.

The following documentation is added to illuminate the working of this state
machine. It will also be added as:

Documentation/filesystems/caching/object.txt

====================================================
IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT
====================================================

==============
REPRESENTATION
==============

FS-Cache maintains an in-kernel representation of each object that a netfs is
currently interested in. Such objects are represented by the fscache_cookie
struct and are referred to as cookies.

FS-Cache also maintains a separate in-kernel representation of the objects that
a cache backend is currently actively caching. Such objects are represented by
the fscache_object struct. The cache backends allocate these upon request, and
are expected to embed them in their own representations. These are referred to
as objects.

There is a 1:N relationship between cookies and objects. A cookie may be
represented by multiple objects - an index may exist in more than one cache -
or even by no objects (it may not be cached).

Furthermore, both cookies and objects are hierarchical. The two hierarchies
correspond, but the cookies tree is a superset of the union of the object trees
of multiple caches:

NETFS INDEX TREE : CACHE 1 : CACHE 2
: :
: +-----------+ :
+----------->| IObject | :
+-----------+ | : +-----------+ :
| ICookie |-------+ : | :
+-----------+ | : | : +-----------+
| +------------------------------>| IObject |
| : | : +-----------+
| : V : |
| : +-----------+ : |
V +----------->| IObject | : |
+-----------+ | : +-----------+ : |
| ICookie |-------+ : | : V
+-----------+ | : | : +-----------+
| +------------------------------>| IObject |
+-----+-----+ : | : +-----------+
| | : | : |
V | : V : |
+-----------+ | : +-----------+ : |
| ICookie |------------------------->| IObject | : |
+-----------+ | : +-----------+ : |
| V : | : V
| +-----------+ : | : +-----------+
| | ICookie |-------------------------------->| IObject |
| +-----------+ : | : +-----------+
V | : V : |
+-----------+ | : +-----------+ : |
| DCookie |------------------------->| DObject | : |
+-----------+ | : +-----------+ : |
| : : |
+-------+-------+ : : |
| | : : |
V V : : V
+-----------+ +-----------+ : : +-----------+
| DCookie | | DCookie |------------------------>| DObject |
+-----------+ +-----------+ : : +-----------+
: :

In the above illustration, ICookie and IObject represent indices and DCookie
and DObject represent data storage objects. Indices may have representation in
multiple caches, but currently, non-index objects may not. Objects of any type
may also be entirely unrepresented.

As far as the netfs API goes, the netfs is only actually permitted to see
pointers to the cookies. The cookies themselves and any objects attached to
those cookies are hidden from it.


===============================
OBJECT MANAGEMENT STATE MACHINE
===============================

Within FS-Cache, each active object is managed by its own individual state
machine. The state for an object is kept in the fscache_object struct, in
object->state. A cookie may point to a set of objects that are in different
states.

Each state has an action associated with it that is invoked when the machine
wakes up in that state. There are four logical sets of states:

(1) Preparation: states that wait for the parent objects to become ready. The
representations are hierarchical, and it is expected that an object must
be created or accessed with respect to its parent object.

(2) Initialisation: states that perform lookups in the cache and validate
what's found and that create on disk any missing metadata.

(3) Normal running: states that allow netfs operations on objects to proceed
and that update the state of objects.

(4) Termination: states that detach objects from their netfs cookies, that
delete objects from disk, that handle disk and system errors and that free
up in-memory resources.


In most cases, transitioning between states is in response to signalled events.
When a state has finished processing, it will usually set the mask of events in
which it is interested (object->event_mask) and relinquish the worker thread.
Then when an event is raised (by calling fscache_raise_event()), if the event
is not masked, the object will be queued for processing (by calling
fscache_enqueue_object()).


PROVISION OF CPU TIME
---------------------

The work to be done by the various states is given CPU time by the threads of
the slow work facility (see Documentation/slow-work.txt). This is used in
preference to the workqueue facility because:

(1) Threads may be completely occupied for very long periods of time by a
particular work item. These state actions may be doing sequences of
synchronous, journalled disk accesses (lookup, mkdir, create, setxattr,
getxattr, truncate, unlink, rmdir, rename).

(2) Threads may do little actual work, but may rather spend a lot of time
sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded
workqueues don't necessarily have the right numbers of threads.


LOCKING SIMPLIFICATION
----------------------

Because only one worker thread may be operating on any particular object's
state machine at once, this simplifies the locking, particularly with respect
to disconnecting the netfs's representation of a cache object (fscache_cookie)
from the cache backend's representation (fscache_object) - which may be
requested from either end.


=================
THE SET OF STATES
=================

The object state machine has a set of states that it can be in. There are
preparation states in which the object sets itself up and waits for its parent
object to transit to a state that allows access to its children:

(1) State FSCACHE_OBJECT_INIT.

Initialise the object and wait for the parent object to become active. In
the cache, it is expected that it will not be possible to look an object
up from the parent object, until that parent object itself has been looked
up.

There are initialisation states in which the object sets itself up and accesses
disk for the object metadata:

(2) State FSCACHE_OBJECT_LOOKING_UP.

Look up the object on disk, using the parent as a starting point.
FS-Cache expects the cache backend to probe the cache to see whether this
object is represented there, and if it is, to see if it's valid (coherency
management).

The cache should call fscache_object_lookup_negative() to indicate lookup
failure for whatever reason, and should call fscache_obtained_object() to
indicate success.

At the completion of lookup, FS-Cache will let the netfs go ahead with
read operations, no matter whether the file is yet cached. If not yet
cached, read operations will be immediately rejected with ENODATA until
the first known page is uncached - as to that point there can be no data
to be read out of the cache for that file that isn't currently also held
in the pagecache.

(3) State FSCACHE_OBJECT_CREATING.

Create an object on disk, using the parent as a starting point. This
happens if the lookup failed to find the object, or if the object's
coherency data indicated what's on disk is out of date. In this state,
FS-Cache expects the cache to create

The cache should call fscache_obtained_object() if creation completes
successfully, fscache_object_lookup_negative() otherwise.

At the completion of creation, FS-Cache will start processing write
operations the netfs has queued for an object. If creation failed, the
write ops will be transparently discarded, and nothing recorded in the
cache.

There are some normal running states in which the object spends its time
servicing netfs requests:

(4) State FSCACHE_OBJECT_AVAILABLE.

A transient state in which pending operations are started, child objects
are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary
lookup data is freed.

(5) State FSCACHE_OBJECT_ACTIVE.

The normal running state. In this state, requests the netfs makes will be
passed on to the cache.

(6) State FSCACHE_OBJECT_UPDATING.

The state machine comes here to update the object in the cache from the
netfs's records. This involves updating the auxiliary data that is used
to maintain coherency.

And there are terminal states in which an object cleans itself up, deallocates
memory and potentially deletes stuff from disk:

(7) State FSCACHE_OBJECT_LC_DYING.

The object comes here if it is dying because of a lookup or creation
error. This would be due to a disk error or system error of some sort.
Temporary data is cleaned up, and the parent is released.

(8) State FSCACHE_OBJECT_DYING.

The object comes here if it is dying due to an error, because its parent
cookie has been relinquished by the netfs or because the cache is being
withdrawn.

Any child objects waiting on this one are given CPU time so that they too
can destroy themselves. This object waits for all its children to go away
before advancing to the next state.

(9) State FSCACHE_OBJECT_ABORT_INIT.

The object comes to this state if it was waiting on its parent in
FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself
so that the parent may proceed from the FSCACHE_OBJECT_DYING state.

(10) State FSCACHE_OBJECT_RELEASING.
(11) State FSCACHE_OBJECT_RECYCLING.

The object comes to one of these two states when dying once it is rid of
all its children, if it is dying because the netfs relinquished its
cookie. In the first state, the cached data is expected to persist, and
in the second it will be deleted.

(12) State FSCACHE_OBJECT_WITHDRAWING.

The object transits to this state if the cache decides it wants to
withdraw the object from service, perhaps to make space, but also due to
error or just because the whole cache is being withdrawn.

(13) State FSCACHE_OBJECT_DEAD.

The object transits to this state when the in-memory object record is
ready to be deleted. The object processor shouldn't ever see an object in
this state.


THE SET OF EVENTS
-----------------

There are a number of events that can be raised to an object state machine:

(*) FSCACHE_OBJECT_EV_UPDATE

The netfs requested that an object be updated. The state machine will ask
the cache backend to update the object, and the cache backend will ask the
netfs for details of the change through its cookie definition ops.

(*) FSCACHE_OBJECT_EV_CLEARED

This is signalled in two circumstances:

(a) when an object's last child object is dropped and

(b) when the last operation outstanding on an object is completed.

This is used to proceed from the dying state.

(*) FSCACHE_OBJECT_EV_ERROR

This is signalled when an I/O error occurs during the processing of some
object.

(*) FSCACHE_OBJECT_EV_RELEASE
(*) FSCACHE_OBJECT_EV_RETIRE

These are signalled when the netfs relinquishes a cookie it was using.
The event selected depends on whether the netfs asks for the backing
object to be retired (deleted) or retained.

(*) FSCACHE_OBJECT_EV_WITHDRAW

This is signalled when the cache backend wants to withdraw an object.
This means that the object will have to be detached from the netfs's
cookie.

Because the withdrawing releasing/retiring events are all handled by the object
state machine, it doesn't matter if there's a collision with both ends trying
to sever the connection at the same time. The state machine can just pick
which one it wants to honour, and that effects the other.


Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

Documentation/filesystems/caching/fscache.txt | 5
Documentation/filesystems/caching/object.txt | 313 ++++++++++
fs/fscache/Makefile | 3
fs/fscache/internal.h | 26 +
fs/fscache/object.c | 810 +++++++++++++++++++++++++
5 files changed, 1155 insertions(+), 2 deletions(-)
create mode 100644 Documentation/filesystems/caching/object.txt
create mode 100644 fs/fscache/object.c


diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt
index 0a751f3..9e94b94 100644
--- a/Documentation/filesystems/caching/fscache.txt
+++ b/Documentation/filesystems/caching/fscache.txt
@@ -188,6 +188,11 @@ The cache backend API to FS-Cache can be found in:

Documentation/filesystems/caching/backend-api.txt

+A description of the internal representations and object state machine can be
+found in:
+
+ Documentation/filesystems/caching/object.txt
+

=======================
STATISTICAL INFORMATION
diff --git a/Documentation/filesystems/caching/object.txt b/Documentation/filesystems/caching/object.txt
new file mode 100644
index 0000000..e8b0a35
--- /dev/null
+++ b/Documentation/filesystems/caching/object.txt
@@ -0,0 +1,313 @@
+ ====================================================
+ IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT
+ ====================================================
+
+By: David Howells <***@redhat.com>
+
+Contents:
+
+ (*) Representation
+
+ (*) Object management state machine.
+
+ - Provision of cpu time.
+ - Locking simplification.
+
+ (*) The set of states.
+
+ (*) The set of events.
+
+
+==============
+REPRESENTATION
+==============
+
+FS-Cache maintains an in-kernel representation of each object that a netfs is
+currently interested in. Such objects are represented by the fscache_cookie
+struct and are referred to as cookies.
+
+FS-Cache also maintains a separate in-kernel representation of the objects that
+a cache backend is currently actively caching. Such objects are represented by
+the fscache_object struct. The cache backends allocate these upon request, and
+are expected to embed them in their own representations. These are referred to
+as objects.
+
+There is a 1:N relationship between cookies and objects. A cookie may be
+represented by multiple objects - an index may exist in more than one cache -
+or even by no objects (it may not be cached).
+
+Furthermore, both cookies and objects are hierarchical. The two hierarchies
+correspond, but the cookies tree is a superset of the union of the object trees
+of multiple caches:
+
+ NETFS INDEX TREE : CACHE 1 : CACHE 2
+ : :
+ : +-----------+ :
+ +----------->| IObject | :
+ +-----------+ | : +-----------+ :
+ | ICookie |-------+ : | :
+ +-----------+ | : | : +-----------+
+ | +------------------------------>| IObject |
+ | : | : +-----------+
+ | : V : |
+ | : +-----------+ : |
+ V +----------->| IObject | : |
+ +-----------+ | : +-----------+ : |
+ | ICookie |-------+ : | : V
+ +-----------+ | : | : +-----------+
+ | +------------------------------>| IObject |
+ +-----+-----+ : | : +-----------+
+ | | : | : |
+ V | : V : |
+ +-----------+ | : +-----------+ : |
+ | ICookie |------------------------->| IObject | : |
+ +-----------+ | : +-----------+ : |
+ | V : | : V
+ | +-----------+ : | : +-----------+
+ | | ICookie |-------------------------------->| IObject |
+ | +-----------+ : | : +-----------+
+ V | : V : |
+ +-----------+ | : +-----------+ : |
+ | DCookie |------------------------->| DObject | : |
+ +-----------+ | : +-----------+ : |
+ | : : |
+ +-------+-------+ : : |
+ | | : : |
+ V V : : V
+ +-----------+ +-----------+ : : +-----------+
+ | DCookie | | DCookie |------------------------>| DObject |
+ +-----------+ +-----------+ : : +-----------+
+ : :
+
+In the above illustration, ICookie and IObject represent indices and DCookie
+and DObject represent data storage objects. Indices may have representation in
+multiple caches, but currently, non-index objects may not. Objects of any type
+may also be entirely unrepresented.
+
+As far as the netfs API goes, the netfs is only actually permitted to see
+pointers to the cookies. The cookies themselves and any objects attached to
+those cookies are hidden from it.
+
+
+===============================
+OBJECT MANAGEMENT STATE MACHINE
+===============================
+
+Within FS-Cache, each active object is managed by its own individual state
+machine. The state for an object is kept in the fscache_object struct, in
+object->state. A cookie may point to a set of objects that are in different
+states.
+
+Each state has an action associated with it that is invoked when the machine
+wakes up in that state. There are four logical sets of states:
+
+ (1) Preparation: states that wait for the parent objects to become ready. The
+ representations are hierarchical, and it is expected that an object must
+ be created or accessed with respect to its parent object.
+
+ (2) Initialisation: states that perform lookups in the cache and validate
+ what's found and that create on disk any missing metadata.
+
+ (3) Normal running: states that allow netfs operations on objects to proceed
+ and that update the state of objects.
+
+ (4) Termination: states that detach objects from their netfs cookies, that
+ delete objects from disk, that handle disk and system errors and that free
+ up in-memory resources.
+
+
+In most cases, transitioning between states is in response to signalled events.
+When a state has finished processing, it will usually set the mask of events in
+which it is interested (object->event_mask) and relinquish the worker thread.
+Then when an event is raised (by calling fscache_raise_event()), if the event
+is not masked, the object will be queued for processing (by calling
+fscache_enqueue_object()).
+
+
+PROVISION OF CPU TIME
+---------------------
+
+The work to be done by the various states is given CPU time by the threads of
+the slow work facility (see Documentation/slow-work.txt). This is used in
+preference to the workqueue facility because:
+
+ (1) Threads may be completely occupied for very long periods of time by a
+ particular work item. These state actions may be doing sequences of
+ synchronous, journalled disk accesses (lookup, mkdir, create, setxattr,
+ getxattr, truncate, unlink, rmdir, rename).
+
+ (2) Threads may do little actual work, but may rather spend a lot of time
+ sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded
+ workqueues don't necessarily have the right numbers of threads.
+
+
+LOCKING SIMPLIFICATION
+----------------------
+
+Because only one worker thread may be operating on any particular object's
+state machine at once, this simplifies the locking, particularly with respect
+to disconnecting the netfs's representation of a cache object (fscache_cookie)
+from the cache backend's representation (fscache_object) - which may be
+requested from either end.
+
+
+=================
+THE SET OF STATES
+=================
+
+The object state machine has a set of states that it can be in. There are
+preparation states in which the object sets itself up and waits for its parent
+object to transit to a state that allows access to its children:
+
+ (1) State FSCACHE_OBJECT_INIT.
+
+ Initialise the object and wait for the parent object to become active. In
+ the cache, it is expected that it will not be possible to look an object
+ up from the parent object, until that parent object itself has been looked
+ up.
+
+There are initialisation states in which the object sets itself up and accesses
+disk for the object metadata:
+
+ (2) State FSCACHE_OBJECT_LOOKING_UP.
+
+ Look up the object on disk, using the parent as a starting point.
+ FS-Cache expects the cache backend to probe the cache to see whether this
+ object is represented there, and if it is, to see if it's valid (coherency
+ management).
+
+ The cache should call fscache_object_lookup_negative() to indicate lookup
+ failure for whatever reason, and should call fscache_obtained_object() to
+ indicate success.
+
+ At the completion of lookup, FS-Cache will let the netfs go ahead with
+ read operations, no matter whether the file is yet cached. If not yet
+ cached, read operations will be immediately rejected with ENODATA until
+ the first known page is uncached - as to that point there can be no data
+ to be read out of the cache for that file that isn't currently also held
+ in the pagecache.
+
+ (3) State FSCACHE_OBJECT_CREATING.
+
+ Create an object on disk, using the parent as a starting point. This
+ happens if the lookup failed to find the object, or if the object's
+ coherency data indicated what's on disk is out of date. In this state,
+ FS-Cache expects the cache to create
+
+ The cache should call fscache_obtained_object() if creation completes
+ successfully, fscache_object_lookup_negative() otherwise.
+
+ At the completion of creation, FS-Cache will start processing write
+ operations the netfs has queued for an object. If creation failed, the
+ write ops will be transparently discarded, and nothing recorded in the
+ cache.
+
+There are some normal running states in which the object spends its time
+servicing netfs requests:
+
+ (4) State FSCACHE_OBJECT_AVAILABLE.
+
+ A transient state in which pending operations are started, child objects
+ are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary
+ lookup data is freed.
+
+ (5) State FSCACHE_OBJECT_ACTIVE.
+
+ The normal running state. In this state, requests the netfs makes will be
+ passed on to the cache.
+
+ (6) State FSCACHE_OBJECT_UPDATING.
+
+ The state machine comes here to update the object in the cache from the
+ netfs's records. This involves updating the auxiliary data that is used
+ to maintain coherency.
+
+And there are terminal states in which an object cleans itself up, deallocates
+memory and potentially deletes stuff from disk:
+
+ (7) State FSCACHE_OBJECT_LC_DYING.
+
+ The object comes here if it is dying because of a lookup or creation
+ error. This would be due to a disk error or system error of some sort.
+ Temporary data is cleaned up, and the parent is released.
+
+ (8) State FSCACHE_OBJECT_DYING.
+
+ The object comes here if it is dying due to an error, because its parent
+ cookie has been relinquished by the netfs or because the cache is being
+ withdrawn.
+
+ Any child objects waiting on this one are given CPU time so that they too
+ can destroy themselves. This object waits for all its children to go away
+ before advancing to the next state.
+
+ (9) State FSCACHE_OBJECT_ABORT_INIT.
+
+ The object comes to this state if it was waiting on its parent in
+ FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself
+ so that the parent may proceed from the FSCACHE_OBJECT_DYING state.
+
+(10) State FSCACHE_OBJECT_RELEASING.
+(11) State FSCACHE_OBJECT_RECYCLING.
+
+ The object comes to one of these two states when dying once it is rid of
+ all its children, if it is dying because the netfs relinquished its
+ cookie. In the first state, the cached data is expected to persist, and
+ in the second it will be deleted.
+
+(12) State FSCACHE_OBJECT_WITHDRAWING.
+
+ The object transits to this state if the cache decides it wants to
+ withdraw the object from service, perhaps to make space, but also due to
+ error or just because the whole cache is being withdrawn.
+
+(13) State FSCACHE_OBJECT_DEAD.
+
+ The object transits to this state when the in-memory object record is
+ ready to be deleted. The object processor shouldn't ever see an object in
+ this state.
+
+
+THE SET OF EVENTS
+-----------------
+
+There are a number of events that can be raised to an object state machine:
+
+ (*) FSCACHE_OBJECT_EV_UPDATE
+
+ The netfs requested that an object be updated. The state machine will ask
+ the cache backend to update the object, and the cache backend will ask the
+ netfs for details of the change through its cookie definition ops.
+
+ (*) FSCACHE_OBJECT_EV_CLEARED
+
+ This is signalled in two circumstances:
+
+ (a) when an object's last child object is dropped and
+
+ (b) when the last operation outstanding on an object is completed.
+
+ This is used to proceed from the dying state.
+
+ (*) FSCACHE_OBJECT_EV_ERROR
+
+ This is signalled when an I/O error occurs during the processing of some
+ object.
+
+ (*) FSCACHE_OBJECT_EV_RELEASE
+ (*) FSCACHE_OBJECT_EV_RETIRE
+
+ These are signalled when the netfs relinquishes a cookie it was using.
+ The event selected depends on whether the netfs asks for the backing
+ object to be retired (deleted) or retained.
+
+ (*) FSCACHE_OBJECT_EV_WITHDRAW
+
+ This is signalled when the cache backend wants to withdraw an object.
+ This means that the object will have to be detached from the netfs's
+ cookie.
+
+Because the withdrawing releasing/retiring events are all handled by the object
+state machine, it doesn't matter if there's a collision with both ends trying
+to sever the connection at the same time. The state machine can just pick
+which one it wants to honour, and that effects the other.
diff --git a/fs/fscache/Makefile b/fs/fscache/Makefile
index ecf6946..4420ac6 100644
--- a/fs/fscache/Makefile
+++ b/fs/fscache/Makefile
@@ -7,7 +7,8 @@ fscache-y := \
cookie.o \
fsdef.o \
main.o \
- netfs.o
+ netfs.o \
+ object.o

fscache-$(CONFIG_PROC_FS) += proc.o
fscache-$(CONFIG_FSCACHE_STATS) += stats.o
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index 1638994..529f4de 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -86,6 +86,18 @@ extern int fscache_wait_bit(void *);
extern int fscache_wait_bit_interruptible(void *);

/*
+ * fsc-object.c
+ */
+extern void fscache_withdrawing_object(struct fscache_cache *,
+ struct fscache_object *);
+extern void fscache_enqueue_object(struct fscache_object *);
+
+/*
+ * fsc-operation.c
+ */
+#define fscache_start_operations(obj) BUG()
+
+/*
* fsc-proc.c
*/
#ifdef CONFIG_PROC_FS
@@ -196,7 +208,19 @@ extern const struct file_operations fscache_stats_fops;
static inline void fscache_raise_event(struct fscache_object *object,
unsigned event)
{
- BUG(); // TODO
+ if (!test_and_set_bit(event, &object->events) &&
+ test_bit(event, &object->event_mask))
+ fscache_enqueue_object(object);
+}
+
+/*
+ * drop a reference to a cookie
+ */
+static inline void fscache_cookie_put(struct fscache_cookie *cookie)
+{
+ BUG_ON(atomic_read(&cookie->usage) <= 0);
+ if (atomic_dec_and_test(&cookie->usage))
+ __fscache_cookie_put(cookie);
}

/*****************************************************************************/
diff --git a/fs/fscache/object.c b/fs/fscache/object.c
new file mode 100644
index 0000000..392a41b
--- /dev/null
+++ b/fs/fscache/object.c
@@ -0,0 +1,810 @@
+/* FS-Cache object state machine handler
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * See Documentation/filesystems/caching/object.txt for a description of the
+ * object state machine and the in-kernel representations.
+ */
+
+#define FSCACHE_DEBUG_LEVEL COOKIE
+#include <linux/module.h>
+#include "internal.h"
+
+const char *fscache_object_states[] = {
+ [FSCACHE_OBJECT_INIT] = "OBJECT_INIT",
+ [FSCACHE_OBJECT_LOOKING_UP] = "OBJECT_LOOKING_UP",
+ [FSCACHE_OBJECT_CREATING] = "OBJECT_CREATING",
+ [FSCACHE_OBJECT_AVAILABLE] = "OBJECT_AVAILABLE",
+ [FSCACHE_OBJECT_ACTIVE] = "OBJECT_ACTIVE",
+ [FSCACHE_OBJECT_UPDATING] = "OBJECT_UPDATING",
+ [FSCACHE_OBJECT_DYING] = "OBJECT_DYING",
+ [FSCACHE_OBJECT_LC_DYING] = "OBJECT_LC_DYING",
+ [FSCACHE_OBJECT_ABORT_INIT] = "OBJECT_ABORT_INIT",
+ [FSCACHE_OBJECT_RELEASING] = "OBJECT_RELEASING",
+ [FSCACHE_OBJECT_RECYCLING] = "OBJECT_RECYCLING",
+ [FSCACHE_OBJECT_WITHDRAWING] = "OBJECT_WITHDRAWING",
+ [FSCACHE_OBJECT_DEAD] = "OBJECT_DEAD",
+};
+EXPORT_SYMBOL(fscache_object_states);
+
+static void fscache_object_slow_work_put_ref(struct slow_work *);
+static int fscache_object_slow_work_get_ref(struct slow_work *);
+static void fscache_object_slow_work_execute(struct slow_work *);
+static void fscache_initialise_object(struct fscache_object *);
+static void fscache_lookup_object(struct fscache_object *);
+static void fscache_object_available(struct fscache_object *);
+static void fscache_release_object(struct fscache_object *);
+static void fscache_withdraw_object(struct fscache_object *);
+static void fscache_enqueue_dependents(struct fscache_object *);
+static void fscache_dequeue_object(struct fscache_object *);
+
+const struct slow_work_ops fscache_object_slow_work_ops = {
+ .get_ref = fscache_object_slow_work_get_ref,
+ .put_ref = fscache_object_slow_work_put_ref,
+ .execute = fscache_object_slow_work_execute,
+};
+EXPORT_SYMBOL(fscache_object_slow_work_ops);
+
+/*
+ * we need to notify the parent when an op completes that we had outstanding
+ * upon it
+ */
+static inline void fscache_done_parent_op(struct fscache_object *object)
+{
+ struct fscache_object *parent = object->parent;
+
+ _enter("OBJ%x {OBJ%x,%x}",
+ object->debug_id, parent->debug_id, parent->n_ops);
+
+ spin_lock_nested(&parent->lock, 1);
+ parent->n_ops--;
+ parent->n_obj_ops--;
+ if (parent->n_ops == 0)
+ fscache_raise_event(parent, FSCACHE_OBJECT_EV_CLEARED);
+ spin_unlock(&parent->lock);
+}
+
+/*
+ * process events that have been sent to an object's state machine
+ * - initiates parent lookup
+ * - does object lookup
+ * - does object creation
+ * - does object recycling and retirement
+ * - does object withdrawal
+ */
+static void fscache_object_state_machine(struct fscache_object *object)
+{
+ enum fscache_object_state new_state;
+
+ ASSERT(object != NULL);
+
+ _enter("{OBJ%x,%s,%lx}",
+ object->debug_id, fscache_object_states[object->state],
+ object->events);
+
+ switch (object->state) {
+ /* wait for the parent object to become ready */
+ case FSCACHE_OBJECT_INIT:
+ object->event_mask =
+ ULONG_MAX & ~(1 << FSCACHE_OBJECT_EV_CLEARED);
+ fscache_initialise_object(object);
+ goto done;
+
+ /* look up the object metadata on disk */
+ case FSCACHE_OBJECT_LOOKING_UP:
+ fscache_lookup_object(object);
+ goto lookup_transit;
+
+ /* create the object metadata on disk */
+ case FSCACHE_OBJECT_CREATING:
+ fscache_lookup_object(object);
+ goto lookup_transit;
+
+ /* handle an object becoming available; start pending
+ * operations and queue dependent operations for processing */
+ case FSCACHE_OBJECT_AVAILABLE:
+ fscache_object_available(object);
+ goto active_transit;
+
+ /* normal running state */
+ case FSCACHE_OBJECT_ACTIVE:
+ goto active_transit;
+
+ /* update the object metadata on disk */
+ case FSCACHE_OBJECT_UPDATING:
+ clear_bit(FSCACHE_OBJECT_EV_UPDATE, &object->events);
+ fscache_stat(&fscache_n_updates_run);
+ object->cache->ops->update_object(object);
+ goto active_transit;
+
+ /* handle an object dying during lookup or creation */
+ case FSCACHE_OBJECT_LC_DYING:
+ object->event_mask &= ~(1 << FSCACHE_OBJECT_EV_UPDATE);
+ object->cache->ops->lookup_complete(object);
+
+ spin_lock(&object->lock);
+ object->state = FSCACHE_OBJECT_DYING;
+ if (test_and_clear_bit(FSCACHE_COOKIE_CREATING,
+ &object->cookie->flags))
+ wake_up_bit(&object->cookie->flags,
+ FSCACHE_COOKIE_CREATING);
+ spin_unlock(&object->lock);
+
+ fscache_done_parent_op(object);
+
+ /* wait for completion of all active operations on this object
+ * and the death of all child objects of this object */
+ case FSCACHE_OBJECT_DYING:
+ dying:
+ clear_bit(FSCACHE_OBJECT_EV_CLEARED, &object->events);
+ spin_lock(&object->lock);
+ _debug("dying OBJ%x {%d,%d}",
+ object->debug_id, object->n_ops, object->n_children);
+ if (object->n_ops == 0 && object->n_children == 0) {
+ object->event_mask &=
+ ~(1 << FSCACHE_OBJECT_EV_CLEARED);
+ object->event_mask |=
+ (1 << FSCACHE_OBJECT_EV_WITHDRAW) |
+ (1 << FSCACHE_OBJECT_EV_RETIRE) |
+ (1 << FSCACHE_OBJECT_EV_RELEASE) |
+ (1 << FSCACHE_OBJECT_EV_ERROR);
+ } else {
+ object->event_mask &=
+ ~((1 << FSCACHE_OBJECT_EV_WITHDRAW) |
+ (1 << FSCACHE_OBJECT_EV_RETIRE) |
+ (1 << FSCACHE_OBJECT_EV_RELEASE) |
+ (1 << FSCACHE_OBJECT_EV_ERROR));
+ object->event_mask |=
+ 1 << FSCACHE_OBJECT_EV_CLEARED;
+ }
+ spin_unlock(&object->lock);
+ fscache_enqueue_dependents(object);
+ goto terminal_transit;
+
+ /* handle an abort during initialisation */
+ case FSCACHE_OBJECT_ABORT_INIT:
+ _debug("handle abort init %lx", object->events);
+ object->event_mask &= ~(1 << FSCACHE_OBJECT_EV_UPDATE);
+
+ spin_lock(&object->lock);
+ fscache_dequeue_object(object);
+
+ object->state = FSCACHE_OBJECT_DYING;
+ if (test_and_clear_bit(FSCACHE_COOKIE_CREATING,
+ &object->cookie->flags))
+ wake_up_bit(&object->cookie->flags,
+ FSCACHE_COOKIE_CREATING);
+ spin_unlock(&object->lock);
+ goto dying;
+
+ /* handle the netfs releasing an object and possibly marking it
+ * obsolete too */
+ case FSCACHE_OBJECT_RELEASING:
+ case FSCACHE_OBJECT_RECYCLING:
+ object->event_mask &=
+ ~((1 << FSCACHE_OBJECT_EV_WITHDRAW) |
+ (1 << FSCACHE_OBJECT_EV_RETIRE) |
+ (1 << FSCACHE_OBJECT_EV_RELEASE) |
+ (1 << FSCACHE_OBJECT_EV_ERROR));
+ fscache_release_object(object);
+ spin_lock(&object->lock);
+ object->state = FSCACHE_OBJECT_DEAD;
+ spin_unlock(&object->lock);
+ fscache_stat(&fscache_n_object_dead);
+ goto terminal_transit;
+
+ /* handle the parent cache of this object being withdrawn from
+ * active service */
+ case FSCACHE_OBJECT_WITHDRAWING:
+ object->event_mask &=
+ ~((1 << FSCACHE_OBJECT_EV_WITHDRAW) |
+ (1 << FSCACHE_OBJECT_EV_RETIRE) |
+ (1 << FSCACHE_OBJECT_EV_RELEASE) |
+ (1 << FSCACHE_OBJECT_EV_ERROR));
+ fscache_withdraw_object(object);
+ spin_lock(&object->lock);
+ object->state = FSCACHE_OBJECT_DEAD;
+ spin_unlock(&object->lock);
+ fscache_stat(&fscache_n_object_dead);
+ goto terminal_transit;
+
+ /* complain about the object being woken up once it is
+ * deceased */
+ case FSCACHE_OBJECT_DEAD:
+ printk(KERN_ERR "FS-Cache:"
+ " Unexpected event in dead state %lx\n",
+ object->events & object->event_mask);
+ BUG();
+
+ default:
+ printk(KERN_ERR "FS-Cache: Unknown object state %u\n",
+ object->state);
+ BUG();
+ }
+
+ /* determine the transition from a lookup state */
+lookup_transit:
+ switch (fls(object->events & object->event_mask) - 1) {
+ case FSCACHE_OBJECT_EV_WITHDRAW:
+ case FSCACHE_OBJECT_EV_RETIRE:
+ case FSCACHE_OBJECT_EV_RELEASE:
+ case FSCACHE_OBJECT_EV_ERROR:
+ new_state = FSCACHE_OBJECT_LC_DYING;
+ goto change_state;
+ case FSCACHE_OBJECT_EV_REQUEUE:
+ goto done;
+ case -1:
+ goto done; /* sleep until event */
+ default:
+ goto unsupported_event;
+ }
+
+ /* determine the transition from an active state */
+active_transit:
+ switch (fls(object->events & object->event_mask) - 1) {
+ case FSCACHE_OBJECT_EV_WITHDRAW:
+ case FSCACHE_OBJECT_EV_RETIRE:
+ case FSCACHE_OBJECT_EV_RELEASE:
+ case FSCACHE_OBJECT_EV_ERROR:
+ new_state = FSCACHE_OBJECT_DYING;
+ goto change_state;
+ case FSCACHE_OBJECT_EV_UPDATE:
+ new_state = FSCACHE_OBJECT_UPDATING;
+ goto change_state;
+ case -1:
+ new_state = FSCACHE_OBJECT_ACTIVE;
+ goto change_state; /* sleep until event */
+ default:
+ goto unsupported_event;
+ }
+
+ /* determine the transition from a terminal state */
+terminal_transit:
+ switch (fls(object->events & object->event_mask) - 1) {
+ case FSCACHE_OBJECT_EV_WITHDRAW:
+ new_state = FSCACHE_OBJECT_WITHDRAWING;
+ goto change_state;
+ case FSCACHE_OBJECT_EV_RETIRE:
+ new_state = FSCACHE_OBJECT_RECYCLING;
+ goto change_state;
+ case FSCACHE_OBJECT_EV_RELEASE:
+ new_state = FSCACHE_OBJECT_RELEASING;
+ goto change_state;
+ case FSCACHE_OBJECT_EV_ERROR:
+ new_state = FSCACHE_OBJECT_WITHDRAWING;
+ goto change_state;
+ case FSCACHE_OBJECT_EV_CLEARED:
+ new_state = FSCACHE_OBJECT_DYING;
+ goto change_state;
+ case -1:
+ goto done; /* sleep until event */
+ default:
+ goto unsupported_event;
+ }
+
+change_state:
+ spin_lock(&object->lock);
+ object->state = new_state;
+ spin_unlock(&object->lock);
+
+done:
+ _leave(" [->%s]", fscache_object_states[object->state]);
+ return;
+
+unsupported_event:
+ printk(KERN_ERR "FS-Cache:"
+ " Unsupported event %lx [mask %lx] in state %s\n",
+ object->events, object->event_mask,
+ fscache_object_states[object->state]);
+ BUG();
+}
+
+/*
+ * execute an object
+ */
+static void fscache_object_slow_work_execute(struct slow_work *work)
+{
+ struct fscache_object *object =
+ container_of(work, struct fscache_object, work);
+ unsigned long start;
+
+ _enter("{OBJ%x}", object->debug_id);
+
+ clear_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+
+ start = jiffies;
+ fscache_object_state_machine(object);
+ fscache_hist(fscache_objs_histogram, start);
+ if (object->events & object->event_mask)
+ fscache_enqueue_object(object);
+}
+
+/*
+ * initialise an object
+ * - check the specified object's parent to see if we can make use of it
+ * immediately to do a creation
+ * - we may need to start the process of creating a parent and we need to wait
+ * for the parent's lookup and creation to complete if it's not there yet
+ * - an object's cookie is pinned until we clear FSCACHE_COOKIE_CREATING on the
+ * leaf-most cookies of the object and all its children
+ */
+static void fscache_initialise_object(struct fscache_object *object)
+{
+ struct fscache_object *parent;
+
+ _enter("");
+ ASSERT(object->cookie != NULL);
+ ASSERT(object->cookie->parent != NULL);
+ ASSERT(list_empty(&object->work.link));
+
+ if (object->events & ((1 << FSCACHE_OBJECT_EV_ERROR) |
+ (1 << FSCACHE_OBJECT_EV_RELEASE) |
+ (1 << FSCACHE_OBJECT_EV_RETIRE) |
+ (1 << FSCACHE_OBJECT_EV_WITHDRAW))) {
+ _debug("abort init %lx", object->events);
+ spin_lock(&object->lock);
+ object->state = FSCACHE_OBJECT_ABORT_INIT;
+ spin_unlock(&object->lock);
+ return;
+ }
+
+ spin_lock(&object->cookie->lock);
+ spin_lock_nested(&object->cookie->parent->lock, 1);
+
+ parent = object->parent;
+ if (!parent) {
+ _debug("no parent");
+ set_bit(FSCACHE_OBJECT_EV_WITHDRAW, &object->events);
+ } else {
+ spin_lock(&object->lock);
+ spin_lock_nested(&parent->lock, 1);
+ _debug("parent %s", fscache_object_states[parent->state]);
+
+ if (parent->state >= FSCACHE_OBJECT_DYING) {
+ _debug("bad parent");
+ set_bit(FSCACHE_OBJECT_EV_WITHDRAW, &object->events);
+ } else if (parent->state < FSCACHE_OBJECT_AVAILABLE) {
+ _debug("wait");
+
+ /* we may get woken up in this state by child objects
+ * binding on to us, so we need to make sure we don't
+ * add ourself to the list multiple times */
+ if (list_empty(&object->dep_link)) {
+ object->cache->ops->grab_object(object);
+ list_add(&object->dep_link,
+ &parent->dependents);
+
+ /* fscache_acquire_non_index_cookie() uses this
+ * to wake the chain up */
+ if (parent->state == FSCACHE_OBJECT_INIT)
+ fscache_enqueue_object(parent);
+ }
+ } else {
+ _debug("go");
+ parent->n_ops++;
+ parent->n_obj_ops++;
+ object->lookup_jif = jiffies;
+ object->state = FSCACHE_OBJECT_LOOKING_UP;
+ set_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+ }
+
+ spin_unlock(&parent->lock);
+ spin_unlock(&object->lock);
+ }
+
+ spin_unlock(&object->cookie->parent->lock);
+ spin_unlock(&object->cookie->lock);
+ _leave("");
+}
+
+/*
+ * look an object up in the cache from which it was allocated
+ * - we hold an "access lock" on the parent object, so the parent object cannot
+ * be withdrawn by either party till we've finished
+ * - an object's cookie is pinned until we clear FSCACHE_COOKIE_CREATING on the
+ * leaf-most cookies of the object and all its children
+ */
+static void fscache_lookup_object(struct fscache_object *object)
+{
+ struct fscache_cookie *cookie = object->cookie;
+ struct fscache_object *parent;
+
+ _enter("");
+
+ parent = object->parent;
+ ASSERT(parent != NULL);
+ ASSERTCMP(parent->n_ops, >, 0);
+ ASSERTCMP(parent->n_obj_ops, >, 0);
+
+ /* make sure the parent is still available */
+ ASSERTCMP(parent->state, >=, FSCACHE_OBJECT_AVAILABLE);
+
+ if (parent->state >= FSCACHE_OBJECT_DYING ||
+ test_bit(FSCACHE_IOERROR, &object->cache->flags)) {
+ _debug("unavailable");
+ set_bit(FSCACHE_OBJECT_EV_WITHDRAW, &object->events);
+ _leave("");
+ return;
+ }
+
+ _debug("LOOKUP \"%s/%s\" in \"%s\"",
+ parent->cookie->def->name, cookie->def->name,
+ object->cache->tag->name);
+
+ fscache_stat(&fscache_n_object_lookups);
+ object->cache->ops->lookup_object(object);
+
+ if (test_bit(FSCACHE_OBJECT_EV_ERROR, &object->events))
+ set_bit(FSCACHE_COOKIE_UNAVAILABLE, &cookie->flags);
+
+ _leave("");
+}
+
+/**
+ * fscache_object_lookup_negative - Note negative cookie lookup
+ * @object: Object pointing to cookie to mark
+ *
+ * Note negative lookup, permitting those waiting to read data from an already
+ * existing backing object to continue as there's no data for them to read.
+ */
+void fscache_object_lookup_negative(struct fscache_object *object)
+{
+ struct fscache_cookie *cookie = object->cookie;
+
+ _enter("{OBJ%x,%s}",
+ object->debug_id, fscache_object_states[object->state]);
+
+ spin_lock(&object->lock);
+ if (object->state == FSCACHE_OBJECT_LOOKING_UP) {
+ fscache_stat(&fscache_n_object_lookups_negative);
+
+ /* transit here to allow write requests to begin stacking up
+ * and read requests to begin returning ENODATA */
+ object->state = FSCACHE_OBJECT_CREATING;
+ spin_unlock(&object->lock);
+
+ set_bit(FSCACHE_COOKIE_PENDING_FILL, &cookie->flags);
+ set_bit(FSCACHE_COOKIE_NO_DATA_YET, &cookie->flags);
+
+ _debug("wake up lookup %p", &cookie->flags);
+ smp_mb__before_clear_bit();
+ clear_bit(FSCACHE_COOKIE_LOOKING_UP, &cookie->flags);
+ smp_mb__after_clear_bit();
+ wake_up_bit(&cookie->flags, FSCACHE_COOKIE_LOOKING_UP);
+ set_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+ } else {
+ ASSERTCMP(object->state, ==, FSCACHE_OBJECT_CREATING);
+ spin_unlock(&object->lock);
+ }
+
+ _leave("");
+}
+EXPORT_SYMBOL(fscache_object_lookup_negative);
+
+/**
+ * fscache_obtained_object - Note successful object lookup or creation
+ * @object: Object pointing to cookie to mark
+ *
+ * Note successful lookup and/or creation, permitting those waiting to write
+ * data to a backing object to continue.
+ *
+ * Note that after calling this, an object's cookie may be relinquished by the
+ * netfs, and so must be accessed with object lock held.
+ */
+void fscache_obtained_object(struct fscache_object *object)
+{
+ struct fscache_cookie *cookie = object->cookie;
+
+ _enter("{OBJ%x,%s}",
+ object->debug_id, fscache_object_states[object->state]);
+
+ /* if we were still looking up, then we must have a positive lookup
+ * result, in which case there may be data available */
+ spin_lock(&object->lock);
+ if (object->state == FSCACHE_OBJECT_LOOKING_UP) {
+ fscache_stat(&fscache_n_object_lookups_positive);
+
+ clear_bit(FSCACHE_COOKIE_NO_DATA_YET, &cookie->flags);
+
+ object->state = FSCACHE_OBJECT_AVAILABLE;
+ spin_unlock(&object->lock);
+
+ smp_mb__before_clear_bit();
+ clear_bit(FSCACHE_COOKIE_LOOKING_UP, &cookie->flags);
+ smp_mb__after_clear_bit();
+ wake_up_bit(&cookie->flags, FSCACHE_COOKIE_LOOKING_UP);
+ set_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+ } else {
+ ASSERTCMP(object->state, ==, FSCACHE_OBJECT_CREATING);
+ fscache_stat(&fscache_n_object_created);
+
+ object->state = FSCACHE_OBJECT_AVAILABLE;
+ spin_unlock(&object->lock);
+ set_bit(FSCACHE_OBJECT_EV_REQUEUE, &object->events);
+ smp_wmb();
+ }
+
+ if (test_and_clear_bit(FSCACHE_COOKIE_CREATING, &cookie->flags))
+ wake_up_bit(&cookie->flags, FSCACHE_COOKIE_CREATING);
+
+ _leave("");
+}
+EXPORT_SYMBOL(fscache_obtained_object);
+
+/*
+ * handle an object that has just become available
+ */
+static void fscache_object_available(struct fscache_object *object)
+{
+ _enter("{OBJ%x}", object->debug_id);
+
+ spin_lock(&object->lock);
+
+ if (test_and_clear_bit(FSCACHE_COOKIE_CREATING, &object->cookie->flags))
+ wake_up_bit(&object->cookie->flags, FSCACHE_COOKIE_CREATING);
+
+ fscache_done_parent_op(object);
+ if (object->n_in_progress == 0) {
+ if (object->n_ops > 0) {
+ ASSERTCMP(object->n_ops, >=, object->n_obj_ops);
+ ASSERTIF(object->n_ops > object->n_obj_ops,
+ !list_empty(&object->pending_ops));
+ fscache_start_operations(object);
+ } else {
+ ASSERT(list_empty(&object->pending_ops));
+ }
+ }
+ spin_unlock(&object->lock);
+
+ object->cache->ops->lookup_complete(object);
+ fscache_enqueue_dependents(object);
+
+ fscache_hist(fscache_obj_instantiate_histogram, object->lookup_jif);
+ fscache_stat(&fscache_n_object_avail);
+
+ _leave("");
+}
+
+/*
+ * drop an object's attachments
+ */
+static void fscache_drop_object(struct fscache_object *object)
+{
+ struct fscache_object *parent = object->parent;
+ struct fscache_cache *cache = object->cache;
+
+ _enter("{OBJ%x,%d}", object->debug_id, object->n_children);
+
+ spin_lock(&cache->object_list_lock);
+ list_del_init(&object->cache_link);
+ spin_unlock(&cache->object_list_lock);
+
+ cache->ops->drop_object(object);
+
+ if (parent) {
+ _debug("release parent OBJ%x {%d}",
+ parent->debug_id, parent->n_children);
+
+ spin_lock(&parent->lock);
+ parent->n_children--;
+ if (parent->n_children == 0)
+ fscache_raise_event(parent, FSCACHE_OBJECT_EV_CLEARED);
+ spin_unlock(&parent->lock);
+ object->parent = NULL;
+ }
+
+ /* this just shifts the object release to the slow work processor */
+ object->cache->ops->put_object(object);
+
+ _leave("");
+}
+
+/*
+ * release or recycle an object that the netfs has discarded
+ */
+static void fscache_release_object(struct fscache_object *object)
+{
+ _enter("");
+
+ fscache_drop_object(object);
+}
+
+/*
+ * withdraw an object from active service
+ */
+static void fscache_withdraw_object(struct fscache_object *object)
+{
+ struct fscache_cookie *cookie;
+ bool detached;
+
+ _enter("");
+
+ spin_lock(&object->lock);
+ cookie = object->cookie;
+ if (cookie) {
+ /* need to get the cookie lock before the object lock, starting
+ * from the object pointer */
+ atomic_inc(&cookie->usage);
+ spin_unlock(&object->lock);
+
+ detached = false;
+ spin_lock(&cookie->lock);
+ spin_lock(&object->lock);
+
+ if (object->cookie == cookie) {
+ hlist_del_init(&object->cookie_link);
+ object->cookie = NULL;
+ detached = true;
+ }
+ spin_unlock(&cookie->lock);
+ fscache_cookie_put(cookie);
+ if (detached)
+ fscache_cookie_put(cookie);
+ }
+
+ spin_unlock(&object->lock);
+
+ fscache_drop_object(object);
+}
+
+/*
+ * withdraw an object from active service at the behest of the cache
+ * - need break the links to a cached object cookie
+ * - called under two situations:
+ * (1) recycler decides to reclaim an in-use object
+ * (2) a cache is unmounted
+ * - have to take care as the cookie can be being relinquished by the netfs
+ * simultaneously
+ * - the object is pinned by the caller holding a refcount on it
+ */
+void fscache_withdrawing_object(struct fscache_cache *cache,
+ struct fscache_object *object)
+{
+ bool enqueue = false;
+
+ _enter(",OBJ%x", object->debug_id);
+
+ spin_lock(&object->lock);
+ if (object->state < FSCACHE_OBJECT_WITHDRAWING) {
+ object->state = FSCACHE_OBJECT_WITHDRAWING;
+ enqueue = true;
+ }
+ spin_unlock(&object->lock);
+
+ if (enqueue)
+ fscache_enqueue_object(object);
+
+ _leave("");
+}
+
+/*
+ * allow the slow work item processor to get a ref on an object
+ */
+static int fscache_object_slow_work_get_ref(struct slow_work *work)
+{
+ struct fscache_object *object =
+ container_of(work, struct fscache_object, work);
+
+ return object->cache->ops->grab_object(object) ? 0 : -EAGAIN;
+}
+
+/*
+ * allow the slow work item processor to discard a ref on a work item
+ */
+static void fscache_object_slow_work_put_ref(struct slow_work *work)
+{
+ struct fscache_object *object =
+ container_of(work, struct fscache_object, work);
+
+ return object->cache->ops->put_object(object);
+}
+
+/*
+ * enqueue an object for metadata-type processing
+ */
+void fscache_enqueue_object(struct fscache_object *object)
+{
+ _enter("{OBJ%x}", object->debug_id);
+
+ slow_work_enqueue(&object->work);
+}
+
+/*
+ * enqueue the dependents of an object for metadata-type processing
+ * - the caller must hold the object's lock
+ * - this may cause an already locked object to wind up being processed again
+ */
+static void fscache_enqueue_dependents(struct fscache_object *object)
+{
+ struct fscache_object *dep;
+
+ _enter("{OBJ%x}", object->debug_id);
+
+ if (list_empty(&object->dependents))
+ return;
+
+ spin_lock(&object->lock);
+
+ while (!list_empty(&object->dependents)) {
+ dep = list_entry(object->dependents.next,
+ struct fscache_object, dep_link);
+ list_del_init(&dep->dep_link);
+
+
+ /* sort onto appropriate lists */
+ fscache_enqueue_object(dep);
+ dep->cache->ops->put_object(dep);
+
+ if (!list_empty(&object->dependents))
+ cond_resched_lock(&object->lock);
+ }
+
+ spin_unlock(&object->lock);
+}
+
+/*
+ * remove an object from whatever queue it's waiting on
+ * - the caller must hold object->lock
+ */
+void fscache_dequeue_object(struct fscache_object *object)
+{
+ _enter("{OBJ%x}", object->debug_id);
+
+ if (!list_empty(&object->dep_link)) {
+ spin_lock(&object->parent->lock);
+ list_del_init(&object->dep_link);
+ spin_unlock(&object->parent->lock);
+ }
+
+ _leave("");
+}
+
+/**
+ * fscache_check_aux - Ask the netfs whether an object on disk is still valid
+ * @object: The object to ask about
+ * @data: The auxiliary data for the object
+ * @datalen: The size of the auxiliary data
+ *
+ * This function consults the netfs about the coherency state of an object
+ */
+enum fscache_checkaux fscache_check_aux(struct fscache_object *object,
+ const void *data, uint16_t datalen)
+{
+ enum fscache_checkaux result;
+
+ if (!object->cookie->def->check_aux) {
+ fscache_stat(&fscache_n_checkaux_none);
+ return FSCACHE_CHECKAUX_OKAY;
+ }
+
+ result = object->cookie->def->check_aux(object->cookie->netfs_data,
+ data, datalen);
+ switch (result) {
+ /* entry okay as is */
+ case FSCACHE_CHECKAUX_OKAY:
+ fscache_stat(&fscache_n_checkaux_okay);
+ break;
+
+ /* entry requires update */
+ case FSCACHE_CHECKAUX_NEEDS_UPDATE:
+ fscache_stat(&fscache_n_checkaux_update);
+ break;
+
+ /* entry requires deletion */
+ case FSCACHE_CHECKAUX_OBSOLETE:
+ fscache_stat(&fscache_n_checkaux_obsolete);
+ break;
+
+ default:
+ BUG();
+ }
+
+ return result;
+}
+EXPORT_SYMBOL(fscache_check_aux);

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:04:23 UTC
Permalink
Implement two features of FS-Cache:

(1) The ability to request and release cache tags - names by which a cache may
be known to a netfs, and thus selected for use.

(2) An internal function by which a cache is selected by consulting the netfs,
if the netfs wishes to be consulted.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/fscache/Makefile | 1
fs/fscache/cache.c | 166 +++++++++++++++++++++++++++++++++++++++++++++++
fs/fscache/internal.h | 20 ++++++
include/linux/fscache.h | 9 ++-
4 files changed, 195 insertions(+), 1 deletions(-)
create mode 100644 fs/fscache/cache.c


diff --git a/fs/fscache/Makefile b/fs/fscache/Makefile
index bc1f3b9..556708b 100644
--- a/fs/fscache/Makefile
+++ b/fs/fscache/Makefile
@@ -3,6 +3,7 @@
#

fscache-y := \
+ cache.o \
fsdef.o \
main.o

diff --git a/fs/fscache/cache.c b/fs/fscache/cache.c
new file mode 100644
index 0000000..1a28df3
--- /dev/null
+++ b/fs/fscache/cache.c
@@ -0,0 +1,166 @@
+/* FS-Cache cache handling
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL CACHE
+#include <linux/module.h>
+#include <linux/slab.h>
+#include "internal.h"
+
+LIST_HEAD(fscache_cache_list);
+DECLARE_RWSEM(fscache_addremove_sem);
+
+static LIST_HEAD(fscache_cache_tag_list);
+
+/*
+ * look up a cache tag
+ */
+struct fscache_cache_tag *__fscache_lookup_cache_tag(const char *name)
+{
+ struct fscache_cache_tag *tag, *xtag;
+
+ /* firstly check for the existence of the tag under read lock */
+ down_read(&fscache_addremove_sem);
+
+ list_for_each_entry(tag, &fscache_cache_tag_list, link) {
+ if (strcmp(tag->name, name) == 0) {
+ atomic_inc(&tag->usage);
+ up_read(&fscache_addremove_sem);
+ return tag;
+ }
+ }
+
+ up_read(&fscache_addremove_sem);
+
+ /* the tag does not exist - create a candidate */
+ xtag = kzalloc(sizeof(*xtag) + strlen(name) + 1, GFP_KERNEL);
+ if (!xtag)
+ /* return a dummy tag if out of memory */
+ return ERR_PTR(-ENOMEM);
+
+ atomic_set(&xtag->usage, 1);
+ strcpy(xtag->name, name);
+
+ /* write lock, search again and add if still not present */
+ down_write(&fscache_addremove_sem);
+
+ list_for_each_entry(tag, &fscache_cache_tag_list, link) {
+ if (strcmp(tag->name, name) == 0) {
+ atomic_inc(&tag->usage);
+ up_write(&fscache_addremove_sem);
+ kfree(xtag);
+ return tag;
+ }
+ }
+
+ list_add_tail(&xtag->link, &fscache_cache_tag_list);
+ up_write(&fscache_addremove_sem);
+ return xtag;
+}
+
+/*
+ * release a reference to a cache tag
+ */
+void __fscache_release_cache_tag(struct fscache_cache_tag *tag)
+{
+ if (tag != ERR_PTR(-ENOMEM)) {
+ down_write(&fscache_addremove_sem);
+
+ if (atomic_dec_and_test(&tag->usage))
+ list_del_init(&tag->link);
+ else
+ tag = NULL;
+
+ up_write(&fscache_addremove_sem);
+
+ kfree(tag);
+ }
+}
+
+/*
+ * select a cache in which to store an object
+ * - the cache addremove semaphore must be at least read-locked by the caller
+ * - the object will never be an index
+ */
+struct fscache_cache *fscache_select_cache_for_object(
+ struct fscache_cookie *cookie)
+{
+ struct fscache_cache_tag *tag;
+ struct fscache_object *object;
+ struct fscache_cache *cache;
+
+ _enter("");
+
+ if (list_empty(&fscache_cache_list)) {
+ _leave(" = NULL [no cache]");
+ return NULL;
+ }
+
+ /* we check the parent to determine the cache to use */
+ spin_lock(&cookie->lock);
+
+ /* the first in the parent's backing list should be the preferred
+ * cache */
+ if (!hlist_empty(&cookie->backing_objects)) {
+ object = hlist_entry(cookie->backing_objects.first,
+ struct fscache_object, cookie_link);
+
+ cache = object->cache;
+ if (object->state >= FSCACHE_OBJECT_DYING ||
+ test_bit(FSCACHE_IOERROR, &cache->flags))
+ cache = NULL;
+
+ spin_unlock(&cookie->lock);
+ _leave(" = %p [parent]", cache);
+ return cache;
+ }
+
+ /* the parent is unbacked */
+ if (cookie->def->type != FSCACHE_COOKIE_TYPE_INDEX) {
+ /* cookie not an index and is unbacked */
+ spin_unlock(&cookie->lock);
+ _leave(" = NULL [cookie ub,ni]");
+ return NULL;
+ }
+
+ spin_unlock(&cookie->lock);
+
+ if (!cookie->def->select_cache)
+ goto no_preference;
+
+ /* ask the netfs for its preference */
+ tag = cookie->def->select_cache(cookie->parent->netfs_data,
+ cookie->netfs_data);
+ if (!tag)
+ goto no_preference;
+
+ if (tag == ERR_PTR(-ENOMEM)) {
+ _leave(" = NULL [nomem tag]");
+ return NULL;
+ }
+
+ if (!tag->cache) {
+ _leave(" = NULL [unbacked tag]");
+ return NULL;
+ }
+
+ if (test_bit(FSCACHE_IOERROR, &tag->cache->flags))
+ return NULL;
+
+ _leave(" = %p [specific]", tag->cache);
+ return tag->cache;
+
+no_preference:
+ /* netfs has no preference - just select first cache */
+ cache = list_entry(fscache_cache_list.next,
+ struct fscache_cache, link);
+ _leave(" = %p [first]", cache);
+ return cache;
+}
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index 4113af8..0a2069a 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -28,6 +28,15 @@
#define FSCACHE_MAX_THREADS 32

/*
+ * fsc-cache.c
+ */
+extern struct list_head fscache_cache_list;
+extern struct rw_semaphore fscache_addremove_sem;
+
+extern struct fscache_cache *fscache_select_cache_for_object(
+ struct fscache_cookie *);
+
+/*
* fsc-fsdef.c
*/
extern struct fscache_cookie fscache_fsdef_index;
@@ -168,6 +177,17 @@ extern const struct file_operations fscache_stats_fops;
#define fscache_stat(stat) do {} while (0)
#endif

+/*
+ * raise an event on an object
+ * - if the event is not masked for that object, then the object is
+ * queued for attention by the thread pool.
+ */
+static inline void fscache_raise_event(struct fscache_object *object,
+ unsigned event)
+{
+ BUG(); // TODO
+}
+
/*****************************************************************************/
/*
* debug tracing
diff --git a/include/linux/fscache.h b/include/linux/fscache.h
index ea0fd0f..0b66726 100644
--- a/include/linux/fscache.h
+++ b/include/linux/fscache.h
@@ -187,6 +187,8 @@ struct fscache_netfs {
* - these are undefined symbols when FS-Cache is not configured and the
* optimiser takes care of not using them
*/
+extern struct fscache_cache_tag *__fscache_lookup_cache_tag(const char *);
+extern void __fscache_release_cache_tag(struct fscache_cache_tag *);

/**
* fscache_register_netfs - Register a filesystem as desiring caching services
@@ -232,7 +234,10 @@ void fscache_unregister_netfs(struct fscache_netfs *netfs)
static inline
struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name)
{
- return NULL;
+ if (fscache_available())
+ return __fscache_lookup_cache_tag(name);
+ else
+ return NULL;
}

/**
@@ -247,6 +252,8 @@ struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name)
static inline
void fscache_release_cache_tag(struct fscache_cache_tag *tag)
{
+ if (fscache_available())
+ __fscache_release_cache_tag(tag);
}

/**

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:04:44 UTC
Permalink
Add helpers for use with wait_on_bit().

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/fscache/internal.h | 3 +++
fs/fscache/main.c | 20 ++++++++++++++++++++
2 files changed, 23 insertions(+), 0 deletions(-)


diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index 4c6ba56..1638994 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -82,6 +82,9 @@ extern unsigned fscache_defer_create;
extern unsigned fscache_debug;
extern struct kobject *fscache_root;

+extern int fscache_wait_bit(void *);
+extern int fscache_wait_bit_interruptible(void *);
+
/*
* fsc-proc.c
*/
diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index 48b79d2..4de41b5 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -102,3 +102,23 @@ static void __exit fscache_exit(void)
}

module_exit(fscache_exit);
+
+/*
+ * wait_on_bit() sleep function for uninterruptible waiting
+ */
+int fscache_wait_bit(void *flags)
+{
+ schedule();
+ return 0;
+}
+EXPORT_SYMBOL(fscache_wait_bit);
+
+/*
+ * wait_on_bit() sleep function for interruptible waiting
+ */
+int fscache_wait_bit_interruptible(void *flags)
+{
+ schedule();
+ return signal_pending(current);
+}
+EXPORT_SYMBOL(fscache_wait_bit_interruptible);

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:04:38 UTC
Permalink
Add functions to register and unregister a network filesystem or other client
of the FS-Cache service. This allocates and releases the cookie representing
the top-level index for a netfs, and makes it available to the netfs.

If the FS-Cache facility is disabled, then the calls are optimised away at
compile time.

Note that whilst this patch may appear to work with FS-Cache enabled and a
netfs attempting to use it, it will leak the cookie it allocates for the netfs
as fscache_relinquish_cookie() is implemented in a later patch. This will
cause the slab code to emit a warning when the module is removed.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/fscache/Makefile | 3 +
fs/fscache/netfs.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fscache.h | 9 ++++
3 files changed, 113 insertions(+), 2 deletions(-)
create mode 100644 fs/fscache/netfs.c


diff --git a/fs/fscache/Makefile b/fs/fscache/Makefile
index f88ac17..ecf6946 100644
--- a/fs/fscache/Makefile
+++ b/fs/fscache/Makefile
@@ -6,7 +6,8 @@ fscache-y := \
cache.o \
cookie.o \
fsdef.o \
- main.o
+ main.o \
+ netfs.o

fscache-$(CONFIG_PROC_FS) += proc.o
fscache-$(CONFIG_FSCACHE_STATS) += stats.o
diff --git a/fs/fscache/netfs.c b/fs/fscache/netfs.c
new file mode 100644
index 0000000..e028b8e
--- /dev/null
+++ b/fs/fscache/netfs.c
@@ -0,0 +1,103 @@
+/* FS-Cache netfs (client) registration
+ *
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL COOKIE
+#include <linux/module.h>
+#include <linux/slab.h>
+#include "internal.h"
+
+static LIST_HEAD(fscache_netfs_list);
+
+/*
+ * register a network filesystem for caching
+ */
+int __fscache_register_netfs(struct fscache_netfs *netfs)
+{
+ struct fscache_netfs *ptr;
+ int ret;
+
+ _enter("{%s}", netfs->name);
+
+ INIT_LIST_HEAD(&netfs->link);
+
+ /* allocate a cookie for the primary index */
+ netfs->primary_index =
+ kmem_cache_zalloc(fscache_cookie_jar, GFP_KERNEL);
+
+ if (!netfs->primary_index) {
+ _leave(" = -ENOMEM");
+ return -ENOMEM;
+ }
+
+ /* initialise the primary index cookie */
+ atomic_set(&netfs->primary_index->usage, 1);
+ atomic_set(&netfs->primary_index->n_children, 0);
+
+ netfs->primary_index->def = &fscache_fsdef_netfs_def;
+ netfs->primary_index->parent = &fscache_fsdef_index;
+ netfs->primary_index->netfs_data = netfs;
+
+ atomic_inc(&netfs->primary_index->parent->usage);
+ atomic_inc(&netfs->primary_index->parent->n_children);
+
+ spin_lock_init(&netfs->primary_index->lock);
+ INIT_HLIST_HEAD(&netfs->primary_index->backing_objects);
+
+ /* check the netfs type is not already present */
+ down_write(&fscache_addremove_sem);
+
+ ret = -EEXIST;
+ list_for_each_entry(ptr, &fscache_netfs_list, link) {
+ if (strcmp(ptr->name, netfs->name) == 0)
+ goto already_registered;
+ }
+
+ list_add(&netfs->link, &fscache_netfs_list);
+ ret = 0;
+
+ printk(KERN_NOTICE "FS-Cache: Netfs '%s' registered for caching\n",
+ netfs->name);
+
+already_registered:
+ up_write(&fscache_addremove_sem);
+
+ if (ret < 0) {
+ netfs->primary_index->parent = NULL;
+ __fscache_cookie_put(netfs->primary_index);
+ netfs->primary_index = NULL;
+ }
+
+ _leave(" = %d", ret);
+ return ret;
+}
+EXPORT_SYMBOL(__fscache_register_netfs);
+
+/*
+ * unregister a network filesystem from the cache
+ * - all cookies must have been released first
+ */
+void __fscache_unregister_netfs(struct fscache_netfs *netfs)
+{
+ _enter("{%s.%u}", netfs->name, netfs->version);
+
+ down_write(&fscache_addremove_sem);
+
+ list_del(&netfs->link);
+ fscache_relinquish_cookie(netfs->primary_index, 0);
+
+ up_write(&fscache_addremove_sem);
+
+ printk(KERN_NOTICE "FS-Cache: Netfs '%s' unregistered from caching\n",
+ netfs->name);
+
+ _leave("");
+}
+EXPORT_SYMBOL(__fscache_unregister_netfs);
diff --git a/include/linux/fscache.h b/include/linux/fscache.h
index 0b66726..e4278b3 100644
--- a/include/linux/fscache.h
+++ b/include/linux/fscache.h
@@ -187,6 +187,8 @@ struct fscache_netfs {
* - these are undefined symbols when FS-Cache is not configured and the
* optimiser takes care of not using them
*/
+extern int __fscache_register_netfs(struct fscache_netfs *);
+extern void __fscache_unregister_netfs(struct fscache_netfs *);
extern struct fscache_cache_tag *__fscache_lookup_cache_tag(const char *);
extern void __fscache_release_cache_tag(struct fscache_cache_tag *);

@@ -202,7 +204,10 @@ extern void __fscache_release_cache_tag(struct fscache_cache_tag *);
static inline
int fscache_register_netfs(struct fscache_netfs *netfs)
{
- return 0;
+ if (fscache_available())
+ return __fscache_register_netfs(netfs);
+ else
+ return 0;
}

/**
@@ -219,6 +224,8 @@ int fscache_register_netfs(struct fscache_netfs *netfs)
static inline
void fscache_unregister_netfs(struct fscache_netfs *netfs)
{
+ if (fscache_available())
+ __fscache_unregister_netfs(netfs);
}

/**

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:05:25 UTC
Permalink
Export a number of functions for CacheFiles's use.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/super.c | 1 +
security/security.c | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)


diff --git a/fs/super.c b/fs/super.c
index 2ba4815..77cb4ec 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -287,6 +287,7 @@ int fsync_super(struct super_block *sb)
__fsync_super(sb);
return sync_blockdev(sb->s_bdev);
}
+EXPORT_SYMBOL_GPL(fsync_super);

/**
* generic_shutdown_super - common helper for ->kill_sb()
diff --git a/security/security.c b/security/security.c
index 206e538..5284255 100644
--- a/security/security.c
+++ b/security/security.c
@@ -445,6 +445,7 @@ int security_inode_create(struct inode *dir, struct dentry *dentry, int mode)
return 0;
return security_ops->inode_create(dir, dentry, mode);
}
+EXPORT_SYMBOL_GPL(security_inode_create);

int security_inode_link(struct dentry *old_dentry, struct inode *dir,
struct dentry *new_dentry)
@@ -475,6 +476,7 @@ int security_inode_mkdir(struct inode *dir, struct dentry *dentry, int mode)
return 0;
return security_ops->inode_mkdir(dir, dentry, mode);
}
+EXPORT_SYMBOL_GPL(security_inode_mkdir);

int security_inode_rmdir(struct inode *dir, struct dentry *dentry)
{

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rik van Riel
2009-04-02 18:53:11 UTC
Permalink
Post by David Howells
Export a number of functions for CacheFiles's use.
Acked-by: Rik van Riel <***@redhat.com>
David Howells
2009-04-01 23:04:33 UTC
Permalink
Provide a slab from which can be allocated the FS-Cache cookies that will be
presented to the netfs.

Also provide a slab constructor and a function to recursively discard a cookie
and its ancestor chain.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/fscache/Makefile | 1 +
fs/fscache/cookie.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/fscache/internal.h | 8 +++++++
fs/fscache/main.c | 15 +++++++++++++
4 files changed, 80 insertions(+), 0 deletions(-)
create mode 100644 fs/fscache/cookie.c


diff --git a/fs/fscache/Makefile b/fs/fscache/Makefile
index 556708b..f88ac17 100644
--- a/fs/fscache/Makefile
+++ b/fs/fscache/Makefile
@@ -4,6 +4,7 @@

fscache-y := \
cache.o \
+ cookie.o \
fsdef.o \
main.o

diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
new file mode 100644
index 0000000..47fd75b
--- /dev/null
+++ b/fs/fscache/cookie.c
@@ -0,0 +1,56 @@
+/* netfs cookie management
+ *
+ * Copyright (C) 2004-2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL COOKIE
+#include <linux/module.h>
+#include <linux/slab.h>
+#include "internal.h"
+
+struct kmem_cache *fscache_cookie_jar;
+
+/*
+ * initialise an cookie jar slab element prior to any use
+ */
+void fscache_cookie_init_once(void *_cookie)
+{
+ struct fscache_cookie *cookie = _cookie;
+
+ memset(cookie, 0, sizeof(*cookie));
+ spin_lock_init(&cookie->lock);
+ INIT_HLIST_HEAD(&cookie->backing_objects);
+}
+
+/*
+ * destroy a cookie
+ */
+void __fscache_cookie_put(struct fscache_cookie *cookie)
+{
+ struct fscache_cookie *parent;
+
+ _enter("%p", cookie);
+
+ for (;;) {
+ _debug("FREE COOKIE %p", cookie);
+ parent = cookie->parent;
+ BUG_ON(!hlist_empty(&cookie->backing_objects));
+ kmem_cache_free(fscache_cookie_jar, cookie);
+
+ if (!parent)
+ break;
+
+ cookie = parent;
+ BUG_ON(atomic_read(&cookie->usage) <= 0);
+ if (!atomic_dec_and_test(&cookie->usage))
+ break;
+ }
+
+ _leave("");
+}
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index 0a2069a..4c6ba56 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -37,6 +37,14 @@ extern struct fscache_cache *fscache_select_cache_for_object(
struct fscache_cookie *);

/*
+ * fsc-cookie.c
+ */
+extern struct kmem_cache *fscache_cookie_jar;
+
+extern void fscache_cookie_init_once(void *);
+extern void __fscache_cookie_put(struct fscache_cookie *);
+
+/*
* fsc-fsdef.c
*/
extern struct fscache_cookie fscache_fsdef_index;
diff --git a/fs/fscache/main.c b/fs/fscache/main.c
index c2f3e63..48b79d2 100644
--- a/fs/fscache/main.c
+++ b/fs/fscache/main.c
@@ -56,6 +56,18 @@ static int __init fscache_init(void)
if (ret < 0)
goto error_proc;

+ fscache_cookie_jar = kmem_cache_create("fscache_cookie_jar",
+ sizeof(struct fscache_cookie),
+ 0,
+ 0,
+ fscache_cookie_init_once);
+ if (!fscache_cookie_jar) {
+ printk(KERN_NOTICE
+ "FS-Cache: Failed to allocate a cookie jar\n");
+ ret = -ENOMEM;
+ goto error_cookie_jar;
+ }
+
fscache_root = kobject_create_and_add("fscache", kernel_kobj);
if (!fscache_root)
goto error_kobj;
@@ -64,6 +76,8 @@ static int __init fscache_init(void)
return 0;

error_kobj:
+ kmem_cache_destroy(fscache_cookie_jar);
+error_cookie_jar:
fscache_proc_cleanup();
error_proc:
slow_work_unregister_user();
@@ -81,6 +95,7 @@ static void __exit fscache_exit(void)
_enter("");

kobject_put(fscache_root);
+ kmem_cache_destroy(fscache_cookie_jar);
fscache_proc_cleanup();
slow_work_unregister_user();
printk(KERN_NOTICE "FS-Cache: Unloaded\n");

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:05:20 UTC
Permalink
Add a function to install a monitor on the page lock waitqueue for a particular
page, thus allowing the page being unlocked to be detected.

This is used by CacheFiles to detect read completion on a page in the backing
filesystem so that it can then copy the data to the waiting netfs page.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

include/linux/pagemap.h | 5 +++++
mm/filemap.c | 18 ++++++++++++++++++
2 files changed, 23 insertions(+), 0 deletions(-)


diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1ae38aa..135028e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -396,6 +396,11 @@ static inline void wait_on_page_owner_priv_2(struct page *page)
extern void end_page_owner_priv_2(struct page *page);

/*
+ * Add an arbitrary waiter to a page's wait queue
+ */
+extern void add_page_wait_queue(struct page *page, wait_queue_t *waiter);
+
+/*
* Fault a userspace page into pagetables. Return non-zero on a fault.
*
* This assumes that two userspace pages are always sufficient. That's
diff --git a/mm/filemap.c b/mm/filemap.c
index d766c2f..15dda3c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -565,6 +565,24 @@ void wait_on_page_bit(struct page *page, int bit_nr)
EXPORT_SYMBOL(wait_on_page_bit);

/**
+ * add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
+ * @page - Page defining the wait queue of interest
+ * @waiter - Waiter to add to the queue
+ *
+ * Add an arbitrary @waiter to the wait queue for the nominated @page.
+ */
+void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
+{
+ wait_queue_head_t *q = page_waitqueue(page);
+ unsigned long flags;
+
+ spin_lock_irqsave(&q->lock, flags);
+ __add_wait_queue(q, waiter);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+EXPORT_SYMBOL_GPL(add_page_wait_queue);
+
+/**
* unlock_page - unlock a locked page
* @page: the page
*

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2009-04-02 15:37:25 UTC
Permalink
Post by David Howells
Add a function to install a monitor on the page lock waitqueue for a particular
page, thus allowing the page being unlocked to be detected.
This is used by CacheFiles to detect read completion on a page in the backing
filesystem so that it can then copy the data to the waiting netfs page.
I prefer to nack this because it is exporting details of the page
locking mechanism. unlock_page is very heavyweight in large part
because of the memory barriers and cacheline required to check the
waitqueue. I have patches to avoid all that if the page lock is
not contended.

What's wrong with using wait_on_page_locked, like everyone else does?
David Howells
2009-04-02 16:14:52 UTC
Permalink
Post by Nick Piggin
I prefer to nack this because it is exporting details of the page
locking mechanism. unlock_page is very heavyweight in large part
because of the memory barriers and cacheline required to check the
waitqueue. I have patches to avoid all that if the page lock is
not contended.
What's wrong with using wait_on_page_locked, like everyone else does?
When fscache_read_or_alloc_pages() is called from, say, nfs_readpages(), and is
given a few hundred pages to readahead from the cache, who does the
wait_on_page_locked() on each of the _backing_ fs's pages?

The way I've arranged things to work is for the backing fs pages to be copied
to the netfs pages and released in the order they're read from the disk.

There's a small pool of threads that processes the pages. I don't want to have
to create a thread for each readpages(), and I don't want readpages() to have
to wait for all the requests it makes.

Ideally, I'd like to ask the backing fs to read directly into netfs pages, but
that's not particularly feasible at the moment.

David
Nick Piggin
2009-04-02 16:35:12 UTC
Permalink
Post by David Howells
Post by Nick Piggin
I prefer to nack this because it is exporting details of the page
locking mechanism. unlock_page is very heavyweight in large part
because of the memory barriers and cacheline required to check the
waitqueue. I have patches to avoid all that if the page lock is
not contended.
What's wrong with using wait_on_page_locked, like everyone else does?
When fscache_read_or_alloc_pages() is called from, say, nfs_readpages(), and is
given a few hundred pages to readahead from the cache, who does the
wait_on_page_locked() on each of the _backing_ fs's pages?
Presumably: at the point where data is needed.
Post by David Howells
The way I've arranged things to work is for the backing fs pages to be copied
to the netfs pages and released in the order they're read from the disk.
There's a small pool of threads that processes the pages. I don't want to have
to create a thread for each readpages(), and I don't want readpages() to have
to wait for all the requests it makes.
Or do you actually have numbers showing a problem if you just read the pages
then copy them?


If there is a problem, then why doesn't fscache_read_or_alloc_pages caller do
the work itself, then you get as many threads as you have indivisible work units,
so completing some part of the request before another wouldn't gain you anything
anyway...
David Howells
2009-04-02 17:05:07 UTC
Permalink
Post by Nick Piggin
Presumably: at the point where data is needed.
But the point where the data is needed is where filemap.c is waiting on a
netfs page. Maybe the sync_page() aop can deal with it

There's also the problem of recording and pinning the backing page I'm waiting
for. Currently I can do that by hooking the monitor block into the page
unlock watching list. If I don't do that, I have to use up yet more memory to
track those some other way. It's not impossible, but I'd like to keep memory
usage down.
Post by Nick Piggin
Or do you actually have numbers showing a problem if you just read the pages
then copy them?
I did, years ago. It wasn't particularly good, but
fscache_read_or_alloc_pages() was completely synchronous.
Post by Nick Piggin
If there is a problem, then why doesn't fscache_read_or_alloc_pages caller
do the work itself, then you get as many threads as you have indivisible
work units, so completing some part of the request before another wouldn't
gain you anything anyway...
(1) Trond stipulated FS-Cache had to be asynchronous, and it is, as far as I
can make it. I still have to invoke bmap() synchronously, though, to find
out whether I have a page in the cache to read:-/

(2) You lose the advantage of being able to process what you've got whilst the
disk is fetching stuff in the background.

David
Nick Piggin
2009-04-02 18:03:18 UTC
Permalink
Post by David Howells
Post by Nick Piggin
Presumably: at the point where data is needed.
But the point where the data is needed is where filemap.c is waiting on a
netfs page. Maybe the sync_page() aop can deal with it
I was thinking of more like just have some threads submitting the reads
and unlocking the waiters.
Post by David Howells
There's also the problem of recording and pinning the backing page I'm waiting
for. Currently I can do that by hooking the monitor block into the page
unlock watching list. If I don't do that, I have to use up yet more memory to
track those some other way. It's not impossible, but I'd like to keep memory
usage down.
Post by Nick Piggin
Or do you actually have numbers showing a problem if you just read the pages
then copy them?
I did, years ago. It wasn't particularly good, but
fscache_read_or_alloc_pages() was completely synchronous.
Performance wasn't good? Why?
Post by David Howells
Post by Nick Piggin
If there is a problem, then why doesn't fscache_read_or_alloc_pages caller
do the work itself, then you get as many threads as you have indivisible
work units, so completing some part of the request before another wouldn't
gain you anything anyway...
(1) Trond stipulated FS-Cache had to be asynchronous, and it is, as far as I
can make it. I still have to invoke bmap() synchronously, though, to find
out whether I have a page in the cache to read:-/
Why does it have to be asynchronous? Seems like an incredible complexity.
->readpage is called only in synchonous contexts, and ->readpages not but
do you even have the netfs filesystem request readahead on the backing
filesystem? (which then presumably tries to do readahead of its own)
Post by David Howells
(2) You lose the advantage of being able to process what you've got whilst the
disk is fetching stuff in the background.
This should happen via readahead on the underlying filesystem, shouldn't
it?
Rik van Riel
2009-04-02 18:51:52 UTC
Permalink
Post by David Howells
Add a function to install a monitor on the page lock waitqueue for a particular
page, thus allowing the page being unlocked to be detected.
This is used by CacheFiles to detect read completion on a page in the backing
filesystem so that it can then copy the data to the waiting netfs page.
Acked-by: Rik van Riel <***@redhat.com>
David Howells
2009-04-01 23:04:54 UTC
Permalink
Implement the cookie management part of the FS-Cache netfs client API. The
documentation and API header file were added in a previous patch.

This patch implements the following three functions:

(1) fscache_acquire_cookie().

Acquire a cookie to represent an object to the netfs. If the object in
question is a non-index object, then that object and its parent indices
will be created on disk at this point if they don't already exist. Index
creation is deferred because an index may reside in multiple caches.

(2) fscache_relinquish_cookie().

Retire or release a cookie previously acquired. At this point, the
object on disk may be destroyed.

(3) fscache_update_cookie().

Update the in-cache representation of a cookie. This is used to update
the auxiliary data for coherency management purposes.

With this patch it is possible to have a netfs instruct a cache backend to
look up, validate and create metadata on disk and to destroy it again.
The ability to actually store and retrieve data in the objects so created is
added in later patches.

Note that these functions will never return an error. _All_ errors are
handled internally to FS-Cache.

The worst that can happen is that fscache_acquire_cookie() may return a NULL
pointer - which is considered a negative cookie pointer and can be passed back
to any function that takes a cookie without harm. A negative cookie pointer
merely suppresses caching at that level.

The stub in linux/fscache.h will detect inline the negative cookie pointer and
abort the operation as fast as possible. This means that the compiler doesn't
have to set up for a call in that case.

See the documentation in Documentation/filesystems/caching/netfs-api.txt for
more information.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/fscache/cookie.c | 442 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fscache.h | 16 ++
2 files changed, 457 insertions(+), 1 deletions(-)


diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index 47fd75b..cd9d065 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -7,6 +7,9 @@
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
+ *
+ * See Documentation/filesystems/caching/netfs-api.txt for more information on
+ * the netfs API.
*/

#define FSCACHE_DEBUG_LEVEL COOKIE
@@ -16,6 +19,14 @@

struct kmem_cache *fscache_cookie_jar;

+static atomic_t fscache_object_debug_id = ATOMIC_INIT(0);
+
+static int fscache_acquire_non_index_cookie(struct fscache_cookie *cookie);
+static int fscache_alloc_object(struct fscache_cache *cache,
+ struct fscache_cookie *cookie);
+static int fscache_attach_object(struct fscache_cookie *cookie,
+ struct fscache_object *object);
+
/*
* initialise an cookie jar slab element prior to any use
*/
@@ -29,6 +40,437 @@ void fscache_cookie_init_once(void *_cookie)
}

/*
+ * request a cookie to represent an object (index, datafile, xattr, etc)
+ * - parent specifies the parent object
+ * - the top level index cookie for each netfs is stored in the fscache_netfs
+ * struct upon registration
+ * - def points to the definition
+ * - the netfs_data will be passed to the functions pointed to in *def
+ * - all attached caches will be searched to see if they contain this object
+ * - index objects aren't stored on disk until there's a dependent file that
+ * needs storing
+ * - other objects are stored in a selected cache immediately, and all the
+ * indices forming the path to it are instantiated if necessary
+ * - we never let on to the netfs about errors
+ * - we may set a negative cookie pointer, but that's okay
+ */
+struct fscache_cookie *__fscache_acquire_cookie(
+ struct fscache_cookie *parent,
+ const struct fscache_cookie_def *def,
+ void *netfs_data)
+{
+ struct fscache_cookie *cookie;
+
+ BUG_ON(!def);
+
+ _enter("{%s},{%s},%p",
+ parent ? (char *) parent->def->name : "<no-parent>",
+ def->name, netfs_data);
+
+ fscache_stat(&fscache_n_acquires);
+
+ /* if there's no parent cookie, then we don't create one here either */
+ if (!parent) {
+ fscache_stat(&fscache_n_acquires_null);
+ _leave(" [no parent]");
+ return NULL;
+ }
+
+ /* validate the definition */
+ BUG_ON(!def->get_key);
+ BUG_ON(!def->name[0]);
+
+ BUG_ON(def->type == FSCACHE_COOKIE_TYPE_INDEX &&
+ parent->def->type != FSCACHE_COOKIE_TYPE_INDEX);
+
+ /* allocate and initialise a cookie */
+ cookie = kmem_cache_alloc(fscache_cookie_jar, GFP_KERNEL);
+ if (!cookie) {
+ fscache_stat(&fscache_n_acquires_oom);
+ _leave(" [ENOMEM]");
+ return NULL;
+ }
+
+ atomic_set(&cookie->usage, 1);
+ atomic_set(&cookie->n_children, 0);
+
+ atomic_inc(&parent->usage);
+ atomic_inc(&parent->n_children);
+
+ cookie->def = def;
+ cookie->parent = parent;
+ cookie->netfs_data = netfs_data;
+ cookie->flags = 0;
+
+ switch (cookie->def->type) {
+ case FSCACHE_COOKIE_TYPE_INDEX:
+ fscache_stat(&fscache_n_cookie_index);
+ break;
+ case FSCACHE_COOKIE_TYPE_DATAFILE:
+ fscache_stat(&fscache_n_cookie_data);
+ break;
+ default:
+ fscache_stat(&fscache_n_cookie_special);
+ break;
+ }
+
+ /* if the object is an index then we need do nothing more here - we
+ * create indices on disk when we need them as an index may exist in
+ * multiple caches */
+ if (cookie->def->type != FSCACHE_COOKIE_TYPE_INDEX) {
+ if (fscache_acquire_non_index_cookie(cookie) < 0) {
+ atomic_dec(&parent->n_children);
+ __fscache_cookie_put(cookie);
+ fscache_stat(&fscache_n_acquires_nobufs);
+ _leave(" = NULL");
+ return NULL;
+ }
+ }
+
+ fscache_stat(&fscache_n_acquires_ok);
+ _leave(" = %p", cookie);
+ return cookie;
+}
+EXPORT_SYMBOL(__fscache_acquire_cookie);
+
+/*
+ * acquire a non-index cookie
+ * - this must make sure the index chain is instantiated and instantiate the
+ * object representation too
+ */
+static int fscache_acquire_non_index_cookie(struct fscache_cookie *cookie)
+{
+ struct fscache_object *object;
+ struct fscache_cache *cache;
+ uint64_t i_size;
+ int ret;
+
+ _enter("");
+
+ cookie->flags = 1 << FSCACHE_COOKIE_UNAVAILABLE;
+
+ /* now we need to see whether the backing objects for this cookie yet
+ * exist, if not there'll be nothing to search */
+ down_read(&fscache_addremove_sem);
+
+ if (list_empty(&fscache_cache_list)) {
+ up_read(&fscache_addremove_sem);
+ _leave(" = 0 [no caches]");
+ return 0;
+ }
+
+ /* select a cache in which to store the object */
+ cache = fscache_select_cache_for_object(cookie->parent);
+ if (!cache) {
+ up_read(&fscache_addremove_sem);
+ fscache_stat(&fscache_n_acquires_no_cache);
+ _leave(" = -ENOMEDIUM [no cache]");
+ return -ENOMEDIUM;
+ }
+
+ _debug("cache %s", cache->tag->name);
+
+ cookie->flags =
+ (1 << FSCACHE_COOKIE_LOOKING_UP) |
+ (1 << FSCACHE_COOKIE_CREATING) |
+ (1 << FSCACHE_COOKIE_NO_DATA_YET);
+
+ /* ask the cache to allocate objects for this cookie and its parent
+ * chain */
+ ret = fscache_alloc_object(cache, cookie);
+ if (ret < 0) {
+ up_read(&fscache_addremove_sem);
+ _leave(" = %d", ret);
+ return ret;
+ }
+
+ /* pass on how big the object we're caching is supposed to be */
+ cookie->def->get_attr(cookie->netfs_data, &i_size);
+
+ spin_lock(&cookie->lock);
+ if (hlist_empty(&cookie->backing_objects)) {
+ spin_unlock(&cookie->lock);
+ goto unavailable;
+ }
+
+ object = hlist_entry(cookie->backing_objects.first,
+ struct fscache_object, cookie_link);
+
+ fscache_set_store_limit(object, i_size);
+
+ /* initiate the process of looking up all the objects in the chain
+ * (done by fscache_initialise_object()) */
+ fscache_enqueue_object(object);
+
+ spin_unlock(&cookie->lock);
+
+ /* we may be required to wait for lookup to complete at this point */
+ if (!fscache_defer_lookup) {
+ _debug("non-deferred lookup %p", &cookie->flags);
+ wait_on_bit(&cookie->flags, FSCACHE_COOKIE_LOOKING_UP,
+ fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+ _debug("complete");
+ if (test_bit(FSCACHE_COOKIE_UNAVAILABLE, &cookie->flags))
+ goto unavailable;
+ }
+
+ up_read(&fscache_addremove_sem);
+ _leave(" = 0 [deferred]");
+ return 0;
+
+unavailable:
+ up_read(&fscache_addremove_sem);
+ _leave(" = -ENOBUFS");
+ return -ENOBUFS;
+}
+
+/*
+ * recursively allocate cache object records for a cookie/cache combination
+ * - caller must be holding the addremove sem
+ */
+static int fscache_alloc_object(struct fscache_cache *cache,
+ struct fscache_cookie *cookie)
+{
+ struct fscache_object *object;
+ struct hlist_node *_n;
+ int ret;
+
+ _enter("%p,%p{%s}", cache, cookie, cookie->def->name);
+
+ spin_lock(&cookie->lock);
+ hlist_for_each_entry(object, _n, &cookie->backing_objects,
+ cookie_link) {
+ if (object->cache == cache)
+ goto object_already_extant;
+ }
+ spin_unlock(&cookie->lock);
+
+ /* ask the cache to allocate an object (we may end up with duplicate
+ * objects at this stage, but we sort that out later) */
+ object = cache->ops->alloc_object(cache, cookie);
+ if (IS_ERR(object)) {
+ fscache_stat(&fscache_n_object_no_alloc);
+ ret = PTR_ERR(object);
+ goto error;
+ }
+
+ fscache_stat(&fscache_n_object_alloc);
+
+ object->debug_id = atomic_inc_return(&fscache_object_debug_id);
+
+ _debug("ALLOC OBJ%x: %s {%lx}",
+ object->debug_id, cookie->def->name, object->events);
+
+ ret = fscache_alloc_object(cache, cookie->parent);
+ if (ret < 0)
+ goto error_put;
+
+ /* only attach if we managed to allocate all we needed, otherwise
+ * discard the object we just allocated and instead use the one
+ * attached to the cookie */
+ if (fscache_attach_object(cookie, object) < 0)
+ cache->ops->put_object(object);
+
+ _leave(" = 0");
+ return 0;
+
+object_already_extant:
+ ret = -ENOBUFS;
+ if (object->state >= FSCACHE_OBJECT_DYING) {
+ spin_unlock(&cookie->lock);
+ goto error;
+ }
+ spin_unlock(&cookie->lock);
+ _leave(" = 0 [found]");
+ return 0;
+
+error_put:
+ cache->ops->put_object(object);
+error:
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * attach a cache object to a cookie
+ */
+static int fscache_attach_object(struct fscache_cookie *cookie,
+ struct fscache_object *object)
+{
+ struct fscache_object *p;
+ struct fscache_cache *cache = object->cache;
+ struct hlist_node *_n;
+ int ret;
+
+ _enter("{%s},{OBJ%x}", cookie->def->name, object->debug_id);
+
+ spin_lock(&cookie->lock);
+
+ /* there may be multiple initial creations of this object, but we only
+ * want one */
+ ret = -EEXIST;
+ hlist_for_each_entry(p, _n, &cookie->backing_objects, cookie_link) {
+ if (p->cache == object->cache) {
+ if (p->state >= FSCACHE_OBJECT_DYING)
+ ret = -ENOBUFS;
+ goto cant_attach_object;
+ }
+ }
+
+ /* pin the parent object */
+ spin_lock_nested(&cookie->parent->lock, 1);
+ hlist_for_each_entry(p, _n, &cookie->parent->backing_objects,
+ cookie_link) {
+ if (p->cache == object->cache) {
+ if (p->state >= FSCACHE_OBJECT_DYING) {
+ ret = -ENOBUFS;
+ spin_unlock(&cookie->parent->lock);
+ goto cant_attach_object;
+ }
+ object->parent = p;
+ spin_lock(&p->lock);
+ p->n_children++;
+ spin_unlock(&p->lock);
+ break;
+ }
+ }
+ spin_unlock(&cookie->parent->lock);
+
+ /* attach to the cache's object list */
+ if (list_empty(&object->cache_link)) {
+ spin_lock(&cache->object_list_lock);
+ list_add(&object->cache_link, &cache->object_list);
+ spin_unlock(&cache->object_list_lock);
+ }
+
+ /* attach to the cookie */
+ object->cookie = cookie;
+ atomic_inc(&cookie->usage);
+ hlist_add_head(&object->cookie_link, &cookie->backing_objects);
+ ret = 0;
+
+cant_attach_object:
+ spin_unlock(&cookie->lock);
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * update the index entries backing a cookie
+ */
+void __fscache_update_cookie(struct fscache_cookie *cookie)
+{
+ struct fscache_object *object;
+ struct hlist_node *_p;
+
+ fscache_stat(&fscache_n_updates);
+
+ if (!cookie) {
+ fscache_stat(&fscache_n_updates_null);
+ _leave(" [no cookie]");
+ return;
+ }
+
+ _enter("{%s}", cookie->def->name);
+
+ BUG_ON(!cookie->def->get_aux);
+
+ spin_lock(&cookie->lock);
+
+ /* update the index entry on disk in each cache backing this cookie */
+ hlist_for_each_entry(object, _p,
+ &cookie->backing_objects, cookie_link) {
+ fscache_raise_event(object, FSCACHE_OBJECT_EV_UPDATE);
+ }
+
+ spin_unlock(&cookie->lock);
+ _leave("");
+}
+EXPORT_SYMBOL(__fscache_update_cookie);
+
+/*
+ * release a cookie back to the cache
+ * - the object will be marked as recyclable on disk if retire is true
+ * - all dependents of this cookie must have already been unregistered
+ * (indices/files/pages)
+ */
+void __fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)
+{
+ struct fscache_cache *cache;
+ struct fscache_object *object;
+ unsigned long event;
+
+ fscache_stat(&fscache_n_relinquishes);
+
+ if (!cookie) {
+ fscache_stat(&fscache_n_relinquishes_null);
+ _leave(" [no cookie]");
+ return;
+ }
+
+ _enter("%p{%s,%p},%d",
+ cookie, cookie->def->name, cookie->netfs_data, retire);
+
+ if (atomic_read(&cookie->n_children) != 0) {
+ printk(KERN_ERR "FS-Cache: Cookie '%s' still has children\n",
+ cookie->def->name);
+ BUG();
+ }
+
+ /* wait for the cookie to finish being instantiated (or to fail) */
+ if (test_bit(FSCACHE_COOKIE_CREATING, &cookie->flags)) {
+ fscache_stat(&fscache_n_relinquishes_waitcrt);
+ wait_on_bit(&cookie->flags, FSCACHE_COOKIE_CREATING,
+ fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+ }
+
+ event = retire ? FSCACHE_OBJECT_EV_RETIRE : FSCACHE_OBJECT_EV_RELEASE;
+
+ /* detach pointers back to the netfs */
+ spin_lock(&cookie->lock);
+
+ cookie->netfs_data = NULL;
+ cookie->def = NULL;
+
+ /* break links with all the active objects */
+ while (!hlist_empty(&cookie->backing_objects)) {
+ object = hlist_entry(cookie->backing_objects.first,
+ struct fscache_object,
+ cookie_link);
+
+ _debug("RELEASE OBJ%x", object->debug_id);
+
+ /* detach each cache object from the object cookie */
+ spin_lock(&object->lock);
+ hlist_del_init(&object->cookie_link);
+
+ cache = object->cache;
+ object->cookie = NULL;
+ fscache_raise_event(object, event);
+ spin_unlock(&object->lock);
+
+ if (atomic_dec_and_test(&cookie->usage))
+ /* the cookie refcount shouldn't be reduced to 0 yet */
+ BUG();
+ }
+
+ spin_unlock(&cookie->lock);
+
+ if (cookie->parent) {
+ ASSERTCMP(atomic_read(&cookie->parent->usage), >, 0);
+ ASSERTCMP(atomic_read(&cookie->parent->n_children), >, 0);
+ atomic_dec(&cookie->parent->n_children);
+ }
+
+ /* finally dispose of the cookie */
+ ASSERTCMP(atomic_read(&cookie->usage), >, 0);
+ fscache_cookie_put(cookie);
+
+ _leave("");
+}
+EXPORT_SYMBOL(__fscache_relinquish_cookie);
+
+/*
* destroy a cookie
*/
void __fscache_cookie_put(struct fscache_cookie *cookie)
diff --git a/include/linux/fscache.h b/include/linux/fscache.h
index e4278b3..7e4fd5c 100644
--- a/include/linux/fscache.h
+++ b/include/linux/fscache.h
@@ -192,6 +192,13 @@ extern void __fscache_unregister_netfs(struct fscache_netfs *);
extern struct fscache_cache_tag *__fscache_lookup_cache_tag(const char *);
extern void __fscache_release_cache_tag(struct fscache_cache_tag *);

+extern struct fscache_cookie *__fscache_acquire_cookie(
+ struct fscache_cookie *,
+ const struct fscache_cookie_def *,
+ void *);
+extern void __fscache_relinquish_cookie(struct fscache_cookie *, int);
+extern void __fscache_update_cookie(struct fscache_cookie *);
+
/**
* fscache_register_netfs - Register a filesystem as desiring caching services
* @netfs: The description of the filesystem
@@ -283,7 +290,10 @@ struct fscache_cookie *fscache_acquire_cookie(
const struct fscache_cookie_def *def,
void *netfs_data)
{
- return NULL;
+ if (fscache_cookie_valid(parent))
+ return __fscache_acquire_cookie(parent, def, netfs_data);
+ else
+ return NULL;
}

/**
@@ -301,6 +311,8 @@ struct fscache_cookie *fscache_acquire_cookie(
static inline
void fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)
{
+ if (fscache_cookie_valid(cookie))
+ __fscache_relinquish_cookie(cookie, retire);
}

/**
@@ -316,6 +328,8 @@ void fscache_relinquish_cookie(struct fscache_cookie *cookie, int retire)
static inline
void fscache_update_cookie(struct fscache_cookie *cookie)
{
+ if (fscache_cookie_valid(cookie))
+ __fscache_update_cookie(cookie);
}

/**

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:04:59 UTC
Permalink
Add and document asynchronous operation handling for use by FS-Cache's data
storage and retrieval routines.

The following documentation is added to:

Documentation/filesystems/caching/operations.txt


================================
ASYNCHRONOUS OPERATIONS HANDLING
================================

========
OVERVIEW
========

FS-Cache has an asynchronous operations handling facility that it uses for its
data storage and retrieval routines. Its operations are represented by
fscache_operation structs, though these are usually embedded into some other
structure.

This facility is available to and expected to be be used by the cache backends,
and FS-Cache will create operations and pass them off to the appropriate cache
backend for completion.

To make use of this facility, <linux/fscache-cache.h> should be #included.


===============================
OPERATION RECORD INITIALISATION
===============================

An operation is recorded in an fscache_operation struct:

struct fscache_operation {
union {
struct work_struct fast_work;
struct slow_work slow_work;
};
unsigned long flags;
fscache_operation_processor_t processor;
...
};

Someone wanting to issue an operation should allocate something with this
struct embedded in it. They should initialise it by calling:

void fscache_operation_init(struct fscache_operation *op,
fscache_operation_release_t release);

with the operation to be initialised and the release function to use.

The op->flags parameter should be set to indicate the CPU time provision and
the exclusivity (see the Parameters section).

The op->fast_work, op->slow_work and op->processor flags should be set as
appropriate for the CPU time provision (see the Parameters section).

FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the
operation and waited for afterwards.


==========
PARAMETERS
==========

There are a number of parameters that can be set in the operation record's flag
parameter. There are three options for the provision of CPU time in these
operations:

(1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread
may decide it wants to handle an operation itself without deferring it to
another thread.

This is, for example, used in read operations for calling readpages() on
the backing filesystem in CacheFiles. Although readpages() does an
asynchronous data fetch, the determination of whether pages exist is done
synchronously - and the netfs does not proceed until this has been
determined.

If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags
before submitting the operation, and the operating thread must wait for it
to be cleared before proceeding:

wait_on_bit(&op->flags, FSCACHE_OP_WAITING,
fscache_wait_bit, TASK_UNINTERRUPTIBLE);


(2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it
will be given to keventd to process. Such an operation is not permitted
to sleep on I/O.

This is, for example, used by CacheFiles to copy data from a backing fs
page to a netfs page after the backing fs has read the page in.

If this option is used, op->fast_work and op->processor must be
initialised before submitting the operation:

INIT_WORK(&op->fast_work, do_some_work);


(3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it
will be given to the slow work facility to process. Such an operation is
permitted to sleep on I/O.

This is, for example, used by FS-Cache to handle background writes of
pages that have just been fetched from a remote server.

If this option is used, op->slow_work and op->processor must be
initialised before submitting the operation:

fscache_operation_init_slow(op, processor)


Furthermore, operations may be one of two types:

(1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in
conjunction with any other operation on the object being operated upon.

An example of this is the attribute change operation, in which the file
being written to may need truncation.

(2) Shareable. Operations of this type may be running simultaneously. It's
up to the operation implementation to prevent interference between other
operations running at the same time.


=========
PROCEDURE
=========

Operations are used through the following procedure:

(1) The submitting thread must allocate the operation and initialise it
itself. Normally this would be part of a more specific structure with the
generic op embedded within.

(2) The submitting thread must then submit the operation for processing using
one of the following two functions:

int fscache_submit_op(struct fscache_object *object,
struct fscache_operation *op);

int fscache_submit_exclusive_op(struct fscache_object *object,
struct fscache_operation *op);

The first function should be used to submit non-exclusive ops and the
second to submit exclusive ones. The caller must still set the
FSCACHE_OP_EXCLUSIVE flag.

If successful, both functions will assign the operation to the specified
object and return 0. -ENOBUFS will be returned if the object specified is
permanently unavailable.

The operation manager will defer operations on an object that is still
undergoing lookup or creation. The operation will also be deferred if an
operation of conflicting exclusivity is in progress on the object.

If the operation is asynchronous, the manager will retain a reference to
it, so the caller should put their reference to it by passing it to:

void fscache_put_operation(struct fscache_operation *op);

(3) If the submitting thread wants to do the work itself, and has marked the
operation with FSCACHE_OP_MYTHREAD, then it should monitor
FSCACHE_OP_WAITING as described above and check the state of the object if
necessary (the object might have died whilst the thread was waiting).

When it has finished doing its processing, it should call
fscache_put_operation() on it.

(4) The operation holds an effective lock upon the object, preventing other
exclusive ops conflicting until it is released. The operation can be
enqueued for further immediate asynchronous processing by adjusting the
CPU time provisioning option if necessary, eg:

op->flags &= ~FSCACHE_OP_TYPE;
op->flags |= ~FSCACHE_OP_FAST;

and calling:

void fscache_enqueue_operation(struct fscache_operation *op)

This can be used to allow other things to have use of the worker thread
pools.


=====================
ASYNCHRONOUS CALLBACK
=====================

When used in asynchronous mode, the worker thread pool will invoke the
processor method with a pointer to the operation. This should then get at the
container struct by using container_of():

static void fscache_write_op(struct fscache_operation *_op)
{
struct fscache_storage *op =
container_of(_op, struct fscache_storage, op);
...
}

The caller holds a reference on the operation, and will invoke
fscache_put_operation() when the processor function returns. The processor
function is at liberty to call fscache_enqueue_operation() or to take extra
references.


Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

Documentation/filesystems/caching/operations.txt | 213 ++++++++++
fs/fscache/Makefile | 3
fs/fscache/cache.c | 2
fs/fscache/internal.h | 8
fs/fscache/operation.c | 459 ++++++++++++++++++++++
5 files changed, 682 insertions(+), 3 deletions(-)
create mode 100644 Documentation/filesystems/caching/operations.txt
create mode 100644 fs/fscache/operation.c


diff --git a/Documentation/filesystems/caching/operations.txt b/Documentation/filesystems/caching/operations.txt
new file mode 100644
index 0000000..b6b070c
--- /dev/null
+++ b/Documentation/filesystems/caching/operations.txt
@@ -0,0 +1,213 @@
+ ================================
+ ASYNCHRONOUS OPERATIONS HANDLING
+ ================================
+
+By: David Howells <***@redhat.com>
+
+Contents:
+
+ (*) Overview.
+
+ (*) Operation record initialisation.
+
+ (*) Parameters.
+
+ (*) Procedure.
+
+ (*) Asynchronous callback.
+
+
+========
+OVERVIEW
+========
+
+FS-Cache has an asynchronous operations handling facility that it uses for its
+data storage and retrieval routines. Its operations are represented by
+fscache_operation structs, though these are usually embedded into some other
+structure.
+
+This facility is available to and expected to be be used by the cache backends,
+and FS-Cache will create operations and pass them off to the appropriate cache
+backend for completion.
+
+To make use of this facility, <linux/fscache-cache.h> should be #included.
+
+
+===============================
+OPERATION RECORD INITIALISATION
+===============================
+
+An operation is recorded in an fscache_operation struct:
+
+ struct fscache_operation {
+ union {
+ struct work_struct fast_work;
+ struct slow_work slow_work;
+ };
+ unsigned long flags;
+ fscache_operation_processor_t processor;
+ ...
+ };
+
+Someone wanting to issue an operation should allocate something with this
+struct embedded in it. They should initialise it by calling:
+
+ void fscache_operation_init(struct fscache_operation *op,
+ fscache_operation_release_t release);
+
+with the operation to be initialised and the release function to use.
+
+The op->flags parameter should be set to indicate the CPU time provision and
+the exclusivity (see the Parameters section).
+
+The op->fast_work, op->slow_work and op->processor flags should be set as
+appropriate for the CPU time provision (see the Parameters section).
+
+FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the
+operation and waited for afterwards.
+
+
+==========
+PARAMETERS
+==========
+
+There are a number of parameters that can be set in the operation record's flag
+parameter. There are three options for the provision of CPU time in these
+operations:
+
+ (1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread
+ may decide it wants to handle an operation itself without deferring it to
+ another thread.
+
+ This is, for example, used in read operations for calling readpages() on
+ the backing filesystem in CacheFiles. Although readpages() does an
+ asynchronous data fetch, the determination of whether pages exist is done
+ synchronously - and the netfs does not proceed until this has been
+ determined.
+
+ If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags
+ before submitting the operation, and the operating thread must wait for it
+ to be cleared before proceeding:
+
+ wait_on_bit(&op->flags, FSCACHE_OP_WAITING,
+ fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+
+
+ (2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it
+ will be given to keventd to process. Such an operation is not permitted
+ to sleep on I/O.
+
+ This is, for example, used by CacheFiles to copy data from a backing fs
+ page to a netfs page after the backing fs has read the page in.
+
+ If this option is used, op->fast_work and op->processor must be
+ initialised before submitting the operation:
+
+ INIT_WORK(&op->fast_work, do_some_work);
+
+
+ (3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it
+ will be given to the slow work facility to process. Such an operation is
+ permitted to sleep on I/O.
+
+ This is, for example, used by FS-Cache to handle background writes of
+ pages that have just been fetched from a remote server.
+
+ If this option is used, op->slow_work and op->processor must be
+ initialised before submitting the operation:
+
+ fscache_operation_init_slow(op, processor)
+
+
+Furthermore, operations may be one of two types:
+
+ (1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in
+ conjunction with any other operation on the object being operated upon.
+
+ An example of this is the attribute change operation, in which the file
+ being written to may need truncation.
+
+ (2) Shareable. Operations of this type may be running simultaneously. It's
+ up to the operation implementation to prevent interference between other
+ operations running at the same time.
+
+
+=========
+PROCEDURE
+=========
+
+Operations are used through the following procedure:
+
+ (1) The submitting thread must allocate the operation and initialise it
+ itself. Normally this would be part of a more specific structure with the
+ generic op embedded within.
+
+ (2) The submitting thread must then submit the operation for processing using
+ one of the following two functions:
+
+ int fscache_submit_op(struct fscache_object *object,
+ struct fscache_operation *op);
+
+ int fscache_submit_exclusive_op(struct fscache_object *object,
+ struct fscache_operation *op);
+
+ The first function should be used to submit non-exclusive ops and the
+ second to submit exclusive ones. The caller must still set the
+ FSCACHE_OP_EXCLUSIVE flag.
+
+ If successful, both functions will assign the operation to the specified
+ object and return 0. -ENOBUFS will be returned if the object specified is
+ permanently unavailable.
+
+ The operation manager will defer operations on an object that is still
+ undergoing lookup or creation. The operation will also be deferred if an
+ operation of conflicting exclusivity is in progress on the object.
+
+ If the operation is asynchronous, the manager will retain a reference to
+ it, so the caller should put their reference to it by passing it to:
+
+ void fscache_put_operation(struct fscache_operation *op);
+
+ (3) If the submitting thread wants to do the work itself, and has marked the
+ operation with FSCACHE_OP_MYTHREAD, then it should monitor
+ FSCACHE_OP_WAITING as described above and check the state of the object if
+ necessary (the object might have died whilst the thread was waiting).
+
+ When it has finished doing its processing, it should call
+ fscache_put_operation() on it.
+
+ (4) The operation holds an effective lock upon the object, preventing other
+ exclusive ops conflicting until it is released. The operation can be
+ enqueued for further immediate asynchronous processing by adjusting the
+ CPU time provisioning option if necessary, eg:
+
+ op->flags &= ~FSCACHE_OP_TYPE;
+ op->flags |= ~FSCACHE_OP_FAST;
+
+ and calling:
+
+ void fscache_enqueue_operation(struct fscache_operation *op)
+
+ This can be used to allow other things to have use of the worker thread
+ pools.
+
+
+=====================
+ASYNCHRONOUS CALLBACK
+=====================
+
+When used in asynchronous mode, the worker thread pool will invoke the
+processor method with a pointer to the operation. This should then get at the
+container struct by using container_of():
+
+ static void fscache_write_op(struct fscache_operation *_op)
+ {
+ struct fscache_storage *op =
+ container_of(_op, struct fscache_storage, op);
+ ...
+ }
+
+The caller holds a reference on the operation, and will invoke
+fscache_put_operation() when the processor function returns. The processor
+function is at liberty to call fscache_enqueue_operation() or to take extra
+references.
diff --git a/fs/fscache/Makefile b/fs/fscache/Makefile
index 4420ac6..6f82da2 100644
--- a/fs/fscache/Makefile
+++ b/fs/fscache/Makefile
@@ -8,7 +8,8 @@ fscache-y := \
fsdef.o \
main.o \
netfs.o \
- object.o
+ object.o \
+ operation.o

fscache-$(CONFIG_PROC_FS) += proc.o
fscache-$(CONFIG_FSCACHE_STATS) += stats.o
diff --git a/fs/fscache/cache.c b/fs/fscache/cache.c
index 355172f..e21985b 100644
--- a/fs/fscache/cache.c
+++ b/fs/fscache/cache.c
@@ -194,7 +194,7 @@ void fscache_init_cache(struct fscache_cache *cache,
vsnprintf(cache->identifier, sizeof(cache->identifier), idfmt, va);
va_end(va);

- INIT_WORK(&cache->op_gc, NULL);
+ INIT_WORK(&cache->op_gc, fscache_operation_gc);
INIT_LIST_HEAD(&cache->link);
INIT_LIST_HEAD(&cache->object_list);
INIT_LIST_HEAD(&cache->op_gc_list);
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index 529f4de..014a830 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -95,7 +95,13 @@ extern void fscache_enqueue_object(struct fscache_object *);
/*
* fsc-operation.c
*/
-#define fscache_start_operations(obj) BUG()
+extern int fscache_submit_exclusive_op(struct fscache_object *,
+ struct fscache_operation *);
+extern int fscache_submit_op(struct fscache_object *,
+ struct fscache_operation *);
+extern void fscache_abort_object(struct fscache_object *);
+extern void fscache_start_operations(struct fscache_object *);
+extern void fscache_operation_gc(struct work_struct *);

/*
* fsc-proc.c
diff --git a/fs/fscache/operation.c b/fs/fscache/operation.c
new file mode 100644
index 0000000..e7f8d53
--- /dev/null
+++ b/fs/fscache/operation.c
@@ -0,0 +1,459 @@
+/* FS-Cache worker operation management routines
+ *
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * See Documentation/filesystems/caching/operations.txt
+ */
+
+#define FSCACHE_DEBUG_LEVEL OPERATION
+#include <linux/module.h>
+#include "internal.h"
+
+atomic_t fscache_op_debug_id;
+EXPORT_SYMBOL(fscache_op_debug_id);
+
+/**
+ * fscache_enqueue_operation - Enqueue an operation for processing
+ * @op: The operation to enqueue
+ *
+ * Enqueue an operation for processing by the FS-Cache thread pool.
+ *
+ * This will get its own ref on the object.
+ */
+void fscache_enqueue_operation(struct fscache_operation *op)
+{
+ _enter("{OBJ%x OP%x,%u}",
+ op->object->debug_id, op->debug_id, atomic_read(&op->usage));
+
+ ASSERT(op->processor != NULL);
+ ASSERTCMP(op->object->state, >=, FSCACHE_OBJECT_AVAILABLE);
+ ASSERTCMP(atomic_read(&op->usage), >, 0);
+
+ if (list_empty(&op->pend_link)) {
+ switch (op->flags & FSCACHE_OP_TYPE) {
+ case FSCACHE_OP_FAST:
+ _debug("queue fast");
+ atomic_inc(&op->usage);
+ if (!schedule_work(&op->fast_work))
+ fscache_put_operation(op);
+ break;
+ case FSCACHE_OP_SLOW:
+ _debug("queue slow");
+ slow_work_enqueue(&op->slow_work);
+ break;
+ case FSCACHE_OP_MYTHREAD:
+ _debug("queue for caller's attention");
+ break;
+ default:
+ printk(KERN_ERR "FS-Cache: Unexpected op type %lx",
+ op->flags);
+ BUG();
+ break;
+ }
+ fscache_stat(&fscache_n_op_enqueue);
+ }
+}
+EXPORT_SYMBOL(fscache_enqueue_operation);
+
+/*
+ * start an op running
+ */
+static void fscache_run_op(struct fscache_object *object,
+ struct fscache_operation *op)
+{
+ object->n_in_progress++;
+ if (test_and_clear_bit(FSCACHE_OP_WAITING, &op->flags))
+ wake_up_bit(&op->flags, FSCACHE_OP_WAITING);
+ if (op->processor)
+ fscache_enqueue_operation(op);
+ fscache_stat(&fscache_n_op_run);
+}
+
+/*
+ * submit an exclusive operation for an object
+ * - other ops are excluded from running simultaneously with this one
+ * - this gets any extra refs it needs on an op
+ */
+int fscache_submit_exclusive_op(struct fscache_object *object,
+ struct fscache_operation *op)
+{
+ int ret;
+
+ _enter("{OBJ%x OP%x},", object->debug_id, op->debug_id);
+
+ spin_lock(&object->lock);
+ ASSERTCMP(object->n_ops, >=, object->n_in_progress);
+ ASSERTCMP(object->n_ops, >=, object->n_exclusive);
+
+ ret = -ENOBUFS;
+ if (fscache_object_is_active(object)) {
+ op->object = object;
+ object->n_ops++;
+ object->n_exclusive++; /* reads and writes must wait */
+
+ if (object->n_ops > 0) {
+ atomic_inc(&op->usage);
+ list_add_tail(&op->pend_link, &object->pending_ops);
+ fscache_stat(&fscache_n_op_pend);
+ } else if (!list_empty(&object->pending_ops)) {
+ atomic_inc(&op->usage);
+ list_add_tail(&op->pend_link, &object->pending_ops);
+ fscache_stat(&fscache_n_op_pend);
+ fscache_start_operations(object);
+ } else {
+ ASSERTCMP(object->n_in_progress, ==, 0);
+ fscache_run_op(object, op);
+ }
+
+ /* need to issue a new write op after this */
+ clear_bit(FSCACHE_OBJECT_PENDING_WRITE, &object->flags);
+ ret = 0;
+ } else if (object->state == FSCACHE_OBJECT_CREATING) {
+ op->object = object;
+ object->n_ops++;
+ object->n_exclusive++; /* reads and writes must wait */
+ atomic_inc(&op->usage);
+ list_add_tail(&op->pend_link, &object->pending_ops);
+ fscache_stat(&fscache_n_op_pend);
+ ret = 0;
+ } else {
+ /* not allowed to submit ops in any other state */
+ BUG();
+ }
+
+ spin_unlock(&object->lock);
+ return ret;
+}
+
+/*
+ * report an unexpected submission
+ */
+static void fscache_report_unexpected_submission(struct fscache_object *object,
+ struct fscache_operation *op,
+ unsigned long ostate)
+{
+ static bool once_only;
+ struct fscache_operation *p;
+ unsigned n;
+
+ if (once_only)
+ return;
+ once_only = true;
+
+ kdebug("unexpected submission OP%x [OBJ%x %s]",
+ op->debug_id, object->debug_id,
+ fscache_object_states[object->state]);
+ kdebug("objstate=%s [%s]",
+ fscache_object_states[object->state],
+ fscache_object_states[ostate]);
+ kdebug("objflags=%lx", object->flags);
+ kdebug("objevent=%lx [%lx]", object->events, object->event_mask);
+ kdebug("ops=%u inp=%u exc=%u",
+ object->n_ops, object->n_in_progress, object->n_exclusive);
+
+ if (!list_empty(&object->pending_ops)) {
+ n = 0;
+ list_for_each_entry(p, &object->pending_ops, pend_link) {
+ ASSERTCMP(p->object, ==, object);
+ kdebug("%p %p", op->processor, op->release);
+ n++;
+ }
+
+ kdebug("n=%u", n);
+ }
+
+ dump_stack();
+}
+
+/*
+ * submit an operation for an object
+ * - objects may be submitted only in the following states:
+ * - during object creation (write ops may be submitted)
+ * - whilst the object is active
+ * - after an I/O error incurred in one of the two above states (op rejected)
+ * - this gets any extra refs it needs on an op
+ */
+int fscache_submit_op(struct fscache_object *object,
+ struct fscache_operation *op)
+{
+ unsigned long ostate;
+ int ret;
+
+ _enter("{OBJ%x OP%x},{%u}",
+ object->debug_id, op->debug_id, atomic_read(&op->usage));
+
+ ASSERTCMP(atomic_read(&op->usage), >, 0);
+
+ spin_lock(&object->lock);
+ ASSERTCMP(object->n_ops, >=, object->n_in_progress);
+ ASSERTCMP(object->n_ops, >=, object->n_exclusive);
+
+ ostate = object->state;
+ smp_rmb();
+
+ if (fscache_object_is_active(object)) {
+ op->object = object;
+ object->n_ops++;
+
+ if (object->n_exclusive > 0) {
+ atomic_inc(&op->usage);
+ list_add_tail(&op->pend_link, &object->pending_ops);
+ fscache_stat(&fscache_n_op_pend);
+ } else if (!list_empty(&object->pending_ops)) {
+ atomic_inc(&op->usage);
+ list_add_tail(&op->pend_link, &object->pending_ops);
+ fscache_stat(&fscache_n_op_pend);
+ fscache_start_operations(object);
+ } else {
+ ASSERTCMP(object->n_exclusive, ==, 0);
+ fscache_run_op(object, op);
+ }
+ ret = 0;
+ } else if (object->state == FSCACHE_OBJECT_CREATING) {
+ op->object = object;
+ object->n_ops++;
+ atomic_inc(&op->usage);
+ list_add_tail(&op->pend_link, &object->pending_ops);
+ fscache_stat(&fscache_n_op_pend);
+ ret = 0;
+ } else if (!test_bit(FSCACHE_IOERROR, &object->cache->flags)) {
+ fscache_report_unexpected_submission(object, op, ostate);
+ ASSERT(!fscache_object_is_active(object));
+ ret = -ENOBUFS;
+ } else {
+ ret = -ENOBUFS;
+ }
+
+ spin_unlock(&object->lock);
+ return ret;
+}
+
+/*
+ * queue an object for withdrawal on error, aborting all following asynchronous
+ * operations
+ */
+void fscache_abort_object(struct fscache_object *object)
+{
+ _enter("{OBJ%x}", object->debug_id);
+
+ fscache_raise_event(object, FSCACHE_OBJECT_EV_ERROR);
+}
+
+/*
+ * jump start the operation processing on an object
+ * - caller must hold object->lock
+ */
+void fscache_start_operations(struct fscache_object *object)
+{
+ struct fscache_operation *op;
+ bool stop = false;
+
+ while (!list_empty(&object->pending_ops) && !stop) {
+ op = list_entry(object->pending_ops.next,
+ struct fscache_operation, pend_link);
+
+ if (test_bit(FSCACHE_OP_EXCLUSIVE, &op->flags)) {
+ if (object->n_in_progress > 0)
+ break;
+ stop = true;
+ }
+ list_del_init(&op->pend_link);
+ object->n_in_progress++;
+
+ if (test_and_clear_bit(FSCACHE_OP_WAITING, &op->flags))
+ wake_up_bit(&op->flags, FSCACHE_OP_WAITING);
+ if (op->processor)
+ fscache_enqueue_operation(op);
+
+ /* the pending queue was holding a ref on the object */
+ fscache_put_operation(op);
+ }
+
+ ASSERTCMP(object->n_in_progress, <=, object->n_ops);
+
+ _debug("woke %d ops on OBJ%x",
+ object->n_in_progress, object->debug_id);
+}
+
+/*
+ * release an operation
+ * - queues pending ops if this is the last in-progress op
+ */
+void fscache_put_operation(struct fscache_operation *op)
+{
+ struct fscache_object *object;
+ struct fscache_cache *cache;
+
+ _enter("{OBJ%x OP%x,%d}",
+ op->object->debug_id, op->debug_id, atomic_read(&op->usage));
+
+ ASSERTCMP(atomic_read(&op->usage), >, 0);
+
+ if (!atomic_dec_and_test(&op->usage))
+ return;
+
+ _debug("PUT OP");
+ if (test_and_set_bit(FSCACHE_OP_DEAD, &op->flags))
+ BUG();
+
+ fscache_stat(&fscache_n_op_release);
+
+ if (op->release) {
+ op->release(op);
+ op->release = NULL;
+ }
+
+ object = op->object;
+
+ /* now... we may get called with the object spinlock held, so we
+ * complete the cleanup here only if we can immediately acquire the
+ * lock, and defer it otherwise */
+ if (!spin_trylock(&object->lock)) {
+ _debug("defer put");
+ fscache_stat(&fscache_n_op_deferred_release);
+
+ cache = object->cache;
+ spin_lock(&cache->op_gc_list_lock);
+ list_add_tail(&op->pend_link, &cache->op_gc_list);
+ spin_unlock(&cache->op_gc_list_lock);
+ schedule_work(&cache->op_gc);
+ _leave(" [defer]");
+ return;
+ }
+
+ if (test_bit(FSCACHE_OP_EXCLUSIVE, &op->flags)) {
+ ASSERTCMP(object->n_exclusive, >, 0);
+ object->n_exclusive--;
+ }
+
+ ASSERTCMP(object->n_in_progress, >, 0);
+ object->n_in_progress--;
+ if (object->n_in_progress == 0)
+ fscache_start_operations(object);
+
+ ASSERTCMP(object->n_ops, >, 0);
+ object->n_ops--;
+ if (object->n_ops == 0)
+ fscache_raise_event(object, FSCACHE_OBJECT_EV_CLEARED);
+
+ spin_unlock(&object->lock);
+
+ kfree(op);
+ _leave(" [done]");
+}
+EXPORT_SYMBOL(fscache_put_operation);
+
+/*
+ * garbage collect operations that have had their release deferred
+ */
+void fscache_operation_gc(struct work_struct *work)
+{
+ struct fscache_operation *op;
+ struct fscache_object *object;
+ struct fscache_cache *cache =
+ container_of(work, struct fscache_cache, op_gc);
+ int count = 0;
+
+ _enter("");
+
+ do {
+ spin_lock(&cache->op_gc_list_lock);
+ if (list_empty(&cache->op_gc_list)) {
+ spin_unlock(&cache->op_gc_list_lock);
+ break;
+ }
+
+ op = list_entry(cache->op_gc_list.next,
+ struct fscache_operation, pend_link);
+ list_del(&op->pend_link);
+ spin_unlock(&cache->op_gc_list_lock);
+
+ object = op->object;
+
+ _debug("GC DEFERRED REL OBJ%x OP%x",
+ object->debug_id, op->debug_id);
+ fscache_stat(&fscache_n_op_gc);
+
+ ASSERTCMP(atomic_read(&op->usage), ==, 0);
+
+ spin_lock(&object->lock);
+ if (test_bit(FSCACHE_OP_EXCLUSIVE, &op->flags)) {
+ ASSERTCMP(object->n_exclusive, >, 0);
+ object->n_exclusive--;
+ }
+
+ ASSERTCMP(object->n_in_progress, >, 0);
+ object->n_in_progress--;
+ if (object->n_in_progress == 0)
+ fscache_start_operations(object);
+
+ ASSERTCMP(object->n_ops, >, 0);
+ object->n_ops--;
+ if (object->n_ops == 0)
+ fscache_raise_event(object, FSCACHE_OBJECT_EV_CLEARED);
+
+ spin_unlock(&object->lock);
+
+ } while (count++ < 20);
+
+ if (!list_empty(&cache->op_gc_list))
+ schedule_work(&cache->op_gc);
+
+ _leave("");
+}
+
+/*
+ * allow the slow work item processor to get a ref on an operation
+ */
+static int fscache_op_get_ref(struct slow_work *work)
+{
+ struct fscache_operation *op =
+ container_of(work, struct fscache_operation, slow_work);
+
+ atomic_inc(&op->usage);
+ return 0;
+}
+
+/*
+ * allow the slow work item processor to discard a ref on an operation
+ */
+static void fscache_op_put_ref(struct slow_work *work)
+{
+ struct fscache_operation *op =
+ container_of(work, struct fscache_operation, slow_work);
+
+ fscache_put_operation(op);
+}
+
+/*
+ * execute an operation using the slow thread pool to provide processing context
+ * - the caller holds a ref to this object, so we don't need to hold one
+ */
+static void fscache_op_execute(struct slow_work *work)
+{
+ struct fscache_operation *op =
+ container_of(work, struct fscache_operation, slow_work);
+ unsigned long start;
+
+ _enter("{OBJ%x OP%x,%d}",
+ op->object->debug_id, op->debug_id, atomic_read(&op->usage));
+
+ ASSERT(op->processor != NULL);
+ start = jiffies;
+ op->processor(op);
+ fscache_hist(fscache_ops_histogram, start);
+
+ _leave("");
+}
+
+const struct slow_work_ops fscache_op_slow_work_ops = {
+ .get_ref = fscache_op_get_ref,
+ .put_ref = fscache_op_put_ref,
+ .execute = fscache_op_execute,
+};

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:05:05 UTC
Permalink
Implement the data I/O part of the FS-Cache netfs API. The documentation and
API header file were added in a previous patch.

This patch implements the following functions for the netfs to call:

(*) fscache_attr_changed().

Indicate that the object has changed its attributes. The only attribute
currently recorded is the file size. Only pages within the set file size
will be stored in the cache.

This operation is submitted for asynchronous processing, and will return
immediately. It will return -ENOMEM if an out of memory error is
encountered, -ENOBUFS if the object is not actually cached, or 0 if the
operation is successfully queued.

(*) fscache_read_or_alloc_page().
(*) fscache_read_or_alloc_pages().

Request data be fetched from the disk, and allocate internal metadata to
track the netfs pages and reserve disk space for unknown pages.

These operations perform semi-asynchronous data reads. Upon returning
they will indicate which pages they think can be retrieved from disk, and
will have set in progress attempts to retrieve those pages.

These will return, in order of preference, -ENOMEM on memory allocation
error, -ERESTARTSYS if a signal interrupted proceedings, -ENODATA if one
or more requested pages are not yet cached, -ENOBUFS if the object is not
actually cached or if there isn't space for future pages to be cached on
this object, or 0 if successful.

In the case of the multipage function, the pages for which reads are set
in progress will be removed from the list and the page count decreased
appropriately.

If any read operations should fail, the completion function will be given
an error, and will also be passed contextual information to allow the
netfs to fall back to querying the server for the absent pages.

For each successful read, the page completion function will also be
called.

Any pages subsequently tracked by the cache will have PG_fscache set upon
them on return. fscache_uncache_page() must be called for such pages.

If supplied by the netfs, the mark_pages_cached() cookie op will be
invoked for any pages now tracked.

(*) fscache_alloc_page().

Allocate internal metadata to track a netfs page and reserve disk space.

This will return -ENOMEM on memory allocation error, -ERESTARTSYS on
signal, -ENOBUFS if the object isn't cached, or there isn't enough space
in the cache, or 0 if successful.

Any pages subsequently tracked by the cache will have PG_fscache set upon
them on return. fscache_uncache_page() must be called for such pages.

If supplied by the netfs, the mark_pages_cached() cookie op will be
invoked for any pages now tracked.

(*) fscache_write_page().

Request data be stored to disk. This may only be called on pages that
have been read or alloc'd by the above three functions and have not yet
been uncached.

This will return -ENOMEM on memory allocation error, -ERESTARTSYS on
signal, -ENOBUFS if the object isn't cached, or there isn't immediately
enough space in the cache, or 0 if successful.

On a successful return, this operation will have queued the page for
asynchronous writing to the cache. The page will be returned with
PG_fscache_write set until the write completes one way or another. The
caller will not be notified if the write fails due to an I/O error. If
that happens, the object will become available and all pending writes will
be aborted.

Note that the cache may batch up page writes, and so it may take a while
to get around to writing them out.

The caller must assume that until PG_fscache_write is cleared the page is
use by the cache. Any changes made to the page may be reflected on disk.
The page may even be under DMA.

(*) fscache_uncache_page().

Indicate that the cache should stop tracking a page previously read or
alloc'd from the cache. If the page was alloc'd only, but unwritten, it
will not appear on disk.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/fscache/Makefile | 3
fs/fscache/internal.h | 21 +
fs/fscache/page.c | 771 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fscache.h | 46 ++-
4 files changed, 835 insertions(+), 6 deletions(-)
create mode 100644 fs/fscache/page.c


diff --git a/fs/fscache/Makefile b/fs/fscache/Makefile
index 6f82da2..91571b9 100644
--- a/fs/fscache/Makefile
+++ b/fs/fscache/Makefile
@@ -9,7 +9,8 @@ fscache-y := \
main.o \
netfs.o \
object.o \
- operation.o
+ operation.o \
+ page.o

fscache-$(CONFIG_PROC_FS) += proc.o
fscache-$(CONFIG_FSCACHE_STATS) += stats.o
diff --git a/fs/fscache/internal.h b/fs/fscache/internal.h
index 014a830..e0cbd16 100644
--- a/fs/fscache/internal.h
+++ b/fs/fscache/internal.h
@@ -229,6 +229,27 @@ static inline void fscache_cookie_put(struct fscache_cookie *cookie)
__fscache_cookie_put(cookie);
}

+/*
+ * get an extra reference to a netfs retrieval context
+ */
+static inline
+void *fscache_get_context(struct fscache_cookie *cookie, void *context)
+{
+ if (cookie->def->get_context)
+ cookie->def->get_context(cookie->netfs_data, context);
+ return context;
+}
+
+/*
+ * release a reference to a netfs retrieval context
+ */
+static inline
+void fscache_put_context(struct fscache_cookie *cookie, void *context)
+{
+ if (cookie->def->put_context)
+ cookie->def->put_context(cookie->netfs_data, context);
+}
+
/*****************************************************************************/
/*
* debug tracing
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
new file mode 100644
index 0000000..512ec2c
--- /dev/null
+++ b/fs/fscache/page.c
@@ -0,0 +1,771 @@
+/* Cache page management and data I/O routines
+ *
+ * Copyright (C) 2004-2008 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#define FSCACHE_DEBUG_LEVEL PAGE
+#include <linux/module.h>
+#include <linux/fscache-cache.h>
+#include <linux/buffer_head.h>
+#include <linux/pagevec.h>
+#include "internal.h"
+
+/*
+ * actually apply the changed attributes to a cache object
+ */
+static void fscache_attr_changed_op(struct fscache_operation *op)
+{
+ struct fscache_object *object = op->object;
+
+ _enter("{OBJ%x OP%x}", object->debug_id, op->debug_id);
+
+ fscache_stat(&fscache_n_attr_changed_calls);
+
+ if (fscache_object_is_active(object) &&
+ object->cache->ops->attr_changed(object) < 0)
+ fscache_abort_object(object);
+
+ _leave("");
+}
+
+/*
+ * notification that the attributes on an object have changed
+ */
+int __fscache_attr_changed(struct fscache_cookie *cookie)
+{
+ struct fscache_operation *op;
+ struct fscache_object *object;
+
+ _enter("%p", cookie);
+
+ ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
+
+ fscache_stat(&fscache_n_attr_changed);
+
+ op = kzalloc(sizeof(*op), GFP_KERNEL);
+ if (!op) {
+ fscache_stat(&fscache_n_attr_changed_nomem);
+ _leave(" = -ENOMEM");
+ return -ENOMEM;
+ }
+
+ fscache_operation_init(op, NULL);
+ fscache_operation_init_slow(op, fscache_attr_changed_op);
+ op->flags = FSCACHE_OP_SLOW | (1 << FSCACHE_OP_EXCLUSIVE);
+
+ spin_lock(&cookie->lock);
+
+ if (hlist_empty(&cookie->backing_objects))
+ goto nobufs;
+ object = hlist_entry(cookie->backing_objects.first,
+ struct fscache_object, cookie_link);
+
+ if (fscache_submit_exclusive_op(object, op) < 0)
+ goto nobufs;
+ spin_unlock(&cookie->lock);
+ fscache_stat(&fscache_n_attr_changed_ok);
+ fscache_put_operation(op);
+ _leave(" = 0");
+ return 0;
+
+nobufs:
+ spin_unlock(&cookie->lock);
+ kfree(op);
+ fscache_stat(&fscache_n_attr_changed_nobufs);
+ _leave(" = %d", -ENOBUFS);
+ return -ENOBUFS;
+}
+EXPORT_SYMBOL(__fscache_attr_changed);
+
+/*
+ * handle secondary execution given to a retrieval op on behalf of the
+ * cache
+ */
+static void fscache_retrieval_work(struct work_struct *work)
+{
+ struct fscache_retrieval *op =
+ container_of(work, struct fscache_retrieval, op.fast_work);
+ unsigned long start;
+
+ _enter("{OP%x}", op->op.debug_id);
+
+ start = jiffies;
+ op->op.processor(&op->op);
+ fscache_hist(fscache_ops_histogram, start);
+ fscache_put_operation(&op->op);
+}
+
+/*
+ * release a retrieval op reference
+ */
+static void fscache_release_retrieval_op(struct fscache_operation *_op)
+{
+ struct fscache_retrieval *op =
+ container_of(_op, struct fscache_retrieval, op);
+
+ _enter("{OP%x}", op->op.debug_id);
+
+ fscache_hist(fscache_retrieval_histogram, op->start_time);
+ if (op->context)
+ fscache_put_context(op->op.object->cookie, op->context);
+
+ _leave("");
+}
+
+/*
+ * allocate a retrieval op
+ */
+static struct fscache_retrieval *fscache_alloc_retrieval(
+ struct address_space *mapping,
+ fscache_rw_complete_t end_io_func,
+ void *context)
+{
+ struct fscache_retrieval *op;
+
+ /* allocate a retrieval operation and attempt to submit it */
+ op = kzalloc(sizeof(*op), GFP_NOIO);
+ if (!op) {
+ fscache_stat(&fscache_n_retrievals_nomem);
+ return NULL;
+ }
+
+ fscache_operation_init(&op->op, fscache_release_retrieval_op);
+ op->op.flags = FSCACHE_OP_MYTHREAD | (1 << FSCACHE_OP_WAITING);
+ op->mapping = mapping;
+ op->end_io_func = end_io_func;
+ op->context = context;
+ op->start_time = jiffies;
+ INIT_WORK(&op->op.fast_work, fscache_retrieval_work);
+ INIT_LIST_HEAD(&op->to_do);
+ return op;
+}
+
+/*
+ * wait for a deferred lookup to complete
+ */
+static int fscache_wait_for_deferred_lookup(struct fscache_cookie *cookie)
+{
+ unsigned long jif;
+
+ _enter("");
+
+ if (!test_bit(FSCACHE_COOKIE_LOOKING_UP, &cookie->flags)) {
+ _leave(" = 0 [imm]");
+ return 0;
+ }
+
+ fscache_stat(&fscache_n_retrievals_wait);
+
+ jif = jiffies;
+ if (wait_on_bit(&cookie->flags, FSCACHE_COOKIE_LOOKING_UP,
+ fscache_wait_bit_interruptible,
+ TASK_INTERRUPTIBLE) != 0) {
+ fscache_stat(&fscache_n_retrievals_intr);
+ _leave(" = -ERESTARTSYS");
+ return -ERESTARTSYS;
+ }
+
+ ASSERT(!test_bit(FSCACHE_COOKIE_LOOKING_UP, &cookie->flags));
+
+ smp_rmb();
+ fscache_hist(fscache_retrieval_delay_histogram, jif);
+ _leave(" = 0 [dly]");
+ return 0;
+}
+
+/*
+ * read a page from the cache or allocate a block in which to store it
+ * - we return:
+ * -ENOMEM - out of memory, nothing done
+ * -ERESTARTSYS - interrupted
+ * -ENOBUFS - no backing object available in which to cache the block
+ * -ENODATA - no data available in the backing object for this block
+ * 0 - dispatched a read - it'll call end_io_func() when finished
+ */
+int __fscache_read_or_alloc_page(struct fscache_cookie *cookie,
+ struct page *page,
+ fscache_rw_complete_t end_io_func,
+ void *context,
+ gfp_t gfp)
+{
+ struct fscache_retrieval *op;
+ struct fscache_object *object;
+ int ret;
+
+ _enter("%p,%p,,,", cookie, page);
+
+ fscache_stat(&fscache_n_retrievals);
+
+ if (hlist_empty(&cookie->backing_objects))
+ goto nobufs;
+
+ ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
+ ASSERTCMP(page, !=, NULL);
+
+ if (fscache_wait_for_deferred_lookup(cookie) < 0)
+ return -ERESTARTSYS;
+
+ op = fscache_alloc_retrieval(page->mapping, end_io_func, context);
+ if (!op) {
+ _leave(" = -ENOMEM");
+ return -ENOMEM;
+ }
+
+ spin_lock(&cookie->lock);
+
+ if (hlist_empty(&cookie->backing_objects))
+ goto nobufs_unlock;
+ object = hlist_entry(cookie->backing_objects.first,
+ struct fscache_object, cookie_link);
+
+ ASSERTCMP(object->state, >, FSCACHE_OBJECT_LOOKING_UP);
+
+ if (fscache_submit_op(object, &op->op) < 0)
+ goto nobufs_unlock;
+ spin_unlock(&cookie->lock);
+
+ fscache_stat(&fscache_n_retrieval_ops);
+
+ /* pin the netfs read context in case we need to do the actual netfs
+ * read because we've encountered a cache read failure */
+ fscache_get_context(object->cookie, op->context);
+
+ /* we wait for the operation to become active, and then process it
+ * *here*, in this thread, and not in the thread pool */
+ if (test_bit(FSCACHE_OP_WAITING, &op->op.flags)) {
+ _debug(">>> WT");
+ fscache_stat(&fscache_n_retrieval_op_waits);
+ wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+ fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+ _debug("<<< GO");
+ }
+
+ /* ask the cache to honour the operation */
+ if (test_bit(FSCACHE_COOKIE_NO_DATA_YET, &object->cookie->flags)) {
+ ret = object->cache->ops->allocate_page(op, page, gfp);
+ if (ret == 0)
+ ret = -ENODATA;
+ } else {
+ ret = object->cache->ops->read_or_alloc_page(op, page, gfp);
+ }
+
+ if (ret == -ENOMEM)
+ fscache_stat(&fscache_n_retrievals_nomem);
+ else if (ret == -ERESTARTSYS)
+ fscache_stat(&fscache_n_retrievals_intr);
+ else if (ret == -ENODATA)
+ fscache_stat(&fscache_n_retrievals_nodata);
+ else if (ret < 0)
+ fscache_stat(&fscache_n_retrievals_nobufs);
+ else
+ fscache_stat(&fscache_n_retrievals_ok);
+
+ fscache_put_retrieval(op);
+ _leave(" = %d", ret);
+ return ret;
+
+nobufs_unlock:
+ spin_unlock(&cookie->lock);
+ kfree(op);
+nobufs:
+ fscache_stat(&fscache_n_retrievals_nobufs);
+ _leave(" = -ENOBUFS");
+ return -ENOBUFS;
+}
+EXPORT_SYMBOL(__fscache_read_or_alloc_page);
+
+/*
+ * read a list of page from the cache or allocate a block in which to store
+ * them
+ * - we return:
+ * -ENOMEM - out of memory, some pages may be being read
+ * -ERESTARTSYS - interrupted, some pages may be being read
+ * -ENOBUFS - no backing object or space available in which to cache any
+ * pages not being read
+ * -ENODATA - no data available in the backing object for some or all of
+ * the pages
+ * 0 - dispatched a read on all pages
+ *
+ * end_io_func() will be called for each page read from the cache as it is
+ * finishes being read
+ *
+ * any pages for which a read is dispatched will be removed from pages and
+ * nr_pages
+ */
+int __fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
+ struct address_space *mapping,
+ struct list_head *pages,
+ unsigned *nr_pages,
+ fscache_rw_complete_t end_io_func,
+ void *context,
+ gfp_t gfp)
+{
+ fscache_pages_retrieval_func_t func;
+ struct fscache_retrieval *op;
+ struct fscache_object *object;
+ int ret;
+
+ _enter("%p,,%d,,,", cookie, *nr_pages);
+
+ fscache_stat(&fscache_n_retrievals);
+
+ if (hlist_empty(&cookie->backing_objects))
+ goto nobufs;
+
+ ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
+ ASSERTCMP(*nr_pages, >, 0);
+ ASSERT(!list_empty(pages));
+
+ if (fscache_wait_for_deferred_lookup(cookie) < 0)
+ return -ERESTARTSYS;
+
+ op = fscache_alloc_retrieval(mapping, end_io_func, context);
+ if (!op)
+ return -ENOMEM;
+
+ spin_lock(&cookie->lock);
+
+ if (hlist_empty(&cookie->backing_objects))
+ goto nobufs_unlock;
+ object = hlist_entry(cookie->backing_objects.first,
+ struct fscache_object, cookie_link);
+
+ if (fscache_submit_op(object, &op->op) < 0)
+ goto nobufs_unlock;
+ spin_unlock(&cookie->lock);
+
+ fscache_stat(&fscache_n_retrieval_ops);
+
+ /* pin the netfs read context in case we need to do the actual netfs
+ * read because we've encountered a cache read failure */
+ fscache_get_context(object->cookie, op->context);
+
+ /* we wait for the operation to become active, and then process it
+ * *here*, in this thread, and not in the thread pool */
+ if (test_bit(FSCACHE_OP_WAITING, &op->op.flags)) {
+ _debug(">>> WT");
+ fscache_stat(&fscache_n_retrieval_op_waits);
+ wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+ fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+ _debug("<<< GO");
+ }
+
+ /* ask the cache to honour the operation */
+ if (test_bit(FSCACHE_COOKIE_NO_DATA_YET, &object->cookie->flags))
+ func = object->cache->ops->allocate_pages;
+ else
+ func = object->cache->ops->read_or_alloc_pages;
+ ret = func(op, pages, nr_pages, gfp);
+
+ if (ret == -ENOMEM)
+ fscache_stat(&fscache_n_retrievals_nomem);
+ else if (ret == -ERESTARTSYS)
+ fscache_stat(&fscache_n_retrievals_intr);
+ else if (ret == -ENODATA)
+ fscache_stat(&fscache_n_retrievals_nodata);
+ else if (ret < 0)
+ fscache_stat(&fscache_n_retrievals_nobufs);
+ else
+ fscache_stat(&fscache_n_retrievals_ok);
+
+ fscache_put_retrieval(op);
+ _leave(" = %d", ret);
+ return ret;
+
+nobufs_unlock:
+ spin_unlock(&cookie->lock);
+ kfree(op);
+nobufs:
+ fscache_stat(&fscache_n_retrievals_nobufs);
+ _leave(" = -ENOBUFS");
+ return -ENOBUFS;
+}
+EXPORT_SYMBOL(__fscache_read_or_alloc_pages);
+
+/*
+ * allocate a block in the cache on which to store a page
+ * - we return:
+ * -ENOMEM - out of memory, nothing done
+ * -ERESTARTSYS - interrupted
+ * -ENOBUFS - no backing object available in which to cache the block
+ * 0 - block allocated
+ */
+int __fscache_alloc_page(struct fscache_cookie *cookie,
+ struct page *page,
+ gfp_t gfp)
+{
+ struct fscache_retrieval *op;
+ struct fscache_object *object;
+ int ret;
+
+ _enter("%p,%p,,,", cookie, page);
+
+ fscache_stat(&fscache_n_allocs);
+
+ if (hlist_empty(&cookie->backing_objects))
+ goto nobufs;
+
+ ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
+ ASSERTCMP(page, !=, NULL);
+
+ if (fscache_wait_for_deferred_lookup(cookie) < 0)
+ return -ERESTARTSYS;
+
+ op = fscache_alloc_retrieval(page->mapping, NULL, NULL);
+ if (!op)
+ return -ENOMEM;
+
+ spin_lock(&cookie->lock);
+
+ if (hlist_empty(&cookie->backing_objects))
+ goto nobufs_unlock;
+ object = hlist_entry(cookie->backing_objects.first,
+ struct fscache_object, cookie_link);
+
+ if (fscache_submit_op(object, &op->op) < 0)
+ goto nobufs_unlock;
+ spin_unlock(&cookie->lock);
+
+ fscache_stat(&fscache_n_alloc_ops);
+
+ if (test_bit(FSCACHE_OP_WAITING, &op->op.flags)) {
+ _debug(">>> WT");
+ fscache_stat(&fscache_n_alloc_op_waits);
+ wait_on_bit(&op->op.flags, FSCACHE_OP_WAITING,
+ fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+ _debug("<<< GO");
+ }
+
+ /* ask the cache to honour the operation */
+ ret = object->cache->ops->allocate_page(op, page, gfp);
+
+ if (ret < 0)
+ fscache_stat(&fscache_n_allocs_nobufs);
+ else
+ fscache_stat(&fscache_n_allocs_ok);
+
+ fscache_put_retrieval(op);
+ _leave(" = %d", ret);
+ return ret;
+
+nobufs_unlock:
+ spin_unlock(&cookie->lock);
+ kfree(op);
+nobufs:
+ fscache_stat(&fscache_n_allocs_nobufs);
+ _leave(" = -ENOBUFS");
+ return -ENOBUFS;
+}
+EXPORT_SYMBOL(__fscache_alloc_page);
+
+/*
+ * release a write op reference
+ */
+static void fscache_release_write_op(struct fscache_operation *_op)
+{
+ _enter("{OP%x}", _op->debug_id);
+}
+
+/*
+ * perform the background storage of a page into the cache
+ */
+static void fscache_write_op(struct fscache_operation *_op)
+{
+ struct fscache_storage *op =
+ container_of(_op, struct fscache_storage, op);
+ struct fscache_object *object = op->op.object;
+ struct page *page;
+ unsigned n;
+ void *results[1];
+ int ret;
+
+ _enter("{OP%x,%d}", op->op.debug_id, atomic_read(&op->op.usage));
+
+ spin_lock(&object->lock);
+
+ if (!fscache_object_is_active(object)) {
+ spin_unlock(&object->lock);
+ _leave("");
+ return;
+ }
+
+ fscache_stat(&fscache_n_store_calls);
+
+ /* find a page to store */
+ page = NULL;
+ n = radix_tree_gang_lookup(&object->stores, results, 0, 1);
+ if (n == 1) {
+ page = results[0];
+ _debug("gang %d [%lx]", n, page->index);
+ if (page->index <= op->store_limit)
+ radix_tree_delete(&object->stores, page->index);
+ else
+ goto superseded;
+ } else {
+ goto superseded;
+ }
+
+ spin_unlock(&object->lock);
+
+ if (page) {
+ ret = object->cache->ops->write_page(op, page);
+ end_page_fscache_write(page);
+ page_cache_release(page);
+ if (ret < 0)
+ fscache_abort_object(object);
+ else
+ fscache_enqueue_operation(&op->op);
+ }
+
+ _leave("");
+ return;
+
+superseded:
+ /* this writer is going away and there aren't any more things to
+ * write */
+ _debug("cease");
+ clear_bit(FSCACHE_OBJECT_PENDING_WRITE, &object->flags);
+ spin_unlock(&object->lock);
+ _leave("");
+}
+
+/*
+ * request a page be stored in the cache
+ * - returns:
+ * -ENOMEM - out of memory, nothing done
+ * -ENOBUFS - no backing object available in which to cache the page
+ * 0 - dispatched a write - it'll call end_io_func() when finished
+ *
+ * if the cookie still has a backing object at this point, that object can be
+ * in one of a few states with respect to storage processing:
+ *
+ * (1) negative lookup, object not yet created (FSCACHE_COOKIE_CREATING is
+ * set)
+ *
+ * (a) no writes yet (set FSCACHE_COOKIE_PENDING_FILL and queue deferred
+ * fill op)
+ *
+ * (b) writes deferred till post-creation (mark page for writing and
+ * return immediately)
+ *
+ * (2) negative lookup, object created, initial fill being made from netfs
+ * (FSCACHE_COOKIE_INITIAL_FILL is set)
+ *
+ * (a) fill point not yet reached this page (mark page for writing and
+ * return)
+ *
+ * (b) fill point passed this page (queue op to store this page)
+ *
+ * (3) object extant (queue op to store this page)
+ *
+ * any other state is invalid
+ */
+int __fscache_write_page(struct fscache_cookie *cookie,
+ struct page *page,
+ gfp_t gfp)
+{
+ struct fscache_storage *op;
+ struct fscache_object *object;
+ int ret;
+
+ _enter("%p,%x,", cookie, (u32) page->flags);
+
+ ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
+ ASSERT(PageFsCache(page));
+
+ fscache_stat(&fscache_n_stores);
+
+ op = kzalloc(sizeof(*op), GFP_NOIO);
+ if (!op)
+ goto nomem;
+
+ fscache_operation_init(&op->op, fscache_release_write_op);
+ fscache_operation_init_slow(&op->op, fscache_write_op);
+ op->op.flags = FSCACHE_OP_SLOW | (1 << FSCACHE_OP_WAITING);
+
+ ret = radix_tree_preload(gfp & ~__GFP_HIGHMEM);
+ if (ret < 0)
+ goto nomem_free;
+
+ ret = -ENOBUFS;
+ spin_lock(&cookie->lock);
+
+ if (hlist_empty(&cookie->backing_objects))
+ goto nobufs;
+ object = hlist_entry(cookie->backing_objects.first,
+ struct fscache_object, cookie_link);
+ if (test_bit(FSCACHE_IOERROR, &object->cache->flags))
+ goto nobufs;
+
+ /* add the page to the pending-storage radix tree on the backing
+ * object */
+ spin_lock(&object->lock);
+
+ _debug("store limit %llx", (unsigned long long) object->store_limit);
+
+ ret = radix_tree_insert(&object->stores, page->index, page);
+ if (ret < 0) {
+ if (ret == -EEXIST)
+ goto already_queued;
+ _debug("insert failed %d", ret);
+ goto nobufs_unlock_obj;
+ }
+
+ page_cache_get(page);
+ if (TestSetPageFsCacheWrite(page))
+ BUG();
+
+ /* we only want one writer at a time, but we do need to queue new
+ * writers after exclusive ops */
+ if (test_and_set_bit(FSCACHE_OBJECT_PENDING_WRITE, &object->flags))
+ goto already_pending;
+
+ spin_unlock(&object->lock);
+
+ op->op.debug_id = atomic_inc_return(&fscache_op_debug_id);
+ op->store_limit = object->store_limit;
+
+ if (fscache_submit_op(object, &op->op) < 0)
+ goto submit_failed;
+
+ spin_unlock(&cookie->lock);
+ radix_tree_preload_end();
+ fscache_stat(&fscache_n_store_ops);
+ fscache_stat(&fscache_n_stores_ok);
+
+ /* the slow work queue now carries its own ref on the object */
+ fscache_put_operation(&op->op);
+ _leave(" = 0");
+ return 0;
+
+already_queued:
+ fscache_stat(&fscache_n_stores_again);
+already_pending:
+ spin_unlock(&object->lock);
+ spin_unlock(&cookie->lock);
+ radix_tree_preload_end();
+ kfree(op);
+ fscache_stat(&fscache_n_stores_ok);
+ _leave(" = 0");
+ return 0;
+
+submit_failed:
+ radix_tree_delete(&object->stores, page->index);
+ end_page_fscache_write(page);
+ page_cache_release(page);
+ ret = -ENOBUFS;
+ goto nobufs;
+
+nobufs_unlock_obj:
+ spin_unlock(&object->lock);
+nobufs:
+ spin_unlock(&cookie->lock);
+ radix_tree_preload_end();
+ kfree(op);
+ fscache_stat(&fscache_n_stores_nobufs);
+ _leave(" = -ENOBUFS");
+ return -ENOBUFS;
+
+nomem_free:
+ kfree(op);
+nomem:
+ fscache_stat(&fscache_n_stores_oom);
+ _leave(" = -ENOMEM");
+ return -ENOMEM;
+}
+EXPORT_SYMBOL(__fscache_write_page);
+
+/*
+ * remove a page from the cache
+ */
+void __fscache_uncache_page(struct fscache_cookie *cookie, struct page *page)
+{
+ struct fscache_object *object;
+
+ _enter(",%p", page);
+
+ ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
+ ASSERTCMP(page, !=, NULL);
+
+ fscache_stat(&fscache_n_uncaches);
+
+ /* cache withdrawal may beat us to it */
+ if (!PageFsCache(page))
+ goto done;
+
+ /* get the object */
+ spin_lock(&cookie->lock);
+
+ if (hlist_empty(&cookie->backing_objects)) {
+ ClearPageFsCache(page);
+ goto done_unlock;
+ }
+
+ object = hlist_entry(cookie->backing_objects.first,
+ struct fscache_object, cookie_link);
+
+ /* there might now be stuff on disk we could read */
+ clear_bit(FSCACHE_COOKIE_NO_DATA_YET, &cookie->flags);
+
+ /* only invoke the cache backend if we managed to mark the page
+ * uncached here; this deals with synchronisation vs withdrawal */
+ if (TestClearPageFsCache(page) &&
+ object->cache->ops->uncache_page) {
+ /* the cache backend releases the cookie lock */
+ object->cache->ops->uncache_page(object, page);
+ goto done;
+ }
+
+done_unlock:
+ spin_unlock(&cookie->lock);
+done:
+ _leave("");
+}
+EXPORT_SYMBOL(__fscache_uncache_page);
+
+/**
+ * fscache_mark_pages_cached - Mark pages as being cached
+ * @op: The retrieval op pages are being marked for
+ * @pagevec: The pages to be marked
+ *
+ * Mark a bunch of netfs pages as being cached. After this is called,
+ * the netfs must call fscache_uncache_page() to remove the mark.
+ */
+void fscache_mark_pages_cached(struct fscache_retrieval *op,
+ struct pagevec *pagevec)
+{
+ struct fscache_cookie *cookie = op->op.object->cookie;
+ unsigned long loop;
+
+#ifdef CONFIG_FSCACHE_STATS
+ atomic_add(pagevec->nr, &fscache_n_marks);
+#endif
+
+ for (loop = 0; loop < pagevec->nr; loop++) {
+ struct page *page = pagevec->pages[loop];
+
+ _debug("- mark %p{%lx}", page, page->index);
+ if (TestSetPageFsCache(page)) {
+ static bool once_only;
+ if (!once_only) {
+ once_only = true;
+ printk(KERN_WARNING "FS-Cache:"
+ " Cookie type %s marked page %lx"
+ " multiple times\n",
+ cookie->def->name, page->index);
+ }
+ }
+ }
+
+ if (cookie->def->mark_pages_cached)
+ cookie->def->mark_pages_cached(cookie->netfs_data,
+ op->mapping, pagevec);
+ pagevec_reinit(pagevec);
+}
+EXPORT_SYMBOL(fscache_mark_pages_cached);
diff --git a/include/linux/fscache.h b/include/linux/fscache.h
index 7e4fd5c..006c919 100644
--- a/include/linux/fscache.h
+++ b/include/linux/fscache.h
@@ -198,6 +198,22 @@ extern struct fscache_cookie *__fscache_acquire_cookie(
void *);
extern void __fscache_relinquish_cookie(struct fscache_cookie *, int);
extern void __fscache_update_cookie(struct fscache_cookie *);
+extern int __fscache_attr_changed(struct fscache_cookie *);
+extern int __fscache_read_or_alloc_page(struct fscache_cookie *,
+ struct page *,
+ fscache_rw_complete_t,
+ void *,
+ gfp_t);
+extern int __fscache_read_or_alloc_pages(struct fscache_cookie *,
+ struct address_space *,
+ struct list_head *,
+ unsigned *,
+ fscache_rw_complete_t,
+ void *,
+ gfp_t);
+extern int __fscache_alloc_page(struct fscache_cookie *, struct page *, gfp_t);
+extern int __fscache_write_page(struct fscache_cookie *, struct page *, gfp_t);
+extern void __fscache_uncache_page(struct fscache_cookie *, struct page *);

/**
* fscache_register_netfs - Register a filesystem as desiring caching services
@@ -375,7 +391,10 @@ void fscache_unpin_cookie(struct fscache_cookie *cookie)
static inline
int fscache_attr_changed(struct fscache_cookie *cookie)
{
- return -ENOBUFS;
+ if (fscache_cookie_valid(cookie))
+ return __fscache_attr_changed(cookie);
+ else
+ return -ENOBUFS;
}

/**
@@ -432,7 +451,11 @@ int fscache_read_or_alloc_page(struct fscache_cookie *cookie,
void *context,
gfp_t gfp)
{
- return -ENOBUFS;
+ if (fscache_cookie_valid(cookie))
+ return __fscache_read_or_alloc_page(cookie, page, end_io_func,
+ context, gfp);
+ else
+ return -ENOBUFS;
}

/**
@@ -478,7 +501,12 @@ int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
void *context,
gfp_t gfp)
{
- return -ENOBUFS;
+ if (fscache_cookie_valid(cookie))
+ return __fscache_read_or_alloc_pages(cookie, mapping, pages,
+ nr_pages, end_io_func,
+ context, gfp);
+ else
+ return -ENOBUFS;
}

/**
@@ -504,7 +532,10 @@ int fscache_alloc_page(struct fscache_cookie *cookie,
struct page *page,
gfp_t gfp)
{
- return -ENOBUFS;
+ if (fscache_cookie_valid(cookie))
+ return __fscache_alloc_page(cookie, page, gfp);
+ else
+ return -ENOBUFS;
}

/**
@@ -530,7 +561,10 @@ int fscache_write_page(struct fscache_cookie *cookie,
struct page *page,
gfp_t gfp)
{
- return -ENOBUFS;
+ if (fscache_cookie_valid(cookie))
+ return __fscache_write_page(cookie, page, gfp);
+ else
+ return -ENOBUFS;
}

/**
@@ -551,6 +585,8 @@ static inline
void fscache_uncache_page(struct fscache_cookie *cookie,
struct page *page)
{
+ if (fscache_cookie_valid(cookie))
+ __fscache_uncache_page(cookie, page);
}

#endif /* _LINUX_FSCACHE_H */

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:05:10 UTC
Permalink
Change all the usages of file->f_mapping in ext3_*write_end() functions to use
the mapping argument directly. This has two consequences:

(*) Consistency. Without this patch sometimes one is used and sometimes the
other is.

(*) A NULL file pointer can be passed. This feature is then made use of by
the generic hook in the next patch, which is used by CacheFiles to write
pages to a file without setting up a file struct.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/ext3/inode.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)


diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 4a09ff1..d897033 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1231,7 +1231,7 @@ static int ext3_generic_write_end(struct file *file,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
{
- struct inode *inode = file->f_mapping->host;
+ struct inode *inode = mapping->host;

copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);

@@ -1256,7 +1256,7 @@ static int ext3_ordered_write_end(struct file *file,
struct page *page, void *fsdata)
{
handle_t *handle = ext3_journal_current_handle();
- struct inode *inode = file->f_mapping->host;
+ struct inode *inode = mapping->host;
unsigned from, to;
int ret = 0, ret2;

@@ -1298,7 +1298,7 @@ static int ext3_writeback_write_end(struct file *file,
struct page *page, void *fsdata)
{
handle_t *handle = ext3_journal_current_handle();
- struct inode *inode = file->f_mapping->host;
+ struct inode *inode = mapping->host;
int ret = 0, ret2;
loff_t new_i_size;


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rik van Riel
2009-04-02 18:52:43 UTC
Permalink
Post by David Howells
Change all the usages of file->f_mapping in ext3_*write_end() functions to use
(*) Consistency. Without this patch sometimes one is used and sometimes the
other is.
(*) A NULL file pointer can be passed. This feature is then made use of by
the generic hook in the next patch, which is used by CacheFiles to write
pages to a file without setting up a file struct.
Acked-by: Rik van Riel <***@redhat.com>
David Howells
2009-04-01 23:05:15 UTC
Permalink
Add an address space operation to write one single page of data to an inode at
a page-aligned location (thus permitting the implementation to be highly
optimised). The data source is a single page.

This is used by CacheFiles to store the contents of netfs pages into their
backing file pages.

Supply a generic implementation for this that uses the write_begin() and
write_end() address_space operations to bind a copy directly into the page
cache.

Hook the Ext2 and Ext3 operations to the generic implementation.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/ext2/inode.c | 2 ++
fs/ext3/inode.c | 3 +++
include/linux/fs.h | 7 ++++++
mm/filemap.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 73 insertions(+), 0 deletions(-)


diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index b43b955..5d17070 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -817,6 +817,8 @@ const struct address_space_operations ext2_nobh_aops = {
.direct_IO = ext2_direct_IO,
.writepages = ext2_writepages,
.migratepage = buffer_migrate_page,
+ .write_one_page = generic_file_buffered_write_one_page,
+ .write_one_page = generic_file_buffered_write_one_page,
};

/*
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index d897033..a2e5fef 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1803,6 +1803,7 @@ static const struct address_space_operations ext3_ordered_aops = {
.direct_IO = ext3_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
+ .write_one_page = generic_file_buffered_write_one_page,
};

static const struct address_space_operations ext3_writeback_aops = {
@@ -1818,6 +1819,7 @@ static const struct address_space_operations ext3_writeback_aops = {
.direct_IO = ext3_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
+ .write_one_page = generic_file_buffered_write_one_page,
};

static const struct address_space_operations ext3_journalled_aops = {
@@ -1832,6 +1834,7 @@ static const struct address_space_operations ext3_journalled_aops = {
.invalidatepage = ext3_invalidatepage,
.releasepage = ext3_releasepage,
.is_partially_uptodate = block_is_partially_uptodate,
+ .write_one_page = generic_file_buffered_write_one_page,
};

void ext3_set_aops(struct inode *inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 61211ad..df93cdf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -532,6 +532,11 @@ struct address_space_operations {
int (*launder_page) (struct page *);
int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
unsigned long);
+
+ /* write the contents of the source page over the page at the specified
+ * index in the target address space (the source page does not need to
+ * be related to the target address space) */
+ int (*write_one_page)(struct address_space *, pgoff_t, struct page *);
};

/*
@@ -2119,6 +2124,8 @@ extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *,
unsigned long *, loff_t, loff_t *, size_t, size_t);
extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *,
unsigned long, loff_t, loff_t *, size_t, ssize_t);
+extern int generic_file_buffered_write_one_page(struct address_space *,
+ pgoff_t, struct page *);
extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos);
extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos);
extern int generic_segment_checks(const struct iovec *iov,
diff --git a/mm/filemap.c b/mm/filemap.c
index 3c55c1a..d766c2f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2320,6 +2320,67 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
}
EXPORT_SYMBOL(generic_file_buffered_write);

+/**
+ * generic_file_buffered_write_one_page - Write a single page of data to an
+ * inode
+ * @mapping - The address space of the target inode
+ * @index - The target page in the target inode to fill
+ * @source - The data to write into the target page
+ *
+ * Write the data from the source page to the page in the nominated address
+ * space at the @index specified. Note that the file will not be extended if
+ * the page crosses the EOF marker, in which case only the first part of the
+ * page will be written.
+ *
+ * The @source page does not need to have any association with the file or the
+ * target page offset.
+ */
+int generic_file_buffered_write_one_page(struct address_space *mapping,
+ pgoff_t index,
+ struct page *source)
+{
+ const struct address_space_operations *a_ops = mapping->a_ops;
+ struct page *page;
+ unsigned len;
+ loff_t isize, pos;
+ void *fsdata;
+ int ret;
+
+ pos = index;
+ pos <<= PAGE_CACHE_SHIFT;
+
+ len = PAGE_CACHE_SIZE;
+ isize = i_size_read(mapping->host);
+ if ((isize >> PAGE_CACHE_SHIFT) == index)
+ len = isize & (PAGE_CACHE_SIZE - 1);
+
+ ret = pagecache_write_begin(NULL, mapping, pos, len,
+ AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+ if (ret < 0)
+ goto sync;
+
+ copy_highpage(page, source);
+
+ ret = pagecache_write_end(NULL, mapping, pos, len, len, page, fsdata);
+ if (ret < 0)
+ goto sync;
+
+ balance_dirty_pages_ratelimited(mapping);
+ cond_resched();
+
+sync:
+ /* the caller must handle O_SYNC themselves, but we handle S_SYNC and
+ * MS_SYNCHRONOUS here */
+ if (unlikely(IS_SYNC(mapping->host)) && !a_ops->writepage)
+ ret = generic_osync_inode(mapping->host, mapping,
+ OSYNC_METADATA | OSYNC_DATA);
+
+ /* the caller must handle O_DIRECT for themselves */
+
+ return ret;
+}
+EXPORT_SYMBOL(generic_file_buffered_write_one_page);
+
static ssize_t
__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t *ppos)

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2009-04-02 14:00:04 UTC
Permalink
Post by David Howells
Add an address space operation to write one single page of data to an inode
at a page-aligned location (thus permitting the implementation to be highly
optimised). The data source is a single page.
This is used by CacheFiles to store the contents of netfs pages into their
backing file pages.
Supply a generic implementation for this that uses the write_begin() and
write_end() address_space operations to bind a copy directly into the page
cache.
Hook the Ext2 and Ext3 operations to the generic implementation.
Is this really worthwhile? What do you do that write_begin/write_end
doesn't? I have a patch around somewhere enable splice stealing into
the pagecache with write_begin/write_end if you are worried about the
copy. But how about you put this at the end of the series justified
with some numbers?
David Howells
2009-04-02 14:55:53 UTC
Permalink
Post by Nick Piggin
Post by David Howells
Hook the Ext2 and Ext3 operations to the generic implementation.
Is this really worthwhile?
I seem to remember being told that I'm not allowed to call such directly (by
Christoph Hellwig perhaps?), and that I should add an address space op to do
this.
Post by Nick Piggin
What do you do that write_begin/write_end doesn't?
This code calls write_begin() and write_end().

write_begin() and write_end() don't actually update the target page, and don't
so the osync stuff.

My code assumed file can be passed as NULL to write_begin() and write_end().
If file is, however, necessary then the backing fs can do something different.
cachefiles doesn't know.

Creating loads of file structs gives you ENFILE problems if you start opening
lots of files internally. This caused lots of argument when I proposed a
patch to allow the kernels to open files not on account.

I'm quite happy to ditch the extra address space op and put this code in
cachefiles itself. Doing it there would perhaps allow me to handle multiple
pages more easily.
Post by Nick Piggin
I have a patch around somewhere enable splice stealing into the pagecache
with write_begin/write_end if you are worried about the copy.
I'm not sure what you mean.
Post by Nick Piggin
But how about you put this at the end of the series justified with some
numbers?
I don't see how that helps. cachefiles has to do something to get the data
out.

David
Nick Piggin
2009-04-02 15:32:58 UTC
Permalink
Post by David Howells
Post by Nick Piggin
Post by David Howells
Hook the Ext2 and Ext3 operations to the generic implementation.
Is this really worthwhile?
I seem to remember being told that I'm not allowed to call such directly (by
Christoph Hellwig perhaps?), and that I should add an address space op to do
this.
Hmm, I guess not all filesystems define write_begin/write_end. But if you
only need to use ones that do define them?
Post by David Howells
Post by Nick Piggin
What do you do that write_begin/write_end doesn't?
This code calls write_begin() and write_end().
Yes, you framed the changelog as introducing this new callback because it
allows a highly optimised code that takes advantage of page aligned write.
So I went on a tangent thinking you were going to use it later to avoid
the data copy or something.
Post by David Howells
write_begin() and write_end() don't actually update the target page, and don't
so the osync stuff.
My code assumed file can be passed as NULL to write_begin() and write_end().
If file is, however, necessary then the backing fs can do something different.
cachefiles doesn't know.
You are knowingly squashing together fscache and the backing filesystem if
you do something like introduce a flag like PG_owner_priv_2 and disallow the
backing filesystem from reusing it. So at which point you don't have to keep
up illusions about being totally filesystem agnostic.

Many filesystems don't need the file argument to write_begin/write_end.
Christoph Hellwig
2009-04-02 16:37:12 UTC
Permalink
Post by Nick Piggin
Hmm, I guess not all filesystems define write_begin/write_end. But if you
only need to use ones that do define them?
No. write_begin/write_end are simply callbacks for the write helpers,
and locking for them is entirely filesystem-defined. E.g. xfs and the
cluster filesystems require additional locks taken first, and some
network filesystems require inode revalidations first.

They really should be taken out of the address_space_operations and
passed as callbacks to generic_file_aio_write & co.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Nick Piggin
2009-04-02 16:47:20 UTC
Permalink
Post by Christoph Hellwig
Post by Nick Piggin
Hmm, I guess not all filesystems define write_begin/write_end. But if you
only need to use ones that do define them?
No. write_begin/write_end are simply callbacks for the write helpers,
and locking for them is entirely filesystem-defined. E.g. xfs and the
cluster filesystems require additional locks taken first, and some
network filesystems require inode revalidations first.
Well they now are quite well filesystem defined. We no longer take
the page lock before calling them. Not saying it's perfect, but if
the backing fs is just using a known subset of ones that work
(like loop does).
Post by Christoph Hellwig
They really should be taken out of the address_space_operations and
passed as callbacks to generic_file_aio_write & co.
Probably yes. But it seems like it should have more discussion IMO
(unless it has already been had and I missed it).

I don't think "write_one_page" sounds like a particularly good new
API addition.

In the meantime, I would prefer cachefs use writebegin/end and it can
be converted to a new API with everyone else rather than having to
rush a new API in with it.
Christoph Hellwig
2009-04-02 16:55:05 UTC
Permalink
Post by Nick Piggin
Well they now are quite well filesystem defined. We no longer take
the page lock before calling them. Not saying it's perfect, but if
the backing fs is just using a known subset of ones that work
(like loop does).
The page lock doesn't matter. What matters is locks protecting the
io. Like the XFS iolock or cluster locks in the cluster filesystems,
and you will get silent data corruption that way.

I would have sworn loop was fixed by now as gfs people were seeing
these issues and submitting patches, but it looks like we never sorted
this out upstream.
Post by Nick Piggin
Probably yes. But it seems like it should have more discussion IMO
(unless it has already been had and I missed it).
This came up plenty of times.
Post by Nick Piggin
I don't think "write_one_page" sounds like a particularly good new
API addition.
I also thing it's not a nice one. I still haven't seen a really good
explanation of why it can't just use plain ->write
Nick Piggin
2009-04-02 17:07:54 UTC
Permalink
Post by Christoph Hellwig
Post by Nick Piggin
Well they now are quite well filesystem defined. We no longer take
the page lock before calling them. Not saying it's perfect, but if
the backing fs is just using a known subset of ones that work
(like loop does).
The page lock doesn't matter. What matters is locks protecting the
io. Like the XFS iolock or cluster locks in the cluster filesystems,
and you will get silent data corruption that way.
Hmm, I can see i_mutex being a problem, but can't see how a filesystem
takes any other locks down that chain?

Naturally a random in-kernel user misses other important things, so yes
a simple write sounds like the best option.
Post by Christoph Hellwig
Post by Nick Piggin
Probably yes. But it seems like it should have more discussion IMO
(unless it has already been had and I missed it).
This came up plenty of times.
I mean, unless the discussion agreed on write_one_page being the right
API to add, then it should not be added in fscache and fscache should
just take a workaround for the meantime.
Post by Christoph Hellwig
Post by Nick Piggin
I don't think "write_one_page" sounds like a particularly good new
API addition.
I also thing it's not a nice one. I still haven't seen a really good
explanation of why it can't just use plain ->write
Good question.
Jan Kara
2009-04-02 17:26:52 UTC
Permalink
Post by Nick Piggin
Post by Christoph Hellwig
Post by Nick Piggin
Well they now are quite well filesystem defined. We no longer take
the page lock before calling them. Not saying it's perfect, but if
the backing fs is just using a known subset of ones that work
(like loop does).
The page lock doesn't matter. What matters is locks protecting the
io. Like the XFS iolock or cluster locks in the cluster filesystems,
and you will get silent data corruption that way.
Hmm, I can see i_mutex being a problem, but can't see how a filesystem
takes any other locks down that chain?
Yes, i_mutex is one problem. Then filesystems may take other locks
in their ->aio_write callbacks - as Christoph mentioned, for example
OCFS2 has to do some network messaging to synchronize nodes in the
cluster accessing the file. I could imagine some clever filesystem doing
more clever locking than just one i_mutex covering the whole file...
Post by Nick Piggin
Naturally a random in-kernel user misses other important things, so yes
a simple write sounds like the best option.
Definitely. IMO it's hard to get the locking right without calling
->aio_write callback.

Honza
--
Jan Kara <***@suse.cz>
SuSE CR Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-02 17:22:32 UTC
Permalink
Post by Christoph Hellwig
Post by Nick Piggin
I don't think "write_one_page" sounds like a particularly good new
API addition.
I also thing it's not a nice one. I still haven't seen a really good
explanation of why it can't just use plain ->write
->write requires:

(1) A struct file. Refer to the ENFILE problem that my patch to do this
raised. The problem is that doing a tar of, say, a kernel tree
immediately opens ~30000 files internally, and then userspace falls over
with ENFILE all over the place. My tests multiply that by 8.

You (at least I think it was you) refused to countenance allocating file
structs internally to the kernel that weren't accounted.

(2) A pointer to a buffer. That means kmapping a page for the duration of
the write.

->write is quite heavy. For something like ext3 it goes down through quite a
few layers, in and out of the fs, VM and VFS. All this takes both time and
stack space.

I want to write a whole page from the kernel at a page-boundary in the file.
That means it is possible to use an extremely optimised aop to do that.

Really, what I want to do is have the filesystem issue its BIOs directly on
the netfs page, but the Ext3 people I talked to at the time thought that would
be too difficult to track.

David
David Howells
2009-04-02 18:18:11 UTC
Permalink
Post by Christoph Hellwig
Post by Nick Piggin
I don't think "write_one_page" sounds like a particularly good new
API addition.
I also thing it's not a nice one. I still haven't seen a really good
explanation of why it can't just use plain ->write
So, I can apply the attached patch. It opens a file each time it wants to
write a page and then closes it again, but it's the best way to avoid the
ENFILE problem. I'm benchmarking it at the moment.

David
---
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index 84bb8b7..a69787e 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -9,6 +9,8 @@
* 2 of the Licence, or (at your option) any later version.
*/

+#include <linux/mount.h>
+#include <linux/file.h>
#include "internal.h"

/*
@@ -796,7 +798,11 @@ int cachefiles_allocate_pages(struct fscache_retrieval *op,
int cachefiles_write_page(struct fscache_storage *op, struct page *page)
{
struct cachefiles_object *object;
- struct address_space *mapping;
+ struct cachefiles_cache *cache;
+ mm_segment_t old_fs;
+ struct file *file;
+ loff_t pos;
+ void *data;
int ret;

ASSERT(op != NULL);
@@ -814,13 +820,33 @@ int cachefiles_write_page(struct fscache_storage *op, struct page *page)

ASSERT(S_ISREG(object->backer->d_inode->i_mode));

- /* copy the page to ext3 and let it store it in its own time */
- mapping = object->backer->d_inode->i_mapping;
- ret = -EIO;
- if (mapping->a_ops->write_one_page) {
- ret = mapping->a_ops->write_one_page(mapping, page->index,
- page);
- _debug("write_one_page -> %d", ret);
+ cache = container_of(object->fscache.cache,
+ struct cachefiles_cache, cache);
+
+ /* write the page to the backing filesystem and let it store it in its
+ * own time */
+ dget(object->backer);
+ mntget(cache->mnt);
+ file = dentry_open(object->backer, cache->mnt, O_RDWR,
+ cache->cache_cred);
+ if (IS_ERR(file)) {
+ ret = PTR_ERR(file);
+ } else {
+ ret = -EIO;
+ if (file->f_op->write) {
+ pos = (loff_t) page->index << PAGE_SHIFT;
+ data = kmap(page);
+ old_fs = get_fs();
+ set_fs(KERNEL_DS);
+ ret = file->f_op->write(
+ file, (const void __user *) data, PAGE_SIZE,
+ &pos);
+ set_fs(old_fs);
+ kunmap(page);
+ if (ret != PAGE_SIZE)
+ ret = -EIO;
+ }
+ fput(file);
}

if (ret < 0) {
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-02 22:44:19 UTC
Permalink
Post by David Howells
So, I can apply the attached patch. It opens a file each time it wants to
write a page and then closes it again, but it's the best way to avoid the
ENFILE problem. I'm benchmarking it at the moment.
Okay. I'm not sure why, but using ->write() in that patch appears to be _much_
better than using ->write_one_page(), despite the latter doing much less work.

The benchmark was:

mke2fs -j -O dir_index -b 4096 /dev/sda6
tune2fs -o user_xattr /dev/sda6
setenforce 0
mount /dev/sda6 /var/fscache/ -o user_xattr
insmod /tmp/sunrpc.ko
insmod /tmp/lockd.ko
insmod /tmp/nfs_acl.ko
insmod /tmp/auth_rpcgss.ko
insmod /tmp/fscache.ko
insmod /tmp/nfs.ko
insmod /tmp/cachefiles.ko
sync
service cachefilesd start
sync
mount warthog:/warthog /warthog -o fsc -t nfs4
date;
tar cf - /warthog/aaa >/dev/zero &
tar cf - /warthog/aaa2 >/dev/zero &
tar cf - /warthog/aaa3 >/dev/zero &
tar cf - /warthog/aaa >/dev/zero &
tar cf - /warthog/aaa2 >/dev/zero &
tar cf - /warthog/aaa3 >/dev/zero &
tar cf - /warthog/aaa >/dev/zero &
tar cf - /warthog/aaa2 >/dev/zero;
wait;
date
sync
echo b >/proc/sysrq-trigger

Where each of /warthog/aaa{,2,3} is about 340MB of kernel tree. The test box
only has 1GB of RAM. The server has the whole data set in RAM and is
connected by GigE with more or less no other traffic on the line.


Runs without and with the patch were alternated, and the following results
were produced:

Using write_one_page():

Thu Apr 2 22:56:45 BST 2009
Thu Apr 2 23:01:03 BST 2009 4m 18s

Thu Apr 2 23:11:12 BST 2009
Thu Apr 2 23:15:12 BST 2009 4m 0s

Thu Apr 2 23:28:39 BST 2009
Thu Apr 2 23:32:01 BST 2009 3m 22s

Using write():

Thu Apr 2 23:05:19 BST 2009
Thu Apr 2 23:07:31 BST 2009 2m 12s

Thu Apr 2 23:18:28 BST 2009
Thu Apr 2 23:20:18 BST 2009 1m 50s

Thu Apr 2 23:22:52 BST 2009
Thu Apr 2 23:25:24 BST 2009 2m 32s

So the patch improves performance by about 2x, it would appear. The disk
holding the backing fs was much more continually on with the patch, so I think
write() must be promoting some disk I/O.

Or maybe, fput() calling release() is the difference. Any thoughts?

David
David Howells
2009-04-03 13:41:48 UTC
Permalink
Post by David Howells
+ data = kmap(page);
+ old_fs = get_fs();
+ set_fs(KERNEL_DS);
+ ret = file->f_op->write(
+ file, (const void __user *) data, PAGE_SIZE,
+ &pos);
+ set_fs(old_fs);
+ kunmap(page);
Hmmm... Now I've thought about it some more, I'm not sure this is a good
idea. Can this run a system out of kmap() resources? These are run by a
dynamic thread pool, where each thread in the pool can do a kmap and issue a
write, but the write may get delayed for a significant amount of time for a
number of reasons - such as block allocation, data fetching and memory
allocation.

David

David Howells
2009-04-02 16:02:33 UTC
Permalink
Post by Nick Piggin
Hmm, I guess not all filesystems define write_begin/write_end. But if you
only need to use ones that do define them?
That would be a reasonable restriction. As would excluding NFS, AFS, CIFS,
etc..
Post by Nick Piggin
Yes, you framed the changelog as introducing this new callback because it
allows a highly optimised code that takes advantage of page aligned write.
So I went on a tangent thinking you were going to use it later to avoid
the data copy or something.
I would love to avoid the data copies and do asynchronous direct I/O. However
the DIO interface appears very highly userspace centric, and is very hard to
use from within the kernel.
Post by Nick Piggin
You are knowingly squashing together fscache and the backing filesystem if
you do something like introduce a flag like PG_owner_priv_2 and disallow the
backing filesystem from reusing it.
The backing filesystem isn't disallowed from reusing it; the *netfs* is.

Understand that there are two filesystems involved: The netfs, which makes
requests of FS-Cache, and the backing filesystem which cachefiles uses to
store data.

FS-Cache places restrictions such as having to yield up FS_private_2 and
FS_owner_priv_2 on the netfs.

Cachefiles places restrictions such as must have bmap(), setxattr(),
write_one_page() on the backing fs.
Post by Nick Piggin
So at which point you don't have to keep up illusions about being totally
filesystem agnostic.
I'm trying to keep things as filesystem agnostic as I can, at both ends.
Post by Nick Piggin
Many filesystems don't need the file argument to write_begin/write_end.
That probably includes all those I care about for supporting the backing fs.
The problem is how does cachefiles tell? Perhaps a file_system_type flag?

#define FS_SUPPORTS_CACHEFILES 65536

or

#define FS_NULL_FILEPTR_OKAY 65536

perhaps.

David
Peter Staubach
2009-04-02 15:32:14 UTC
Permalink
Post by David Howells
Add an address space operation to write one single page of data to an inode at
a page-aligned location (thus permitting the implementation to be highly
optimised). The data source is a single page.
This is used by CacheFiles to store the contents of netfs pages into their
backing file pages.
Supply a generic implementation for this that uses the write_begin() and
write_end() address_space operations to bind a copy directly into the page
cache.
Hook the Ext2 and Ext3 operations to the generic implementation.
---
fs/ext2/inode.c | 2 ++
fs/ext3/inode.c | 3 +++
include/linux/fs.h | 7 ++++++
mm/filemap.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 73 insertions(+), 0 deletions(-)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index b43b955..5d17070 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -817,6 +817,8 @@ const struct address_space_operations ext2_nobh_aops = {
.direct_IO = ext2_direct_IO,
.writepages = ext2_writepages,
.migratepage = buffer_migrate_page,
+ .write_one_page = generic_file_buffered_write_one_page,
+ .write_one_page = generic_file_buffered_write_one_page,
You probably only need one of these lines, don't you?

Thanx...

ps
Post by David Howells
};
/*
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index d897033..a2e5fef 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1803,6 +1803,7 @@ static const struct address_space_operations ext3_ordered_aops = {
.direct_IO = ext3_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
+ .write_one_page = generic_file_buffered_write_one_page,
};
static const struct address_space_operations ext3_writeback_aops = {
@@ -1818,6 +1819,7 @@ static const struct address_space_operations ext3_writeback_aops = {
.direct_IO = ext3_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
+ .write_one_page = generic_file_buffered_write_one_page,
};
static const struct address_space_operations ext3_journalled_aops = {
@@ -1832,6 +1834,7 @@ static const struct address_space_operations ext3_journalled_aops = {
.invalidatepage = ext3_invalidatepage,
.releasepage = ext3_releasepage,
.is_partially_uptodate = block_is_partially_uptodate,
+ .write_one_page = generic_file_buffered_write_one_page,
};
void ext3_set_aops(struct inode *inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 61211ad..df93cdf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -532,6 +532,11 @@ struct address_space_operations {
int (*launder_page) (struct page *);
int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
unsigned long);
+
+ /* write the contents of the source page over the page at the specified
+ * index in the target address space (the source page does not need to
+ * be related to the target address space) */
+ int (*write_one_page)(struct address_space *, pgoff_t, struct page *);
};
/*
@@ -2119,6 +2124,8 @@ extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *,
unsigned long *, loff_t, loff_t *, size_t, size_t);
extern ssize_t generic_file_buffered_write(struct kiocb *, const struct iovec *,
unsigned long, loff_t, loff_t *, size_t, ssize_t);
+extern int generic_file_buffered_write_one_page(struct address_space *,
+ pgoff_t, struct page *);
extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos);
extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos);
extern int generic_segment_checks(const struct iovec *iov,
diff --git a/mm/filemap.c b/mm/filemap.c
index 3c55c1a..d766c2f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2320,6 +2320,67 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov,
}
EXPORT_SYMBOL(generic_file_buffered_write);
+/**
+ * generic_file_buffered_write_one_page - Write a single page of data to an
+ * inode
+ *
+ * Write the data from the source page to the page in the nominated address
+ * the page crosses the EOF marker, in which case only the first part of the
+ * page will be written.
+ *
+ * target page offset.
+ */
+int generic_file_buffered_write_one_page(struct address_space *mapping,
+ pgoff_t index,
+ struct page *source)
+{
+ const struct address_space_operations *a_ops = mapping->a_ops;
+ struct page *page;
+ unsigned len;
+ loff_t isize, pos;
+ void *fsdata;
+ int ret;
+
+ pos = index;
+ pos <<= PAGE_CACHE_SHIFT;
+
+ len = PAGE_CACHE_SIZE;
+ isize = i_size_read(mapping->host);
+ if ((isize >> PAGE_CACHE_SHIFT) == index)
+ len = isize & (PAGE_CACHE_SIZE - 1);
+
+ ret = pagecache_write_begin(NULL, mapping, pos, len,
+ AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+ if (ret < 0)
+ goto sync;
+
+ copy_highpage(page, source);
+
+ ret = pagecache_write_end(NULL, mapping, pos, len, len, page, fsdata);
+ if (ret < 0)
+ goto sync;
+
+ balance_dirty_pages_ratelimited(mapping);
+ cond_resched();
+
+ /* the caller must handle O_SYNC themselves, but we handle S_SYNC and
+ * MS_SYNCHRONOUS here */
+ if (unlikely(IS_SYNC(mapping->host)) && !a_ops->writepage)
+ ret = generic_osync_inode(mapping->host, mapping,
+ OSYNC_METADATA | OSYNC_DATA);
+
+ /* the caller must handle O_DIRECT for themselves */
+
+ return ret;
+}
+EXPORT_SYMBOL(generic_file_buffered_write_one_page);
+
static ssize_t
__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
unsigned long nr_segs, loff_t *ppos)
_______________________________________________
NFSv4 mailing list
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-02 16:03:16 UTC
Permalink
Post by Peter Staubach
Post by David Howells
+ .write_one_page = generic_file_buffered_write_one_page,
+ .write_one_page = generic_file_buffered_write_one_page,
You probably only need one of these lines, don't you?
Yes: one should be in ext2_aops, thanks.

David
Rik van Riel
2009-04-02 18:48:56 UTC
Permalink
Post by David Howells
Add an address space operation to write one single page of data to an inode at
a page-aligned location (thus permitting the implementation to be highly
optimised). The data source is a single page.
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index b43b955..5d17070 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -817,6 +817,8 @@ const struct address_space_operations ext2_nobh_aops = {
.direct_IO = ext2_direct_IO,
.writepages = ext2_writepages,
.migratepage = buffer_migrate_page,
+ .write_one_page = generic_file_buffered_write_one_page,
+ .write_one_page = generic_file_buffered_write_one_page
Adding the same function pointer twice seems a bit excessive.

Other than that, the patch looks good :)
David Howells
2009-04-01 23:05:46 UTC
Permalink
Add FS-Cache option bit to nfs_server struct. This is set to indicate local
on-disk caching is enabled for a particular superblock.

Also add debug bit for local caching operations.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

include/linux/nfs_fs.h | 1 +
include/linux/nfs_fs_sb.h | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)


diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index bde2557..fd3e7f9 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -583,6 +583,7 @@ extern void * nfs_root_data(void);
#define NFSDBG_CALLBACK 0x0100
#define NFSDBG_CLIENT 0x0200
#define NFSDBG_MOUNT 0x0400
+#define NFSDBG_FSCACHE 0x0800
#define NFSDBG_ALL 0xFFFF

#ifdef __KERNEL__
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 29b1e40..a749f85 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -96,6 +96,8 @@ struct nfs_server {
unsigned int acdirmin;
unsigned int acdirmax;
unsigned int namelen;
+ unsigned int options; /* extra options enabled by mount */
+#define NFS_OPTION_FSCACHE 0x00000001 /* - local caching enabled */

struct nfs_fsid fsid;
__u64 maxfilesize; /* maximum file size */

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:06:01 UTC
Permalink
Define and create server-level cache index objects (as managed by nfs_client
structs).

Each server object is created in the NFS top-level index object and is itself
an index into which superblock-level objects are inserted.

Ideally there would be one superblock-level object per server, and the former
would be folded into the latter; however, since the "nosharecache" option
exists this isn't possible.

The server object key is a sequence consisting of:

(1) NFS version

(2) Server address family (eg: AF_INET or AF_INET6)

(3) Server port.

(4) Server IP address.

The key blob is of variable length, depending on the length of (4).

The server object is given no coherency data to carry in the auxiliary data
permitted by the cache.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/Makefile | 2 +
fs/nfs/client.c | 5 +++
fs/nfs/fscache-index.c | 65 +++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.c | 52 ++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 10 +++++++
include/linux/nfs_fs_sb.h | 4 +++
6 files changed, 137 insertions(+), 1 deletions(-)
create mode 100644 fs/nfs/fscache.c


diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index 0e0bb6c..8451598 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -15,4 +15,4 @@ nfs-$(CONFIG_NFS_V4) += nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \
callback.o callback_xdr.o callback_proc.o \
nfs4namespace.o
nfs-$(CONFIG_SYSCTL) += sysctl.o
-nfs-$(CONFIG_NFS_FSCACHE) += fscache-index.o
+nfs-$(CONFIG_NFS_FSCACHE) += fscache.o fscache-index.o
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index aba3801..aa04da8 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -45,6 +45,7 @@
#include "delegation.h"
#include "iostat.h"
#include "internal.h"
+#include "fscache.h"

#define NFSDBG_FACILITY NFSDBG_CLIENT

@@ -154,6 +155,8 @@ static struct nfs_client *nfs_alloc_client(const struct nfs_client_initdata *cl_
if (!IS_ERR(cred))
clp->cl_machine_cred = cred;

+ nfs_fscache_get_client_cookie(clp);
+
return clp;

error_3:
@@ -187,6 +190,8 @@ static void nfs_free_client(struct nfs_client *clp)

nfs4_shutdown_client(clp);

+ nfs_fscache_release_client_cookie(clp);
+
/* -EIO all pending I/O */
if (!IS_ERR(clp->cl_rpcclient))
rpc_shutdown_client(clp->cl_rpcclient);
diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index 6d5bb5c..ff14b03 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -47,3 +47,68 @@ void nfs_fscache_unregister(void)
{
fscache_unregister_netfs(&nfs_fscache_netfs);
}
+
+/*
+ * Layout of the key for an NFS server cache object.
+ */
+struct nfs_server_key {
+ uint16_t nfsversion; /* NFS protocol version */
+ uint16_t family; /* address family */
+ uint16_t port; /* IP port */
+ union {
+ struct in_addr ipv4_addr; /* IPv4 address */
+ struct in6_addr ipv6_addr; /* IPv6 address */
+ } addr[0];
+};
+
+/*
+ * Generate a key to describe a server in the main NFS index
+ * - We return the length of the key, or 0 if we can't generate one
+ */
+static uint16_t nfs_server_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
+{
+ const struct nfs_client *clp = cookie_netfs_data;
+ const struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *) &clp->cl_addr;
+ const struct sockaddr_in *sin = (struct sockaddr_in *) &clp->cl_addr;
+ struct nfs_server_key *key = buffer;
+ uint16_t len = sizeof(struct nfs_server_key);
+
+ key->nfsversion = clp->rpc_ops->version;
+ key->family = clp->cl_addr.ss_family;
+
+ memset(key, 0, len);
+
+ switch (clp->cl_addr.ss_family) {
+ case AF_INET:
+ key->port = sin->sin_port;
+ key->addr[0].ipv4_addr = sin->sin_addr;
+ len += sizeof(key->addr[0].ipv4_addr);
+ break;
+
+ case AF_INET6:
+ key->port = sin6->sin6_port;
+ key->addr[0].ipv6_addr = sin6->sin6_addr;
+ len += sizeof(key->addr[0].ipv6_addr);
+ break;
+
+ default:
+ printk(KERN_WARNING "NFS: Unknown network family '%d'\n",
+ clp->cl_addr.ss_family);
+ len = 0;
+ break;
+ }
+
+ return len;
+}
+
+/*
+ * Define the server object for FS-Cache. This is used to describe a server
+ * object to fscache_acquire_cookie(). It is keyed by the NFS protocol and
+ * server address parameters.
+ */
+const struct fscache_cookie_def nfs_fscache_server_index_def = {
+ .name = "NFS.server",
+ .type = FSCACHE_COOKIE_TYPE_INDEX,
+ .get_key = nfs_server_get_key,
+};
diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
new file mode 100644
index 0000000..c3f056f
--- /dev/null
+++ b/fs/nfs/fscache.c
@@ -0,0 +1,52 @@
+/* NFS filesystem cache interface
+ *
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/nfs_fs.h>
+#include <linux/nfs_fs_sb.h>
+#include <linux/in6.h>
+#include <linux/seq_file.h>
+
+#include "internal.h"
+#include "fscache.h"
+
+#define NFSDBG_FACILITY NFSDBG_FSCACHE
+
+/*
+ * Get the per-client index cookie for an NFS client if the appropriate mount
+ * flag was set
+ * - We always try and get an index cookie for the client, but get filehandle
+ * cookies on a per-superblock basis, depending on the mount flags
+ */
+void nfs_fscache_get_client_cookie(struct nfs_client *clp)
+{
+ /* create a cache index for looking up filehandles */
+ clp->fscache = fscache_acquire_cookie(nfs_fscache_netfs.primary_index,
+ &nfs_fscache_server_index_def,
+ clp);
+ dfprintk(FSCACHE, "NFS: get client cookie (0x%p/0x%p)\n",
+ clp, clp->fscache);
+}
+
+/*
+ * Dispose of a per-client cookie
+ */
+void nfs_fscache_release_client_cookie(struct nfs_client *clp)
+{
+ dfprintk(FSCACHE, "NFS: releasing client cookie (0x%p/0x%p)\n",
+ clp, clp->fscache);
+
+ fscache_relinquish_cookie(clp->fscache, 0);
+ clp->fscache = NULL;
+}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index ccfcdc5..1d864be 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -23,13 +23,23 @@
* fscache-index.c
*/
extern struct fscache_netfs nfs_fscache_netfs;
+extern const struct fscache_cookie_def nfs_fscache_server_index_def;

extern int nfs_fscache_register(void);
extern void nfs_fscache_unregister(void);

+/*
+ * fscache.c
+ */
+extern void nfs_fscache_get_client_cookie(struct nfs_client *);
+extern void nfs_fscache_release_client_cookie(struct nfs_client *);
+
#else /* CONFIG_NFS_FSCACHE */
static inline int nfs_fscache_register(void) { return 0; }
static inline void nfs_fscache_unregister(void) {}

+static inline void nfs_fscache_get_client_cookie(struct nfs_client *clp) {}
+static inline void nfs_fscache_release_client_cookie(struct nfs_client *clp) {}
+
#endif /* CONFIG_NFS_FSCACHE */
#endif /* _NFS_FSCACHE_H */
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index a749f85..0a374b9 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -64,6 +64,10 @@ struct nfs_client {
char cl_ipaddr[48];
unsigned char cl_id_uniquifier;
#endif
+
+#ifdef CONFIG_NFS_FSCACHE
+ struct fscache_cookie *fscache; /* client index cache cookie */
+#endif
};

/*

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:05:51 UTC
Permalink
Permit local filesystem caching to be enabled for NFS in the kernel
configuration.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/Kconfig | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/Kconfig b/fs/nfs/Kconfig
index 36fe20d..e67f3ec 100644
--- a/fs/nfs/Kconfig
+++ b/fs/nfs/Kconfig
@@ -84,3 +84,11 @@ config ROOT_NFS
<file:Documentation/filesystems/nfsroot.txt>.

Most people say N here.
+
+config NFS_FSCACHE
+ bool "Provide NFS client caching support (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ depends on NFS_FS=m && FSCACHE || NFS_FS=y && FSCACHE=y
+ help
+ Say Y here if you want NFS data to be cached locally on disc through
+ the general filesystem cache manager

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:05:35 UTC
Permalink
The attached patch makes the kAFS filesystem in fs/afs/ use FS-Cache, and
through it any attached caches. The kAFS filesystem will use caching
automatically if it's available.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/afs/Kconfig | 8 +
fs/afs/Makefile | 3
fs/afs/cache.c | 503 ++++++++++++++++++++++++++++++++++------------------
fs/afs/cache.h | 15 --
fs/afs/cell.c | 16 +-
fs/afs/file.c | 220 +++++++++++++++--------
fs/afs/inode.c | 31 ++-
fs/afs/internal.h | 53 +++--
fs/afs/main.c | 27 +--
fs/afs/mntpt.c | 4
fs/afs/vlocation.c | 25 +--
fs/afs/volume.c | 14 +
fs/afs/write.c | 21 ++
13 files changed, 582 insertions(+), 358 deletions(-)


diff --git a/fs/afs/Kconfig b/fs/afs/Kconfig
index e7b522f..5c4e61d 100644
--- a/fs/afs/Kconfig
+++ b/fs/afs/Kconfig
@@ -19,3 +19,11 @@ config AFS_DEBUG
See <file:Documentation/filesystems/afs.txt> for more information.

If unsure, say N.
+
+config AFS_FSCACHE
+ bool "Provide AFS client caching support (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ depends on AFS_FS=m && FSCACHE || AFS_FS=y && FSCACHE=y
+ help
+ Say Y here if you want AFS data to be cached locally on disk through
+ the generic filesystem cache manager
diff --git a/fs/afs/Makefile b/fs/afs/Makefile
index a666710..4f64b95 100644
--- a/fs/afs/Makefile
+++ b/fs/afs/Makefile
@@ -2,7 +2,10 @@
# Makefile for Red Hat Linux AFS client.
#

+afs-cache-$(CONFIG_AFS_FSCACHE) := cache.o
+
kafs-objs := \
+ $(afs-cache-y) \
callback.o \
cell.o \
cmservice.o \
diff --git a/fs/afs/cache.c b/fs/afs/cache.c
index de0d7de..e2b1d3f 100644
--- a/fs/afs/cache.c
+++ b/fs/afs/cache.c
@@ -1,6 +1,6 @@
/* AFS caching stuff
*
- * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
* Written by David Howells (***@redhat.com)
*
* This program is free software; you can redistribute it and/or
@@ -9,248 +9,395 @@
* 2 of the License, or (at your option) any later version.
*/

-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_cell_cache_match(void *target,
- const void *entry);
-static void afs_cell_cache_update(void *source, void *entry);
-
-struct cachefs_index_def afs_cache_cell_index_def = {
- .name = "cell_ix",
- .data_size = sizeof(struct afs_cache_cell),
- .keys[0] = { CACHEFS_INDEX_KEYS_ASCIIZ, 64 },
- .match = afs_cell_cache_match,
- .update = afs_cell_cache_update,
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include "internal.h"
+
+static uint16_t afs_cell_cache_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t buflen);
+static uint16_t afs_cell_cache_get_aux(const void *cookie_netfs_data,
+ void *buffer, uint16_t buflen);
+static enum fscache_checkaux afs_cell_cache_check_aux(void *cookie_netfs_data,
+ const void *buffer,
+ uint16_t buflen);
+
+static uint16_t afs_vlocation_cache_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t buflen);
+static uint16_t afs_vlocation_cache_get_aux(const void *cookie_netfs_data,
+ void *buffer, uint16_t buflen);
+static enum fscache_checkaux afs_vlocation_cache_check_aux(
+ void *cookie_netfs_data, const void *buffer, uint16_t buflen);
+
+static uint16_t afs_volume_cache_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t buflen);
+
+static uint16_t afs_vnode_cache_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t buflen);
+static void afs_vnode_cache_get_attr(const void *cookie_netfs_data,
+ uint64_t *size);
+static uint16_t afs_vnode_cache_get_aux(const void *cookie_netfs_data,
+ void *buffer, uint16_t buflen);
+static enum fscache_checkaux afs_vnode_cache_check_aux(void *cookie_netfs_data,
+ const void *buffer,
+ uint16_t buflen);
+static void afs_vnode_cache_now_uncached(void *cookie_netfs_data);
+
+struct fscache_netfs afs_cache_netfs = {
+ .name = "afs",
+ .version = 0,
+};
+
+struct fscache_cookie_def afs_cell_cache_index_def = {
+ .name = "AFS.cell",
+ .type = FSCACHE_COOKIE_TYPE_INDEX,
+ .get_key = afs_cell_cache_get_key,
+ .get_aux = afs_cell_cache_get_aux,
+ .check_aux = afs_cell_cache_check_aux,
+};
+
+struct fscache_cookie_def afs_vlocation_cache_index_def = {
+ .name = "AFS.vldb",
+ .type = FSCACHE_COOKIE_TYPE_INDEX,
+ .get_key = afs_vlocation_cache_get_key,
+ .get_aux = afs_vlocation_cache_get_aux,
+ .check_aux = afs_vlocation_cache_check_aux,
+};
+
+struct fscache_cookie_def afs_volume_cache_index_def = {
+ .name = "AFS.volume",
+ .type = FSCACHE_COOKIE_TYPE_INDEX,
+ .get_key = afs_volume_cache_get_key,
+};
+
+struct fscache_cookie_def afs_vnode_cache_index_def = {
+ .name = "AFS.vnode",
+ .type = FSCACHE_COOKIE_TYPE_DATAFILE,
+ .get_key = afs_vnode_cache_get_key,
+ .get_attr = afs_vnode_cache_get_attr,
+ .get_aux = afs_vnode_cache_get_aux,
+ .check_aux = afs_vnode_cache_check_aux,
+ .now_uncached = afs_vnode_cache_now_uncached,
};
-#endif

/*
- * match a cell record obtained from the cache
+ * set the key for the index entry
*/
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_cell_cache_match(void *target,
- const void *entry)
+static uint16_t afs_cell_cache_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
{
- const struct afs_cache_cell *ccell = entry;
- struct afs_cell *cell = target;
+ const struct afs_cell *cell = cookie_netfs_data;
+ uint16_t klen;

- _enter("{%s},{%s}", ccell->name, cell->name);
+ _enter("%p,%p,%u", cell, buffer, bufmax);

- if (strncmp(ccell->name, cell->name, sizeof(ccell->name)) == 0) {
- _leave(" = SUCCESS");
- return CACHEFS_MATCH_SUCCESS;
- }
+ klen = strlen(cell->name);
+ if (klen > bufmax)
+ return 0;

- _leave(" = FAILED");
- return CACHEFS_MATCH_FAILED;
+ memcpy(buffer, cell->name, klen);
+ return klen;
}
-#endif

/*
- * update a cell record in the cache
+ * provide new auxilliary cache data
*/
-#ifdef AFS_CACHING_SUPPORT
-static void afs_cell_cache_update(void *source, void *entry)
+static uint16_t afs_cell_cache_get_aux(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
{
- struct afs_cache_cell *ccell = entry;
- struct afs_cell *cell = source;
+ const struct afs_cell *cell = cookie_netfs_data;
+ uint16_t dlen;

- _enter("%p,%p", source, entry);
+ _enter("%p,%p,%u", cell, buffer, bufmax);

- strncpy(ccell->name, cell->name, sizeof(ccell->name));
+ dlen = cell->vl_naddrs * sizeof(cell->vl_addrs[0]);
+ dlen = min(dlen, bufmax);
+ dlen &= ~(sizeof(cell->vl_addrs[0]) - 1);

- memcpy(ccell->vl_servers,
- cell->vl_addrs,
- min(sizeof(ccell->vl_servers), sizeof(cell->vl_addrs)));
+ memcpy(buffer, cell->vl_addrs, dlen);
+ return dlen;
+}

+/*
+ * check that the auxilliary data indicates that the entry is still valid
+ */
+static enum fscache_checkaux afs_cell_cache_check_aux(void *cookie_netfs_data,
+ const void *buffer,
+ uint16_t buflen)
+{
+ _leave(" = OKAY");
+ return FSCACHE_CHECKAUX_OKAY;
}
-#endif
-
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_vlocation_cache_match(void *target,
- const void *entry);
-static void afs_vlocation_cache_update(void *source, void *entry);
-
-struct cachefs_index_def afs_vlocation_cache_index_def = {
- .name = "vldb",
- .data_size = sizeof(struct afs_cache_vlocation),
- .keys[0] = { CACHEFS_INDEX_KEYS_ASCIIZ, 64 },
- .match = afs_vlocation_cache_match,
- .update = afs_vlocation_cache_update,
-};
-#endif

+/*****************************************************************************/
/*
- * match a VLDB record stored in the cache
- * - may also load target from entry
+ * set the key for the index entry
*/
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_vlocation_cache_match(void *target,
- const void *entry)
+static uint16_t afs_vlocation_cache_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
{
- const struct afs_cache_vlocation *vldb = entry;
- struct afs_vlocation *vlocation = target;
+ const struct afs_vlocation *vlocation = cookie_netfs_data;
+ uint16_t klen;
+
+ _enter("{%s},%p,%u", vlocation->vldb.name, buffer, bufmax);
+
+ klen = strnlen(vlocation->vldb.name, sizeof(vlocation->vldb.name));
+ if (klen > bufmax)
+ return 0;

- _enter("{%s},{%s}", vlocation->vldb.name, vldb->name);
+ memcpy(buffer, vlocation->vldb.name, klen);

- if (strncmp(vlocation->vldb.name, vldb->name, sizeof(vldb->name)) == 0
- ) {
- if (!vlocation->valid ||
- vlocation->vldb.rtime == vldb->rtime
+ _leave(" = %u", klen);
+ return klen;
+}
+
+/*
+ * provide new auxilliary cache data
+ */
+static uint16_t afs_vlocation_cache_get_aux(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
+{
+ const struct afs_vlocation *vlocation = cookie_netfs_data;
+ uint16_t dlen;
+
+ _enter("{%s},%p,%u", vlocation->vldb.name, buffer, bufmax);
+
+ dlen = sizeof(struct afs_cache_vlocation);
+ dlen -= offsetof(struct afs_cache_vlocation, nservers);
+ if (dlen > bufmax)
+ return 0;
+
+ memcpy(buffer, (uint8_t *)&vlocation->vldb.nservers, dlen);
+
+ _leave(" = %u", dlen);
+ return dlen;
+}
+
+/*
+ * check that the auxilliary data indicates that the entry is still valid
+ */
+static
+enum fscache_checkaux afs_vlocation_cache_check_aux(void *cookie_netfs_data,
+ const void *buffer,
+ uint16_t buflen)
+{
+ const struct afs_cache_vlocation *cvldb;
+ struct afs_vlocation *vlocation = cookie_netfs_data;
+ uint16_t dlen;
+
+ _enter("{%s},%p,%u", vlocation->vldb.name, buffer, buflen);
+
+ /* check the size of the data is what we're expecting */
+ dlen = sizeof(struct afs_cache_vlocation);
+ dlen -= offsetof(struct afs_cache_vlocation, nservers);
+ if (dlen != buflen)
+ return FSCACHE_CHECKAUX_OBSOLETE;
+
+ cvldb = container_of(buffer, struct afs_cache_vlocation, nservers);
+
+ /* if what's on disk is more valid than what's in memory, then use the
+ * VL record from the cache */
+ if (!vlocation->valid || vlocation->vldb.rtime == cvldb->rtime) {
+ memcpy((uint8_t *)&vlocation->vldb.nservers, buffer, dlen);
+ vlocation->valid = 1;
+ _leave(" = SUCCESS [c->m]");
+ return FSCACHE_CHECKAUX_OKAY;
+ }
+
+ /* need to update the cache if the cached info differs */
+ if (memcmp(&vlocation->vldb, buffer, dlen) != 0) {
+ /* delete if the volume IDs for this name differ */
+ if (memcmp(&vlocation->vldb.vid, &cvldb->vid,
+ sizeof(cvldb->vid)) != 0
) {
- vlocation->vldb = *vldb;
- vlocation->valid = 1;
- _leave(" = SUCCESS [c->m]");
- return CACHEFS_MATCH_SUCCESS;
- } else if (memcmp(&vlocation->vldb, vldb, sizeof(*vldb)) != 0) {
- /* delete if VIDs for this name differ */
- if (memcmp(&vlocation->vldb.vid,
- &vldb->vid,
- sizeof(vldb->vid)) != 0) {
- _leave(" = DELETE");
- return CACHEFS_MATCH_SUCCESS_DELETE;
- }
-
- _leave(" = UPDATE");
- return CACHEFS_MATCH_SUCCESS_UPDATE;
- } else {
- _leave(" = SUCCESS");
- return CACHEFS_MATCH_SUCCESS;
+ _leave(" = OBSOLETE");
+ return FSCACHE_CHECKAUX_OBSOLETE;
}
+
+ _leave(" = UPDATE");
+ return FSCACHE_CHECKAUX_NEEDS_UPDATE;
}

- _leave(" = FAILED");
- return CACHEFS_MATCH_FAILED;
+ _leave(" = OKAY");
+ return FSCACHE_CHECKAUX_OKAY;
}
-#endif

+/*****************************************************************************/
/*
- * update a VLDB record stored in the cache
+ * set the key for the volume index entry
*/
-#ifdef AFS_CACHING_SUPPORT
-static void afs_vlocation_cache_update(void *source, void *entry)
+static uint16_t afs_volume_cache_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
{
- struct afs_cache_vlocation *vldb = entry;
- struct afs_vlocation *vlocation = source;
+ const struct afs_volume *volume = cookie_netfs_data;
+ uint16_t klen;
+
+ _enter("{%u},%p,%u", volume->type, buffer, bufmax);
+
+ klen = sizeof(volume->type);
+ if (klen > bufmax)
+ return 0;

- _enter("");
+ memcpy(buffer, &volume->type, sizeof(volume->type));
+
+ _leave(" = %u", klen);
+ return klen;

- *vldb = vlocation->vldb;
}
-#endif
-
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_volume_cache_match(void *target,
- const void *entry);
-static void afs_volume_cache_update(void *source, void *entry);
-
-struct cachefs_index_def afs_volume_cache_index_def = {
- .name = "volume",
- .data_size = sizeof(struct afs_cache_vhash),
- .keys[0] = { CACHEFS_INDEX_KEYS_BIN, 1 },
- .keys[1] = { CACHEFS_INDEX_KEYS_BIN, 1 },
- .match = afs_volume_cache_match,
- .update = afs_volume_cache_update,
-};
-#endif

+/*****************************************************************************/
/*
- * match a volume hash record stored in the cache
+ * set the key for the index entry
*/
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_volume_cache_match(void *target,
- const void *entry)
+static uint16_t afs_vnode_cache_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
{
- const struct afs_cache_vhash *vhash = entry;
- struct afs_volume *volume = target;
+ const struct afs_vnode *vnode = cookie_netfs_data;
+ uint16_t klen;

- _enter("{%u},{%u}", volume->type, vhash->vtype);
+ _enter("{%x,%x,%llx},%p,%u",
+ vnode->fid.vnode, vnode->fid.unique, vnode->status.data_version,
+ buffer, bufmax);

- if (volume->type == vhash->vtype) {
- _leave(" = SUCCESS");
- return CACHEFS_MATCH_SUCCESS;
- }
+ klen = sizeof(vnode->fid.vnode);
+ if (klen > bufmax)
+ return 0;
+
+ memcpy(buffer, &vnode->fid.vnode, sizeof(vnode->fid.vnode));

- _leave(" = FAILED");
- return CACHEFS_MATCH_FAILED;
+ _leave(" = %u", klen);
+ return klen;
}
-#endif

/*
- * update a volume hash record stored in the cache
+ * provide updated file attributes
*/
-#ifdef AFS_CACHING_SUPPORT
-static void afs_volume_cache_update(void *source, void *entry)
+static void afs_vnode_cache_get_attr(const void *cookie_netfs_data,
+ uint64_t *size)
{
- struct afs_cache_vhash *vhash = entry;
- struct afs_volume *volume = source;
+ const struct afs_vnode *vnode = cookie_netfs_data;

- _enter("");
+ _enter("{%x,%x,%llx},",
+ vnode->fid.vnode, vnode->fid.unique,
+ vnode->status.data_version);

- vhash->vtype = volume->type;
+ *size = vnode->status.size;
}
-#endif
-
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_vnode_cache_match(void *target,
- const void *entry);
-static void afs_vnode_cache_update(void *source, void *entry);
-
-struct cachefs_index_def afs_vnode_cache_index_def = {
- .name = "vnode",
- .data_size = sizeof(struct afs_cache_vnode),
- .keys[0] = { CACHEFS_INDEX_KEYS_BIN, 4 },
- .match = afs_vnode_cache_match,
- .update = afs_vnode_cache_update,
-};
-#endif

/*
- * match a vnode record stored in the cache
+ * provide new auxilliary cache data
+ */
+static uint16_t afs_vnode_cache_get_aux(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
+{
+ const struct afs_vnode *vnode = cookie_netfs_data;
+ uint16_t dlen;
+
+ _enter("{%x,%x,%Lx},%p,%u",
+ vnode->fid.vnode, vnode->fid.unique, vnode->status.data_version,
+ buffer, bufmax);
+
+ dlen = sizeof(vnode->fid.unique) + sizeof(vnode->status.data_version);
+ if (dlen > bufmax)
+ return 0;
+
+ memcpy(buffer, &vnode->fid.unique, sizeof(vnode->fid.unique));
+ buffer += sizeof(vnode->fid.unique);
+ memcpy(buffer, &vnode->status.data_version,
+ sizeof(vnode->status.data_version));
+
+ _leave(" = %u", dlen);
+ return dlen;
+}
+
+/*
+ * check that the auxilliary data indicates that the entry is still valid
*/
-#ifdef AFS_CACHING_SUPPORT
-static cachefs_match_val_t afs_vnode_cache_match(void *target,
- const void *entry)
+static enum fscache_checkaux afs_vnode_cache_check_aux(void *cookie_netfs_data,
+ const void *buffer,
+ uint16_t buflen)
{
- const struct afs_cache_vnode *cvnode = entry;
- struct afs_vnode *vnode = target;
-
- _enter("{%x,%x,%Lx},{%x,%x,%Lx}",
- vnode->fid.vnode,
- vnode->fid.unique,
- vnode->status.version,
- cvnode->vnode_id,
- cvnode->vnode_unique,
- cvnode->data_version);
-
- if (vnode->fid.vnode != cvnode->vnode_id) {
- _leave(" = FAILED");
- return CACHEFS_MATCH_FAILED;
+ struct afs_vnode *vnode = cookie_netfs_data;
+ uint16_t dlen;
+
+ _enter("{%x,%x,%llx},%p,%u",
+ vnode->fid.vnode, vnode->fid.unique, vnode->status.data_version,
+ buffer, buflen);
+
+ /* check the size of the data is what we're expecting */
+ dlen = sizeof(vnode->fid.unique) + sizeof(vnode->status.data_version);
+ if (dlen != buflen) {
+ _leave(" = OBSOLETE [len %hx != %hx]", dlen, buflen);
+ return FSCACHE_CHECKAUX_OBSOLETE;
}

- if (vnode->fid.unique != cvnode->vnode_unique ||
- vnode->status.version != cvnode->data_version) {
- _leave(" = DELETE");
- return CACHEFS_MATCH_SUCCESS_DELETE;
+ if (memcmp(buffer,
+ &vnode->fid.unique,
+ sizeof(vnode->fid.unique)
+ ) != 0) {
+ unsigned unique;
+
+ memcpy(&unique, buffer, sizeof(unique));
+
+ _leave(" = OBSOLETE [uniq %x != %x]",
+ unique, vnode->fid.unique);
+ return FSCACHE_CHECKAUX_OBSOLETE;
+ }
+
+ if (memcmp(buffer + sizeof(vnode->fid.unique),
+ &vnode->status.data_version,
+ sizeof(vnode->status.data_version)
+ ) != 0) {
+ afs_dataversion_t version;
+
+ memcpy(&version, buffer + sizeof(vnode->fid.unique),
+ sizeof(version));
+
+ _leave(" = OBSOLETE [vers %llx != %llx]",
+ version, vnode->status.data_version);
+ return FSCACHE_CHECKAUX_OBSOLETE;
}

_leave(" = SUCCESS");
- return CACHEFS_MATCH_SUCCESS;
+ return FSCACHE_CHECKAUX_OKAY;
}
-#endif

/*
- * update a vnode record stored in the cache
+ * indication the cookie is no longer uncached
+ * - this function is called when the backing store currently caching a cookie
+ * is removed
+ * - the netfs should use this to clean up any markers indicating cached pages
+ * - this is mandatory for any object that may have data
*/
-#ifdef AFS_CACHING_SUPPORT
-static void afs_vnode_cache_update(void *source, void *entry)
+static void afs_vnode_cache_now_uncached(void *cookie_netfs_data)
{
- struct afs_cache_vnode *cvnode = entry;
- struct afs_vnode *vnode = source;
+ struct afs_vnode *vnode = cookie_netfs_data;
+ struct pagevec pvec;
+ pgoff_t first;
+ int loop, nr_pages;
+
+ _enter("{%x,%x,%Lx}",
+ vnode->fid.vnode, vnode->fid.unique, vnode->status.data_version);
+
+ pagevec_init(&pvec, 0);
+ first = 0;
+
+ for (;;) {
+ /* grab a bunch of pages to clean */
+ nr_pages = pagevec_lookup(&pvec, vnode->vfs_inode.i_mapping,
+ first,
+ PAGEVEC_SIZE - pagevec_count(&pvec));
+ if (!nr_pages)
+ break;

- _enter("");
+ for (loop = 0; loop < nr_pages; loop++)
+ ClearPageFsCache(pvec.pages[loop]);
+
+ first = pvec.pages[nr_pages - 1]->index + 1;
+
+ pvec.nr = nr_pages;
+ pagevec_release(&pvec);
+ cond_resched();
+ }

- cvnode->vnode_id = vnode->fid.vnode;
- cvnode->vnode_unique = vnode->fid.unique;
- cvnode->data_version = vnode->status.version;
+ _leave("");
}
-#endif
diff --git a/fs/afs/cache.h b/fs/afs/cache.h
index 36a3642..5c4f6b4 100644
--- a/fs/afs/cache.h
+++ b/fs/afs/cache.h
@@ -1,6 +1,6 @@
/* AFS local cache management interface
*
- * Copyright (C) 2002 Red Hat, Inc. All Rights Reserved.
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
* Written by David Howells (***@redhat.com)
*
* This program is free software; you can redistribute it and/or
@@ -9,15 +9,4 @@
* 2 of the License, or (at your option) any later version.
*/

-#ifndef AFS_CACHE_H
-#define AFS_CACHE_H
-
-#undef AFS_CACHING_SUPPORT
-
-#include <linux/mm.h>
-#ifdef AFS_CACHING_SUPPORT
-#include <linux/cachefs.h>
-#endif
-#include "types.h"
-
-#endif /* AFS_CACHE_H */
+#include <linux/fscache.h>
diff --git a/fs/afs/cell.c b/fs/afs/cell.c
index 5e1df14..e19c13f 100644
--- a/fs/afs/cell.c
+++ b/fs/afs/cell.c
@@ -147,12 +147,11 @@ struct afs_cell *afs_cell_create(const char *name, char *vllist)
if (ret < 0)
goto error;

-#ifdef AFS_CACHING_SUPPORT
- /* put it up for caching */
- cachefs_acquire_cookie(afs_cache_netfs.primary_index,
- &afs_vlocation_cache_index_def,
- cell,
- &cell->cache);
+#ifdef CONFIG_AFS_FSCACHE
+ /* put it up for caching (this never returns an error) */
+ cell->cache = fscache_acquire_cookie(afs_cache_netfs.primary_index,
+ &afs_cell_cache_index_def,
+ cell);
#endif

/* add to the cell lists */
@@ -362,10 +361,9 @@ static void afs_cell_destroy(struct afs_cell *cell)
list_del_init(&cell->proc_link);
up_write(&afs_proc_cells_sem);

-#ifdef AFS_CACHING_SUPPORT
- cachefs_relinquish_cookie(cell->cache, 0);
+#ifdef CONFIG_AFS_FSCACHE
+ fscache_relinquish_cookie(cell->cache, 0);
#endif
-
key_put(cell->anonymous_key);
kfree(cell);

diff --git a/fs/afs/file.c b/fs/afs/file.c
index a390176..aeb6cdd 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -23,6 +23,9 @@ static void afs_invalidatepage(struct page *page, unsigned long offset);
static int afs_releasepage(struct page *page, gfp_t gfp_flags);
static int afs_launder_page(struct page *page);

+static int afs_readpages(struct file *filp, struct address_space *mapping,
+ struct list_head *pages, unsigned nr_pages);
+
const struct file_operations afs_file_operations = {
.open = afs_open,
.release = afs_release,
@@ -46,6 +49,7 @@ const struct inode_operations afs_file_inode_operations = {

const struct address_space_operations afs_fs_aops = {
.readpage = afs_readpage,
+ .readpages = afs_readpages,
.set_page_dirty = afs_set_page_dirty,
.launder_page = afs_launder_page,
.releasepage = afs_releasepage,
@@ -101,37 +105,18 @@ int afs_release(struct inode *inode, struct file *file)
/*
* deal with notification that a page was read from the cache
*/
-#ifdef AFS_CACHING_SUPPORT
-static void afs_readpage_read_complete(void *cookie_data,
- struct page *page,
- void *data,
- int error)
+static void afs_file_readpage_read_complete(struct page *page,
+ void *data,
+ int error)
{
- _enter("%p,%p,%p,%d", cookie_data, page, data, error);
+ _enter("%p,%p,%d", page, data, error);

- if (error)
- SetPageError(page);
- else
+ /* if the read completes with an error, we just unlock the page and let
+ * the VM reissue the readpage */
+ if (!error)
SetPageUptodate(page);
unlock_page(page);
-
}
-#endif
-
-/*
- * deal with notification that a page was written to the cache
- */
-#ifdef AFS_CACHING_SUPPORT
-static void afs_readpage_write_complete(void *cookie_data,
- struct page *page,
- void *data,
- int error)
-{
- _enter("%p,%p,%p,%d", cookie_data, page, data, error);
-
- unlock_page(page);
-}
-#endif

/*
* AFS read page from file, directory or symlink
@@ -161,9 +146,9 @@ static int afs_readpage(struct file *file, struct page *page)
if (test_bit(AFS_VNODE_DELETED, &vnode->flags))
goto error;

-#ifdef AFS_CACHING_SUPPORT
/* is it cached? */
- ret = cachefs_read_or_alloc_page(vnode->cache,
+#ifdef CONFIG_AFS_FSCACHE
+ ret = fscache_read_or_alloc_page(vnode->cache,
page,
afs_file_readpage_read_complete,
NULL,
@@ -171,20 +156,21 @@ static int afs_readpage(struct file *file, struct page *page)
#else
ret = -ENOBUFS;
#endif
-
switch (ret) {
- /* read BIO submitted and wb-journal entry found */
- case 1:
- BUG(); // TODO - handle wb-journal match
-
/* read BIO submitted (page in cache) */
case 0:
break;

- /* no page available in cache */
- case -ENOBUFS:
+ /* page not yet cached */
case -ENODATA:
+ _debug("cache said ENODATA");
+ goto go_on;
+
+ /* page will not be cached */
+ case -ENOBUFS:
+ _debug("cache said ENOBUFS");
default:
+ go_on:
offset = page->index << PAGE_CACHE_SHIFT;
len = min_t(size_t, i_size_read(inode) - offset, PAGE_SIZE);

@@ -198,27 +184,25 @@ static int afs_readpage(struct file *file, struct page *page)
set_bit(AFS_VNODE_DELETED, &vnode->flags);
ret = -ESTALE;
}
-#ifdef AFS_CACHING_SUPPORT
- cachefs_uncache_page(vnode->cache, page);
+
+#ifdef CONFIG_AFS_FSCACHE
+ fscache_uncache_page(vnode->cache, page);
#endif
+ BUG_ON(PageFsCache(page));
goto error;
}

SetPageUptodate(page);

-#ifdef AFS_CACHING_SUPPORT
- if (cachefs_write_page(vnode->cache,
- page,
- afs_file_readpage_write_complete,
- NULL,
- GFP_KERNEL) != 0
- ) {
- cachefs_uncache_page(vnode->cache, page);
- unlock_page(page);
+ /* send the page to the cache */
+#ifdef CONFIG_AFS_FSCACHE
+ if (PageFsCache(page) &&
+ fscache_write_page(vnode->cache, page, GFP_KERNEL) != 0) {
+ fscache_uncache_page(vnode->cache, page);
+ BUG_ON(PageFsCache(page));
}
-#else
- unlock_page(page);
#endif
+ unlock_page(page);
}

_leave(" = 0");
@@ -232,34 +216,59 @@ error:
}

/*
- * invalidate part or all of a page
+ * read a set of pages
*/
-static void afs_invalidatepage(struct page *page, unsigned long offset)
+static int afs_readpages(struct file *file, struct address_space *mapping,
+ struct list_head *pages, unsigned nr_pages)
{
- int ret = 1;
+ struct afs_vnode *vnode;
+ int ret = 0;

- _enter("{%lu},%lu", page->index, offset);
+ _enter(",{%lu},,%d", mapping->host->i_ino, nr_pages);

- BUG_ON(!PageLocked(page));
+ vnode = AFS_FS_I(mapping->host);
+ if (vnode->flags & AFS_VNODE_DELETED) {
+ _leave(" = -ESTALE");
+ return -ESTALE;
+ }

- if (PagePrivate(page)) {
- /* We release buffers only if the entire page is being
- * invalidated.
- * The get_block cached value has been unconditionally
- * invalidated, so real IO is not possible anymore.
- */
- if (offset == 0) {
- BUG_ON(!PageLocked(page));
-
- ret = 0;
- if (!PageWriteback(page))
- ret = page->mapping->a_ops->releasepage(page,
- 0);
- /* possibly should BUG_ON(!ret); - neilb */
- }
+ /* attempt to read as many of the pages as possible */
+#ifdef CONFIG_AFS_FSCACHE
+ ret = fscache_read_or_alloc_pages(vnode->cache,
+ mapping,
+ pages,
+ &nr_pages,
+ afs_file_readpage_read_complete,
+ NULL,
+ mapping_gfp_mask(mapping));
+#else
+ ret = -ENOBUFS;
+#endif
+
+ switch (ret) {
+ /* all pages are being read from the cache */
+ case 0:
+ BUG_ON(!list_empty(pages));
+ BUG_ON(nr_pages != 0);
+ _leave(" = 0 [reading all]");
+ return 0;
+
+ /* there were pages that couldn't be read from the cache */
+ case -ENODATA:
+ case -ENOBUFS:
+ break;
+
+ /* other error */
+ default:
+ _leave(" = %d", ret);
+ return ret;
}

- _leave(" = %d", ret);
+ /* load the missing pages from the network */
+ ret = read_cache_pages(mapping, pages, (void *) afs_readpage, file);
+
+ _leave(" = %d [netting]", ret);
+ return ret;
}

/*
@@ -273,25 +282,82 @@ static int afs_launder_page(struct page *page)
}

/*
- * release a page and cleanup its private data
+ * invalidate part or all of a page
+ * - release a page and clean up its private data if offset is 0 (indicating
+ * the entire page)
+ */
+static void afs_invalidatepage(struct page *page, unsigned long offset)
+{
+ struct afs_writeback *wb = (struct afs_writeback *) page_private(page);
+
+ _enter("{%lu},%lu", page->index, offset);
+
+ BUG_ON(!PageLocked(page));
+
+ /* we clean up only if the entire page is being invalidated */
+ if (offset == 0) {
+#ifdef CONFIG_AFS_FSCACHE
+ if (PageFsCache(page)) {
+ struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
+ wait_on_page_fscache_write(page);
+ fscache_uncache_page(vnode->cache, page);
+ ClearPageFsCache(page);
+ }
+#endif
+
+ if (PagePrivate(page)) {
+ if (wb && !PageWriteback(page)) {
+ set_page_private(page, 0);
+ afs_put_writeback(wb);
+ }
+
+ if (!page_private(page))
+ ClearPagePrivate(page);
+ }
+ }
+
+ _leave("");
+}
+
+/*
+ * release a page and clean up its private state if it's not busy
+ * - return true if the page can now be released, false if not
*/
static int afs_releasepage(struct page *page, gfp_t gfp_flags)
{
+ struct afs_writeback *wb = (struct afs_writeback *) page_private(page);
struct afs_vnode *vnode = AFS_FS_I(page->mapping->host);
- struct afs_writeback *wb;

_enter("{{%x:%u}[%lu],%lx},%x",
vnode->fid.vid, vnode->fid.vnode, page->index, page->flags,
gfp_flags);

+ /* deny if page is being written to the cache and the caller hasn't
+ * elected to wait */
+#ifdef CONFIG_AFS_FSCACHE
+ if (PageFsCache(page)) {
+ if (PageFsCacheWrite(page)) {
+ if (!(gfp_flags & __GFP_WAIT)) {
+ _leave(" = F [cache busy]");
+ return 0;
+ }
+ wait_on_page_fscache_write(page);
+ }
+
+ fscache_uncache_page(vnode->cache, page);
+ ClearPageFsCache(page);
+ }
+#endif
+
if (PagePrivate(page)) {
- wb = (struct afs_writeback *) page_private(page);
- ASSERT(wb != NULL);
- set_page_private(page, 0);
+ if (wb) {
+ set_page_private(page, 0);
+ afs_put_writeback(wb);
+ }
ClearPagePrivate(page);
- afs_put_writeback(wb);
}

- _leave(" = 0");
- return 0;
+ /* indicate that the page can be released */
+ _leave(" = T");
+ return 1;
}
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index bb47217..c048f06 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -61,6 +61,11 @@ static int afs_inode_map_status(struct afs_vnode *vnode, struct key *key)
return -EBADMSG;
}

+#ifdef CONFIG_AFS_FSCACHE
+ if (vnode->status.size != inode->i_size)
+ fscache_attr_changed(vnode->cache);
+#endif
+
inode->i_nlink = vnode->status.nlink;
inode->i_uid = vnode->status.owner;
inode->i_gid = 0;
@@ -149,15 +154,6 @@ struct inode *afs_iget(struct super_block *sb, struct key *key,
return inode;
}

-#ifdef AFS_CACHING_SUPPORT
- /* set up caching before reading the status, as fetch-status reads the
- * first page of symlinks to see if they're really mntpts */
- cachefs_acquire_cookie(vnode->volume->cache,
- NULL,
- vnode,
- &vnode->cache);
-#endif
-
if (!status) {
/* it's a remotely extant inode */
set_bit(AFS_VNODE_CB_BROKEN, &vnode->flags);
@@ -183,6 +179,15 @@ struct inode *afs_iget(struct super_block *sb, struct key *key,
}
}

+ /* set up caching before mapping the status, as map-status reads the
+ * first page of symlinks to see if they're really mountpoints */
+ inode->i_size = vnode->status.size;
+#ifdef CONFIG_AFS_FSCACHE
+ vnode->cache = fscache_acquire_cookie(vnode->volume->cache,
+ &afs_vnode_cache_index_def,
+ vnode);
+#endif
+
ret = afs_inode_map_status(vnode, key);
if (ret < 0)
goto bad_inode;
@@ -196,6 +201,10 @@ struct inode *afs_iget(struct super_block *sb, struct key *key,

/* failure */
bad_inode:
+#ifdef CONFIG_AFS_FSCACHE
+ fscache_relinquish_cookie(vnode->cache, 0);
+ vnode->cache = NULL;
+#endif
iget_failed(inode);
_leave(" = %d [bad]", ret);
return ERR_PTR(ret);
@@ -340,8 +349,8 @@ void afs_clear_inode(struct inode *inode)
ASSERT(list_empty(&vnode->writebacks));
ASSERT(!vnode->cb_promised);

-#ifdef AFS_CACHING_SUPPORT
- cachefs_relinquish_cookie(vnode->cache, 0);
+#ifdef CONFIG_AFS_FSCACHE
+ fscache_relinquish_cookie(vnode->cache, 0);
vnode->cache = NULL;
#endif

diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 67f259d..106be66 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -21,6 +21,7 @@

#include "afs.h"
#include "afs_vl.h"
+#include "cache.h"

#define AFS_CELL_MAX_ADDRS 15

@@ -193,8 +194,8 @@ struct afs_cell {
struct key *anonymous_key; /* anonymous user key for this cell */
struct list_head proc_link; /* /proc cell list link */
struct proc_dir_entry *proc_dir; /* /proc dir for this cell */
-#ifdef AFS_CACHING_SUPPORT
- struct cachefs_cookie *cache; /* caching cookie */
+#ifdef CONFIG_AFS_FSCACHE
+ struct fscache_cookie *cache; /* caching cookie */
#endif

/* server record management */
@@ -249,8 +250,8 @@ struct afs_vlocation {
struct list_head grave; /* link in master graveyard list */
struct list_head update; /* link in master update list */
struct afs_cell *cell; /* cell to which volume belongs */
-#ifdef AFS_CACHING_SUPPORT
- struct cachefs_cookie *cache; /* caching cookie */
+#ifdef CONFIG_AFS_FSCACHE
+ struct fscache_cookie *cache; /* caching cookie */
#endif
struct afs_cache_vlocation vldb; /* volume information DB record */
struct afs_volume *vols[3]; /* volume access record pointer (index by type) */
@@ -302,8 +303,8 @@ struct afs_volume {
atomic_t usage;
struct afs_cell *cell; /* cell to which belongs (unrefd ptr) */
struct afs_vlocation *vlocation; /* volume location */
-#ifdef AFS_CACHING_SUPPORT
- struct cachefs_cookie *cache; /* caching cookie */
+#ifdef CONFIG_AFS_FSCACHE
+ struct fscache_cookie *cache; /* caching cookie */
#endif
afs_volid_t vid; /* volume ID */
afs_voltype_t type; /* type of volume */
@@ -333,8 +334,8 @@ struct afs_vnode {
struct afs_server *server; /* server currently supplying this file */
struct afs_fid fid; /* the file identifier for this inode */
struct afs_file_status status; /* AFS status info for this file */
-#ifdef AFS_CACHING_SUPPORT
- struct cachefs_cookie *cache; /* caching cookie */
+#ifdef CONFIG_AFS_FSCACHE
+ struct fscache_cookie *cache; /* caching cookie */
#endif
struct afs_permits *permits; /* cache of permits so far obtained */
struct mutex permits_lock; /* lock for altering permits list */
@@ -428,6 +429,22 @@ struct afs_uuid {

/*****************************************************************************/
/*
+ * cache.c
+ */
+#ifdef CONFIG_AFS_FSCACHE
+extern struct fscache_netfs afs_cache_netfs;
+extern struct fscache_cookie_def afs_cell_cache_index_def;
+extern struct fscache_cookie_def afs_vlocation_cache_index_def;
+extern struct fscache_cookie_def afs_volume_cache_index_def;
+extern struct fscache_cookie_def afs_vnode_cache_index_def;
+#else
+#define afs_cell_cache_index_def (*(struct fscache_cookie_def *) NULL)
+#define afs_vlocation_cache_index_def (*(struct fscache_cookie_def *) NULL)
+#define afs_volume_cache_index_def (*(struct fscache_cookie_def *) NULL)
+#define afs_vnode_cache_index_def (*(struct fscache_cookie_def *) NULL)
+#endif
+
+/*
* callback.c
*/
extern void afs_init_callback_state(struct afs_server *);
@@ -446,9 +463,6 @@ extern void afs_callback_update_kill(void);
*/
extern struct rw_semaphore afs_proc_cells_sem;
extern struct list_head afs_proc_cells;
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_index_def afs_cache_cell_index_def;
-#endif

#define afs_get_cell(C) do { atomic_inc(&(C)->usage); } while(0)
extern int afs_cell_init(char *);
@@ -554,9 +568,6 @@ extern void afs_clear_inode(struct inode *);
* main.c
*/
extern struct afs_uuid afs_uuid;
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_netfs afs_cache_netfs;
-#endif

/*
* misc.c
@@ -637,10 +648,6 @@ extern int afs_get_MAC_address(u8 *, size_t);
/*
* vlclient.c
*/
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_index_def afs_vlocation_cache_index_def;
-#endif
-
extern int afs_vl_get_entry_by_name(struct in_addr *, struct key *,
const char *, struct afs_cache_vlocation *,
const struct afs_wait_mode *);
@@ -664,12 +671,6 @@ extern void afs_vlocation_purge(void);
/*
* vnode.c
*/
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_index_def afs_vnode_cache_index_def;
-#endif
-
-extern struct afs_timer_ops afs_vnode_cb_timed_out_ops;
-
static inline struct afs_vnode *AFS_FS_I(struct inode *inode)
{
return container_of(inode, struct afs_vnode, vfs_inode);
@@ -711,10 +712,6 @@ extern int afs_vnode_release_lock(struct afs_vnode *, struct key *);
/*
* volume.c
*/
-#ifdef AFS_CACHING_SUPPORT
-extern struct cachefs_index_def afs_volume_cache_index_def;
-#endif
-
#define afs_get_volume(V) do { atomic_inc(&(V)->usage); } while(0)

extern void afs_put_volume(struct afs_volume *);
diff --git a/fs/afs/main.c b/fs/afs/main.c
index 2d3e5d4..66d54d3 100644
--- a/fs/afs/main.c
+++ b/fs/afs/main.c
@@ -1,6 +1,6 @@
/* AFS client file system
*
- * Copyright (C) 2002 Red Hat, Inc. All Rights Reserved.
+ * Copyright (C) 2002,5 Red Hat, Inc. All Rights Reserved.
* Written by David Howells (***@redhat.com)
*
* This program is free software; you can redistribute it and/or
@@ -29,18 +29,6 @@ static char *rootcell;
module_param(rootcell, charp, 0);
MODULE_PARM_DESC(rootcell, "root AFS cell name and VL server IP addr list");

-#ifdef AFS_CACHING_SUPPORT
-static struct cachefs_netfs_operations afs_cache_ops = {
- .get_page_cookie = afs_cache_get_page_cookie,
-};
-
-struct cachefs_netfs afs_cache_netfs = {
- .name = "afs",
- .version = 0,
- .ops = &afs_cache_ops,
-};
-#endif
-
struct afs_uuid afs_uuid;

/*
@@ -104,10 +92,9 @@ static int __init afs_init(void)
if (ret < 0)
return ret;

-#ifdef AFS_CACHING_SUPPORT
+#ifdef CONFIG_AFS_FSCACHE
/* we want to be able to cache */
- ret = cachefs_register_netfs(&afs_cache_netfs,
- &afs_cache_cell_index_def);
+ ret = fscache_register_netfs(&afs_cache_netfs);
if (ret < 0)
goto error_cache;
#endif
@@ -142,8 +129,8 @@ error_fs:
error_open_socket:
error_vl_update_init:
error_cell_init:
-#ifdef AFS_CACHING_SUPPORT
- cachefs_unregister_netfs(&afs_cache_netfs);
+#ifdef CONFIG_AFS_FSCACHE
+ fscache_unregister_netfs(&afs_cache_netfs);
error_cache:
#endif
afs_callback_update_kill();
@@ -175,8 +162,8 @@ static void __exit afs_exit(void)
afs_vlocation_purge();
flush_scheduled_work();
afs_cell_purge();
-#ifdef AFS_CACHING_SUPPORT
- cachefs_unregister_netfs(&afs_cache_netfs);
+#ifdef CONFIG_AFS_FSCACHE
+ fscache_unregister_netfs(&afs_cache_netfs);
#endif
afs_proc_cleanup();
rcu_barrier();
diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
index 78db495..2b9e2d0 100644
--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -173,9 +173,9 @@ static struct vfsmount *afs_mntpt_do_automount(struct dentry *mntpt)
if (PageError(page))
goto error;

- buf = kmap(page);
+ buf = kmap_atomic(page, KM_USER0);
memcpy(devname, buf, size);
- kunmap(page);
+ kunmap_atomic(buf, KM_USER0);
page_cache_release(page);
page = NULL;

diff --git a/fs/afs/vlocation.c b/fs/afs/vlocation.c
index 849fc31..ec2a743 100644
--- a/fs/afs/vlocation.c
+++ b/fs/afs/vlocation.c
@@ -281,9 +281,8 @@ static void afs_vlocation_apply_update(struct afs_vlocation *vl,

vl->vldb = *vldb;

-#ifdef AFS_CACHING_SUPPORT
- /* update volume entry in local cache */
- cachefs_update_cookie(vl->cache);
+#ifdef CONFIG_AFS_FSCACHE
+ fscache_update_cookie(vl->cache);
#endif
}

@@ -304,11 +303,9 @@ static int afs_vlocation_fill_in_record(struct afs_vlocation *vl,
memset(&vldb, 0, sizeof(vldb));

/* see if we have an in-cache copy (will set vl->valid if there is) */
-#ifdef AFS_CACHING_SUPPORT
- cachefs_acquire_cookie(cell->cache,
- &afs_volume_cache_index_def,
- vlocation,
- &vl->cache);
+#ifdef CONFIG_AFS_FSCACHE
+ vl->cache = fscache_acquire_cookie(vl->cell->cache,
+ &afs_vlocation_cache_index_def, vl);
#endif

if (vl->valid) {
@@ -420,6 +417,11 @@ fill_in_record:
spin_unlock(&vl->lock);
wake_up(&vl->waitq);

+ /* update volume entry in local cache */
+#ifdef CONFIG_AFS_FSCACHE
+ fscache_update_cookie(vl->cache);
+#endif
+
/* schedule for regular updates */
afs_vlocation_queue_for_updates(vl);
goto success;
@@ -465,7 +467,7 @@ found_in_memory:
spin_unlock(&vl->lock);

success:
- _leave(" = %p",vl);
+ _leave(" = %p", vl);
return vl;

error_abandon:
@@ -523,10 +525,9 @@ static void afs_vlocation_destroy(struct afs_vlocation *vl)
{
_enter("%p", vl);

-#ifdef AFS_CACHING_SUPPORT
- cachefs_relinquish_cookie(vl->cache, 0);
+#ifdef CONFIG_AFS_FSCACHE
+ fscache_relinquish_cookie(vl->cache, 0);
#endif
-
afs_put_cell(vl->cell);
kfree(vl);
}
diff --git a/fs/afs/volume.c b/fs/afs/volume.c
index 8bab0e3..a353e69 100644
--- a/fs/afs/volume.c
+++ b/fs/afs/volume.c
@@ -124,13 +124,11 @@ struct afs_volume *afs_volume_lookup(struct afs_mount_params *params)
}

/* attach the cache and volume location */
-#ifdef AFS_CACHING_SUPPORT
- cachefs_acquire_cookie(vlocation->cache,
- &afs_vnode_cache_index_def,
- volume,
- &volume->cache);
+#ifdef CONFIG_AFS_FSCACHE
+ volume->cache = fscache_acquire_cookie(vlocation->cache,
+ &afs_volume_cache_index_def,
+ volume);
#endif
-
afs_get_vlocation(vlocation);
volume->vlocation = vlocation;

@@ -194,8 +192,8 @@ void afs_put_volume(struct afs_volume *volume)
up_write(&vlocation->cell->vl_sem);

/* finish cleaning up the volume */
-#ifdef AFS_CACHING_SUPPORT
- cachefs_relinquish_cookie(volume->cache, 0);
+#ifdef CONFIG_AFS_FSCACHE
+ fscache_relinquish_cookie(volume->cache, 0);
#endif
afs_put_vlocation(vlocation);

diff --git a/fs/afs/write.c b/fs/afs/write.c
index 3fb36d4..7884518 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -780,3 +780,24 @@ int afs_fsync(struct file *file, struct dentry *dentry, int datasync)
_leave(" = %d", ret);
return ret;
}
+
+/*
+ * notification that a previously read-only page is about to become writable
+ * - if it returns an error, the caller will deliver a bus error signal
+ */
+int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+ struct afs_vnode *vnode = AFS_FS_I(vma->vm_file->f_mapping->host);
+
+ _enter("{{%x:%u}},{%lx}",
+ vnode->fid.vid, vnode->fid.vnode, page->index);
+
+ /* wait for the page to be written to the cache before we allow it to
+ * be modified */
+#ifdef CONFIG_AFS_FSCACHE
+ wait_on_page_fscache_write(page);
+#endif
+
+ _leave(" = 0");
+ return 0;
+}

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:05:40 UTC
Permalink
Add comment banners to some NFS functions so that they can be modified by the
NFS fscache patches for further information.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/file.c | 26 ++++++++++++++++++++++++++
1 files changed, 26 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 0abf3f3..f5bc54d 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -409,6 +409,13 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
return copied;
}

+/*
+ * Partially or wholly invalidate a page
+ * - Release the private state associated with a page if undergoing complete
+ * page invalidation
+ * - Called if either PG_private or PG_private_2 is set on the page
+ * - Caller holds page lock
+ */
static void nfs_invalidate_page(struct page *page, unsigned long offset)
{
dfprintk(PAGECACHE, "NFS: invalidate_page(%p, %lu)\n", page, offset);
@@ -419,6 +426,12 @@ static void nfs_invalidate_page(struct page *page, unsigned long offset)
nfs_wb_page_cancel(page->mapping->host, page);
}

+/*
+ * Attempt to release the private state associated with a page
+ * - Called if either PG_private or PG_private_2 is set on the page
+ * - Caller holds page lock
+ * - Return true (may release page) or false (may not)
+ */
static int nfs_release_page(struct page *page, gfp_t gfp)
{
dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
@@ -427,6 +440,14 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
return 0;
}

+/*
+ * Attempt to clear the private state associated with a page when an error
+ * occurs that requires the cached contents of an inode to be written back or
+ * destroyed
+ * - Called if either PG_private or PG_private_2 is set on the page
+ * - Caller holds page lock
+ * - Return 0 if successful, -error otherwise
+ */
static int nfs_launder_page(struct page *page)
{
struct inode *inode = page->mapping->host;
@@ -451,6 +472,11 @@ const struct address_space_operations nfs_file_aops = {
.launder_page = nfs_launder_page,
};

+/*
+ * Notification that a PTE pointing to an NFS page is about to be made
+ * writable, implying that someone is about to modify the page through a
+ * shared-writable mapping
+ */
static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct page *page = vmf->page;

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:06:07 UTC
Permalink
Define and create superblock-level cache index objects (as managed by
nfs_server structs).

Each superblock object is created in a server level index object and is itself
an index into which inode-level objects are inserted.

Ideally there would be one superblock-level object per server, and the former
would be folded into the latter; however, since the "nosharecache" option
exists this isn't possible.

The superblock object key is a sequence consisting of:

(1) Certain superblock s_flags.

(2) Various connection parameters that serve to distinguish superblocks for
sget().

(3) The volume FSID.

(4) The security flavour.

(5) The uniquifier length.

(6) The uniquifier text. This is normally an empty string, unless the fsc=xyz
mount option was used to explicitly specify a uniquifier.

The key blob is of variable length, depending on the length of (6).

The superblock object is given no coherency data to carry in the auxiliary data
permitted by the cache. It is assumed that the superblock is always coherent.


This patch also adds uniquification handling such that two otherwise identical
superblocks, at least one of which is marked "nosharecache", won't end up
trying to share the on-disk cache. It will be possible to manually provide a
uniquifier through a mount option with a later patch to avoid the error
otherwise produced.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/fscache-index.c | 34 +++++++++++++
fs/nfs/fscache.c | 116 +++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 49 +++++++++++++++++++
fs/nfs/internal.h | 3 +
fs/nfs/super.c | 9 +++
include/linux/nfs_fs_sb.h | 5 ++
6 files changed, 214 insertions(+), 2 deletions(-)


diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index ff14b03..a824050 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -112,3 +112,37 @@ const struct fscache_cookie_def nfs_fscache_server_index_def = {
.type = FSCACHE_COOKIE_TYPE_INDEX,
.get_key = nfs_server_get_key,
};
+
+/*
+ * Generate a key to describe a superblock key in the main NFS index
+ */
+static uint16_t nfs_super_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
+{
+ const struct nfs_fscache_key *key;
+ const struct nfs_server *nfss = cookie_netfs_data;
+ uint16_t len;
+
+ key = nfss->fscache_key;
+ len = sizeof(key->key) + key->key.uniq_len;
+ if (len > bufmax) {
+ len = 0;
+ } else {
+ memcpy(buffer, &key->key, sizeof(key->key));
+ memcpy(buffer + sizeof(key->key),
+ key->key.uniquifier, key->key.uniq_len);
+ }
+
+ return len;
+}
+
+/*
+ * Define the superblock object for FS-Cache. This is used to describe a
+ * superblock object to fscache_acquire_cookie(). It is keyed by all the NFS
+ * parameters that might cause a separate superblock.
+ */
+const struct fscache_cookie_def nfs_fscache_super_index_def = {
+ .name = "NFS.super",
+ .type = FSCACHE_COOKIE_TYPE_INDEX,
+ .get_key = nfs_super_get_key,
+};
diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index c3f056f..ab2de2c 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -23,6 +23,9 @@

#define NFSDBG_FACILITY NFSDBG_FSCACHE

+static struct rb_root nfs_fscache_keys = RB_ROOT;
+static DEFINE_SPINLOCK(nfs_fscache_keys_lock);
+
/*
* Get the per-client index cookie for an NFS client if the appropriate mount
* flag was set
@@ -50,3 +53,116 @@ void nfs_fscache_release_client_cookie(struct nfs_client *clp)
fscache_relinquish_cookie(clp->fscache, 0);
clp->fscache = NULL;
}
+
+/*
+ * Get the cache cookie for an NFS superblock. We have to handle
+ * uniquification here because the cache doesn't do it for us.
+ */
+void nfs_fscache_get_super_cookie(struct super_block *sb,
+ struct nfs_parsed_mount_data *data)
+{
+ struct nfs_fscache_key *key, *xkey;
+ struct nfs_server *nfss = NFS_SB(sb);
+ struct rb_node **p, *parent;
+ const char *uniq = data->fscache_uniq ?: "";
+ int diff, ulen;
+
+ ulen = strlen(uniq);
+ key = kzalloc(sizeof(*key) + ulen, GFP_KERNEL);
+ if (!key)
+ return;
+
+ key->nfs_client = nfss->nfs_client;
+ key->key.super.s_flags = sb->s_flags & NFS_MS_MASK;
+ key->key.nfs_server.flags = nfss->flags;
+ key->key.nfs_server.rsize = nfss->rsize;
+ key->key.nfs_server.wsize = nfss->wsize;
+ key->key.nfs_server.acregmin = nfss->acregmin;
+ key->key.nfs_server.acregmax = nfss->acregmax;
+ key->key.nfs_server.acdirmin = nfss->acdirmin;
+ key->key.nfs_server.acdirmax = nfss->acdirmax;
+ key->key.nfs_server.fsid = nfss->fsid;
+ key->key.rpc_auth.au_flavor = nfss->client->cl_auth->au_flavor;
+
+ key->key.uniq_len = ulen;
+ memcpy(key->key.uniquifier, uniq, ulen);
+
+ spin_lock(&nfs_fscache_keys_lock);
+ p = &nfs_fscache_keys.rb_node;
+ parent = NULL;
+ while (*p) {
+ parent = *p;
+ xkey = rb_entry(parent, struct nfs_fscache_key, node);
+
+ if (key->nfs_client < xkey->nfs_client)
+ goto go_left;
+ if (key->nfs_client > xkey->nfs_client)
+ goto go_right;
+
+ diff = memcmp(&key->key, &xkey->key, sizeof(key->key));
+ if (diff < 0)
+ goto go_left;
+ if (diff > 0)
+ goto go_right;
+
+ if (key->key.uniq_len == 0)
+ goto non_unique;
+ diff = memcmp(key->key.uniquifier,
+ xkey->key.uniquifier,
+ key->key.uniq_len);
+ if (diff < 0)
+ goto go_left;
+ if (diff > 0)
+ goto go_right;
+ goto non_unique;
+
+ go_left:
+ p = &(*p)->rb_left;
+ continue;
+ go_right:
+ p = &(*p)->rb_right;
+ }
+
+ rb_link_node(&key->node, parent, p);
+ rb_insert_color(&key->node, &nfs_fscache_keys);
+ spin_unlock(&nfs_fscache_keys_lock);
+ nfss->fscache_key = key;
+
+ /* create a cache index for looking up filehandles */
+ nfss->fscache = fscache_acquire_cookie(nfss->nfs_client->fscache,
+ &nfs_fscache_super_index_def,
+ nfss);
+ dfprintk(FSCACHE, "NFS: get superblock cookie (0x%p/0x%p)\n",
+ nfss, nfss->fscache);
+ return;
+
+non_unique:
+ spin_unlock(&nfs_fscache_keys_lock);
+ kfree(key);
+ nfss->fscache_key = NULL;
+ nfss->fscache = NULL;
+ printk(KERN_WARNING "NFS:"
+ " Cache request denied due to non-unique superblock keys\n");
+}
+
+/*
+ * release a per-superblock cookie
+ */
+void nfs_fscache_release_super_cookie(struct super_block *sb)
+{
+ struct nfs_server *nfss = NFS_SB(sb);
+
+ dfprintk(FSCACHE, "NFS: releasing superblock cookie (0x%p/0x%p)\n",
+ nfss, nfss->fscache);
+
+ fscache_relinquish_cookie(nfss->fscache, 0);
+ nfss->fscache = NULL;
+
+ if (nfss->fscache_key) {
+ spin_lock(&nfs_fscache_keys_lock);
+ rb_erase(&nfss->fscache_key->node, &nfs_fscache_keys);
+ spin_unlock(&nfs_fscache_keys_lock);
+ kfree(nfss->fscache_key);
+ nfss->fscache_key = NULL;
+ }
+}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 1d864be..22b971e 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -20,10 +20,48 @@
#ifdef CONFIG_NFS_FSCACHE

/*
+ * set of NFS FS-Cache objects that form a superblock key
+ */
+struct nfs_fscache_key {
+ struct rb_node node;
+ struct nfs_client *nfs_client; /* the server */
+
+ /* the elements of the unique key - as used by nfs_compare_super() and
+ * nfs_compare_mount_options() to distinguish superblocks */
+ struct {
+ struct {
+ unsigned long s_flags; /* various flags
+ * (& NFS_MS_MASK) */
+ } super;
+
+ struct {
+ struct nfs_fsid fsid;
+ int flags;
+ unsigned int rsize; /* read size */
+ unsigned int wsize; /* write size */
+ unsigned int acregmin; /* attr cache timeouts */
+ unsigned int acregmax;
+ unsigned int acdirmin;
+ unsigned int acdirmax;
+ } nfs_server;
+
+ struct {
+ rpc_authflavor_t au_flavor;
+ } rpc_auth;
+
+ /* uniquifier - can be used if nfs_server.flags includes
+ * NFS_MOUNT_UNSHARED */
+ u8 uniq_len;
+ char uniquifier[0];
+ } key;
+};
+
+/*
* fscache-index.c
*/
extern struct fscache_netfs nfs_fscache_netfs;
extern const struct fscache_cookie_def nfs_fscache_server_index_def;
+extern const struct fscache_cookie_def nfs_fscache_super_index_def;

extern int nfs_fscache_register(void);
extern void nfs_fscache_unregister(void);
@@ -34,6 +72,10 @@ extern void nfs_fscache_unregister(void);
extern void nfs_fscache_get_client_cookie(struct nfs_client *);
extern void nfs_fscache_release_client_cookie(struct nfs_client *);

+extern void nfs_fscache_get_super_cookie(struct super_block *,
+ struct nfs_parsed_mount_data *);
+extern void nfs_fscache_release_super_cookie(struct super_block *);
+
#else /* CONFIG_NFS_FSCACHE */
static inline int nfs_fscache_register(void) { return 0; }
static inline void nfs_fscache_unregister(void) {}
@@ -41,5 +83,12 @@ static inline void nfs_fscache_unregister(void) {}
static inline void nfs_fscache_get_client_cookie(struct nfs_client *clp) {}
static inline void nfs_fscache_release_client_cookie(struct nfs_client *clp) {}

+static inline void nfs_fscache_get_super_cookie(
+ struct super_block *sb,
+ struct nfs_parsed_mount_data *data)
+{
+}
+static inline void nfs_fscache_release_super_cookie(struct super_block *sb) {}
+
#endif /* CONFIG_NFS_FSCACHE */
#endif /* _NFS_FSCACHE_H */
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 2041f68..0130700 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -5,6 +5,8 @@
#include <linux/mount.h>
#include <linux/security.h>

+#define NFS_MS_MASK (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_SYNCHRONOUS)
+
struct nfs_string;

/* Maximum number of readahead requests
@@ -41,6 +43,7 @@ struct nfs_parsed_mount_data {
unsigned int auth_flavor_len;
rpc_authflavor_t auth_flavors[1];
char *client_address;
+ char *fscache_uniq;

struct {
struct sockaddr_storage address;
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 0942fcb..87f65ae 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -60,6 +60,7 @@
#include "delegation.h"
#include "iostat.h"
#include "internal.h"
+#include "fscache.h"

#define NFSDBG_FACILITY NFSDBG_VFS

@@ -1870,8 +1871,6 @@ static void nfs_clone_super(struct super_block *sb,
nfs_initialise_sb(sb);
}

-#define NFS_MS_MASK (MS_RDONLY|MS_NOSUID|MS_NODEV|MS_NOEXEC|MS_SYNCHRONOUS)
-
static int nfs_compare_mount_options(const struct super_block *s, const struct nfs_server *b, int flags)
{
const struct nfs_server *a = s->s_fs_info;
@@ -2036,6 +2035,7 @@ static int nfs_get_sb(struct file_system_type *fs_type,
if (!s->s_root) {
/* initial superblock/root creation */
nfs_fill_super(s, data);
+ nfs_fscache_get_super_cookie(s, data);
}

mntroot = nfs_get_root(s, mntfh);
@@ -2056,6 +2056,7 @@ static int nfs_get_sb(struct file_system_type *fs_type,
out:
kfree(data->nfs_server.hostname);
kfree(data->mount_server.hostname);
+ kfree(data->fscache_uniq);
security_free_mnt_opts(&data->lsm_opts);
out_free_fh:
kfree(mntfh);
@@ -2083,6 +2084,7 @@ static void nfs_kill_super(struct super_block *s)

bdi_unregister(&server->backing_dev_info);
kill_anon_super(s);
+ nfs_fscache_release_super_cookie(s);
nfs_free_server(server);
}

@@ -2390,6 +2392,7 @@ static int nfs4_get_sb(struct file_system_type *fs_type,
if (!s->s_root) {
/* initial superblock/root creation */
nfs4_fill_super(s);
+ nfs_fscache_get_super_cookie(s, data);
}

mntroot = nfs4_get_root(s, mntfh);
@@ -2411,6 +2414,7 @@ out:
kfree(data->client_address);
kfree(data->nfs_server.export_path);
kfree(data->nfs_server.hostname);
+ kfree(data->fscache_uniq);
security_free_mnt_opts(&data->lsm_opts);
out_free_fh:
kfree(mntfh);
@@ -2437,6 +2441,7 @@ static void nfs4_kill_super(struct super_block *sb)
kill_anon_super(sb);

nfs4_renewd_prepare_shutdown(server);
+ nfs_fscache_release_super_cookie(sb);
nfs_free_server(server);
}

diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 0a374b9..6ad7594 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -108,6 +108,11 @@ struct nfs_server {
unsigned long mount_time; /* when this fs was mounted */
dev_t s_dev; /* superblock dev numbers */

+#ifdef CONFIG_NFS_FSCACHE
+ struct nfs_fscache_key *fscache_key; /* unique key for superblock */
+ struct fscache_cookie *fscache; /* superblock cookie */
+#endif
+
#ifdef CONFIG_NFS_V4
u32 attr_bitmask[2];/* V4 bitmask representing the set
of attributes supported on this

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:05:56 UTC
Permalink
Register NFS for caching and retrieve the top-level cache index object cookie.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/Makefile | 1 +
fs/nfs/fscache-index.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 35 ++++++++++++++++++++++++++++++++++
fs/nfs/inode.c | 8 ++++++++
4 files changed, 93 insertions(+), 0 deletions(-)
create mode 100644 fs/nfs/fscache-index.c
create mode 100644 fs/nfs/fscache.h


diff --git a/fs/nfs/Makefile b/fs/nfs/Makefile
index ac6170c..0e0bb6c 100644
--- a/fs/nfs/Makefile
+++ b/fs/nfs/Makefile
@@ -15,3 +15,4 @@ nfs-$(CONFIG_NFS_V4) += nfs4proc.o nfs4xdr.o nfs4state.o nfs4renewd.o \
callback.o callback_xdr.o callback_proc.o \
nfs4namespace.o
nfs-$(CONFIG_SYSCTL) += sysctl.o
+nfs-$(CONFIG_NFS_FSCACHE) += fscache-index.o
diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
new file mode 100644
index 0000000..6d5bb5c
--- /dev/null
+++ b/fs/nfs/fscache-index.c
@@ -0,0 +1,49 @@
+/* NFS FS-Cache index structure definition
+ *
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/nfs_fs.h>
+#include <linux/nfs_fs_sb.h>
+#include <linux/in6.h>
+
+#include "internal.h"
+#include "fscache.h"
+
+#define NFSDBG_FACILITY NFSDBG_FSCACHE
+
+/*
+ * Define the NFS filesystem for FS-Cache. Upon registration FS-Cache sticks
+ * the cookie for the top-level index object for NFS into here. The top-level
+ * index can than have other cache objects inserted into it.
+ */
+struct fscache_netfs nfs_fscache_netfs = {
+ .name = "nfs",
+ .version = 0,
+};
+
+/*
+ * Register NFS for caching
+ */
+int nfs_fscache_register(void)
+{
+ return fscache_register_netfs(&nfs_fscache_netfs);
+}
+
+/*
+ * Unregister NFS for caching
+ */
+void nfs_fscache_unregister(void)
+{
+ fscache_unregister_netfs(&nfs_fscache_netfs);
+}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
new file mode 100644
index 0000000..ccfcdc5
--- /dev/null
+++ b/fs/nfs/fscache.h
@@ -0,0 +1,35 @@
+/* NFS filesystem cache interface definitions
+ *
+ * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _NFS_FSCACHE_H
+#define _NFS_FSCACHE_H
+
+#include <linux/nfs_fs.h>
+#include <linux/nfs_mount.h>
+#include <linux/nfs4_mount.h>
+#include <linux/fscache.h>
+
+#ifdef CONFIG_NFS_FSCACHE
+
+/*
+ * fscache-index.c
+ */
+extern struct fscache_netfs nfs_fscache_netfs;
+
+extern int nfs_fscache_register(void);
+extern void nfs_fscache_unregister(void);
+
+#else /* CONFIG_NFS_FSCACHE */
+static inline int nfs_fscache_register(void) { return 0; }
+static inline void nfs_fscache_unregister(void) {}
+
+#endif /* CONFIG_NFS_FSCACHE */
+#endif /* _NFS_FSCACHE_H */
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index a834d1d..cd29f41 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -46,6 +46,7 @@
#include "delegation.h"
#include "iostat.h"
#include "internal.h"
+#include "fscache.h"

#define NFSDBG_FACILITY NFSDBG_VFS

@@ -1436,6 +1437,10 @@ static int __init init_nfs_fs(void)
{
int err;

+ err = nfs_fscache_register();
+ if (err < 0)
+ goto out7;
+
err = nfsiod_start();
if (err)
goto out6;
@@ -1488,6 +1493,8 @@ out4:
out5:
nfsiod_stop();
out6:
+ nfs_fscache_unregister();
+out7:
return err;
}

@@ -1498,6 +1505,7 @@ static void __exit exit_nfs_fs(void)
nfs_destroy_readpagecache();
nfs_destroy_inodecache();
nfs_destroy_nfspagecache();
+ nfs_fscache_unregister();
#ifdef CONFIG_PROC_FS
rpc_proc_unregister("nfs");
#endif

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:06:12 UTC
Permalink
Define and create inode-level cache data storage objects (as managed by
nfs_inode structs).

Each inode-level object is created in a superblock-level index object and is
itself a data storage object into which pages from the inode are stored.

The inode object key is the NFS file handle for the inode.

The inode object is given coherency data to carry in the auxiliary data
permitted by the cache. This is a sequence made up of:

(1) i_mtime from the NFS inode.

(2) i_ctime from the NFS inode.

(3) i_size from the NFS inode.

(4) change_attr from the NFSv4 attribute data.

As the cache is a persistent cache, the auxiliary data is checked when a new
NFS in-memory inode is set up that matches an already existing data storage
object in the cache. If the coherency data is the same, the on-disk object is
retained and used; if not, it is scrapped and a new one created.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/fscache-index.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 1
2 files changed, 124 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index a824050..7bc54ac 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -146,3 +146,126 @@ const struct fscache_cookie_def nfs_fscache_super_index_def = {
.type = FSCACHE_COOKIE_TYPE_INDEX,
.get_key = nfs_super_get_key,
};
+
+/*
+ * Definition of the auxiliary data attached to NFS inode storage objects
+ * within the cache.
+ *
+ * The contents of this struct are recorded in the on-disk local cache in the
+ * auxiliary data attached to the data storage object backing an inode. This
+ * permits coherency to be managed when a new inode binds to an already extant
+ * cache object.
+ */
+struct nfs_fscache_inode_auxdata {
+ struct timespec mtime;
+ struct timespec ctime;
+ loff_t size;
+ u64 change_attr;
+};
+
+/*
+ * Generate a key to describe an NFS inode in an NFS server's index
+ */
+static uint16_t nfs_fscache_inode_get_key(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
+{
+ const struct nfs_inode *nfsi = cookie_netfs_data;
+ uint16_t nsize;
+
+ /* use the inode's NFS filehandle as the key */
+ nsize = nfsi->fh.size;
+ memcpy(buffer, nfsi->fh.data, nsize);
+ return nsize;
+}
+
+/*
+ * Get certain file attributes from the netfs data
+ * - This function can be absent for an index
+ * - Not permitted to return an error
+ * - The netfs data from the cookie being used as the source is presented
+ */
+static void nfs_fscache_inode_get_attr(const void *cookie_netfs_data,
+ uint64_t *size)
+{
+ const struct nfs_inode *nfsi = cookie_netfs_data;
+
+ *size = nfsi->vfs_inode.i_size;
+}
+
+/*
+ * Get the auxiliary data from netfs data
+ * - This function can be absent if the index carries no state data
+ * - Should store the auxiliary data in the buffer
+ * - Should return the amount of amount stored
+ * - Not permitted to return an error
+ * - The netfs data from the cookie being used as the source is presented
+ */
+static uint16_t nfs_fscache_inode_get_aux(const void *cookie_netfs_data,
+ void *buffer, uint16_t bufmax)
+{
+ struct nfs_fscache_inode_auxdata auxdata;
+ const struct nfs_inode *nfsi = cookie_netfs_data;
+
+ memset(&auxdata, 0, sizeof(auxdata));
+ auxdata.size = nfsi->vfs_inode.i_size;
+ auxdata.mtime = nfsi->vfs_inode.i_mtime;
+ auxdata.ctime = nfsi->vfs_inode.i_ctime;
+
+ if (NFS_SERVER(&nfsi->vfs_inode)->nfs_client->rpc_ops->version == 4)
+ auxdata.change_attr = nfsi->change_attr;
+
+ if (bufmax > sizeof(auxdata))
+ bufmax = sizeof(auxdata);
+
+ memcpy(buffer, &auxdata, bufmax);
+ return bufmax;
+}
+
+/*
+ * Consult the netfs about the state of an object
+ * - This function can be absent if the index carries no state data
+ * - The netfs data from the cookie being used as the target is
+ * presented, as is the auxiliary data
+ */
+static
+enum fscache_checkaux nfs_fscache_inode_check_aux(void *cookie_netfs_data,
+ const void *data,
+ uint16_t datalen)
+{
+ struct nfs_fscache_inode_auxdata auxdata;
+ struct nfs_inode *nfsi = cookie_netfs_data;
+
+ if (datalen != sizeof(auxdata))
+ return FSCACHE_CHECKAUX_OBSOLETE;
+
+ memset(&auxdata, 0, sizeof(auxdata));
+ auxdata.size = nfsi->vfs_inode.i_size;
+ auxdata.mtime = nfsi->vfs_inode.i_mtime;
+ auxdata.ctime = nfsi->vfs_inode.i_ctime;
+
+ if (NFS_SERVER(&nfsi->vfs_inode)->nfs_client->rpc_ops->version == 4)
+ auxdata.change_attr = nfsi->change_attr;
+
+ if (memcmp(data, &auxdata, datalen) != 0)
+ return FSCACHE_CHECKAUX_OBSOLETE;
+
+ return FSCACHE_CHECKAUX_OKAY;
+}
+
+/*
+ * Define the inode object for FS-Cache. This is used to describe an inode
+ * object to fscache_acquire_cookie(). It is keyed by the NFS file handle for
+ * an inode.
+ *
+ * Coherency is managed by comparing the copies of i_size, i_mtime and i_ctime
+ * held in the cache auxiliary data for the data storage object with those in
+ * the inode struct in memory.
+ */
+const struct fscache_cookie_def nfs_fscache_inode_object_def = {
+ .name = "NFS.fh",
+ .type = FSCACHE_COOKIE_TYPE_DATAFILE,
+ .get_key = nfs_fscache_inode_get_key,
+ .get_attr = nfs_fscache_inode_get_attr,
+ .get_aux = nfs_fscache_inode_get_aux,
+ .check_aux = nfs_fscache_inode_check_aux,
+};
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 22b971e..d21b590 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -62,6 +62,7 @@ struct nfs_fscache_key {
extern struct fscache_netfs nfs_fscache_netfs;
extern const struct fscache_cookie_def nfs_fscache_server_index_def;
extern const struct fscache_cookie_def nfs_fscache_super_index_def;
+extern const struct fscache_cookie_def nfs_fscache_inode_object_def;

extern int nfs_fscache_register(void);
extern void nfs_fscache_unregister(void);

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:06:17 UTC
Permalink
Bind data storage objects in the local cache to NFS inodes.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/fscache.c | 162 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 13 ++++
fs/nfs/inode.c | 6 ++
include/linux/nfs_fs.h | 10 +++
4 files changed, 191 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index ab2de2c..e3816eb 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -166,3 +166,165 @@ void nfs_fscache_release_super_cookie(struct super_block *sb)
nfss->fscache_key = NULL;
}
}
+
+/*
+ * Initialise the per-inode cache cookie pointer for an NFS inode.
+ */
+void nfs_fscache_init_inode_cookie(struct inode *inode)
+{
+ NFS_I(inode)->fscache = NULL;
+ if (S_ISREG(inode->i_mode))
+ set_bit(NFS_INO_FSCACHE, &NFS_I(inode)->flags);
+}
+
+/*
+ * Get the per-inode cache cookie for an NFS inode.
+ */
+static void nfs_fscache_enable_inode_cookie(struct inode *inode)
+{
+ struct super_block *sb = inode->i_sb;
+ struct nfs_inode *nfsi = NFS_I(inode);
+
+ if (nfsi->fscache || !NFS_FSCACHE(inode))
+ return;
+
+ if ((NFS_SB(sb)->options & NFS_OPTION_FSCACHE)) {
+ nfsi->fscache = fscache_acquire_cookie(
+ NFS_SB(sb)->fscache,
+ &nfs_fscache_inode_object_def,
+ nfsi);
+
+ dfprintk(FSCACHE, "NFS: get FH cookie (0x%p/0x%p/0x%p)\n",
+ sb, nfsi, nfsi->fscache);
+ }
+}
+
+/*
+ * Release a per-inode cookie.
+ */
+void nfs_fscache_release_inode_cookie(struct inode *inode)
+{
+ struct nfs_inode *nfsi = NFS_I(inode);
+
+ dfprintk(FSCACHE, "NFS: clear cookie (0x%p/0x%p)\n",
+ nfsi, nfsi->fscache);
+
+ fscache_relinquish_cookie(nfsi->fscache, 0);
+ nfsi->fscache = NULL;
+}
+
+/*
+ * Retire a per-inode cookie, destroying the data attached to it.
+ */
+void nfs_fscache_zap_inode_cookie(struct inode *inode)
+{
+ struct nfs_inode *nfsi = NFS_I(inode);
+
+ dfprintk(FSCACHE, "NFS: zapping cookie (0x%p/0x%p)\n",
+ nfsi, nfsi->fscache);
+
+ fscache_relinquish_cookie(nfsi->fscache, 1);
+ nfsi->fscache = NULL;
+}
+
+/*
+ * Turn off the cache with regard to a per-inode cookie if opened for writing,
+ * invalidating all the pages in the page cache relating to the associated
+ * inode to clear the per-page caching.
+ */
+static void nfs_fscache_disable_inode_cookie(struct inode *inode)
+{
+ clear_bit(NFS_INO_FSCACHE, &NFS_I(inode)->flags);
+
+ if (NFS_I(inode)->fscache) {
+ dfprintk(FSCACHE,
+ "NFS: nfsi 0x%p turning cache off\n", NFS_I(inode));
+
+ /* Need to invalidate any mapped pages that were read in before
+ * turning off the cache.
+ */
+ if (inode->i_mapping && inode->i_mapping->nrpages)
+ invalidate_inode_pages2(inode->i_mapping);
+
+ nfs_fscache_zap_inode_cookie(inode);
+ }
+}
+
+/*
+ * wait_on_bit() sleep function for uninterruptible waiting
+ */
+static int nfs_fscache_wait_bit(void *flags)
+{
+ schedule();
+ return 0;
+}
+
+/*
+ * Lock against someone else trying to also acquire or relinquish a cookie
+ */
+static inline void nfs_fscache_inode_lock(struct inode *inode)
+{
+ struct nfs_inode *nfsi = NFS_I(inode);
+
+ while (test_and_set_bit(NFS_INO_FSCACHE_LOCK, &nfsi->flags))
+ wait_on_bit(&nfsi->flags, NFS_INO_FSCACHE_LOCK,
+ nfs_fscache_wait_bit, TASK_UNINTERRUPTIBLE);
+}
+
+/*
+ * Unlock cookie management lock
+ */
+static inline void nfs_fscache_inode_unlock(struct inode *inode)
+{
+ struct nfs_inode *nfsi = NFS_I(inode);
+
+ smp_mb__before_clear_bit();
+ clear_bit(NFS_INO_FSCACHE_LOCK, &nfsi->flags);
+ smp_mb__after_clear_bit();
+ wake_up_bit(&nfsi->flags, NFS_INO_FSCACHE_LOCK);
+}
+
+/*
+ * Decide if we should enable or disable local caching for this inode.
+ * - For now, with NFS, only regular files that are open read-only will be able
+ * to use the cache.
+ * - May be invoked multiple times in parallel by parallel nfs_open() functions.
+ */
+void nfs_fscache_set_inode_cookie(struct inode *inode, struct file *filp)
+{
+ if (NFS_FSCACHE(inode)) {
+ nfs_fscache_inode_lock(inode);
+ if ((filp->f_flags & O_ACCMODE) != O_RDONLY)
+ nfs_fscache_disable_inode_cookie(inode);
+ else
+ nfs_fscache_enable_inode_cookie(inode);
+ nfs_fscache_inode_unlock(inode);
+ }
+}
+
+/*
+ * Replace a per-inode cookie due to revalidation detecting a file having
+ * changed on the server.
+ */
+void nfs_fscache_reset_inode_cookie(struct inode *inode)
+{
+ struct nfs_inode *nfsi = NFS_I(inode);
+ struct nfs_server *nfss = NFS_SERVER(inode);
+ struct fscache_cookie *old = nfsi->fscache;
+
+ nfs_fscache_inode_lock(inode);
+ if (nfsi->fscache) {
+ /* retire the current fscache cache and get a new one */
+ fscache_relinquish_cookie(nfsi->fscache, 1);
+
+ nfsi->fscache = fscache_acquire_cookie(
+ nfss->nfs_client->fscache,
+ &nfs_fscache_inode_object_def,
+ nfsi);
+
+ dfprintk(FSCACHE,
+ "NFS: revalidation new cookie (0x%p/0x%p/0x%p/0x%p)\n",
+ nfss, nfsi, old, nfsi->fscache);
+ }
+ nfs_fscache_inode_unlock(inode);
+}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index d21b590..8b4299a 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -77,6 +77,12 @@ extern void nfs_fscache_get_super_cookie(struct super_block *,
struct nfs_parsed_mount_data *);
extern void nfs_fscache_release_super_cookie(struct super_block *);

+extern void nfs_fscache_init_inode_cookie(struct inode *);
+extern void nfs_fscache_release_inode_cookie(struct inode *);
+extern void nfs_fscache_zap_inode_cookie(struct inode *);
+extern void nfs_fscache_set_inode_cookie(struct inode *, struct file *);
+extern void nfs_fscache_reset_inode_cookie(struct inode *);
+
#else /* CONFIG_NFS_FSCACHE */
static inline int nfs_fscache_register(void) { return 0; }
static inline void nfs_fscache_unregister(void) {}
@@ -91,5 +97,12 @@ static inline void nfs_fscache_get_super_cookie(
}
static inline void nfs_fscache_release_super_cookie(struct super_block *sb) {}

+static inline void nfs_fscache_init_inode_cookie(struct inode *inode) {}
+static inline void nfs_fscache_release_inode_cookie(struct inode *inode) {}
+static inline void nfs_fscache_zap_inode_cookie(struct inode *inode) {}
+static inline void nfs_fscache_set_inode_cookie(struct inode *inode,
+ struct file *filp) {}
+static inline void nfs_fscache_reset_inode_cookie(struct inode *inode) {}
+
#endif /* CONFIG_NFS_FSCACHE */
#endif /* _NFS_FSCACHE_H */
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index cd29f41..64f8719 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -122,6 +122,7 @@ void nfs_clear_inode(struct inode *inode)
BUG_ON(!list_empty(&NFS_I(inode)->open_files));
nfs_zap_acl_cache(inode);
nfs_access_zap_cache(inode);
+ nfs_fscache_release_inode_cookie(inode);
}

/**
@@ -356,6 +357,8 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
nfsi->attrtimeo_timestamp = now;
nfsi->access_cache = RB_ROOT;

+ nfs_fscache_init_inode_cookie(inode);
+
unlock_new_inode(inode);
} else
nfs_refresh_inode(inode, fattr);
@@ -687,6 +690,7 @@ int nfs_open(struct inode *inode, struct file *filp)
ctx->mode = filp->f_mode;
nfs_file_set_open_context(filp, ctx);
put_nfs_open_context(ctx);
+ nfs_fscache_set_inode_cookie(inode, filp);
return 0;
}

@@ -787,6 +791,7 @@ static int nfs_invalidate_mapping_nolock(struct inode *inode, struct address_spa
memset(nfsi->cookieverf, 0, sizeof(nfsi->cookieverf));
spin_unlock(&inode->i_lock);
nfs_inc_stats(inode, NFSIOS_DATAINVALIDATE);
+ nfs_fscache_reset_inode_cookie(inode);
dfprintk(PAGECACHE, "NFS: (%s/%Ld) data cache invalidated\n",
inode->i_sb->s_id, (long long)NFS_FILEID(inode));
return 0;
@@ -1031,6 +1036,7 @@ int nfs_refresh_inode(struct inode *inode, struct nfs_fattr *fattr)
spin_lock(&inode->i_lock);
status = nfs_refresh_inode_locked(inode, fattr);
spin_unlock(&inode->i_lock);
+
return status;
}

diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index fd3e7f9..8a99e79 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -185,6 +185,9 @@ struct nfs_inode {
fmode_t delegation_state;
struct rw_semaphore rwsem;
#endif /* CONFIG_NFS_V4*/
+#ifdef CONFIG_NFS_FSCACHE
+ struct fscache_cookie *fscache;
+#endif
struct inode vfs_inode;
};

@@ -207,6 +210,8 @@ struct nfs_inode {
#define NFS_INO_ACL_LRU_SET (2) /* Inode is on the LRU list */
#define NFS_INO_MOUNTPOINT (3) /* inode is remote mountpoint */
#define NFS_INO_FLUSHING (4) /* inode is flushing out data */
+#define NFS_INO_FSCACHE (5) /* inode can be cached by FS-Cache */
+#define NFS_INO_FSCACHE_LOCK (6) /* FS-Cache cookie management lock */

static inline struct nfs_inode *NFS_I(const struct inode *inode)
{
@@ -260,6 +265,11 @@ static inline int NFS_STALE(const struct inode *inode)
return test_bit(NFS_INO_STALE, &NFS_I(inode)->flags);
}

+static inline int NFS_FSCACHE(const struct inode *inode)
+{
+ return test_bit(NFS_INO_FSCACHE, &NFS_I(inode)->flags);
+}
+
static inline __u64 NFS_FILEID(const struct inode *inode)
{
return NFS_I(inode)->fileid;

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:06:27 UTC
Permalink
Add some new NFS I/O counters for FS-Cache doing things for NFS. A new line is
emitted into /proc/pid/mountstats if caching is enabled that looks like:

fsc: <rok> <rfl> <wok> <wfl> <unc>

Where <rok> is the number of pages read successfully from the cache, <rfl> is
the number of failed page reads against the cache, <wok> is the number of
successful page writes to the cache, <wfl> is the number of failed page writes
to the cache, and <unc> is the number of NFS pages that have been disconnected
from the cache.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/iostat.h | 18 ++++++++++++++++++
fs/nfs/super.c | 11 +++++++++++
include/linux/nfs_iostat.h | 12 ++++++++++++
3 files changed, 41 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/iostat.h b/fs/nfs/iostat.h
index a369528..a2ab252 100644
--- a/fs/nfs/iostat.h
+++ b/fs/nfs/iostat.h
@@ -16,6 +16,9 @@

struct nfs_iostats {
unsigned long long bytes[__NFSIOS_BYTESMAX];
+#ifdef CONFIG_NFS_FSCACHE
+ unsigned long long fscache[__NFSIOS_FSCACHEMAX];
+#endif
unsigned long events[__NFSIOS_COUNTSMAX];
} ____cacheline_aligned;

@@ -57,6 +60,21 @@ static inline void nfs_add_stats(const struct inode *inode,
nfs_add_server_stats(NFS_SERVER(inode), stat, addend);
}

+#ifdef CONFIG_NFS_FSCACHE
+static inline void nfs_add_fscache_stats(struct inode *inode,
+ enum nfs_stat_fscachecounters stat,
+ unsigned long addend)
+{
+ struct nfs_iostats *iostats;
+ int cpu;
+
+ cpu = get_cpu();
+ iostats = per_cpu_ptr(NFS_SERVER(inode)->io_stats, cpu);
+ iostats->fscache[stat] += addend;
+ put_cpu_no_resched();
+}
+#endif
+
static inline struct nfs_iostats *nfs_alloc_iostats(void)
{
return alloc_percpu(struct nfs_iostats);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 87f65ae..b5fea77 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -642,6 +642,10 @@ static int nfs_show_stats(struct seq_file *m, struct vfsmount *mnt)
totals.events[i] += stats->events[i];
for (i = 0; i < __NFSIOS_BYTESMAX; i++)
totals.bytes[i] += stats->bytes[i];
+#ifdef CONFIG_NFS_FSCACHE
+ for (i = 0; i < __NFSIOS_FSCACHEMAX; i++)
+ totals.fscache[i] += stats->fscache[i];
+#endif

preempt_enable();
}
@@ -652,6 +656,13 @@ static int nfs_show_stats(struct seq_file *m, struct vfsmount *mnt)
seq_printf(m, "\n\tbytes:\t");
for (i = 0; i < __NFSIOS_BYTESMAX; i++)
seq_printf(m, "%Lu ", totals.bytes[i]);
+#ifdef CONFIG_NFS_FSCACHE
+ if (nfss->options & NFS_OPTION_FSCACHE) {
+ seq_printf(m, "\n\tfsc:\t");
+ for (i = 0; i < __NFSIOS_FSCACHEMAX; i++)
+ seq_printf(m, "%Lu ", totals.bytes[i]);
+ }
+#endif
seq_printf(m, "\n");

rpc_print_iostats(m, nfss->client);
diff --git a/include/linux/nfs_iostat.h b/include/linux/nfs_iostat.h
index 1cb9a3f..68b10f5 100644
--- a/include/linux/nfs_iostat.h
+++ b/include/linux/nfs_iostat.h
@@ -116,4 +116,16 @@ enum nfs_stat_eventcounters {
__NFSIOS_COUNTSMAX,
};

+/*
+ * NFS local caching servicing counters
+ */
+enum nfs_stat_fscachecounters {
+ NFSIOS_FSCACHE_PAGES_READ_OK,
+ NFSIOS_FSCACHE_PAGES_READ_FAIL,
+ NFSIOS_FSCACHE_PAGES_WRITTEN_OK,
+ NFSIOS_FSCACHE_PAGES_WRITTEN_FAIL,
+ NFSIOS_FSCACHE_PAGES_UNCACHED,
+ __NFSIOS_FSCACHEMAX,
+};
+
#endif /* _LINUX_NFS_IOSTAT */

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:06:22 UTC
Permalink
Invalidate the FsCache page flags on the pages belonging to an inode when the
cache backing that NFS inode is removed.

This allows a live cache to be withdrawn.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/fscache-index.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 40 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index 7bc54ac..a119b56 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -253,6 +253,45 @@ enum fscache_checkaux nfs_fscache_inode_check_aux(void *cookie_netfs_data,
}

/*
+ * Indication from FS-Cache that the cookie is no longer cached
+ * - This function is called when the backing store currently caching a cookie
+ * is removed
+ * - The netfs should use this to clean up any markers indicating cached pages
+ * - This is mandatory for any object that may have data
+ */
+static void nfs_fscache_inode_now_uncached(void *cookie_netfs_data)
+{
+ struct nfs_inode *nfsi = cookie_netfs_data;
+ struct pagevec pvec;
+ pgoff_t first;
+ int loop, nr_pages;
+
+ pagevec_init(&pvec, 0);
+ first = 0;
+
+ dprintk("NFS: nfs_inode_now_uncached: nfs_inode 0x%p\n", nfsi);
+
+ for (;;) {
+ /* grab a bunch of pages to unmark */
+ nr_pages = pagevec_lookup(&pvec,
+ nfsi->vfs_inode.i_mapping,
+ first,
+ PAGEVEC_SIZE - pagevec_count(&pvec));
+ if (!nr_pages)
+ break;
+
+ for (loop = 0; loop < nr_pages; loop++)
+ ClearPageFsCache(pvec.pages[loop]);
+
+ first = pvec.pages[nr_pages - 1]->index + 1;
+
+ pvec.nr = nr_pages;
+ pagevec_release(&pvec);
+ cond_resched();
+ }
+}
+
+/*
* Define the inode object for FS-Cache. This is used to describe an inode
* object to fscache_acquire_cookie(). It is keyed by the NFS file handle for
* an inode.
@@ -268,4 +307,5 @@ const struct fscache_cookie_def nfs_fscache_inode_object_def = {
.get_attr = nfs_fscache_inode_get_attr,
.get_aux = nfs_fscache_inode_get_aux,
.check_aux = nfs_fscache_inode_check_aux,
+ .now_uncached = nfs_fscache_inode_now_uncached,
};

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:06:32 UTC
Permalink
FS-Cache page management for NFS. This includes hooking the releasing and
invalidation of pages marked with PG_fscache (aka PG_private_2) and waiting for
completion of the write-to-cache flag (PG_fscache_write aka PG_owner_priv_2).

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/file.c | 17 +++++++++++++----
fs/nfs/fscache.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 22 ++++++++++++++++++++++
3 files changed, 86 insertions(+), 4 deletions(-)


diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index f5bc54d..d3060c4 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -35,6 +35,7 @@
#include "delegation.h"
#include "internal.h"
#include "iostat.h"
+#include "fscache.h"

#define NFSDBG_FACILITY NFSDBG_FILE

@@ -413,7 +414,7 @@ static int nfs_write_end(struct file *file, struct address_space *mapping,
* Partially or wholly invalidate a page
* - Release the private state associated with a page if undergoing complete
* page invalidation
- * - Called if either PG_private or PG_private_2 is set on the page
+ * - Called if either PG_private or PG_fscache is set on the page
* - Caller holds page lock
*/
static void nfs_invalidate_page(struct page *page, unsigned long offset)
@@ -424,11 +425,13 @@ static void nfs_invalidate_page(struct page *page, unsigned long offset)
return;
/* Cancel any unstarted writes on this page */
nfs_wb_page_cancel(page->mapping->host, page);
+
+ nfs_fscache_invalidate_page(page, page->mapping->host);
}

/*
* Attempt to release the private state associated with a page
- * - Called if either PG_private or PG_private_2 is set on the page
+ * - Called if either PG_private or PG_fscache is set on the page
* - Caller holds page lock
* - Return true (may release page) or false (may not)
*/
@@ -437,14 +440,16 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);

/* If PagePrivate() is set, then the page is not freeable */
- return 0;
+ if (PagePrivate(page))
+ return 0;
+ return nfs_fscache_release_page(page, gfp);
}

/*
* Attempt to clear the private state associated with a page when an error
* occurs that requires the cached contents of an inode to be written back or
* destroyed
- * - Called if either PG_private or PG_private_2 is set on the page
+ * - Called if either PG_private or fscache is set on the page
* - Caller holds page lock
* - Return 0 if successful, -error otherwise
*/
@@ -455,6 +460,7 @@ static int nfs_launder_page(struct page *page)
dfprintk(PAGECACHE, "NFS: launder_page(%ld, %llu)\n",
inode->i_ino, (long long)page_offset(page));

+ wait_on_page_fscache_write(page);
return nfs_wb_page(inode, page);
}

@@ -491,6 +497,9 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
filp->f_mapping->host->i_ino,
(long long)page_offset(page));

+ /* make sure the cache has finished storing the page */
+ wait_on_page_fscache_write(page);
+
lock_page(page);
mapping = page->mapping;
if (mapping != dentry->d_inode->i_mapping)
diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index e3816eb..de990dd 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -19,6 +19,7 @@
#include <linux/seq_file.h>

#include "internal.h"
+#include "iostat.h"
#include "fscache.h"

#define NFSDBG_FACILITY NFSDBG_FSCACHE
@@ -328,3 +329,53 @@ void nfs_fscache_reset_inode_cookie(struct inode *inode)
}
nfs_fscache_inode_unlock(inode);
}
+
+/*
+ * Release the caching state associated with a page, if the page isn't busy
+ * interacting with the cache.
+ * - Returns true (can release page) or false (page busy).
+ */
+int nfs_fscache_release_page(struct page *page, gfp_t gfp)
+{
+ if (PageFsCacheWrite(page)) {
+ if (!(gfp & __GFP_WAIT))
+ return 0;
+ wait_on_page_fscache_write(page);
+ }
+
+ if (PageFsCache(page)) {
+ struct nfs_inode *nfsi = NFS_I(page->mapping->host);
+
+ BUG_ON(!nfsi->fscache);
+
+ dfprintk(FSCACHE, "NFS: fscache releasepage (0x%p/0x%p/0x%p)\n",
+ nfsi->fscache, page, nfsi);
+
+ fscache_uncache_page(nfsi->fscache, page);
+ nfs_add_fscache_stats(page->mapping->host,
+ NFSIOS_FSCACHE_PAGES_UNCACHED, 1);
+ }
+
+ return 1;
+}
+
+/*
+ * Release the caching state associated with a page if undergoing complete page
+ * invalidation.
+ */
+void __nfs_fscache_invalidate_page(struct page *page, struct inode *inode)
+{
+ struct nfs_inode *nfsi = NFS_I(inode);
+
+ BUG_ON(!nfsi->fscache);
+
+ dfprintk(FSCACHE, "NFS: fscache invalidatepage (0x%p/0x%p/0x%p)\n",
+ nfsi->fscache, page, nfsi);
+
+ wait_on_page_fscache_write(page);
+
+ BUG_ON(!PageLocked(page));
+ fscache_uncache_page(nfsi->fscache, page);
+ nfs_add_fscache_stats(page->mapping->host,
+ NFSIOS_FSCACHE_PAGES_UNCACHED, 1);
+}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 8b4299a..9cc63af 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -83,6 +83,21 @@ extern void nfs_fscache_zap_inode_cookie(struct inode *);
extern void nfs_fscache_set_inode_cookie(struct inode *, struct file *);
extern void nfs_fscache_reset_inode_cookie(struct inode *);

+extern void __nfs_fscache_invalidate_page(struct page *, struct inode *);
+extern int nfs_fscache_release_page(struct page *, gfp_t);
+
+/*
+ * release the caching state associated with a page if undergoing complete page
+ * invalidation
+ */
+static inline void nfs_fscache_invalidate_page(struct page *page,
+ struct inode *inode)
+{
+ if (PageFsCache(page))
+ __nfs_fscache_invalidate_page(page, inode);
+}
+
+
#else /* CONFIG_NFS_FSCACHE */
static inline int nfs_fscache_register(void) { return 0; }
static inline void nfs_fscache_unregister(void) {}
@@ -104,5 +119,12 @@ static inline void nfs_fscache_set_inode_cookie(struct inode *inode,
struct file *filp) {}
static inline void nfs_fscache_reset_inode_cookie(struct inode *inode) {}

+static inline int nfs_fscache_release_page(struct page *page, gfp_t gfp)
+{
+ return 1; /* True: may release page */
+}
+static inline void nfs_fscache_invalidate_page(struct page *page,
+ struct inode *inode) {}
+
#endif /* CONFIG_NFS_FSCACHE */
#endif /* _NFS_FSCACHE_H */

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:06:38 UTC
Permalink
Add read context retention so that FS-Cache can call back into NFS when a read
operation on the cache fails EIO rather than reading data. This permits NFS to
then fetch the data from the server instead using the appropriate security
context.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/fscache-index.c | 26 ++++++++++++++++++++++++++
1 files changed, 26 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index a119b56..5b10064 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -292,6 +292,30 @@ static void nfs_fscache_inode_now_uncached(void *cookie_netfs_data)
}

/*
+ * Get an extra reference on a read context.
+ * - This function can be absent if the completion function doesn't require a
+ * context.
+ * - The read context is passed back to NFS in the event that a data read on the
+ * cache fails with EIO - in which case the server must be contacted to
+ * retrieve the data, which requires the read context for security.
+ */
+static void nfs_fh_get_context(void *cookie_netfs_data, void *context)
+{
+ get_nfs_open_context(context);
+}
+
+/*
+ * Release an extra reference on a read context.
+ * - This function can be absent if the completion function doesn't require a
+ * context.
+ */
+static void nfs_fh_put_context(void *cookie_netfs_data, void *context)
+{
+ if (context)
+ put_nfs_open_context(context);
+}
+
+/*
* Define the inode object for FS-Cache. This is used to describe an inode
* object to fscache_acquire_cookie(). It is keyed by the NFS file handle for
* an inode.
@@ -308,4 +332,6 @@ const struct fscache_cookie_def nfs_fscache_inode_object_def = {
.get_aux = nfs_fscache_inode_get_aux,
.check_aux = nfs_fscache_inode_check_aux,
.now_uncached = nfs_fscache_inode_now_uncached,
+ .get_context = nfs_fh_get_context,
+ .put_context = nfs_fh_put_context,
};

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:07:03 UTC
Permalink
Add NFS mount options to allow the local caching support to be enabled.

The attached patch makes it possible for the NFS filesystem to be told to make
use of the network filesystem local caching service (FS-Cache).

To be able to use this, a recent nfsutils package is required.

There are three variant NFS mount options that can be added to a mount command
to control caching for a mount. Only the last one specified takes effect:

(*) Adding "fsc" will request caching.

(*) Adding "fsc=<string>" will request caching and also specify a uniquifier.

(*) Adding "nofsc" will disable caching.

For example:

mount warthog:/ /a -o fsc


The cache of a particular superblock (NFS FSID) will be shared between all
mounts of that volume, provided they have the same connection parameters and
are not marked 'nosharecache'.

Where it is otherwise impossible to distinguish superblocks because all the
parameters are identical, but the 'nosharecache' option is supplied, a
uniquifying string must be supplied, else only the first mount will be
permitted to use the cache.

If there's a key collision, then the second mount will disable caching and give
a warning into the kernel log.


Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/client.c | 2 ++
fs/nfs/internal.h | 1 +
fs/nfs/super.c | 25 +++++++++++++++++++++++++
3 files changed, 28 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 726fe5a..75c9cd2 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -765,6 +765,7 @@ static int nfs_init_server(struct nfs_server *server,

/* Initialise the client representation from the mount data */
server->flags = data->flags;
+ server->options = data->options;

if (data->rsize)
server->rsize = nfs_block_size(data->rsize, NULL);
@@ -1153,6 +1154,7 @@ static int nfs4_init_server(struct nfs_server *server,
/* Initialise the client representation from the mount data */
server->flags = data->flags;
server->caps |= NFS_CAP_ATOMIC_OPEN;
+ server->options = data->options;

/* Get a client record */
error = nfs4_set_client(server,
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 0130700..e4d6a83 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -39,6 +39,7 @@ struct nfs_parsed_mount_data {
int acregmin, acregmax,
acdirmin, acdirmax;
int namlen;
+ unsigned int options;
unsigned int bsize;
unsigned int auth_flavor_len;
rpc_authflavor_t auth_flavors[1];
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index b5fea77..82eaadb 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -77,6 +77,7 @@ enum {
Opt_rdirplus, Opt_nordirplus,
Opt_sharecache, Opt_nosharecache,
Opt_resvport, Opt_noresvport,
+ Opt_fscache, Opt_nofscache,

/* Mount options that take integer arguments */
Opt_port,
@@ -94,6 +95,7 @@ enum {
Opt_sec, Opt_proto, Opt_mountproto, Opt_mounthost,
Opt_addr, Opt_mountaddr, Opt_clientaddr,
Opt_lookupcache,
+ Opt_fscache_uniq,

/* Special mount options */
Opt_userspace, Opt_deprecated, Opt_sloppy,
@@ -133,6 +135,9 @@ static const match_table_t nfs_mount_option_tokens = {
{ Opt_nosharecache, "nosharecache" },
{ Opt_resvport, "resvport" },
{ Opt_noresvport, "noresvport" },
+ { Opt_fscache, "fsc" },
+ { Opt_fscache_uniq, "fsc=%s" },
+ { Opt_nofscache, "nofsc" },

{ Opt_port, "port=%u" },
{ Opt_rsize, "rsize=%u" },
@@ -564,6 +569,8 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss,
if (clp->rpc_ops->version == 4)
seq_printf(m, ",clientaddr=%s", clp->cl_ipaddr);
#endif
+ if (nfss->options & NFS_OPTION_FSCACHE)
+ seq_printf(m, ",fsc");
}

/*
@@ -1056,6 +1063,24 @@ static int nfs_parse_mount_options(char *raw,
case Opt_noresvport:
mnt->flags |= NFS_MOUNT_NORESVPORT;
break;
+ case Opt_fscache:
+ mnt->options |= NFS_OPTION_FSCACHE;
+ kfree(mnt->fscache_uniq);
+ mnt->fscache_uniq = NULL;
+ break;
+ case Opt_nofscache:
+ mnt->options &= ~NFS_OPTION_FSCACHE;
+ kfree(mnt->fscache_uniq);
+ mnt->fscache_uniq = NULL;
+ break;
+ case Opt_fscache_uniq:
+ string = match_strdup(args);
+ if (!string)
+ goto out_nomem;
+ kfree(mnt->fscache_uniq);
+ mnt->fscache_uniq = string;
+ mnt->options |= NFS_OPTION_FSCACHE;
+ break;

/*
* options that take numeric values

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:06:43 UTC
Permalink
nfs_readpage_async() needs to be non-static so that it can be used as a
fallback for the local on-disk caching should an EIO crop up when reading the
cache.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/read.c | 4 ++--
include/linux/nfs_fs.h | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)


diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index f856004..98b7400 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -111,8 +111,8 @@ static void nfs_readpage_truncate_uninitialised_page(struct nfs_read_data *data)
}
}

-static int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
- struct page *page)
+int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,
+ struct page *page)
{
LIST_HEAD(one_request);
struct nfs_page *new;
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 8a99e79..fdffb41 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -516,6 +516,8 @@ extern int nfs_readpages(struct file *, struct address_space *,
struct list_head *, unsigned);
extern int nfs_readpage_result(struct rpc_task *, struct nfs_read_data *);
extern void nfs_readdata_release(void *data);
+extern int nfs_readpage_async(struct nfs_open_context *, struct inode *,
+ struct page *);

/*
* Allocate nfs_read_data structures

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:06:58 UTC
Permalink
Display the local caching state in /proc/fs/nfsfs/volumes.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/client.c | 7 ++++---
fs/nfs/fscache.h | 15 +++++++++++++++
2 files changed, 19 insertions(+), 3 deletions(-)


diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index aa04da8..726fe5a 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -1564,7 +1564,7 @@ static int nfs_volume_list_show(struct seq_file *m, void *v)

/* display header on line 1 */
if (v == &nfs_volume_list) {
- seq_puts(m, "NV SERVER PORT DEV FSID\n");
+ seq_puts(m, "NV SERVER PORT DEV FSID FSC\n");
return 0;
}
/* display one transport per line on subsequent lines */
@@ -1578,12 +1578,13 @@ static int nfs_volume_list_show(struct seq_file *m, void *v)
(unsigned long long) server->fsid.major,
(unsigned long long) server->fsid.minor);

- seq_printf(m, "v%u %s %s %-7s %-17s\n",
+ seq_printf(m, "v%u %s %s %-7s %-17s %s\n",
clp->rpc_ops->version,
rpc_peeraddr2str(clp->cl_rpcclient, RPC_DISPLAY_HEX_ADDR),
rpc_peeraddr2str(clp->cl_rpcclient, RPC_DISPLAY_HEX_PORT),
dev,
- fsid);
+ fsid,
+ nfs_server_fscache_state(server));

return 0;
}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 7eeaab4..2d43b67 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -143,6 +143,16 @@ static inline void nfs_readpage_to_fscache(struct inode *inode,
__nfs_readpage_to_fscache(inode, page, sync);
}

+/*
+ * indicate the client caching state as readable text
+ */
+static inline const char *nfs_server_fscache_state(struct nfs_server *server)
+{
+ if (server->fscache && (server->options & NFS_OPTION_FSCACHE))
+ return "yes";
+ return "no ";
+}
+

#else /* CONFIG_NFS_FSCACHE */
static inline int nfs_fscache_register(void) { return 0; }
@@ -189,5 +199,10 @@ static inline int nfs_readpages_from_fscache(struct nfs_open_context *ctx,
static inline void nfs_readpage_to_fscache(struct inode *inode,
struct page *page, int sync) {}

+static inline const char *nfs_server_fscache_state(struct nfs_server *server)
+{
+ return "no ";
+}
+
#endif /* CONFIG_NFS_FSCACHE */
#endif /* _NFS_FSCACHE_H */

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:06:48 UTC
Permalink
Read pages from an FS-Cache data storage object representing an inode into an
NFS inode.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/fscache.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/fscache.h | 47 +++++++++++++++++++++++
fs/nfs/read.c | 18 +++++++++
3 files changed, 177 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index de990dd..fcabfdc 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -379,3 +379,115 @@ void __nfs_fscache_invalidate_page(struct page *page, struct inode *inode)
nfs_add_fscache_stats(page->mapping->host,
NFSIOS_FSCACHE_PAGES_UNCACHED, 1);
}
+
+/*
+ * Handle completion of a page being read from the cache.
+ * - Called in process (keventd) context.
+ */
+static void nfs_readpage_from_fscache_complete(struct page *page,
+ void *context,
+ int error)
+{
+ dfprintk(FSCACHE,
+ "NFS: readpage_from_fscache_complete (0x%p/0x%p/%d)\n",
+ page, context, error);
+
+ /* if the read completes with an error, we just unlock the page and let
+ * the VM reissue the readpage */
+ if (!error) {
+ SetPageUptodate(page);
+ unlock_page(page);
+ } else {
+ error = nfs_readpage_async(context, page->mapping->host, page);
+ if (error)
+ unlock_page(page);
+ }
+}
+
+/*
+ * Retrieve a page from fscache
+ */
+int __nfs_readpage_from_fscache(struct nfs_open_context *ctx,
+ struct inode *inode, struct page *page)
+{
+ int ret;
+
+ dfprintk(FSCACHE,
+ "NFS: readpage_from_fscache(fsc:%p/p:%p(i:%lx f:%lx)/0x%p)\n",
+ NFS_I(inode)->fscache, page, page->index, page->flags, inode);
+
+ ret = fscache_read_or_alloc_page(NFS_I(inode)->fscache,
+ page,
+ nfs_readpage_from_fscache_complete,
+ ctx,
+ GFP_KERNEL);
+
+ switch (ret) {
+ case 0: /* read BIO submitted (page in fscache) */
+ dfprintk(FSCACHE,
+ "NFS: readpage_from_fscache: BIO submitted\n");
+ nfs_add_fscache_stats(inode, NFSIOS_FSCACHE_PAGES_READ_OK, 1);
+ return ret;
+
+ case -ENOBUFS: /* inode not in cache */
+ case -ENODATA: /* page not in cache */
+ nfs_add_fscache_stats(inode, NFSIOS_FSCACHE_PAGES_READ_FAIL, 1);
+ dfprintk(FSCACHE,
+ "NFS: readpage_from_fscache %d\n", ret);
+ return 1;
+
+ default:
+ dfprintk(FSCACHE, "NFS: readpage_from_fscache %d\n", ret);
+ nfs_add_fscache_stats(inode, NFSIOS_FSCACHE_PAGES_READ_FAIL, 1);
+ }
+ return ret;
+}
+
+/*
+ * Retrieve a set of pages from fscache
+ */
+int __nfs_readpages_from_fscache(struct nfs_open_context *ctx,
+ struct inode *inode,
+ struct address_space *mapping,
+ struct list_head *pages,
+ unsigned *nr_pages)
+{
+ int ret, npages = *nr_pages;
+
+ dfprintk(FSCACHE, "NFS: nfs_getpages_from_fscache (0x%p/%u/0x%p)\n",
+ NFS_I(inode)->fscache, npages, inode);
+
+ ret = fscache_read_or_alloc_pages(NFS_I(inode)->fscache,
+ mapping, pages, nr_pages,
+ nfs_readpage_from_fscache_complete,
+ ctx,
+ mapping_gfp_mask(mapping));
+ if (*nr_pages < npages)
+ nfs_add_fscache_stats(inode, NFSIOS_FSCACHE_PAGES_READ_OK,
+ npages);
+ if (*nr_pages > 0)
+ nfs_add_fscache_stats(inode, NFSIOS_FSCACHE_PAGES_READ_FAIL,
+ *nr_pages);
+
+ switch (ret) {
+ case 0: /* read submitted to the cache for all pages */
+ BUG_ON(!list_empty(pages));
+ BUG_ON(*nr_pages != 0);
+ dfprintk(FSCACHE,
+ "NFS: nfs_getpages_from_fscache: submitted\n");
+
+ return ret;
+
+ case -ENOBUFS: /* some pages aren't cached and can't be */
+ case -ENODATA: /* some pages aren't cached */
+ dfprintk(FSCACHE,
+ "NFS: nfs_getpages_from_fscache: no page: %d\n", ret);
+ return 1;
+
+ default:
+ dfprintk(FSCACHE,
+ "NFS: nfs_getpages_from_fscache: ret %d\n", ret);
+ }
+
+ return ret;
+}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index 9cc63af..cbc010e 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -86,6 +86,12 @@ extern void nfs_fscache_reset_inode_cookie(struct inode *);
extern void __nfs_fscache_invalidate_page(struct page *, struct inode *);
extern int nfs_fscache_release_page(struct page *, gfp_t);

+extern int __nfs_readpage_from_fscache(struct nfs_open_context *,
+ struct inode *, struct page *);
+extern int __nfs_readpages_from_fscache(struct nfs_open_context *,
+ struct inode *, struct address_space *,
+ struct list_head *, unsigned *);
+
/*
* release the caching state associated with a page if undergoing complete page
* invalidation
@@ -97,6 +103,32 @@ static inline void nfs_fscache_invalidate_page(struct page *page,
__nfs_fscache_invalidate_page(page, inode);
}

+/*
+ * Retrieve a page from an inode data storage object.
+ */
+static inline int nfs_readpage_from_fscache(struct nfs_open_context *ctx,
+ struct inode *inode,
+ struct page *page)
+{
+ if (NFS_I(inode)->fscache)
+ return __nfs_readpage_from_fscache(ctx, inode, page);
+ return -ENOBUFS;
+}
+
+/*
+ * Retrieve a set of pages from an inode data storage object.
+ */
+static inline int nfs_readpages_from_fscache(struct nfs_open_context *ctx,
+ struct inode *inode,
+ struct address_space *mapping,
+ struct list_head *pages,
+ unsigned *nr_pages)
+{
+ if (NFS_I(inode)->fscache)
+ return __nfs_readpages_from_fscache(ctx, inode, mapping, pages,
+ nr_pages);
+ return -ENOBUFS;
+}

#else /* CONFIG_NFS_FSCACHE */
static inline int nfs_fscache_register(void) { return 0; }
@@ -126,5 +158,20 @@ static inline int nfs_fscache_release_page(struct page *page, gfp_t gfp)
static inline void nfs_fscache_invalidate_page(struct page *page,
struct inode *inode) {}

+static inline int nfs_readpage_from_fscache(struct nfs_open_context *ctx,
+ struct inode *inode,
+ struct page *page)
+{
+ return -ENOBUFS;
+}
+static inline int nfs_readpages_from_fscache(struct nfs_open_context *ctx,
+ struct inode *inode,
+ struct address_space *mapping,
+ struct list_head *pages,
+ unsigned *nr_pages)
+{
+ return -ENOBUFS;
+}
+
#endif /* CONFIG_NFS_FSCACHE */
#endif /* _NFS_FSCACHE_H */
diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index 98b7400..e18ba79 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -24,6 +24,7 @@

#include "internal.h"
#include "iostat.h"
+#include "fscache.h"

#define NFSDBG_FACILITY NFSDBG_PAGECACHE

@@ -510,8 +511,15 @@ int nfs_readpage(struct file *file, struct page *page)
} else
ctx = get_nfs_open_context(nfs_file_open_context(file));

+ if (!IS_SYNC(inode)) {
+ error = nfs_readpage_from_fscache(ctx, inode, page);
+ if (error == 0)
+ goto out;
+ }
+
error = nfs_readpage_async(ctx, inode, page);

+out:
put_nfs_open_context(ctx);
return error;
out_unlock:
@@ -584,6 +592,15 @@ int nfs_readpages(struct file *filp, struct address_space *mapping,
return -EBADF;
} else
desc.ctx = get_nfs_open_context(nfs_file_open_context(filp));
+
+ /* attempt to read as many of the pages as possible from the cache
+ * - this returns -ENOBUFS immediately if the cookie is negative
+ */
+ ret = nfs_readpages_from_fscache(desc.ctx, inode, mapping,
+ pages, &nr_pages);
+ if (ret == 0)
+ goto read_complete; /* all pages were read */
+
if (rsize < PAGE_CACHE_SIZE)
nfs_pageio_init(&pgio, inode, nfs_pagein_multi, rsize, 0);
else
@@ -594,6 +611,7 @@ int nfs_readpages(struct file *filp, struct address_space *mapping,
nfs_pageio_complete(&pgio);
npages = (pgio.pg_bytes_written + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
nfs_add_stats(inode, NFSIOS_READPAGES, npages);
+read_complete:
put_nfs_open_context(desc.ctx);
out:
return ret;

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:06:53 UTC
Permalink
Store pages from an NFS inode into the cache data storage object associated
with that inode.

Signed-off-by: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

fs/nfs/fscache.c | 28 ++++++++++++++++++++++++++++
fs/nfs/fscache.h | 16 ++++++++++++++++
fs/nfs/read.c | 5 +++++
3 files changed, 49 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index fcabfdc..968cf5d 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -491,3 +491,31 @@ int __nfs_readpages_from_fscache(struct nfs_open_context *ctx,

return ret;
}
+
+/*
+ * Store a newly fetched page in fscache
+ * - PG_fscache must be set on the page
+ */
+void __nfs_readpage_to_fscache(struct inode *inode, struct page *page, int sync)
+{
+ int ret;
+
+ dfprintk(FSCACHE,
+ "NFS: readpage_to_fscache(fsc:%p/p:%p(i:%lx f:%lx)/%d)\n",
+ NFS_I(inode)->fscache, page, page->index, page->flags, sync);
+
+ ret = fscache_write_page(NFS_I(inode)->fscache, page, GFP_KERNEL);
+ dfprintk(FSCACHE,
+ "NFS: readpage_to_fscache: p:%p(i:%lu f:%lx) ret %d\n",
+ page, page->index, page->flags, ret);
+
+ if (ret != 0) {
+ fscache_uncache_page(NFS_I(inode)->fscache, page);
+ nfs_add_fscache_stats(inode,
+ NFSIOS_FSCACHE_PAGES_WRITTEN_FAIL, 1);
+ nfs_add_fscache_stats(inode, NFSIOS_FSCACHE_PAGES_UNCACHED, 1);
+ } else {
+ nfs_add_fscache_stats(inode,
+ NFSIOS_FSCACHE_PAGES_WRITTEN_OK, 1);
+ }
+}
diff --git a/fs/nfs/fscache.h b/fs/nfs/fscache.h
index cbc010e..7eeaab4 100644
--- a/fs/nfs/fscache.h
+++ b/fs/nfs/fscache.h
@@ -91,6 +91,7 @@ extern int __nfs_readpage_from_fscache(struct nfs_open_context *,
extern int __nfs_readpages_from_fscache(struct nfs_open_context *,
struct inode *, struct address_space *,
struct list_head *, unsigned *);
+extern void __nfs_readpage_to_fscache(struct inode *, struct page *, int);

/*
* release the caching state associated with a page if undergoing complete page
@@ -130,6 +131,19 @@ static inline int nfs_readpages_from_fscache(struct nfs_open_context *ctx,
return -ENOBUFS;
}

+/*
+ * Store a page newly fetched from the server in an inode data storage object
+ * in the cache.
+ */
+static inline void nfs_readpage_to_fscache(struct inode *inode,
+ struct page *page,
+ int sync)
+{
+ if (PageFsCache(page))
+ __nfs_readpage_to_fscache(inode, page, sync);
+}
+
+
#else /* CONFIG_NFS_FSCACHE */
static inline int nfs_fscache_register(void) { return 0; }
static inline void nfs_fscache_unregister(void) {}
@@ -172,6 +186,8 @@ static inline int nfs_readpages_from_fscache(struct nfs_open_context *ctx,
{
return -ENOBUFS;
}
+static inline void nfs_readpage_to_fscache(struct inode *inode,
+ struct page *page, int sync) {}

#endif /* CONFIG_NFS_FSCACHE */
#endif /* _NFS_FSCACHE_H */
diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index e18ba79..4ace3c5 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -140,6 +140,11 @@ int nfs_readpage_async(struct nfs_open_context *ctx, struct inode *inode,

static void nfs_readpage_release(struct nfs_page *req)
{
+ struct inode *d_inode = req->wb_context->path.dentry->d_inode;
+
+ if (PageUptodate(req->wb_page))
+ nfs_readpage_to_fscache(d_inode, req->wb_page, 0);
+
unlock_page(req->wb_page);

dprintk("NFS: read done (%s/%Ld %d@%Ld)\n",

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2009-04-01 23:05:30 UTC
Permalink
Add an FS-Cache cache-backend that permits a mounted filesystem to be used as a
backing store for the cache.


CacheFiles uses a userspace daemon to do some of the cache management - such as
reaping stale nodes and culling. This is called cachefilesd and lives in
/sbin. The source for the daemon can be downloaded from:

http://people.redhat.com/~dhowells/cachefs/cachefilesd.c

And an example configuration from:

http://people.redhat.com/~dhowells/cachefs/cachefilesd.conf

The filesystem and data integrity of the cache are only as good as those of the
filesystem providing the backing services. Note that CacheFiles does not
attempt to journal anything since the journalling interfaces of the various
filesystems are very specific in nature.

CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
to communication with the daemon. Only one thing may have this open at once,
and whilst it is open, a cache is at least partially in existence. The daemon
opens this and sends commands down it to control the cache.

CacheFiles is currently limited to a single cache.

CacheFiles attempts to maintain at least a certain percentage of free space on
the filesystem, shrinking the cache by culling the objects it contains to make
space if necessary - see the "Cache Culling" section. This means it can be
placed on the same medium as a live set of data, and will expand to make use of
spare space and automatically contract when the set of data requires more
space.


============
REQUIREMENTS
============

The use of CacheFiles and its daemon requires the following features to be
available in the system and in the cache filesystem:

- dnotify.

- extended attributes (xattrs).

- openat() and friends.

- bmap() support on files in the filesystem (FIBMAP ioctl).

- The use of bmap() to detect a partial page at the end of the file.

It is strongly recommended that the "dir_index" option is enabled on Ext3
filesystems being used as a cache.


=============
CONFIGURATION
=============

The cache is configured by a script in /etc/cachefilesd.conf. These commands
set up cache ready for use. The following script commands are available:

(*) brun <N>%
(*) bcull <N>%
(*) bstop <N>%
(*) frun <N>%
(*) fcull <N>%
(*) fstop <N>%

Configure the culling limits. Optional. See the section on culling
The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.

The commands beginning with a 'b' are file space (block) limits, those
beginning with an 'f' are file count limits.

(*) dir <path>

Specify the directory containing the root of the cache. Mandatory.

(*) tag <name>

Specify a tag to FS-Cache to use in distinguishing multiple caches.
Optional. The default is "CacheFiles".

(*) debug <mask>

Specify a numeric bitmask to control debugging in the kernel module.
Optional. The default is zero (all off). The following values can be
OR'd into the mask to collect various information:

1 Turn on trace of function entry (_enter() macros)
2 Turn on trace of function exit (_leave() macros)
4 Turn on trace of internal debug points (_debug())

This mask can also be set through sysfs, eg:

echo 5 >/sys/modules/cachefiles/parameters/debug


==================
STARTING THE CACHE
==================

The cache is started by running the daemon. The daemon opens the cache device,
configures the cache and tells it to begin caching. At that point the cache
binds to fscache and the cache becomes live.

The daemon is run as follows:

/sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]

The flags are:

(*) -d

Increase the debugging level. This can be specified multiple times and
is cumulative with itself.

(*) -s

Send messages to stderr instead of syslog.

(*) -n

Don't daemonise and go into background.

(*) -f <configfile>

Use an alternative configuration file rather than the default one.


===============
THINGS TO AVOID
===============

Do not mount other things within the cache as this will cause problems. The
kernel module contains its own very cut-down path walking facility that ignores
mountpoints, but the daemon can't avoid them.

Do not create, rename or unlink files and directories in the cache whilst the
cache is active, as this may cause the state to become uncertain.

Renaming files in the cache might make objects appear to be other objects (the
filename is part of the lookup key).

Do not change or remove the extended attributes attached to cache files by the
cache as this will cause the cache state management to get confused.

Do not create files or directories in the cache, lest the cache get confused or
serve incorrect data.

Do not chmod files in the cache. The module creates things with minimal
permissions to prevent random users being able to access them directly.


=============
CACHE CULLING
=============

The cache may need culling occasionally to make space. This involves
discarding objects from the cache that have been used less recently than
anything else. Culling is based on the access time of data objects. Empty
directories are culled if not in use.

Cache culling is done on the basis of the percentage of blocks and the
percentage of files available in the underlying filesystem. There are six
"limits":

(*) brun
(*) frun

If the amount of free space and the number of available files in the cache
rises above both these limits, then culling is turned off.

(*) bcull
(*) fcull

If the amount of available space or the number of available files in the
cache falls below either of these limits, then culling is started.

(*) bstop
(*) fstop

If the amount of available space or the number of available files in the
cache falls below either of these limits, then no further allocation of
disk space or files is permitted until culling has raised things above
these limits again.

These must be configured thusly:

0 <= bstop < bcull < brun < 100
0 <= fstop < fcull < frun < 100

Note that these are percentages of available space and available files, and do
_not_ appear as 100 minus the percentage displayed by the "df" program.

The userspace daemon scans the cache to build up a table of cullable objects.
These are then culled in least recently used order. A new scan of the cache is
started as soon as space is made in the table. Objects will be skipped if
their atimes have changed or if the kernel module says it is still using them.


===============
CACHE STRUCTURE
===============

The CacheFiles module will create two directories in the directory it was
given:

(*) cache/

(*) graveyard/

The active cache objects all reside in the first directory. The CacheFiles
kernel module moves any retired or culled objects that it can't simply unlink
to the graveyard from which the daemon will actually delete them.

The daemon uses dnotify to monitor the graveyard directory, and will delete
anything that appears therein.


The module represents index objects as directories with the filename "I..." or
"J...". Note that the "cache/" directory is itself a special index.

Data objects are represented as files if they have no children, or directories
if they do. Their filenames all begin "D..." or "E...". If represented as a
directory, data objects will have a file in the directory called "data" that
actually holds the data.

Special objects are similar to data objects, except their filenames begin
"S..." or "T...".


If an object has children, then it will be represented as a directory.
Immediately in the representative directory are a collection of directories
named for hash values of the child object keys with an '@' prepended. Into
this directory, if possible, will be placed the representations of the child
objects:

INDEX INDEX INDEX DATA FILES
========= ========== ================================= ================
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry


If the key is so long that it exceeds NAME_MAX with the decorations added on to
it, then it will be cut into pieces, the first few of which will be used to
make a nest of directories, and the last one of which will be the objects
inside the last directory. The names of the intermediate directories will have
'+' prepended:

J1223/@23/+xy...z/+kl...m/Epqr


Note that keys are raw data, and not only may they exceed NAME_MAX in size,
they may also contain things like '/' and NUL characters, and so they may not
be suitable for turning directly into a filename.

To handle this, CacheFiles will use a suitably printable filename directly and
"base-64" encode ones that aren't directly suitable. The two versions of
object filenames indicate the encoding:

OBJECT TYPE PRINTABLE ENCODED
=============== =============== ===============
Index "I..." "J..."
Data "D..." "E..."
Special "S..." "T..."

Intermediate directories are always "@" or "+" as appropriate.


Each object in the cache has an extended attribute label that holds the object
type ID (required to distinguish special objects) and the auxiliary data from
the netfs. The latter is used to detect stale objects in the cache and update
or retire them.


Note that CacheFiles will erase from the cache any file it doesn't recognise or
any file of an incorrect type (such as a FIFO file or a device file).


==========================
SECURITY MODEL AND SELINUX
==========================

CacheFiles is implemented to deal properly with the LSM security features of
the Linux kernel and the SELinux facility.

One of the problems that CacheFiles faces is that it is generally acting on
behalf of a process, and running in that process's context, and that includes a
security context that is not appropriate for accessing the cache - either
because the files in the cache are inaccessible to that process, or because if
the process creates a file in the cache, that file may be inaccessible to other
processes.

The way CacheFiles works is to temporarily change the security context (fsuid,
fsgid and actor security label) that the process acts as - without changing the
security context of the process when it the target of an operation performed by
some other process (so signalling and suchlike still work correctly).


When the CacheFiles module is asked to bind to its cache, it:

(1) Finds the security label attached to the root cache directory and uses
that as the security label with which it will create files. By default,
this is:

cachefiles_var_t

(2) Finds the security label of the process which issued the bind request
(presumed to be the cachefilesd daemon), which by default will be:

cachefilesd_t

and asks LSM to supply a security ID as which it should act given the
daemon's label. By default, this will be:

cachefiles_kernel_t

SELinux transitions the daemon's security ID to the module's security ID
based on a rule of this form in the policy.

type_transition <daemon's-ID> kernel_t : process <module's-ID>;

For instance:

type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;


The module's security ID gives it permission to create, move and remove files
and directories in the cache, to find and access directories and files in the
cache, to set and access extended attributes on cache objects, and to read and
write files in the cache.

The daemon's security ID gives it only a very restricted set of permissions: it
may scan directories, stat files and erase files and directories. It may
not read or write files in the cache, and so it is precluded from accessing the
data cached therein; nor is it permitted to create new files in the cache.


There are policy source files available in:

http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2

and later versions. In that tarball, see the files:

cachefilesd.te
cachefilesd.fc
cachefilesd.if

They are built and installed directly by the RPM.

If a non-RPM based system is being used, then copy the above files to their own
directory and run:

make -f /usr/share/selinux/devel/Makefile
semodule -i cachefilesd.pp

You will need checkpolicy and selinux-policy-devel installed prior to the
build.


By default, the cache is located in /var/fscache, but if it is desirable that
it should be elsewhere, than either the above policy files must be altered, or
an auxiliary policy must be installed to label the alternate location of the
cache.

For instructions on how to add an auxiliary policy to enable the cache to be
located elsewhere when SELinux is in enforcing mode, please see:

/usr/share/doc/cachefilesd-*/move-cache.txt

When the cachefilesd rpm is installed; alternatively, the document can be found
in the sources.


==================
A NOTE ON SECURITY
==================

CacheFiles makes use of the split security in the task_struct. It allocates
its own task_security structure, and redirects current->act_as to point to it
when it acts on behalf of another process, in that process's context.

The reason it does this is that it calls vfs_mkdir() and suchlike rather than
bypassing security and calling inode ops directly. Therefore the VFS and LSM
may deny the CacheFiles access to the cache data because under some
circumstances the caching code is running in the security context of whatever
process issued the original syscall on the netfs.

Furthermore, should CacheFiles create a file or directory, the security
parameters with that object is created (UID, GID, security label) would be
derived from that process that issued the system call, thus potentially
preventing other processes from accessing the cache - including CacheFiles's
cache management daemon (cachefilesd).

What is required is to temporarily override the security of the process that
issued the system call. We can't, however, just do an in-place change of the
security data as that affects the process as an object, not just as a subject.
This means it may lose signals or ptrace events for example, and affects what
the process looks like in /proc.

So CacheFiles makes use of a logical split in the security between the
objective security (task->sec) and the subjective security (task->act_as). The
objective security holds the intrinsic security properties of a process and is
never overridden. This is what appears in /proc, and is what is used when a
process is the target of an operation by some other process (SIGKILL for
example).

The subjective security holds the active security properties of a process, and
may be overridden. This is not seen externally, and is used whan a process
acts upon another object, for example SIGKILLing another process or opening a
file.

LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request
for CacheFiles to run in a context of a specific security label, or to create
files and directories with another security label.


This documentation is added by the patch to:

Documentation/filesystems/caching/cachefiles.txt

Signed-Off-By: David Howells <***@redhat.com>
Acked-by: Steve Dickson <***@redhat.com>
Acked-by: Trond Myklebust <***@netapp.com>
---

Documentation/filesystems/caching/cachefiles.txt | 501 +++++++++++++
fs/Kconfig | 1
fs/Makefile | 1
fs/cachefiles/Kconfig | 39 +
fs/cachefiles/Makefile | 18
fs/cachefiles/bind.c | 286 +++++++
fs/cachefiles/daemon.c | 754 +++++++++++++++++++
fs/cachefiles/interface.c | 449 ++++++++++++
fs/cachefiles/internal.h | 360 +++++++++
fs/cachefiles/key.c | 159 ++++
fs/cachefiles/main.c | 106 +++
fs/cachefiles/namei.c | 772 ++++++++++++++++++++
fs/cachefiles/proc.c | 134 +++
fs/cachefiles/rdwr.c | 853 ++++++++++++++++++++++
fs/cachefiles/security.c | 116 +++
fs/cachefiles/xattr.c | 291 ++++++++
16 files changed, 4840 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/caching/cachefiles.txt
create mode 100644 fs/cachefiles/Kconfig
create mode 100644 fs/cachefiles/Makefile
create mode 100644 fs/cachefiles/bind.c
create mode 100644 fs/cachefiles/daemon.c
create mode 100644 fs/cachefiles/interface.c
create mode 100644 fs/cachefiles/internal.h
create mode 100644 fs/cachefiles/key.c
create mode 100644 fs/cachefiles/main.c
create mode 100644 fs/cachefiles/namei.c
create mode 100644 fs/cachefiles/proc.c
create mode 100644 fs/cachefiles/rdwr.c
create mode 100644 fs/cachefiles/security.c
create mode 100644 fs/cachefiles/xattr.c


diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.txt
new file mode 100644
index 0000000..c78a49b
--- /dev/null
+++ b/Documentation/filesystems/caching/cachefiles.txt
@@ -0,0 +1,501 @@
+ ===============================================
+ CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
+ ===============================================
+
+Contents:
+
+ (*) Overview.
+
+ (*) Requirements.
+
+ (*) Configuration.
+
+ (*) Starting the cache.
+
+ (*) Things to avoid.
+
+ (*) Cache culling.
+
+ (*) Cache structure.
+
+ (*) Security model and SELinux.
+
+ (*) A note on security.
+
+ (*) Statistical information.
+
+ (*) Debugging.
+
+
+========
+OVERVIEW
+========
+
+CacheFiles is a caching backend that's meant to use as a cache a directory on
+an already mounted filesystem of a local type (such as Ext3).
+
+CacheFiles uses a userspace daemon to do some of the cache management - such as
+reaping stale nodes and culling. This is called cachefilesd and lives in
+/sbin.
+
+The filesystem and data integrity of the cache are only as good as those of the
+filesystem providing the backing services. Note that CacheFiles does not
+attempt to journal anything since the journalling interfaces of the various
+filesystems are very specific in nature.
+
+CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
+to communication with the daemon. Only one thing may have this open at once,
+and whilst it is open, a cache is at least partially in existence. The daemon
+opens this and sends commands down it to control the cache.
+
+CacheFiles is currently limited to a single cache.
+
+CacheFiles attempts to maintain at least a certain percentage of free space on
+the filesystem, shrinking the cache by culling the objects it contains to make
+space if necessary - see the "Cache Culling" section. This means it can be
+placed on the same medium as a live set of data, and will expand to make use of
+spare space and automatically contract when the set of data requires more
+space.
+
+
+============
+REQUIREMENTS
+============
+
+The use of CacheFiles and its daemon requires the following features to be
+available in the system and in the cache filesystem:
+
+ - dnotify.
+
+ - extended attributes (xattrs).
+
+ - openat() and friends.
+
+ - bmap() support on files in the filesystem (FIBMAP ioctl).
+
+ - The use of bmap() to detect a partial page at the end of the file.
+
+It is strongly recommended that the "dir_index" option is enabled on Ext3
+filesystems being used as a cache.
+
+
+=============
+CONFIGURATION
+=============
+
+The cache is configured by a script in /etc/cachefilesd.conf. These commands
+set up cache ready for use. The following script commands are available:
+
+ (*) brun <N>%
+ (*) bcull <N>%
+ (*) bstop <N>%
+ (*) frun <N>%
+ (*) fcull <N>%
+ (*) fstop <N>%
+
+ Configure the culling limits. Optional. See the section on culling
+ The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
+
+ The commands beginning with a 'b' are file space (block) limits, those
+ beginning with an 'f' are file count limits.
+
+ (*) dir <path>
+
+ Specify the directory containing the root of the cache. Mandatory.
+
+ (*) tag <name>
+
+ Specify a tag to FS-Cache to use in distinguishing multiple caches.
+ Optional. The default is "CacheFiles".
+
+ (*) debug <mask>
+
+ Specify a numeric bitmask to control debugging in the kernel module.
+ Optional. The default is zero (all off). The following values can be
+ OR'd into the mask to collect various information:
+
+ 1 Turn on trace of function entry (_enter() macros)
+ 2 Turn on trace of function exit (_leave() macros)
+ 4 Turn on trace of internal debug points (_debug())
+
+ This mask can also be set through sysfs, eg:
+
+ echo 5 >/sys/modules/cachefiles/parameters/debug
+
+
+==================
+STARTING THE CACHE
+==================
+
+The cache is started by running the daemon. The daemon opens the cache device,
+configures the cache and tells it to begin caching. At that point the cache
+binds to fscache and the cache becomes live.
+
+The daemon is run as follows:
+
+ /sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
+
+The flags are:
+
+ (*) -d
+
+ Increase the debugging level. This can be specified multiple times and
+ is cumulative with itself.
+
+ (*) -s
+
+ Send messages to stderr instead of syslog.
+
+ (*) -n
+
+ Don't daemonise and go into background.
+
+ (*) -f <configfile>
+
+ Use an alternative configuration file rather than the default one.
+
+
+===============
+THINGS TO AVOID
+===============
+
+Do not mount other things within the cache as this will cause problems. The
+kernel module contains its own very cut-down path walking facility that ignores
+mountpoints, but the daemon can't avoid them.
+
+Do not create, rename or unlink files and directories in the cache whilst the
+cache is active, as this may cause the state to become uncertain.
+
+Renaming files in the cache might make objects appear to be other objects (the
+filename is part of the lookup key).
+
+Do not change or remove the extended attributes attached to cache files by the
+cache as this will cause the cache state management to get confused.
+
+Do not create files or directories in the cache, lest the cache get confused or
+serve incorrect data.
+
+Do not chmod files in the cache. The module creates things with minimal
+permissions to prevent random users being able to access them directly.
+
+
+=============
+CACHE CULLING
+=============
+
+The cache may need culling occasionally to make space. This involves
+discarding objects from the cache that have been used less recently than
+anything else. Culling is based on the access time of data objects. Empty
+directories are culled if not in use.
+
+Cache culling is done on the basis of the percentage of blocks and the
+percentage of files available in the underlying filesystem. There are six
+"limits":
+
+ (*) brun
+ (*) frun
+
+ If the amount of free space and the number of available files in the cache
+ rises above both these limits, then culling is turned off.
+
+ (*) bcull
+ (*) fcull
+
+ If the amount of available space or the number of available files in the
+ cache falls below either of these limits, then culling is started.
+
+ (*) bstop
+ (*) fstop
+
+ If the amount of available space or the number of available files in the
+ cache falls below either of these limits, then no further allocation of
+ disk space or files is permitted until culling has raised things above
+ these limits again.
+
+These must be configured thusly:
+
+ 0 <= bstop < bcull < brun < 100
+ 0 <= fstop < fcull < frun < 100
+
+Note that these are percentages of available space and available files, and do
+_not_ appear as 100 minus the percentage displayed by the "df" program.
+
+The userspace daemon scans the cache to build up a table of cullable objects.
+These are then culled in least recently used order. A new scan of the cache is
+started as soon as space is made in the table. Objects will be skipped if
+their atimes have changed or if the kernel module says it is still using them.
+
+
+===============
+CACHE STRUCTURE
+===============
+
+The CacheFiles module will create two directories in the directory it was
+given:
+
+ (*) cache/
+
+ (*) graveyard/
+
+The active cache objects all reside in the first directory. The CacheFiles
+kernel module moves any retired or culled objects that it can't simply unlink
+to the graveyard from which the daemon will actually delete them.
+
+The daemon uses dnotify to monitor the graveyard directory, and will delete
+anything that appears therein.
+
+
+The module represents index objects as directories with the filename "I..." or
+"J...". Note that the "cache/" directory is itself a special index.
+
+Data objects are represented as files if they have no children, or directories
+if they do. Their filenames all begin "D..." or "E...". If represented as a
+directory, data objects will have a file in the directory called "data" that
+actually holds the data.
+
+Special objects are similar to data objects, except their filenames begin
+"S..." or "T...".
+
+
+If an object has children, then it will be represented as a directory.
+Immediately in the representative directory are a collection of directories
+named for hash values of the child object keys with an '@' prepended. Into
+this directory, if possible, will be placed the representations of the child
+objects:
+
+ INDEX INDEX INDEX DATA FILES
+ ========= ========== ================================= ================
+ cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
+ cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
+ cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
+ cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry
+
+
+If the key is so long that it exceeds NAME_MAX with the decorations added on to
+it, then it will be cut into pieces, the first few of which will be used to
+make a nest of directories, and the last one of which will be the objects
+inside the last directory. The names of the intermediate directories will have
+'+' prepended:
+
+ J1223/@23/+xy...z/+kl...m/Epqr
+
+
+Note that keys are raw data, and not only may they exceed NAME_MAX in size,
+they may also contain things like '/' and NUL characters, and so they may not
+be suitable for turning directly into a filename.
+
+To handle this, CacheFiles will use a suitably printable filename directly and
+"base-64" encode ones that aren't directly suitable. The two versions of
+object filenames indicate the encoding:
+
+ OBJECT TYPE PRINTABLE ENCODED
+ =============== =============== ===============
+ Index "I..." "J..."
+ Data "D..." "E..."
+ Special "S..." "T..."
+
+Intermediate directories are always "@" or "+" as appropriate.
+
+
+Each object in the cache has an extended attribute label that holds the object
+type ID (required to distinguish special objects) and the auxiliary data from
+the netfs. The latter is used to detect stale objects in the cache and update
+or retire them.
+
+
+Note that CacheFiles will erase from the cache any file it doesn't recognise or
+any file of an incorrect type (such as a FIFO file or a device file).
+
+
+==========================
+SECURITY MODEL AND SELINUX
+==========================
+
+CacheFiles is implemented to deal properly with the LSM security features of
+the Linux kernel and the SELinux facility.
+
+One of the problems that CacheFiles faces is that it is generally acting on
+behalf of a process, and running in that process's context, and that includes a
+security context that is not appropriate for accessing the cache - either
+because the files in the cache are inaccessible to that process, or because if
+the process creates a file in the cache, that file may be inaccessible to other
+processes.
+
+The way CacheFiles works is to temporarily change the security context (fsuid,
+fsgid and actor security label) that the process acts as - without changing the
+security context of the process when it the target of an operation performed by
+some other process (so signalling and suchlike still work correctly).
+
+
+When the CacheFiles module is asked to bind to its cache, it:
+
+ (1) Finds the security label attached to the root cache directory and uses
+ that as the security label with which it will create files. By default,
+ this is:
+
+ cachefiles_var_t
+
+ (2) Finds the security label of the process which issued the bind request
+ (presumed to be the cachefilesd daemon), which by default will be:
+
+ cachefilesd_t
+
+ and asks LSM to supply a security ID as which it should act given the
+ daemon's label. By default, this will be:
+
+ cachefiles_kernel_t
+
+ SELinux transitions the daemon's security ID to the module's security ID
+ based on a rule of this form in the policy.
+
+ type_transition <daemon's-ID> kernel_t : process <module's-ID>;
+
+ For instance:
+
+ type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
+
+
+The module's security ID gives it permission to create, move and remove files
+and directories in the cache, to find and access directories and files in the
+cache, to set and access extended attributes on cache objects, and to read and
+write files in the cache.
+
+The daemon's security ID gives it only a very restricted set of permissions: it
+may scan directories, stat files and erase files and directories. It may
+not read or write files in the cache, and so it is precluded from accessing the
+data cached therein; nor is it permitted to create new files in the cache.
+
+
+There are policy source files available in:
+
+ http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
+
+and later versions. In that tarball, see the files:
+
+ cachefilesd.te
+ cachefilesd.fc
+ cachefilesd.if
+
+They are built and installed directly by the RPM.
+
+If a non-RPM based system is being used, then copy the above files to their own
+directory and run:
+
+ make -f /usr/share/selinux/devel/Makefile
+ semodule -i cachefilesd.pp
+
+You will need checkpolicy and selinux-policy-devel installed prior to the
+build.
+
+
+By default, the cache is located in /var/fscache, but if it is desirable that
+it should be elsewhere, than either the above policy files must be altered, or
+an auxiliary policy must be installed to label the alternate location of the
+cache.
+
+For instructions on how to add an auxiliary policy to enable the cache to be
+located elsewhere when SELinux is in enforcing mode, please see:
+
+ /usr/share/doc/cachefilesd-*/move-cache.txt
+
+When the cachefilesd rpm is installed; alternatively, the document can be found
+in the sources.
+
+
+==================
+A NOTE ON SECURITY
+==================
+
+CacheFiles makes use of the split security in the task_struct. It allocates
+its own task_security structure, and redirects current->act_as to point to it
+when it acts on behalf of another process, in that process's context.
+
+The reason it does this is that it calls vfs_mkdir() and suchlike rather than
+bypassing security and calling inode ops directly. Therefore the VFS and LSM
+may deny the CacheFiles access to the cache data because under some
+circumstances the caching code is running in the security context of whatever
+process issued the original syscall on the netfs.
+
+Furthermore, should CacheFiles create a file or directory, the security
+parameters with that object is created (UID, GID, security label) would be
+derived from that process that issued the system call, thus potentially
+preventing other processes from accessing the cache - including CacheFiles's
+cache management daemon (cachefilesd).
+
+What is required is to temporarily override the security of the process that
+issued the system call. We can't, however, just do an in-place change of the
+security data as that affects the process as an object, not just as a subject.
+This means it may lose signals or ptrace events for example, and affects what
+the process looks like in /proc.
+
+So CacheFiles makes use of a logical split in the security between the
+objective security (task->sec) and the subjective security (task->act_as). The
+objective security holds the intrinsic security properties of a process and is
+never overridden. This is what appears in /proc, and is what is used when a
+process is the target of an operation by some other process (SIGKILL for
+example).
+
+The subjective security holds the active security properties of a process, and
+may be overridden. This is not seen externally, and is used whan a process
+acts upon another object, for example SIGKILLing another process or opening a
+file.
+
+LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request
+for CacheFiles to run in a context of a specific security label, or to create
+files and directories with another security label.
+
+
+=======================
+STATISTICAL INFORMATION
+=======================
+
+If FS-Cache is compiled with the following option enabled:
+
+ CONFIG_CACHEFILES_HISTOGRAM=y
+
+then it will gather certain statistics and display them through a proc file.
+
+ (*) /proc/fs/cachefiles/histogram
+
+ cat /proc/fs/cachefiles/histogram
+ JIFS SECS LOOKUPS MKDIRS CREATES
+ ===== ===== ========= ========= =========
+
+ This shows the breakdown of the number of times each amount of time
+ between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
+ columns are as follows:
+
+ COLUMN TIME MEASUREMENT
+ ======= =======================================================
+ LOOKUPS Length of time to perform a lookup on the backing fs
+ MKDIRS Length of time to perform a mkdir on the backing fs
+ CREATES Length of time to perform a create on the backing fs
+
+ Each row shows the number of events that took a particular range of times.
+ Each step is 1 jiffy in size. The JIFS column indicates the particular
+ jiffy range covered, and the SECS field the equivalent number of seconds.
+
+
+=========
+DEBUGGING
+=========
+
+If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime
+debugging enabled by adjusting the value in:
+
+ /sys/module/cachefiles/parameters/debug
+
+This is a bitmask of debugging streams to enable:
+
+ BIT VALUE STREAM POINT
+ ======= ======= =============================== =======================
+ 0 1 General Function entry trace
+ 1 2 Function exit trace
+ 2 4 General
+
+The appropriate set of values should be OR'd together and the result written to
+the control file. For example:
+
+ echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug
+
+will turn on all function entry debugging.
diff --git a/fs/Kconfig b/fs/Kconfig
index 3942df6..c0022b1 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -69,6 +69,7 @@ config GENERIC_ACL
menu "Caches"

source "fs/fscache/Kconfig"
+source "fs/cachefiles/Kconfig"

endmenu

diff --git a/fs/Makefile b/fs/Makefile
index 24ce1fe..8fd4602 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -117,6 +117,7 @@ obj-$(CONFIG_AFS_FS) += afs/
obj-$(CONFIG_BEFS_FS) += befs/
obj-$(CONFIG_HOSTFS) += hostfs/
obj-$(CONFIG_HPPFS) += hppfs/
+obj-$(CONFIG_CACHEFILES) += cachefiles/
obj-$(CONFIG_DEBUG_FS) += debugfs/
obj-$(CONFIG_OCFS2_FS) += ocfs2/
obj-$(CONFIG_BTRFS_FS) += btrfs/
diff --git a/fs/cachefiles/Kconfig b/fs/cachefiles/Kconfig
new file mode 100644
index 0000000..80e9c61
--- /dev/null
+++ b/fs/cachefiles/Kconfig
@@ -0,0 +1,39 @@
+
+config CACHEFILES
+ tristate "Filesystem caching on files"
+ depends on FSCACHE && BLOCK
+ help
+ This permits use of a mounted filesystem as a cache for other
+ filesystems - primarily networking filesystems - thus allowing fast
+ local disk to enhance the speed of slower devices.
+
+ See Documentation/filesystems/caching/cachefiles.txt for more
+ information.
+
+config CACHEFILES_DEBUG
+ bool "Debug CacheFiles"
+ depends on CACHEFILES
+ help
+ This permits debugging to be dynamically enabled in the filesystem
+ caching on files module. If this is set, the debugging output may be
+ enabled by setting bits in /sys/modules/cachefiles/parameter/debug or
+ by including a debugging specifier in /etc/cachefilesd.conf.
+
+config CACHEFILES_HISTOGRAM
+ bool "Gather latency information on CacheFiles"
+ depends on CACHEFILES && PROC_FS
+ help
+
+ This option causes latency information to be gathered on CacheFiles
+ operation and exported through file:
+
+ /proc/fs/cachefiles/histogram
+
+ The generation of this histogram adds a certain amount of overhead to
+ execution as there are a number of points at which data is gathered,
+ and on a multi-CPU system these may be on cachelines that keep
+ bouncing between CPUs. On the other hand, the histogram may be
+ useful for debugging purposes. Saying 'N' here is recommended.
+
+ See Documentation/filesystems/caching/cachefiles.txt for more
+ information.
diff --git a/fs/cachefiles/Makefile b/fs/cachefiles/Makefile
new file mode 100644
index 0000000..32cbab0
--- /dev/null
+++ b/fs/cachefiles/Makefile
@@ -0,0 +1,18 @@
+#
+# Makefile for caching in a mounted filesystem
+#
+
+cachefiles-y := \
+ bind.o \
+ daemon.o \
+ interface.o \
+ key.o \
+ main.o \
+ namei.o \
+ rdwr.o \
+ security.o \
+ xattr.o
+
+cachefiles-$(CONFIG_CACHEFILES_HISTOGRAM) += proc.o
+
+obj-$(CONFIG_CACHEFILES) := cachefiles.o
diff --git a/fs/cachefiles/bind.c b/fs/cachefiles/bind.c
new file mode 100644
index 0000000..3797e00
--- /dev/null
+++ b/fs/cachefiles/bind.c
@@ -0,0 +1,286 @@
+/* Bind and unbind a cache from the filesystem backing it
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/mount.h>
+#include <linux/statfs.h>
+#include <linux/ctype.h>
+#include "internal.h"
+
+static int cachefiles_daemon_add_cache(struct cachefiles_cache *caches);
+
+/*
+ * bind a directory as a cache
+ */
+int cachefiles_daemon_bind(struct cachefiles_cache *cache, char *args)
+{
+ _enter("{%u,%u,%u,%u,%u,%u},%s",
+ cache->frun_percent,
+ cache->fcull_percent,
+ cache->fstop_percent,
+ cache->brun_percent,
+ cache->bcull_percent,
+ cache->bstop_percent,
+ args);
+
+ /* start by checking things over */
+ ASSERT(cache->fstop_percent >= 0 &&
+ cache->fstop_percent < cache->fcull_percent &&
+ cache->fcull_percent < cache->frun_percent &&
+ cache->frun_percent < 100);
+
+ ASSERT(cache->bstop_percent >= 0 &&
+ cache->bstop_percent < cache->bcull_percent &&
+ cache->bcull_percent < cache->brun_percent &&
+ cache->brun_percent < 100);
+
+ if (*args) {
+ kerror("'bind' command doesn't take an argument");
+ return -EINVAL;
+ }
+
+ if (!cache->rootdirname) {
+ kerror("No cache directory specified");
+ return -EINVAL;
+ }
+
+ /* don't permit already bound caches to be re-bound */
+ if (test_bit(CACHEFILES_READY, &cache->flags)) {
+ kerror("Cache already bound");
+ return -EBUSY;
+ }
+
+ /* make sure we have copies of the tag and dirname strings */
+ if (!cache->tag) {
+ /* the tag string is released by the fops->release()
+ * function, so we don't release it on error here */
+ cache->tag = kstrdup("CacheFiles", GFP_KERNEL);
+ if (!cache->tag)
+ return -ENOMEM;
+ }
+
+ /* add the cache */
+ return cachefiles_daemon_add_cache(cache);
+}
+
+/*
+ * add a cache
+ */
+static int cachefiles_daemon_add_cache(struct cachefiles_cache *cache)
+{
+ struct cachefiles_object *fsdef;
+ struct nameidata nd;
+ struct kstatfs stats;
+ struct dentry *graveyard, *cachedir, *root;
+ const struct cred *saved_cred;
+ int ret;
+
+ _enter("");
+
+ /* we want to work under the module's security ID */
+ ret = cachefiles_get_security_ID(cache);
+ if (ret < 0)
+ return ret;
+
+ cachefiles_begin_secure(cache, &saved_cred);
+
+ /* allocate the root index object */
+ ret = -ENOMEM;
+
+ fsdef = kmem_cache_alloc(cachefiles_object_jar, GFP_KERNEL);
+ if (!fsdef)
+ goto error_root_object;
+
+ ASSERTCMP(fsdef->backer, ==, NULL);
+
+ atomic_set(&fsdef->usage, 1);
+ fsdef->type = FSCACHE_COOKIE_TYPE_INDEX;
+
+ _debug("- fsdef %p", fsdef);
+
+ /* look up the directory at the root of the cache */
+ memset(&nd, 0, sizeof(nd));
+
+ ret = path_lookup(cache->rootdirname, LOOKUP_DIRECTORY, &nd);
+ if (ret < 0)
+ goto error_open_root;
+
+ cache->mnt = mntget(nd.path.mnt);
+ root = dget(nd.path.dentry);
+ path_put(&nd.path);
+
+ /* check parameters */
+ ret = -EOPNOTSUPP;
+ if (!root->d_inode ||
+ !root->d_inode->i_op ||
+ !root->d_inode->i_op->lookup ||
+ !root->d_inode->i_op->mkdir ||
+ !root->d_inode->i_op->setxattr ||
+ !root->d_inode->i_op->getxattr ||
+ !root->d_sb ||
+ !root->d_sb->s_op ||
+ !root->d_sb->s_op->statfs ||
+ !root->d_sb->s_op->sync_fs)
+ goto error_unsupported;
+
+ ret = -EROFS;
+ if (root->d_sb->s_flags & MS_RDONLY)
+ goto error_unsupported;
+
+ /* determine the security of the on-disk cache as this governs
+ * security ID of files we create */
+ ret = cachefiles_determine_cache_security(cache, root, &saved_cred);
+ if (ret < 0)
+ goto error_unsupported;
+
+ /* get the cache size and blocksize */
+ ret = vfs_statfs(root, &stats);
+ if (ret < 0)
+ goto error_unsupported;
+
+ ret = -ERANGE;
+ if (stats.f_bsize <= 0)
+ goto error_unsupported;
+
+ ret = -EOPNOTSUPP;
+ if (stats.f_bsize > PAGE_SIZE)
+ goto error_unsupported;
+
+ cache->bsize = stats.f_bsize;
+ cache->bshift = 0;
+ if (stats.f_bsize < PAGE_SIZE)
+ cache->bshift = PAGE_SHIFT - ilog2(stats.f_bsize);
+
+ _debug("blksize %u (shift %u)",
+ cache->bsize, cache->bshift);
+
+ _debug("size %llu, avail %llu",
+ (unsigned long long) stats.f_blocks,
+ (unsigned long long) stats.f_bavail);
+
+ /* set up caching limits */
+ do_div(stats.f_files, 100);
+ cache->fstop = stats.f_files * cache->fstop_percent;
+ cache->fcull = stats.f_files * cache->fcull_percent;
+ cache->frun = stats.f_files * cache->frun_percent;
+
+ _debug("limits {%llu,%llu,%llu} files",
+ (unsigned long long) cache->frun,
+ (unsigned long long) cache->fcull,
+ (unsigned long long) cache->fstop);
+
+ stats.f_blocks >>= cache->bshift;
+ do_div(stats.f_blocks, 100);
+ cache->bstop = stats.f_blocks * cache->bstop_percent;
+ cache->bcull = stats.f_blocks * cache->bcull_percent;
+ cache->brun = stats.f_blocks * cache->brun_percent;
+
+ _debug("limits {%llu,%llu,%llu} blocks",
+ (unsigned long long) cache->brun,
+ (unsigned long long) cache->bcull,
+ (unsigned long long) cache->bstop);
+
+ /* get the cache directory and check its type */
+ cachedir = cachefiles_get_directory(cache, root, "cache");
+ if (IS_ERR(cachedir)) {
+ ret = PTR_ERR(cachedir);
+ goto error_unsupported;
+ }
+
+ fsdef->dentry = cachedir;
+ fsdef->fscache.cookie = NULL;
+
+ ret = cachefiles_check_object_type(fsdef);
+ if (ret < 0)
+ goto error_unsupported;
+
+ /* get the graveyard directory */
+ graveyard = cachefiles_get_directory(cache, root, "graveyard");
+ if (IS_ERR(graveyard)) {
+ ret = PTR_ERR(graveyard);
+ goto error_unsupported;
+ }
+
+ cache->graveyard = graveyard;
+
+ /* publish the cache */
+ fscache_init_cache(&cache->cache,
+ &cachefiles_cache_ops,
+ "%s",
+ fsdef->dentry->d_sb->s_id);
+
+ fscache_object_init(&fsdef->fscache, NULL, &cache->cache);
+
+ ret = fscache_add_cache(&cache->cache, &fsdef->fscache, cache->tag);
+ if (ret < 0)
+ goto error_add_cache;
+
+ /* done */
+ set_bit(CACHEFILES_READY, &cache->flags);
+ dput(root);
+
+ printk(KERN_INFO "CacheFiles:"
+ " File cache on %s registered\n",
+ cache->cache.identifier);
+
+ /* check how much space the cache has */
+ cachefiles_has_space(cache, 0, 0);
+ cachefiles_end_secure(cache, saved_cred);
+ return 0;
+
+error_add_cache:
+ dput(cache->graveyard);
+ cache->graveyard = NULL;
+error_unsupported:
+ mntput(cache->mnt);
+ cache->mnt = NULL;
+ dput(fsdef->dentry);
+ fsdef->dentry = NULL;
+ dput(root);
+error_open_root:
+ kmem_cache_free(cachefiles_object_jar, fsdef);
+error_root_object:
+ cachefiles_end_secure(cache, saved_cred);
+ kerror("Failed to register: %d", ret);
+ return ret;
+}
+
+/*
+ * unbind a cache on fd release
+ */
+void cachefiles_daemon_unbind(struct cachefiles_cache *cache)
+{
+ _enter("");
+
+ if (test_bit(CACHEFILES_READY, &cache->flags)) {
+ printk(KERN_INFO "CacheFiles:"
+ " File cache on %s unregistering\n",
+ cache->cache.identifier);
+
+ fscache_withdraw_cache(&cache->cache);
+ }
+
+ dput(cache->graveyard);
+ mntput(cache->mnt);
+
+ kfree(cache->rootdirname);
+ kfree(cache->secctx);
+ kfree(cache->tag);
+
+ _leave("");
+}
diff --git a/fs/cachefiles/daemon.c b/fs/cachefiles/daemon.c
new file mode 100644
index 0000000..b4c95c4
--- /dev/null
+++ b/fs/cachefiles/daemon.c
@@ -0,0 +1,754 @@
+/* Daemon interface
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/poll.h>
+#include <linux/mount.h>
+#include <linux/statfs.h>
+#include <linux/ctype.h>
+#include "internal.h"
+
+static int cachefiles_daemon_open(struct inode *, struct file *);
+static int cachefiles_daemon_release(struct inode *, struct file *);
+static ssize_t cachefiles_daemon_read(struct file *, char __user *, size_t,
+ loff_t *);
+static ssize_t cachefiles_daemon_write(struct file *, const char __user *,
+ size_t, loff_t *);
+static unsigned int cachefiles_daemon_poll(struct file *,
+ struct poll_table_struct *);
+static int cachefiles_daemon_frun(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_fcull(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_fstop(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_brun(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_bcull(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_bstop(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_cull(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_debug(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_dir(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_inuse(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_secctx(struct cachefiles_cache *, char *);
+static int cachefiles_daemon_tag(struct cachefiles_cache *, char *);
+
+static unsigned long cachefiles_open;
+
+const struct file_operations cachefiles_daemon_fops = {
+ .owner = THIS_MODULE,
+ .open = cachefiles_daemon_open,
+ .release = cachefiles_daemon_release,
+ .read = cachefiles_daemon_read,
+ .write = cachefiles_daemon_write,
+ .poll = cachefiles_daemon_poll,
+};
+
+struct cachefiles_daemon_cmd {
+ char name[8];
+ int (*handler)(struct cachefiles_cache *cache, char *args);
+};
+
+static const struct cachefiles_daemon_cmd cachefiles_daemon_cmds[] = {
+ { "bind", cachefiles_daemon_bind },
+ { "brun", cachefiles_daemon_brun },
+ { "bcull", cachefiles_daemon_bcull },
+ { "bstop", cachefiles_daemon_bstop },
+ { "cull", cachefiles_daemon_cull },
+ { "debug", cachefiles_daemon_debug },
+ { "dir", cachefiles_daemon_dir },
+ { "frun", cachefiles_daemon_frun },
+ { "fcull", cachefiles_daemon_fcull },
+ { "fstop", cachefiles_daemon_fstop },
+ { "inuse", cachefiles_daemon_inuse },
+ { "secctx", cachefiles_daemon_secctx },
+ { "tag", cachefiles_daemon_tag },
+ { "", NULL }
+};
+
+
+/*
+ * do various checks
+ */
+static int cachefiles_daemon_open(struct inode *inode, struct file *file)
+{
+ struct cachefiles_cache *cache;
+
+ _enter("");
+
+ /* only the superuser may do this */
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* the cachefiles device may only be open once at a time */
+ if (xchg(&cachefiles_open, 1) == 1)
+ return -EBUSY;
+
+ /* allocate a cache record */
+ cache = kzalloc(sizeof(struct cachefiles_cache), GFP_KERNEL);
+ if (!cache) {
+ cachefiles_open = 0;
+ return -ENOMEM;
+ }
+
+ mutex_init(&cache->daemon_mutex);
+ cache->active_nodes = RB_ROOT;
+ rwlock_init(&cache->active_lock);
+ init_waitqueue_head(&cache->daemon_pollwq);
+
+ /* set default caching limits
+ * - limit at 1% free space and/or free files
+ * - cull below 5% free space and/or free files
+ * - cease culling above 7% free space and/or free files
+ */
+ cache->frun_percent = 7;
+ cache->fcull_percent = 5;
+ cache->fstop_percent = 1;
+ cache->brun_percent = 7;
+ cache->bcull_percent = 5;
+ cache->bstop_percent = 1;
+
+ file->private_data = cache;
+ cache->cachefilesd = file;
+ return 0;
+}
+
+/*
+ * release a cache
+ */
+static int cachefiles_daemon_release(struct inode *inode, struct file *file)
+{
+ struct cachefiles_cache *cache = file->private_data;
+
+ _enter("");
+
+ ASSERT(cache);
+
+ set_bit(CACHEFILES_DEAD, &cache->flags);
+
+ cachefiles_daemon_unbind(cache);
+
+ ASSERT(!cache->active_nodes.rb_node);
+
+ /* clean up the control file interface */
+ cache->cachefilesd = NULL;
+ file->private_data = NULL;
+ cachefiles_open = 0;
+
+ kfree(cache);
+
+ _leave("");
+ return 0;
+}
+
+/*
+ * read the cache state
+ */
+static ssize_t cachefiles_daemon_read(struct file *file, char __user *_buffer,
+ size_t buflen, loff_t *pos)
+{
+ struct cachefiles_cache *cache = file->private_data;
+ char buffer[256];
+ int n;
+
+ //_enter(",,%zu,", buflen);
+
+ if (!test_bit(CACHEFILES_READY, &cache->flags))
+ return 0;
+
+ /* check how much space the cache has */
+ cachefiles_has_space(cache, 0, 0);
+
+ /* summarise */
+ clear_bit(CACHEFILES_STATE_CHANGED, &cache->flags);
+
+ n = snprintf(buffer, sizeof(buffer),
+ "cull=%c"
+ " frun=%llx"
+ " fcull=%llx"
+ " fstop=%llx"
+ " brun=%llx"
+ " bcull=%llx"
+ " bstop=%llx",
+ test_bit(CACHEFILES_CULLING, &cache->flags) ? '1' : '0',
+ (unsigned long long) cache->frun,
+ (unsigned long long) cache->fcull,
+ (unsigned long long) cache->fstop,
+ (unsigned long long) cache->brun,
+ (unsigned long long) cache->bcull,
+ (unsigned long long) cache->bstop
+ );
+
+ if (n > buflen)
+ return -EMSGSIZE;
+
+ if (copy_to_user(_buffer, buffer, n) != 0)
+ return -EFAULT;
+
+ return n;
+}
+
+/*
+ * command the cache
+ */
+static ssize_t cachefiles_daemon_write(struct file *file,
+ const char __user *_data,
+ size_t datalen,
+ loff_t *pos)
+{
+ const struct cachefiles_daemon_cmd *cmd;
+ struct cachefiles_cache *cache = file->private_data;
+ ssize_t ret;
+ char *data, *args, *cp;
+
+ //_enter(",,%zu,", datalen);
+
+ ASSERT(cache);
+
+ if (test_bit(CACHEFILES_DEAD, &cache->flags))
+ return -EIO;
+
+ if (datalen < 0 || datalen > PAGE_SIZE - 1)
+ return -EOPNOTSUPP;
+
+ /* drag the command string into the kernel so we can parse it */
+ data = kmalloc(datalen + 1, GFP_KERNEL);
+ if (!data)
+ return -ENOMEM;
+
+ ret = -EFAULT;
+ if (copy_from_user(data, _data, datalen) != 0)
+ goto error;
+
+ data[datalen] = '\0';
+
+ ret = -EINVAL;
+ if (memchr(data, '\0', datalen))
+ goto error;
+
+ /* strip any newline */
+ cp = memchr(data, '\n', datalen);
+ if (cp) {
+ if (cp == data)
+ goto error;
+
+ *cp = '\0';
+ }
+
+ /* parse the command */
+ ret = -EOPNOTSUPP;
+
+ for (args = data; *args; args++)
+ if (isspace(*args))
+ break;
+ if (*args) {
+ if (args == data)
+ goto error;
+ *args = '\0';
+ for (args++; isspace(*args); args++)
+ continue;
+ }
+
+ /* run the appropriate command handler */
+ for (cmd = cachefiles_daemon_cmds; cmd->name[0]; cmd++)
+ if (strcmp(cmd->name, data) == 0)
+ goto found_command;
+
+error:
+ kfree(data);
+ //_leave(" = %zd", ret);
+ return ret;
+
+found_command:
+ mutex_lock(&cache->daemon_mutex);
+
+ ret = -EIO;
+ if (!test_bit(CACHEFILES_DEAD, &cache->flags))
+ ret = cmd->handler(cache, args);
+
+ mutex_unlock(&cache->daemon_mutex);
+
+ if (ret == 0)
+ ret = datalen;
+ goto error;
+}
+
+/*
+ * poll for culling state
+ * - use POLLOUT to indicate culling state
+ */
+static unsigned int cachefiles_daemon_poll(struct file *file,
+ struct poll_table_struct *poll)
+{
+ struct cachefiles_cache *cache = file->private_data;
+ unsigned int mask;
+
+ poll_wait(file, &cache->daemon_pollwq, poll);
+ mask = 0;
+
+ if (test_bit(CACHEFILES_STATE_CHANGED, &cache->flags))
+ mask |= POLLIN;
+
+ if (test_bit(CACHEFILES_CULLING, &cache->flags))
+ mask |= POLLOUT;
+
+ return mask;
+}
+
+/*
+ * give a range error for cache space constraints
+ * - can be tail-called
+ */
+static int cachefiles_daemon_range_error(struct cachefiles_cache *cache,
+ char *args)
+{
+ kerror("Free space limits must be in range"
+ " 0%%<=stop<cull<run<100%%");
+
+ return -EINVAL;
+}
+
+/*
+ * set the percentage of files at which to stop culling
+ * - command: "frun <N>%"
+ */
+static int cachefiles_daemon_frun(struct cachefiles_cache *cache, char *args)
+{
+ unsigned long frun;
+
+ _enter(",%s", args);
+
+ if (!*args)
+ return -EINVAL;
+
+ frun = simple_strtoul(args, &args, 10);
+ if (args[0] != '%' || args[1] != '\0')
+ return -EINVAL;
+
+ if (frun <= cache->fcull_percent || frun >= 100)
+ return cachefiles_daemon_range_error(cache, args);
+
+ cache->frun_percent = frun;
+ return 0;
+}
+
+/*
+ * set the percentage of files at which to start culling
+ * - command: "fcull <N>%"
+ */
+static int cachefiles_daemon_fcull(struct cachefiles_cache *cache, char *args)
+{
+ unsigned long fcull;
+
+ _enter(",%s", args);
+
+ if (!*args)
+ return -EINVAL;
+
+ fcull = simple_strtoul(args, &args, 10);
+ if (args[0] != '%' || args[1] != '\0')
+ return -EINVAL;
+
+ if (fcull <= cache->fstop_percent || fcull >= cache->frun_percent)
+ return cachefiles_daemon_range_error(cache, args);
+
+ cache->fcull_percent = fcull;
+ return 0;
+}
+
+/*
+ * set the percentage of files at which to stop allocating
+ * - command: "fstop <N>%"
+ */
+static int cachefiles_daemon_fstop(struct cachefiles_cache *cache, char *args)
+{
+ unsigned long fstop;
+
+ _enter(",%s", args);
+
+ if (!*args)
+ return -EINVAL;
+
+ fstop = simple_strtoul(args, &args, 10);
+ if (args[0] != '%' || args[1] != '\0')
+ return -EINVAL;
+
+ if (fstop < 0 || fstop >= cache->fcull_percent)
+ return cachefiles_daemon_range_error(cache, args);
+
+ cache->fstop_percent = fstop;
+ return 0;
+}
+
+/*
+ * set the percentage of blocks at which to stop culling
+ * - command: "brun <N>%"
+ */
+static int cachefiles_daemon_brun(struct cachefiles_cache *cache, char *args)
+{
+ unsigned long brun;
+
+ _enter(",%s", args);
+
+ if (!*args)
+ return -EINVAL;
+
+ brun = simple_strtoul(args, &args, 10);
+ if (args[0] != '%' || args[1] != '\0')
+ return -EINVAL;
+
+ if (brun <= cache->bcull_percent || brun >= 100)
+ return cachefiles_daemon_range_error(cache, args);
+
+ cache->brun_percent = brun;
+ return 0;
+}
+
+/*
+ * set the percentage of blocks at which to start culling
+ * - command: "bcull <N>%"
+ */
+static int cachefiles_daemon_bcull(struct cachefiles_cache *cache, char *args)
+{
+ unsigned long bcull;
+
+ _enter(",%s", args);
+
+ if (!*args)
+ return -EINVAL;
+
+ bcull = simple_strtoul(args, &args, 10);
+ if (args[0] != '%' || args[1] != '\0')
+ return -EINVAL;
+
+ if (bcull <= cache->bstop_percent || bcull >= cache->brun_percent)
+ return cachefiles_daemon_range_error(cache, args);
+
+ cache->bcull_percent = bcull;
+ return 0;
+}
+
+/*
+ * set the percentage of blocks at which to stop allocating
+ * - command: "bstop <N>%"
+ */
+static int cachefiles_daemon_bstop(struct cachefiles_cache *cache, char *args)
+{
+ unsigned long bstop;
+
+ _enter(",%s", args);
+
+ if (!*args)
+ return -EINVAL;
+
+ bstop = simple_strtoul(args, &args, 10);
+ if (args[0] != '%' || args[1] != '\0')
+ return -EINVAL;
+
+ if (bstop < 0 || bstop >= cache->bcull_percent)
+ return cachefiles_daemon_range_error(cache, args);
+
+ cache->bstop_percent = bstop;
+ return 0;
+}
+
+/*
+ * set the cache directory
+ * - command: "dir <name>"
+ */
+static int cachefiles_daemon_dir(struct cachefiles_cache *cache, char *args)
+{
+ char *dir;
+
+ _enter(",%s", args);
+
+ if (!*args) {
+ kerror("Empty directory specified");
+ return -EINVAL;
+ }
+
+ if (cache->rootdirname) {
+ kerror("Second cache directory specified");
+ return -EEXIST;
+ }
+
+ dir = kstrdup(args, GFP_KERNEL);
+ if (!dir)
+ return -ENOMEM;
+
+ cache->rootdirname = dir;
+ return 0;
+}
+
+/*
+ * set the cache security context
+ * - command: "secctx <ctx>"
+ */
+static int cachefiles_daemon_secctx(struct cachefiles_cache *cache, char *args)
+{
+ char *secctx;
+
+ _enter(",%s", args);
+
+ if (!*args) {
+ kerror("Empty security context specified");
+ return -EINVAL;
+ }
+
+ if (cache->secctx) {
+ kerror("Second security context specified");
+ return -EINVAL;
+ }
+
+ secctx = kstrdup(args, GFP_KERNEL);
+ if (!secctx)
+ return -ENOMEM;
+
+ cache->secctx = secctx;
+ return 0;
+}
+
+/*
+ * set the cache tag
+ * - command: "tag <name>"
+ */
+static int cachefiles_daemon_tag(struct cachefiles_cache *cache, char *args)
+{
+ char *tag;
+
+ _enter(",%s", args);
+
+ if (!*args) {
+ kerror("Empty tag specified");
+ return -EINVAL;
+ }
+
+ if (cache->tag)
+ return -EEXIST;
+
+ tag = kstrdup(args, GFP_KERNEL);
+ if (!tag)
+ return -ENOMEM;
+
+ cache->tag = tag;
+ return 0;
+}
+
+/*
+ * request a node in the cache be culled from the current working directory
+ * - command: "cull <name>"
+ */
+static int cachefiles_daemon_cull(struct cachefiles_cache *cache, char *args)
+{
+ struct fs_struct *fs;
+ struct dentry *dir;
+ const struct cred *saved_cred;
+ int ret;
+
+ _enter(",%s", args);
+
+ if (strchr(args, '/'))
+ goto inval;
+
+ if (!test_bit(CACHEFILES_READY, &cache->flags)) {
+ kerror("cull applied to unready cache");
+ return -EIO;
+ }
+
+ if (test_bit(CACHEFILES_DEAD, &cache->flags)) {
+ kerror("cull applied to dead cache");
+ return -EIO;
+ }
+
+ /* extract the directory dentry from the cwd */
+ fs = current->fs;
+ read_lock(&fs->lock);
+ dir = dget(fs->pwd.dentry);
+ read_unlock(&fs->lock);
+
+ if (!S_ISDIR(dir->d_inode->i_mode))
+ goto notdir;
+
+ cachefiles_begin_secure(cache, &saved_cred);
+ ret = cachefiles_cull(cache, dir, args);
+ cachefiles_end_secure(cache, saved_cred);
+
+ dput(dir);
+ _leave(" = %d", ret);
+ return ret;
+
+notdir:
+ dput(dir);
+ kerror("cull command requires dirfd to be a directory");
+ return -ENOTDIR;
+
+inval:
+ kerror("cull command requires dirfd and filename");
+ return -EINVAL;
+}
+
+/*
+ * set debugging mode
+ * - command: "debug <mask>"
+ */
+static int cachefiles_daemon_debug(struct cachefiles_cache *cache, char *args)
+{
+ unsigned long mask;
+
+ _enter(",%s", args);
+
+ mask = simple_strtoul(args, &args, 0);
+ if (args[0] != '\0')
+ goto inval;
+
+ cachefiles_debug = mask;
+ _leave(" = 0");
+ return 0;
+
+inval:
+ kerror("debug command requires mask");
+ return -EINVAL;
+}
+
+/*
+ * find out whether an object in the current working directory is in use or not
+ * - command: "inuse <name>"
+ */
+static int cachefiles_daemon_inuse(struct cachefiles_cache *cache, char *args)
+{
+ struct fs_struct *fs;
+ struct dentry *dir;
+ const struct cred *saved_cred;
+ int ret;
+
+ //_enter(",%s", args);
+
+ if (strchr(args, '/'))
+ goto inval;
+
+ if (!test_bit(CACHEFILES_READY, &cache->flags)) {
+ kerror("inuse applied to unready cache");
+ return -EIO;
+ }
+
+ if (test_bit(CACHEFILES_DEAD, &cache->flags)) {
+ kerror("inuse applied to dead cache");
+ return -EIO;
+ }
+
+ /* extract the directory dentry from the cwd */
+ fs = current->fs;
+ read_lock(&fs->lock);
+ dir = dget(fs->pwd.dentry);
+ read_unlock(&fs->lock);
+
+ if (!S_ISDIR(dir->d_inode->i_mode))
+ goto notdir;
+
+ cachefiles_begin_secure(cache, &saved_cred);
+ ret = cachefiles_check_in_use(cache, dir, args);
+ cachefiles_end_secure(cache, saved_cred);
+
+ dput(dir);
+ //_leave(" = %d", ret);
+ return ret;
+
+notdir:
+ dput(dir);
+ kerror("inuse command requires dirfd to be a directory");
+ return -ENOTDIR;
+
+inval:
+ kerror("inuse command requires dirfd and filename");
+ return -EINVAL;
+}
+
+/*
+ * see if we have space for a number of pages and/or a number of files in the
+ * cache
+ */
+int cachefiles_has_space(struct cachefiles_cache *cache,
+ unsigned fnr, unsigned bnr)
+{
+ struct kstatfs stats;
+ int ret;
+
+ //_enter("{%llu,%llu,%llu,%llu,%llu,%llu},%u,%u",
+ // (unsigned long long) cache->frun,
+ // (unsigned long long) cache->fcull,
+ // (unsigned long long) cache->fstop,
+ // (unsigned long long) cache->brun,
+ // (unsigned long long) cache->bcull,
+ // (unsigned long long) cache->bstop,
+ // fnr, bnr);
+
+ /* find out how many pages of blockdev are available */
+ memset(&stats, 0, sizeof(stats));
+
+ ret = vfs_statfs(cache->mnt->mnt_root, &stats);
+ if (ret < 0) {
+ if (ret == -EIO)
+ cachefiles_io_error(cache, "statfs failed");
+ _leave(" = %d", ret);
+ return ret;
+ }
+
+ stats.f_bavail >>= cache->bshift;
+
+ //_debug("avail %llu,%llu",
+ // (unsigned long long) stats.f_ffree,
+ // (unsigned long long) stats.f_bavail);
+
+ /* see if there is sufficient space */
+ if (stats.f_ffree > fnr)
+ stats.f_ffree -= fnr;
+ else
+ stats.f_ffree = 0;
+
+ if (stats.f_bavail > bnr)
+ stats.f_bavail -= bnr;
+ else
+ stats.f_bavail = 0;
+
+ ret = -ENOBUFS;
+ if (stats.f_ffree < cache->fstop ||
+ stats.f_bavail < cache->bstop)
+ goto begin_cull;
+
+ ret = 0;
+ if (stats.f_ffree < cache->fcull ||
+ stats.f_bavail < cache->bcull)
+ goto begin_cull;
+
+ if (test_bit(CACHEFILES_CULLING, &cache->flags) &&
+ stats.f_ffree >= cache->frun &&
+ stats.f_bavail >= cache->brun &&
+ test_and_clear_bit(CACHEFILES_CULLING, &cache->flags)
+ ) {
+ _debug("cease culling");
+ cachefiles_state_changed(cache);
+ }
+
+ //_leave(" = 0");
+ return 0;
+
+begin_cull:
+ if (!test_and_set_bit(CACHEFILES_CULLING, &cache->flags)) {
+ _debug("### CULL CACHE ###");
+ cachefiles_state_changed(cache);
+ }
+
+ _leave(" = %d", ret);
+ return ret;
+}
diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
new file mode 100644
index 0000000..1e96234
--- /dev/null
+++ b/fs/cachefiles/interface.c
@@ -0,0 +1,449 @@
+/* FS-Cache interface to CacheFiles
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/mount.h>
+#include <linux/buffer_head.h>
+#include "internal.h"
+
+#define list_to_page(head) (list_entry((head)->prev, struct page, lru))
+
+struct cachefiles_lookup_data {
+ struct cachefiles_xattr *auxdata; /* auxiliary data */
+ char *key; /* key path */
+};
+
+static int cachefiles_attr_changed(struct fscache_object *_object);
+
+/*
+ * allocate an object record for a cookie lookup and prepare the lookup data
+ */
+static struct fscache_object *cachefiles_alloc_object(
+ struct fscache_cache *_cache,
+ struct fscache_cookie *cookie)
+{
+ struct cachefiles_lookup_data *lookup_data;
+ struct cachefiles_object *object;
+ struct cachefiles_cache *cache;
+ struct cachefiles_xattr *auxdata;
+ unsigned keylen, auxlen;
+ void *buffer;
+ char *key;
+
+ cache = container_of(_cache, struct cachefiles_cache, cache);
+
+ _enter("{%s},%p,", cache->cache.identifier, cookie);
+
+ lookup_data = kmalloc(sizeof(*lookup_data), GFP_KERNEL);
+ if (!lookup_data)
+ goto nomem_lookup_data;
+
+ /* create a new object record and a temporary leaf image */
+ object = kmem_cache_alloc(cachefiles_object_jar, GFP_KERNEL);
+ if (!object)
+ goto nomem_object;
+
+ ASSERTCMP(object->backer, ==, NULL);
+
+ BUG_ON(test_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags));
+ atomic_set(&object->usage, 1);
+
+ fscache_object_init(&object->fscache, cookie, &cache->cache);
+
+ object->type = cookie->def->type;
+
+ /* get hold of the raw key
+ * - stick the length on the front and leave space on the back for the
+ * encoder
+ */
+ buffer = kmalloc((2 + 512) + 3, GFP_KERNEL);
+ if (!buffer)
+ goto nomem_buffer;
+
+ keylen = cookie->def->get_key(cookie->netfs_data, buffer + 2, 512);
+ ASSERTCMP(keylen, <, 512);
+
+ *(uint16_t *)buffer = keylen;
+ ((char *)buffer)[keylen + 2] = 0;
+ ((char *)buffer)[keylen + 3] = 0;
+ ((char *)buffer)[keylen + 4] = 0;
+
+ /* turn the raw key into something that can work with as a filename */
+ key = cachefiles_cook_key(buffer, keylen + 2, object->type);
+ if (!key)
+ goto nomem_key;
+
+ /* get hold of the auxiliary data and prepend the object type */
+ auxdata = buffer;
+ auxlen = 0;
+ if (cookie->def->get_aux) {
+ auxlen = cookie->def->get_aux(cookie->netfs_data,
+ auxdata->data, 511);
+ ASSERTCMP(auxlen, <, 511);
+ }
+
+ auxdata->len = auxlen + 1;
+ auxdata->type = cookie->def->type;
+
+ lookup_data->auxdata = auxdata;
+ lookup_data->key = key;
+ object->lookup_data = lookup_data;
+
+ _leave(" = %p [%p]", &object->fscache, lookup_data);
+ return &object->fscache;
+
+nomem_key:
+ kfree(buffer);
+nomem_buffer:
+ BUG_ON(test_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags));
+ kmem_cache_free(cachefiles_object_jar, object);
+ fscache_object_destroyed(&cache->cache);
+nomem_object:
+ kfree(lookup_data);
+nomem_lookup_data:
+ _leave(" = -ENOMEM");
+ return ERR_PTR(-ENOMEM);
+}
+
+/*
+ * attempt to look up the nominated node in this cache
+ */
+static void cachefiles_lookup_object(struct fscache_object *_object)
+{
+ struct cachefiles_lookup_data *lookup_data;
+ struct cachefiles_object *parent, *object;
+ struct cachefiles_cache *cache;
+ const struct cred *saved_cred;
+ int ret;
+
+ _enter("{OBJ%x}", _object->debug_id);
+
+ cache = container_of(_object->cache, struct cachefiles_cache, cache);
+ parent = container_of(_object->parent,
+ struct cachefiles_object, fscache);
+ object = container_of(_object, struct cachefiles_object, fscache);
+ lookup_data = object->lookup_data;
+
+ ASSERTCMP(lookup_data, !=, NULL);
+
+ /* look up the key, creating any missing bits */
+ cachefiles_begin_secure(cache, &saved_cred);
+ ret = cachefiles_walk_to_object(parent, object,
+ lookup_data->key,
+ lookup_data->auxdata);
+ cachefiles_end_secure(cache, saved_cred);
+
+ /* polish off by setting the attributes of non-index files */
+ if (ret == 0 &&
+ object->fscache.cookie->def->type != FSCACHE_COOKIE_TYPE_INDEX)
+ cachefiles_attr_changed(&object->fscache);
+
+ if (ret < 0) {
+ printk(KERN_WARNING "CacheFiles: Lookup failed error %d\n",
+ ret);
+ fscache_object_lookup_error(&object->fscache);
+ }
+
+ _leave(" [%d]", ret);
+}
+
+/*
+ * indication of lookup completion
+ */
+static void cachefiles_lookup_complete(struct fscache_object *_object)
+{
+ struct cachefiles_object *object;
+
+ object = container_of(_object, struct cachefiles_object, fscache);
+
+ _enter("{OBJ%x,%p}", object->fscache.debug_id, object->lookup_data);
+
+ if (object->lookup_data) {
+ kfree(object->lookup_data->key);
+ kfree(object->lookup_data->auxdata);
+ kfree(object->lookup_data);
+ object->lookup_data = NULL;
+ }
+}
+
+/*
+ * increment the usage count on an inode object (may fail if unmounting)
+ */
+static
+struct fscache_object *cachefiles_grab_object(struct fscache_object *_object)
+{
+ struct cachefiles_object *object =
+ container_of(_object, struct cachefiles_object, fscache);
+
+ _enter("{OBJ%x,%d}", _object->debug_id, atomic_read(&object->usage));
+
+#ifdef CACHEFILES_DEBUG_SLAB
+ ASSERT((atomic_read(&object->usage) & 0xffff0000) != 0x6b6b0000);
+#endif
+
+ atomic_inc(&object->usage);
+ return &object->fscache;
+}
+
+/*
+ * update the auxilliary data for an object object on disk
+ */
+static void cachefiles_update_object(struct fscache_object *_object)
+{
+ struct cachefiles_object *object;
+ struct cachefiles_xattr *auxdata;
+ struct cachefiles_cache *cache;
+ struct fscache_cookie *cookie;
+ const struct cred *saved_cred;
+ unsigned auxlen;
+
+ _enter("{OBJ%x}", _object->debug_id);
+
+ object = container_of(_object, struct cachefiles_object, fscache);
+ cache = container_of(object->fscache.cache, struct cachefiles_cache,
+ cache);
+ cookie = object->fscache.cookie;
+
+ if (!cookie->def->get_aux) {
+ _leave(" [no aux]");
+ return;
+ }
+
+ auxdata = kmalloc(2 + 512 + 3, GFP_KERNEL);
+ if (!auxdata) {
+ _leave(" [nomem]");
+ return;
+ }
+
+ auxlen = cookie->def->get_aux(cookie->netfs_data, auxdata->data, 511);
+ ASSERTCMP(auxlen, <, 511);
+
+ auxdata->len = auxlen + 1;
+ auxdata->type = cookie->def->type;
+
+ cachefiles_begin_secure(cache, &saved_cred);
+ cachefiles_update_object_xattr(object, auxdata);
+ cachefiles_end_secure(cache, saved_cred);
+ kfree(auxdata);
+ _leave("");
+}
+
+/*
+ * discard the resources pinned by an object and effect retirement if
+ * requested
+ */
+static void cachefiles_drop_object(struct fscache_object *_object)
+{
+ struct cachefiles_object *object;
+ struct cachefiles_cache *cache;
+ const struct cred *saved_cred;
+
+ ASSERT(_object);
+
+ object = container_of(_object, struct cachefiles_object, fscache);
+
+ _enter("{OBJ%x,%d}",
+ object->fscache.debug_id, atomic_read(&object->usage));
+
+ cache = container_of(object->fscache.cache,
+ struct cachefiles_cache, cache);
+
+#ifdef CACHEFILES_DEBUG_SLAB
+ ASSERT((atomic_read(&object->usage) & 0xffff0000) != 0x6b6b0000);
+#endif
+
+ /* delete retired objects */
+ if (object->fscache.state == FSCACHE_OBJECT_RECYCLING &&
+ _object != cache->cache.fsdef
+ ) {
+ _debug("- retire object OBJ%x", object->fscache.debug_id);
+ cachefiles_begin_secure(cache, &saved_cred);
+ cachefiles_delete_object(cache, object);
+ cachefiles_end_secure(cache, saved_cred);
+ }
+
+ /* close the filesystem stuff attached to the object */
+ if (object->backer != object->dentry)
+ dput(object->backer);
+ object->backer = NULL;
+
+ /* note that the object is now inactive */
+ if (test_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags)) {
+ write_lock(&cache->active_lock);
+ if (!test_and_clear_bit(CACHEFILES_OBJECT_ACTIVE,
+ &object->flags))
+ BUG();
+ rb_erase(&object->active_node, &cache->active_nodes);
+ wake_up_bit(&object->flags, CACHEFILES_OBJECT_ACTIVE);
+ write_unlock(&cache->active_lock);
+ }
+
+ dput(object->dentry);
+ object->dentry = NULL;
+
+ _leave("");
+}
+
+/*
+ * dispose of a reference to an object
+ */
+static void cachefiles_put_object(struct fscache_object *_object)
+{
+ struct cachefiles_object *object;
+ struct fscache_cache *cache;
+
+ ASSERT(_object);
+
+ object = container_of(_object, struct cachefiles_object, fscache);
+
+ _enter("{OBJ%x,%d}",
+ object->fscache.debug_id, atomic_read(&object->usage));
+
+#ifdef CACHEFILES_DEBUG_SLAB
+ ASSERT((atomic_read(&object->usage) & 0xffff0000) != 0x6b6b0000);
+#endif
+
+ ASSERTIFCMP(object->fscache.parent,
+ object->fscache.parent->n_children, >, 0);
+
+ if (atomic_dec_and_test(&object->usage)) {
+ _debug("- kill object OBJ%x", object->fscache.debug_id);
+
+ ASSERT(!test_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags));
+ ASSERTCMP(object->fscache.parent, ==, NULL);
+ ASSERTCMP(object->backer, ==, NULL);
+ ASSERTCMP(object->dentry, ==, NULL);
+ ASSERTCMP(object->fscache.n_ops, ==, 0);
+ ASSERTCMP(object->fscache.n_children, ==, 0);
+
+ if (object->lookup_data) {
+ kfree(object->lookup_data->key);
+ kfree(object->lookup_data->auxdata);
+ kfree(object->lookup_data);
+ object->lookup_data = NULL;
+ }
+
+ cache = object->fscache.cache;
+ kmem_cache_free(cachefiles_object_jar, object);
+ fscache_object_destroyed(cache);
+ }
+
+ _leave("");
+}
+
+/*
+ * sync a cache
+ */
+static void cachefiles_sync_cache(struct fscache_cache *_cache)
+{
+ struct cachefiles_cache *cache;
+ const struct cred *saved_cred;
+ int ret;
+
+ _enter("%p", _cache);
+
+ cache = container_of(_cache, struct cachefiles_cache, cache);
+
+ /* make sure all pages pinned by operations on behalf of the netfs are
+ * written to disc */
+ cachefiles_begin_secure(cache, &saved_cred);
+ ret = fsync_super(cache->mnt->mnt_sb);
+ cachefiles_end_secure(cache, saved_cred);
+
+ if (ret == -EIO)
+ cachefiles_io_error(cache,
+ "Attempt to sync backing fs superblock"
+ " returned error %d",
+ ret);
+}
+
+/*
+ * notification the attributes on an object have changed
+ * - called with reads/writes excluded by FS-Cache
+ */
+static int cachefiles_attr_changed(struct fscache_object *_object)
+{
+ struct cachefiles_object *object;
+ struct cachefiles_cache *cache;
+ const struct cred *saved_cred;
+ struct iattr newattrs;
+ uint64_t ni_size;
+ loff_t oi_size;
+ int ret;
+
+ _object->cookie->def->get_attr(_object->cookie->netfs_data, &ni_size);
+
+ _enter("{OBJ%x},[%llu]",
+ _object->debug_id, (unsigned long long) ni_size);
+
+ object = container_of(_object, struct cachefiles_object, fscache);
+ cache = container_of(object->fscache.cache,
+ struct cachefiles_cache, cache);
+
+ if (ni_size == object->i_size)
+ return 0;
+
+ if (!object->backer)
+ return -ENOBUFS;
+
+ ASSERT(S_ISREG(object->backer->d_inode->i_mode));
+
+ fscache_set_store_limit(&object->fscache, ni_size);
+
+ oi_size = i_size_read(object->backer->d_inode);
+ if (oi_size == ni_size)
+ return 0;
+
+ newattrs.ia_size = ni_size;
+ newattrs.ia_valid = ATTR_SIZE;
+
+ cachefiles_begin_secure(cache, &saved_cred);
+ mutex_lock(&object->backer->d_inode->i_mutex);
+ ret = notify_change(object->backer, &newattrs);
+ mutex_unlock(&object->backer->d_inode->i_mutex);
+ cachefiles_end_secure(cache, saved_cred);
+
+ if (ret == -EIO) {
+ fscache_set_store_limit(&object->fscache, 0);
+ cachefiles_io_error_obj(object, "Size set failed");
+ ret = -ENOBUFS;
+ }
+
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * dissociate a cache from all the pages it was backing
+ */
+static void cachefiles_dissociate_pages(struct fscache_cache *cache)
+{
+ _enter("");
+}
+
+const struct fscache_cache_ops cachefiles_cache_ops = {
+ .name = "cachefiles",
+ .alloc_object = cachefiles_alloc_object,
+ .lookup_object = cachefiles_lookup_object,
+ .lookup_complete = cachefiles_lookup_complete,
+ .grab_object = cachefiles_grab_object,
+ .update_object = cachefiles_update_object,
+ .drop_object = cachefiles_drop_object,
+ .put_object = cachefiles_put_object,
+ .sync_cache = cachefiles_sync_cache,
+ .attr_changed = cachefiles_attr_changed,
+ .read_or_alloc_page = cachefiles_read_or_alloc_page,
+ .read_or_alloc_pages = cachefiles_read_or_alloc_pages,
+ .allocate_page = cachefiles_allocate_page,
+ .allocate_pages = cachefiles_allocate_pages,
+ .write_page = cachefiles_write_page,
+ .uncache_page = cachefiles_uncache_page,
+ .dissociate_pages = cachefiles_dissociate_pages,
+};
diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
new file mode 100644
index 0000000..19218e1
--- /dev/null
+++ b/fs/cachefiles/internal.h
@@ -0,0 +1,360 @@
+/* General netfs cache on cache files internal defs
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/fscache-cache.h>
+#include <linux/timer.h>
+#include <linux/wait.h>
+#include <linux/workqueue.h>
+#include <linux/security.h>
+
+struct cachefiles_cache;
+struct cachefiles_object;
+
+extern unsigned cachefiles_debug;
+#define CACHEFILES_DEBUG_KENTER 1
+#define CACHEFILES_DEBUG_KLEAVE 2
+#define CACHEFILES_DEBUG_KDEBUG 4
+
+/*
+ * node records
+ */
+struct cachefiles_object {
+ struct fscache_object fscache; /* fscache handle */
+ struct cachefiles_lookup_data *lookup_data; /* cached lookup data */
+ struct dentry *dentry; /* the file/dir representing this object */
+ struct dentry *backer; /* backing file */
+ loff_t i_size; /* object size */
+ unsigned long flags;
+#define CACHEFILES_OBJECT_ACTIVE 0 /* T if marked active */
+ atomic_t usage; /* object usage count */
+ uint8_t type; /* object type */
+ uint8_t new; /* T if object new */
+ spinlock_t work_lock;
+ struct rb_node active_node; /* link in active tree (dentry is key) */
+};
+
+extern struct kmem_cache *cachefiles_object_jar;
+
+/*
+ * Cache files cache definition
+ */
+struct cachefiles_cache {
+ struct fscache_cache cache; /* FS-Cache record */
+ struct vfsmount *mnt; /* mountpoint holding the cache */
+ struct dentry *graveyard; /* directory into which dead objects go */
+ struct file *cachefilesd; /* manager daemon handle */
+ const struct cred *cache_cred; /* security override for accessing cache */
+ struct mutex daemon_mutex; /* command serialisation mutex */
+ wait_queue_head_t daemon_pollwq; /* poll waitqueue for daemon */
+ struct rb_root active_nodes; /* active nodes (can't be culled) */
+ rwlock_t active_lock; /* lock for active_nodes */
+ atomic_t gravecounter; /* graveyard uniquifier */
+ unsigned frun_percent; /* when to stop culling (% files) */
+ unsigned fcull_percent; /* when to start culling (% files) */
+ unsigned fstop_percent; /* when to stop allocating (% files) */
+ unsigned brun_percent; /* when to stop culling (% blocks) */
+ unsigned bcull_percent; /* when to start culling (% blocks) */
+ unsigned bstop_percent; /* when to stop allocating (% blocks) */
+ unsigned bsize; /* cache's block size */
+ unsigned bshift; /* min(ilog2(PAGE_SIZE / bsize), 0) */
+ uint64_t frun; /* when to stop culling */
+ uint64_t fcull; /* when to start culling */
+ uint64_t fstop; /* when to stop allocating */
+ sector_t brun; /* when to stop culling */
+ sector_t bcull; /* when to start culling */
+ sector_t bstop; /* when to stop allocating */
+ unsigned long flags;
+#define CACHEFILES_READY 0 /* T if cache prepared */
+#define CACHEFILES_DEAD 1 /* T if cache dead */
+#define CACHEFILES_CULLING 2 /* T if cull engaged */
+#define CACHEFILES_STATE_CHANGED 3 /* T if state changed (poll trigger) */
+ char *rootdirname; /* name of cache root directory */
+ char *secctx; /* LSM security context */
+ char *tag; /* cache binding tag */
+};
+
+/*
+ * backing file read tracking
+ */
+struct cachefiles_one_read {
+ wait_queue_t monitor; /* link into monitored waitqueue */
+ struct page *back_page; /* backing file page we're waiting for */
+ struct page *netfs_page; /* netfs page we're going to fill */
+ struct fscache_retrieval *op; /* retrieval op covering this */
+ struct list_head op_link; /* link in op's todo list */
+};
+
+/*
+ * backing file write tracking
+ */
+struct cachefiles_one_write {
+ struct page *netfs_page; /* netfs page to copy */
+ struct cachefiles_object *object;
+ struct list_head obj_link; /* link in object's lists */
+ fscache_rw_complete_t end_io_func;
+ void *context;
+};
+
+/*
+ * auxiliary data xattr buffer
+ */
+struct cachefiles_xattr {
+ uint16_t len;
+ uint8_t type;
+ uint8_t data[];
+};
+
+/*
+ * note change of state for daemon
+ */
+static inline void cachefiles_state_changed(struct cachefiles_cache *cache)
+{
+ set_bit(CACHEFILES_STATE_CHANGED, &cache->flags);
+ wake_up_all(&cache->daemon_pollwq);
+}
+
+/*
+ * cf-bind.c
+ */
+extern int cachefiles_daemon_bind(struct cachefiles_cache *cache, char *args);
+extern void cachefiles_daemon_unbind(struct cachefiles_cache *cache);
+
+/*
+ * cf-daemon.c
+ */
+extern const struct file_operations cachefiles_daemon_fops;
+
+extern int cachefiles_has_space(struct cachefiles_cache *cache,
+ unsigned fnr, unsigned bnr);
+
+/*
+ * cf-interface.c
+ */
+extern const struct fscache_cache_ops cachefiles_cache_ops;
+
+/*
+ * cf-key.c
+ */
+extern char *cachefiles_cook_key(const u8 *raw, int keylen, uint8_t type);
+
+/*
+ * cf-namei.c
+ */
+extern int cachefiles_delete_object(struct cachefiles_cache *cache,
+ struct cachefiles_object *object);
+extern int cachefiles_walk_to_object(struct cachefiles_object *parent,
+ struct cachefiles_object *object,
+ const char *key,
+ struct cachefiles_xattr *auxdata);
+extern struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
+ struct dentry *dir,
+ const char *name);
+
+extern int cachefiles_cull(struct cachefiles_cache *cache, struct dentry *dir,
+ char *filename);
+
+extern int cachefiles_check_in_use(struct cachefiles_cache *cache,
+ struct dentry *dir, char *filename);
+
+/*
+ * cf-proc.c
+ */
+#ifdef CONFIG_CACHEFILES_HISTOGRAM
+extern atomic_t cachefiles_lookup_histogram[HZ];
+extern atomic_t cachefiles_mkdir_histogram[HZ];
+extern atomic_t cachefiles_create_histogram[HZ];
+
+extern int __init cachefiles_proc_init(void);
+extern void cachefiles_proc_cleanup(void);
+static inline
+void cachefiles_hist(atomic_t histogram[], unsigned long start_jif)
+{
+ unsigned long jif = jiffies - start_jif;
+ if (jif >= HZ)
+ jif = HZ - 1;
+ atomic_inc(&histogram[jif]);
+}
+
+#else
+#define cachefiles_proc_init() (0)
+#define cachefiles_proc_cleanup() do {} while (0)
+#define cachefiles_hist(hist, start_jif) do {} while (0)
+#endif
+
+/*
+ * cf-rdwr.c
+ */
+extern int cachefiles_read_or_alloc_page(struct fscache_retrieval *,
+ struct page *, gfp_t);
+extern int cachefiles_read_or_alloc_pages(struct fscache_retrieval *,
+ struct list_head *, unsigned *,
+ gfp_t);
+extern int cachefiles_allocate_page(struct fscache_retrieval *, struct page *,
+ gfp_t);
+extern int cachefiles_allocate_pages(struct fscache_retrieval *,
+ struct list_head *, unsigned *, gfp_t);
+extern int cachefiles_write_page(struct fscache_storage *, struct page *);
+extern void cachefiles_uncache_page(struct fscache_object *, struct page *);
+
+/*
+ * cf-security.c
+ */
+extern int cachefiles_get_security_ID(struct cachefiles_cache *cache);
+extern int cachefiles_determine_cache_security(struct cachefiles_cache *cache,
+ struct dentry *root,
+ const struct cred **_saved_cred);
+
+static inline void cachefiles_begin_secure(struct cachefiles_cache *cache,
+ const struct cred **_saved_cred)
+{
+ *_saved_cred = override_creds(cache->cache_cred);
+}
+
+static inline void cachefiles_end_secure(struct cachefiles_cache *cache,
+ const struct cred *saved_cred)
+{
+ revert_creds(saved_cred);
+}
+
+/*
+ * cf-xattr.c
+ */
+extern int cachefiles_check_object_type(struct cachefiles_object *object);
+extern int cachefiles_set_object_xattr(struct cachefiles_object *object,
+ struct cachefiles_xattr *auxdata);
+extern int cachefiles_update_object_xattr(struct cachefiles_object *object,
+ struct cachefiles_xattr *auxdata);
+extern int cachefiles_check_object_xattr(struct cachefiles_object *object,
+ struct cachefiles_xattr *auxdata);
+extern int cachefiles_remove_object_xattr(struct cachefiles_cache *cache,
+ struct dentry *dentry);
+
+
+/*
+ * error handling
+ */
+#define kerror(FMT, ...) printk(KERN_ERR "CacheFiles: "FMT"\n", ##__VA_ARGS__)
+
+#define cachefiles_io_error(___cache, FMT, ...) \
+do { \
+ kerror("I/O Error: " FMT, ##__VA_ARGS__); \
+ fscache_io_error(&(___cache)->cache); \
+ set_bit(CACHEFILES_DEAD, &(___cache)->flags); \
+} while (0)
+
+#define cachefiles_io_error_obj(object, FMT, ...) \
+do { \
+ struct cachefiles_cache *___cache; \
+ \
+ ___cache = container_of((object)->fscache.cache, \
+ struct cachefiles_cache, cache); \
+ cachefiles_io_error(___cache, FMT, ##__VA_ARGS__); \
+} while (0)
+
+
+/*
+ * debug tracing
+ */
+#define dbgprintk(FMT, ...) \
+ printk(KERN_DEBUG "[%-6.6s] "FMT"\n", current->comm, ##__VA_ARGS__)
+
+/* make sure we maintain the format strings, even when debugging is disabled */
+static inline void _dbprintk(const char *fmt, ...)
+ __attribute__((format(printf, 1, 2)));
+static inline void _dbprintk(const char *fmt, ...)
+{
+}
+
+#define kenter(FMT, ...) dbgprintk("==> %s("FMT")", __func__, ##__VA_ARGS__)
+#define kleave(FMT, ...) dbgprintk("<== %s()"FMT"", __func__, ##__VA_ARGS__)
+#define kdebug(FMT, ...) dbgprintk(FMT, ##__VA_ARGS__)
+
+
+#if defined(__KDEBUG)
+#define _enter(FMT, ...) kenter(FMT, ##__VA_ARGS__)
+#define _leave(FMT, ...) kleave(FMT, ##__VA_ARGS__)
+#define _debug(FMT, ...) kdebug(FMT, ##__VA_ARGS__)
+
+#elif defined(CONFIG_CACHEFILES_DEBUG)
+#define _enter(FMT, ...) \
+do { \
+ if (cachefiles_debug & CACHEFILES_DEBUG_KENTER) \
+ kenter(FMT, ##__VA_ARGS__); \
+} while (0)
+
+#define _leave(FMT, ...) \
+do { \
+ if (cachefiles_debug & CACHEFILES_DEBUG_KLEAVE) \
+ kleave(FMT, ##__VA_ARGS__); \
+} while (0)
+
+#define _debug(FMT, ...) \
+do { \
+ if (cachefiles_debug & CACHEFILES_DEBUG_KDEBUG) \
+ kdebug(FMT, ##__VA_ARGS__); \
+} while (0)
+
+#else
+#define _enter(FMT, ...) _dbprintk("==> %s("FMT")", __func__, ##__VA_ARGS__)
+#define _leave(FMT, ...) _dbprintk("<== %s()"FMT"", __func__, ##__VA_ARGS__)
+#define _debug(FMT, ...) _dbprintk(FMT, ##__VA_ARGS__)
+#endif
+
+#if 1 /* defined(__KDEBUGALL) */
+
+#define ASSERT(X) \
+do { \
+ if (unlikely(!(X))) { \
+ printk(KERN_ERR "\n"); \
+ printk(KERN_ERR "CacheFiles: Assertion failed\n"); \
+ BUG(); \
+ } \
+} while (0)
+
+#define ASSERTCMP(X, OP, Y) \
+do { \
+ if (unlikely(!((X) OP (Y)))) { \
+ printk(KERN_ERR "\n"); \
+ printk(KERN_ERR "CacheFiles: Assertion failed\n"); \
+ printk(KERN_ERR "%lx " #OP " %lx is false\n", \
+ (unsigned long)(X), (unsigned long)(Y)); \
+ BUG(); \
+ } \
+} while (0)
+
+#define ASSERTIF(C, X) \
+do { \
+ if (unlikely((C) && !(X))) { \
+ printk(KERN_ERR "\n"); \
+ printk(KERN_ERR "CacheFiles: Assertion failed\n"); \
+ BUG(); \
+ } \
+} while (0)
+
+#define ASSERTIFCMP(C, X, OP, Y) \
+do { \
+ if (unlikely((C) && !((X) OP (Y)))) { \
+ printk(KERN_ERR "\n"); \
+ printk(KERN_ERR "CacheFiles: Assertion failed\n"); \
+ printk(KERN_ERR "%lx " #OP " %lx is false\n", \
+ (unsigned long)(X), (unsigned long)(Y)); \
+ BUG(); \
+ } \
+} while (0)
+
+#else
+
+#define ASSERT(X) do {} while (0)
+#define ASSERTCMP(X, OP, Y) do {} while (0)
+#define ASSERTIF(C, X) do {} while (0)
+#define ASSERTIFCMP(C, X, OP, Y) do {} while (0)
+
+#endif
diff --git a/fs/cachefiles/key.c b/fs/cachefiles/key.c
new file mode 100644
index 0000000..81b8b2b
--- /dev/null
+++ b/fs/cachefiles/key.c
@@ -0,0 +1,159 @@
+/* Key to pathname encoder
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/slab.h>
+#include "internal.h"
+
+static const char cachefiles_charmap[64] =
+ "0123456789" /* 0 - 9 */
+ "abcdefghijklmnopqrstuvwxyz" /* 10 - 35 */
+ "ABCDEFGHIJKLMNOPQRSTUVWXYZ" /* 36 - 61 */
+ "_-" /* 62 - 63 */
+ ;
+
+static const char cachefiles_filecharmap[256] = {
+ /* we skip space and tab and control chars */
+ [33 ... 46] = 1, /* '!' -> '.' */
+ /* we skip '/' as it's significant to pathwalk */
+ [48 ... 127] = 1, /* '0' -> '~' */
+};
+
+/*
+ * turn the raw key into something cooked
+ * - the raw key should include the length in the two bytes at the front
+ * - the key may be up to 514 bytes in length (including the length word)
+ * - "base64" encode the strange keys, mapping 3 bytes of raw to four of
+ * cooked
+ * - need to cut the cooked key into 252 char lengths (189 raw bytes)
+ */
+char *cachefiles_cook_key(const u8 *raw, int keylen, uint8_t type)
+{
+ unsigned char csum, ch;
+ unsigned int acc;
+ char *key;
+ int loop, len, max, seg, mark, print;
+
+ _enter(",%d", keylen);
+
+ BUG_ON(keylen < 2 || keylen > 514);
+
+ csum = raw[0] + raw[1];
+ print = 1;
+ for (loop = 2; loop < keylen; loop++) {
+ ch = raw[loop];
+ csum += ch;
+ print &= cachefiles_filecharmap[ch];
+ }
+
+ if (print) {
+ /* if the path is usable ASCII, then we render it directly */
+ max = keylen - 2;
+ max += 2; /* two base64'd length chars on the front */
+ max += 5; /* @checksum/M */
+ max += 3 * 2; /* maximum number of segment dividers (".../M")
+ * is ((514 + 251) / 252) = 3
+ */
+ max += 1; /* NUL on end */
+ } else {
+ /* calculate the maximum length of the cooked key */
+ keylen = (keylen + 2) / 3;
+
+ max = keylen * 4;
+ max += 5; /* @checksum/M */
+ max += 3 * 2; /* maximum number of segment dividers (".../M")
+ * is ((514 + 188) / 189) = 3
+ */
+ max += 1; /* NUL on end */
+ }
+
+ max += 1; /* 2nd NUL on end */
+
+ _debug("max: %d", max);
+
+ key = kmalloc(max, GFP_KERNEL);
+ if (!key)
+ return NULL;
+
+ len = 0;
+
+ /* build the cooked key */
+ sprintf(key, "@%02x%c+", (unsigned) csum, 0);
+ len = 5;
+ mark = len - 1;
+
+ if (print) {
+ acc = *(uint16_t *) raw;
+ raw += 2;
+
+ key[len + 1] = cachefiles_charmap[acc & 63];
+ acc >>= 6;
+ key[len] = cachefiles_charmap[acc & 63];
+ len += 2;
+
+ seg = 250;
+ for (loop = keylen; loop > 0; loop--) {
+ if (seg <= 0) {
+ key[len++] = '\0';
+ mark = len;
+ key[len++] = '+';
+ seg = 252;
+ }
+
+ key[len++] = *raw++;
+ ASSERT(len < max);
+ }
+
+ switch (type) {
+ case FSCACHE_COOKIE_TYPE_INDEX: type = 'I'; break;
+ case FSCACHE_COOKIE_TYPE_DATAFILE: type = 'D'; break;
+ default: type = 'S'; break;
+ }
+ } else {
+ seg = 252;
+ for (loop = keylen; loop > 0; loop--) {
+ if (seg <= 0) {
+ key[len++] = '\0';
+ mark = len;
+ key[len++] = '+';
+ seg = 252;
+ }
+
+ acc = *raw++;
+ acc |= *raw++ << 8;
+ acc |= *raw++ << 16;
+
+ _debug("acc: %06x", acc);
+
+ key[len++] = cachefiles_charmap[acc & 63];
+ acc >>= 6;
+ key[len++] = cachefiles_charmap[acc & 63];
+ acc >>= 6;
+ key[len++] = cachefiles_charmap[acc & 63];
+ acc >>= 6;
+ key[len++] = cachefiles_charmap[acc & 63];
+
+ ASSERT(len < max);
+ }
+
+ switch (type) {
+ case FSCACHE_COOKIE_TYPE_INDEX: type = 'J'; break;
+ case FSCACHE_COOKIE_TYPE_DATAFILE: type = 'E'; break;
+ default: type = 'T'; break;
+ }
+ }
+
+ key[mark] = type;
+ key[len++] = 0;
+ key[len] = 0;
+
+ _leave(" = %p %d", key, len);
+ return key;
+}
diff --git a/fs/cachefiles/main.c b/fs/cachefiles/main.c
new file mode 100644
index 0000000..4bfa8cf
--- /dev/null
+++ b/fs/cachefiles/main.c
@@ -0,0 +1,106 @@
+/* Network filesystem caching backend to use cache files on a premounted
+ * filesystem
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sched.h>
+#include <linux/completion.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/mount.h>
+#include <linux/statfs.h>
+#include <linux/sysctl.h>
+#include <linux/miscdevice.h>
+#include "internal.h"
+
+unsigned cachefiles_debug;
+module_param_named(debug, cachefiles_debug, uint, S_IWUSR | S_IRUGO);
+MODULE_PARM_DESC(cachefiles_debug, "CacheFiles debugging mask");
+
+MODULE_DESCRIPTION("Mounted-filesystem based cache");
+MODULE_AUTHOR("Red Hat, Inc.");
+MODULE_LICENSE("GPL");
+
+struct kmem_cache *cachefiles_object_jar;
+
+static struct miscdevice cachefiles_dev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "cachefiles",
+ .fops = &cachefiles_daemon_fops,
+};
+
+static void cachefiles_object_init_once(void *_object)
+{
+ struct cachefiles_object *object = _object;
+
+ memset(object, 0, sizeof(*object));
+ spin_lock_init(&object->work_lock);
+}
+
+/*
+ * initialise the fs caching module
+ */
+static int __init cachefiles_init(void)
+{
+ int ret;
+
+ ret = misc_register(&cachefiles_dev);
+ if (ret < 0)
+ goto error_dev;
+
+ /* create an object jar */
+ ret = -ENOMEM;
+ cachefiles_object_jar =
+ kmem_cache_create("cachefiles_object_jar",
+ sizeof(struct cachefiles_object),
+ 0,
+ SLAB_HWCACHE_ALIGN,
+ cachefiles_object_init_once);
+ if (!cachefiles_object_jar) {
+ printk(KERN_NOTICE
+ "CacheFiles: Failed to allocate an object jar\n");
+ goto error_object_jar;
+ }
+
+ ret = cachefiles_proc_init();
+ if (ret < 0)
+ goto error_proc;
+
+ printk(KERN_INFO "CacheFiles: Loaded\n");
+ return 0;
+
+error_proc:
+ kmem_cache_destroy(cachefiles_object_jar);
+error_object_jar:
+ misc_deregister(&cachefiles_dev);
+error_dev:
+ kerror("failed to register: %d", ret);
+ return ret;
+}
+
+fs_initcall(cachefiles_init);
+
+/*
+ * clean up on module removal
+ */
+static void __exit cachefiles_exit(void)
+{
+ printk(KERN_INFO "CacheFiles: Unloading\n");
+
+ cachefiles_proc_cleanup();
+ kmem_cache_destroy(cachefiles_object_jar);
+ misc_deregister(&cachefiles_dev);
+}
+
+module_exit(cachefiles_exit);
diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
new file mode 100644
index 0000000..83b1988
--- /dev/null
+++ b/fs/cachefiles/namei.c
@@ -0,0 +1,772 @@
+/* CacheFiles path walking and related routines
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/fsnotify.h>
+#include <linux/quotaops.h>
+#include <linux/xattr.h>
+#include <linux/mount.h>
+#include <linux/namei.h>
+#include <linux/security.h>
+#include "internal.h"
+
+static int cachefiles_wait_bit(void *flags)
+{
+ schedule();
+ return 0;
+}
+
+/*
+ * record the fact that an object is now active
+ */
+static void cachefiles_mark_object_active(struct cachefiles_cache *cache,
+ struct cachefiles_object *object)
+{
+ struct cachefiles_object *xobject;
+ struct rb_node **_p, *_parent = NULL;
+ struct dentry *dentry;
+
+ _enter(",%p", object);
+
+try_again:
+ write_lock(&cache->active_lock);
+
+ if (test_and_set_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags))
+ BUG();
+
+ dentry = object->dentry;
+ _p = &cache->active_nodes.rb_node;
+ while (*_p) {
+ _parent = *_p;
+ xobject = rb_entry(_parent,
+ struct cachefiles_object, active_node);
+
+ ASSERT(xobject != object);
+
+ if (xobject->dentry > dentry)
+ _p = &(*_p)->rb_left;
+ else if (xobject->dentry < dentry)
+ _p = &(*_p)->rb_right;
+ else
+ goto wait_for_old_object;
+ }
+
+ rb_link_node(&object->active_node, _parent, _p);
+ rb_insert_color(&object->active_node, &cache->active_nodes);
+
+ write_unlock(&cache->active_lock);
+ _leave("");
+ return;
+
+ /* an old object from a previous incarnation is hogging the slot - we
+ * need to wait for it to be destroyed */
+wait_for_old_object:
+ if (xobject->fscache.state < FSCACHE_OBJECT_DYING) {
+ printk(KERN_ERR "\n");
+ printk(KERN_ERR "CacheFiles: Error:"
+ " Unexpected object collision\n");
+ printk(KERN_ERR "xobject: OBJ%x\n",
+ xobject->fscache.debug_id);
+ printk(KERN_ERR "xobjstate=%s\n",
+ fscache_object_states[xobject->fscache.state]);
+ printk(KERN_ERR "xobjflags=%lx\n", xobject->fscache.flags);
+ printk(KERN_ERR "xobjevent=%lx [%lx]\n",
+ xobject->fscache.events, xobject->fscache.event_mask);
+ printk(KERN_ERR "xops=%u inp=%u exc=%u\n",
+ xobject->fscache.n_ops, xobject->fscache.n_in_progress,
+ xobject->fscache.n_exclusive);
+ printk(KERN_ERR "xcookie=%p [pr=%p nd=%p fl=%lx]\n",
+ xobject->fscache.cookie,
+ xobject->fscache.cookie->parent,
+ xobject->fscache.cookie->netfs_data,
+ xobject->fscache.cookie->flags);
+ printk(KERN_ERR "xparent=%p\n",
+ xobject->fscache.parent);
+ printk(KERN_ERR "object: OBJ%x\n",
+ object->fscache.debug_id);
+ printk(KERN_ERR "cookie=%p [pr=%p nd=%p fl=%lx]\n",
+ object->fscache.cookie,
+ object->fscache.cookie->parent,
+ object->fscache.cookie->netfs_data,
+ object->fscache.cookie->flags);
+ printk(KERN_ERR "parent=%p\n",
+ object->fscache.parent);
+ BUG();
+ }
+ atomic_inc(&xobject->usage);
+ write_unlock(&cache->active_lock);
+
+ _debug(">>> wait");
+ wait_on_bit(&xobject->flags, CACHEFILES_OBJECT_ACTIVE,
+ cachefiles_wait_bit, TASK_UNINTERRUPTIBLE);
+ _debug("<<< waited");
+
+ cache->cache.ops->put_object(&xobject->fscache);
+ goto try_again;
+}
+
+/*
+ * delete an object representation from the cache
+ * - file backed objects are unlinked
+ * - directory backed objects are stuffed into the graveyard for userspace to
+ * delete
+ * - unlocks the directory mutex
+ */
+static int cachefiles_bury_object(struct cachefiles_cache *cache,
+ struct dentry *dir,
+ struct dentry *rep)
+{
+ struct dentry *grave, *trap;
+ char nbuffer[8 + 8 + 1];
+ int ret;
+
+ _enter(",'%*.*s','%*.*s'",
+ dir->d_name.len, dir->d_name.len, dir->d_name.name,
+ rep->d_name.len, rep->d_name.len, rep->d_name.name);
+
+ /* non-directories can just be unlinked */
+ if (!S_ISDIR(rep->d_inode->i_mode)) {
+ _debug("unlink stale object");
+ ret = vfs_unlink(dir->d_inode, rep);
+
+ mutex_unlock(&dir->d_inode->i_mutex);
+
+ if (ret == -EIO)
+ cachefiles_io_error(cache, "Unlink failed");
+
+ _leave(" = %d", ret);
+ return ret;
+ }
+
+ /* directories have to be moved to the graveyard */
+ _debug("move stale object to graveyard");
+ mutex_unlock(&dir->d_inode->i_mutex);
+
+try_again:
+ /* first step is to make up a grave dentry in the graveyard */
+ sprintf(nbuffer, "%08x%08x",
+ (uint32_t) get_seconds(),
+ (uint32_t) atomic_inc_return(&cache->gravecounter));
+
+ /* do the multiway lock magic */
+ trap = lock_rename(cache->graveyard, dir);
+
+ /* do some checks before getting the grave dentry */
+ if (rep->d_parent != dir) {
+ /* the entry was probably culled when we dropped the parent dir
+ * lock */
+ unlock_rename(cache->graveyard, dir);
+ _leave(" = 0 [culled?]");
+ return 0;
+ }
+
+ if (!S_ISDIR(cache->graveyard->d_inode->i_mode)) {
+ unlock_rename(cache->graveyard, dir);
+ cachefiles_io_error(cache, "Graveyard no longer a directory");
+ return -EIO;
+ }
+
+ if (trap == rep) {
+ unlock_rename(cache->graveyard, dir);
+ cachefiles_io_error(cache, "May not make directory loop");
+ return -EIO;
+ }
+
+ if (d_mountpoint(rep)) {
+ unlock_rename(cache->graveyard, dir);
+ cachefiles_io_error(cache, "Mountpoint in cache");
+ return -EIO;
+ }
+
+ grave = lookup_one_len(nbuffer, cache->graveyard, strlen(nbuffer));
+ if (IS_ERR(grave)) {
+ unlock_rename(cache->graveyard, dir);
+
+ if (PTR_ERR(grave) == -ENOMEM) {
+ _leave(" = -ENOMEM");
+ return -ENOMEM;
+ }
+
+ cachefiles_io_error(cache, "Lookup error %ld",
+ PTR_ERR(grave));
+ return -EIO;
+ }
+
+ if (grave->d_inode) {
+ unlock_rename(cache->graveyard, dir);
+ dput(grave);
+ grave = NULL;
+ cond_resched();
+ goto try_again;
+ }
+
+ if (d_mountpoint(grave)) {
+ unlock_rename(cache->graveyard, dir);
+ dput(grave);
+ cachefiles_io_error(cache, "Mountpoint in graveyard");
+ return -EIO;
+ }
+
+ /* target should not be an ancestor of source */
+ if (trap == grave) {
+ unlock_rename(cache->graveyard, dir);
+ dput(grave);
+ cachefiles_io_error(cache, "May not make directory loop");
+ return -EIO;
+ }
+
+ /* attempt the rename */
+ ret = vfs_rename(dir->d_inode, rep, cache->graveyard->d_inode, grave);
+ if (ret != 0 && ret != -ENOMEM)
+ cachefiles_io_error(cache, "Rename failed with error %d", ret);
+
+ unlock_rename(cache->graveyard, dir);
+ dput(grave);
+ _leave(" = 0");
+ return 0;
+}
+
+/*
+ * delete an object representation from the cache
+ */
+int cachefiles_delete_object(struct cachefiles_cache *cache,
+ struct cachefiles_object *object)
+{
+ struct dentry *dir;
+ int ret;
+
+ _enter(",{%p}", object->dentry);
+
+ ASSERT(object->dentry);
+ ASSERT(object->dentry->d_inode);
+ ASSERT(object->dentry->d_parent);
+
+ dir = dget_parent(object->dentry);
+
+ mutex_lock(&dir->d_inode->i_mutex);
+ ret = cachefiles_bury_object(cache, dir, object->dentry);
+
+ dput(dir);
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * walk from the parent object to the child object through the backing
+ * filesystem, creating directories as we go
+ */
+int cachefiles_walk_to_object(struct cachefiles_object *parent,
+ struct cachefiles_object *object,
+ const char *key,
+ struct cachefiles_xattr *auxdata)
+{
+ struct cachefiles_cache *cache;
+ struct dentry *dir, *next = NULL;
+ unsigned long start;
+ const char *name;
+ int ret, nlen;
+
+ _enter("{%p},,%s,", parent->dentry, key);
+
+ cache = container_of(parent->fscache.cache,
+ struct cachefiles_cache, cache);
+
+ ASSERT(parent->dentry);
+ ASSERT(parent->dentry->d_inode);
+
+ if (!(S_ISDIR(parent->dentry->d_inode->i_mode))) {
+ // TODO: convert file to dir
+ _leave("looking up in none directory");
+ return -ENOBUFS;
+ }
+
+ dir = dget(parent->dentry);
+
+advance:
+ /* attempt to transit the first directory component */
+ name = key;
+ nlen = strlen(key);
+
+ /* key ends in a double NUL */
+ key = key + nlen + 1;
+ if (!*key)
+ key = NULL;
+
+lookup_again:
+ /* search the current directory for the element name */
+ _debug("lookup '%s'", name);
+
+ mutex_lock(&dir->d_inode->i_mutex);
+
+ start = jiffies;
+ next = lookup_one_len(name, dir, nlen);
+ cachefiles_hist(cachefiles_lookup_histogram, start);
+ if (IS_ERR(next))
+ goto lookup_error;
+
+ _debug("next -> %p %s", next, next->d_inode ? "positive" : "negative");
+
+ if (!key)
+ object->new = !next->d_inode;
+
+ /* if this element of the path doesn't exist, then the lookup phase
+ * failed, and we can release any readers in the certain knowledge that
+ * there's nothing for them to actually read */
+ if (!next->d_inode)
+ fscache_object_lookup_negative(&object->fscache);
+
+ /* we need to create the object if it's negative */
+ if (key || object->type == FSCACHE_COOKIE_TYPE_INDEX) {
+ /* index objects and intervening tree levels must be subdirs */
+ if (!next->d_inode) {
+ ret = cachefiles_has_space(cache, 1, 0);
+ if (ret < 0)
+ goto create_error;
+
+ start = jiffies;
+ ret = vfs_mkdir(dir->d_inode, next, 0);
+ cachefiles_hist(cachefiles_mkdir_histogram, start);
+ if (ret < 0)
+ goto create_error;
+
+ ASSERT(next->d_inode);
+
+ _debug("mkdir -> %p{%p{ino=%lu}}",
+ next, next->d_inode, next->d_inode->i_ino);
+
+ } else if (!S_ISDIR(next->d_inode->i_mode)) {
+ kerror("inode %lu is not a directory",
+ next->d_inode->i_ino);
+ ret = -ENOBUFS;
+ goto error;
+ }
+
+ } else {
+ /* non-index objects start out life as files */
+ if (!next->d_inode) {
+ ret = cachefiles_has_space(cache, 1, 0);
+ if (ret < 0)
+ goto create_error;
+
+ start = jiffies;
+ ret = vfs_create(dir->d_inode, next, S_IFREG, NULL);
+ cachefiles_hist(cachefiles_create_histogram, start);
+ if (ret < 0)
+ goto create_error;
+
+ ASSERT(next->d_inode);
+
+ _debug("create -> %p{%p{ino=%lu}}",
+ next, next->d_inode, next->d_inode->i_ino);
+
+ } else if (!S_ISDIR(next->d_inode->i_mode) &&
+ !S_ISREG(next->d_inode->i_mode)
+ ) {
+ kerror("inode %lu is not a file or directory",
+ next->d_inode->i_ino);
+ ret = -ENOBUFS;
+ goto error;
+ }
+ }
+
+ /* process the next component */
+ if (key) {
+ _debug("advance");
+ mutex_unlock(&dir->d_inode->i_mutex);
+ dput(dir);
+ dir = next;
+ next = NULL;
+ goto advance;
+ }
+
+ /* we've found the object we were looking for */
+ object->dentry = next;
+
+ /* if we've found that the terminal object exists, then we need to
+ * check its attributes and delete it if it's out of date */
+ if (!object->new) {
+ _debug("validate '%*.*s'",
+ next->d_name.len, next->d_name.len, next->d_name.name);
+
+ ret = cachefiles_check_object_xattr(object, auxdata);
+ if (ret == -ESTALE) {
+ /* delete the object (the deleter drops the directory
+ * mutex) */
+ object->dentry = NULL;
+
+ ret = cachefiles_bury_object(cache, dir, next);
+ dput(next);
+ next = NULL;
+
+ if (ret < 0)
+ goto delete_error;
+
+ _debug("redo lookup");
+ goto lookup_again;
+ }
+ }
+
+ /* note that we're now using this object */
+ cachefiles_mark_object_active(cache, object);
+
+ mutex_unlock(&dir->d_inode->i_mutex);
+ dput(dir);
+ dir = NULL;
+
+ _debug("=== OBTAINED_OBJECT ===");
+
+ if (object->new) {
+ /* attach data to a newly constructed terminal object */
+ ret = cachefiles_set_object_xattr(object, auxdata);
+ if (ret < 0)
+ goto check_error;
+ } else {
+ /* always update the atime on an object we've just looked up
+ * (this is used to keep track of culling, and atimes are only
+ * updated by read, write and readdir but not lookup or
+ * open) */
+ touch_atime(cache->mnt, next);
+ }
+
+ /* open a file interface onto a data file */
+ if (object->type != FSCACHE_COOKIE_TYPE_INDEX) {
+ if (S_ISREG(object->dentry->d_inode->i_mode)) {
+ const struct address_space_operations *aops;
+
+ ret = -EPERM;
+ aops = object->dentry->d_inode->i_mapping->a_ops;
+ if (!aops->bmap ||
+ !aops->write_one_page)
+ goto check_error;
+
+ object->backer = object->dentry;
+ } else {
+ BUG(); // TODO: open file in data-class subdir
+ }
+ }
+
+ object->new = 0;
+ fscache_obtained_object(&object->fscache);
+
+ _leave(" = 0 [%lu]", object->dentry->d_inode->i_ino);
+ return 0;
+
+create_error:
+ _debug("create error %d", ret);
+ if (ret == -EIO)
+ cachefiles_io_error(cache, "Create/mkdir failed");
+ goto error;
+
+check_error:
+ _debug("check error %d", ret);
+ write_lock(&cache->active_lock);
+ rb_erase(&object->active_node, &cache->active_nodes);
+ clear_bit(CACHEFILES_OBJECT_ACTIVE, &object->flags);
+ wake_up_bit(&object->flags, CACHEFILES_OBJECT_ACTIVE);
+ write_unlock(&cache->active_lock);
+
+ dput(object->dentry);
+ object->dentry = NULL;
+ goto error_out;
+
+delete_error:
+ _debug("delete error %d", ret);
+ goto error_out2;
+
+lookup_error:
+ _debug("lookup error %ld", PTR_ERR(next));
+ ret = PTR_ERR(next);
+ if (ret == -EIO)
+ cachefiles_io_error(cache, "Lookup failed");
+ next = NULL;
+error:
+ mutex_unlock(&dir->d_inode->i_mutex);
+ dput(next);
+error_out2:
+ dput(dir);
+error_out:
+ if (ret == -ENOSPC)
+ ret = -ENOBUFS;
+
+ _leave(" = error %d", -ret);
+ return ret;
+}
+
+/*
+ * get a subdirectory
+ */
+struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
+ struct dentry *dir,
+ const char *dirname)
+{
+ struct dentry *subdir;
+ unsigned long start;
+ int ret;
+
+ _enter(",,%s", dirname);
+
+ /* search the current directory for the element name */
+ mutex_lock(&dir->d_inode->i_mutex);
+
+ start = jiffies;
+ subdir = lookup_one_len(dirname, dir, strlen(dirname));
+ cachefiles_hist(cachefiles_lookup_histogram, start);
+ if (IS_ERR(subdir)) {
+ if (PTR_ERR(subdir) == -ENOMEM)
+ goto nomem_d_alloc;
+ goto lookup_error;
+ }
+
+ _debug("subdir -> %p %s",
+ subdir, subdir->d_inode ? "positive" : "negative");
+
+ /* we need to create the subdir if it doesn't exist yet */
+ if (!subdir->d_inode) {
+ ret = cachefiles_has_space(cache, 1, 0);
+ if (ret < 0)
+ goto mkdir_error;
+
+ _debug("attempt mkdir");
+
+ ret = vfs_mkdir(dir->d_inode, subdir, 0700);
+ if (ret < 0)
+ goto mkdir_error;
+
+ ASSERT(subdir->d_inode);
+
+ _debug("mkdir -> %p{%p{ino=%lu}}",
+ subdir,
+ subdir->d_inode,
+ subdir->d_inode->i_ino);
+ }
+
+ mutex_unlock(&dir->d_inode->i_mutex);
+
+ /* we need to make sure the subdir is a directory */
+ ASSERT(subdir->d_inode);
+
+ if (!S_ISDIR(subdir->d_inode->i_mode)) {
+ kerror("%s is not a directory", dirname);
+ ret = -EIO;
+ goto check_error;
+ }
+
+ ret = -EPERM;
+ if (!subdir->d_inode->i_op ||
+ !subdir->d_inode->i_op->setxattr ||
+ !subdir->d_inode->i_op->getxattr ||
+ !subdir->d_inode->i_op->lookup ||
+ !subdir->d_inode->i_op->mkdir ||
+ !subdir->d_inode->i_op->create ||
+ !subdir->d_inode->i_op->rename ||
+ !subdir->d_inode->i_op->rmdir ||
+ !subdir->d_inode->i_op->unlink)
+ goto check_error;
+
+ _leave(" = [%lu]", subdir->d_inode->i_ino);
+ return subdir;
+
+check_error:
+ dput(subdir);
+ _leave(" = %d [check]", ret);
+ return ERR_PTR(ret);
+
+mkdir_error:
+ mutex_unlock(&dir->d_inode->i_mutex);
+ dput(subdir);
+ kerror("mkdir %s failed with error %d", dirname, ret);
+ return ERR_PTR(ret);
+
+lookup_error:
+ mutex_unlock(&dir->d_inode->i_mutex);
+ ret = PTR_ERR(subdir);
+ kerror("Lookup %s failed with error %d", dirname, ret);
+ return ERR_PTR(ret);
+
+nomem_d_alloc:
+ mutex_unlock(&dir->d_inode->i_mutex);
+ _leave(" = -ENOMEM");
+ return ERR_PTR(-ENOMEM);
+}
+
+/*
+ * find out if an object is in use or not
+ * - if finds object and it's not in use:
+ * - returns a pointer to the object and a reference on it
+ * - returns with the directory locked
+ */
+static struct dentry *cachefiles_check_active(struct cachefiles_cache *cache,
+ struct dentry *dir,
+ char *filename)
+{
+ struct cachefiles_object *object;
+ struct rb_node *_n;
+ struct dentry *victim;
+ unsigned long start;
+ int ret;
+
+ //_enter(",%*.*s/,%s",
+ // dir->d_name.len, dir->d_name.len, dir->d_name.name, filename);
+
+ /* look up the victim */
+ mutex_lock_nested(&dir->d_inode->i_mutex, 1);
+
+ start = jiffies;
+ victim = lookup_one_len(filename, dir, strlen(filename));
+ cachefiles_hist(cachefiles_lookup_histogram, start);
+ if (IS_ERR(victim))
+ goto lookup_error;
+
+ //_debug("victim -> %p %s",
+ // victim, victim->d_inode ? "positive" : "negative");
+
+ /* if the object is no longer there then we probably retired the object
+ * at the netfs's request whilst the cull was in progress
+ */
+ if (!victim->d_inode) {
+ mutex_unlock(&dir->d_inode->i_mutex);
+ dput(victim);
+ _leave(" = -ENOENT [absent]");
+ return ERR_PTR(-ENOENT);
+ }
+
+ /* check to see if we're using this object */
+ read_lock(&cache->active_lock);
+
+ _n = cache->active_nodes.rb_node;
+
+ while (_n) {
+ object = rb_entry(_n, struct cachefiles_object, active_node);
+
+ if (object->dentry > victim)
+ _n = _n->rb_left;
+ else if (object->dentry < victim)
+ _n = _n->rb_right;
+ else
+ goto object_in_use;
+ }
+
+ read_unlock(&cache->active_lock);
+
+ //_leave(" = %p", victim);
+ return victim;
+
+object_in_use:
+ read_unlock(&cache->active_lock);
+ mutex_unlock(&dir->d_inode->i_mutex);
+ dput(victim);
+ //_leave(" = -EBUSY [in use]");
+ return ERR_PTR(-EBUSY);
+
+lookup_error:
+ mutex_unlock(&dir->d_inode->i_mutex);
+ ret = PTR_ERR(victim);
+ if (ret == -ENOENT) {
+ /* file or dir now absent - probably retired by netfs */
+ _leave(" = -ESTALE [absent]");
+ return ERR_PTR(-ESTALE);
+ }
+
+ if (ret == -EIO) {
+ cachefiles_io_error(cache, "Lookup failed");
+ } else if (ret != -ENOMEM) {
+ kerror("Internal error: %d", ret);
+ ret = -EIO;
+ }
+
+ _leave(" = %d", ret);
+ return ERR_PTR(ret);
+}
+
+/*
+ * cull an object if it's not in use
+ * - called only by cache manager daemon
+ */
+int cachefiles_cull(struct cachefiles_cache *cache, struct dentry *dir,
+ char *filename)
+{
+ struct dentry *victim;
+ int ret;
+
+ _enter(",%*.*s/,%s",
+ dir->d_name.len, dir->d_name.len, dir->d_name.name, filename);
+
+ victim = cachefiles_check_active(cache, dir, filename);
+ if (IS_ERR(victim))
+ return PTR_ERR(victim);
+
+ _debug("victim -> %p %s",
+ victim, victim->d_inode ? "positive" : "negative");
+
+ /* okay... the victim is not being used so we can cull it
+ * - start by marking it as stale
+ */
+ _debug("victim is cullable");
+
+ ret = cachefiles_remove_object_xattr(cache, victim);
+ if (ret < 0)
+ goto error_unlock;
+
+ /* actually remove the victim (drops the dir mutex) */
+ _debug("bury");
+
+ ret = cachefiles_bury_object(cache, dir, victim);
+ if (ret < 0)
+ goto error;
+
+ dput(victim);
+ _leave(" = 0");
+ return 0;
+
+error_unlock:
+ mutex_unlock(&dir->d_inode->i_mutex);
+error:
+ dput(victim);
+ if (ret == -ENOENT) {
+ /* file or dir now absent - probably retired by netfs */
+ _leave(" = -ESTALE [absent]");
+ return -ESTALE;
+ }
+
+ if (ret != -ENOMEM) {
+ kerror("Internal error: %d", ret);
+ ret = -EIO;
+ }
+
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * find out if an object is in use or not
+ * - called only by cache manager daemon
+ * - returns -EBUSY or 0 to indicate whether an object is in use or not
+ */
+int cachefiles_check_in_use(struct cachefiles_cache *cache, struct dentry *dir,
+ char *filename)
+{
+ struct dentry *victim;
+
+ //_enter(",%*.*s/,%s",
+ // dir->d_name.len, dir->d_name.len, dir->d_name.name, filename);
+
+ victim = cachefiles_check_active(cache, dir, filename);
+ if (IS_ERR(victim))
+ return PTR_ERR(victim);
+
+ mutex_unlock(&dir->d_inode->i_mutex);
+ dput(victim);
+ //_leave(" = 0");
+ return 0;
+}
diff --git a/fs/cachefiles/proc.c b/fs/cachefiles/proc.c
new file mode 100644
index 0000000..eccd339
--- /dev/null
+++ b/fs/cachefiles/proc.c
@@ -0,0 +1,134 @@
+/* CacheFiles statistics
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/seq_file.h>
+#include "internal.h"
+
+atomic_t cachefiles_lookup_histogram[HZ];
+atomic_t cachefiles_mkdir_histogram[HZ];
+atomic_t cachefiles_create_histogram[HZ];
+
+/*
+ * display the latency histogram
+ */
+static int cachefiles_histogram_show(struct seq_file *m, void *v)
+{
+ unsigned long index;
+ unsigned x, y, z, t;
+
+ switch ((unsigned long) v) {
+ case 1:
+ seq_puts(m, "JIFS SECS LOOKUPS MKDIRS CREATES\n");
+ return 0;
+ case 2:
+ seq_puts(m, "===== ===== ========= ========= =========\n");
+ return 0;
+ default:
+ index = (unsigned long) v - 3;
+ x = atomic_read(&cachefiles_lookup_histogram[index]);
+ y = atomic_read(&cachefiles_mkdir_histogram[index]);
+ z = atomic_read(&cachefiles_create_histogram[index]);
+ if (x == 0 && y == 0 && z == 0)
+ return 0;
+
+ t = (index * 1000) / HZ;
+
+ seq_printf(m, "%4lu 0.%03u %9u %9u %9u\n", index, t, x, y, z);
+ return 0;
+ }
+}
+
+/*
+ * set up the iterator to start reading from the first line
+ */
+static void *cachefiles_histogram_start(struct seq_file *m, loff_t *_pos)
+{
+ if ((unsigned long long)*_pos >= HZ + 2)
+ return NULL;
+ if (*_pos == 0)
+ *_pos = 1;
+ return (void *)(unsigned long) *_pos;
+}
+
+/*
+ * move to the next line
+ */
+static void *cachefiles_histogram_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ (*pos)++;
+ return (unsigned long long)*pos > HZ + 2 ?
+ NULL : (void *)(unsigned long) *pos;
+}
+
+/*
+ * clean up after reading
+ */
+static void cachefiles_histogram_stop(struct seq_file *m, void *v)
+{
+}
+
+static const struct seq_operations cachefiles_histogram_ops = {
+ .start = cachefiles_histogram_start,
+ .stop = cachefiles_histogram_stop,
+ .next = cachefiles_histogram_next,
+ .show = cachefiles_histogram_show,
+};
+
+/*
+ * open "/proc/fs/cachefiles/XXX" which provide statistics summaries
+ */
+static int cachefiles_histogram_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &cachefiles_histogram_ops);
+}
+
+static const struct file_operations cachefiles_histogram_fops = {
+ .owner = THIS_MODULE,
+ .open = cachefiles_histogram_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+/*
+ * initialise the /proc/fs/cachefiles/ directory
+ */
+int __init cachefiles_proc_init(void)
+{
+ _enter("");
+
+ if (!proc_mkdir("fs/cachefiles", NULL))
+ goto error_dir;
+
+ if (!proc_create("fs/cachefiles/histogram", S_IFREG | 0444, NULL,
+ &cachefiles_histogram_fops))
+ goto error_histogram;
+
+ _leave(" = 0");
+ return 0;
+
+error_histogram:
+ remove_proc_entry("fs/cachefiles", NULL);
+error_dir:
+ _leave(" = -ENOMEM");
+ return -ENOMEM;
+}
+
+/*
+ * clean up the /proc/fs/cachefiles/ directory
+ */
+void cachefiles_proc_cleanup(void)
+{
+ remove_proc_entry("fs/cachefiles/histogram", NULL);
+ remove_proc_entry("fs/cachefiles", NULL);
+}
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
new file mode 100644
index 0000000..84bb8b7
--- /dev/null
+++ b/fs/cachefiles/rdwr.c
@@ -0,0 +1,853 @@
+/* Storage object read/write
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include "internal.h"
+
+/*
+ * detect wake up events generated by the unlocking of pages in which we're
+ * interested
+ * - we use this to detect read completion of backing pages
+ * - the caller holds the waitqueue lock
+ */
+static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,
+ int sync, void *_key)
+{
+ struct cachefiles_one_read *monitor =
+ container_of(wait, struct cachefiles_one_read, monitor);
+ struct cachefiles_object *object;
+ struct wait_bit_key *key = _key;
+ struct page *page = wait->private;
+
+ ASSERT(key);
+
+ _enter("{%lu},%u,%d,{%p,%u}",
+ monitor->netfs_page->index, mode, sync,
+ key->flags, key->bit_nr);
+
+ if (key->flags != &page->flags ||
+ key->bit_nr != PG_locked)
+ return 0;
+
+ _debug("--- monitor %p %lx ---", page, page->flags);
+
+ if (!PageUptodate(page) && !PageError(page))
+ dump_stack();
+
+ /* remove from the waitqueue */
+ list_del(&wait->task_list);
+
+ /* move onto the action list and queue for FS-Cache thread pool */
+ ASSERT(monitor->op);
+
+ object = container_of(monitor->op->op.object,
+ struct cachefiles_object, fscache);
+
+ spin_lock(&object->work_lock);
+ list_add_tail(&monitor->op_link, &monitor->op->to_do);
+ spin_unlock(&object->work_lock);
+
+ fscache_enqueue_retrieval(monitor->op);
+ return 0;
+}
+
+/*
+ * copy data from backing pages to netfs pages to complete a read operation
+ * - driven by FS-Cache's thread pool
+ */
+static void cachefiles_read_copier(struct fscache_operation *_op)
+{
+ struct cachefiles_one_read *monitor;
+ struct cachefiles_object *object;
+ struct fscache_retrieval *op;
+ struct pagevec pagevec;
+ int error, max;
+
+ op = container_of(_op, struct fscache_retrieval, op);
+ object = container_of(op->op.object,
+ struct cachefiles_object, fscache);
+
+ _enter("{ino=%lu}", object->backer->d_inode->i_ino);
+
+ pagevec_init(&pagevec, 0);
+
+ max = 8;
+ spin_lock_irq(&object->work_lock);
+
+ while (!list_empty(&op->to_do)) {
+ monitor = list_entry(op->to_do.next,
+ struct cachefiles_one_read, op_link);
+ list_del(&monitor->op_link);
+
+ spin_unlock_irq(&object->work_lock);
+
+ _debug("- copy {%lu}", monitor->back_page->index);
+
+ error = -EIO;
+ if (PageUptodate(monitor->back_page)) {
+ copy_highpage(monitor->netfs_page, monitor->back_page);
+
+ pagevec_add(&pagevec, monitor->netfs_page);
+ fscache_mark_pages_cached(monitor->op, &pagevec);
+ error = 0;
+ }
+
+ if (error)
+ cachefiles_io_error_obj(
+ object,
+ "Readpage failed on backing file %lx",
+ (unsigned long) monitor->back_page->flags);
+
+ page_cache_release(monitor->back_page);
+
+ fscache_end_io(op, monitor->netfs_page, error);
+ page_cache_release(monitor->netfs_page);
+ fscache_put_retrieval(op);
+ kfree(monitor);
+
+ /* let the thread pool have some air occasionally */
+ max--;
+ if (max < 0 || need_resched()) {
+ if (!list_empty(&op->to_do))
+ fscache_enqueue_retrieval(op);
+ _leave(" [maxed out]");
+ return;
+ }
+
+ spin_lock_irq(&object->work_lock);
+ }
+
+ spin_unlock_irq(&object->work_lock);
+ _leave("");
+}
+
+/*
+ * read the corresponding page to the given set from the backing file
+ * - an uncertain page is simply discarded, to be tried again another time
+ */
+static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
+ struct fscache_retrieval *op,
+ struct page *netpage,
+ struct pagevec *pagevec)
+{
+ struct cachefiles_one_read *monitor;
+ struct address_space *bmapping;
+ struct page *newpage, *backpage;
+ int ret;
+
+ _enter("");
+
+ pagevec_reinit(pagevec);
+
+ _debug("read back %p{%lu,%d}",
+ netpage, netpage->index, page_count(netpage));
+
+ monitor = kzalloc(sizeof(*monitor), GFP_KERNEL);
+ if (!monitor)
+ goto nomem;
+
+ monitor->netfs_page = netpage;
+ monitor->op = fscache_get_retrieval(op);
+
+ init_waitqueue_func_entry(&monitor->monitor, cachefiles_read_waiter);
+
+ /* attempt to get hold of the backing page */
+ bmapping = object->backer->d_inode->i_mapping;
+ newpage = NULL;
+
+ for (;;) {
+ backpage = find_get_page(bmapping, netpage->index);
+ if (backpage)
+ goto backing_page_already_present;
+
+ if (!newpage) {
+ newpage = page_cache_alloc_cold(bmapping);
+ if (!newpage)
+ goto nomem_monitor;
+ }
+
+ ret = add_to_page_cache(newpage, bmapping,
+ netpage->index, GFP_KERNEL);
+ if (ret == 0)
+ goto installed_new_backing_page;
+ if (ret != -EEXIST)
+ goto nomem_page;
+ }
+
+ /* we've installed a new backing page, so now we need to add it
+ * to the LRU list and start it reading */
+installed_new_backing_page:
+ _debug("- new %p", newpage);
+
+ backpage = newpage;
+ newpage = NULL;
+
+ page_cache_get(backpage);
+ pagevec_add(pagevec, backpage);
+ __pagevec_lru_add_file(pagevec);
+
+read_backing_page:
+ ret = bmapping->a_ops->readpage(NULL, backpage);
+ if (ret < 0)
+ goto read_error;
+
+ /* set the monitor to transfer the data across */
+monitor_backing_page:
+ _debug("- monitor add");
+
+ /* install the monitor */
+ page_cache_get(monitor->netfs_page);
+ page_cache_get(backpage);
+ monitor->back_page = backpage;
+ monitor->monitor.private = backpage;
+ add_page_wait_queue(backpage, &monitor->monitor);
+ monitor = NULL;
+
+ /* but the page may have been read before the monitor was installed, so
+ * the monitor may miss the event - so we have to ensure that we do get
+ * one in such a case */
+ if (trylock_page(backpage)) {
+ _debug("jumpstart %p {%lx}", backpage, backpage->flags);
+ unlock_page(backpage);
+ }
+ goto success;
+
+ /* if the backing page is already present, it can be in one of
+ * three states: read in progress, read failed or read okay */
+backing_page_already_present:
+ _debug("- present");
+
+ if (newpage) {
+ page_cache_release(newpage);
+ newpage = NULL;
+ }
+
+ if (PageError(backpage))
+ goto io_error;
+
+ if (PageUptodate(backpage))
+ goto backing_page_already_uptodate;
+
+ if (!trylock_page(backpage))
+ goto monitor_backing_page;
+ _debug("read %p {%lx}", backpage, backpage->flags);
+ goto read_backing_page;
+
+ /* the backing page is already up to date, attach the netfs
+ * page to the pagecache and LRU and copy the data across */
+backing_page_already_uptodate:
+ _debug("- uptodate");
+
+ pagevec_add(pagevec, netpage);
+ fscache_mark_pages_cached(op, pagevec);
+
+ copy_highpage(netpage, backpage);
+ fscache_end_io(op, netpage, 0);
+
+success:
+ _debug("success");
+ ret = 0;
+
+out:
+ if (backpage)
+ page_cache_release(backpage);
+ if (monitor) {
+ fscache_put_retrieval(monitor->op);
+ kfree(monitor);
+ }
+ _leave(" = %d", ret);
+ return ret;
+
+read_error:
+ _debug("read error %d", ret);
+ if (ret == -ENOMEM)
+ goto out;
+io_error:
+ cachefiles_io_error_obj(object, "Page read error on backing file");
+ ret = -ENOBUFS;
+ goto out;
+
+nomem_page:
+ page_cache_release(newpage);
+nomem_monitor:
+ fscache_put_retrieval(monitor->op);
+ kfree(monitor);
+nomem:
+ _leave(" = -ENOMEM");
+ return -ENOMEM;
+}
+
+/*
+ * read a page from the cache or allocate a block in which to store it
+ * - cache withdrawal is prevented by the caller
+ * - returns -EINTR if interrupted
+ * - returns -ENOMEM if ran out of memory
+ * - returns -ENOBUFS if no buffers can be made available
+ * - returns -ENOBUFS if page is beyond EOF
+ * - if the page is backed by a block in the cache:
+ * - a read will be started which will call the callback on completion
+ * - 0 will be returned
+ * - else if the page is unbacked:
+ * - the metadata will be retained
+ * - -ENODATA will be returned
+ */
+int cachefiles_read_or_alloc_page(struct fscache_retrieval *op,
+ struct page *page,
+ gfp_t gfp)
+{
+ struct cachefiles_object *object;
+ struct cachefiles_cache *cache;
+ struct pagevec pagevec;
+ struct inode *inode;
+ sector_t block0, block;
+ unsigned shift;
+ int ret;
+
+ object = container_of(op->op.object,
+ struct cachefiles_object, fscache);
+ cache = container_of(object->fscache.cache,
+ struct cachefiles_cache, cache);
+
+ _enter("{%p},{%lx},,,", object, page->index);
+
+ if (!object->backer)
+ return -ENOBUFS;
+
+ inode = object->backer->d_inode;
+ ASSERT(S_ISREG(inode->i_mode));
+ ASSERT(inode->i_mapping->a_ops->bmap);
+ ASSERT(inode->i_mapping->a_ops->readpages);
+
+ /* calculate the shift required to use bmap */
+ if (inode->i_sb->s_blocksize > PAGE_SIZE)
+ return -ENOBUFS;
+
+ shift = PAGE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+ op->op.flags = FSCACHE_OP_FAST;
+ op->op.processor = cachefiles_read_copier;
+
+ pagevec_init(&pagevec, 0);
+
+ /* we assume the absence or presence of the first block is a good
+ * enough indication for the page as a whole
+ * - TODO: don't use bmap() for this as it is _not_ actually good
+ * enough for this as it doesn't indicate errors, but it's all we've
+ * got for the moment
+ */
+ block0 = page->index;
+ block0 <<= shift;
+
+ block = inode->i_mapping->a_ops->bmap(inode->i_mapping, block0);
+ _debug("%llx -> %llx",
+ (unsigned long long) block0,
+ (unsigned long long) block);
+
+ if (block) {
+ /* submit the apparently valid page to the backing fs to be
+ * read from disk */
+ ret = cachefiles_read_backing_file_one(object, op, page,
+ &pagevec);
+ } else if (cachefiles_has_space(cache, 0, 1) == 0) {
+ /* there's space in the cache we can use */
+ pagevec_add(&pagevec, page);
+ fscache_mark_pages_cached(op, &pagevec);
+ ret = -ENODATA;
+ } else {
+ ret = -ENOBUFS;
+ }
+
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * read the corresponding pages to the given set from the backing file
+ * - any uncertain pages are simply discarded, to be tried again another time
+ */
+static int cachefiles_read_backing_file(struct cachefiles_object *object,
+ struct fscache_retrieval *op,
+ struct list_head *list,
+ struct pagevec *mark_pvec)
+{
+ struct cachefiles_one_read *monitor = NULL;
+ struct address_space *bmapping = object->backer->d_inode->i_mapping;
+ struct pagevec lru_pvec;
+ struct page *newpage = NULL, *netpage, *_n, *backpage = NULL;
+ int ret = 0;
+
+ _enter("");
+
+ pagevec_init(&lru_pvec, 0);
+
+ list_for_each_entry_safe(netpage, _n, list, lru) {
+ list_del(&netpage->lru);
+
+ _debug("read back %p{%lu,%d}",
+ netpage, netpage->index, page_count(netpage));
+
+ if (!monitor) {
+ monitor = kzalloc(sizeof(*monitor), GFP_KERNEL);
+ if (!monitor)
+ goto nomem;
+
+ monitor->op = fscache_get_retrieval(op);
+ init_waitqueue_func_entry(&monitor->monitor,
+ cachefiles_read_waiter);
+ }
+
+ for (;;) {
+ backpage = find_get_page(bmapping, netpage->index);
+ if (backpage)
+ goto backing_page_already_present;
+
+ if (!newpage) {
+ newpage = page_cache_alloc_cold(bmapping);
+ if (!newpage)
+ goto nomem;
+ }
+
+ ret = add_to_page_cache(newpage, bmapping,
+ netpage->index, GFP_KERNEL);
+ if (ret == 0)
+ goto installed_new_backing_page;
+ if (ret != -EEXIST)
+ goto nomem;
+ }
+
+ /* we've installed a new backing page, so now we need to add it
+ * to the LRU list and start it reading */
+ installed_new_backing_page:
+ _debug("- new %p", newpage);
+
+ backpage = newpage;
+ newpage = NULL;
+
+ page_cache_get(backpage);
+ if (!pagevec_add(&lru_pvec, backpage))
+ __pagevec_lru_add_file(&lru_pvec);
+
+ reread_backing_page:
+ ret = bmapping->a_ops->readpage(NULL, backpage);
+ if (ret < 0)
+ goto read_error;
+
+ /* add the netfs page to the pagecache and LRU, and set the
+ * monitor to transfer the data across */
+ monitor_backing_page:
+ _debug("- monitor add");
+
+ ret = add_to_page_cache(netpage, op->mapping, netpage->index,
+ GFP_KERNEL);
+ if (ret < 0) {
+ if (ret == -EEXIST) {
+ page_cache_release(netpage);
+ continue;
+ }
+ goto nomem;
+ }
+
+ page_cache_get(netpage);
+ if (!pagevec_add(&lru_pvec, netpage))
+ __pagevec_lru_add_file(&lru_pvec);
+
+ /* install a monitor */
+ page_cache_get(netpage);
+ monitor->netfs_page = netpage;
+
+ page_cache_get(backpage);
+ monitor->back_page = backpage;
+ monitor->monitor.private = backpage;
+ add_page_wait_queue(backpage, &monitor->monitor);
+ monitor = NULL;
+
+ /* but the page may have been read before the monitor was
+ * installed, so the monitor may miss the event - so we have to
+ * ensure that we do get one in such a case */
+ if (trylock_page(backpage)) {
+ _debug("2unlock %p {%lx}", backpage, backpage->flags);
+ unlock_page(backpage);
+ }
+
+ page_cache_release(backpage);
+ backpage = NULL;
+
+ page_cache_release(netpage);
+ netpage = NULL;
+ continue;
+
+ /* if the backing page is already present, it can be in one of
+ * three states: read in progress, read failed or read okay */
+ backing_page_already_present:
+ _debug("- present %p", backpage);
+
+ if (PageError(backpage))
+ goto io_error;
+
+ if (PageUptodate(backpage))
+ goto backing_page_already_uptodate;
+
+ _debug("- not ready %p{%lx}", backpage, backpage->flags);
+
+ if (!trylock_page(backpage))
+ goto monitor_backing_page;
+
+ if (PageError(backpage)) {
+ _debug("error %lx", backpage->flags);
+ unlock_page(backpage);
+ goto io_error;
+ }
+
+ if (PageUptodate(backpage))
+ goto backing_page_already_uptodate_unlock;
+
+ /* we've locked a page that's neither up to date nor erroneous,
+ * so we need to attempt to read it again */
+ goto reread_backing_page;
+
+ /* the backing page is already up to date, attach the netfs
+ * page to the pagecache and LRU and copy the data across */
+ backing_page_already_uptodate_unlock:
+ _debug("uptodate %lx", backpage->flags);
+ unlock_page(backpage);
+ backing_page_already_uptodate:
+ _debug("- uptodate");
+
+ ret = add_to_page_cache(netpage, op->mapping, netpage->index,
+ GFP_KERNEL);
+ if (ret < 0) {
+ if (ret == -EEXIST) {
+ page_cache_release(netpage);
+ continue;
+ }
+ goto nomem;
+ }
+
+ copy_highpage(netpage, backpage);
+
+ page_cache_release(backpage);
+ backpage = NULL;
+
+ if (!pagevec_add(mark_pvec, netpage))
+ fscache_mark_pages_cached(op, mark_pvec);
+
+ page_cache_get(netpage);
+ if (!pagevec_add(&lru_pvec, netpage))
+ __pagevec_lru_add_file(&lru_pvec);
+
+ fscache_end_io(op, netpage, 0);
+ page_cache_release(netpage);
+ netpage = NULL;
+ continue;
+ }
+
+ netpage = NULL;
+
+ _debug("out");
+
+out:
+ /* tidy up */
+ pagevec_lru_add_file(&lru_pvec);
+
+ if (newpage)
+ page_cache_release(newpage);
+ if (netpage)
+ page_cache_release(netpage);
+ if (backpage)
+ page_cache_release(backpage);
+ if (monitor) {
+ fscache_put_retrieval(op);
+ kfree(monitor);
+ }
+
+ list_for_each_entry_safe(netpage, _n, list, lru) {
+ list_del(&netpage->lru);
+ page_cache_release(netpage);
+ }
+
+ _leave(" = %d", ret);
+ return ret;
+
+nomem:
+ _debug("nomem");
+ ret = -ENOMEM;
+ goto out;
+
+read_error:
+ _debug("read error %d", ret);
+ if (ret == -ENOMEM)
+ goto out;
+io_error:
+ cachefiles_io_error_obj(object, "Page read error on backing file");
+ ret = -ENOBUFS;
+ goto out;
+}
+
+/*
+ * read a list of pages from the cache or allocate blocks in which to store
+ * them
+ */
+int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op,
+ struct list_head *pages,
+ unsigned *nr_pages,
+ gfp_t gfp)
+{
+ struct cachefiles_object *object;
+ struct cachefiles_cache *cache;
+ struct list_head backpages;
+ struct pagevec pagevec;
+ struct inode *inode;
+ struct page *page, *_n;
+ unsigned shift, nrbackpages;
+ int ret, ret2, space;
+
+ object = container_of(op->op.object,
+ struct cachefiles_object, fscache);
+ cache = container_of(object->fscache.cache,
+ struct cachefiles_cache, cache);
+
+ _enter("{OBJ%x,%d},,%d,,",
+ object->fscache.debug_id, atomic_read(&op->op.usage),
+ *nr_pages);
+
+ if (!object->backer)
+ return -ENOBUFS;
+
+ space = 1;
+ if (cachefiles_has_space(cache, 0, *nr_pages) < 0)
+ space = 0;
+
+ inode = object->backer->d_inode;
+ ASSERT(S_ISREG(inode->i_mode));
+ ASSERT(inode->i_mapping->a_ops->bmap);
+ ASSERT(inode->i_mapping->a_ops->readpages);
+
+ /* calculate the shift required to use bmap */
+ if (inode->i_sb->s_blocksize > PAGE_SIZE)
+ return -ENOBUFS;
+
+ shift = PAGE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+ pagevec_init(&pagevec, 0);
+
+ op->op.flags = FSCACHE_OP_FAST;
+ op->op.processor = cachefiles_read_copier;
+
+ INIT_LIST_HEAD(&backpages);
+ nrbackpages = 0;
+
+ ret = space ? -ENODATA : -ENOBUFS;
+ list_for_each_entry_safe(page, _n, pages, lru) {
+ sector_t block0, block;
+
+ /* we assume the absence or presence of the first block is a
+ * good enough indication for the page as a whole
+ * - TODO: don't use bmap() for this as it is _not_ actually
+ * good enough for this as it doesn't indicate errors, but
+ * it's all we've got for the moment
+ */
+ block0 = page->index;
+ block0 <<= shift;
+
+ block = inode->i_mapping->a_ops->bmap(inode->i_mapping,
+ block0);
+ _debug("%llx -> %llx",
+ (unsigned long long) block0,
+ (unsigned long long) block);
+
+ if (block) {
+ /* we have data - add it to the list to give to the
+ * backing fs */
+ list_move(&page->lru, &backpages);
+ (*nr_pages)--;
+ nrbackpages++;
+ } else if (space && pagevec_add(&pagevec, page) == 0) {
+ fscache_mark_pages_cached(op, &pagevec);
+ ret = -ENODATA;
+ }
+ }
+
+ if (pagevec_count(&pagevec) > 0)
+ fscache_mark_pages_cached(op, &pagevec);
+
+ if (list_empty(pages))
+ ret = 0;
+
+ /* submit the apparently valid pages to the backing fs to be read from
+ * disk */
+ if (nrbackpages > 0) {
+ ret2 = cachefiles_read_backing_file(object, op, &backpages,
+ &pagevec);
+ if (ret2 == -ENOMEM || ret2 == -EINTR)
+ ret = ret2;
+ }
+
+ if (pagevec_count(&pagevec) > 0)
+ fscache_mark_pages_cached(op, &pagevec);
+
+ _leave(" = %d [nr=%u%s]",
+ ret, *nr_pages, list_empty(pages) ? " empty" : "");
+ return ret;
+}
+
+/*
+ * allocate a block in the cache in which to store a page
+ * - cache withdrawal is prevented by the caller
+ * - returns -EINTR if interrupted
+ * - returns -ENOMEM if ran out of memory
+ * - returns -ENOBUFS if no buffers can be made available
+ * - returns -ENOBUFS if page is beyond EOF
+ * - otherwise:
+ * - the metadata will be retained
+ * - 0 will be returned
+ */
+int cachefiles_allocate_page(struct fscache_retrieval *op,
+ struct page *page,
+ gfp_t gfp)
+{
+ struct cachefiles_object *object;
+ struct cachefiles_cache *cache;
+ struct pagevec pagevec;
+ int ret;
+
+ object = container_of(op->op.object,
+ struct cachefiles_object, fscache);
+ cache = container_of(object->fscache.cache,
+ struct cachefiles_cache, cache);
+
+ _enter("%p,{%lx},", object, page->index);
+
+ ret = cachefiles_has_space(cache, 0, 1);
+ if (ret == 0) {
+ pagevec_init(&pagevec, 0);
+ pagevec_add(&pagevec, page);
+ fscache_mark_pages_cached(op, &pagevec);
+ } else {
+ ret = -ENOBUFS;
+ }
+
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * allocate blocks in the cache in which to store a set of pages
+ * - cache withdrawal is prevented by the caller
+ * - returns -EINTR if interrupted
+ * - returns -ENOMEM if ran out of memory
+ * - returns -ENOBUFS if some buffers couldn't be made available
+ * - returns -ENOBUFS if some pages are beyond EOF
+ * - otherwise:
+ * - -ENODATA will be returned
+ * - metadata will be retained for any page marked
+ */
+int cachefiles_allocate_pages(struct fscache_retrieval *op,
+ struct list_head *pages,
+ unsigned *nr_pages,
+ gfp_t gfp)
+{
+ struct cachefiles_object *object;
+ struct cachefiles_cache *cache;
+ struct pagevec pagevec;
+ struct page *page;
+ int ret;
+
+ object = container_of(op->op.object,
+ struct cachefiles_object, fscache);
+ cache = container_of(object->fscache.cache,
+ struct cachefiles_cache, cache);
+
+ _enter("%p,,,%d,", object, *nr_pages);
+
+ ret = cachefiles_has_space(cache, 0, *nr_pages);
+ if (ret == 0) {
+ pagevec_init(&pagevec, 0);
+
+ list_for_each_entry(page, pages, lru) {
+ if (pagevec_add(&pagevec, page) == 0)
+ fscache_mark_pages_cached(op, &pagevec);
+ }
+
+ if (pagevec_count(&pagevec) > 0)
+ fscache_mark_pages_cached(op, &pagevec);
+ ret = -ENODATA;
+ } else {
+ ret = -ENOBUFS;
+ }
+
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * request a page be stored in the cache
+ * - cache withdrawal is prevented by the caller
+ * - this request may be ignored if there's no cache block available, in which
+ * case -ENOBUFS will be returned
+ * - if the op is in progress, 0 will be returned
+ */
+int cachefiles_write_page(struct fscache_storage *op, struct page *page)
+{
+ struct cachefiles_object *object;
+ struct address_space *mapping;
+ int ret;
+
+ ASSERT(op != NULL);
+ ASSERT(page != NULL);
+
+ object = container_of(op->op.object,
+ struct cachefiles_object, fscache);
+
+ _enter("%p,%p{%lx},,,", object, page, page->index);
+
+ if (!object->backer) {
+ _leave(" = -ENOBUFS");
+ return -ENOBUFS;
+ }
+
+ ASSERT(S_ISREG(object->backer->d_inode->i_mode));
+
+ /* copy the page to ext3 and let it store it in its own time */
+ mapping = object->backer->d_inode->i_mapping;
+ ret = -EIO;
+ if (mapping->a_ops->write_one_page) {
+ ret = mapping->a_ops->write_one_page(mapping, page->index,
+ page);
+ _debug("write_one_page -> %d", ret);
+ }
+
+ if (ret < 0) {
+ if (ret == -EIO)
+ cachefiles_io_error_obj(
+ object, "Write page to backing file failed");
+ ret = -ENOBUFS;
+ }
+
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * detach a backing block from a page
+ * - cache withdrawal is prevented by the caller
+ */
+void cachefiles_uncache_page(struct fscache_object *_object, struct page *page)
+{
+ struct cachefiles_object *object;
+ struct cachefiles_cache *cache;
+
+ object = container_of(_object, struct cachefiles_object, fscache);
+ cache = container_of(object->fscache.cache,
+ struct cachefiles_cache, cache);
+
+ _enter("%p,{%lu}", object, page->index);
+
+ spin_unlock(&object->fscache.cookie->lock);
+}
diff --git a/fs/cachefiles/security.c b/fs/cachefiles/security.c
new file mode 100644
index 0000000..b5808cd
--- /dev/null
+++ b/fs/cachefiles/security.c
@@ -0,0 +1,116 @@
+/* CacheFiles security management
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/fs.h>
+#include <linux/cred.h>
+#include "internal.h"
+
+/*
+ * determine the security context within which we access the cache from within
+ * the kernel
+ */
+int cachefiles_get_security_ID(struct cachefiles_cache *cache)
+{
+ struct cred *new;
+ int ret;
+
+ _enter("{%s}", cache->secctx);
+
+ new = prepare_kernel_cred(current);
+ if (!new) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ if (cache->secctx) {
+ ret = set_security_override_from_ctx(new, cache->secctx);
+ if (ret < 0) {
+ put_cred(new);
+ printk(KERN_ERR "CacheFiles:"
+ " Security denies permission to nominate"
+ " security context: error %d\n",
+ ret);
+ goto error;
+ }
+ }
+
+ cache->cache_cred = new;
+ ret = 0;
+error:
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * see if mkdir and create can be performed in the root directory
+ */
+static int cachefiles_check_cache_dir(struct cachefiles_cache *cache,
+ struct dentry *root)
+{
+ int ret;
+
+ ret = security_inode_mkdir(root->d_inode, root, 0);
+ if (ret < 0) {
+ printk(KERN_ERR "CacheFiles:"
+ " Security denies permission to make dirs: error %d",
+ ret);
+ return ret;
+ }
+
+ ret = security_inode_create(root->d_inode, root, 0);
+ if (ret < 0)
+ printk(KERN_ERR "CacheFiles:"
+ " Security denies permission to create files: error %d",
+ ret);
+
+ return ret;
+}
+
+/*
+ * check the security details of the on-disk cache
+ * - must be called with security override in force
+ */
+int cachefiles_determine_cache_security(struct cachefiles_cache *cache,
+ struct dentry *root,
+ const struct cred **_saved_cred)
+{
+ struct cred *new;
+ int ret;
+
+ _enter("");
+
+ /* duplicate the cache creds for COW (the override is currently in
+ * force, so we can use prepare_creds() to do this) */
+ new = prepare_creds();
+ if (!new)
+ return -ENOMEM;
+
+ cachefiles_end_secure(cache, *_saved_cred);
+
+ /* use the cache root dir's security context as the basis with
+ * which create files */
+ ret = set_create_files_as(new, root->d_inode);
+ if (ret < 0) {
+ _leave(" = %d [cfa]", ret);
+ return ret;
+ }
+
+ put_cred(cache->cache_cred);
+ cache->cache_cred = new;
+
+ cachefiles_begin_secure(cache, _saved_cred);
+ ret = cachefiles_check_cache_dir(cache, root);
+
+ if (ret == -EOPNOTSUPP)
+ ret = 0;
+ _leave(" = %d", ret);
+ return ret;
+}
diff --git a/fs/cachefiles/xattr.c b/fs/cachefiles/xattr.c
new file mode 100644
index 0000000..f3e7a0b
--- /dev/null
+++ b/fs/cachefiles/xattr.c
@@ -0,0 +1,291 @@
+/* CacheFiles extended attribute management
+ *
+ * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (***@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/fsnotify.h>
+#include <linux/quotaops.h>
+#include <linux/xattr.h>
+#include "internal.h"
+
+static const char cachefiles_xattr_cache[] =
+ XATTR_USER_PREFIX "CacheFiles.cache";
+
+/*
+ * check the type label on an object
+ * - done using xattrs
+ */
+int cachefiles_check_object_type(struct cachefiles_object *object)
+{
+ struct dentry *dentry = object->dentry;
+ char type[3], xtype[3];
+ int ret;
+
+ ASSERT(dentry);
+ ASSERT(dentry->d_inode);
+
+ if (!object->fscache.cookie)
+ strcpy(type, "C3");
+ else
+ snprintf(type, 3, "%02x", object->fscache.cookie->def->type);
+
+ _enter("%p{%s}", object, type);
+
+ /* attempt to install a type label directly */
+ ret = vfs_setxattr(dentry, cachefiles_xattr_cache, type, 2,
+ XATTR_CREATE);
+ if (ret == 0) {
+ _debug("SET"); /* we succeeded */
+ goto error;
+ }
+
+ if (ret != -EEXIST) {
+ kerror("Can't set xattr on %*.*s [%lu] (err %d)",
+ dentry->d_name.len, dentry->d_name.len,
+ dentry->d_name.name, dentry->d_inode->i_ino,
+ -ret);
+ goto error;
+ }
+
+ /* read the current type label */
+ ret = vfs_getxattr(dentry, cachefiles_xattr_cache, xtype, 3);
+ if (ret < 0) {
+ if (ret == -ERANGE)
+ goto bad_type_length;
+
+ kerror("Can't read xattr on %*.*s [%lu] (err %d)",
+ dentry->d_name.len, dentry->d_name.len,
+ dentry->d_name.name, dentry->d_inode->i_ino,
+ -ret);
+ goto error;
+ }
+
+ /* check the type is what we're expecting */
+ if (ret != 2)
+ goto bad_type_length;
+
+ if (xtype[0] != type[0] || xtype[1] != type[1])
+ goto bad_type;
+
+ ret = 0;
+
+error:
+ _leave(" = %d", ret);
+ return ret;
+
+bad_type_length:
+ kerror("Cache object %lu type xattr length incorrect",
+ dentry->d_inode->i_ino);
+ ret = -EIO;
+ goto error;
+
+bad_type:
+ xtype[2] = 0;
+ kerror("Cache object %*.*s [%lu] type %s not %s",
+ dentry->d_name.len, dentry->d_name.len,
+ dentry->d_name.name, dentry->d_inode->i_ino,
+ xtype, type);
+ ret = -EIO;
+ goto error;
+}
+
+/*
+ * set the state xattr on a cache file
+ */
+int cachefiles_set_object_xattr(struct cachefiles_object *object,
+ struct cachefiles_xattr *auxdata)
+{
+ struct dentry *dentry = object->dentry;
+ int ret;
+
+ ASSERT(object->fscache.cookie);
+ ASSERT(dentry);
+
+ _enter("%p,#%d", object, auxdata->len);
+
+ /* attempt to install the cache metadata directly */
+ _debug("SET %s #%u", object->fscache.cookie->def->name, auxdata->len);
+
+ ret = vfs_setxattr(dentry, cachefiles_xattr_cache,
+ &auxdata->type, auxdata->len,
+ XATTR_CREATE);
+ if (ret < 0 && ret != -ENOMEM)
+ cachefiles_io_error_obj(
+ object,
+ "Failed to set xattr with error %d", ret);
+
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * update the state xattr on a cache file
+ */
+int cachefiles_update_object_xattr(struct cachefiles_object *object,
+ struct cachefiles_xattr *auxdata)
+{
+ struct dentry *dentry = object->dentry;
+ int ret;
+
+ ASSERT(object->fscache.cookie);
+ ASSERT(dentry);
+
+ _enter("%p,#%d", object, auxdata->len);
+
+ /* attempt to install the cache metadata directly */
+ _debug("SET %s #%u", object->fscache.cookie->def->name, auxdata->len);
+
+ ret = vfs_setxattr(dentry, cachefiles_xattr_cache,
+ &auxdata->type, auxdata->len,
+ XATTR_REPLACE);
+ if (ret < 0 && ret != -ENOMEM)
+ cachefiles_io_error_obj(
+ object,
+ "Failed to update xattr with error %d", ret);
+
+ _leave(" = %d", ret);
+ return ret;
+}
+
+/*
+ * check the state xattr on a cache file
+ * - return -ESTALE if the object should be deleted
+ */
+int cachefiles_check_object_xattr(struct cachefiles_object *object,
+ struct cachefiles_xattr *auxdata)
+{
+ struct cachefiles_xattr *auxbuf;
+ struct dentry *dentry = object->dentry;
+ int ret;
+
+ _enter("%p,#%d", object, auxdata->len);
+
+ ASSERT(dentry);
+ ASSERT(dentry->d_inode);
+
+ auxbuf = kmalloc(sizeof(struct cachefiles_xattr) + 512, GFP_KERNEL);
+ if (!auxbuf) {
+ _leave(" = -ENOMEM");
+ return -ENOMEM;
+ }
+
+ /* read the current type label */
+ ret = vfs_getxattr(dentry, cachefiles_xattr_cache,
+ &auxbuf->type, 512 + 1);
+ if (ret < 0) {
+ if (ret == -ENODATA)
+ goto stale; /* no attribute - power went off
+ * mid-cull? */
+
+ if (ret == -ERANGE)
+ goto bad_type_length;
+
+ cachefiles_io_error_obj(object,
+ "Can't read xattr on %lu (err %d)",
+ dentry->d_inode->i_ino, -ret);
+ goto error;
+ }
+
+ /* check the on-disk object */
+ if (ret < 1)
+ goto bad_type_length;
+
+ if (auxbuf->type != auxdata->type)
+ goto stale;
+
+ auxbuf->len = ret;
+
+ /* consult the netfs */
+ if (object->fscache.cookie->def->check_aux) {
+ enum fscache_checkaux result;
+ unsigned int dlen;
+
+ dlen = auxbuf->len - 1;
+
+ _debug("checkaux %s #%u",
+ object->fscache.cookie->def->name, dlen);
+
+ result = fscache_check_aux(&object->fscache,
+ &auxbuf->data, dlen);
+
+ switch (result) {
+ /* entry okay as is */
+ case FSCACHE_CHECKAUX_OKAY:
+ goto okay;
+
+ /* entry requires update */
+ case FSCACHE_CHECKAUX_NEEDS_UPDATE:
+ break;
+
+ /* entry requires deletion */
+ case FSCACHE_CHECKAUX_OBSOLETE:
+ goto stale;
+
+ default:
+ BUG();
+ }
+
+ /* update the current label */
+ ret = vfs_setxattr(dentry, cachefiles_xattr_cache,
+ &auxdata->type, auxdata->len,
+ XATTR_REPLACE);
+ if (ret < 0) {
+ cachefiles_io_error_obj(object,
+ "Can't update xattr on %lu"
+ " (error %d)",
+ dentry->d_inode->i_ino, -ret);
+ goto error;
+ }
+ }
+
+okay:
+ ret = 0;
+
+error:
+ kfree(auxbuf);
+ _leave(" = %d", ret);
+ return ret;
+
+bad_type_length:
+ kerror("Cache object %lu xattr length incorrect",
+ dentry->d_inode->i_ino);
+ ret = -EIO;
+ goto error;
+
+stale:
+ ret = -ESTALE;
+ goto error;
+}
+
+/*
+ * remove the object's xattr to mark it stale
+ */
+int cachefiles_remove_object_xattr(struct cachefiles_cache *cache,
+ struct dentry *dentry)
+{
+ int ret;
+
+ ret = vfs_removexattr(dentry, cachefiles_xattr_cache);
+ if (ret < 0) {
+ if (ret == -ENOENT || ret == -ENODATA)
+ ret = 0;
+ else if (ret != -ENOMEM)
+ cachefiles_io_error(cache,
+ "Can't remove xattr from %lu"
+ " (error %d)",
+ dentry->d_inode->i_ino, -ret);
+ }
+
+ _leave(" = %d", ret);
+ return ret;
+}
Loading...