aboutsummaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/files.rst53
-rw-r--r--Documentation/filesystems/fscrypt.rst121
-rw-r--r--Documentation/filesystems/nfs/exporting.rst7
-rw-r--r--Documentation/filesystems/porting.rst7
4 files changed, 126 insertions, 62 deletions
diff --git a/Documentation/filesystems/files.rst b/Documentation/filesystems/files.rst
index bcf84459917f..9e38e4c221ca 100644
--- a/Documentation/filesystems/files.rst
+++ b/Documentation/filesystems/files.rst
@@ -62,7 +62,7 @@ the fdtable structure -
be held.
4. To look up the file structure given an fd, a reader
- must use either lookup_fd_rcu() or files_lookup_fd_rcu() APIs. These
+ must use either lookup_fdget_rcu() or files_lookup_fdget_rcu() APIs. These
take care of barrier requirements due to lock-free lookup.
An example::
@@ -70,43 +70,22 @@ the fdtable structure -
struct file *file;
rcu_read_lock();
- file = lookup_fd_rcu(fd);
- if (file) {
- ...
- }
- ....
+ file = lookup_fdget_rcu(fd);
rcu_read_unlock();
-
-5. Handling of the file structures is special. Since the look-up
- of the fd (fget()/fget_light()) are lock-free, it is possible
- that look-up may race with the last put() operation on the
- file structure. This is avoided using atomic_long_inc_not_zero()
- on ->f_count::
-
- rcu_read_lock();
- file = files_lookup_fd_rcu(files, fd);
if (file) {
- if (atomic_long_inc_not_zero(&file->f_count))
- *fput_needed = 1;
- else
- /* Didn't get the reference, someone's freed */
- file = NULL;
+ ...
+ fput(file);
}
- rcu_read_unlock();
....
- return file;
-
- atomic_long_inc_not_zero() detects if refcounts is already zero or
- goes to zero during increment. If it does, we fail
- fget()/fget_light().
-6. Since both fdtable and file structures can be looked up
+5. Since both fdtable and file structures can be looked up
lock-free, they must be installed using rcu_assign_pointer()
API. If they are looked up lock-free, rcu_dereference()
must be used. However it is advisable to use files_fdtable()
- and lookup_fd_rcu()/files_lookup_fd_rcu() which take care of these issues.
+ and lookup_fdget_rcu()/files_lookup_fdget_rcu() which take care of these
+ issues.
-7. While updating, the fdtable pointer must be looked up while
+6. While updating, the fdtable pointer must be looked up while
holding files->file_lock. If ->file_lock is dropped, then
another thread expand the files thereby creating a new
fdtable and making the earlier fdtable pointer stale.
@@ -126,3 +105,19 @@ the fdtable structure -
Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
the fdtable pointer (fdt) must be loaded after locate_fd().
+On newer kernels rcu based file lookup has been switched to rely on
+SLAB_TYPESAFE_BY_RCU instead of call_rcu(). It isn't sufficient anymore
+to just acquire a reference to the file in question under rcu using
+atomic_long_inc_not_zero() since the file might have already been
+recycled and someone else might have bumped the reference. In other
+words, callers might see reference count bumps from newer users. For
+this is reason it is necessary to verify that the pointer is the same
+before and after the reference count increment. This pattern can be seen
+in get_file_rcu() and __files_get_rcu().
+
+In addition, it isn't possible to access or check fields in struct file
+without first aqcuiring a reference on it under rcu lookup. Not doing
+that was always very dodgy and it was only usable for non-pointer data
+in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers
+either first acquire a reference or they must hold the files_lock of the
+fdtable.
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index a624e92f2687..1b84f818e574 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -261,9 +261,9 @@ DIRECT_KEY policies
The Adiantum encryption mode (see `Encryption modes and usage`_) is
suitable for both contents and filenames encryption, and it accepts
-long IVs --- long enough to hold both an 8-byte logical block number
-and a 16-byte per-file nonce. Also, the overhead of each Adiantum key
-is greater than that of an AES-256-XTS key.
+long IVs --- long enough to hold both an 8-byte data unit index and a
+16-byte per-file nonce. Also, the overhead of each Adiantum key is
+greater than that of an AES-256-XTS key.
Therefore, to improve performance and save memory, for Adiantum a
"direct key" configuration is supported. When the user has enabled
@@ -300,8 +300,8 @@ IV_INO_LBLK_32 policies
IV_INO_LBLK_32 policies work like IV_INO_LBLK_64, except that for
IV_INO_LBLK_32, the inode number is hashed with SipHash-2-4 (where the
-SipHash key is derived from the master key) and added to the file
-logical block number mod 2^32 to produce a 32-bit IV.
+SipHash key is derived from the master key) and added to the file data
+unit index mod 2^32 to produce a 32-bit IV.
This format is optimized for use with inline encryption hardware
compliant with the eMMC v5.2 standard, which supports only 32 IV bits
@@ -451,31 +451,62 @@ acceleration is recommended:
Contents encryption
-------------------
-For file contents, each filesystem block is encrypted independently.
-Starting from Linux kernel 5.5, encryption of filesystems with block
-size less than system's page size is supported.
-
-Each block's IV is set to the logical block number within the file as
-a little endian number, except that:
-
-- With CBC mode encryption, ESSIV is also used. Specifically, each IV
- is encrypted with AES-256 where the AES-256 key is the SHA-256 hash
- of the file's data encryption key.
-
-- With `DIRECT_KEY policies`_, the file's nonce is appended to the IV.
- Currently this is only allowed with the Adiantum encryption mode.
-
-- With `IV_INO_LBLK_64 policies`_, the logical block number is limited
- to 32 bits and is placed in bits 0-31 of the IV. The inode number
- (which is also limited to 32 bits) is placed in bits 32-63.
-
-- With `IV_INO_LBLK_32 policies`_, the logical block number is limited
- to 32 bits and is placed in bits 0-31 of the IV. The inode number
- is then hashed and added mod 2^32.
-
-Note that because file logical block numbers are included in the IVs,
-filesystems must enforce that blocks are never shifted around within
-encrypted files, e.g. via "collapse range" or "insert range".
+For contents encryption, each file's contents is divided into "data
+units". Each data unit is encrypted independently. The IV for each
+data unit incorporates the zero-based index of the data unit within
+the file. This ensures that each data unit within a file is encrypted
+differently, which is essential to prevent leaking information.
+
+Note: the encryption depending on the offset into the file means that
+operations like "collapse range" and "insert range" that rearrange the
+extent mapping of files are not supported on encrypted files.
+
+There are two cases for the sizes of the data units:
+
+* Fixed-size data units. This is how all filesystems other than UBIFS
+ work. A file's data units are all the same size; the last data unit
+ is zero-padded if needed. By default, the data unit size is equal
+ to the filesystem block size. On some filesystems, users can select
+ a sub-block data unit size via the ``log2_data_unit_size`` field of
+ the encryption policy; see `FS_IOC_SET_ENCRYPTION_POLICY`_.
+
+* Variable-size data units. This is what UBIFS does. Each "UBIFS
+ data node" is treated as a crypto data unit. Each contains variable
+ length, possibly compressed data, zero-padded to the next 16-byte
+ boundary. Users cannot select a sub-block data unit size on UBIFS.
+
+In the case of compression + encryption, the compressed data is
+encrypted. UBIFS compression works as described above. f2fs
+compression works a bit differently; it compresses a number of
+filesystem blocks into a smaller number of filesystem blocks.
+Therefore a f2fs-compressed file still uses fixed-size data units, and
+it is encrypted in a similar way to a file containing holes.
+
+As mentioned in `Key hierarchy`_, the default encryption setting uses
+per-file keys. In this case, the IV for each data unit is simply the
+index of the data unit in the file. However, users can select an
+encryption setting that does not use per-file keys. For these, some
+kind of file identifier is incorporated into the IVs as follows:
+
+- With `DIRECT_KEY policies`_, the data unit index is placed in bits
+ 0-63 of the IV, and the file's nonce is placed in bits 64-191.
+
+- With `IV_INO_LBLK_64 policies`_, the data unit index is placed in
+ bits 0-31 of the IV, and the file's inode number is placed in bits
+ 32-63. This setting is only allowed when data unit indices and
+ inode numbers fit in 32 bits.
+
+- With `IV_INO_LBLK_32 policies`_, the file's inode number is hashed
+ and added to the data unit index. The resulting value is truncated
+ to 32 bits and placed in bits 0-31 of the IV. This setting is only
+ allowed when data unit indices and inode numbers fit in 32 bits.
+
+The byte order of the IV is always little endian.
+
+If the user selects FSCRYPT_MODE_AES_128_CBC for the contents mode, an
+ESSIV layer is automatically included. In this case, before the IV is
+passed to AES-128-CBC, it is encrypted with AES-256 where the AES-256
+key is the SHA-256 hash of the file's contents encryption key.
Filenames encryption
--------------------
@@ -544,7 +575,8 @@ follows::
__u8 contents_encryption_mode;
__u8 filenames_encryption_mode;
__u8 flags;
- __u8 __reserved[4];
+ __u8 log2_data_unit_size;
+ __u8 __reserved[3];
__u8 master_key_identifier[FSCRYPT_KEY_IDENTIFIER_SIZE];
};
@@ -586,6 +618,29 @@ This structure must be initialized as follows:
The DIRECT_KEY, IV_INO_LBLK_64, and IV_INO_LBLK_32 flags are
mutually exclusive.
+- ``log2_data_unit_size`` is the log2 of the data unit size in bytes,
+ or 0 to select the default data unit size. The data unit size is
+ the granularity of file contents encryption. For example, setting
+ ``log2_data_unit_size`` to 12 causes file contents be passed to the
+ underlying encryption algorithm (such as AES-256-XTS) in 4096-byte
+ data units, each with its own IV.
+
+ Not all filesystems support setting ``log2_data_unit_size``. ext4
+ and f2fs support it since Linux v6.7. On filesystems that support
+ it, the supported nonzero values are 9 through the log2 of the
+ filesystem block size, inclusively. The default value of 0 selects
+ the filesystem block size.
+
+ The main use case for ``log2_data_unit_size`` is for selecting a
+ data unit size smaller than the filesystem block size for
+ compatibility with inline encryption hardware that only supports
+ smaller data unit sizes. ``/sys/block/$disk/queue/crypto/`` may be
+ useful for checking which data unit sizes are supported by a
+ particular system's inline encryption hardware.
+
+ Leave this field zeroed unless you are certain you need it. Using
+ an unnecessarily small data unit size reduces performance.
+
- For v2 encryption policies, ``__reserved`` must be zeroed.
- For v1 encryption policies, ``master_key_descriptor`` specifies how
@@ -1079,8 +1134,8 @@ The caller must zero all input fields, then fill in ``key_spec``:
On success, 0 is returned and the kernel fills in the output fields:
- ``status`` indicates whether the key is absent, present, or
- incompletely removed. Incompletely removed means that the master
- secret has been removed, but some files are still in use; i.e.,
+ incompletely removed. Incompletely removed means that removal has
+ been initiated, but some files are still in use; i.e.,
`FS_IOC_REMOVE_ENCRYPTION_KEY`_ returned 0 but set the informational
status flag FSCRYPT_KEY_REMOVAL_STATUS_FLAG_FILES_BUSY.
diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index 4b30daee399a..198d805d611c 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -241,3 +241,10 @@ following flags are defined:
all of an inode's dirty data on last close. Exports that behave this
way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip
waiting for writeback when closing such files.
+
+ EXPORT_OP_ASYNC_LOCK - Indicates a capable filesystem to do async lock
+ requests from lockd. Only set EXPORT_OP_ASYNC_LOCK if the filesystem has
+ it's own ->lock() functionality as core posix_lock_file() implementation
+ has no async lock request handling yet. For more information about how to
+ indicate an async lock request from a ->lock() file_operations struct, see
+ fs/locks.c and comment for the function vfs_lock_file().
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 4d05b9862451..d69f59700a23 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1045,3 +1045,10 @@ filesystem type is now moved to a later point when the devices are closed:
As this is a VFS level change it has no practical consequences for filesystems
other than that all of them must use one of the provided kill_litter_super(),
kill_anon_super(), or kill_block_super() helpers.
+
+---
+
+**mandatory**
+
+Lock ordering has been changed so that s_umount ranks above open_mutex again.
+All places where s_umount was taken under open_mutex have been fixed up.