diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/files.rst | 53 | ||||
-rw-r--r-- | Documentation/filesystems/fscrypt.rst | 121 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/exporting.rst | 7 | ||||
-rw-r--r-- | Documentation/filesystems/porting.rst | 7 |
4 files changed, 126 insertions, 62 deletions
diff --git a/Documentation/filesystems/files.rst b/Documentation/filesystems/files.rst index bcf84459917f..9e38e4c221ca 100644 --- a/Documentation/filesystems/files.rst +++ b/Documentation/filesystems/files.rst @@ -62,7 +62,7 @@ the fdtable structure - be held. 4. To look up the file structure given an fd, a reader - must use either lookup_fd_rcu() or files_lookup_fd_rcu() APIs. These + must use either lookup_fdget_rcu() or files_lookup_fdget_rcu() APIs. These take care of barrier requirements due to lock-free lookup. An example:: @@ -70,43 +70,22 @@ the fdtable structure - struct file *file; rcu_read_lock(); - file = lookup_fd_rcu(fd); - if (file) { - ... - } - .... + file = lookup_fdget_rcu(fd); rcu_read_unlock(); - -5. Handling of the file structures is special. Since the look-up - of the fd (fget()/fget_light()) are lock-free, it is possible - that look-up may race with the last put() operation on the - file structure. This is avoided using atomic_long_inc_not_zero() - on ->f_count:: - - rcu_read_lock(); - file = files_lookup_fd_rcu(files, fd); if (file) { - if (atomic_long_inc_not_zero(&file->f_count)) - *fput_needed = 1; - else - /* Didn't get the reference, someone's freed */ - file = NULL; + ... + fput(file); } - rcu_read_unlock(); .... - return file; - - atomic_long_inc_not_zero() detects if refcounts is already zero or - goes to zero during increment. If it does, we fail - fget()/fget_light(). -6. Since both fdtable and file structures can be looked up +5. Since both fdtable and file structures can be looked up lock-free, they must be installed using rcu_assign_pointer() API. If they are looked up lock-free, rcu_dereference() must be used. However it is advisable to use files_fdtable() - and lookup_fd_rcu()/files_lookup_fd_rcu() which take care of these issues. + and lookup_fdget_rcu()/files_lookup_fdget_rcu() which take care of these + issues. -7. While updating, the fdtable pointer must be looked up while +6. While updating, the fdtable pointer must be looked up while holding files->file_lock. If ->file_lock is dropped, then another thread expand the files thereby creating a new fdtable and making the earlier fdtable pointer stale. @@ -126,3 +105,19 @@ the fdtable structure - Since locate_fd() can drop ->file_lock (and reacquire ->file_lock), the fdtable pointer (fdt) must be loaded after locate_fd(). +On newer kernels rcu based file lookup has been switched to rely on +SLAB_TYPESAFE_BY_RCU instead of call_rcu(). It isn't sufficient anymore +to just acquire a reference to the file in question under rcu using +atomic_long_inc_not_zero() since the file might have already been +recycled and someone else might have bumped the reference. In other +words, callers might see reference count bumps from newer users. For +this is reason it is necessary to verify that the pointer is the same +before and after the reference count increment. This pattern can be seen +in get_file_rcu() and __files_get_rcu(). + +In addition, it isn't possible to access or check fields in struct file +without first aqcuiring a reference on it under rcu lookup. Not doing +that was always very dodgy and it was only usable for non-pointer data +in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers +either first acquire a reference or they must hold the files_lock of the +fdtable. diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst index a624e92f2687..1b84f818e574 100644 --- a/Documentation/filesystems/fscrypt.rst +++ b/Documentation/filesystems/fscrypt.rst @@ -261,9 +261,9 @@ DIRECT_KEY policies The Adiantum encryption mode (see `Encryption modes and usage`_) is suitable for both contents and filenames encryption, and it accepts -long IVs --- long enough to hold both an 8-byte logical block number -and a 16-byte per-file nonce. Also, the overhead of each Adiantum key -is greater than that of an AES-256-XTS key. +long IVs --- long enough to hold both an 8-byte data unit index and a +16-byte per-file nonce. Also, the overhead of each Adiantum key is +greater than that of an AES-256-XTS key. Therefore, to improve performance and save memory, for Adiantum a "direct key" configuration is supported. When the user has enabled @@ -300,8 +300,8 @@ IV_INO_LBLK_32 policies IV_INO_LBLK_32 policies work like IV_INO_LBLK_64, except that for IV_INO_LBLK_32, the inode number is hashed with SipHash-2-4 (where the -SipHash key is derived from the master key) and added to the file -logical block number mod 2^32 to produce a 32-bit IV. +SipHash key is derived from the master key) and added to the file data +unit index mod 2^32 to produce a 32-bit IV. This format is optimized for use with inline encryption hardware compliant with the eMMC v5.2 standard, which supports only 32 IV bits @@ -451,31 +451,62 @@ acceleration is recommended: Contents encryption ------------------- -For file contents, each filesystem block is encrypted independently. -Starting from Linux kernel 5.5, encryption of filesystems with block -size less than system's page size is supported. - -Each block's IV is set to the logical block number within the file as -a little endian number, except that: - -- With CBC mode encryption, ESSIV is also used. Specifically, each IV - is encrypted with AES-256 where the AES-256 key is the SHA-256 hash - of the file's data encryption key. - -- With `DIRECT_KEY policies`_, the file's nonce is appended to the IV. - Currently this is only allowed with the Adiantum encryption mode. - -- With `IV_INO_LBLK_64 policies`_, the logical block number is limited - to 32 bits and is placed in bits 0-31 of the IV. The inode number - (which is also limited to 32 bits) is placed in bits 32-63. - -- With `IV_INO_LBLK_32 policies`_, the logical block number is limited - to 32 bits and is placed in bits 0-31 of the IV. The inode number - is then hashed and added mod 2^32. - -Note that because file logical block numbers are included in the IVs, -filesystems must enforce that blocks are never shifted around within -encrypted files, e.g. via "collapse range" or "insert range". +For contents encryption, each file's contents is divided into "data +units". Each data unit is encrypted independently. The IV for each +data unit incorporates the zero-based index of the data unit within +the file. This ensures that each data unit within a file is encrypted +differently, which is essential to prevent leaking information. + +Note: the encryption depending on the offset into the file means that +operations like "collapse range" and "insert range" that rearrange the +extent mapping of files are not supported on encrypted files. + +There are two cases for the sizes of the data units: + +* Fixed-size data units. This is how all filesystems other than UBIFS + work. A file's data units are all the same size; the last data unit + is zero-padded if needed. By default, the data unit size is equal + to the filesystem block size. On some filesystems, users can select + a sub-block data unit size via the ``log2_data_unit_size`` field of + the encryption policy; see `FS_IOC_SET_ENCRYPTION_POLICY`_. + +* Variable-size data units. This is what UBIFS does. Each "UBIFS + data node" is treated as a crypto data unit. Each contains variable + length, possibly compressed data, zero-padded to the next 16-byte + boundary. Users cannot select a sub-block data unit size on UBIFS. + +In the case of compression + encryption, the compressed data is +encrypted. UBIFS compression works as described above. f2fs +compression works a bit differently; it compresses a number of +filesystem blocks into a smaller number of filesystem blocks. +Therefore a f2fs-compressed file still uses fixed-size data units, and +it is encrypted in a similar way to a file containing holes. + +As mentioned in `Key hierarchy`_, the default encryption setting uses +per-file keys. In this case, the IV for each data unit is simply the +index of the data unit in the file. However, users can select an +encryption setting that does not use per-file keys. For these, some +kind of file identifier is incorporated into the IVs as follows: + +- With `DIRECT_KEY policies`_, the data unit index is placed in bits + 0-63 of the IV, and the file's nonce is placed in bits 64-191. + +- With `IV_INO_LBLK_64 policies`_, the data unit index is placed in + bits 0-31 of the IV, and the file's inode number is placed in bits + 32-63. This setting is only allowed when data unit indices and + inode numbers fit in 32 bits. + +- With `IV_INO_LBLK_32 policies`_, the file's inode number is hashed + and added to the data unit index. The resulting value is truncated + to 32 bits and placed in bits 0-31 of the IV. This setting is only + allowed when data unit indices and inode numbers fit in 32 bits. + +The byte order of the IV is always little endian. + +If the user selects FSCRYPT_MODE_AES_128_CBC for the contents mode, an +ESSIV layer is automatically included. In this case, before the IV is +passed to AES-128-CBC, it is encrypted with AES-256 where the AES-256 +key is the SHA-256 hash of the file's contents encryption key. Filenames encryption -------------------- @@ -544,7 +575,8 @@ follows:: __u8 contents_encryption_mode; __u8 filenames_encryption_mode; __u8 flags; - __u8 __reserved[4]; + __u8 log2_data_unit_size; + __u8 __reserved[3]; __u8 master_key_identifier[FSCRYPT_KEY_IDENTIFIER_SIZE]; }; @@ -586,6 +618,29 @@ This structure must be initialized as follows: The DIRECT_KEY, IV_INO_LBLK_64, and IV_INO_LBLK_32 flags are mutually exclusive. +- ``log2_data_unit_size`` is the log2 of the data unit size in bytes, + or 0 to select the default data unit size. The data unit size is + the granularity of file contents encryption. For example, setting + ``log2_data_unit_size`` to 12 causes file contents be passed to the + underlying encryption algorithm (such as AES-256-XTS) in 4096-byte + data units, each with its own IV. + + Not all filesystems support setting ``log2_data_unit_size``. ext4 + and f2fs support it since Linux v6.7. On filesystems that support + it, the supported nonzero values are 9 through the log2 of the + filesystem block size, inclusively. The default value of 0 selects + the filesystem block size. + + The main use case for ``log2_data_unit_size`` is for selecting a + data unit size smaller than the filesystem block size for + compatibility with inline encryption hardware that only supports + smaller data unit sizes. ``/sys/block/$disk/queue/crypto/`` may be + useful for checking which data unit sizes are supported by a + particular system's inline encryption hardware. + + Leave this field zeroed unless you are certain you need it. Using + an unnecessarily small data unit size reduces performance. + - For v2 encryption policies, ``__reserved`` must be zeroed. - For v1 encryption policies, ``master_key_descriptor`` specifies how @@ -1079,8 +1134,8 @@ The caller must zero all input fields, then fill in ``key_spec``: On success, 0 is returned and the kernel fills in the output fields: - ``status`` indicates whether the key is absent, present, or - incompletely removed. Incompletely removed means that the master - secret has been removed, but some files are still in use; i.e., + incompletely removed. Incompletely removed means that removal has + been initiated, but some files are still in use; i.e., `FS_IOC_REMOVE_ENCRYPTION_KEY`_ returned 0 but set the informational status flag FSCRYPT_KEY_REMOVAL_STATUS_FLAG_FILES_BUSY. diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst index 4b30daee399a..198d805d611c 100644 --- a/Documentation/filesystems/nfs/exporting.rst +++ b/Documentation/filesystems/nfs/exporting.rst @@ -241,3 +241,10 @@ following flags are defined: all of an inode's dirty data on last close. Exports that behave this way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip waiting for writeback when closing such files. + + EXPORT_OP_ASYNC_LOCK - Indicates a capable filesystem to do async lock + requests from lockd. Only set EXPORT_OP_ASYNC_LOCK if the filesystem has + it's own ->lock() functionality as core posix_lock_file() implementation + has no async lock request handling yet. For more information about how to + indicate an async lock request from a ->lock() file_operations struct, see + fs/locks.c and comment for the function vfs_lock_file(). diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 4d05b9862451..d69f59700a23 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -1045,3 +1045,10 @@ filesystem type is now moved to a later point when the devices are closed: As this is a VFS level change it has no practical consequences for filesystems other than that all of them must use one of the provided kill_litter_super(), kill_anon_super(), or kill_block_super() helpers. + +--- + +**mandatory** + +Lock ordering has been changed so that s_umount ranks above open_mutex again. +All places where s_umount was taken under open_mutex have been fixed up. |