and do not hang waiting forever for an ack from non-existing thread.
PR: 281511
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Remove some uses of PHOLD which were there only to prevent the process'
threads from being swapped out.
Tested by: pho
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D46118
The gjournal implementation does not properly handle the freeing
of blocks that may be part of a snapshot. Adding this support to
gjournal would require considerable effort. For now we simply
do not allow snapshots to be taken on filesystems using gjournal.
Reported by: ant_mail@inbox.ru
PR: 280216
MFC after: 1 week
This is mostly to reduce the diff with CheriBSD which adds additional
constants to enum uio_rw, but also matches the normal style used for
uio_segflg.
Reviewed by: kib, emaste
Obtained from: CheriBSD
Differential Revision: https://reviews.freebsd.org/D45142
The UFS1 integrity checks added in FreeBSD 14 were too aggressive
for UFS1 filesystems created in FreeBSD 4 and 9 systems. This patch
removes those tests which can be done safely since they are not
relevant to the current implementation of UFS1.
This is a follow-on report to bug report 264450 (comments 21-28).
Reported by: slb@sonnet.com
Tested by: slb@sonnet.com
PR: 264450
MFC after: 1 week
If the ffs_write() operation specified to overwrite the whole buffer,
ffs tries to save the read by not validating allocated buffer. Then
uiommove() might fail with EFAULT, in which case pages are left zeroed
and marked valid but not read from the disk. Then vn_io_fault() logic
retries the write after holding the user pages to avoid EFAULTs. In
erronous case of really faulty buffer, or in contrived case of writing
from file to itself, we are left with zeroed buffer instead of valid
content written back to disk.
Handle the situation by releasing non-cached buffer on fault, instead
of clearing it. Note that buffers with alive dependencies cannot be
released, but also either they cannot have valid content on the disk
because dependency on data buffer means that it was not yet written, or
they were reallocated by fragment extension or ffs_reallocbks(), and are
already fully valid.
Reported by: kevans
Discussed with: mav
In collaboration with: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The link count for a UFS/FFS inode is stored in a signed 16-bit
integer. Thus the maximum link count has been 32767.
This limit has been recently hit by the poudriere build system when
doing a ports build as it needs one directory per port and the
number of ports recently passed 32767.
A long-term solution would be to use one of the spare 32-bit fields
in the inode to store the link count. However, the UFS1 format does
not have a spare and adding the spare in UFS2 would make it hard
to make it compatible when running on older kernels that use the
original link count field. So this patch uses the much simpler
approach of changing the existing link count field from a signed
16-bit value to an unsigned 16-bit value. It has the fewest lines
of code changes. The only thing that changes is the type in the
dinode and inode structures and the definition of UFS_LINK_MAX. It
has the added benefit that it works with both UFS1 and UFS2.
It allows easy backward compatibility. Indeed it is backward
compatibility that is the primary reason to go with this approach.
If a filesystem with the new organization is mounted on an older
kernel, it still needs to work. Thus if we move the new link count
to a new field, we still need to maintain the old link count as
best as possible even when running on a kernel that knows about the
larger link counts. And we would have to carry this overhead for
the indefinite future.
If we have a new link-count field, we will have to add a new
filesystem flag to indicate that we are running with larger link
counts. We will also need to add of one of the new-feature flags
to say that we have larger link counts. Older kernels clear the
new-feature flags that they do not know about, so when a filesystem
is used on an older kernel and then moved back to a newer one, the
newer one will know that the new link counts have not been maintained
and that it will be necessary to run a full fsck on the filesystem
to correct the link counts before it can be mounted.
With this change, older kernels will generally work with the bigger
counts. While it will not itself allow the link count to exceed
32767, it will have no problem working with inodes that have a link
count greater than 32767. Since it tests that i_nlink <= UFS_LINK_MAX,
counts that are bigger than 32767 will appear negative, so will
still pass the test. Of course, if they ever drop below 32767, they
will no longer be able to exceed 32767. The one issue is if the
link count ever exceeds 65535 then it will wrap to zero and the
older kernel will be none the wiser. But this corner case is likely
to be very rare since these kernels and the applications running
on them do not expect to be able to get link counts over 32767. And
over time, the use of new filesystems on older kernels will become
rarer and rarer.
Reported-by: Mark Millard running poudriere on the ports tree
Reviewed-by: kib, olce.freebsd_certner.fr
Tested-by: Peter Holm, Mark Millard
MFC-after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D42767
Apply the following automated changes to try to eliminate
no-longer-needed sys/cdefs.h includes as well as now-empty
blank lines in a row.
Remove /^#if.*\n#endif.*\n#include\s+<sys/cdefs.h>.*\n/
Remove /\n+#include\s+<sys/cdefs.h>.*\n+#if.*\n#endif.*\n+/
Remove /\n+#if.*\n#endif.*\n+/
Remove /^#if.*\n#endif.*\n/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/types.h>/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/param.h>/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/capsicum.h>/
Sponsored by: Netflix
Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.
Sponsored by: Netflix
A UFS/FFS snapshot file is identified with the SF_SNAPSHOT
flag to identify it as a snapshot. This flag needs to be
set before setting some of its block pointers to the special
values BLK_SNAP and BLK_NOCOPY. If the snapshot creation fails
and we call VOP_REMOVE(), the SF_SNAPSHOT flag will let the
remove routine know that the special block pointer values need
to be rolled back before attempting deletion of the file.
Also ensure that an fsck is required after setting superblock
values in the ffs_checkcgintegrity() routine.
Reported-by: Peter Holm
Tested-by: Peter Holm
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
If a UFS/FFS filesystem develops a broken cylinder group (which is
usually detected when its check hash fails), that cylinder group
will not be usable until the filesystem has been unmounted and fsck
has been run to repair it. On the first attempt to to allocate
resources from the broken cylinder group, its available resources
are set to zero in the superblock summary information. Since it
will appear to have no resources available, no further calls will
be made to allocate resources from it. When resources are freed to
the broken cylinder group, the resource free routines will find the
cylinder group unusable so the resource will simply be discarded
and thus will not show up in the superblock summary information
until they are recovered by fsck.
Reported-by: Peter Holm
Tested-by: Peter Holm
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
Rename to ffs_checkfreeblk() to better describe that it is checking
to find out if a block or fragment is free. Clarify its implementation.
No functional change intended.
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
The ffs_inotovp() function returns a vnode from a mounted filesystem
for an inode number with specified generation number. We now
consistently return ESTALE if the inode with given generation number
no longer exists on that filesystem.
The ffs_reload() function reloads all incore data for a filesystem.
It is used after running fsck on a mounted filesystem and finding
things to fix. It now returns the EINTEGRITY error if it is unable
to find a valid superblock.
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
If the root fs is forcibly unmounted then basically every process
will die with a SEGV as soon as it tries to run again because libc.so
is gone, which leaves the system basically hung. It seems better
to just panic instead, so let's do that.
Requested-by: karels
Reviewed-by: imp, mckusick, karels
Sponsored-by: Netflix
Differential Revision: https://reviews.freebsd.org/D41387
When taking a UFS/FFS snapshot, it may not succeed for example if the
filesystem is too full to hold it. When a snapshot is unable to be
successfully taken, the partial snapshot should be removed.
Reported-by: Peter Holm
Tested-by: Peter Holm
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
When a large file is deleted or a large number of files are deleted,
even a single cylinder group with a bad check hash can generate
thousands of check-hash warnings. As with other filesystem messages
such as out-of-space, print a maximum of one check-hash error per
second. Note the limit is per filesystem. If two filesystems have
cylinder group(s) with a bad check hash, each will print a maximum
of one check-hash error message per second.
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
When a file is deleted, its blocks need to be put back in the free
block list and its inode needs to be put back in the inode free list.
These lists reside in cylinder-group maps. If either some of its blocks
or its inode reside in a cylinder-group map with a bad check hash
it is not possible to free the associated resource. Since the cylinder
group cannot be repaired until the filesystem is unmounted these
resources cannot be freed. They simply accumulate in memory. And
any attempt to unmount the filesystem loops forever trying to flush them.
With this change, the resource update claims to succeed so that the
file deletion can successfully complete. The filesystem is marked as
requiring an fsck so that before the next time that the filesystem is
mounted, the offending cylinder groups are reconstructed causing the
lost resources to be reclaimed.
A better solution would be to downgrade the filesystem to read-only,
but that capability is not currently implemented.
Reported-by: Peter Holm
Tested-by: Peter Holm
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
As per https://lists.freebsd.org/archives/freebsd-scsi/2023-July/000257.html
move to the modern uintXX_t. While here also migrate u_char to uint8_t.
Where other kernel interfaces allow, migrate u_long to uint64_t.
No functional changes intended.
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
When taking a snapshot on a UFS version 1 filesystem we need to
call ffs_oldfscompat_write() to unwind any in-memory changes that
were made to the superblock before writing it. The cause of this bug
was that the trimmed down maximum file size was not being reverted.
PR: 271352
Tested-by: Peter Holm
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
Verify that the summary information does not extend past the end
of the filesystem.
No legitimate superblocks should fail as a result of these changes.
Reported-by: Robert Morris
PR: 271351
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
Commit fe5e6e2 improved FFS directory placement when creating new
directories. It is done by keeping track of the depth of directories
in the filesystem and placing those lower in the tree closer together
while spreading out those higher in the tree.
Fsck_ffs(8) checks these depths and if incorrect adjusts them to
their correct value. When running in background fsck_ffs(8) needs
to be able to make an adjustment to the depth. This commit adds
the sysctl to make such an adjustment and adds the code to fsck_ffs(8)
to use the new sysctl.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
The cylinder group header structure ended with `u_int8_t cg_space[1]'
representing the beginning of the inode bitmap array. Some architectures
like the i386 rounded this up to a 4-byte boundry while other
architectures like the amd64 rounded it up to an 8-byte boundry.
Thus sizeof(struct cg) was four bytes bigger on an amd64 machine
than on an i386 machine. If a filesystem created on an i386 machine
was moved to an amd64 machine, the size of the cylinder group
calculated by the CGSIZE macro would appear to grow by four bytes.
Filesystems whose cylinder groups were exactly equal to the block
size on an i386 machine would appear to have a cylinder group that
was four bytes too big when moved to an amd64 machine. Note that
although the structure appears to be too big, it in fact is fine.
It is just the calaculation of its size that is in error.
The fix is to remove the cg_space element from the cylinder-group
structure so that the calculated size of the structure is the same
size on all architectures.
Reported by: Tijl Coosemans
Tested by: Tijl Coosemans and Peter Holm
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.
Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix
Check for an uninitialed (zero valued) fs_maxbsize and set it
to its minimum valid size (fs_bsize). Uninitialed fs_maxbsize
were left by older versions of makefs(8) and the superblock
integrity checks fail when they are found.
No legitimate superblocks should fail as a result of these changes.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
If either of vnodes is shared locked, lock must not be recursed.
Requested by: rmacklem
Reviewed by: markj, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39444
The algorithm for laying out new directories was devised in the 1980s
and markedly improved the performance of the filesystem. In those days
large disks had at most 100 cylinder groups and often as few as 10-20.
Modern multi-terrabyte disks have thousands of cylinder groups. The
original algorithm does not handle these large sizes well. This change
attempts to expand the scope of the original algorithm to work well
with these much larger disks while still retaining the properties
of the original algorithm for small disks.
The filesystem implementation is divided into policy routines and
implementation routines. The policy routines can be changed in any
way desired without risk of corrupting the filesystem. The policy
requests are handled by the implementation layer. If the policy
asks for an available resource, it is granted. But if it asks for
an already in-use resource, then the implementation will provide
an available one nearby the request. Thus it is impossible for a
policy to double allocate. This change is limited to the policy
implementation.
This change updates the ffs_dirpref() routine which is responsible
for selecting the cylinder group into which a new directory should
be placed. If we are near the root of the filesystem we aim to
spread them out as much as possible. As we descend deeper from the
root we cluster them closer together around their parent as we
expect them to be more closely interactive. Higher-level directories
like usr/src/sys and usr/src/bin should be separated while the
directories in these areas are more likely to be accessed together
so should be closer. And directories within commands or kernel
subsystems should be closer still.
We pick a range of cylinder groups around the cylinder group of the
directory in which we are being created. The size of the range for
our search is based on our depth from the root of our filesystem.
We then probe that range based on how many directories are already
present. The first new directory is at 1/2 (middle) of the range;
the second is in the first 1/4 of the range, then at 3/4, 1/8, 3/8,
5/8, 7/8, 1/16, 3/16, 5/16, etc.
It is desirable to store the depth of a directory in its on-disk
inode so that it is available when we need it. We add a new field
di_dirdepth to track the depth of each directory. Because there are
few spare fields left in the inode, we choose to share an existing
field in the inode rather than having one of our own. Specifically
we create a union with the di_freelink field. The di_freelink field
is used to track inodes that have been unlinked but remain referenced.
It is not needed until a rmdir(2) operation has been done on a
directory. At that point, the directory has no contents and even
if it is kept active as a current directory is no longer able to
have any new directories or files created in it. Thus the use of
di_dirdepth and di_freelink will never coincide.
Reported by: Timo Voelker
Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39246
When requesting a superblock read for the sole purpose of getting
the parameters needed to find if backup parameters have been stored,
specify UFS_NOCSUM as only the base superblock is needed. This
change reduces the number of checks that the superblock must pass.
MFC after: 1 week
Have to add a check that the computed cylinder group size does not
exceed the block size of the filesystem.
It is also necessary to validate additional parameters when a
superblock is going to be used in read-only mode if its supplementary
information is going to be read in to ensure that the size and
location of the supplementary information is valid. Also when a
warning is raised let it be accepted, but bound the flagged field
to the value checked by the warning.
No legitimate superblocks should fail as a result of these changes.
Reported by: Bob Prohaska, John-Mark Gurney, and Mark Millard
Tested by: Peter Holm
Reviewed by: Peter Holm
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38668
There is another case where SU code does ffs_syncvnode(dvp) for the
parent directory dvp while the child vnode vp is locked. Avoid the
issue by relocking and returning ERELOOKUP to indicate the need of
resync.
Reported by: jkim
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37997
There is no point in clearing just this flag. Flags are reset on the
struct mount re-allocation for reuse anyway.
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37966
Also deny suspension if we cannot check the above condition race-free
because there is more than one thread in the calling process.
PR: 267628, 267630
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37896
Order includes alphabetically.
Remove unneeded sys/param.h, it is already included by sys/systm.h.
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37896
To quote from a comment above vput_final:
<quote>
* XXX Some filesystems pass in an exclusively locked vnode and strongly depend
* on the lock being held all the way until VOP_INACTIVE. This in particular
* happens with UFS which adds half-constructed vnodes to the hash, where they
* can be found by other code.
</quote>
As is there is no mechanism which allows filesystems to denote that a
vnode is fully initialized, consequently problems like the above are
only found the hard way(tm).
Add rudimentary support for state transitions, which in particular allow
to assert the vnode is not legally unlocked until its fate is decided
(either construction finishes or vgone is called to abort it).
The new field lands in a 1-byte hole, thus it does not grow the struct.
Bump __FreeBSD_version to 1400077
Reviewed by: kib (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D37759