Make 7 filesystems which don't really know about VOP_BMAP rely
on the default vector, rather than more or less complete local
vop_nopbmap() implementations.
This version has a step debugger, which now completely replaces the
old trace feature. Also, we moved all of the FreeBSD-specific MI
code to loader.c, reducing the diff between this and the official
FICL distribution.
The zone allocator's locks should be leaflocks, meaning that they
should never be held when entering into another subsystem, however
the sysctl grabs the zone global mutex and individual zone mutexes
while holding the lock it calls SYSCTL_OUT which recurses into the
VM subsystem in order to wire user memory to do a safe copy. This
can block and cause lock order reversals.
To fix this:
lock zone global.
get a count of the number of zones.
unlock global.
allocate temporary storage.
format and SYSCTL_OUT the banner.
lock global.
traverse list.
make sure we haven't looped more than the initial count taken
to avoid overflowing the allocated buffer.
lock each nodes.
read values and format into buffer.
unlock individual node.
unlock global.
format and SYSCTL_OUT the rest of the data.
free storage.
return.
Other problems included not checking for errors when doing sysctl out
of the column header. Fixed.
Inconsistant termination of the copied string. Fixed.
Objected to by: des (for not using sbuf)
Since the output is not variable length and I'm actually over
allocating signifigantly and I'd like to get this fixed now, I'll
work on the sbuf convertion at a later date. I would not object
to someone else taking it upon themselves to convert it to sbuf.
I hold no MAINTIANER rights to this code (for now).
been made machine independent and various other adjustments have been made
to support Alpha SMP.
- It splits the per-process portions of hardclock() and statclock() off
into hardclock_process() and statclock_process() respectively. hardclock()
and statclock() call the *_process() functions for the current process so
that UP systems will run as before. For SMP systems, it is simply necessary
to ensure that all other processors execute the *_process() functions when the
main clock functions are triggered on one CPU by an interrupt. For the alpha
4100, clock interrupts are delievered in a staggered broadcast fashion, so
we simply call hardclock/statclock on the boot CPU and call the *_process()
functions on the secondaries. For x86, we call statclock and hardclock as
usual and then call forward_hardclock/statclock in the MD code to send an IPI
to cause the AP's to execute forwared_hardclock/statclock which then call the
*_process() functions.
- forward_signal() and forward_roundrobin() have been reworked to be MI and to
involve less hackery. Now the cpu doing the forward sets any flags, etc. and
sends a very simple IPI_AST to the other cpu(s). AST IPIs now just basically
return so that they can execute ast() and don't bother with setting the
astpending or needresched flags themselves. This also removes the loop in
forward_signal() as sched_lock closes the race condition that the loop worked
around.
- need_resched(), resched_wanted() and clear_resched() have been changed to take
a process to act on rather than assuming curproc so that they can be used to
implement forward_roundrobin() as described above.
- Various other SMP variables have been moved to a MI subr_smp.c and a new
header sys/smp.h declares MI SMP variables and API's. The IPI API's from
machine/ipl.h have moved to machine/smp.h which is included by sys/smp.h.
- The globaldata_register() and globaldata_find() functions as well as the
SLIST of globaldata structures has become MI and moved into subr_smp.c.
Also, the globaldata list is only available if SMP support is compiled in.
Reviewed by: jake, peter
Looked over by: eivind
It might be more correct to make stathz as close as possible to 128,
but that would involve adding complexity to the clock intr path, which
I don't want to do.
modify the scheduling properties of processes with a different real
uid but the same effective uid (i.e., daemons, et al). (note: these
cases were previously commented out, so this does not change the
compiled code at al)
Obtained from: TrustedBSD Project
The constant I was using was correct, but I mislabeled it as 256K when
it should have been 512K. This doesn't actually change the code, but
it clarifies things somewhat.
Submitted by: Chuck Cranor <chuck@research.att.com>
sf_hdtr is used to provide writev(2) style headers/trailers on the
sent data the return value is actually either the result of writev(2)
from the trailers or headers of no tailers are specified.
Fix sendfile to comply with the documentation, by returning 0 on
success.
Ok'd by: dg
We are way too inconsistent with our setting of the "schg" flag, and in
our default install, it doesn't really offer any additional security.
Reviewed by: arch@
saves 32 registers) to do on every context switch. This is only required
for SMP, so only do it there.
We should also look at moving the critical enter/exit out to the callers
Long ago, bread() set b_blkno to the disk block number as a side effect
of doing physical i/o (or it just retained the setting from when the
i/o was done). The setting is lost when buffers go away and then are
reconsituted from VM. bread() originally compensated by doing a
VOP_BMAP() to recover b_blkno, but this was no good since it sometimes
caused extra i/o or even deadlock for bread()ing metadata to do the
bmap. This was fixed in vfs_bio.c 1.33 (1995/03/03) and ffs_balloc.c
1.5, etc., by removing the VOP_BMAP() from bread() and breadn(), and
changing all (?) places that used b_blkno to set it if necessary.
ext2fs was not imported until later in 1995 and was still depending on
the old behaviour of bread() in at least ext2_balloc(). This caused
filesystem and file corruption by clobbering direct block numbers in
inodes.
by the inactive routine. Because the freeing causes the filesystem
to be modified, the close must be held up during periods when the
filesystem is suspended.
For snapshots to be consistent across crashes, they must write
blocks that they copy and claim those written blocks in their
on-disk block pointers before the old blocks that they referenced
can be allowed to be written.
Close a loophole that allowed unwritten blocks to be skipped when
doing ffs_sync with a request to wait for all I/O activity to be
completed.
to struct mount.
This makes the "struct netexport *" paramter to the vfs_export
and vfs_checkexport interface unneeded.
Consequently that all non-stacking filesystems can use
vfs_stdcheckexp().
At the same time, make it a pointer to a struct netexport
in struct mount, so that we can remove the bogus AF_MAX
and #include <net/radix.h> from <sys/mount.h>
required by POSIX.1e. This maintains the current 'struct acl'
in the kernel while providing the generic external acl_t
interface required to complete the ACL editing library.
o Add the acl_get_entry() function.
o Convert the existing ACL utilities, getfacl and setfacl, to
fully make use of the ACL editing library.
Obtained from: TrustedBSD Project
structure. This field keeps track of how many levels deep we are nested
into the kernel. The nesting level is bumped at the start of a trap,
interrupt, syscall, or exception and is decremented on return. This is
used to detect the case when the kernel is returning back to a kernel
context in exception_return(). If we are returning to the kernel we need
to update the globaldata pointer register saved in the stack frame in case
we have switched CPU's between taking the initial interrupt that saved the
frame and returning. If we don't do this fixup it is possible for a CPU to
use the wrong per-cpu data. On UP systems this is not a problem, so the
code is conditional on SMP.
A count was used instead of simply checking the process status register in
the frame during exception_return() since there are critical sections at
the very start and end of a trap, exception, or interrupt from userland in
which we could trash the t7 register being used in userland. The counter
is incremented after adn before these critical sections respectively so
that we will not overwrite the saved t7 register if we are interrupted
during one of these critical sections.
nam for an unbound socket instead of leaving nam untouched in that case.
This way, the getsockname() output can be used to determine the address
family of such sockets (AF_LOCAL).
Reviewed by: iedowse
Approved by: rwatson
linuxulator so as to allow privileged processes within a jail() to
invoke the Linux initgroups() system call. This allows the Linux
"su" to work properly (better) when running a complete Linux
environment under jail(). This problem was reported by Attila
Nagy <bra@fsn.hu>.
Reviewed by: marcel
fs_contigdirs, fs_avgfilesize and fs_avgfpdir. This could cause
panics if these fields were zeroed while a filesystem was mounted
read-only, and then remounted read-write.
Add code to ffs_reload() which copies the fs_contigdirs pointer
from the previous superblock, and reinitialises fs_avgf* if necessary.
Reviewed by: mckusick
With the recent changes in the CAM error handling, some problems in
the error handling of sa(4) have been uncovered. Basically, a number
of conditions that are not actually errors have been mistreated as
genuine errors. In particular:
. Trying to read in variable length mode with a mismatched blocksize
between the on-tape (virtual) blocks and the read(2) supplied buffer
size, causing an ILI SCSI condition, have caused an attempt to retry
the supposedly `errored' transfer, causing the tape to be read
continuously until it eventually hit EOM. Since by default any
simple mt(1) operation does an initial test read, an `mt stat' was
sufficient to trigger this bug.
Note that it's Justin's opinion that treating a NO SENSE as an EIO
is another bug in CAM. I feel not authorized to fix cam_periph.c
without another confirmation that i'm on the right track, however.
. Hitting a filemark caused the read(2) syscall to return EIO, instead
of returning a `short read'. Note that the current fix only solves
this problem in variable length mode. Fixed length mode uses a
different code path, and since i didn't grok all the intentions behind
that handling, i did not touch it (IOW: it's still broken, and you get
an EIO upon hitting a filemark).
The solution is to keep track of those conditions inside saerror(),
and upon completion to not call cam_periph_error() in that case. We
need to make sure that the device gets unfrozen if needed though (in
case of actual errors, cam_periph_error() does this on our behalf).
Not objected by: mjacob (who currently doesn't have the time to
review the patch)
semantics don't: in practice, both policy and semantics permit
loop-back debugging operations, only it's just a subset of debugging
operations (i.e., a proc can open its own /dev/mem), and that's at a
higher layer.
This is will be required to prevent lowering the ipl when a critical_enter()
is present in the interrupt path when handling a machine check.
reviewed by: jhb
only aiod does this and is also marked P_SYSTEM, the locations that
reference p->p_vmspace usually do it within the context of the caller,
the async access from the vm system is protected by the fact that it
will skip over P_SYSTEM processes.
Ok'd by: jhb
means that the pcic98 functionality might now work (I've tested it on
my pcic machine, but not the pcic98). Since these functions are
rarely called, it is unlikely that this will have a measurable impact
on performance.
FreeBSD. This code doesn't work just yet, but does compile. We need
to start indirecting via the cinfo pointers, rather than directly
calling pcic_*. There may be other issues as well, but you gotta
start somewhere.
Obtained from: PAO3
we also reserve _adequate_ space for the mb_map submap; i.e. we need
space for nmbclusters, nmbufs, _and_ nmbcnt. Furthermore, we need to
rounddown, and not roundup, so that we are consistent.
Pointed out by: bde
Also move the insertion of the request to after the request is validated,
there's still looks like there may be some problems if an invalid address
is passed to the aio routines, basically a possible leak or having a
not completely initialized structure on the queue may still be possible.
A new sig macro was made _SIG_VALID to check the validity of a signal,
it would be advisable to use it from now on (in kern/kern_sig.c) rather
than rolling your own.
PR: kern/17152
Protect pager object list manipulation with a mutex.
It doesn't look possible to combine them under a single sx lock because
creation may block and we can't have the object list manipulation block
on anything other than a mutex because of interrupt requests.
available.
Only directory vnodes holding no child directory vnodes held in
v_cache_src are recycled, so that directory vnodes near the root of
the filesystem hierarchy remain in namecache and directory vnodes are
not reclaimed in cascade.
The period of vnode reclaiming attempt and the number of vnodes
attempted to reclaim can be tuned via sysctl(2).
Suggested by: tegge
Approved by: phk
Add TI4451 as well.
These are untested since I don't have the hardware to test against.
Also, some O2Micro devices are #define w/o numbers as place holders so that
I can encourage people to submit them when they appear in the channels.
parts. This is based on the newcard code that turns it off :-). We
can now reboot after NEWCARD or Windows and have OLDCARD work. Add
support for the RL5C466 while I'm at it.
Treat TI1031 the same as the CLPD6832. It doesn't work yet, but sucks
less than it did before.
Also add a few #defines for other changes in the pipe.
and __i386__ are defined rather than if SMP and BETTER_CLOCK are defined.
The removal of BETTER_CLOCK would have broken this except that kern_clock.c
doesn't include <machine/smptests.h>, so it doesn't see the definition of
BETTER_CLOCK, and forward_*clock aren't called, even on 4.x. This seems to
fix the problem where a n-way SMP system would see 100 * n clk interrupts
and 128 * n rtc interrupts.
and AS4100s into single user mode. This work was done jointly by jhb and
myself, and builds on dfr's earlier work.
smp_init_secondary() / smp_start_secondary()
- use the uniq val to pass the globalp (me)
- fancy footwork to take any pending machine checks (me)
- doing things the FreeBSD way and getting the per-cpu idleproc created
correctly, and synchronizing the startup of secondaries (jhb)
mp_start()
- better recognition of available cpus (jhb)
smp_rendezvous()
- if smp hasn't started, only run the rendezvous function on the current
cpu. Sleuthing and (prior) incorrect fix by me, correct fix by jhb
smp_handle_ipi()
- more verbose handling of console messages (jhb)
- grab sched lock around setting PS_ASTPENDING (jhb)
forward_*clock()
- commented out. Joint decision by dfr, jhb and myself
General synchronization improvements (more mb()s, etc) (jhb)
Printf cleanups (joint)
Whitespace cleanups (jhb)
- don't do the stack overflow sanity check on MP systems -- p->p_addr
will be malloc'ed memory (not K0SEG) and the check will fail.
- don't ignore clock interrupts on secondaries. Alphas apparently
roundrobin clock interrupts to all cpus, so we're going to take clock
interrupts on all CPUS and not forward them.
- use the unique value to save the per-cpu globalp struct like the
comment says
- don't lower the ipl to ALPHA_PSL_IPL_HIGH: we may have a pending machine
check to take and we're not prepared for that yet, as we haven't setup
our interrupt entry points. (this may only happen on sable/lynx)
- indicate the fact that the working version of smp_init_secondary() doesn't
return (this is tied up in other changes and hasn't yet been committed).
VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.
This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.
The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.
sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'.
Add debugging option to disable background writes of cylinder
groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'.
These debugging options should be tried on systems that are panicing
with corrupted cylinder group maps to see if it makes the problem
go away. The set of panics in question are:
ffs_clusteralloc: map mismatch
ffs_nodealloccg: map corrupted
ffs_nodealloccg: block not in map
ffs_alloccg: map corrupted
ffs_alloccg: block not in map
ffs_alloccgblk: cyl groups corrupted
ffs_alloccgblk: can't find blk in cyl
ffs_checkblk: partially free fragment
The following panics are less likely to be related to this problem,
but might be helped by these debugging options:
ffs_valloc: dup alloc
ffs_blkfree: freeing free block
ffs_blkfree: freeing free frag
ffs_vfree: freeing free inode
If you try these options, please report whether they helped reduce your
bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com>
and to Matt Dillon <dillon@earth.backplane.com>.
ACL_USER_OBJ and ACL_GROUP_OBJ fields, believing that modification of the
access ACL could be used by privileged processes to change file/directory
ownership. In fact, this is incorrect; ACL_*_OBJ (+ ACL_MASK and
ACL_OTHER) should have undefined ae_id fields; this commit attempts
to correct that misunderstanding.
o Modify arguments to vaccess_acl_posix1e() to accept the uid and gid
associated with the vnode, as those can no longer be extracted from
the ACL passed as an argument. Perform all comparisons against
the passed arguments. This actually has the effect of simplifying
a number of components of this call, as well as reducing the indent
level, but now seperates handling of ACL_GROUP_OBJ from ACL_GROUP.
o Modify acl_posix1e_check() to return EINVAL if the ae_id field of
any of the ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} entries is a value
other than ACL_UNDEFINED_ID. As a temporary work-around to allow
clean upgrades, set the ae_id field to ACL_UNDEFINED_ID before
each check so that this cannot cause a failure in the short term
(this work-around will be removed when the userland libraries and
utilities are updated to take this change into account).
o Modify ufs_sync_acl_from_inode() so that it forces
ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} ae_id fields to ACL_UNDEFINED_ID
when synchronizing the ACL from the inode.
o Modify ufs_sync_inode_from_acl to not propagate uid and gid
information to the inode from the ACL during ACL update. Also
modify the masking of permission bits that may be set from
ALLPERMS to (S_IRWXU|S_IRWXG|S_IRWXO), as ACLs currently do not
carry none-ACCESSPERMS (S_ISUID, S_ISGID, S_ISTXT).
o Modify ufs_getacl() so that when it emulates an access ACL from
the inode, it initializes the ae_id fields to ACL_UNDEFINED_ID.
o Clean up ufs_setacl() substantially since it is no longer possible
to perform chown/chgrp operations using vop_setacl(), so all the
access control for that can be eliminated.
o Modify ufs_access() so that it passes owner uid and gid information
into vaccess_acl_posix1e().
Pointed out by: jedger
Obtained from: TrustedBSD Project
panic_cpu shared variable. I used a simple atomic operation here instead
of a spin lock as it seemed to be excessive overhead. Also, this can avoid
recursive panics if, for example, witness is broken.
can happen if witness runs out of resources during initialization or if
witness_skipspin is enabled.
Sleuthing by: Peter Jeremy <peter.jeremy@alcatel.com.au>
"inside" of locked regions. That is, an acquire atomic operation will
always enforce a memory barrier after the atomic operation and a release
operation will always enforce a memory barrier before the atomic
operation.
- Explicitly use 'mb' instead of 'wmb' in release atomic operations. The
'wmb' memory barrier is not strong enough to guarantee coherence with
other processors. This is effectively a nop since alpha_wmb() actually
performs a 'mb' and not a 'wmb', but I wanted the code to be more
correct since at some point in the future alpha_wmb()'s implementation
may switch to being a real 'wmb'.
we should call ast(). This allows us to branch to a separate Lkernelret
label so we can fixup the saved t7 register in the trapframe. Otherwise
we can run into a problem on SMP systems where a process is interrupted by
a trap or interrupt on one CPU, migrates to another CPU, and then returns
with the t7 in the stack clobbering the CPU's t7. As a result, two CPU's
would both point to the same per-CPU data and things would go downhill from
there.
Sleuthing help by: gallatin
- Add a new ddb command: 'show pcpu' similar to the i386 command added
recently. By default it displays the current CPU's info, but an optional
argument can specify the logical ID of a specific CPU to examine.
bcopy would go off the end of the array by two elements, which sometimes
causes a panic if it happens to cross into a page that isn't mapped.
Submitted by: gibbs
Reviewed by: peter
are some good reasons for not doing this, even if the linting of
the code breaks.
1) If lint were ever to understand the stuff inside the macros,
that would break the checks.
2) There are ways to use __GNUC__ to exclude overly specific
code.
3) (Not yet practical) Lint(1) needs to properlyu understand
all of te code we actually run.
Complained about by: bde
Education by: jake, jhb, eivind
It is described in ufs/ffs/fs.h as follows:
/*
* Filesystem flags.
*
* Note that the FS_NEEDSFSCK flag is set and cleared only by the
* fsck utility. It is set when background fsck finds an unexpected
* inconsistency which requires a traditional foreground fsck to be
* run. Such inconsistencies should only be found after an uncorrectable
* disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when
* it has successfully cleaned up the filesystem. The kernel uses this
* flag to enforce that inconsistent filesystems be mounted read-only.
*/
#define FS_UNCLEAN 0x01 /* filesystem not clean at mount */
#define FS_DOSOFTDEP 0x02 /* filesystem using soft dependencies */
#define FS_NEEDSFSCK 0x04 /* filesystem needs sync fsck before mount */
and non-P_SUGID cases, simplify p_cansignal() logic so that the
P_SUGID masking of possible signals is independent from uid checks,
removing redundant code and generally improving readability.
Reviewed by: tmm
Obtained from: TrustedBSD Project
the ability of unprivileged processes to deliver arbitrary signals
to daemons temporarily taking on unprivileged effective credentials
when P_SUGID is not set on the target process:
Removed:
(p1->p_cred->cr_ruid != ps->p_cred->cr_uid)
(p1->p_ucred->cr_uid != ps->p_cred->cr_uid)
o Replace two "allow this" exceptions in p_cansignal() restricting
the ability of unprivileged processes to deliver arbitrary signals
to daemons temporarily taking on unprivileged effective credentials
when P_SUGID is set on the target process:
Replaced:
(p1->p_cred->p_ruid != p2->p_ucred->cr_uid)
(p1->p_cred->cr_uid != p2->p_ucred->cr_uid)
With:
(p1->p_cred->p_ruid != p2->p_ucred->p_svuid)
(p1->p_ucred->cr_uid != p2->p_ucred->p_svuid)
o These changes have the effect of making the uid-based handling of
both P_SUGID and non-P_SUGID signal delivery consistent, following
these four general cases:
p1's ruid equals p2's ruid
p1's euid equals p2's ruid
p1's ruid equals p2's svuid
p1's euid equals p2's svuid
The P_SUGID and non-P_SUGID cases can now be largely collapsed,
and I'll commit this in a few days if no immediate problems are
encountered with this set of changes.
o These changes remove a number of warning cases identified by the
proc_to_proc inter-process authorization regression test.
o As these are new restrictions, we'll have to watch out carefully for
possible side effects on running code: they seem reasonable to me,
but it's possible this change might have to be backed out if problems
are experienced.
Submitted by: src/tools/regression/security/proc_to_proc/testuid
Reviewed by: tmm
Obtained from: TrustedBSD Project
ability of unprivileged processes to modify the scheduling properties
of daemons temporarily taking on unprivileged effective credentials.
These cases (p1->p_cred->p_ruid == p2->p_ucred->cr_uid) and
(p1->p_ucred->cr_uid == p2->p_ucred->cr_uid), respectively permitting
a subject process to influence the scheduling of a daemon if the subject
process has the same real uid or effective uid as the daemon's effective
uid. This removes a number of the warning cases identified by the
proc_to_proc iner-process authorization regression test.
o As these are new restrictions, we'll have to watch out carefully for
possible side effects on running code: they seem reasonable to me,
but it's possible this change might have to be backed out if problems
are experienced.
Reported by: src/tools/regression/security/proc_to_proc/testuid
Obtained from: TrustedBSD Project
by p_can(...P_CAN_SEE), rather than returning EACCES directly. This
brings the error code used here into line with similar arrangements
elsewhere, and prevents the leakage of pid usage information.
Reviewed by: jlemon
Obtained from: TrustedBSD Project
p_can(...P_CAN_SEE...) to getpgid(), getsid(), and setpgid(),
blocking these operations on processes that should not be visible
by the requesting process. Required to reduce information leakage
in MAC environments.
Obtained from: TrustedBSD Project
from signal authorization checking.
o p_cansignal() takes three arguments: subject process, object process,
and signal number, unlike p_cankill(), which only took into account
the processes and not the signal number, improving the abstraction
such that CANSIGNAL() from kern_sig.c can now also be eliminated;
previously CANSIGNAL() special-cased the handling of SIGCONT based
on process session. privused is now deprecated.
o The new p_cansignal() further limits the set of signals that may
be delivered to processes with P_SUGID set, and restructures the
access control check to allow it to be extended more easily.
o These changes take into account work done by the OpenBSD Project,
as well as by Robert Watson and Thomas Moestl on the TrustedBSD
Project.
Obtained from: TrustedBSD Project
toggle the P_SUGID bit explicitly, rather than relying on it being
set implicitly by other protection and credential logic. This feature
is introduced to support inter-process authorization regression testing
by simplifying userland credential management allowing the easy
isolation and reproduction of authorization events with specific
security contexts. This feature is enabled only by "options REGRESSION"
and is not intended to be used by applications. While the feature is
not known to introduce security vulnerabilities, it does allow
processes to enter previously inaccessible parts of the credential
state machine, and is therefore disabled by default. It may not
constitute a risk, and therefore in the future pending further analysis
(and appropriate need) may become a published interface.
Obtained from: TrustedBSD Project
interfaces and functionality intended for use during correctness and
regression testing. Features enabled by "options REGRESSION" may
in and of themselves introduce security or correctness problems if
used improperly, and so are not intended for use in production
systems, only in testing environments.
Obtained from: TrustedBSD Project