SOCK_LOCK(so):
- Hold socket lock over calls to MAC entry points reading or
manipulating socket labels.
- Assert socket lock in MAC entry point implementations.
- When externalizing the socket label, first make a thread-local
copy while holding the socket lock, then release the socket lock
to externalize to userspace.
order definition for witness. Send lock before receive lock, and
socket locks after accept but before select:
filedesc -> accept -> so_snd -> so_rcv -> sellck
All routing locks after send lock:
so_rcv -> radix node head
All protocol locks before socket locks:
unp -> so_snd
udp -> udpinp -> so_snd
tcp -> tcpinp -> so_snd
before calling sotryfree().
-- Body of earlier bulk commit this belonged with --
Log:
Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:
- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().
- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().
- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.
- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.
- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.
Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS
reference count:
- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().
- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().
- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.
- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.
- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.
Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS
Otherwise, the setting of the PG_M bit by one processor could be lost if
another processor is simultaneously changing the PG_W bit.
Reviewed by: tegge@
rig a PREPEND macro for ALTQ as the POLL/DEQUEUE semantic is very bad in
terms of locking. We make this a full functional queue to allow "bulk
dequeue" which will further reduce the locking overhead (for non-altq
enabled devices). Drivers will access this via the following macros, which
will show up in <net/if_var.h> once we expose ALTQ to the build:
IFQ_DRV_DEQUEUE(ifq, m) - takes a mbuf off the queue (driver queue first)
IFQ_DRV_PREPEND(ifq, m) - pushes a mbuf back to the driver queue
IFQ_DRV_PURGE(ifq) - drops all packets in both queues
IFQ_DRV_IS_EMPTY(ifq) - checks for pending mbufs in either queue
One has to make sure that the first three are protected by a driver mutex.
At the moment most network drivers still require Giant, so this is not an
issue. Even those that have thier own mutex usually hold it in if_start and
the like, so this requirement is almost always satisfied.
This evolved from a discussion with Andrew Gallatin.
protect fields in the socket buffer. Add accessor macros to use the
mutex (SOCKBUF_*()). Initialize the mutex in soalloc(), and destroy
it in sodealloc(). Add addition, add SOCK_*() access macros which
will protect most remaining fields in the socket; for the time being,
use the receive socket buffer mutex to implement socket level locking
to reduce memory overhead.
Submitted by: sam
Sponosored by: FreeBSD Foundation
Obtained from: BSD/OS
that the command succeeded. Sheesh! This makes CDROMs no longer cause an
instant panic at boot. Thanks to Jake Burkholder for providing a remote
test setup.
Also make device resets work, thanks to another typo.
pass any traffic. Unfortunately this means no full-duplex link with auto-
negotiation on hme(4) using DP83840A PHYs again.
I really thought I had tested this also on a Netra t1 100...
- add locking
- disable ALTQ3_COMPAT by default (do not remove the code to keep the diff
towards KAME small)
- put some more code under ALTQ3 conditional compilation as it should be
- account for if_xname
- some more minor compile fixes
As people started wondering:
The strange path layout "altq/altq" is there to avoid "-Isys/contrib" and
make it "-Isys/contrib/altq" instead, as we will need at least <altq/altq.h>
and <altq/if_altq.h> for kernel compilation.
The "freebsd4_..." in the privious commit is just the best tag name in the
KAME tree I could find to classify this in order to track its history. It
does *not* mean that this will go to 4-STABLE or anything of that kind.
HEAD at this point). This will not exactly live in a vendor branch, but have
the vendor backing to make it easier to exchange diffs.
This will be followed by a diff which takes most of the .c files off the
vendor branch in order to:
- add locking
- disable ALTQ3_COMPAT code (which is outdated and "un-lockable")
There is work in progress to refine the configuration API. Import this "as
is" now to have more exposure time before 5-STABLE.
This is only the import, it will be some more days until you will actually
be able to compile ALTQ support into your kernel so don't hold your breath.
HEADUPs will be posted on current@ and net@ before this is actually enabled.
No-objection: re(scottl), core(rwatson)
ruleset, the pcb is looked up once per ipfw_chk() activation.
This is done by extracting the required information out of the PCB
and caching it to the ipfw_chk() stack. This should greatly reduce
PCB looking contention and speed up the processing of UID/GID based
firewall rules (especially with large UID/GID rulesets).
Some very basic benchmarks were taken which compares the number
of in_pcblookup_hash(9) activations to the number of firewall
rules containing UID/GID based contraints before and after this patch.
The results can be viewed here:
o http://people.freebsd.org/~csjp/ip_fw_pcb.png
Reviewed by: andre, luigi, rwatson
Approved by: bmilekic (mentor)
driver tries to submit the same request repeatedly, on finding the
controller cmd queue to be full.
Submitted by:ps, vkashyap
Reviewed by:re
Approved by:re
ifp->if_output() basedd on debug.mpsafenet. That way once bpfwrite()
can be called without Giant, it will acquire Giant (if desired) before
entering the network stack.
the need to synchronize access to the structure. I believe this
should fit into the stack under the necessary circumstances, but
if not we can either add synchronization or use a thread-local
malloc for the duration.
versions of various routers seen:
- Introduce igmp_mtx.
- Protect global variable 'router_info_head' and list fields
in struct router_info with this mutex, as well as
igmp_timers_are_running.
- find_rti() asserts that the caller acquires igmp_mtx.
- Annotate a failure to check the return value of
MALLOC(..., M_NOWAIT).
is the actual name here) on EBus and which are PCF8584 (on systems having
a boot-bus controller the i2c are said to not be a PCF8584). Similar to the
SUNW,envctrl devices, onboard slaves for monitoring fans, temperatures and
such hang off of these i2c devices. But there's also stuff like EEPROMs
housing the hostid of the system and the boards usally have a connector to
add custom slave devices (on CP1500 there's actually a second PCF8584 with
its own I2C bus for these).
This driver already works fine but I'm not yet sure if access to the slave
devices on CP1400/CP1500 marked as "reserved for factory use" in the docs
should be blocked (most likely these are the voltage controllers wich aren't
meant to be controller by software and even not by the firmware). Once the
issues with polled mode are fixed in the common pcf(4) part in pcf.c, this
front-end should probably honour the poll-mode property of the i2c devices.
Tested on Ultra AXe and CP1500 (Netra t1 100).
OK'ed by: joerg, nsouch
- Use "envctrl" as the name when registering this module rather than "pcf";
we can't have "pcf" as the name for all pcf(4) front-ends or we would get
conflicts.
OK'ed by: joerg
- s,pcf_,pcf_isa, to better reflect the purpose of this front-end and to
avoid conflicts.
- Don't use this front-end for attaching to EBus, declaring it as an EBus
driver was a cut&paste accident according to joerg.
OK'ed by: joerg, nsouch
global and allocated variables. This strategy is derived from work
originally developed by BSDi for BSD/OS, and applied to FreeBSD by Sam
Leffler:
- Add unp_mtx, a global mutex which will protect all UNIX domain socket
related variables, structures, etc.
- Add UNP_LOCK(), UNP_UNLOCK(), UNP_LOCK_ASSERT() macros.
- Acquire unp_mtx on entering most UNIX domain socket code,
drop/re-acquire around calls into VFS, and release it on return.
- Avoid performing sodupsockaddr() while holding the mutex, so in general
move to allocating storage before acquiring the mutex to copy the data.
- Make a stack copy of the xucred rather than copying out while holding
unp_mtx. Copy the peer credential out after releasing the mutex.
- Add additional assertions of vnode locks following VOP_CREATE().
A few notes:
- Use of an sx lock for the file list mutex may cause problems with regard
to unp_mtx when garbage collection passed file descriptors.
- The locking in unp_pcblist() for sysctl monitoring is correct subject to
the unpcb zone not returning memory for reuse by other subsystems
(consistent with similar existing concerns).
- Sam's version of this change, as with the BSD/OS version, made use of
both a global lock and per-unpcb locks. However, in practice, the
global lock covered all accesses, so I have simplified out the unpcb
locks in the interest of getting this merged faster (reducing the
overhead but not sacrificing granularity in most cases). We will want
to explore possibilities for improving lock granularity in this code in
the future.
Submitted by: sam
Sponsored by: FreeBSD Foundatiuon
Obtained from: BSD/OS 5 snapshot provided by BSDi
present and thus that the PnPBIOS probe should be skipped instead of
having ACPI zero out the PnPBIOStable pointer.
- Make the PnPBIOStable pointer static to i386/i386/bios.c now that that is
the only place it is used.
its primary use is for the FEPS/FAS366 SCSI found in Sun Ultra 1e and 2
machines. Once the pci front-end is ported, this driver can replace the
amd(4) driver.
The code as-is is fairly stable. I've disabled tagged-queueing until I can
figure out a corruption bug related to it. I'm importing it now so that
people with these machines can (finally) stop netbooting and report bugs
before 5.3.
as otherwise the junk it contains may cause uhub_explore to give
up without ever trying to restart the port. This fixes the following
errors I was seeing with a VIA UHCI controller:
uhub0: port error, restarting port 1
uhub0: port error, giving up port 1
(time grows downward)
thread 1 thread 2
------------|------------
dec ref_cnt |
| dec ref_cnt <-- ref_cnt now zero
cmpset |
free all |
return |
|
alloc again,|
reuse prev |
ref_cnt |
| cmpset, read
| already freed
| ref_cnt
------------|------------
This should fix that by performing only a single
atomic test-and-set that will serve to decrement
the ref_cnt, only if it hasn't changed since the
earlier read, otherwise it'll loop and re-read.
This forces ordering of decrements so that truly
the thread which did the LAST decrement is the
one that frees.
This is how atomic-instruction-based refcnting
should probably be handled.
Submitted by: Julian Elischer
internal reference counters, UMA_ZONE_NOFREE. This way, those slabs
(with their ref counts) will be effectively type-stable, then using
a trick like this on the refcount is no longer dangerous:
MEXT_REM_REF(m);
if (atomic_cmpset_int(m->m_ext.ref_cnt, 0, 1)) {
if (m->m_ext.ext_type == EXT_PACKET) {
uma_zfree(zone_pack, m);
return;
} else if (m->m_ext.ext_type == EXT_CLUSTER) {
uma_zfree(zone_clust, m->m_ext.ext_buf);
m->m_ext.ext_buf = NULL;
} else {
(*(m->m_ext.ext_free))(m->m_ext.ext_buf,
m->m_ext.ext_args);
if (m->m_ext.ext_type != EXT_EXTREF)
free(m->m_ext.ref_cnt, M_MBUF);
}
}
uma_zfree(zone_mbuf, m);
Previously, a second thread hitting the above cmpset might
actually read the refcnt AFTER it has already been freed. A very
rare occurance. Now we'll know that it won't be freed, though.
Spotted by: julian, pjd
all of the interface between the driver and the bus. This will enable
us to stop special casing eisa bus attachments in modules and treat them
like we treat all other busses.
In the longer run, we need to eliminate much (all?) of these interfaces
and switch to using the standard bus_alloc_resource(), but that's not
done right now.
# I've not updated the modules to include eisa, etc, just yet
Tested on: Compaq Proliant 3000/333 purchased for eisa work
mode. The 5704 apparently has some s00p3r s33kr1t registers for setting
the advertisement of pause frame ability (i.e flow control) when in
autoneg mode. If we don't set these registers correctly, we may not
be able to negotiate a proper link with some switches. (Symptom is that
the NIC reports the link as up (PCS synched) but no traffic can be
exchanged.)
PR: kern/67598
Add two new functions: ttyref() and ttyrel(). ttymalloc() creates a struct
tty with a reference count of one. when ttyrel sees the count go to zero,
struct tty is freed.
Hold references for open ttys and for ttys which are controlling terminal
for sessions.
Until drivers start using ttyrel(), this commit will make no difference.
different cards that matched vendor/id, but weren't wi cards. This is
because the vendor foolishly didn't have unique product ids. Symbol
has a serial card that would otherwise match the wi driver, for
example...
Taken from a patch for xe posted by: Carlos Velasco
recursive entering of the socket code from the routing code:
- Modify rt_dispatch() to bundle up the sockaddr family, if any,
associated with a pending mbuf to dispatch to routing sockets, in
an m_tag on the mbuf.
- Allocate NETISR_ROUTE for use by routing sockets.
- Introduce rtsintrq, an ifqueue to be used by the netisr, and
introduce rts_input(), a function to unbundle the tagged sockaddr
and inject the mbuf and address into raw_input(), which previously
occurred in rt_dispatch().
- Introduce rts_init() to initialize rtsintrq, its mutex, and
register the netisr. Perform this at the same point in system
initialization as setup of the domains.
This change introduces asynchrony between the generation of a
pending routing socket message and delivery to sockets for use
by userspace. It avoids socket->routing->rtsock->socket use and
helps to avoid lock order reversals between the routing code and
socket code (in particular, raw socket control blocks), as route
locks are held over calls to rt_dispatch().
Reviewed by: "George V.Neville-Neil" <gnn@neville-neil.com>
Conceptual head nod by: sam
pmap_extract() already does it.
In pmap_enter(), opa has already been masked so don't do it again.
Wrap a long line (recent transgression).
Use trunc_page() in pmap_mapdev() instead of anding with PG_FRAME, since
that is what we really meant.
Submitted by: alc (first item)
- export the rest of the cpu features (and amd's features).
- turn on EFER_NXE, depending on the NX amd feature bit
- reorg the identcpu stuff a bit in order to stop treating the
amd features as second class features (since it is now a primary feature
bit set) and make it easier to export.
lives in the top 12 'available' bits. atop() in the PHYS_TO_VM_PAGE()
macro only masks off the lower bits (by accident) and the upper bits
in the 64 bit ptes turn into "interesting" index values.
pmap_remove() would be called with a huge range and we'd stride across
it in only 2MB chunks. This would manifest as massive cpu time and a
largely unresponsive system during hard swap. Instead, check the higher
page directories which means we can run pmap_remove() in just a few
hundred loop iterations instead of millions since we can process
address space in chunks of 512GB and 1GB as well as 2MB.
Eternal thanks to: tmm
when I reordered events in accept1() to allocate a file descriptor
earlier, I didn't properly update use of goto on exit to unwind for
cases where the file descriptor is now held, but wasn't previously.
The result was that, in the event of accept() on a non-blocking socket,
or in the event of a socket error, a file descriptor would be leaked.
This ended up being non-fatal in many cases, as the file descriptor
would be properly GC'd on process exit, so only showed up for processes
that do a lot of non-blocking accept() calls, and also live for a long
time (such as qmail).
This change updates the use of goto targets to do additional unwinding.
Eyes provided by: Brian Feldman <green@freebsd.org>
Feet, hands provided by: Stefan Ehmann <shoesoft@gmx.net>,
Dimitry Andric <dimitry@andric.com>
Arjan van Leeuwen <avleeuwen@piwebs.com>
which doesn't support ACPI power states. Return AE_NOT_FOUND for these
cases and don't print the warning message. Also, print the name of the
handle instead of device when unable to switch states. The device is often
not attached at this point and so its name is NULL, which doesn't help
debugging.
is generic to any threading system. This commit does not link this
file to the build yet, nor does it remove these functions from their
current location in kern_thread.c. (that commit coming up after further review)
Dividing by 0 in order to check for irq13/exception16 delivery apparently
always causes an irq13 even if we have configured for exception16 (by
setting CR0_NE). This was expected, but the timing of the irq13 was
unexpected. Without CR0_NE, the irq13 is delivered synchronously at
least on my test machine, but with CR0_NE it is delivered a little
later (about 250 nsec) in PIC mode and much later (5000-10000 nsec)
in APIC mode. So especially in APIC mode, the irq13 may arrive after
it is supposed to be shut down. It should then be masked, but the
shutdown is incomplete, so the irq goes to a null handler that just
reports it as stray. The fix is to wait a bit after dividing by 0 to
give a good chance of the irq13 being handled by its proper handler.
Removed the hack that was supposed to recover from the incomplete shutdown
of irq13. The shutdown is now even more incomplete, or perhaps just
incomplete in a different way, but the hack now has no effect because
irq13 is edge triggered and handling of edge triggered interrupts is
now optimized by skipping their masking. The hack only worked due
to it accidentally not losing races.
The incomplete shutdown of irq13 still allows unprivileged users to
generate a stray irq13 (except on systems where irq13 is actually used)
by unmasking an npx exception and causing one. The exception gets
handled properly by the exception 16 handler. A spurious irq13 is
delivered asynchronously but is harmless (as in the probe) because it
is almost perfectly not handled by the null interrupt handler.
Perfectly not handling it involves mainly not resetting the npx busy
latch. This prevents further irq13's despite them not being masked in
the [A]PIC.
with an ASUS A7N8X-E motherboard in APIC mode, since storming interrupts
don't repeat immediately. Use DELAY(1) to wait a bit for them to repeat.
This affects all systems. Only delay for the first
(10 * intr_storm_threshold) interrupts (per interrupt handler) so that
this is only a pessimization while warming up. Throttle after calling
the sub-handlers instead of before so that the long delay given by
throttling can be used instead of the DELAY(1) to detect storms after
warming up.
Reduced the throttling period from 1/10 second to 1/hz seconds so that
throttling doesn't destroy performance so much. Interrupts that are
detected as storming are effectively handled by polling at a frequency
of hz Hz. On A7N8X-E's there is another hardware or configuration bug
that makes the throttled frequency closer to 2*hz Hz.
tree, output an empty string instead of "?". This is already what
happened with DEVICE_SYSCTL_LOCATION and DEVICE_SYSCTL_PNPINFO. This
makes the output of "sysctl dev" much nicer (it won't display those
empty sysctls).
Reviewed by: des
"stray irq 9" messages on my Thinkpad. It may also help with general
reboot consistency although the recent hang on reboot was solved by
acpi_cpu.c rev 1.39.
after. Unify the paths for all Cx states. Remove cpu_idle_busy and
instead do the little profiling we need before re-enabling interrupts.
Use 1 quantum as estimate for C1 sleep duration since the timer interrupt
is the main reason we wake.
While here, change the cx_history sysctl to cx_usage and report statistics
for which idle states were used in terms of percent. This seems more
intuitive than counters. Remove the cx_stats structure since it's no
longer used. Update the man page.
Change various types which do not need explicit size.
size_t and size_t *, respectively. Update callers for the new interface.
This is a better fix for overflows that occurred when dumping segments
larger than 2GB to core files.
called ttyldoptim().
Use this function from all the relevant drivers.
I belive no drivers finger linesw[] directly anymore, paving the way for
locking and refcounting.
exactly as done in the cmi driver. I am quite confident this is
safe since I'm runing this for more than two weeks now, on an SMP
box. A few people tested this patch for me successfully as well.
<sys/linedisc.h> (repocopied).
Temporarily use a nested include from <sys/tty.h> to get <sys/linedisc.h>
into relevant source files.
Introduce a set of inline functions named ttyld_...() to invoke
linedisc methods instead of groping around in the linesw array.
class variables in addition to per-device variables. In plain English,
this means that dev.foo0.bar is now called dev.foo.0.bar, and it is
possible to to have dev.foo.bar as well.
- In subr_ndis.c, my_strcasecmp() actually behaved like my_strncasecmp():
we really need it to behave like the former, not the latter. (It was
falsely matching "RadioEnable", which defaults to 1 with "RadioEnableHW"
which the driver creates itself and to 0, because we were using
strlen("RadioEnable") as the length to test. This caused the radio to
always be turned off. :( )
- In if_ndis.c, only set IEEE80211_CHAN_A for channels if we actually
set any IEEE80211_MODE_11A rates. (ieee80211_attach() will "helpfully"
add IEEE80211_MODE_11A to ic_modecaps for you if you initialize any
802.11a channels. This caused "ndis0: 11a rates:" to erroneously be
displayed during driver load.)
- Also in if_ndis.c, when using TESTSETRATE() to add in any missing 802.11b
rates, remember to OR the rates with IEEE80211_RATE_BASIC, otherwise
comparing against existing basic rates won't match. (1, 2, 5.5 and
11Mbps are basic rates, according to the 802.11b spec.) This erroneously
cause 11Mbps to be added to the 11b rate list twice.
double NULL entries signal Witness to stop processing the array of
order entries meaning none of the spin locks are added resulting in
panics on boot.
- Add a missing NULL, NULL terminator to the Slip locks list to keep them
separate from the spin locks.
that m_prepend() is not called with possibility to wait while the
pcb lock is held. What still needs revisiting is whether the
ripcbinfo lock is really required here.
Discussed with: rwatson
relationships:
Sockets: filedesc->accept->sellck
Routing: radix node head->rtentry->ifaddr
UDP: udp->udpinp
TCP: tcp->tcpinp
SLIP: slip_mtx->slip sc_mtx
Drop in a place holder section for UNIX domain sockets. Various
sections to be expanded over the next few days.
sysctls were global (hw.fxp_rnr and hw.fxp_noflow), all of them are
now per-device. Sample output of "sysctl dev.fxp0" with this patch,
with the standard %foo nodes removed :
dev.fxp0.int_delay: 1000
dev.fxp0.bundle_max: 6
dev.fxp0.rnr: 0
dev.fxp0.noflow: 0
little/big endian fashion, so that network drivers can just reference
the standard implementation and don't have to bring their own.
As discussed on arch@.
Obtained from: NetBSD
protect the registers so it was trivially possible for a sync command and
i/o command to fight each other and confuse the controller. Make the
sync fib alloc/release functions inline and remove the somewhat worthless
AAC_SYNC_LOCK_FORCE flag. Thanks to Adil Katchi for helping me to track
this down in RELENG_4.