The counters are incremented when a thread goes to sleep and decremented
either when a thread is woken up by another thread or when the sleep
times out. There existed a race where the sleep count could be decremented
twice resulting in an eventual underflow.
Move the decrementing of the "counters" to the thread initiating the sleep
and thus remedy the problem.
allocation routines are being called safely. Since we drop our relevant
mbuf mutex and acquire Giant before we call kmem_malloc(), we have
to make sure that this does not pave the way for a fatal lock order
reversal. Check that either Giant is already held (in which case it's safe
to grab it again and recurse on it) or, if Giant is not held, that no
other locks are held before we try to acquire Giant.
Similarily, add a KASSERT valid in the WITNESS case in m_reclaim() to
nail callers who end up in m_reclaim() and hold a lock.
Pointed out by: jhb
m_reclaim() and re-acquire it when m_reclaim() returns. This means that
we now call the drain routines without holding the mutex lock and
recursing into it. This was done for mainly two reasons:
(i) Avoid the long recursion; long recursions are typically bad and this
is the case here because we block all other code from freeing mbufs
if they need to. Doing that is kind of counter-productive, since we're
really hoping that someone will free.
(ii) More importantly, avoid a potential lock order reversal. Right now,
not all the locks have been added to our networking code; but
without this change, we're introducing the possibility for deadlock.
Consider for example ip_drain(). We will likely eventually introduce
a lock for ipq there, and so ip_freef() will be called with ipq lock
held. But, ip_freef() calls m_freem() which in turn acquires the
mmbfree lock. Since we were previously calling ip_drain() with mmbfree
held, our lock order would be: mmbfree->ipq->mmbfree. Some other code
may very well lock ipq first and then call ip_freef(). This would
result in the regular lock order, ipq->mmbfree. Clearly, we have
deadlock if one thread acquires the ipq lock and sits waiting for
mmbfree while another thread calling m_reclaim() acquires mmbfree
and sits waiting for the ipq lock.
Also, make sure to add a comment above m_reclaim()'s definition briefly
explaining this. Also document this above the call to m_reclaim() in
m_mballoc_wait().
Suggested and reviewed by: alfred
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.
* Fix a typo in a comment in mbuf.h
* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.
number of ext_buf counters that are possibly allocatable.
Do this because:
(i) It will make it easier to influence EXT_COUNTERS for if_sk,
if_ti (or similar) users where the driver allocates its own
ext_bufs and where it is important for the mbuf system to take
it into account when reserving necessary space for counters.
(ii) Facilitate some percentile calculation for netstat(1)
to accomodate the changes.
Here's a list of things that have changed (I may have left out a few); for a
relatively complete list, see http://people.freebsd.org/~bmilekic/mtx_journal
* Remove old (once useful) mcluster code for MCLBYTES > PAGE_SIZE which
nobody uses anymore. It was great while it lasted, but now we're moving
onto bigger and better things (Approved by: wollman).
* Practically re-wrote the allocation macros in sys/sys/mbuf.h to accomodate
new allocations which grab the necessary lock.
* Make sure that necessary mbstat variables are manipulated with
corresponding atomic() routines.
* Changed the "wait" routines, cleaned it up, made one routine that does
the job.
* Generalized MWAKEUP() macro. Got rid of m_retry and m_retryhdr, as they
are now included in the generalized "wait" routines.
* Sleep routines now use msleep().
* Free lists have locks.
* etc... probably other stuff I'm missing...
Things to look out for and work on later:
* find a better way to (dynamically) adjust EXT_COUNTERS
* move necessity to recurse on a lock from drain routines by providing
lock-free lower-level version of MFREE() (and possibly m_free()?).
* checkout include of mutex.h in sys/sys/mbuf.h - probably violating
general philosophy here.
The code has been reviewed quite a bit, but problems may arise... please,
don't panic! Send me Emails: bmilekic@freebsd.org
Reviewed by: jlemon, cp, alfred, others?
that should be better.
The old code counted references to mbuf clusters by using the offset
of the cluster from the start of memory allocated for mbufs and
clusters as an index into an array of chars, which did the reference
counting. If the external storage was not a cluster then reference
counting had to be done by the code using that external storage.
NetBSD's system of linked lists of mbufs was cosidered, but Alfred
felt it would have locking issues when the kernel was made more
SMP friendly.
The system implimented uses a pool of unions to track external
storage. The union contains an int for counting the references and
a pointer for forming a free list. The reference counts are
incremented and decremented atomically and so should be SMP friendly.
This system can track reference counts for any sort of external
storage.
Access to the reference counting stuff is now through macros defined
in mbuf.h, so it should be easier to make changes to the system in
the future.
The possibility of storing the reference count in one of the
referencing mbufs was considered, but was rejected 'cos it would
often leave extra mbufs allocated. Storing the reference count in
the cluster was also considered, but because the external storage
may not be a cluster this isn't an option.
The size of the pool of reference counters is available in the
stats provided by "netstat -m".
PR: 19866
Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: alfred (glanced at by others on -net)
The variables "m_mclalloc_wid" and "m_mballoc_wid" were not in the
proper place. They should have been in uipc_mbuf.c and have been global,
not in mbuf.h and local per each file that uses mbuf.h.
Sorta bug fix:
In mbuf.h, the definitions of various things for KERNEL and not
KERNEL cases were very screwy. This fixes all of that which I could
find.
means that running out of mbuf space isn't a panic anymore, and code
which runs out of network memory will sleep to wait for it.
Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: green, wollman
because in the case of mbuf clusters they only increment the reference
count rather than actually copying the data.
Add comments to this effect, and add a new routine called m_dup() that
returns a real, writable copy of an mbuf chain.
This is preliminary work required for implementing 'ipfw tee'.
Reviewed by: julian
into uipc_mbuf.c. This reduces three sets of identical tunable code to
one set, and puts the initialisation with the mbuf code proper.
Make NMBUFs tunable as well.
Move the nmbclusters sysctl here as well.
Move the initialisation of maxsockets from param.c to uipc_socket2.c,
next to its corresponding sysctl.
Use the new tunable macros for the kern.vm.kmem.size tunable (this should have
been in a separate commit, whoops).
SYSINIT_KT() etc (which is a static, compile-time procedure), use a
NetBSD-style kthread_create() interface. kproc_start is still available
as a SYSINIT() hook. This allowed simplification of chunks of the
sysinit code in the process. This kthread_create() is our old kproc_start
internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work
the same as the NetBSD one.
One thing I'd like to do shortly is get rid of nfsiod as a user initiated
process. It makes sense for the nfs client code to create them on the
fly as needed up to a user settable limit. This means that nfsiod
doesn't need to be in /sbin and is always "available". This is a fair bit
easier to do outside of the SYSINIT_KT() framework.
This makes it possible to change the sysctl tree at runtime.
* Change KLD to find and register any sysctl nodes contained in the loaded
file and to unregister them when the file is unloaded.
Reviewed by: Archie Cobbs <archie@whistle.com>,
Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)
the alloc is not M_DONTWAIT, then panic with "Out of mbuf clusters".
Callers that specify M_WAIT can't deal with getting a NULL buffer, so this
is a more graceful failure than randomly page faulting in the socket code
or elsewhere.
here, but kmem_malloc() is used and it takes the same "flags" as
malloc().
Use the mbuf allocation "flags" M_WAIT and M_DONTWAIT consistently.
There is really only one boolean flag, M_DONTWAIT, but the "flags"
were always treated as enum-like values, except in some places here
where the values are tacitly converted to boolean flags. Treat
them as enum-like values everywhere, except where we tacitly assume
that there are only two values in order to convert them to the
corresponding two kmem_malloc() "flags".
Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.
A couple of finer points by: bde
all of the configurables and instrumentation related to
inter-process communication mechanisms. Some variables,
like mbuf statistics, are instrumented here for the first
time.
For mbuf statistics: also keep track of m_copym() and
m_pullup() failures, and provide for the user's inspection
the compiled-in values of MSIZE, MHLEN, MCLBYTES, and MINCLSIZE.
clusters greater than one page in length by calling contigmalloc1().
This uses a helper process `mclalloc' to do the allocation if
the system runs out at interrupt time to avoid calling contigmalloc
at high spl. It is not yet clear to me whether this works.
when allocating memory for network buffers at interrupt time. This is due
to inadequate checking for the new mcl_map. Fixed by merging mb_map and
mcl_map into a single mb_map.
Reviewed by: wollman
This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.
Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
the past, since it returns to the old system of allocating mbufs out of
a private area rather than using the kernel malloc(). While this may seem
like a backwards step to some, the new allocator is some 20% faster than
the old one and has much better caching properties.
Written by: John Wroclawski <jtw@lcs.mit.edu>