1
0
mirror of https://git.FreeBSD.org/src.git synced 2024-12-21 11:13:30 +00:00
Commit Graph

177 Commits

Author SHA1 Message Date
Marius Strobl
0d13d5fce2 Given that as of r258002 the last external user is gone, make sched_lock
static.
2014-04-29 20:51:57 +00:00
Mark Johnston
7dba784986 The arguments to sched:::off-cpu are the thread and associated process of
the thread selected to run, not the currently running thread. This fix has
already been made for ULE in r252070.

PR:		177706
MFC after:	1 week
2013-12-29 17:08:30 +00:00
Andriy Gapon
d9fae5ab88 dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Andriy Gapon
785797c341 rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST
Also directly call swapper() at the end of mi_startup instead of
relying on swapper being the last thing in sysinits order.

Rationale:

- "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage
- "scheduler" was misleading, the function swaps in the swapped out processes
- another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be
  invoked depending on its relative order with scheduler; this was not obvious
  and the bug actually used to exist

Reviewed by:	kib (ealier version)
MFC after:	14 days
2013-07-24 09:45:31 +00:00
Edward Tomasz Napierala
36af98697d Add CPU percentage limit enforcement to RCTL. The resouce name is "pcpu".
It was implemented by Rudolf Tomori during Google Summer of Code 2012.
2012-10-26 16:01:08 +00:00
John Baldwin
ba96d2d816 Mark the idle threads as non-sleepable and also assert that an idle
thread never blocks on a turnstile.
2012-08-22 20:01:38 +00:00
Alexander Motin
37f4e0254f Some more minor tunings inspired by bde@. 2012-08-11 20:24:39 +00:00
Alexander Motin
579895df01 Some minor tunings/cleanups inspired by bde@ after previous commits:
- remove extra dynamic variable initializations;
 - restore (4BSD) and implement (ULE) hogticks variable setting;
 - make sched_rr_interval() more tolerant to options;
 - restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more
user-friendly wrapper for sched_slice;
 - tune some sysctl descriptions;
 - make some style fixes.
2012-08-10 19:02:49 +00:00
Alexander Motin
3d7f41175d Rework r220198 change (by fabient). I believe it solves the problem from
the wrong direction. Before it, if preemption and end of time slice happen
same time, thread was put to the head of the queue as for only preemption.
It could cause single thread to run for indefinitely long time. r220198
handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that
causes delayed context switch every time preemption happens, even when not
needed.

Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND,
set when thread's time slice is over and it should be put to the tail of
queue. Using SW_PREEMPT flag for that purpose as it was before just not
enough informative to work correctly.

On my tests this by 2-3 times reduces run time deviation (improves fairness)
in cases when several threads share one CPU.

Reviewed by:	fabient
MFC after:	2 months
Sponsored by:	iXsystems, Inc.
2012-08-09 19:26:13 +00:00
Alexander Motin
48317e9e27 SCHED_4BSD scheduling quantum mechanism appears to be broken for some time.
With switchticks variable being reset each time thread preempted (that is
done regularly by interrupt threads) scheduling quantum may never expire.
It was not noticed in time because several other factors still regularly
trigger context switches.

Handle the problem by replacing that mechanism with its equivalent from
SCHED_ULE called time slice. It is effectively the same, just measured in
context of stathz instead of hz. Some unification is probably not bad.
2012-08-09 18:09:59 +00:00
Sergey Kandaurov
2aaae99d96 Fix typo in function name SDT_PROBE4 and unbreak 4BSD UP. 2012-05-15 10:58:17 +00:00
Ryan Stone
b3e9e682cf Implement the DTrace sched provider. This implementation aims to be
compatible with the sched provider implemented by Solaris and its open-
source derivatives.  Full documentation of the sched provider can be found
on Oracle's DTrace wiki pages.

Note that for compatibility with scripts originally written for Solaris,
serveral probes are defined that will never fire.  These probes are defined
to fire when Solaris-specific features perform certain actions.  As these
features are not present in FreeBSD, the probes can never fire.

Also, I have added a two probes that are not defined in Solaris, lend-pri
and load-change.  These probes have been added to make it possible to
collect schedgraph data with DTrace.

Finally, a few probes are defined in Solaris to take a cpuinfo_t *
argument.  As it was not immediately clear to me how to translate that to
FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *.

Sponsored by: Sandvine Incorporated
MFC after:	2 weeks
2012-05-15 01:30:25 +00:00
John Baldwin
44ad547522 Add a new sched_clear_name() method to the scheduler interface to clear
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().

MFC after:	2 weeks
2012-03-08 19:41:05 +00:00
John Baldwin
7e3a96ea37 Some small fixes to CPU accounting for threads:
- Only initialize the per-cpu switchticks and switchtime in sched_throw()
  for the very first context switch on APs during boot.  This avoids a
  small gap between the middle of thread_exit() and sched_throw() where
  time is not accounted to any thread.
- In thread_exit(), update the timestamp bookkeeping to track the changes
  to mi_switch() introduced by td_rux so that the code once again matches
  the comment claiming it is mimicing mi_switch().  Specifically, only
  update the per-thread stats directly and depend on ruxagg() to update
  p_rux rather than adjusting p_rux directly.  While here, move the
  timestamp bookkeeping as late in the function as possible.

Reviewed by:	bde, kib
MFC after:	1 week
2012-01-03 21:03:28 +00:00
Ed Schouten
6472ac3d8a Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
2011-11-07 15:43:11 +00:00
Xin LI
cd39bb098e Fix format strings for KTR_STATE in 4BSD ad ULE schedulers.
Submitted by:	Ivan Klymenko <fidaj@ukr.net>
PR:		kern/159904, kern/159905
MFC after:	2 weeks
Approved by:	re (kib)
2011-08-26 18:00:07 +00:00
Attilio Rao
a38f1f263b Remove pc_cpumask and pc_other_cpus usage from MI code.
Tested by:	pluknet
2011-06-13 13:28:31 +00:00
Attilio Rao
61b926921f MFC 2011-05-31 21:22:44 +00:00
Nathan Whitehorn
d098f93019 On multi-core, multi-threaded PPC systems, it is important that the threads
be brought up in the order they are enumerated in the device tree (in
particular, that thread 0 on each core be brought up first). The SLIST
through which we loop to start the CPUs has all of its entries added with
SLIST_INSERT_HEAD(), which means it is in reverse order of enumeration
and so AP startup would always fail in such situations (causing a machine
check or RTAS failure). Fix this by changing the SLIST into an STAILQ,
and inserting new CPUs at the end.

Reviewed by:	jhb
2011-05-31 15:11:43 +00:00
Attilio Rao
a0a43452ae Merge r221285 from largeSMP project:
- Remove the following sysctl:
  kern.sched.ipiwakeup.onecpu
  kern.sched.ipiwakeup.htt2

  Because they are absolutely obsolete.  Probabilly the whole wakeup
  forward mechanism should be revisited for a better fitting in modern
  hw, in the future.
- As map2 variable is no longer used rename map3 to map2
- Fix a string by making more informative the msg and removing the
  arguments passing.

Reviewed by:	julian
Tested by:	several
2011-05-17 22:14:00 +00:00
Attilio Rao
d59dd76c22 Merge r221278 from largeSMP project:
idle_cpus_mask is just used in sched_4bsd, thus make it private for it.

Tested by:	several
2011-05-16 23:20:12 +00:00
Attilio Rao
71a19bdc64 Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).

Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.

The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN

while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.

Some technical notes:
- This commit may be considered an ABI nop for all the architectures
  different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
  accessed avoiding migration, because the size of cpuset_t should be
  considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
  primirally done in order to leave some more space in userland to cope
  with KBI extensions). If you need to access kernel cpuset_t from the
  userland please refer to example in this patch on how to do that
  correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now

The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.

Tested by:	pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by:	jeff, jhb, sbruno
2011-05-05 14:39:14 +00:00
Attilio Rao
f0283a735c - Remove the following sysctl:
kern.sched.ipiwakeup.onecpu
  kern.sched.ipiwakeup.htt2

  Because they are absolutely obsolete. Probabilly the whole wakeup
  forward mechanism should be revisited for a better fitting in modern
  hw.
- As map2 variable is no longer used rename map3 to map2
- Fix a string by making more informative the msg and removing the
  arguments passing

Approved by:	julian
2011-04-30 23:28:07 +00:00
Attilio Rao
3121f5347e idle_cpus_mask is just used in the SMP case and within sched_4BSD.
Declare appropriately.
2011-04-30 22:30:18 +00:00
Ryan Stone
60dd73b78b If the 4BSD scheduler tries to schedule a thread that has been pinned or
bound to an AP before SMP has started, the system will panic when we try
to touch per-CPU state for that AP because that state has not been
initialized yet.  Fix this in the same way as ULE: place all threads in
the global run queue before SMP has started.

Reviewed by:	jhb
MFC after:	1 month
2011-04-26 20:34:30 +00:00
John Baldwin
e806d352d2 Fix several places to ignore processes that are not yet fully constructed.
MFC after:	1 week
2011-04-06 17:47:22 +00:00
Fabien Thomas
586cb6ec77 Clearing the flag when preempting will let the preempted thread run
too much time. This can finish in a scheduler deadlock with ping-pong
between two threads.

One sample of this is:
- device lapic (to have a preemption point on critical_exit())
- options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq)
- running a cpu intensive task (that does not enter the kernel)
- only one CPU on SMP or no SMP.

As requested by jhb@ 4BSD have received the same type of fix instead of
propagating the flag to the new thread.

Reviewed by:	jhb, jeff
MFC after:	1 month
2011-03-31 13:59:47 +00:00
John Baldwin
2dc29adb9f Rework realtime priority support:
- Move the realtime priority range up above kernel sleep priorities and
  just below interrupt thread priorities.
- Contract the interrupt and kernel sleep priority ranges a bit so that
  the timesharing priority band can be increased.  The new timeshare range
  is now slightly larger than the old realtime + timeshare ranges.
- Change the ULE scheduler to no longer use realtime priorities for
  interactive threads.  Instead, the larger timeshare range is now split
  into separate subranges for interactive and non-interactive ("batch")
  threads.  The end result is that interactive threads and non-interactive
  threads still use the same priority ranges as before, but realtime
  threads now have a separate, dedicated priority range.
- Do not modify the priority of non-timeshare threads in sched_sleep()
  or via cv_broadcastpri().  Realtime and idle priority threads will
  no longer have their priorities affected by sleeping in the kernel.

Reviewed by:	jeff
2011-01-14 17:06:54 +00:00
Matthew D Fleming
52c0b557cc One more sysctl(9) type-safety that I missed before. 2011-01-13 18:20:37 +00:00
John Baldwin
22d19207e9 - Move sched_fork() later in fork() after the various sections of the new
thread and proc have been copied and zeroed from the old thread and
  proc.  Otherwise attempts to modify thread or process data in sched_fork()
  could be undone.
- Don't copy td_{base,}_user_pri from the old thread to the new thread in
  sched_fork_thread() in ULE.  This is already done courtesy the bcopy()
  of the thread copy region.
- Always initialize the real priority (td_priority) of new threads to the
  new thread's base priority (td_base_pri) to avoid bogusly inheriting a
  borrowed priority from the parent thread.

MFC after:	2 weeks
2011-01-06 22:24:00 +00:00
David Xu
c8e368a933 - Follow r216313, the sched_unlend_user_prio is no longer needed, always
use sched_lend_user_prio to set lent priority.
- Improve pthread priority-inherit mutex, when a contender's priority is
  lowered, repropagete priorities, this may cause mutex owner's priority
  to be lowerd, in old code, mutex owner's priority is rise-only.
2010-12-29 09:26:46 +00:00
David Xu
acbe332a58 MFp4:
It is possible a lower priority thread lending priority to higher priority
thread, in old code, it is ignored, however the lending should always be
recorded, add field td_lend_user_pri to fix the problem, if a thread does
not have borrowed priority, its value is PRI_MAX.

MFC after: 1 week
2010-12-09 02:42:02 +00:00
Dimitry Andric
3e288e6238 After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files.  A better long-term solution is
still being considered.  This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
2010-11-22 19:32:54 +00:00
Dimitry Andric
31c6a0037e Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
2010-11-14 20:38:11 +00:00
Alexander Motin
a157e42516 Refactor timer management code with priority to one-shot operation mode.
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.

There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
  kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
  kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
  kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
  kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.

As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.

Tested by:	many (on i386, amd64, sparc64 and powerc)
H/W donated by:	Gheorghe Ardelean
Sponsored by:	iXsystems, Inc.
2010-09-13 07:25:35 +00:00
Alexander Motin
b722ad008b Merge some SCHED_ULE features to SCHED_4BSD:
- Teach SCHED_4BSD to inform cpu_idle() about high sleep/wakeup rate to
choose optimized handler. In case of x86 it is MONITOR/MWAIT. Also it
will be needed to bypass forthcoming idle tick skipping logic to not
consume resources on events rescheduling when it won't give any benefits.
- Teach SCHED_4BSD to wake up idle CPUs without using IPI. In case of x86,
when MONITOR/MWAIT is active, it require just single memory write. This
doubles performance on some heavily switching test loads.
2010-09-11 07:08:22 +00:00
John Baldwin
d9d8d1449d Add a new ipi_cpu() function to the MI IPI API that can be used to send an
IPI to a specific CPU by its cpuid.  Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead.  This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.

Submitted by:	peter, sbruno
Reviewed by:	rookie
Obtained from:	Yahoo! (x86)
MFC after:	1 month
2010-08-06 15:36:59 +00:00
John Baldwin
3aa6d94e0c Update several places that iterate over CPUs to use CPU_FOREACH(). 2010-06-11 18:46:34 +00:00
John Baldwin
3da35a0a52 Assert that the thread lock is held in sched_pctcpu() instead of
recursively acquiring it.  All of the current callers already hold the
lock.

MFC after:	1 month
2010-06-03 16:02:11 +00:00
John Baldwin
1d7830edd5 Assert that the thread passed to sched_bind() and sched_unbind() is
curthread as those routines are only supported for curthread currently.

MFC after:	1 month
2010-05-21 17:15:56 +00:00
Attilio Rao
58060789e6 Split out an invariant in order to better check that newtd, when
provided, must be on a runqueue.

Tested by:	Giovanni Trematerra
		<giovanni dot trematerra at gmail dot com>
MFC:		2 weeks
X-MFC:		r202889
2010-01-24 18:16:38 +00:00
Attilio Rao
b0b9dee5c9 - Fix a race in sched_switch() of sched_4bsd.
In the case of the thread being on a sleepqueue or a turnstile, the
  sched_lock was acquired (without the aid of the td_lock interface) and
  the td_lock was dropped. This was going to break locking rules on other
  threads willing to access to the thread (via the td_lock interface) and
  modify his flags (allowed as long as the container lock was different
  by the one used in sched_switch).
  In order to prevent this situation, while sched_lock is acquired there
  the td_lock gets blocked. [0]
- Merge the ULE's internal function thread_block_switch() into the global
  thread_lock_block() and make the former semantic as the default for
  thread_lock_block(). This means that thread_lock_block() will not
  disable interrupts when called (and consequently thread_unlock_block()
  will not re-enabled them when called). This should be done manually
  when necessary.
  Note, however, that ULE's thread_unblock_switch() is not reaped
  because it does reflect a difference in semantic due in ULE (the
  td_lock may not be necessarilly still blocked_lock when calling this).
  While asymmetric, it does describe a remarkable difference in semantic
  that is good to keep in mind.

[0] Reported by:	Kohji Okuno
			<okuno dot kohji at jp dot panasonic dot com>
Tested by:		Giovanni Trematerra
			<giovanni dot trematerra at gmail dot com>
MFC:			2 weeks
2010-01-23 15:54:21 +00:00
Attilio Rao
6eac7e5752 - Fix a bug in sched_4bsd where the timestamp for the sleeping operation
is not cleaned up on the wakeup but reset.
  This is harmless mostly because td_slptick (and ki_slptime from
  userland) should be analyzed only with the assumption that the thread
  is actually sleeping (thus while the td_slptick is correctly set) but
  without this invariant the number is nomore consistent.
- Move td_slptick from u_int to int in order to follow 'ticks' signedness
  and wrap up accordingly [0]

[0] Submitted by:	emaste
Sponsored by:		Sandvine Incorporated
MFC			1 week
2010-01-08 14:55:11 +00:00
Konstantin Belousov
17c4c3563c Allow swap out of the kernel stack for the thread with priority greater
or equial then PSOCK, not less or equial. Higher priority has lesser
numerical value.

Existing test does not allow for swapout of the thread waiting for
advisory lock, for exiting child or sleeping for timeout. On the other
hand, high-priority waiters of VFS/VM events can be swapped out.

Tested by:	pho
Reviewed by:	jhb
MFC after:	1 week
2009-12-31 18:52:58 +00:00
Attilio Rao
1b9d701fee Split P_NOLOAD into a per-thread flag (TDF_NOLOAD).
This improvements aims for avoiding further cache-misses in scheduler
specific functions which need to keep track of average thread running
time and further locking in places setting for this flag.

Reported by:	jeff (originally), kris (currently)
Reviewed by:	jhb
Tested by:	Giuseppe Cocomazzi <sbudella at email dot it>
2009-11-03 16:46:52 +00:00
Jeff Roberson
0d2cf8374a - Use __XSTRING where I want the define to be expanded. This resulted in
sizeof("MAXCPU") being used to calculate a string length rather than
   something more reasonable such as sizeof("32").  This shouldn't have
   caused any ill effect until we run on machines with 1000000 or more
   cpus.
2009-01-25 07:35:10 +00:00
Jeff Roberson
8f51ad55e7 - Implement generic macros for producing KTR records that are compatible
with src/tools/sched/schedgraph.py.  This allows developers to quickly
   create a graphical view of ktr data for any resource in the system.
 - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly
   identifying records associated with a thread or cpu.
 - Reimplement the KTR_SCHED traces using the new generic facility.

Obtained from:	attilio
Discussed with:	jhb
Sponsored by:	Nokia
2009-01-17 07:17:57 +00:00
John Baldwin
c3ea337801 When choosing a CPU for a thread in a cpuset, prefer the last CPU that the
thread ran on if there are no other CPUs in the set with a shorter per-CPU
runqueue.
2008-07-28 20:39:21 +00:00
John Baldwin
f200843b72 Implement support for cpusets in the 4BSD scheduler.
- When a cpuset is applied to a thread, walk the cpuset to see if it is a
  "full" cpuset (includes all available CPUs).  If not, set a new
  TDS_AFFINITY flag to indicate that this thread can't run on all CPUs.
  When inheriting a cpuset from another thread during thread creation, the
  new thread also inherits this flag.  It is in a new ts_flags field in
  td_sched rather than using one of the TDF_SCHEDx flags because fork()
  clears td_flags after invoking sched_fork().
- When placing a thread on a runqueue via sched_add(), if the thread is not
  pinned or bound but has the TDS_AFFINITY flag set, then invoke a new
  routine (sched_pickcpu()) to pick a CPU for the thread to run on next.
  sched_pickcpu() walks the cpuset and picks the CPU with the shortest
  per-CPU runqueue length.  Note that the reason for the TDS_AFFINITY flag
  is to avoid having to walk the cpuset and examine runq lengths in the
  common case.
- To avoid walking the per-CPU runqueues in sched_pickcpu(), add an array
  of counters to hold the length of the per-CPU runqueues and update them
  when adding and removing threads to per-CPU runqueues.

MFC after:	2 weeks
2008-07-28 17:25:24 +00:00