1
0
mirror of https://git.FreeBSD.org/src.git synced 2024-12-17 10:26:15 +00:00
Commit Graph

11467 Commits

Author SHA1 Message Date
Bruce Evans
63b4a1f80c Sigh, the weak reference for ceill(), floorl() and truncl() was in
unreachable code due to a missing include.  This kept arm and powerpc
broken.

Reported by:	sam, grehan
2008-02-15 07:01:40 +00:00
Bruce Evans
5014f8ded4 Oops, the weak reference for ceill(), floorl() and truncl() was in the
wrong file.  This broke arm and powerpc.

Reported by:	grehan
2008-02-14 15:10:34 +00:00
Bruce Evans
3365b45e5e Use the expression fabs(x+0.0)+fabs(y+0.0) instad of a+b (where a is
|x| or |y| and b is |y| or |x|) when mixing NaN arg(s).

hypot*() had its own foot shooting for mixing NaNs -- it swaps the
args so that |x| in bits is largest, but does this before quieting
signaling NaNs, so on amd64 (where the result of adding NaNs depends
on the order) it gets inconsistent results if setting the quiet bit
makes a difference, just like a similar ia64 and i387 hardware comparison.
The usual fix (see e_powf.c 1.13 for more details) of mixing using
(a+0.0)+-(b+0.0) doesn't work on amd64 if the args are swapped (since
the rder makes a difference with SSE). Fortunately, the original args
are unchanged and don't need to be swapped when we let the hardware
decide the mixing after quieting them, but we need to take their
absolute value.

hypotf() doesn't seem to have any real bugs masked by this non-bug.
On amd64, its maximum error in 2^32 trials on amd64 is now 0.8422 ulps,
and on i386 the maximum error is unchanged and about the same, except
with certain CFLAGS it magically drops to 0.5 (perfect rounding).

Convert to __FBSDID().
2008-02-14 13:44:03 +00:00
Dag-Erling Smørgrav
096ba44775 _pthread_mutex_isowned_np(): use a more reliable method; the current code
will work in simple cases, but may fail in more complicated ones.

Reviewed by:	davidxu
2008-02-14 12:37:58 +00:00
Bruce Evans
b4437c3d32 Fix the hi+lo decomposition for 2/(3ln2). The decomposition needs to
be into 12+24 bits of precision for extra-precision multiplication,
but was into 13+24 bits.  On i386 with -O1 the bug was hidden by
accidental extra precision, but on amd64, in 2^32 trials the bug
caused about 200000 errors of more than 1 ulp, with a maximum error
of about 80 ulps.  Now the maximum error in 2^32 trials on amd64
is 0.8573 ulps.  It is still 0.8316 ulps on i386 with -O1.

The nearby decomposition of 1/ln2 and the decomposition of 2/(3ln2) in
the double precision version seem to be sub-optimal but not broken.
2008-02-14 10:23:51 +00:00
Bruce Evans
011cbae1fe Use the expression (x+0.0)-(y+0.0) instead of x+y when mixing NaN arg(s).
This uses 2 tricks to improve consistency so that more serious problems
aren't hidden in simple regression tests by noise for the NaNs:

- for a signaling NaN, adding 0.0 generates the invalid exception and
  converts to a quiet NaN, and doesn't have too many effects for other
  types of args (it converts -0 to +0 in some rounding modes, but that
  hopefully doesn't change the result after adding the NaN arg).  This
  avoids some inconsistencies on i386 and ia64.  On these arches, the
  result of an operation on 2 NaNs is apparently the largest or the
  smallest of the NaNs as bits (consistently largest or smallest for
  each arch, but the opposite).  I forget which way the comparison
  goes and if the sign bit affects it.  The quiet bit is is handled
  poorly by not always setting it before the comparision or ignoring
  it.  Thus if one of the args was originally a signaling NaN and the
  other was originally a quiet NaN, then the result depends too much
  on whether the signaling NaN has been quieted at this point, which
  in turn depends on optimizations and promotions.  E.g., passing float
  signaling NaNs to double functions must quiet them on conversion;
  on i387, loading a signaling NaN of type float or double (but not
  long double) into a register involves a conversion, so it quiets
  signaling NaNs, so if the addition has 2 register operands than it
  only sees quiet NaNs, but if the addition has a memory operand then
  it sees a signaling NaN iff it is in the memory operand.

- subtraction instead of addition is used to avoid a dubious optimization
  in old versions of gcc.  For SSE operations, mixing of NaNs apparently
  always gives the target operand.  This is not as good as the i387
  and ia64 behaviour.  It doesn't mix NaNs at all, and makes addition
  not quite commutative.  Old versions of gcc sometimes rewrite x+y
  to y+x and thus give different results (in bits) for NaNs.  gcc-3.3.3
  rewrites x+y to y+x for one of pow() and powf() but not the other,
  so starting from float NaN args x and y, powf(x, y) was almost always
  different from pow(x, y).

These tricks won't give consistency of 2-arg float and double functions
with long double ones on amd64, since long double ones use the i387
which has different semantics from SSE.

Convert to __FBSDID().
2008-02-14 09:42:24 +00:00
Bruce Evans
e7c95ee5fe s_ceill.c
s_floorl.c
s_truncl.c
2008-02-13 17:38:16 +00:00
Bruce Evans
74d68da630 On arches where long double is the same as double, alias ceil(), floor()
and trunc() to the corresponding long double functions.  This is not
just an optimization for these arches.  The full long double functions
have a wrong value for `huge', and the arches without full long doubles
depended on it being wrong.
2008-02-13 16:56:52 +00:00
Bruce Evans
6597187205 Fix the C version of ceill(x) for -1 < x <= -0 in all rounding modes.
The result should be -0, but was +0.
2008-02-13 15:22:53 +00:00
Rong-En Fan
7913e26359 - Remove duplicate tputs.3 from MLINK. As we use termcap in the bsae, remove
the one links to curs_terminfo.

Submitted by:	David Naylor <blackdragon at highveldmail.co.za>
MFC after:	3 days
2008-02-13 14:34:39 +00:00
Bruce Evans
f01bfe5c6d Fix exp2*(x) on signaling NaNs by returning x+x as usual.
This has the side effect of confusing gcc-4.2.1's optimizer into more
often doing the right thing.  When it does the wrong thing here, it
seems to be mainly making too many copies of x with dependency chains.
This effect is tiny on amd64, but in some cases on i386 it is enormous.
E.g., on i386 (A64) with -O1, the current version of exp2() should
take about 50 cycles, but took 83 cycles before this change and 66
cycles after this change.  exp2f() with -O1 only speeded up from 51
to 47 cycles.  (exp2f() should take about 40 cycles, on an Athlon in
either i386 or amd64 mode, and now takes 42 on amd64).  exp2l() with
-O1 slowed down from 155 cycles to 123 for some args; this is unimportant
since the i386 exp2l() is a fake; the wrong thing for it seems to
involve branch misprediction.
2008-02-13 10:44:44 +00:00
Bruce Evans
828f7b4a82 Rearrange the polynomial evaluation for better parallelism. This is
faster on all machines tested (old Celeron (P2), A64 (amd64 and i386)
and ia64) except on ia64 when compiled with -O1.  It takes 2 more
multiplications, so it will be slower on old machines.  The speedup
is about 8 cycles = 17% on A64 (amd64 and i386) with best CFLAGS
and some parallelism in the caller.

Move the evaluation of 2**k up a bit so that it doesn't compete too
much with the new polynomial evaluation.  Unlike the previous
optimization, this rearrangement cannot change the result, so compilers
and CPU schedulers can do it, but they don't do it quite right yet.
This saves a whole 1 or 2 cycles on A64.
2008-02-13 08:36:13 +00:00
Bruce Evans
02ef796d23 Use hardware remainder on amd64 since it is 5 to 10 times faster than
software remainder and is already used for remquo().
2008-02-13 06:01:48 +00:00
David E. O'Brien
a2c4cd4549 style.Makefile(5) 2008-02-13 05:25:43 +00:00
David E. O'Brien
6cc9986927 style(9) 2008-02-13 05:12:05 +00:00
Ruslan Ermilov
5f56182b6f Change readlink(2)'s return type and type of the last argument
to match POSIX.

Prodded by:	Alexey Lyashkov
2008-02-12 20:09:04 +00:00
Bruce Evans
a2ddfa5ea7 Fix remainder() and remainderf() in round-towards-minus-infinity mode
when the result is +-0.  IEEE754 requires (in all rounding modes) that
if the result is +-0 then its sign is the same as that of the first
arg, but in round-towards-minus-infinity mode an uncorrected implementation
detail always reversed the sign.  (The detail is that x-x with x's
sign positive gives -0 in this mode only, but the algorithm assumed
that x-x always has positive sign for such x.)

remquo() and remquof() seem to need the same fix, but I cannot test them
yet.

Use long doubles when mixing NaN args.  This trick improves consistency
of results on at least amd64, so that more serious problems like the
above aren't hidden in simple regression tests by noise for the NaNs.
On amd64, hardware remainder should be used since it is about 10 times
faster than software remainder and is already used for remquo(), but
it involves using the i387 even for floats and doubles, and the i387
does NaN mixing which is better than but inconsistent with SSE NaN mixing.
Software remainder() would probably have been inconsistent with
software remainderl() for the same reason if the latter existed.

Signaling NaNs cause further inconsistencies on at least ia64 and i386.

Use __FBSDID().
2008-02-12 17:11:36 +00:00
Rong-En Fan
67c700c106 - Update build glues for ncurses 5.6 snapshot 20080209
- While I'm here, sort macro defines in ncurses_cfg.h
2008-02-11 13:39:36 +00:00
Remko Lodder
8e167da222 After issueing a ntpdate [1] I noticed it's already 2008, reflect that
in the last modified date.

Noticed by:	brueffer [1]
2008-02-11 07:43:23 +00:00
Remko Lodder
08a155ad22 Fix typo (s/existance/existence/)
Noticed by:	ceri
2008-02-11 07:15:52 +00:00
Bruce Evans
51f86873af Use double precision for z and thus for the entire calculation of
exp2(i/TBLSIZE) * p(z) instead of only for the final multiplication
and addition.  This fixes the code to match the comment that the maximum
error is 0.5010 ulps (except on machines that evaluate float expressions
in extra precision, e.g., i386's, where the evaluation was already
in extra precision).

Fix and expand the comment about use of double precision.

The relative roundoff error from evaluating p(z) in non-extra precision
was about 16 times larger than in exp2() because the interval length
is 16 times smaller.  Its maximum was at least P1 * (1.0 ulps) *
max(|z|) ~= log(2) * 1.0 * 1/32 ~= 0.0217 ulps (1.0 ulps from the
addition in (1 + P1*z) with a cancelation error when z ~= -1/32).  The
actual final maximum was 0.5313 ulps, of which 0.0303 ulps must have
come from the additional roundoff error in p(z).  I can't explain why
the additional roundoff error was almost 3/2 times larger than the rough
estimate.
2008-02-11 05:20:02 +00:00
Bruce Evans
52453261e9 As usual, use a minimax polynomial that is specialized for float
precision.  The new polynomial has degree 4 instead of 10, and a maximum
error of 2**-30.04 ulps instead of 2**-33.15.  This doesn't affect the
final error significantly; the maximum error was and is about 0.5015
ulps on i386 -O1, and the number of cases with an error of > 0.5 ulps
is increased from 13851 to 14407.

Note that the error is only this close to 0.5 ulps due to excessive
extra precision caused by compiler bugs on i386.  The extra precision
could be obtained intentionally, and is useful for keeping the error
of the hyperbolic float functions below 1 ulp, since these functions
are implemented using expm1f.  My recent change for scaling by 2**k
had the unintentional side effect of retaining extra precision for
longer, so callers of expm1f see errors of more like 0.0015 ulps than
0.5015 ulps, and for the hyperbolic functions this reduces the maximum
error from nearly about 2 ulps to about 0.75 ulps.

This is about 10% faster on i386 (A64).  expm1* is still very slow,
but now the float version is actually significantly faster.  The
algorithm is very sophisticated but not very good except on machines
with fast division.
2008-02-09 12:53:15 +00:00
Bruce Evans
6d656800db Fix a comment about coefficients and expand a related one. 2008-02-09 10:36:07 +00:00
Dag-Erling Smørgrav
340b079be0 Use memcpy(3) instead of the BSD-specific bcopy(3).
Submitted by:	Joerg Sonnenberger <joerg@britannica.bec.de>
MFC after:	2 weeks
2008-02-08 09:48:48 +00:00
Dag-Erling Smørgrav
e97f516c09 s/MAXPATHLEN/PATH_MAX/ to reflect five-year old change to the code :)
Submitted by:	Joerg Sonnenberger <joerg@britannica.bec.de>
MFC after:	2 weeks
2008-02-08 09:44:34 +00:00
Jason Evans
157d89fe25 Fix a bug in lazy deallocation that was introduced when
arena_dalloc_lazy_hard() was split out of arena_dalloc_lazy() in revision
1.162.

Reduce thundering herd problems in lazy deallocation by randomly varying
how many probes a thread does before taking the slow path.
2008-02-08 08:02:34 +00:00
Bruce Evans
fbe8fb4d7b Fix truncl() when the result should be -0.0L. When the result is +-0.0L,
it must have the same sign as the arg in all rounding modes, but it was
always +0.0L.
2008-02-08 01:45:52 +00:00
Bruce Evans
aa7c7c47cf Oops, fix the fix in rev.1.10. logb() and logbf() were broken on
denormals, and logb() remained broken after 1.10 because the fix for
logbf() was incompletely translated.

Convert to __FBSDID().
2008-02-08 01:22:13 +00:00
Jason Evans
97091a2dd7 Clean up manipulation of chunk page map elements to remove some tenuous
assumptions about whether bits are set at various times.  This makes
adding other flags safe.

Reorganize functions in order to inline i{m,c,p,s,re}alloc().  This
allows the entire fast-path call chains for malloc() and free() to be
inlined. [1]

Suggested by:	[1] Stuart Parmenter <stuart@mozilla.com>
2008-02-08 00:35:56 +00:00
Bruce Evans
a00672cff9 Use a better method of scaling by 2**k. Instead of adding to the
exponent bits of the reduced result, construct 2**k (hopefully in
parallel with the construction of the reduced result) and multiply by
it.  This tends to be much faster if the construction of 2**k is
actually in parallel, and might be faster even with no parallelism
since adjustment of the exponent requires a read-modify-wrtite at an
unfortunate time for pipelines.

In some cases involving exp2* on amd64 (A64), this change saves about
40 cycles or 30%.  I think it is inherently only about 12 cycles faster
in these cases and the rest of the speedup is from partly-accidentally
avoiding compiler pessimizations (the construction of 2**k is now
manually scheduled for good results, and -O2 doesn't always mess this
up).  In most cases on amd64 (A64) and i386 (A64) the speedup is about
20 cycles.  The worst case that I found is expf on ia64 where this
change is a pessimization of about 10 cycles or 5%.  The manual
scheduling for plain exp[f] is harder and not as tuned.

Details specific to expm1*:
- the saving is closer to 12 cycles than to 40 for expm1* on i386 (A64).
  For some reason it is much larger for negative args.
- also convert to __FBSDID().
2008-02-07 09:42:19 +00:00
Bruce Evans
a373e66b85 Use a better method of scaling by 2**k. Instead of adding to the
exponent bits of the reduced result, construct 2**k (hopefully in
parallel with the construction of the reduced result) and multiply by
it.  This tends to be much faster if the construction of 2**k is
actually in parallel, and might be faster even with no parallelism
since adjustment of the exponent requires a read-modify-wrtite at an
unfortunate time for pipelines.

In some cases involving exp2* on amd64 (A64), this change saves about
40 cycles or 30%.  I think it is inherently only about 12 cycles faster
in these cases and the rest of the speedup is from partly-accidentally
avoiding compiler pessimizations (the construction of 2**k is now
manually scheduled for good results, and -O2 doesn't always mess this
up).  In most cases on amd64 (A64) and i386 (A64) the speedup is about
20 cycles.  The worst case that I found is expf on ia64 where this
change is a pessimization of about 10 cycles or 5%.  The manual
scheduling for plain exp[f] is harder and not as tuned.

This change ld128/s_exp2l.c has not been tested.
2008-02-07 03:17:05 +00:00
Dag-Erling Smørgrav
04539c7099 Add missing #include
Spotted by:	tinderbox
Submitted by:	Pietro Cerutti <gahr@gahr.ch>
Pointy hat to:	des
2008-02-06 23:25:29 +00:00
Dag-Erling Smørgrav
47252f4e65 Yet another pointy hat: when I zapped FBSDprivate_1.1, I forgot to move
its contents to FBSDprivate_1.0.
2008-02-06 20:45:46 +00:00
Dag-Erling Smørgrav
0957bb7d29 Add pthread_mutex_isowned_np() here as well; libthr and libkse are supposed
to have identical functionality.

MFC after:	2 weeks
2008-02-06 20:44:29 +00:00
Dag-Erling Smørgrav
d29b081fee Remove unnecessary prototype. 2008-02-06 20:43:19 +00:00
Dag-Erling Smørgrav
3cd52a7730 Add pthread_mutex_isowned_np() so there is no need for an additional
prototype next to the implementation.

MFC after:	2 weeks
2008-02-06 20:42:35 +00:00
Dag-Erling Smørgrav
7035cef466 Previous commit had a typo that resulted in symbol versioning being
(silently) disabled for libkse...

Pointy hat to:	des
2008-02-06 20:33:59 +00:00
Dag-Erling Smørgrav
d197ee06fc Give libkse the same treatment as libthr re. symbol versioning.
MFC after:	2 weeks
2008-02-06 20:30:48 +00:00
Dag-Erling Smørgrav
a9b9744ff8 Convert pthread.map to the format expected by version_gen.awk, and modify
the Makefile accordingly; libthr now explicitly uses libc's Versions.def.

MFC after:	2 weeks
2008-02-06 20:25:00 +00:00
Dag-Erling Smørgrav
facfc98226 Remove incorrectly added FBSDprivate_1.1 namespace, and move symbols which
are new in FreeBSD 8 to the appropriate namespace.
2008-02-06 20:20:29 +00:00
Dag-Erling Smørgrav
1cbdac2668 Per discussion on -threads, rename _islocked_np() to _isowned_np(). 2008-02-06 19:34:31 +00:00
Dag-Erling Smørgrav
79257dd70a Add necessary cast for tolower() argument.
Submitted by:	Joerg Sonnenberger <joerg@britannica.bec.de>
MFC after:	1 week
2008-02-06 11:39:55 +00:00
Bruce Evans
ce56838fdc As for the float trig functions and logf, use a minimax polynomial
that is specialized for float precision.  The new polynomial has degree
5 instead of 11, and a maximum error of 2**-27.74 ulps instead
of 2**-30.64.  This doesn't affect the final error significantly; the
maximum error was and is about 0.9101 ulps on amd64 -01 and the number
of cases with an error of > 0.5 ulps is actually reduced by epsilon
despite the larger error in the polynomial.

This is about 15% faster on amd64 (A64), i386 (A64) and ia64.  The asm
version is still used instead of this on i386 since it is faster and
more accurate.
2008-02-06 06:35:21 +00:00
Jason Evans
baad859d16 Track dirty unused pages so that they can be purged if they exceed a
threshold, according to the 'F' MALLOC_OPTIONS flag.  This obsoletes the
'H' flag.

Try to realloc() large objects in place.  This substantially speeds up
incremental large reallocations in the common case.

Fix a bug in arena_ralloc() that caused relocation of sub-page objects
even if the old and new sizes were in the same size class.

Maintain trees of runs and simplify the per-chunk page map.  This allows
logarithmic-time searching for sufficiently large runs in
arena_run_alloc(), whereas the previous algorithm required linear time
in the worst case.

Break various large functions into smaller sub-functions, and inline
only the functions that are in the fast path for small object
allocation/deallocation.

Remove an unnecessary check in base_pages_alloc_mmap().

Avoid integer division in choose_arena() for the NO_TLS case on
single-CPU systems.
2008-02-06 02:59:54 +00:00
Matteo Riondato
23490135ae set WARNS to 1: with WARNS=2 an aliasing error in a file generated by
rpcgen from include/rpcsvc/rex.x is exposed and I really don't know
how to fix it.

MFC after:	1 week
2008-02-05 20:03:45 +00:00
Joseph Koshy
e77357ea2d Document the return type for gelf_fsize(3).
Submitted by:		kaiw
2008-02-04 18:50:28 +00:00
Dag-Erling Smørgrav
fcd61d9141 After careful consideration (and a brief discussion with attilio@), change
the semantics of pthread_mutex_islocked_np() to return true if and only if
the mutex is held by the current thread.

Obviously, change the regression test to match.

MFC after:	2 weeks
2008-02-04 12:35:23 +00:00
Matteo Riondato
c75cd2cb75 Fix incorrect handling of malloc failures
PR:		bin/83369
MFC after:	1 week
2008-02-04 07:56:36 +00:00
Dag-Erling Smørgrav
5fd410a787 Add pthread_mutex_islocked_np(), a cheap way to verify that a mutex is
locked.  This is intended primarily to support the userland equivalent
of the various *_ASSERT_LOCKED() macros we have in the kernel.

MFC after:	2 weeks
2008-02-03 22:38:10 +00:00
Hajimu UMEMOTO
6b299433de Remove incomplete support of AI_ALL and AI_V4MAPPED.
Reported by:	"Heiko Wundram (Beenic)" <wundram__at__beenic.net>
2008-02-03 19:07:55 +00:00