Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
freebsd
GitHub Repository: freebsd/freebsd-src
Path: blob/main/lib/libc/softfloat/softfloat.txt
39476 views
1
$NetBSD: softfloat.txt,v 1.2 2006/11/24 19:46:58 christos Exp $
2
3
SoftFloat Release 2a General Documentation
4
5
John R. Hauser
6
1998 December 13
7
8
9
-------------------------------------------------------------------------------
10
Introduction
11
12
SoftFloat is a software implementation of floating-point that conforms to
13
the IEC/IEEE Standard for Binary Floating-Point Arithmetic. As many as four
14
formats are supported: single precision, double precision, extended double
15
precision, and quadruple precision. All operations required by the standard
16
are implemented, except for conversions to and from decimal.
17
18
This document gives information about the types defined and the routines
19
implemented by SoftFloat. It does not attempt to define or explain the
20
IEC/IEEE Floating-Point Standard. Details about the standard are available
21
elsewhere.
22
23
24
-------------------------------------------------------------------------------
25
Limitations
26
27
SoftFloat is written in C and is designed to work with other C code. The
28
SoftFloat header files assume an ISO/ANSI-style C compiler. No attempt
29
has been made to accommodate compilers that are not ISO-conformant. In
30
particular, the distributed header files will not be acceptable to any
31
compiler that does not recognize function prototypes.
32
33
Support for the extended double-precision and quadruple-precision formats
34
depends on a C compiler that implements 64-bit integer arithmetic. If the
35
largest integer format supported by the C compiler is 32 bits, SoftFloat is
36
limited to only single and double precisions. When that is the case, all
37
references in this document to the extended double precision, quadruple
38
precision, and 64-bit integers should be ignored.
39
40
41
-------------------------------------------------------------------------------
42
Contents
43
44
Introduction
45
Limitations
46
Contents
47
Legal Notice
48
Types and Functions
49
Rounding Modes
50
Extended Double-Precision Rounding Precision
51
Exceptions and Exception Flags
52
Function Details
53
Conversion Functions
54
Standard Arithmetic Functions
55
Remainder Functions
56
Round-to-Integer Functions
57
Comparison Functions
58
Signaling NaN Test Functions
59
Raise-Exception Function
60
Contact Information
61
62
63
64
-------------------------------------------------------------------------------
65
Legal Notice
66
67
SoftFloat was written by John R. Hauser. This work was made possible in
68
part by the International Computer Science Institute, located at Suite 600,
69
1947 Center Street, Berkeley, California 94704. Funding was partially
70
provided by the National Science Foundation under grant MIP-9311980. The
71
original version of this code was written as part of a project to build
72
a fixed-point vector processor in collaboration with the University of
73
California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek.
74
75
THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort
76
has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT
77
TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO
78
PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ANY
79
AND ALL LOSSES, COSTS, OR OTHER PROBLEMS ARISING FROM ITS USE.
80
81
82
-------------------------------------------------------------------------------
83
Types and Functions
84
85
When 64-bit integers are supported by the compiler, the `softfloat.h' header
86
file defines four types: `float32' (single precision), `float64' (double
87
precision), `floatx80' (extended double precision), and `float128'
88
(quadruple precision). The `float32' and `float64' types are defined in
89
terms of 32-bit and 64-bit integer types, respectively, while the `float128'
90
type is defined as a structure of two 64-bit integers, taking into account
91
the byte order of the particular machine being used. The `floatx80' type
92
is defined as a structure containing one 16-bit and one 64-bit integer, with
93
the machine's byte order again determining the order of the `high' and `low'
94
fields.
95
96
When 64-bit integers are _not_ supported by the compiler, the `softfloat.h'
97
header file defines only two types: `float32' and `float64'. Because
98
ISO/ANSI C guarantees at least one built-in integer type of 32 bits,
99
the `float32' type is identified with an appropriate integer type. The
100
`float64' type is defined as a structure of two 32-bit integers, with the
101
machine's byte order determining the order of the fields.
102
103
In either case, the types in `softfloat.h' are defined such that if a system
104
implements the usual C `float' and `double' types according to the IEC/IEEE
105
Standard, then the `float32' and `float64' types should be indistinguishable
106
in memory from the native `float' and `double' types. (On the other hand,
107
when `float32' or `float64' values are placed in processor registers by
108
the compiler, the type of registers used may differ from those used for the
109
native `float' and `double' types.)
110
111
SoftFloat implements the following arithmetic operations:
112
113
-- Conversions among all the floating-point formats, and also between
114
integers (32-bit and 64-bit) and any of the floating-point formats.
115
116
-- The usual add, subtract, multiply, divide, and square root operations
117
for all floating-point formats.
118
119
-- For each format, the floating-point remainder operation defined by the
120
IEC/IEEE Standard.
121
122
-- For each floating-point format, a ``round to integer'' operation that
123
rounds to the nearest integer value in the same format. (The floating-
124
point formats can hold integer values, of course.)
125
126
-- Comparisons between two values in the same floating-point format.
127
128
The only functions required by the IEC/IEEE Standard that are not provided
129
are conversions to and from decimal.
130
131
132
-------------------------------------------------------------------------------
133
Rounding Modes
134
135
All four rounding modes prescribed by the IEC/IEEE Standard are implemented
136
for all operations that require rounding. The rounding mode is selected
137
by the global variable `float_rounding_mode'. This variable may be set
138
to one of the values `float_round_nearest_even', `float_round_to_zero',
139
`float_round_down', or `float_round_up'. The rounding mode is initialized
140
to nearest/even.
141
142
143
-------------------------------------------------------------------------------
144
Extended Double-Precision Rounding Precision
145
146
For extended double precision (`floatx80') only, the rounding precision
147
of the standard arithmetic operations is controlled by the global variable
148
`floatx80_rounding_precision'. The operations affected are:
149
150
floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt
151
152
When `floatx80_rounding_precision' is set to its default value of 80, these
153
operations are rounded (as usual) to the full precision of the extended
154
double-precision format. Setting `floatx80_rounding_precision' to 32
155
or to 64 causes the operations listed to be rounded to reduced precision
156
equivalent to single precision (`float32') or to double precision
157
(`float64'), respectively. When rounding to reduced precision, additional
158
bits in the result significand beyond the rounding point are set to zero.
159
The consequences of setting `floatx80_rounding_precision' to a value other
160
than 32, 64, or 80 is not specified. Operations other than the ones listed
161
above are not affected by `floatx80_rounding_precision'.
162
163
164
-------------------------------------------------------------------------------
165
Exceptions and Exception Flags
166
167
All five exception flags required by the IEC/IEEE Standard are
168
implemented. Each flag is stored as a unique bit in the global variable
169
`float_exception_flags'. The positions of the exception flag bits within
170
this variable are determined by the bit masks `float_flag_inexact',
171
`float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and
172
`float_flag_invalid'. The exception flags variable is initialized to all 0,
173
meaning no exceptions.
174
175
An individual exception flag can be cleared with the statement
176
177
float_exception_flags &= ~ float_flag_<exception>;
178
179
where `<exception>' is the appropriate name. To raise a floating-point
180
exception, the SoftFloat function `float_raise' should be used (see below).
181
182
In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess
183
for underflow either before or after rounding. The choice is made by
184
the global variable `float_detect_tininess', which can be set to either
185
`float_tininess_before_rounding' or `float_tininess_after_rounding'.
186
Detecting tininess after rounding is better because it results in fewer
187
spurious underflow signals. The other option is provided for compatibility
188
with some systems. Like most systems, SoftFloat always detects loss of
189
accuracy for underflow as an inexact result.
190
191
192
-------------------------------------------------------------------------------
193
Function Details
194
195
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
196
Conversion Functions
197
198
All conversions among the floating-point formats are supported, as are all
199
conversions between a floating-point format and 32-bit and 64-bit signed
200
integers. The complete set of conversion functions is:
201
202
int32_to_float32 int64_to_float32
203
int32_to_float64 int64_to_float32
204
int32_to_floatx80 int64_to_floatx80
205
int32_to_float128 int64_to_float128
206
207
float32_to_int32 float32_to_int64
208
float32_to_int32 float64_to_int64
209
floatx80_to_int32 floatx80_to_int64
210
float128_to_int32 float128_to_int64
211
212
float32_to_float64 float32_to_floatx80 float32_to_float128
213
float64_to_float32 float64_to_floatx80 float64_to_float128
214
floatx80_to_float32 floatx80_to_float64 floatx80_to_float128
215
float128_to_float32 float128_to_float64 float128_to_floatx80
216
217
Each conversion function takes one operand of the appropriate type and
218
returns one result. Conversions from a smaller to a larger floating-point
219
format are always exact and so require no rounding. Conversions from 32-bit
220
integers to double precision and larger formats are also exact, and likewise
221
for conversions from 64-bit integers to extended double and quadruple
222
precisions.
223
224
Conversions from floating-point to integer raise the invalid exception if
225
the source value cannot be rounded to a representable integer of the desired
226
size (32 or 64 bits). If the floating-point operand is a NaN, the largest
227
positive integer is returned. Otherwise, if the conversion overflows, the
228
largest integer with the same sign as the operand is returned.
229
230
On conversions to integer, if the floating-point operand is not already an
231
integer value, the operand is rounded according to the current rounding
232
mode as specified by `float_rounding_mode'. Because C (and perhaps other
233
languages) require that conversions to integers be rounded toward zero, the
234
following functions are provided for improved speed and convenience:
235
236
float32_to_int32_round_to_zero float32_to_int64_round_to_zero
237
float64_to_int32_round_to_zero float64_to_int64_round_to_zero
238
floatx80_to_int32_round_to_zero floatx80_to_int64_round_to_zero
239
float128_to_int32_round_to_zero float128_to_int64_round_to_zero
240
241
These variant functions ignore `float_rounding_mode' and always round toward
242
zero.
243
244
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
245
Standard Arithmetic Functions
246
247
The following standard arithmetic functions are provided:
248
249
float32_add float32_sub float32_mul float32_div float32_sqrt
250
float64_add float64_sub float64_mul float64_div float64_sqrt
251
floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt
252
float128_add float128_sub float128_mul float128_div float128_sqrt
253
254
Each function takes two operands, except for `sqrt' which takes only one.
255
The operands and result are all of the same type.
256
257
Rounding of the extended double-precision (`floatx80') functions is affected
258
by the `floatx80_rounding_precision' variable, as explained above in the
259
section _Extended_Double-Precision_Rounding_Precision_.
260
261
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
262
Remainder Functions
263
264
For each format, SoftFloat implements the remainder function according to
265
the IEC/IEEE Standard. The remainder functions are:
266
267
float32_rem
268
float64_rem
269
floatx80_rem
270
float128_rem
271
272
Each remainder function takes two operands. The operands and result are all
273
of the same type. Given operands x and y, the remainder functions return
274
the value x - n*y, where n is the integer closest to x/y. If x/y is exactly
275
halfway between two integers, n is the even integer closest to x/y. The
276
remainder functions are always exact and so require no rounding.
277
278
Depending on the relative magnitudes of the operands, the remainder
279
functions can take considerably longer to execute than the other SoftFloat
280
functions. This is inherent in the remainder operation itself and is not a
281
flaw in the SoftFloat implementation.
282
283
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
284
Round-to-Integer Functions
285
286
For each format, SoftFloat implements the round-to-integer function
287
specified by the IEC/IEEE Standard. The functions are:
288
289
float32_round_to_int
290
float64_round_to_int
291
floatx80_round_to_int
292
float128_round_to_int
293
294
Each function takes a single floating-point operand and returns a result of
295
the same type. (Note that the result is not an integer type.) The operand
296
is rounded to an exact integer according to the current rounding mode, and
297
the resulting integer value is returned in the same floating-point format.
298
299
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
300
Comparison Functions
301
302
The following floating-point comparison functions are provided:
303
304
float32_eq float32_le float32_lt
305
float64_eq float64_le float64_lt
306
floatx80_eq floatx80_le floatx80_lt
307
float128_eq float128_le float128_lt
308
309
Each function takes two operands of the same type and returns a 1 or 0
310
representing either _true_ or _false_. The abbreviation `eq' stands for
311
``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands
312
for ``less than'' (<).
313
314
The standard greater-than (>), greater-than-or-equal (>=), and not-equal
315
(!=) functions are easily obtained using the functions provided. The
316
not-equal function is just the logical complement of the equal function.
317
The greater-than-or-equal function is identical to the less-than-or-equal
318
function with the operands reversed; and the greater-than function can be
319
obtained from the less-than function in the same way.
320
321
The IEC/IEEE Standard specifies that the less-than-or-equal and less-than
322
functions raise the invalid exception if either input is any kind of NaN.
323
The equal functions, on the other hand, are defined not to raise the invalid
324
exception on quiet NaNs. For completeness, SoftFloat provides the following
325
additional functions:
326
327
float32_eq_signaling float32_le_quiet float32_lt_quiet
328
float64_eq_signaling float64_le_quiet float64_lt_quiet
329
floatx80_eq_signaling floatx80_le_quiet floatx80_lt_quiet
330
float128_eq_signaling float128_le_quiet float128_lt_quiet
331
332
The `signaling' equal functions are identical to the standard functions
333
except that the invalid exception is raised for any NaN input. Likewise,
334
the `quiet' comparison functions are identical to their counterparts except
335
that the invalid exception is not raised for quiet NaNs.
336
337
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
338
Signaling NaN Test Functions
339
340
The following functions test whether a floating-point value is a signaling
341
NaN:
342
343
float32_is_signaling_nan
344
float64_is_signaling_nan
345
floatx80_is_signaling_nan
346
float128_is_signaling_nan
347
348
The functions take one operand and return 1 if the operand is a signaling
349
NaN and 0 otherwise.
350
351
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
352
Raise-Exception Function
353
354
SoftFloat provides a function for raising floating-point exceptions:
355
356
float_raise
357
358
The function takes a mask indicating the set of exceptions to raise. No
359
result is returned. In addition to setting the specified exception flags,
360
this function may cause a trap or abort appropriate for the current system.
361
362
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
363
364
365
-------------------------------------------------------------------------------
366
Contact Information
367
368
At the time of this writing, the most up-to-date information about
369
SoftFloat and the latest release can be found at the Web page `http://
370
HTTP.CS.Berkeley.EDU/~jhauser/arithmetic/SoftFloat.html'.
371
372
373
374