十进制下的有效位数定义

以科学计数法记录，

十进制下， \( 1.23456 \times 10^3 \) 的有效数字为6位。

如果一个变量x的范围为[0,123456],那么这个变量x的有效数字至少为5位，但是其不能表示所有6位有效数字的数。
如果一个变量x的有效数字为5位，那么它不能区分1.23456和1.23457。

如果我们定义一个32位十进制浮点数，其尾数为23位。那么其尾码为 10^24，由于科学计数法中指数部分不影响有效位数，所以其可以写作

\[ 1.(10^{23}) * 10^E \]

其有效位数为24位。

float32

edit-3afa04072ea14ffc986188711196383f-2022-08-29-23-19-02

在只考虑规格化(normalized)的浮点数的情况下(因为非规格化的浮点数的尾数不额外补1，其可表示的尾数比规格化的少一个比特，所以无需讨论他)，尾数部分可以用23个比特表示，加上额外的补1，总共可以认为是24个比特。

24个比特的尾数换算成10进制，其为

\[ 2^{24} = 16777216 \]

规格化的浮点数写作 \(sign * M * 2^{E}\) 的形式，因此其有效位数单纯由\(M\)决定。由此

\[ 10^7<16,777,216 < 10^8 \]

在规格化的情况下，其至少可以表示7位有效数字，不能准确表示八位有效数字。在非规格化下由于少一个比特，其尾数只能表示8388608，至少只能表示6位有效数字。

其他精度

Double

Double的尾码部分有52位，

\[ 10^{15} < 2^{53} = 9007199254740992 < 10^{16} \]

规格化情况下至少可以表示15位有效数字。非规格化情况下

\[ 10^{15} < 2^{52} = 4503599627370496< 10^{16} \]

有效位数至少15位。

其他

fp24明明D24S8有，但是好像没有找到合适的文档,也许不是原生的表示，只是把float32 encode成了24位。其他格式的浮点数表示方法可以见文档 (https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-bdw-vol03-gpu_overview_3.pdf) 以及DirectX的Spec。

fp32尾数23位，阶码8位,其最大值可以记作

\[ (2 - 2^{-23}) * 2 ^ {255 - 127 - 1} = 3.4028235 × 10^{38} \]

用类似的方法计算

Format	E	M	有效数字	尾数部分	最大值
Fp16	5	10	至少3位	\(2^{10 + 1} = 2048\)	\((2 - 2^{-10} * 2 ^{ 31 - 15 - 1}) = 65504\)
Fp11	5	6	至少2位	去掉了符号位 \(2^{6 + 1} = 128\)	\((2 - 2^{-6} * 2 ^{ 31 - 15 - 1}) = 65024.0\)
Fp10	5	5	至少1位	去掉了符号位 \(2^{5 + 1} = 64\)	\((2 - 2^{-5} * 2 ^{ 31 - 15 - 1}) = 64512.0\)
Fp8(e4m3)	4	3	至少1位	\(2^{3 + 1} = 16\)	\((2 - 2^{-3} * 2 ^{ 15 - 7 - 1}) = 240.0\)

注意后面的fp8的是没有标准的，采用的计算方法和表示csapp，之前的文章整数和浮点数的机器级表示介绍过。

在shader中其表示的范围可能会发生变化，可能只用来表示[0,1]范围,但是有效位数只受\(M\)位数影响，其精度是不会变的。

有一个简单的口诀，是指数(Exponent)的bit决定了数字表示的范围(因为相当于可以移动多少个小数点)，而尾数(Mantissa)的bit决定了数字的精度。从极端的角度上来讲，假如尾数只有1个bit，那么不管能移动多少个小数点，它也只能表示两个状态。

OpenGL wiki里整理了一些图形API里常用格式的精度，可以参考:

参考：Small Float Formats - OpenGL Wiki

图片里有一个有趣的指标是Decimal digits of precision，查了一下floating point - How to calculate decimal digits of precision based on the number of bits? - Stack Overflow。

这个指标是以10为底的log,d=log[base 10](2^b)。比如32位浮点的十进制精度为7.22，说明它至少能表示7位有效数字，但是不足以表示8位有效数字。

14F和11F的特殊之处

14Bit的格式只能是unsigned格式。因为只有这样才能抠出9个bit给尾数，凑成2 ^ (9 + 1) = 1024，确保能达到3位有效数字的精度。

11Bit的也是一样，去掉signed bit才能扣出6个bit出来，这样的话只能表示2 ^ (6 + 1) = 128，确保能表示2位有效数字。

注意图形API的Subnormal的处理

大概由于subnormal的浮点处理起来比较麻烦，似乎不能用硬件实现。图形API和硬件可能不会完整实现IEEE 754标准，所以关于subnormal的处理要小心，最好假设它会被置0，加上clamp以确保其不会过小而置0，否则可能导致除0异常，部分像素出现Nan值。

OpenGL ES 比如OpenGL ES Spec 27页:https://registry.khronos.org/OpenGL/specs/es/3.0/GLSL_ES_Specification_3.00.pdf

Spec没有规定subnormal的处理方式，硬件可以将subnormal浮点数直接设置为0.

There is no limit on the number of digits in any digit-sequence. If the value of the floating point number is too large (small) to be stored as a single precision value, it is converted to positive (negative) infinity. A value with a magnitude too small to be represented as a mantissa and exponent is converted to zero Implementations may also convert subnormal (denormalized) numbers to zero.

DirectX

https://docs.microsoft.com/en-us/windows/win32/direct3d11/floating-point-rules#32-bit-floating-point-rules DX11的spec有写

Denorms are flushed to sign-preserved zero on input and output of any floating-point mathematical operation. Exceptions are made for any I/O or data movement operation that doesn't manipulate the data.

非规格化的浮点数直接置零。

浮点数有效位数