无符号大小:五年之误
Source: Hacker News
“给不熟悉 C3 的读者一个简短说明:它是一种遵循 C 传统的系统语言。下面的细节是 C3 的,但这些权衡适用于任何必须为大小和长度选择类型的语言。”
为什么 C3 默认改为使用有符号整数
C3 正在默认使用有符号整数,但我们这样做的原因是什么?至少对于大小而言,无符号不是更合适吗?让我们尝试回答这个问题。
Source:
无符号整数的坑
自早期起,C3 就使用无符号大小。虽然无符号类型的 名称 随时间而变——从 usize 变为 usz(在与 uptrdiff 类型统一之后)——但它作为默认类型的地位从未受到挑战。
然而,无符号整数有已知的陷阱,最著名的就是:
for (uint x = 10; x >= 0; x--) // 无限循环!
{
…
}
这个错误如此容易出现,以至于 C3 明确在宏之外禁止对无符号类型使用 x >= 0。
另一个经典的 C 语言错误是:
uint a = 0;
int b = -1;
if (a > b) { … }
在 C 中,两个操作数都会被提升为无符号,从而把 b 变成一个巨大的无符号值,导致比较失效。基于此,C3 实现了安全的无符号/有符号比较,不会把两侧都转换为同一种类型,并且在任意操作数类型下都是安全的。
当然,C 允许无符号与有符号之间的隐式转换。虽然这常常是 bug 的根源,但我认为只要加以安全措施,基本可以接受。
很容易把上面的错误视为互不相关的怪癖:永不终止的循环、错误的比较、需要“恰到好处”地修正的转换……它们都源自一个更早的决定——把无符号设为大小的默认类型。本文的大部分内容其实都是在讨论这个决定。
一个切题的问题
你可能会合理地问:“为什么不要求有符号/无符号转换必须显式?”
答案与无符号大小有关。
如果大小是无符号的——正如在 C、C++、Rust、Zig 和 C3 中的情况——那么任何涉及对数据进行索引的操作都必须全部使用无符号,或者需要进行强制转换。C 的宽松语义在一定程度上把这个问题掩盖了,但在 Rust 中,这意味着在处理大小时你经常需要来回强制转换。
对强制转换有两种做法:
- 在代码库中大量散布,认为“这是显式转换,显而易见”。
- 最小化强制转换,只在“有异常情况”时使用它们:“此处有风险”。
前者更容易实现,但本质上是压制警告。例如,假设代码最初把 u16 强制转换为 u32。后来变量类型改为 u64;此时强制转换会悄悄截断值。强制转换就成了“消除所有警告”的手段。
当强制转换是机械地在编译器要求的地方插入,而不是在仔细审查每个案例后加入时,“这是显式转换”的理念也会被削弱。
另一方面,最小化强制转换更具挑战性:我们需要规则,能够正确允许安全的隐式转换,同时对不安全的情况强制显式转换。
C3 采用了第二种方式——强制转换应该有意义。但它为何仍然允许无符号 ↔ 有符号转换?这不是不安全吗?
事实是,只要只使用加、减、乘,且有符号整数采用二进制补码表示,这种转换大体上是安全的。由于转换会频繁发生(记住:无符号大小!),把它们设为隐式是自然的权衡。
最周全的计划
自 2021 年起,C3 基本保持了当前的转换语义,并在五年里运行良好,未出现严重的异常行为——直到一个关于 (foo + a) % 2 的天真问题把这些假设彻底颠覆。
为消除隐蔽的陷阱,C3 改变了规则,使 int + uint 的提升结果为 int 而不是无符号。这让许多情况默默地使用有符号运算,而这在大多数情形下是正确的。
但考虑 (foo + a) % 2,其中 foo 恰好大于 INT_MAX。这时会得到难以理解的结果;正确的写法应该是 (foo + a) % 2U。
这在当时是不可接受的——并不是因为难以修复,而是因为它出乎意料。几乎在所有其他地方,你都可以简单地忽略底层转换是有符号还是无符号的差异。
Source: …
ed or unsigned – it just worked. But / and %? Here the solution broke down. Because it “just worked” elsewhere, it was fairly opaque which sub‑expression was signed or unsigned. The convenience turned a minor issue into a big one.
The immediate reaction was to patch it: issue an error on “unsigned / signed” and “unsigned % signed”. However, more issues were lurking in the shadows.
The tricky wrap
If you write a ring buffer, how do you make sure that calculating offsets wraps correctly?
The naïve solution is:
index = (start + offset) % length;
This works as long as offset is positive. What about negative values? A common simple solution is:
index = ((start + offset) % length + length) % length;
Since offset is negative we can assume signed numbers; barring extremely large offsets (causing signed overflow) this works.
Now remember how we started with unsigned sizes? Using unsigned everywhere leads to code that looks like this:
index = ((start - offset_back) % length + length) % length;
That is completely wrong – but also hard to detect. It will sometimes wrap correctly, but mostly not.
The correct code for unsigned arithmetic should be something like:
index = (start + length - (offset_back % length)) % length;
Regardless of the rules we apply to unsigned ↔ signed conversions, there is simply no way for the compiler to tell us that the first “offset_back” example is broken for unsigned.
The unsigned size
It seems hard to solve the problem with unsigned, so perhaps we’re making a faulty assumption.
Look back in time: C was originally designed around signed integers, with the int type at its core. This all changed when the type of sizeof was standardized to the unsigned size_t.
That single change single‑handedly introduced… (the original text ends here).
Unsigned arithmetic is a common thing in C code. Finding this new shiny thing, people started to use `unsigned` to encode “this value can’t be negative” and talked about how using `unsigned` helped since it allowed them to express larger sums.
That didn’t mean it was without problems. In fact, the problems were so significant that in the 90s Java decided to drop unsigned types entirely in its design. Java’s reaction was perhaps a little extreme, but it did achieve the goal of making a large set of common bugs – related to unsigned – just go away.
Go should give us pause: it’s a low‑level language, created as a reaction to problems in C++, by people who knew exactly what unsigned sizes cost – and they picked signed sizes.
With any bounded integers, problems arise when we close in on the boundaries. For a 32‑bit signed int that is approximately ±2 billion, for an unsigned 32‑bit integer it’s 0 … ≈4 billion. The “unsafe” boundaries for unsigned lie so much closer than for signed integers – there is simply no contest.
This is exactly why we see problems for things like the case with %.
But what about the range? While it’s true that you get twice the range, surprisingly often the code in the range above INT_MAX is quite bug‑ridden. Any code doing something like
(2U * index) / 2U
in this range will have quite the surprise coming. It’s worse than that: overflow for signed values generally produces an invalid, negative number – but unsigned overflow often produces a plausible number, just the wrong one. Not to mention that on modern 64‑bit machines you’ll run out of memory before you can use a full signed 64‑bit integer.
Isn’t it valuable to be in the right range by design?
The answer seems to be no, judging from work on verification frameworks, as unsigned only encodes modulo behaviour and actual ranges. It might be argued that you can make unsigned overflow an error (this is indeed what Rust does), but that removes useful properties of unsigned arithmetic:
(a + b) - c == a + (b - c) // true when unsigned arithmetic wraps
If overflow is n
不允许时,等式不再成立——这本身就是一个陷阱。
所以我们经常使用 unsigned,或多或少是历史偶然导致的。它容易出错且悄悄地隐藏错误。也许解决方案并不是让它更符合人体工学?
Signed first
正如你可能预料的那样,C3 已经采用了带符号的类型和长度大小。由于无符号类型现在变得更少见,我们不再需要在无符号和有符号之间进行任何隐式转换。无符号与有符号的比较?——也不复存在。
在进行此更改时,我也开始移除无关的 uint 和 ulong 用法,结果发现了一些看起来可疑甚至明显错误的代码。此外,代码只使用 int 和带符号的大小后变得更加简洁。正是在这里,我意识到自己一直在内化使用无符号的成本:在 C 或 C++ 中工作一段时间后,你会养成寻找无符号可能导致的问题的习惯,并使用那些不太直观但能够同时适用于无符号和有符号变量的模式。
我对自己花了这么久才做出改变感到有些尴尬,这也说明了这种习惯根深蒂固。我只是假设无符号大小是最佳方案,认为问题仅在于提升可用性并尽可能消除陷阱。尽管 Go 和 Java 已经用带符号的大小展示了更好的做法。
即使在决定进行更改后,最初将无符号转换为有符号仍然感觉别扭且不对劲,仿佛在做违禁的事——这正说明我已经走得多远。但看到每一次更改既让代码更易于推理,又更正确,我再也无法否认这些证据。
关于 C3 更改的一些说明
- 此更改在实现之前已在 C3 Discord 中讨论,并获得了亲切的名称 “iszmageddon”,该名称源自
isz类型(大致对应ssize_t)成为默认的尺寸类型。 - 为了更明确地推广有符号尺寸,它被重命名为
sz,在 0.8.0 版本中形成了不对称的sz/usz对。这使得记住首选的类型变得容易。因此此更改被重新命名为 “szmageddon”。 - 最初,签名 ↔ 无符号之间的隐式转换基本保持不变,但随后被完全移除。
在 Hacker News 上讨论本文。