Building Reliable Software: The Trap of Convenience
Source: Dev.to
When I started to learn programming a PC (as opposed to programming an Amiga), it was still the 20th century
In the days of yore, when electricity was a novel concept and computer screens had to be illuminated by candlelight in the evenings, we’d use languages like C or Pascal. While the standard libraries of those languages provided most of the needed primitives, it was by no means a “batteries included” situation. Even where a standard‑library solution existed, we’d still drop to inline assembly for performance‑critical sections because those computers were definitely not fast enough to spare any CPU cycles.
PCs were also still similar enough that they used the same CPU architecture, and thus the same machine code, so the assembly sections were not that hard to maintain.
Today, the x86 architectures come with dozens of optional extensions and you’re not even guaranteed to encounter an “Intel” machine (technically referred to as amd64). RISC CPUs are coming back to reclaim computing thanks to the efforts of Apple (Apple Silicon), Amazon (AWS Graviton), Microsoft (Azure Cobalt), and other ARM licensees.
In 2026, writing assembly code is something you only do if there is absolutely no other solution. The number of versions of that inline section keeps growing with every new CPU family. Meanwhile, modern compilers are getting so good at optimizing the resulting machine code, and computers are so fast, that manual optimization is usually not worth the effort—unless it absolutely is.
So modern programming languages split into two camps:
- System programming languages that optimize for performance with an extra focus on safety (e.g., Rust).
- Application programming languages that optimize for productivity—the speed at which we produce useful software, rather than the speed at which that software runs.
Productivity Through Convenience
Productivity demands higher‑order abstractions. Instead of representing how the underlying hardware works, the programming languages and their libraries model how people think about the problems.
Thanks to this, instead of writing several pages of C code to allocate a send buffer, open a socket, set its options, resolve the target hostname, establish a connection, and so on, you can fetch and parse a web resource in a few lines of Python:
import requests
def fetch_json(url):
data = requests.get(url)
return data.json()
With just a few keystrokes I can achieve what used to take me hours to type out. Thanks to both Python (and C#, TypeScript, and likely also your favorite language) and the requests library (and its equivalents) being open‑source and freely available, we can all collectively and individually build more complex systems with less effort.
Note: As I mentioned in my previous post, it’s systems all the way down. All those systems make a (conscious or not) choice about what it means to be a reliable tool.
As a fun exercise, look at the example above and try to figure out what the biggest problem with that bit of code is. It certainly works for the happy path, which would make it pass a lot of the unit tests!
Let’s walk through several (but not all—the complete list would be way too long) of the things that can go wrong in just two lines of code.
A Litany of Failure Modes
data = requests.get(url)
return data.json()
- Invalid URL – An error is raised if
urlis not a valid URL (or not even a string). - DNS resolution failure – An error is raised if the target hostname cannot be resolved (domain does not exist or DNS server is unreachable).
- Multiple IP addresses – If the hostname resolves to several IPv4/IPv6 addresses, each is tried in sequence. Because no timeout is specified, the default system TCP timeout is used (six connection attempts totaling about 127 seconds on modern Linux) for each individual address. If none accept the connection, an error is raised.
- Protocol mismatch – If the target system does not speak HTTP (or speaks a different protocol) and returns random gibberish, an error is raised.
- TLS handshake failure – An error is raised if the protocol is HTTPS and the server does not offer any TLS version we trust.
- Invalid TLS certificate – An error is raised if the server’s certificate is broken, expired, or not trusted by any of the certificates in our trust store.
- Hostname‑certificate mismatch – An error is raised if the certificate is valid but does not match the target hostname.
- Read timeout – If the server stops responding, an error is raised. Because no timeout is specified, the default system TCP read timeout (≈ 60 seconds on modern Linux) is used. If the server sends anything during that window, the timer resets as the read is considered successful.
- Redirects – If the response is a valid HTTP redirect, the request is automatically retried from step 1 using the new URL.
- Invalid JSON – If we finally get a response that is not a valid JSON string,
data.json()raises an error.
Different parts of the above are problematic—or extremely problematic—depending on what your code is attempting to achieve.
- If your goal is to download a movie, preserve artifacts of a system you’re about to delete, or create complete copies of websites for a project like the Internet Archive’s Wayback Machine, then you probably want the code to keep trying, possibly with custom back‑off strategies, rather than giving up on the first transient error. The desired outcome is to access the resource at all costs.
Cleaned‑up Markdown
Failing Fast
If you guessed that the problem with the code is that it could crash, you probably guessed wrong. I’m going with “probably”, because I don’t know what your use case is. But most systems handle failing fast rather gracefully.
A simple try/except block wrapped around the call to our function could take care of specifying the fallback behavior. And even if that is absent, the underlying framework is likely built to withstand the error and return a proper response instead of crashing—like in the old days. What it can’t do is rewind the time it took the code to fail.
Resisting Abuse
You can’t have reliability without at least some resilience (though failing reliably is also a form of consistency). Therefore you need to teach the system how to defend itself against undesirable behaviors—some outright malicious, some merely careless.
In the example above, an extremely malicious behavior would be:
- configuring a domain’s DNS zone to resolve to 511 different IP addresses, all from non‑routable network segments such as
196.168.0.0/16; - having a domain resolve to 9 non‑routable IPs and one that returns an HTTP redirect to the same domain;
- pointing the URL to a server that streams the response by sending one byte every 50 seconds, thus never triggering a read timeout.
If those numbers sound oddly specific, it’s because we tried all those things internally at Saleor. I don’t remember why we chose 511 IPs—maybe Cloudflare limited the number of records, maybe it didn’t matter because no one would wait for that test to time out anyway.
A malicious actor could also ask your system to access a URL of an internal service they can’t reach directly. If the URL comes from an untrusted source, it could be used to probe your internal network for open ports, based on the error codes you return. And if your system is foolish enough to surface the entire “unexpected response” from such a URL, you risk leaking credentials.
Did you know?
Any EC2 instance on AWS can requesthttp://169.254.169.254/latest/meta-data/to learn about its own IAM role. A subsequent call tohttp://169.254.169.254/latest/meta-data/iam/security-credentials//returns both the AWS access key and its secret. Yikes!
Final Thoughts
Convenience is the biggest pitfall of modern, high‑abstraction productivity. All the important bits and compromises are buried deep in the convenience layers, making it impossible to reason about systems without “popping the hood.” Meanwhile, your IDE, code‑review tools, and whiteboard interviews surface the types of problems that—in the grand scheme of things—don’t matter much: the ones your system can recover from automatically.
If you ever need to access the great unknown from Python code, take a look at the requests‑hardened wrapper we created for requests. It makes it safe to point the library at untrusted URLs from code that doesn’t have forever to wait for the outcome. It also works around a DoS potential in Python’s standard library that we reported responsibly (the issue is public only because the maintainers asked us to make it public).
Takeaway:
Make sure your team doesn’t mistake simple code for the simplicity of the underlying systems. The reassurance offered by invisible complexity is a fake one.
Happy failures. Farewell, and until next time!