티켓이 닫혔지만 사용자는 결제할 수 없었습니다.
Source: Dev.to
Backend이 200을 반환했습니다.
모바일 앱에서 오류가 표시되었습니다.
사용자가 “Pay”를 세 번 탭했습니다.
세 건의 대기 중인 결제가 계정에 들어갔습니다. 한 건의 주문이 생성되었습니다. 잔액이 부족했습니다. 그리고 인시던트 로그는 실패가 없었습니다.
팀 내 모든 엔지니어가 자신의 역할을 수행했습니다. 문제가 해결된 사람은 없었습니다.
이것은 엔지니어링 팀이 가장 흔하게 실패하는 방식이며, 무능력 때문이 아니라 올바른 단위의 작업을 잘못 수행하는 데서 비롯됩니다. 그리고 “작업을 완료한다”와 “비즈니스 문제를 해결한다”는 차이를 인식하지 못하는 한, 완벽히 작동하는 시스템을 계속 배포하고 사용자 경험은 좋지 않을 것입니다.
The Ticket-Thinker vs. The System- Owner
Most engineers early in their careers think in tickets.
Ticket assigned → code written → tests pass → PR merged → ticket closed. Done.
This is fine when you’re learning. It’s a liability when you’re trying to grow.
The engineer who closes tickets is useful. The engineer who asks “what problem does this ticket actually solve, and am I solving it in the right place?” that engineer is dangerous in the best way.
Here’s the distinction in practice.
The backend engineer builds a payment endpoint. It processes charges correctly, returns the right status codes, has proper error handling. 100% test coverage. Ticket closed.
The mobile engineer builds the payment screen. It calls the endpoint, handles the response, shows confirmation or error. Smooth UI. Ticket closed.
The problem nobody owned: what happens when the network drops after the backend processes the charge but before the mobile app receives the confirmation?
The backend: charge processed. No error.
The mobile: timeout. Shows “Payment failed.” User retries.
The user: charged twice.
Both engineers solved their assigned problem correctly. The business problem — charge the user once and confirm it reliably — went unsolved. Because that problem lived in the space between their tickets, and nobody was watching that space.
Real Scenario 1: The Payment That Worked and Failed at the Same Time
This happens in production more than any team admits.
In a payment flow, the sequence is: mobile initiates → backend charges → payment processor confirms → backend responds → mobile confirms to user.
Network latency exists at every arrow in that chain.
If the connection between the backend and mobile drops after the payment processor confirms but before the backend responds to the mobile, both the backend log and the payment processor log show success. The mobile app shows “Payment failed. Please try again.”
A user who trusts the mobile app retries. Now they’re charged twice.
The fix isn’t purely a backend fix. It isn’t purely a mobile fix. It requires:
Idempotency keys — the mobile generates a unique key per payment attempt and sends it with every request. The backend uses it to guarantee that retrying the same request never creates a duplicate charge, regardless of how many times the network drops and retries.
// Mobile: generate and persist the idempotency key per payment intent
const idempotencyKey = `pay_${userId}_${orderId}_${Date.now()}`;
// Store it locally before the request
localStorage.setItem('pending_payment_key', idempotencyKey);
// Send with every retry of this specific payment
const response = await fetch('/api/payments', {
method: 'POST',
headers: {
'Idempotency-Key': idempotencyKey,
'Content-Type': 'application/json'
},
body: JSON.stringify({ amount, currency, orderId })
});
Enter fullscreen mode
Exit fullscreen mode
// Backend: check for existing successful charge with this key
async function processPayment(req) {
const idempotencyKey = req.headers['idempotency-key'];
const existing = await db.payments.findOne({ idempotencyKey });
if (existing?.status === 'success') {
return existing; // Return the same result. Don’t charge again.
}
const charge = await paymentProcessor.charge(req.body);
await db.payments.create({ idempotencyKey, ...charge });
return charge;
}
Enter fullscreen mode
Exit fullscreen mode
This solution only exists if a backend engineer and mobile engineer sat down together and asked: what does the user experience look like when the network misbehaves? Not: does my component work?
That’s the difference.
Real Scenario 2: The Smart Device That “Works”
A team builds a smart home device. Hardware, mobile app, cloud backend, three separate engineering workstreams.
The hardware engineer ships firmware that correctly sends state changes to the cloud API. Tests pass. Ticket closed.
The mobile engineer ships an app that correctly receives state changes from the cloud and updates the UI. Tests pass. Ticket closed.
The backend engineer ships an API that receives from hardware and sends to mobile. Load tested. Ticket closed.
Users buy the device. They press the button to turn on their light.
The light turns on 11 seconds later.
Nobody’s system is broken. The latency was distributed across three components, each one individually fine, each one adding 3–4 seconds of its own processing and polling delay. Nobody measured the end-to-end journey. Nobody owned the number that the user actually experiences: the time between button press and light turning on.
The product reviews say “laggy” and “unresponsive.” The engineering team looks at their metrics and sees nothing wrong.
This is what happens when reliability is treated as a component property instead of a system property.
Real reliability — the kind users actually experience only exists at the intersection of every layer. The backend can be 99.9% available. If the mobile SDK polls every 5 seconds, the effective user-facing response time is up to 5 seconds before the backend is even consulted. Hardware transmission latency on top of that. Cloud-to-mobile push latency on top of that.
The only way to catch this is to instrument the entire journey, not individual components:
// Instrument the user-facing journey end to end // Not just "did the API respond?" but "did the user get feedback?"
const journeyStart = performance.now();
await hardwareCommandAPI.send(deviceId, 'toggle_light');
// Poll for state change confirmation from device
await waitForDeviceStateChange(deviceId, 'on', { timeoutMs: 2000 });
const journeyEnd = performance.now();
const userFacingLatency = journeyEnd - journeyStart;
metrics.record('light_toggle_user_latency_ms', userFacingLatency);
Enter fullscreen mode
Exit fullscreen mode
When this number starts living in your dashboards, cross-functional conversations change. “The API is fast” stops being the end of the discussion.
Why Engineers Stay Stuck in the Ticket Mindset
It’s not laziness. It’s incentive structure.
Most engineering teams measure and reward what’s visible: tickets closed, PRs merged, features shipped, uptime of individual services.
Nobody measures “how many times did an engineer spot a problem outside their lane and raise it?” Nobody gives performance review credit for the mobile engineer who asked the backend team: “what happens to our payment UI if your charge endpoint takes 8 seconds instead of 200ms?” And then followed up with: “here’s what the user sees, here’s the drop-off in our funnel.”
The ticket system creates invisible walls between components. Each engineer optimizes for their component. The user lives in the space between the walls and has no advocate unless someone consciously takes on that role.
One of the clearest signs of engineering maturity is the ability to think beyond the ticket and own the user outcome.
Not deeper technical expertise in one domain. The willingness to hold the end-to-end user journey in your head while working in one specific layer of it.
What Cross-Functional Reliability Actually Looks Like
Collabor