Why Always-On Engineering Culture Signals Systemic Failure
The idea that being available 24/7 is a sign of dedication is a lie. It’s actually a leading indicator that your architecture and your processes are broken. If you have to be constantly firefighting, you haven't built something resilient. You've just built something that requires a human sacrifice to keep running.
We're being told that AI is about to 10x the productivity of the white-collar workforce. The math is pretty simple, even if the implications are messy. If these tools actually work, I should be able to produce my entire week's worth of output by Monday at noon.
This creates a massive tension in how we measure value. If the work itself becomes cheap and fast, what happens to the person who is still billing by the hour or measuring success by the number of emails sent? We need to figure out if we're actually getting more efficient, or if we're just creating a new kind of digital burnout.
The myth of the heroic developer
The "hero developer" is a liability, not an asset. We often celebrate the person who stays up until 3:00 AM to fix a production outage, but that person is actually a symptom of a broken process. When a single engineer becomes the only person who understands a specific service or can navigate a complex deployment pipeline, you've created a single point of failure. If that developer leaves or even just takes a week of vacation, the team's velocity drops.
Crunch time is usually just the visible result of poor estimation or shifting requirements. It's easy to mistake frantic, high-intensity work for high performance, but it's actually a sign that the planning phase failed. Relying on individual heroics to meet deadlines makes your delivery dates unpredictable.
You can see this pattern in how teams manage critical infrastructure. A team that relies on "tribal knowledge" instead of documented, reproducible processes is a team waiting for a disaster.
deploy_service:
script:
- export API_KEY=$(cat /Users/dave/secrets/key.txt)
- ./deploy.sh --env production --token $API_KEY
The fix isn't to work harder; it's to automate the mundane and document the complex. Moving from manual, person-dependent steps to a standardized CI/CD pipeline reduces the need for any one person to be a hero. This involves:
- Standardizing build environments using containers.
- Automating secret management through tools like HashiCorp Vault.
- Implementing rigorous code reviews to spread knowledge across the team.
Building for autonomy
You can't have autonomous systems if your humans are tethered to their phones. If every deployment requires a developer to sit there and watch the logs for an hour, you haven't built an automated pipeline; you've just built a more expensive way to be stressed. True autonomy comes from a CI/CD pipeline that is opinionated enough to block bad code before it ever touches production.
The goal is to move the "point of failure" as far left as possible. This means your test suite needs to be more than just a collection of passing green checks; it needs to be a rigorous gatekeeper. I've seen teams try to do this by adding more integration tests, but that often just makes the pipeline slow and flaky. Instead, focus on high-coverage unit tests and automated canary deployments. When you use a canary pattern, you're only exposing a small fraction of your traffic to the new version. If the error rate spikes, the system rolls itself back.
name: Deploy to Production
on:
push:
branches: [ main ]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Deploy Canary
run: ./deploy_script.sh --canary --weight 10
- name: Monitor Error Rates
run: ./check_metrics.sh --threshold 0.01
- name: Rollback on Failure
if: failure()
run: ./deploy_script.sh --rollback
This setup works because it removes the human decision-making process from the emergency. The logic is simple: if the error rate exceeds 1%, the deployment is reverted. It's not perfect, and monitoring logic can get messy, but it's better than a 3:00 AM phone call. To make this work, your infrastructure needs to support these automated transitions, which usually requires a service mesh or a sophisticated load balancer configuration.
Measuring output over presence
The shift toward measuring output rather than presence is a logical response to the fragmentation of the modern office, but I think the implementation details will be where the real friction lies. It’s easy to celebrate the end of "green dot" culture, where being active on Slack is mistaken for being productive. However, moving the goalposts to purely measurable output assumes we have a standardized way to quantify what "good" looks like across different roles. For a software engineer, pull request velocity or deployment frequency are tangible metrics. For a product manager or a designer, those same metrics are often useless or, worse, actively harmful to the quality of their work.
I see a risk that we aren't actually moving away from surveillance, but simply trading one form of it for another. If we stop monitoring when someone logs on, we will likely start monitoring how many tickets they close or how many lines of code they commit. This doesn't necessarily grant autonomy; it just changes the metric of control. We might find ourselves in a loop where the harder it is to measure a specific type of deep, contemplative work, the less that work gets valued by the organization.
The real test will be whether this transition actually reduces burnout or if it just creates a new, more intense pressure to produce visible artifacts at all times. I'm curious to see if teams can actually sustain this without falling back into the trap of rewarding the loudest, most visible contributors.
Conclusion
Stop looking for the one engineer who can save your entire sprint by working through the weekend. That person is usually just a single point of failure waiting to happen, and once they burn out, your velocity goes to zero.
If you want a team that actually scales, stop tracking commit timestamps and start looking at how much agency your engineers have to make their own decisions. It’s much harder to build that kind of autonomy than it is to just monitor a Slack status.
I'm curious if we're actually moving toward better management, or if we're just finding more sophisticated ways to micromanage the same old metrics.