Taking over a live system at scale, no downtime

A dusk aerial view of a highway interchange at long exposure, the freshly rebuilt ramp glowing teal along its full length while traffic streaks past as light trails on the older grey ramps. — Taking over a live system means keeping it running while you make it safe and rebuild it underneath.

Taking it over, not patching or rewriting

If an engineering vendor has left you a fragile system that the business depends on, and it is live and at scale, your options are narrower than they feel. You do not have to choose between patching it forever and rewriting it from a blank page. You can take it over: stabilize it enough to run safely, build the replacement in parallel, and move customers onto it one at a time with no downtime.

This is the situation we were brought into on CORE Home, a multi-tenant real estate platform from Inside Real Estate. It was already live across web and mobile, with pilot brokerages serving real homeowners and real data every day. The inherited system had done its job getting the product to market. It was not built for where the product was going next, and going offline was never an option. We took full ownership of the running stack and rebuilt it underneath the live product.

The hard part of a moment like this is that it is the least forgiving one. The product is the business, customers are on it right now, and the team that built it is gone. The instinct is to reach for one of two extremes, hold the system together with fixes or scrap it and start again, and both are wrong for something at scale. What follows is the sequence we used instead, written so you can map it onto your own system.

If you are reading this mid-switch, sourcing a team to take the system over, the order of operations below doubles as a way to judge the teams you talk to. The steps are not exotic. The judgment is in doing them in the right order, and in refusing the rewrite-first instinct that feels decisive and is usually the most dangerous move on the table.

Key Takeaways

Stabilize before you rebuild. Get control of the deploy process and clear the worst performance problems so the system is safe to operate. Stabilization is the runway, not the destination.
Build the replacement in parallel on the same data layer, so the old and new systems can be compared against each other and parity comes for free.
Cut customers over one at a time behind a feature flag, so a single broken case never takes the whole platform down.
The takeover team to trust stabilizes and diagnoses before it pitches a rewrite.

First, make the running system safe to operate

Before you change a line of the architecture, make the running system safe to operate. You cannot rebuild a platform you are afraid to deploy. The first job is to regain control of the two things that decide whether any change is safe: how code ships, and how the system holds up under real load.

On CORE Home that meant two moves. We put the deploy process onto a CI/CD pipeline, which took releases from barely one a day to as many as we needed. And we cleared the worst performance bottlenecks under the load the platform was actually seeing. Those two changes turned a system we had inherited into one we could run with confidence, and gave us a stable base to rebuild from.

On demand

Deploys, up from barely one a day

Moving the deploy process onto a CI/CD pipeline and clearing the worst performance bottlenecks made the platform safe to operate, and cut a projected six-figure sum from annual infrastructure costs along the way.

Stabilizing also buys you something less obvious: time to understand the system before you commit to changing it. While we brought deploys and performance under control on CORE Home, we were also learning how the platform actually behaved, which parts were load-bearing, and where the real risks sat. By the time we proposed the rebuild, it was a plan grounded in the running system rather than a guess made from the outside.

The mindset that matters here is that stabilization is the runway, not the destination. The goal is not to make the inherited system pleasant to work in. It is to make it safe enough to operate while you build its replacement. A team that wants to start rewriting before it can ship a release safely is solving the problems in the wrong order, and you should read that as a warning.

Build the replacement on the data you already have

Once the system is safe to operate, build the replacement in parallel, on the data you already have. The riskiest version of a rebuild changes the code and the data at the same time, because then a bad result could come from either one and you cannot tell which. Keeping the data layer unchanged removes that ambiguity.

We ran the rebuilt backend alongside the old one, both pointed at the same data. On the read paths, where it is safe to run both against the same records, we compared them: a difference pointed straight at the new code, because the data was identical. Each tenant only had to be correct one at a time, never all at once. We left the schema alone until the migration was over, and changed it only when a single system was left to own it.

Reads are the safe place to compare; writes are where you commit. So a single system owned each tenant's writes at any moment, never both at once. That is the line between a careful takeover and a reckless one: you do not run two systems writing the same records and hope they agree. Each tenant kept one writer until it was verified and moved for good.

The mechanics of sharing one data layer, and why we deferred the schema changes until the end, are their own subject. We covered them in how we rebuilt a backend without migrating the data.

Cut over without a maintenance window

With a verified replacement running in parallel, cut customers over one at a time, with no maintenance window. A platform-wide switch means everyone moves on the same night, and one broken case forces you to roll all of them back. Moving one at a time keeps the blast radius to a single customer.

Two decisions made that possible on CORE Home. The rebuilt backend answered to the same API contracts the live web and mobile apps already used, so only one side moved and no customer had to ship an app update the day we cut over. And every tenant's move sat behind a feature flag, a canary release run one tenant at a time, staged and verified and reversible on its own. When the last tenant was across, we retired the old system, because a parallel run that never ends is its own kind of debt.

The full set of decisions that kept tenants running, and the trade-offs behind each, are in how we rebuilt a live multi-tenant SaaS without downtime.

What taking it over the right way bought

Taken over this way, the platform did not just survive the handover. It scaled, on the same foundation we built during the rebuild.

5,000+

Tenants on the platform today

From a handful of pilot brokerages, CORE Home now serves more than 2 million homeowners and over 100,000 agents, at more than 2 million requests a day, on a foundation that has held for four years without a re-platform.

Three things came with the new foundation. Provisioning a brokerage used to be manual work across branding, domains, and release steps; we wired it into the client billing system, so a branded tenant now spins up when a package is sold. Each branded app used to carry its own build; the product now runs from one shared codebase, with branding and configuration set per tenant. And because configuration lives in one place, a single change reaches web and mobile at the same time.

Patching the inherited system would not have reached any of this, and getting here never required taking the product offline. The takeover and the scale are one story: a platform made safe to operate, rebuilt underneath itself, and carried to where the business needed it without a dark day.

“Over the years Danubio has become the partner we turn to when the work is critical and there's no margin for error. CORE Home was exactly that. They rebuilt the platform underneath the running product without taking a single tenant offline. The kind of team you wish you'd had from the start.”

Andrew Hartnett, EVP Product & Engineering, Inside Real Estate

How to recognize a team you can hand this to

If you are the one choosing who to hand a live system to, the signal to watch is what a team does before it proposes a rebuild. The team you want stabilizes first and diagnoses honestly. It asks to see how the system runs, makes the unsafe parts safe, and finds the real problems before it pitches a plan.

Be wary of the opposite. A team that opens with a full rewrite, before it understands what the system does or can deploy it safely, is selling you a plan instead of judgment. The rewrite is the most expensive and most dangerous option on the table, and reaching for it first, sight unseen, usually means the team is more comfortable with a blank page than with the system you actually have.

The difference usually shows up in the first conversation. A team that has taken over live systems will talk about the order of the work and the risks before it talks about the end state. It will want access to the running system, the deploy process, and the incident history early, because that is where the real picture is. A team that can tell you what it would build over six months but not how it would operate your system next week has skipped the part that actually keeps you up at night.

The work compounds from there. The same team that took over CORE Home and rebuilt it later shipped HomeSearch AI in five months, on the date the company had committed to publicly. The foundation held, which is the point of building it the way we did. A team you can hand a fragile system to is usually the team you keep.

This is the work we do when we stabilize, modernize, and scale an existing product. If a vendor has left you a system the business runs on, you can take it over the same way: stabilize it, rebuild it in parallel, and move onto it without going dark.

Frequently asked questions

Our engineering vendor left us with a fragile, business-critical system. What should we do first?

Stabilize it before you rebuild it. Get control of the deploy process and clear the worst performance problems so the system is safe to operate. A team that wants to rewrite everything before it can even ship a release safely is solving the problems in the wrong order. On CORE Home the first move was putting deploys on a CI/CD pipeline and clearing the bottlenecks, which bought a platform we could run while we rebuilt it.

Can a new team take over a live production system without taking it offline?

Yes. Keep the existing data layer, run the new system alongside the old one, and compare the two on the read paths where running both against the same records is safe. Then move customers across one at a time behind a feature flag, so any problem stays contained to one of them and rolls back on its own. The product keeps serving everyone else throughout.

Should we patch the inherited system, rebuild it, or replace it?

For a system the business depends on at scale, the answer is rarely one of those alone. Stabilize it enough to operate safely, then rebuild the foundation underneath the running product and retire the old system once it is empty. Patching alone leaves the structural problems in place, and a from-scratch rewrite either takes the product offline or splits your effort for months.

How do we tell a team that will stabilize first from one that just wants to rewrite everything?

Watch what they do before they pitch. A team you can hand a live system to will ask to see how it runs, make the unsafe parts safe, and diagnose the real problems before it proposes a rebuild. A team that opens with a full rewrite, before it understands what the system does or can deploy it safely, is selling a plan rather than judgment.

How we take over a live system at scale, without downtime

Taking it over, not patching or rewriting

First, make the running system safe to operate

Build the replacement on the data you already have

Cut over without a maintenance window

What taking it over the right way bought

How to recognize a team you can hand this to

Frequently asked questions

Our engineering vendor left us with a fragile, business-critical system. What should we do first?

Can a new team take over a live production system without taking it offline?

Should we patch the inherited system, rebuild it, or replace it?

How do we tell a team that will stabilize first from one that just wants to rewrite everything?

Keep reading

How we rebuilt a backend without migrating the data

How we rebuilt a live multi-tenant SaaS without downtime

Do you actually need Kafka?

Tell us what you're building. We'll bring the dragons.