Rebuilding a multi-tenant SaaS, zero downtime

CORE Home running across web and mobile after the rebuild. — CORE Home, the multi-tenant platform, running across web and mobile.

What we took on

You can rebuild the backend under a live multi-tenant SaaS without taking a single customer offline. It's a sequencing problem before it's an architecture problem, and the sequence is the part most teams get wrong.

When we took ownership of CORE Home, it was already running. Web and mobile both existed. A small set of pilot brokerages were live, with real homeowners and real data moving through the product every day. The inherited system had done its job getting CORE Home to market. It wasn't built for where the product was going next.

Before we touched the architecture, we made the running system safe to operate. We moved the deploy process onto a CI/CD pipeline, taking releases from barely one a day to deploying on demand, and cleared the worst performance bottlenecks under real load. That work also cut annual infrastructure costs by a projected six-figure sum. Stabilizing the inherited system first is what earned the room to rebuild it safely.

Inside Real Estate needed a foundation that could carry millions of homeowners, a full white-label rollout, and a product the business could own for years. So we rebuilt that foundation underneath the running product, six months from the first line of code to the last tenant on the new API, with no downtime. Four years on, the same platform carries more than 5,000 tenants. This is how the migration worked, decision by decision.

Key Takeaways

Keep the rebuilt backend behind the same API the live apps already call, so only one side moves.
Run the old and new systems in parallel over one shared data layer, so parity is automatic.
Cut tenants over one at a time behind a feature flag, so a single broken case never takes everyone down.
Retire the old stack only once it is empty, so the migration actually ends.

What does 'without downtime' actually mean?

On CORE Home, without downtime meant three things: no tenant went offline, no customer was forced to update an app the day the backend changed, and the migration had a real end date. Those three constraints shaped the work more than the architecture diagram did.

A multi-tenant platform raises the stakes. For more than 90% of mid-size and large enterprises, a single hour of downtime now runs over $300,000, by ITIC's 2024 Hourly Cost of Downtime Survey. One backend serves every brokerage, so a clumsy change can hit all of them at once. The risk that matters is blast radius: how many tenants one mistake can reach.

The other quiet risk is coupling. On CORE Home, web and mobile both talked to the backend. If a backend change forced a matching client release, every tenant would have to ship an app update the moment we cut over. We treated that coupling as the first thing to design away.

Why is a live multi-tenant rebuild so hard?

The hard part of rebuilding a live platform is replacing the engine while the car is moving, for every passenger at once, with no moment where the road is closed. The new code is the easy part.

The track record for this kind of work is sobering. Data migrations are notorious for overrunning on time and budget, and most of that risk comes from moving everything at once and meeting the problems live. The way through is to make the change incremental, a strangler fig migration, where the rebuilt system grows in behind the old one and the old one is retired only once it is empty.

The inherited backend was tangled enough that a change in one surface could reach further than we expected. Performance under real load was already a problem, and the system was not sized for the roadmap. The roadmap pointed at millions of homeowners.

The obvious approach, a single big-bang cutover, is the one that gets teams in trouble. You stage everything, pick a weekend, flip the switch, and hope. If one tenant breaks, the only option is to roll everyone back. The bigger the platform, the worse that bet gets.

Multi-tenant makes all of this sharper. There is no quiet corner to fail in when one mistake can reach every tenant at once. So the entire plan came down to a single goal: shrink how far any one mistake can travel.

The four decisions that kept tenants running

Four decisions did the work. We kept the rebuilt backend behind the same API contracts the live apps already used. We ran the old and new systems in parallel over one shared data layer. We cut tenants over one at a time behind feature flags. And we retired the legacy stack only once it was empty. Each one exists to shrink blast radius.

The migration shape: one API contract, two backends over one shared data layer, tenants moving across one at a time. The legacy backend is retired once empty.

Preserve the API contracts

The rebuilt backend answered to the same endpoints the existing web and mobile apps already depended on. It slid in underneath without a client release tied to the swap. A coordinated migration release would have forced every live tenant to ship a client update at the exact moment we changed the backend. By keeping the contract stable, only one side moved during the migration. That is what kept tenants safe.

A stable contract did not mean a frozen product. We could add to what the new backend returned, as long as we did not break what the existing clients read. Additive changes were safe to ship. Renames and removals had to wait until the clients were ready for them. The contract was the seam that let us swap backends without the apps noticing.

Run old and new in parallel, over one data layer

Both stacks served real traffic throughout the migration, a parallel run with reconciliation. We preserved the existing data structure and pointed the rebuilt backend at the same data layer, then compared the two on the read paths, where running both against the same records is safe. A difference on a read pointed straight at the new code, since the data was identical. The rebuild only ever had to be correct for one tenant at a time.

Writes are different, because you cannot run both backends against the same write without applying it twice. So we verified a tenant before flipping it, and the feature flag then sent that tenant's reads and writes to exactly one backend. Sharing one copy of the data is also the place blast radius is not automatically contained: a write bug in the new backend lands in the data every tenant reads, so we kept its write surface small and guarded the destructive paths. Preserving the data layer covers those trade-offs in full.

The schema changes we wanted came later, after the last tenant moved, when a single backend was left to change.

Cut over behind feature flags

Every tenant's move sat behind a feature flag that decided whether that tenant's traffic went to the old backend or the new one. We staged the cutover, verified that one tenant against the new backend, and flipped the flag. If something looked wrong, we rolled that single tenant back without touching the rest of the platform. This is a canary release done per tenant, where each tenant is its own canary. A platform-wide flip would have meant everyone moved together, with no way to pause for one broken case. Flipping one tenant at a time gave each migration its own verification path, so a failure stayed contained to that brokerage.

A flag per tenant also changed how we worked. We had a queue of small releases, each a decision we could make, check, and undo on its own schedule. The pace stayed steady, which is what you want when real homeowners are on the other side of the change.

Retire the legacy stack once it is empty

The migration ended only when nothing was left on the old system. With every tenant moved across, we decommissioned the legacy stack. This was a deadline we set on purpose. Long-running parallel infrastructure tends to quietly become permanent, and we did not want to operate two platforms a year later. That left one platform to operate, with the old stack gone.

Setting the end condition up front kept us honest. The finish line was a single event: the old stack switched off with nothing left running on it. A ready backend and a majority of tenants moved still counted as in progress. Naming that finish line is what stops a clean migration from decaying into two systems you maintain forever.

How do you migrate tenants one at a time?

Moving one tenant at a time is slower to plan and far safer to run than moving everyone at once. Each tenant became its own small, reversible release, and the migration was a long series of those, with no single event big enough to take the platform down.

Big-bang cutover

Everyone moves on one weekend

Stage it all, flip the switch, and hope. One broken tenant forces a full rollback. The more tenants on the platform, the larger and riskier the single event becomes.

One tenant at a time

Each move is small and reversible

Stage one tenant, verify it, flip its flag. A problem affects exactly one brokerage and rolls back on its own. The platform keeps serving everyone else the whole time.

There is a real cost to this discipline. It takes longer, and it asks the rebuilt backend to be production-correct for each tenant before that tenant moves. That is the trade we wanted. We would rather spend more calendar time than carry the risk of a migration that can take every brokerage down on the same night.

What did the rebuild unlock?

The payoff shows up in what the platform carries now. Against an industry where most migrations overrun or stall, this one finished in six months, and the foundation has run for four years without a re-platform, growing by orders of magnitude since the pilot brokerages.

5,000+

Tenants on the platform today

Up from a handful of pilot brokerages, the platform now serves more than 2 million homeowners and over 100,000 agents, at more than 2 million requests a day.

Three changes came with the new foundation. First, the foundation itself: we moved core behavior onto Laravel, which matched Inside Real Estate's broader stack and hiring, with clear separation of concerns. Search, valuation, and notifications became services that scale on their own.

Second, onboarding. Provisioning a brokerage used to be manual: branding, domains, account records, and mobile release steps, all coordinated by hand. We turned it into infrastructure, wired into the client's billing system, so a branded tenant now spins up automatically when a package is sold.

Third, white-label at scale. Each branded app used to carry its own build and release overhead. We moved the product onto tenant configuration: branding, domains, app names, and feature toggles all live in config, and web and mobile run from one shared codebase.

Because configuration lives in one place, a single change reaches web and mobile at the same time. What looks like two products to a homeowner is two clients of one system to us. The web and mobile experiences stay in step because the platform underneath them is genuinely one.

One platform across every tenant
Tenants provisioned automatically on a sale
Web and mobile from one codebase
Four years, no re-platform
Independent, scalable services

“Over the years Danubio has become the partner we turn to when the work is critical and there's no margin for error. CORE Home was exactly that. They rebuilt the platform underneath the running product without taking a single tenant offline.”

Andrew Hartnett, EVP Product & Engineering, Inside Real Estate

On the same platform, we later shipped HomeSearch AI in five months. The foundation held, which is the whole point of building it the way we did.

What we'd tell a team facing the same rebuild

If you are staring at a live platform that has outgrown its foundation, design the migration path as carefully as you design the architecture. The architecture decides where you end up. The migration decides whether you survive the trip. And if you inherited that platform from a team that has since moved on, the same sequence is how we take over a live system at scale: stabilize it enough to operate safely, then rebuild.

Three things carried us. Keep every step small enough that one failure affects one tenant. Keep every step reversible, so a bad move stays an inconvenience. And give the migration a real finish line, so you do not end up running two platforms forever.

None of this is exotic. Preserved contracts, a shared data layer, feature flags, and the discipline to turn the old thing off are ordinary tools. The judgment is in the sequence, and in refusing the tempting big-bang weekend that turns a hard problem into a dangerous one.

This is the work we do when we rebuild and scale an existing product, and it usually grows into a longer engagement. For the screens, the stack, and the numbers, the CORE Home case study has the full detail.

Frequently asked questions

Can you rebuild a backend without taking customers offline?

Yes. The trick is that only one side moves at a time. We kept the rebuilt backend behind the same API the live apps already called, ran it next to the old one, and moved tenants across one at a time. There was no shared cutover, so no customer went offline.

How do you migrate tenants one at a time?

Each tenant cutover sat behind a feature flag. We staged the move, verified that one tenant against the new backend, and flipped the flag. If something looked wrong, we rolled that single tenant back without touching anyone else. Every migration had its own verification path, so a failure stayed contained to that one tenant.

What happens to the database during a live migration?

We preserved the data layer during the migration. Both the old and rebuilt backends ran against the same data structure, which is how we got functional parity for free. We only started changing the schema after every tenant had moved to the new backend.

How long does a rebuild like this take?

The CORE Home rebuild took about six months, from the first line of code to the last tenant moving onto the new API. How long it takes depends on the tenant count and how disciplined you are about parity. The migration only ends when the last tenant is across and the old backend is switched off.

Sources

ITIC, 2024 Hourly Cost of Downtime Survey. Retrieved April 10, 2026. itic-corp.com
Martin Fowler, Strangler Fig Application. Retrieved June 2, 2026. martinfowler.com
Thoughtworks, Parallel Run with Reconciliation. Retrieved June 2, 2026. thoughtworks.com
Martin Fowler, Canary Release. Retrieved June 2, 2026. martinfowler.com

How we rebuilt a live multi-tenant SaaS without downtime

What we took on

What does 'without downtime' actually mean?

Why is a live multi-tenant rebuild so hard?

The four decisions that kept tenants running

Preserve the API contracts

Run old and new in parallel, over one data layer

Cut over behind feature flags

Retire the legacy stack once it is empty

How do you migrate tenants one at a time?

Everyone moves on one weekend

Each move is small and reversible

What did the rebuild unlock?

What we'd tell a team facing the same rebuild

Frequently asked questions

Can you rebuild a backend without taking customers offline?

How do you migrate tenants one at a time?

What happens to the database during a live migration?

How long does a rebuild like this take?

Sources

Keep reading

How we take over a live system at scale, without downtime

How we rebuilt a backend without migrating the data

Do you actually need Kafka?

Tell us what you're building. We'll bring the dragons.