My Journey recovering from a major UniFi Controller failure

My Journey recovering from a major UniFi Controller failure

I have quite a complicated UniFi Networking Setup segregating over 20 different networks and was using a well-known Third-Party UniFi Hosting Provider since 2018.

3 days ago they had updated their UniFi Controller, which I was using for free as a long-term Beta Member from years ago, it is broken, they did not admit it neither offered to help.

I requested to provide me with a backup since they have cut the Site Migrate and Backup functionality out of my account, they refused, so I was in need to re-do my network...

Since my current config was still running on the UniFi Gear I thought about a smooth transition instead of rebuilding from scratch. Since I didn't want to completely overhaul my Rack just to have my proxmox node delivering the unifi controller as a VM be always connected to the gear and mess up my HA setup for it.

I have my UniFi Controller published for control traffic to serve other locations I support (Friends, Family) so I had to make sure it was always available from the Internet, this would also make my migration process easier, since the gear only needs internet to be adopted and re-integrated into the network

The Before

Network Setup (1AP missing, lended to sister until their gear arrives)
The proxmox nodes are effectively failovering the network uplink over 10G to the other if one leg breaks

Above you see the basic Map view of the network and after that the Network Setup seen from a broader look.

The Plan

  1. Install the new UniFi Controller into a VM on one proxmox host.
  2. pre-configure the vlans and networks from documentation
  3. Publish the Controller via pfSense/HAProxy
  4. changing unifi controller dhcp ip in usg via cmdline
  5. reset Access Points one by one
  6. adopt them by the new controller
  7. reset the switch2 and adopt it
  8. prepare USG ports on switch2
  9. move USG Patch cables (one by one keeping uplink)
  10. Reset Coreswitch (activating the STP blocked secondary link between switch1,switch2)
  11. adopt coreswitch and configure for DSL modem, USG und trunk ports
  12. move proxmox node and one AP to switch2
  13. reset switch 1, adopt it and configure NAS,Proxmox,UAP ports
  14. move proxmox and AP back

The Result

All went smoothly because I had planned for redundancy of internet, proxmox uplink und switch links so I never had to connect via ethernet to any switch to configure something being isolated from the rest of the network.

Well no, I had one issue to overcome, when resetting the DSL switch I lost connectivity to the unifi controller vm, since it's uplink was to the DSL connected pfsense, so I moved the VM into the VLAN of the modem to directly connect it the the Provider-NAT, just for the adoption of the core-switch.

But wait where is the USG ?
So this was a bit trickier, I had to make sure the controller had internet access while the central router was down, how I made it work ?

You remember the LTE Uplink being available to pfSense as a Fallback for the published services ?
Well here in Germany our LTE providers don't give us public IPs with Port Forwarding, while the contract was already a Home Internet Access not just a Data plan.

So I have a cloud hosted VM being a VPN exit node for the LTE connection, I simply added my DSL pfSense via VPN to the same instance and configured NAT to failover to each of the links when one goes down, effectively bypassing my CloudFlare Load-balancer listening on Port Forwarding on the DSL.

With this setup I could reset the USG while the controller was connected trough the VPN connections bypassing the USG

How I connected to it after the reset?
I created a WLAN to be connected to the USGs LAN Native Network where I connected with my Mac to configure It.

After the USG was adopted I managed to migrate my entire unifi network to a new controller without any interruption in any hosted service.

The side-effect I can now cancel my CloudFlare Load-Balancer subscription!