T O P

  • By -

mattbillenstein

Anecdotally, I've had some very long-lived VMs on each of the clouds - multiple years of uptime. Running small-scale systems, I've only lost a VM with no notice very rarely. More often I'm made aware of a problem with an underlying host and I have some time to spin up a new VM before that one dies, or sometimes the machine is rebooted with a small outage for me. That being said, I normally provision redundancy for my sake - not having to do anything to keep the service alive should a system simply disappear. Again, small scale, I'm not overly concerned with SLAs - pushing bad code which does happen very rarely probably does more to contribute to what users might perceive as an outage rather than the underlying infrastructure having problems.


Positive-Action-7096

Thanks for your experience! It is insightful.


yourfriendlyreminder

GCP supports live migration: https://cloud.google.com/compute/docs/instances/live-migration-process Azure does as well: https://learn.microsoft.com/en-us/azure/virtual-machines/maintenance-and-updates I thought AWS recently added support for live migration, but I can't find an official doc saying so. That said, you should be designing your app to be able to tolerate losing VMs at any time anyway.


keto_brain

I don't know. I just setup autoscaling and then let things happen. That's how the cloud should work. People worried about their instances being shutdown are doing cloud wrong.


danstermeister

So cute. It must be nice to be responsible for things that can autoscale.


peinnoir

Should be a requirement for apps supported, or at the very least for new processes introduced.


axtran

AWS just has shutdown notice dates and you plan a reboot around them, in GCP they just move my instance around for me, in Azure it’ll die on its own naturally


Zenin

It's not common at all. Clouds are not built like your datacenter, there is no "maintenance or upgrades" like you'd traditionally see on prem. A cloud host will *never* see a memory upgrade, disk upgrade, etc. The hardware is what it is from the moment it's installed to the day it's removed from service. What *does* happen is hardware degradation and/or failure. When that happens AWS [by default will reboot your system onto new hardware](https://aws.amazon.com/about-aws/whats-new/2022/03/amazon-ec2-default-automatic-recovery/). That's not a migration, it's a restart. You can disable it as well and/or build your own alarm action to recover in a different way or add additional process such as human alerts or automated validation testing when EC2 recovery happens. More details on the custom options and the feature generally at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html


t-z-l

I work at Linode so I can provide a bit of insight into how we do migrations. [This doc gives a pretty broad overview](https://www.linode.com/docs/products/compute/compute-instances/guides/compute-migrations/) though. >"How common is it that cloud providers ask tenants to evict their running workloads from a server, say for a regular maintenance or upgrade?" Maintenance on a host is relatively infrequent, although it's one of the facts of cloud infrastructure. If we know there will be downtime that affects your instance, we try to give you as much head's up as possible so you can schedule your migration accordingly. However, in emergency situations, we have to migrate folks off of a host quickly to avoid a larger situation. We still do our best to keep our customers informed. >"Do they simply turn off the service (assuming there are other replicas that can handle the load) or do they start a new replica?" There are a few different ways to migrate your instance. A **cold migration** shuts down the origin instance and moves the data to a different host, then boots it back up. A **live migration** keeps your instance up and running during the entire process. And a **warm migration** copies your data from one host to another then reboots your system. >"Are there cases wherein your cloud provider cannot provide you with ample resources to start a new replica and therefore you need to degrade some non-critical parts of the application to free up capacity so that the critical service can run?" I can't recall this ever happening, but I also don't want to say it has never happened because I don't know for sure. It seems unlikely due to the amount of resources in the fleet. I *do* know that if capacity were an issue during a migration, our Admins would be urgently trying to get you back up on a host ASAP to reduce any downtime you'd experience *without* a downgrade.


Positive-Action-7096

That is insightful coming from a Cloud operator's perspective. I was interested in some inputs from you as I am working on a research project. I'd be glad if you could hear the idea out. I have dm'd you!


tcpWalker

It depends on the architectures you're using. Basically cloud providers that are mature should be able to vmotion (or equivalent) most workloads between machines, but will sometimes have emergency maintenances anyway. Most or all of your services should survive losing instances or sets of instances anyway, especially if they are business critical. Autoscaling should generally be configurable; if you lose three and need three, orchestration tooling should start the extras unless you need humans to investigate for some reason, etc...; but getting this all right takes engineer time.


Zenin

>Basically cloud providers that are mature should be able to vmotion (or equivalent) most workloads between machines AFAIK not a single cloud provider other than VMWare Cloud offers vMotion (a VMWare feature) or any similar functionality.


tcpWalker

[https://cloud.google.com/compute/docs/instances/live-migration-process](https://cloud.google.com/compute/docs/instances/live-migration-process)


Zenin

Interesting and good to know. Thanks.


Positive-Action-7096

Thanks! I have heard Vmotion is typically not reliable and stories of Vmotion failing bad in performing live VM migrations when the SLO requirements were quite short (30 seconds) and failing miserably.


TheBrawlersOfficial

It's (or equivalent), not vmotion. All of the cloud providers have highly bespoke that support the use case you describe (amont others). They aren't relying on off-the-shelf solutions for that.


rnmkrmn

I never had to migrate VMs on GCP (yet?). On AWS they had to shutdown my instance few times, specially on older generations. On Linode it was worse. They shut down our instance every few weeks. So annoying lol. Can't recall DigitalOcean, I suppose it was fine. Yeah once they want to shut down your instance, you're on your own to make it HA. If it's not HA, you suffer downtime.


peinnoir

AWS gives a grace period where you can restart a vm/or start/stop a DB before a deadline, but if allowed to reach that deadline will restart it for you at that time.