As part of their daily routine, administrators of the VMware vSphere virtualization platform deal with the processes of updating the virtual infrastructure and its components by rolling out patches or carrying out upgrades (including the most complex ones – to new major versions). The most convenient way to do this is with vSphere Lifecycle Manager (vLCM), which automates this process. This is the next generation of vSphere Update Manager (VUM), previously used for planning and rolling out updates. Using it, you can update not only the ESXi hypervisor but also drivers and microcode, and the updates themselves take place according to the desired-state configuration model.
Today, we will talk about the process of patching/upgrading the VMware vSphere platform itself, specifically about the problems that may arise and the best practices that you need to implement in your organization to ensure everything goes smoothly and without problems.
Preparation for Updating the Virtual Environment
First, ensure that the vCenter Server Appliance Administrator and SSO Administrator accounts (administrator@vsphere.local) are available. By default, the VCSA root account is locked after 90 days of inactivity, which can sometimes cause problems. Please note that you may need to restart the vCenter Server to recover your password. It is also sometimes recommended to update passwords after maintenance of infrastructure components.
You must provide direct access to the ESXi host to take shutdown snapshots of vCenter and Platform Services Controllers (required in vSphere 6.7 and earlier) and for diagnostic purposes. Make sure you have administrator access to the hosts.
Suppose something went wrong before – double check DNS services. Verify that the A (forward) and PTR (reverse) DNS records for vCenter Server, ESXi, and other devices resolve correctly. This simple check takes seconds, but will prevent problems during patches and updates. Also, check the NTP service – ensure that the settings for all virtual environment components, including ESXi, vCenter Server, SDDC Manager, devices, storage and switches, are correct. Many inexplicable problems arise at first glance precisely because of incorrect time synchronization.
Check the alarms on the vCenter and ESXi servers. Review and resolve vSAN Health Checks warnings that may prevent automated patching through Lifecycle Manager.
Check again the firewall rules for the temporary IP address of the upgrade (Reduced Downtime Upgrade temporary IP address) if you use this feature present in vCenter Server 8.0.2 and newer versions.
Use DRS rules to keep vCenter Server and key virtual machines (such as KMS, DNS, and AD) on a dedicated ESXi host. It is better to use the “should” rule (for a more flexible configuration). This will allow you to quickly find and repair the VCSA host using the Host Client with direct access to ESXi.
Remember that you need to create backup copies of the vCenter server using the file backup and restore function, which is available through the Virtual Appliance Management Interface (VAMI). If you don’t have a backup copy, use the “Backup Now” button to make a backup right now:
Also, export the configuration of distributed switches to have a fresh copy of the configuration. Keep this backup in an accessible place.
Download vCenter and ESXi updates locally before applying them in the production environment. This will reduce the risk of update operation failures due to network interruptions.
Disable vCenter HA – this should be done before updating the vCenter server.
And remember, for major updates (for example, major versions of vSphere), you should have a plan prepared in advance for testing the virtual infrastructure after the update to ensure it was successful. You should also understand the conditions under which you will have to revert to the previous version.
Carrying out updates
First, reboot vCenter Server and associated components if they have been running for a long time. This will allow you to understand the “health” of the server before major upgrades and determine whether there are problems before the upgrade itself. This will cause additional minor downtime, but will speed up the restoration of the service if problems arise.
Immediately before performing the upgrade, create a snapshot of vCenter and all associated PSCs in a powered-off state. You will have to use ESXi Host Client after you shut down vCenter properly. This gives you a recovery point in case of problems during the update.
If you have multiple environments connected via Enhanced Linked Mode, you need to shut down all linked vCenter and PSC server instances and create snapshots of them. Failure to do so will result in replication errors and conflicts. If you need to roll back (for example, an error occurred on one of several servers during the update), you will need to roll back on all servers. In complex environments, you may need to take several series of snapshots over the course of a complex update. For example, you have upgraded half of all vCenter and PSC, after which you can take snapshots of all hosts again – and if something happens, you can roll back to this point where half of the hosts have already been updated.
Additionally, to avoid errors and issues resulting from host management agent upgrades for ESXi hosts, consider setting vSphere DRS to “Partially Automated” mode so that migrations do not occur while vCenter servers are being upgraded. Then, of course, don’t forget to turn it back on. But! Do not disable DRS completely, as this will delete all resource pools.
To avoid errors and delays due to ESXi host agent upgrades causing elections in the HA cluster, consider disabling vSphere HA before upgrading, then you can enable it again.
The key is not to rush – do not update multiple vCenter servers and/or PSC controllers simultaneously. This will cause replication errors between components and ultimately cause the entire update to fail.
Remember – first we update the vCenter server, then the ESXi servers.
If only ESXi updates are available, roll them out, especially if there are critical security subsystem updates – do not wait for vCenter updates to be released. In environments using Enhanced Linked Mode, the vCenter and PSC servers must be upgraded to the same version before performing ESXi upgrades.
Post-update procedures
If you have a post-upgrade test plan, run it and make sure everything functions as you expect.
Enable DRS and HA – If you have disabled vSphere HA and/or set DRS to “Partially Automated” mode, restore the original settings. Also, if you disabled vCenter HA, do not forget to enable it again.
Don’t forget to delete snapshots! Over time, they can reduce storage performance and make them difficult to remove. But do this only after you have checked the functionality and synchronization of all components.
Clear your browser cache if unexpected behavior occurs in vSphere Client during or after an update. This often resolves anomalies in vSphere Client and ESXi Host Client.
Also, with major upgrades, you can change the root passwords on ESXi and VCSA, as well as the administrator@vsphere.local password, saving the new passwords in your password manager and/or offline.
And, if all is well, create vCenter Server backups at the file level using the built-in solution, or at the virtual machine level using a third-party backup product.
Organizational recommendations
- Inform users and managers about the consequences of updates – not everyone may understand the details of this process (for example, that the infrastructure continues to operate during updates)
- Establish maintenance windows to make the process predictable – this will allow system administrators and other stakeholders to be prepared for unexpected downtime.
- Consider the peculiarity of some systems in relation to vMotion – they can take a long time to migrate from the ESXi host on which the update is planned, especially if there is a large flow of transactions in the guest OS. vMotion has been greatly improved in recent years, and vMotion Notifications helps applications prepare for and recover from migration.
- Apply ITIL definitions for change management – “standard” routine changes are not disruptive, such as deploying a new VM. “Emergency” changes require immediate action, such as rolling out critical security patches. Using ITIL workflows helps clarify the significance of tasks and facilitates their prioritization, ensuring consistency within the organization. Always designate critical updates as “emergency” because you are the one who will have to deal with the consequences if something happens.
Architectural recommendations
- Configure file backup and recovery of the vCenter server.
- Ensure that vSphere HA is not using the vCenter Server as the isolation address (das.isolationaddress). Instead, use multiple addresses (das.isolationaddress0 to das.isolationaddress9) to prevent unnecessary HA switches due to unavailability of one address.
- Limit the number of plugins on vCenter Server where possible. Fewer plugins also improve compatibility and ease of updates, making the system more manageable and secure.
- Limit the number of VIBs on ESXi – to ensure security, install only essential software and remove all non-VMware components (except hardware vendor NIC and HBA drivers) that are absolutely necessary for your environment to function.
- Use Enhanced Linked Mode (ELM) only if you really need it – it makes updates and upgrades more difficult.
- Provide N+1 cluster resources – vMotion and DRS work best when there are enough free resources in the cluster to migrate workloads because the host goes into Maintenance Mode when upgrading.
Conclusion
The process of both small updates and large upgrades requires a serious approach. First of all – at the organizational level. All stakeholders who may be affected by potential downtime, misconfigurations, and data loss must be aware of the current update regime. This will allow you to plan a strategy in advance in case of unforeseen circumstances. Be sure to make backups of critical components and snapshots so that you can roll back to the original state. Well, and most importantly – take your time, plan and do not take irreversible actions (such as, for example, disabling DRS in the cluster before updating).