VCSA does not boot due to file system errors

Due to a power failure of the storage where the vCenter Server Appliance resides, the VCSA does not boot. Connecting to the console shows the following output:

Failed to check /dev/log_vg/log

When you see this screen, none of the services are started as the appliance does not fully start. This implies that there is no means of connecting to the H5 client nor the VAMI interface on port 5480.

Why does the VCSA not boot and where do I start troubleshooting?

There are two important things to mentioned on the screenshot above, this is where we start:

  • Failed to start File System Check on /dev/log_vg/log
  • journalctl -xb

First we take a look at ‘journalctl -xb’. To do this we need to supply the root password and launch the BASH:

launch BASH

Now that have shell access we can take a look at ‘journalctl -xb’:

Type G to go to the bottom of the log file:

journalctl -xb

Work upwards, the most relevant logs will be at the bottom. For the sake of this blog post, I have type -S. This will turn on/off word wrap, in this case, I turned on word wrap.

File System Check

Going up a little I find these entries:

journalctl showing more info about the failed volume

There is a problem with a certain inode and File System Check (fsck) should be run. Let’s see how we can do that. Is it as simple as running:

It seems like it. Running the above command finds some errors and suggests to repair. I confirmed.

Other volumes

Let’s check the other logical volumes (lvm). First we will run ‘lsblk’ to take a look at the drive layout:

With lsblk we take a look at the drive layout

Remark: When we take a look at the type, we see the disks, eg. sda, sdb, etc… The difference between sda and the rest is that sda is partitioned with standard partitions and on the rest the disks an LVM has been created.

I checked all other volumes and found none of them were having issues.

Reboot

To reboot while you are in maintenance boot:

After the reboot, I could connect to the H5 client and clear the relevant errors.

Remark

This blog post is very similar to this one here. Although they are very much alike, the issues in the older blog post were on a standard partition on a VCSA 6.5 whereas the issues described and addressed in this post are on a VCSA 7.0 LVM physical volume.

esxtop output is not displaying as it should

When you connect to your ESXi host and you launch esxtop. You look at the esxtop output and it is not displaying as it should. Instead, it is displaying like in the below screenshot:

esxtop displaying incorrect

Your esxtop output will be displayed correctly if you are using a terminal emulator that defaults to xterm as the TERM environment variable. Some terminal emulators will use another terminal emulator value by default, eg. xterm-256color. ESXi does not map xterm-256color to one of the values it knows, so it doesn’t know how to display the output.

There is a KB article that explains how to resolve:

Output of esxtop defaults to non-interactive CSV with unknown TermInfo (2001448)

The value of the environment variable TERM is used by the server to control how input is recognized by the system, and what capabilities exist for output.

Let us have a look first what the TERM variable is in my case:

I am receiving the following output:

echo TERM output

My terminal emulator tries to connect to the endpoint (ESXi) with xterm-256color. Now let’s take a look at what values this endpoint does support:

terminfo_values

So all of the above is possible to assign to TERM. The value my terminal emulator uses is not among the supported terminfo types. So the ESXi host cannot map to any of the known and thus does not know how to display the esxtop info correctly.

When we update the TERM environment variable to xterm and try to run esxtop again, the output will show nicely formatted.

Let’s check esxtop again to make sure the outcome is as expected:

esxtop displaying correct

NSX-T password expiration alarms in the Home Lab

The challenge

I have a couple of NSX-T environments in my home lab. I logged on to one of them and saw a couple of open NSX-T password expiration alarms.

Password expiration alarms

CAUTION

Password expiration should be part of your password policy strategy. Disabling the password expiration on a production system is not a good strategy.

The solution

With my sharp googling skills, I found this reference in the NSX-T 3.0 docs:

https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.0/installation/GUID-89E9BD91-6FD4-481A-A76F-7A20DB5B916C.html

So I changed the admin password ‘password-expiration’, not even bothering to open the event details. I just assumed this is about the admin user.

Done.

Not true. Some time later that day I found that the alarms were still open. I figured that this is some sort of timing issue, that the alarms were not automatically cleared yet. So I set them to resolved manually. Almost the same minute the alarms are triggered again, so no timing issue. If I only would have counted the alarms the first time it would have showed me that there more alarms than NSX-T components where I cleared the password expiration for the admin user.

Password expiration, read the details

It was only when I read the alarm in detail that I noticed the alarm is not the same one I saw before. This alarm was not triggered about the password expiration of the admin user but showed that it was for the audit user. The alarms are very much the same only the username is different, so easily overlooked.

So doing the math. Initially I had 8 open alarms, of which 3 were put to resolved automatically after changing the password expiration of the admin user. One on the NSX-T Manager and one on each of the 2 edge nodes. Which left 5 open alarms to take care of. Checking all the alarms gave me the following actions:

  • clear alarm for the root user on NSX-T Manager
  • clear alarms for the root user and the audit user on the NSX-T Edge 1 and 2

CAUTION

Password expiration should be part of your password policy strategy. Disabling the password expiration on a production system is not a good strategy.

Upgrade Methodology – Upgrade the homelab

This post is not as a end-to-end upgrade guide but a methodology guide. Everything is more or less straight forward if one uses the correct methodology.

As this is a home lab I have chosen not to add complexity by adding additional nodes to the various components for High Availability but depend on vSphere HA. This eases and shortens my upgrade path. In a production environment every application should be evaluated and talked through with the stakeholders whether it can rely on vSphere HA or some form of application HA should be introduced.

The release of vSphere 7 introduced a starting point for an update/upgrade round in the lab. I have not always used the methodology when upgrading components as I should have in the lab. When I used this methodology in the current upgrade round it has come to light that some components were not interoperable with each other. Why is this important? If you have a problem with your environment and you call VMware support, they will go through the logs and verify the environment is on par with the documentation.

So where do I start?

Reading the docs first takes time but it will save a lot of time in the long run. Can’t stress it enough RTFM !!

I always create my own high level upgrade guide. This will include the research that I have done and it also includes the upgrade path to follow.

Phase 1: Information gathering

Determine the components and versions

Be thorough in determining the components and versions (BOM or Bill Of Materials).

  • VMware vSphere 6.7U3
  • vRealize Log Insight 4.5
  • vRealize Operations 7.5
  • VMware Horizon 7.11

You will see below that I had forgotten to include VMware Horizon in my component set. This could have been catastrophic as some components might stop to work when you don’t follow the correct upgrade procedure.

Gather the documentation

After you determine which components are used in the environment you can go and search for the necessary documentation. Find the release notes, upgrade guides and other relevant documentation

I used the following documentation set:

Release notes:

Upgrade documentation:

Other:

Remarks

In the ‘Compatibility Considerations’ in Important information before upgrading to vSphere 7.0 vRealize Operations is mentioned as not compatible with vSphere 7.0. This will be overruled by the VMware Product Interoperability Matrix between the two products (VMware Product Interoperability Matrix overview section),

VMware Hardware Compatibility List

What I learned from these documents was that I was not sure the NVMe drives in my hosts would be compatible. After all they are consumer grade NVMe drives and are not on the VMware HCL.
I recently installed two NVMe drives, a CT500P1SSD8 and a CT1000P1SSD8 and at the time of install the Crucial CT500P1SSD8 was not recognized. A quick googling showed me a blog post from William Lam that replaced the vSphere 6.7 U3 nvme driver with one from a vSphere 6.5U2 install.
I will discuss how I determined if there would be issues around this in ESXi 7.0 in a later phase.

All components should be checked with the vendor and the VMware HCL. Be aware that the vendor and VMware might not always agree and that the VMware HCL might not always be in sync with the the documentation of the hardware vendor. You should always follow the VMware HCL but be aware of the following KB. If vSAN hardware is involved it is advised to use extreme caution as this has a specific section of the VMware HCL for vSAN Ready Nodes and for Build Your Own

VMware Product Interoperability Matrix overview

I did a research on which versions would be compatible with vSphere 6.7 and vSphere 7.0 as I was not yet sure if would be able to upgrade to vSphere 7.0

One product I forgot about was VMware Horizon. I’m currently on 7.10 and the VMware Product Interoperability Matrix show that at least 7.12 is needed to be supported. As I currently use Horizon with a full clone this should not pose much problems and I am planning to upgrade this as well. If I would be using Linked Clones or Instant Clones this could have been worse.

Update sequence for vSphere 7.0 and its compatible VMware products

Now that we checked all the info on whether everything would be compatible and supported after the upgrade, it is time to check the knowledge base article on the update sequence. This article shows what must be upgraded before another component can be upgraded to keep a supported environment.

Phase 2: The upgrade

vRealize Log Insight upgrade

vRealize Log Insight was running version 4.6. Digging back through the release notes showed me that I had to upgrade from 4.6 > 4.7(.1) > 4.8 > 8.1. The VMware Product Interoperability Matrix showed that vRealize Log Insight 8.1 was compatible with either vSphere 6.7U3 and vSphere 7.0

The upgrade process was painless. It just took a lot of time. The process itself is straight forward. Go to Administration > Management > Cluster, upload the pak file and follow the screens. In my case again and again because I did not upgrade Log Insight a long time.

vRealize Operations Manager upgrade

Upgrading the vRealize Operations Manager node is a breeze too, mainly because it is a simple setup with only one master node. vRealize Operations Manager was running version 7.5. I missed the 8.1 release so I upgraded to 8.0 first.

There are a couple of things that need some attention.

  1. Always run the vRealize Operations Manager Pre-Upgrade Readiness Assessment Tool (APUAT pak)
  2. Make sure to upgrade the OS through the OS pak files first, then the vROps pak file
  3. As I upgraded to 8.0 I had to switch files to execute the 8.1 upgrade

The Pre-Upgrade Readiness Assessment Tool showed me warnings for two items:

  1. Validating product version Make sure to run vRealize Operations Manager – 6.6.1, 6.7, 7.0 and 7.5 Virtual Appliance upgrade, as product version is 7.5.0 Ensure product and upgrade versions meet the requirements.
  1. Checking /dev/sda partition size. The size of the partition is less than 20GB. Increase the size of the partition to be greater or equal to 20 GB (https://kb.vmware.com/s/article/75298).

Both are easily addressed. The first one gives a warning to use the correct pak file when upgrading. The second one refers to a KB article that has only a couple of steps:

  • take vRealize Operations offline
  • shut down guest OS
  • increase hard disk in vCenter
  • boot Virtual Machine
  • take vRealize Operations online

After addressing both warnings I was able to upgrade.

vCenter upgrade

Simple and easy when you are already on the VCSA with an embedded Platform Services Controller (PSC).

Run the installer and choose upgrade. Supply the source vCenter information and destination vCenter information and click Next – Next – Finish. Grab a drink and wait. It is a two part process. The first part will deploy the new machine with the chosen information and the second part will migrate the data from the old vCenter to the new vCenter.

ESXi upgrade

Before the actual upgrade could take place I needed to be sure that everything would work after the upgrade. Within the vExpert vCommunity I had seen a nice and easy way to do this. I am sorry that I can’t give credits to the person that I got the idea from.

  • Create a bootable USB ESXi installer or use the iDRAC or equivalent technology to boot your server from the ESXi installer
  • Find an empty USB flash drive
  • Put your server in Maintenance Mode
  • Shutdown and boot your server from the USB installer or the iDRAC or equivalent technology
  • Install to the empty USB drive – BE CAREFUL not to install to the wrong location

Upgrade check workaround

ESXi will create a VMFS volume from the remaining local space where ESXi is installed by default. After installing I tried to add the ESXi host to vCenter but failed because it had detected the local VMFS volume from the original install and that was conflicting with the one that was still present in vCenter but disconnected. I rebooted the ESXi host, booted into the original drive, verified nothing was on the local drive in the original install and deleted the datastore. Rebooted again into the USB drive and now could add the USB installed ESXi 7.0 to vCenter. Now I was able to get a glimpse of how everything was seen from an ESXi 7.0 install. The NVMe drives I was worried about were showing all fine.

Again this is a home lab and not all components are on the VMware HCL so this adds some extra steps like checking from an actual install. This would not be necessary in a production environment where everything has green checks on the VMware HCL.

The actual upgrade

Upgrading ESXi hosts is done easiest through VMware vSphere – Lifecycle Manager (VMware Update Manager has been rebranded)

I imported the Dell customized ISO, created a baseline and did a Host Compliance Check. The Host Compliance Check was Incompatible and led me to the following two knowledge base articles:

I had to remove everything based on the qedi and qedf drivers

VMware Tools / VM Hardware Compatibility

Upgrading the VMware Tools and the VM Hardware Compatibility is the last part in the process. Determine the viability of each VM to upgrade the VMware Tools and the VM Hardware Compatibility. For most VMs this won’t pose a problem. Nevertheless there are some vendor appliances that will need to run a specific version.

vSAN upgrade

Although vSAN is not really a separate component to upgrade, you will need to upgrade the on-disk format. This is an online upgrade that will not impact the running VMs.

Conclusion

Good preparation of an upgrade is key !!

Upgrade Methodology checks:

  1. Determine your BOM (Bill Of Materials)
  2. Check the documentation first
    1. Read the Release Notes
    2. Read the Upgrade guides for each component
  3. Check the HCL
  4. Check the Interoperability Guide
  5. Determine the update sequence
  6. Upgrade according your plan

Use iPerf to test NIC speed between two ESXi hosts

Sometimes you want/need use iPerf to test the nic speed between two ESXi hosts. I did because I was seeing a NIC with low throughput in my lab.

How can we test raw speeds between the two hosts? iPerf comes to the rescue. I was looking on how to do this on an ESXi host. I doesn’t come as a surprise that I found the solution here at William Lams’ virtuallyghetto.com. Apparently iperf has been added to ESXi since 6.5 U2. You used to have to copy iperf to iperf.copy. In ESXi 7.0 that has been done for you, although you will need to look for /usr/lib/vmware/vsan/bin/iperf3.copy

ESXi host 1 (iperf server)

Disable the firewall:

Change to the directory containing the iperf binary

Execute iPerf as server

Overview of the used parameters:

-swill start iperf as server
-Bdefines the IP the iperf server will listen to

Disable the firewall

ESXi host 2 (iperf client)

Change to the directory containing the iperf binary

Execute iPerf as client

Overview of the used parameters:

-iwill determine the interval of reporting back
-ttime iperf will be running
-cclient ip, will force the usage of the correct vmkernel interface
-fmdefaults to kbit/s, adding m will use mbit/s

Don’t forget to re-enable the firewall on both systems.

esxcli network firewall set --enabled true

My first deploy with VMware Cloud Foundation

I have been working on a script to deploy environments on a regular basis in my homelab. While I have made great progress I have not been able to get this completed due to the lack of time. It did up my powershell script writing skills.

A while ago I followed a webinar about VMware Cloud Foundation Lab Constructor (VLC in short). This will deploy a VCF environment in a decent amount of time. With little effort I have been able to get this up and running multiple times. There are some pitfalls I ran into. My goal is to get to learn more about VCF, NSX-T and K8s all in a VMware Validated Design (VVD) setup.

You can get access too by completing the registration form at tiny.cc/getVLC.

The following files are included in the download:

  • Example_DNS_Entries.txt
  • VCF Lab Constructor Install Guide 39.pdf
  • VLCGui.ps1
  • add_3_hosts.json
  • add_3_hosts_bulk_commission VSAN.json
  • default-vcf-ems.json
  • default_mgmt_hosthw.json
  • maradns-2.0.16-1.x86_64.rpm
  • mkisofs.exe
  • plink.exe

As I already have a DNS infrastructure in place I used ‘Example_DNS_Entries.txt‘ as a reference to create all the necessary DNS entries.

Read the documentation pdf FIRST. It will give you a good insight in what will be set up, won’t be set up and how everything will be set up. I’m not planning to repeat info that is included in the documentation. The only thing that I have copied from this pdf is the disclaimer because I feel it is important:

Below I have included the various configuration files and split them to show the different parts and also show where I deferred from the default. There are the configuration files that the VLC script will use:

  • Management domain:
    • default-vcf-ems.json → changed all ip addresses, gateways, hostnames, networks and licenses
    • default_mgmt_hosthw.json → changed the amount of CPUs (8 → 12), the amount of RAM (32 and 64 → 80) and the disk sizes(50,150 and 175 → 150)
  • Workload domain
    • add_3_hosts.json → changed the hostname, management IP and IP gateway

To deploy VCF and be able to deploy NSX-T you will need a good amount of resources. The mimimum of host resources to be able to deploy NSX-T is 12vCPUs (There is a workaround to lower the vCPU requirements for NSX-T) and 80GB of RAM due to the NSX-T requirements.

The configuration files

The first file is the ‘default_mgmt_hosthw.json’. This file describes the specs for the (virtual) hardware for the management domain hosts:

default management host hardware json

The second file is the ‘default-vcf-ems.json’. This file describes the configuration for all software components for the management domain:

default VCF EMS JSON

The last configuration file is ‘add_3_hosts.json’. This configuration file is optional and can be used to prepare three extra hosts for the first workload domain:

Where did I change the defaults

There are some settings that I changed from the defaults aside from changing the names and network settings:

  • in the ‘default_mgmt_hosthw.json’ I have changed the CPU to 12 to be able to deploy NSX-T
  • in the ‘default_mgmt_hosthw.json’ I have changed the RAM 80 to be able to deploy NSX-T

How do we start

If you are meeting the prerequirements it is fairly simple. Fire up the ‘VLCGui.ps1’. This will present the following gui which will give the ability to supply all the necessary information and to connect to your physical environment. It speaks for itself, just make sure the Cluster, Network Name and Datastore field are higlighted blue like the following.

What’s next

I hope to expand this inital post with a couple of follow-up posts. These are the topics that I’m currently thinking about:

  • NSX-T
  • importing the upgrade and deployment bundles
  • K8s

… and maybe more …

Additional info

Support:

Slack VLC Support channel – http://tiny.cc/getVLCSlack

Some blogs:

https://blog.bertello.org/2019/08/building-nested-vcf-using-vcf-lab-constructor-vlc/ and https://blog.bertello.org/category/automation/

https://my-sddc.net/

https://vinfrastructure.it/2019/10/vmware-cloud-foundation-3-9/

https://blogs.vmware.com/cloud-foundation/