Quick tip: Setting up TrueSSO?

Are you setting up TrueSSO? Are you looking to use signed certificates to secure the communication between the Connection Server and the Enrollment Server?

Try to find the documentation on using signed certificates to secure that communication. I challenge you, you will not find it easily.

What and why?

You are allowing access to the Unified Access Gateway from the internet. You will want those services to have signed certificates to secure the communication, which will turn that icon in the Horizon client green. To enable end-to-end signed communication, you will need to make sure that you have certs all the way. In the end you are creating tunnels to backend services.

On top of that you want to add TrueSSO in the equation as you want a seamless sign-on experience. This means more certificates. You follow the guides (and all the blog posts that are built using this information), so you are almost there.

However, one step is exporting the ‘vdm.ec’ certificate from the Connection Server and import it on the Enrollment Server. That is exactly where the information is missing or at least hard to be found. None of them actually talk about CA signed certificates for this. You are doing this kind of effort to get all those components (Microsoft) CA signed. Don’t you think that you should use signed certificates here as well, if . I think so!

Where can I find the documentation

Here is the documentation on the VMware websites on setting up TrueSSO:

… and also some great blogs articles out there:

Search no more, you can find it here on the docs.vmware.com site, it is just in another section and a bit hard to find.

Photon OS 3.0 update the maximum failed OS login attempts

A client of mine was looking on how to update the maximum failed OS login attempts because they were having an issue with the monitoring solution locking the root user account on the VMware Unified Access Gateway 2212. This version of the UAG is based on photon OS 3.0.

He asked me to verify where to change the configuration to update the maximum failed OS login attempts. This is normally set at UAG deploy time and there is no option to change it afterwards easily.

Be aware that this is a rather unconventional change because these values shouldn’t be changed from the default, especially if you want to be compliant with CIS audits for example.

This is the default and also the line that you need to change

Open system-auth located in /etc/pam.d

  • deny=3

Change the deny=3 to the maximum value you want. If you change it to 0 (zero) it will never deny based on the maximum failed OS login attempts for all local users

  • even_deny_root

Leave it out if it shouldn’t deny the root user being locked out on the maximum failed OS login

  • unlock_time=86400

Default unlock time for all users

  • root_unlock_time=300

Unlock time for the root user account

Source: https://www.stigviewer.com/stig/vmware_vsphere_6.7_photon_os/2022-09-27/

Deleting the datastore where a content library is hosted is probably not the best idea

Deleting the datastore where a content library is hosted is probably not the best idea but … yes stupid error and now what. If you are not faint of heart (and now how to take a snapshot), you can rectify this. You should contact GSS as there is not documented solution and this might break.

Take a snapshot and verify if the vCenter backups are in a healthy status. Yes? Ok go ahead.

Log on to the vCenter and create a new Content Library and name it ‘i-made-an-error’. Use the new datastore you want to use and keep the rest of the settings default as these don’t really matter.

Open an SSH session to the vCenter and connect to the Postgress DB ‘VCDB’

To show which tables are present within the database:

Show an overview of the Content Libraries added ( make sure to add the trailing ;):

Show the Content Library entries in the vCenter database

Now that we have an overview of the Content Libraries, with the one that is throwing an error highlighted.

In the following overview we find the library id from the new Content Library we just added and also the corresponding storage id.

Database Content Library storage ids

I will update the storage id from the faulty one we found on the previous screenshot with the one we found for the new Content Library.

Update the storage id for the faulty Content Library

There are a couple of places that helped me in solving this:

https://communities.vmware.com/t5/VMware-vCenter-Discussions/Content-library-item-delete-issue/td-p/2266050

https://tinkertry.com/how-to-remove-vmware-vsphere-zombie-datastore

https://vmninja.wordpress.com/2019/04/05/remove-inaccessible-datastore-from-inventory/

Enable Workload Management does not finish

Some time ago we were having issues in the Tanzu PoC class for partners we were teaching. One of the students had an environment where the Enable Workload Management process was unable to finish the creation of the Supervisor Cluster.

It was an interesting issue because when we verified all the settings we saw everything configured correctly on a UI level. Nevertheless when went to the virtualservice we saw that it was down because of the servers in the pool were not up.

When the Enable Workload Management is unable to finish, there are some usual suspects. Most of the time the details within the Enable Workload Management wizard are just not correct. Validation on the values supplied could be better I believe. You only know when it takes to long, that you need to start verifying the components. The following milestones can be checked.

  • Are the Supervisor Control Plane VMs created?
  • Do the Supervisor Control Plane VMs have the correct amount of IPs
  • Are the NSX ALB Service Engine VMs created?

During the troubleshooting, we verified these usual suspects. We also verified all values supplied in the different consoles, being the Workload Management configuration page in the vSphere client but also on the NSX ALB. It seemed that this student had done everything correct. We started to exclude issues with pinging, executing curl to the relevant ip’s and checking the logs.

At a moment we arrived at the Service Engines and went from there. At lunch time I stumbled onto this blog post from Nick Schmidt (a fellow vExpert), which made a jump in to the troubleshooting phase:

https://dev.to/ngschmidt/troubleshooting-with-vmware-nsx-alb-avi-vantage-23pc

This showed how to connect to the networking namespace on the Service Engine and this helped a lot.

If you do not connect to the networking namespace, you will see the configuration on an OS level. Within the networking namespace you troubleshoot within the correct context.

Although the web UI shows the correct values for the configured routes, they were not applied correct on the NSX ALB SE.

Here are the steps that I executed when connected to one of the NSX ALB Service Engines:

Now we saw that there was a route missing within this namespace. We went back to the web UI deleted the route and re-created, et voila the servers in the pool came up and therefor the virtualservice was alive.

vLCM fails to upgrade a firmware component

I recently experienced an issue within a HPE environment where vSphere Lifecycle Management (vLCM) fails to upgrade the firmware on a HP FlexFabric 534FLR-SFP+ Adapter.

On HPE Gen10 servers it is possible to leverage vSphere LifeCycle Management to manage not only the ESXi version but also the firmware and drivers of the different hardware components. vLCM leverages a vendor tool, in HPE’s case it is either HP OneView or HP Amplifier, to do the lift and shift for the firmware.

Apparently it fails when there are multiple adapters present in the system which have a firmware v7.15.97 or prior. The upgrade would succeed on one adapter but not on the subsequent adapter(s), see here. The KB is specifically mentioning HP OneView but as I experienced it is also affecting HP Amplifier, which makes sense.

The following screenshot shows two hosts out of compliance with the image, because of that specific firmware. Other hosts in that cluster upgraded the firmware on the adapter just fine. It really is due to the version to upgrade from.

vLCM Cluster Image settings and Compliance

Resolution

The article is providing a link to a firmware upgrade utility, which is for ESXi 6.0 / 6.5. You can download the 7.0 version here.

Now that we downloaded the firmware update utility, put the host into Maintenance Mode and copy it onto the ESXi host. Putting it in the /tmp directory gives the (dis)advantage that is tis removed when the machine is rebooted.

SSH to the host and install the firmware update utility (Smart Component):

This should be the output:

Now go to the directory where the firmware update utility is installed and run it:

This should be the output:

If the Return value is 1, that is a good sign. I had to rerun it some times because of return value 0. I also had a return value of 106, which didn’t change after several runs. I rebooted that host, ran it again and then it went ok.

As a final step clean up the actions, so remove the firmware update utility:

Reboot. When the host is back, Check Compliance again and you should be good to go.