Some time ago we were having issues in the Tanzu PoC class for partners we were teaching. One of the students had an environment where the Enable Workload Management process was unable to finish the creation of the Supervisor Cluster.
It was an interesting issue because when we verified all the settings we saw everything configured correctly on a UI level. Nevertheless when went to the virtualservice we saw that it was down because of the servers in the pool were not up.
When the Enable Workload Management is unable to finish, there are some usual suspects. Most of the time the details within the Enable Workload Management wizard are just not correct. Validation on the values supplied could be better I believe. You only know when it takes to long, that you need to start verifying the components. The following milestones can be checked.
- Are the Supervisor Control Plane VMs created?
- Do the Supervisor Control Plane VMs have the correct amount of IPs
- Are the NSX ALB Service Engine VMs created?
During the troubleshooting, we verified these usual suspects. We also verified all values supplied in the different consoles, being the Workload Management configuration page in the vSphere client but also on the NSX ALB. It seemed that this student had done everything correct. We started to exclude issues with pinging, executing curl to the relevant ip’s and checking the logs.
At a moment we arrived at the Service Engines and went from there. At lunch time I stumbled onto this blog post from Nick Schmidt (a fellow vExpert), which made a jump in to the troubleshooting phase:
This showed how to connect to the networking namespace on the Service Engine and this helped a lot.
If you do not connect to the networking namespace, you will see the configuration on an OS level. Within the networking namespace you troubleshoot within the correct context.
Although the web UI shows the correct values for the configured routes, they were not applied correct on the NSX ALB SE.
Here are the steps that I executed when connected to one of the NSX ALB Service Engines:
ifconfig --> shows the network configuration of the NSX ALB SE
ip route --> shows the routes, only the management route was shown
ip netns show --> shows the network namespaces, only one was shown in this environment, namely avi_ns1, there was also only one tenant
ip netns exec avi_ns1 bash --> launches a shell within the avi_ns1 namespace
ip route --> shows the routes from the avi_ns1 namespace
Now we saw that there was a route missing within this namespace. We went back to the web UI deleted the route and re-created, et voila the servers in the pool came up and therefor the virtualservice was alive.