When customers with vSphere+NSX-T-based foundations apply a stemcell update, update a tile, or upgrade PAS (Pivotal Application Service) from 2.2 to 2.3, their Cloud Foundry may become unreachable as their NSX-T static load balancer server pools have been emptied.

This blog post describes a method to ensure availability during upgrades. We use a combination of customized Operations Manager resource configs and BOSH VM Extensions.

The sample workflow in this post is for upgrading PAS 2.2 to PAS 2.3 with an Operations Manager upgrade; however, it can also be adapted to stemcell or tile upgrades as well.

Operations Manager 2.3 introduces the capability to manage the full lifecycle of the membership of the NSX-T load balancer pools, which relieves customers of the responsibility of manually assigning VMs to server pools. This allows uninterrupted availability of the BOSH instance group VMs that use NSX-T load balancers even during upgrades, re-deploys, and IP address re-assignment.

To enable these features without downtime for deployments using Operations Manager 2.2, one must migrate membership from out-of-band VM assignment to BOSH-managed assignment prior to upgrading to Operations Manager 2.3. This blog post describes that migration process.

0. Procedure

  • Review manually-created static [or Dynamic] NSX-T Load Balancer Server Pools
  • Craft BOSH VM Extensions
  • Craft Operations Manager resource configs
  • Upgrade Operations Manager to version 2.3
  • Stage VM extensions with the Operations Manager API
  • Prepare to upgrade Pivotal Application Service to 2.3.0
  • Stage new resource configs with the Operations Manager API
  • Apply changes (deploy)

1. Review Manually-Created Static NSX-T Load Balancer Server Pools

We log into our NSX Manager to review the Server Pool configuration of our Static Load Balancers:

In the table below, we describe the purpose of each of the load balancer server pools. The most important load balancers are the first two; they allow HTTP(S) access to our applications. The “Job” column is the name of the job that is load balanced by this server pool; they can be viewed in the Pivotal Application Service “Status” page in Operations Manager. The “BOSH instance group” column is the name of the instance group for this job within BOSH.

Server PoolDescriptionPortJobBOSH instance group
PAS-GoRouter443ServerPoolHTTPS access to cf apps443Routerrouter
PAS-GoRouter80ServerPoolHTTP access to cf apps80Routerrouter
PAS-SSHProxyServerPoolcf ssh2222Diego Braindiego_brain
PAS-TCPRouterServerPoolTCP access to non-HTTP(S) cf apps (optional)*TCP Routertcp_router

* - the TCP router server pool uses a port range, e.g. 1024-1124, and does not load balance a single port.

We double-check to make sure the IP addresses of the VMs backing each server pool correspond to the IP addresses of the jobs. For example, we check the IP addresses backing the PAS-GoRouter443ServerPool server pool:

We cross-reference those IP addresses to the IPs of the Router VMs in our foundation:

We cross-reference all server pools.

2. Craft BOSH VM Extensions

For each BOSH instance group, we prepare a JSON file describing a VM extension that will be applied to that instance group.

Here is an example of the VM extension configuration for the Router job:

{
  "cloud_properties": {
    "nsxt": {
      "lb": {
        "server_pools": [
          {
            "name": "PAS-GoRouter443ServerPool",
            "port": 443
          },
          {
            "name": "PAS-GoRouter80ServerPool",
            "port": 80
          }
        ]
      }
    }
  },
  "name": "http_https_lb"
}

The server_pools array should contain a list of JSON objects with the name of the NSX-T load balancer server pool, and where applicable, a port. When configuring load balancers that use a port range, such as the TCP router load balancer, the port should be omitted.

The name (e.g. http_https_lb) is arbitrary but must match the name used in the the resource config in the following section.

In our case, we create 3 VM extensions: router_vm-extension.json, diego_brain_vm-extension.json, tcp_router_vm-extension.json.

3. Craft Operations Manager resource configs

For each BOSH instance group, we prepare a JSON (JavaScript Object Notation) file describing the resource config that BOSH will use when redeploying the instance group during the upgrade.

3.0 Authenticate To Use the Operations Manager API

We authenticate in order to obtain our UAA access token.

FOUNDATION_URL= # your ops manager URL goes here, e.g. https://pcf.example.com
UAA_ACCESS_TOKEN= # your token goes here, a very long string. Very long.

3.1 Determine the GUID of the PAS Foundation

We use the Operations Manager API to determine the GUID of the PAS foundation:

curl "$FOUNDATION_URL/api/v0/staged/products" -H "Authorization: Bearer $UAA_ACCESS_TOKEN"

This will return a long string of JSON. We’re looking for this part: [..., {"installation_name":"some-guid","guid":"some-guid","type":"cf","product_version":"2.2.2"}, ...]. This is our guid: cf-ebd5ce1f7b11714cbb94.

3.2 Determine the GUID of the Jobs That Belong in Server Pools

CF_GUID= # the GUID we got from the previous step
curl "$FOUNDATION_URL/api/v0/staged/products/$CF_GUID/jobs" -H "Authorization: Bearer $UAA_ACCESS_TOKEN"

This also returns a long string of JSON. We want to find the GUIDs for the jobs (BOSH instance groups) router, diego_brain, and tcp_router (the corresponding GUIDs in our foundation are router-1a40bb1433cd790d3920, diego_brain-247c1f4a616b5d43546e, and tcp_router-93c7a99605c109f53f8c).

If you have jq installed, you may use the following command to isolate the important components: jq -r '.jobs[] | select(.name=="router" or .name=="diego_brain" or .name=="tcp_router")'.

3.3 Get Existing Resource Configs Attached to the Jobs

We run the following command for the GUID of each job we’re interested in:

curl "$FOUNDATION_URL/api/v0/staged/products/$CF_GUID/jobs/$JOB_GUID/resource_config" -H "Authorization: Bearer $UAA_ACCESS_TOKEN" > $JOB_NAME-resource-config.json

Where CF_GUID is set to the GUID from step 3.1, JOB_GUID is set to the GUID from step 3.2, and JOB_NAME is set to the name of the BOSH instance group for this job.

This produces a result like {"instance_type":{"id":"automatic"},"instances":2,"nsx_security_groups":null,"nsx_lbs":[],"additional_vm_extensions":[]}.

3.4 Write New Resource Configs For Each Job

Edit each file from the previous step to add the name of the corresponding load balancer (see table in step 1), in quotes, to the additional_vm_extensions array. [Resource Config]

We do this once for each load-balanced job (router, diego_brain, and tcp_router).

4. Upgrade Operations Manager to Version 2.3

Following the steps in the Upgrading Pivotal Cloud Foundry documentation, we upgrade Operations Manager to version 2.3.

Do not follow steps to upgrade the PAS tile at this time.

If you are using the VMware NSX-T tile, and it is version 2.2 or earlier, you should now stage version 2.3 of that tile.

Do not press “Apply Changes” during this step.

5. Stage VM Extensions with the Operations Manager API

For each VM extension JSON file we wrote in step 2, we run the following curl command:

curl "$FOUNDATION_URL/api/v0/staged/vm_extensions" -X POST -H "Authorization: Bearer $UAA_ACCESS_TOKEN" -d "@${VM_EXTENSION_FILE}" -H "Content-Type: application/json"

Where ${VM_EXTENSION_FILE} is the path to the file. We expect to see an HTTP status 200 and an empty JSON object {} returned for each call.

6. Stage Pivotal Application Service Tile Version 2.3.0

Following the steps in the Upgrading Pivotal Cloud Foundry documentation, we stage PAS tile version 2.3.0.

Do not press “Apply Changes” during this step.

7. Stage New Resource Configs with the Operations Manager API

For each resource config JSON file we wrote in step 3.4, we run the following curl command:

curl "$FOUNDATION_URL/api/v0/staged/products/${CF_GUID}/jobs/${JOB_GUID}/resource_config" -X PUT -H "Authorization: Bearer $UAA_ACCESS_TOKEN" -d "@${RESOURCE_CONFIG_FILE}" -H "Content-Type: application/json"

Where ${RESOURCE_CONFIG_FILE} is the path to the file. We expect to see an HTTP status 200 and an empty JSON object {} returned for each call.

8. Apply Changes (Deploy)

Press “Review Pending Changes” in the Operations Manager 2.3 UI, then press “Apply Changes”.

At the end of this step, you will have a PAS 2.3.0 foundation, with networking optionally provided by VMware NSX-T tile 2.3.0, where each job VM is located in the appropriate NSX-T load balancer server pool.

9. Gotchas

An incident occurred where the NSX-T load balancer was unable to forward traffic to the newly-deployed gorouters.

Rebooting one of the NSX-T Edges restored the flow of traffic from the NSX-T load balancer to the gorouters. We are unsure of the root cause; however, since existing load balancer pools continued to function, we suspect the Edge had become incapable of honoring updates.

10. Troubleshooting

We find the Traceflow (Tools → Traceflow) networking tool invaluable when debugging network failures. In the screenshot below, we examine the gorouter/0 VM’s ability to communicate with the load balancer (IP address 10.144.15.4) on port 443 (HTTPS). In this case, we determined the gorouter’s CID using the bosh vms command, but we could have just as easily determined it by looking it up on the Status page of the PAS tile on Operations Manager:

References

VM Extensions we used for our deployment:

Our BASH commands we followed when we upgraded: script.

Acknowledgements

Josh Gray of the PEZ Team was instrumental in discovering the behavior and providing resources to test remediation. The BOSH vSphere CPI Team provided invaluable support.

Bryan Kelly of Cerner provided invaluable feedback, pointing out that this process is relevant not only to 2.2 → 2.3 upgrades but also to stemcell upgrades and tile upgrades.

Corrections & Updates

2018-11-20

Added Gotchas and Troubleshooting sections after suggestions from Bryan Kelly.

Footnotes

[or Dynamic] This blog post focuses on static pools, but the procedure is identical for dynamic pools since the BOSH vSphere CPI “looks up the NSGroup [of the server pool] and adds the VM to the NSGroup”.

[resource_config] In a prettified JSON file, our change would look like the following:

  {
    "instance_type": {
      "id": "automatic"
    },
    "instances": 2,
    "nsx_security_groups": null,
    "nsx_lbs": [],
    "additional_vm_extensions": [
+     "http_https_lb"
    ]
  }