What is this?

Provisioning an instance of API-M and importing some APIs can be performed in about 3 minutes (using the consumption tier). However that’s a single instance, with a single API. Is that OK for production?

I’ve been getting asked more frequently about ‘next-level’ type patterns and practice around API-M, specifically resiliency patterns, multiple sites via expressroute, protecting not only the front-end of API-M but the ‘backend’ access, too.

How can we go about achieving this? Let’s look at some possible scenarios, and what we can do to make our deployment ‘production grade’

The constants

In the following diagrams, we have some constants. We have:

  • 2x sites, each with a single expressroute circuit
    • These sites do NOT have a form of backup - a crossed ER link or site-to-site VPN. Just the single circuit.
  • A requirement to connect into a VNET with API-M
  • A requirement to connect to external services or APIs such as Azure Logic Apps

The initial ‘highly available external VNET integration’ design

The below design is sort of pretty standard when people begin asking about API-M in a resilient manner.

Discuss

We have 2x seperate stacks, with API-M ‘externally’ VNET integrated. This means it get’s a publicly accessible interface, and does NOT get a private IP which can be accessed from the VNET. We add a traffic manager out the front, to load-balance/HA/DR the API-M endpoints. Obviously depending on the services, but you’re either operating in one stack or the other.

Any APIs we have on-premise can be accessed by way of the VNET integration, however access to API-M must traverse the internet in some manner:

Pros and cons

Pros

  • Easy to deploy
  • Straight-forward to understand traffic flows
  • Achieves very basic levels of HA

Cons

  • High administrative overhead, due to multiple developer portal instances (this is a large con)
  • Connectivity into the instance needs to traverse the internet
  • Fails to meet a few HA/DR design considerations or requirements from enterprise organizations
  • ‘Backend’ HA/DR doesn’t exist (from on-prem); If a particular API-M instance goes down, that region goes down, so to say

The initial ‘internal VNET integration trying to use global VNET peering’ design

The below diagram addresses a common issue of attempting to use internal VNET integration with global VNET peering.

Discuss

Again, we have 2x seperate stacks, but this time the API-Ms are ‘internally’ VNET integrated. This means the instances get an internal IP in a VNET subnet, and does NOT get a publicly useable IP which can be accessed from the internet. We have to manage our own DNS, because trafficmanager only works externally and the API-M instance only responds with a valid host-header, and we also have to manage any form of active/active active/passive traffic routing (of which none exists here)

Any APIs we have on-premise can be accessed by way of the VNET integration, and the good news is that traffic flow to a given instance traverses the Azure backbone of private addressing; never seeing the internet.

A common problem people run into is trying to peer VNETs across regions and expecting this to work. It does not, because the underlying API-M instance uses a Basic Load Balancer, where traversal over a globally peered VNET is not supported.

An interesting note is that the API-M instances can still access public APIs without routing or NATing, this is because the API-M instance still has a public interface, for administrative duties (such as Azure managing the instance). This means when an external API is added, the source of the traffic is this unusable interface:

Pros and cons

Pros

  • Easy-ish to deploy
  • Straight-forward to understand traffic flows
  • All (most) traffic traverses the Azure backbone, privately, within your VNETs
  • On-premise services can access the instances via private connectivity
  • Can implement network security controls such as NSG’s etc.

Cons

  • Global VNET peering simply doesn’t work with what we’re trying to do here
  • High administrative overhead, due to multiple developer portal instances (this is a large con)
  • Now need to manage DNS - must be configured as API-M only responds with valid host-header
  • ‘Backend’ HA/DR etc. still doesn’t apply here for on-premise services, due to peered VNET switching/routing constraints
  • Fails to meet a few HA/DR design considerations or requirements from enterprise organizations

Short note

One thing to note is that VNET peering works intra-region - so for example if you have 2x peered VNETs in Australia East, traffic traversal works fine. It’s only VNETs ‘globally’ peered across regions, where this doesn’t work.


The ‘we are doing some cool stuff now with S2S VPN’ design

The below diagram addresses the VNET routing issue with a network-to-network (Site-to-Site or S2S) VPN, giving a good solution for on-premise HA/DR. Starting to get some serious functionality out of these designs.

Discuss

Again, we have 2x seperate stacks, and still the API-Ms are ‘internally’ VNET integrated. The key differnce here is that we’ve scrapped the globally peered VNET in favour of using Azure VPN Gateways to establish a site-to-site VPN. In front of the expressroute circuits, we put an application gateway per region and configure API-M as a HA backend pool. Our on-premise services access the appgw instance, and in the event of API-M falling over the appgw will, in theory, route to the other region via the VPN.

Pros and cons

Pros

  • All (most) traffic traverses the Azure backbone, privately, within your VNETs
  • On-premise services can access the instances via private connectivity
  • Can implement network security controls such as NSG’s etc.
  • Can fail between a given region for an on-premise site, if API-M fails

Cons

  • Slightly higher complexity
  • Higher cost with the addition of VPN and Application Gateway
  • High administrative overhead, due to multiple developer portal instances (this is a large con)
  • Now need to manage DNS - must be configured as API-M only responds with valid host-header

The ‘But I need to add another site and I’m sick of 2x API-M instances’ design

The below diagram builds ontop of our last one in a few major ways.

Discuss

Our 2x seperate stacks have gone by way of a scaled API-M instance - this is a feature of the premium SKU where you can deploy multiple gateway instances which are all centrally managed.

The appgw etc. still exists, however we’ve moved to a ‘hub-and-spoke’ topology where we route our spoke traffic through a ‘hub’ - this means we can add more spokes and scale the API-M instance further out where required.

In the event of a new site being added, or new stack required (say in West Europe), we can configure our networking, VPN, scale the API-M instance to that region and we are done.

It’s important to note that the developer portal cannot be made HA - it’s always running in the ‘primary’ region and if that region fails, so does that portal aspect.

Pros and cons

Pros

  • All (most) traffic traverses the Azure backbone, privately, within your VNETs
  • On-premise services can access the instances via private connectivity
  • Can implement network security controls such as NSG’s etc.
  • Can fail between a given region for an on-premise site, if API-M fails
  • Can easily scale to more than 2 sites
  • Routed traffic is plumbed via an NVA, or Azure Firewall for example, enabling a single point to control certain traffic types.

Cons

  • High administrative overhead
  • Higher complexity
  • Higher cost again with VPN GWs and a NVA to route traffic in the hub
  • API-M developer portal is only in a single instance (primary region) if that region fails, the developer portal fails;
  • Now need to manage DNS - must be configured as API-M only responds with valid host-header

Finally, the ‘Now I’m sick of managing the network’ design

The below diagram builds ontop of our last one by changing the manual hub-and-spoke design to using Azure Virtual WAN. However, this is a fundamental change.

Discuss

Same stacks, same API-M etc. the key difference here is that we have completely redesigned our network topology to utilize Azure Virtual Wan.

AZVWAN (‘AY-ZEE-VEE-WAN’ from here on, because it’s fun to say) automated that manual hub-and-spoke stuff we did before by overlaying the management plane across the entire deployment.

This means we have a single entry-point for routing, traffic ingress and egress, VNET to VNET transitive connectivity, on-premise connetivity, security, logging, troubleshooting and more.

AZVWAN is also much more scalable than manual hub-and-spoke - both from a performance and administrative overhead perspective.

Pros and cons

Pros

  • All (most) traffic traverses the Azure backbone, privately, within your VNETs
  • On-premise services can access the instances via private connectivity
  • Can implement network security controls such as NSG’s etc.
  • Can fail between a given region for an on-premise site, if API-M fails
  • Can easily scale to more than 2 sites
  • Single point of management for all of our connectivity into Azure
  • Automates the provision and administration of the ‘spokes’

Cons

  • Higher cost again with all the components and AZVWAN
  • API-M developer portal is only in a single instance (primary region) if that region fails, the developer portal fails;
  • Now need to manage DNS - must be configured as API-M only responds with valid host-header

Summary

By combining certain products and services, we can design a highly available, resilient architecture for both external and on-premise clients to API-M. It’s also worth noting again the the example sites above have no HA/DR for expressroute; we can simplify a bunch of things here if the sites have another ER link to another region or a S2S VPN as backup.

The designs begin to get complex because we are pushing complexity to the wrong places. If we rely on the above ER backup and have a simple mechanism to operate in either one stack or the other, things are much more simple. That however might not incorporate multiple sites, something with our final design incorporates with ease.

One thing which could be coming is support for Azure Private Link (Private Endpoints) - this would perhaps get around our peered VNET issue and reduce complexity.

The above isn’t an absolute or exhaustive list of possible solutions, just some way to solve some common problems.

I guess, as always, ‘it depends’.