Wednesday, March 5, 2014

Strategies for managing OAAM to OAM connections in production

Many Oracle Access Management 11g customers opt to deploy a combination of Oracle Access Manager and Oracle Adaptive Access Manager using the Advanced Integration option. This combination of product features can provide strong, adaptive authentication and fraud mitigation for online applications. In this post, we examine a number of strategies for configuring the connectivity between these components in order to provide scalability and high availability for production deployments.

Note: The information in this post applies to the 11g R2 versions of OAAM and OAM only ( at the time of writing, 11.1.2.0, 11.1.2.1 and 11.1.2.2).

Before continuing, readers are advised to consult the Appendix C of the Oracle Fusion Middleware Integration Guide for Oracle Identity Management Suite (11.1.2.2 release here) to familiarize themselves with the Advanced Integration option, in terms of its features, benefits and configuration steps. This post will concentrate only on the configuration of the necessary parameters controlling the OAP communication pool between OAAM and OAM.


The problem we are trying to solve


When OAM and OAAM are deployed using the Advanced Integration pattern, the two product components play different role during the authentication process. Through the use of the OAAM Authentication Scheme in OAM, the process of collecting credentials (and thus handling the entire authentication flow with the user's browser) is handled by OAAM. The actual authentication (or, in fact, credential validation) step is still performed by OAM via a back-channel OAP (Oracle Access Protocol) call from OAAM. OAAM uses its configured logic to collect username and password from the user, with the aid of virtual strong authentication devices, fraud detection rules and the like. Once it has collected these credentials, it uses an embedded OAM Access SDK client (or custom AccessGate) to pass these credentials to the OAM server. OAM validates the credentials against its configured LDAP identity store and returns the result to OAAM. Should the authentication succeed, OAAM then generates a Delegated Authentication Protocol (DAP) token and redirects the user back to OAM with this token in order to create the necessary OAM session. In order to ensure sufficient performance and availability for production deployments, it is thus critical to ensure that this OAP connection mechanism between OAAM and OAM is correctly configured to meet the applicable requirements.


How OAAM manages connections to OAM


Unlike OAM webgates, which are completely configurable via the webgate profile in the OAM console (which in turn generates the ObAccessClient.xml file), OAM Access SDK clients (such as OAAM) do not use the webgate profile for anything other than basic authentication to the OAM server. What this means is that while the webgate ID and password are important, OAAM will essentially ignore any other settings on the webgate profile - in particular, those settings controlling the number of primary and secondary OAP connections that should be created against each OAM server, which allow for load balancing and high availability when configuring webgates. Instead, OAAM's connection pool is configured via a number of OAAM properties, which provide somewhat less flexibility in terms of support for load balancing. We'll explore these properties below, before discussing a number of strategies that can be used to ensure a production-ready deployment. Please also see Appendix C of the Oracle Fusion Middleware Administrator's Guide for Oracle Adaptive Access Manager (11.1.2.2 release here)  
  • oaam.uio.oam.webgate_id - defines the webgate ID used by OAAM. This defaults to IAMSuiteAgent and should not be changed.
  • oaam.oam.csf.credentials.enabled - this property, when set, uses the Fusion Middleware Credential Store Framework (CSF) to securely store password, such as the webgate password. This should always be set to true.
  • oaam.uio.oam.security.mode - defines the communication security between OAAM and OAM, can be either 1 (open), 2 (simple) or 3 (cert). Open is the default.
  • oaam.uio.oam.host - defines the primary OAM hostname to which OAP connections should be established.
  • oaam.uio.oam.port - defines the OAP port for the primary OAM host (this defaults to 5575)
  • oaam.uio.oam.secondary.host - defines the secondary, or failover, OAM hostname. OAP connections will only be established to this host if connections to the primary OAM host fail.
  • oaam.uio.oam.secondary.host.port - defines the OAP port for the secondary OAM host (this defaults to 5575)
  • oaam.oam.oamclient.minConInPool - defines the minimum number of OAP connections that OAAM will maintain in its pool. This setting will obviously be respected by each OAAM server.
  • oaam.uio.oam.num_of_connections - defines the target (maximum) number of OAP connections to the primary OAM server that OAAM will maintain in its pool. This setting will obviously be respected by each OAAM server. The default value is 5.
  • oaam.uio.oam.secondary.host.num_of_connections - defines the target (maximum) number of OAP connections to the secondary OAM server that OAAM will maintain in its pool. This setting will obviously be respected by each OAAM server. The default value is 5.
  • oaam.oam.oamclient.timeout - the period in seconds that a request will wait for an available OAP connection before timing out. The default is 3600 seconds (1 hour) which is way too high and should always be reduced to not more than 60 seconds  in production.
  • oaam.oam.oamclient.periodForWatcher - defines the rest period (in seconds) for the OAAM Pool Watcher thread, a thread which periodically checks the health of connections in the pool. The default is 3600 seconds (1 hour)  which should probably be reduced to around 300 (5 minutes) for production deployments.
  • oaam.oam.oamclient.initDelayForWatcher -  defines the initial delay (in seconds) before the OAAM Pool Watcher thread starts to check connections. The default is 3600 seconds (1 hour)  which should probably be reduced to around 300 (5 minutes) for production deployments.
Perusing the above properties, the immediate observation is that only a single primary and single secondary OAM server can be specified. This is obviously of limited usefulness for large-scale production deployments, where it is a fairly obvious requirement to want to load balance requests from OAAM across a number of OAM servers. Below, we explore a number of options that can work.

 

Options for OAAM to OAM connection load balancing


1: Override deployment-wide properties on a per-host basis

In a deployment where the number of OAAM nodes matches the number of OAM nodes exactly, then a fairly sensible and robust load balancing approach is simply to allocate a single primary and a single secondary OAM server to each OAAM server. This can be achieved by overriding the deployment-wide oaam.uio.oam.host and  oaam.uio.oam.secondary.host settings on each individual OAAM host. In order to do this, first ensure that you delete the applicable property values from the OAAM database via the OAAM console. Then pass a unique value to each OAAM server instance at startup via a java property, e.g. -Doaam.uio.oam.host=<primary_host_name> and -Doaam.uio.oam.secondary.host=<secondary_host_name> Consider a deployment comprising two OAAM hosts (Host A and Host B) and two further OAM hosts (Host C and Host D). Using this approach, Host A would be configured with the following settings: oaam.uio.oam.host: Host C and oaam.uio.oam.secondary.host: Host D while Host B would be configured with oaam.uio.oam.host: Host D and oaam.uio.oam.secondary.host: Host C This configuration would ensure that both OAM hosts received an equivalent number of connections, thus providing load balancing, while also providing resilience in case either OAM server should become unavailable. This approach, though, would suffer from a number of drawbacks, including the following:
  • unsuitable for deployments where the number of OAM and OAAM nodes is asymmetric and not even.
  • manageability is reduced as OAAM console cannot be used to configure per-host parameter values.
  • would not scale much beyond two nodes while still providing high availability. The loss of more than one OAM node at any one time would potentially render certain OAAM nodes unusable.
  • no way to rebalance load across OAM nodes in case an OAAM node goes down.

2: Use virtual hostnames

The second option is similar to the first, in that it allows for the definition of a single primary and a single secondary OAM server for each OAAM server. In this case, though, rather than overriding domain-wide property values, the approach is to user virtual hostnames to define the OAM servers. For example, we would define the following: oaam.uio.oam.host: oam-primary.domain.com oaam.uio.oam.secondary.host: oam-secondary.domain.com We would then use the /etc/hosts file on each OAAM node to define exactly which physical OAM server IP address the virtual hostnames oam-primary and oam-secondary should resolve to. In our above scenario, OAAM HOST A would have entries in its hosts file mapping oam-primary to the IP address for OAM Host C and oam-secondary to the IP address for OAM Host D. HOST B would instead map oam-primary to the IP address for OAM Host D and oam-secondary to the IP address for OAM Host C. In cases where OAAM and OAM servers are co-located on the same hardware, we can use a shortcut and specify "localhost" as the oaam.uio.oam.host value. This approach provides pretty much exactly the same benefits as the first option and incurs the same drawbacks, with the possible exception that it may prove somewhat easier to manage in production. In particular, the fact that any of the virtual mappings could be changed dynamically (without needing to restart OAAM) would be a definite advantage of this strategy.

3: Use an external load balancer

Perhaps the most obvious solution to this problem is to insert some form of external load balancer between OAAM and OAM. In this case, OAAM is configured such that the oaam.uio.oam.host property points to the address of the load balancer, which then in turn distributes requests to the OAM servers according to whatever algorithm is desired. In this scenario, it does not even make sense to define the oaam.uio.oam.secondary.host property (unless there is a second, redundant load balancer in place) since it's assumed that the load balancer itself will only route requests to active OAM nodes. This approach has a number of benefits when compared to options 1 and 2 above, including the following:
  • can be used to balance load from any number of OAAM servers to any number of OAM servers; there is no requirement for symmetry
  • better scalability beyond 2 nodes
  • better manageability via load balancer console, rather than host files/command-line switches
These benefits do come at a cost, however, in terms of increased complexity within the deployment. There will obviously also be a physical cost to procuring and commissioning the necessary load balancing device. In addition, some caveats need to be mentioned at this point. Firstly, while it may seem an obvious point, it's worth remembering that OAP is a long-lived, TCP-based protocol and thus the load balancer used must be able to handle such a protocol. OAP is not HTTP, thus an HTTP-only load balancer can not be used here. The fact that OAP connections are long-lived can introduce some unforseen complications, like the ones described in this excellent post by Chris Johnson. Unless the load balancer is able to dynamically rebalance connections, it is possible that an OAM server outage could result in an unbalanced connection load even after the troublesome server is brought back on-line. The only way to mitigate this situation would be to perform a managed rolling restart of the OAAM cluster once all the OAM servers are up again. The comments in this blog post about connection timeouts are also applicable; it is best to configure the load balancer so as not to time out idle/long-lived connections if possible. If not, these time-outs should be set for as long as possible, since we do not have the equivalent of the webgate "Max Session Time" parameter available through OAAM's configuration properties. If it is not possible to avoid connection time outs, then as a mitigation, be sure to set the oaam.oam.oamclient.periodForWatcher property to a low enough value, to increase the likelihood that the OAAM pool watcher will detect and re-establish a timed-out connection before a real client request attempts to use it.

 4: Use a combination of the above

While there is obviously no perfect answer or one-size-fits-all solution here, the most sensible approach may well be to combine the above options; a number of the more unpleasant side effects caused by load balancing OAP can be avoided by using a direct host connection (either option 1 or 2) for the primary OAM server connection. If a load balancer is available, it could be used as the secondary, thus allowing the solution to scale beyond two nodes without compromising availability.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.