Introduction
In the Network Active-Passive (NAP) configuration of Terracotta, the data storage is relegated to the Terracotta servers themselves and there is no reliance on external storage. The active L2 server will keep the passive L2 servers up to date on all changes that occur in the cluster over the network, eliminating the need for shared disk. The L2s will also keep track of whom the active server is, at any point in time.
In order to take advantage of Terracotta in this way, certain network configurations are necessary to ensure that there is no split-brain and that the L1s and the L2s will behave in a deterministic manner when a failure does occur (network, machine, etc.)
In this document, we outline below, two possible network configurations that will work with Terracotta. It is possible for other network configurations to work reliably, but these configurations have been well tested. For any other configurations, please contact us and we can help you determine whether those specific configurations will work, or if there needs to be more investigation.
Deployment Configuration: Simple (no network redundancy)

Description
This is the simplest NAP configuration for Terracotta. There is no network redundancy so when any failure occurs, there is a good chance that all or part of the cluster will stop functioning. All fail over activity is left up the the Terracotta software only.
In this diagram, the IP addresses are merely examples to demonstrate the L1s (L1a & L1b) and L2s (TCserverA & TCserverB) can live on different subnets. The actual addressing scheme is specific to your environment. There is a single switch that is a single point of failure.
Additional configuration
There is no additional network or operating system configuration necessary in this configuration. Each machine needs a proper network configuration (IP address, subnet mask, gateway, DNS, NTP, hostname) and be plugged into the network.
Failure scenarios
| TestID |
Failure |
Expected Outcome |
| FS1 |
Loss of L1a (link or system) |
Cluster continues as normal using only L1b |
| FS2 |
Loss of L1b (link or system) |
Cluster continues as normal using only L1a |
| FS3 |
Loss of L1a & L1b |
Non-functioning cluster |
| FS4 |
Loss of Switch |
Non-functioning cluster |
| FS5 |
Loss of Active L2 (link or system) |
Passive L2 becomes new Active L2, L1s fail over to new Active L2 |
| FS6 |
Loss of Passive L2 |
Cluster continues as normal without TC redundancy |
| FS7 |
Loss of TCservers A & B |
Non-functioning cluster |
Network testing
After the network has been configured, you can test your configuration with simple ping tests.
| TestID |
Host |
Action |
Expected Outcome |
| NT1 |
all |
ping every other host |
successful ping |
| NT2 |
all |
pull network cable during continuous ping |
ping failure until link restored |
| NT3 |
switch |
reload |
all pings cease until reload complete and links restored |
Deployment Configuration: Fully Redundant

Description
This is the fully redundant NAP configuration for Terracotta. It relies on the fail over capabilities of Terracotta, the switches, and the operating system. In this scenario it is even possible to sustain certain double failures and still maintain a fully functioning cluster.
In this diagram, the IP addressing scheme is merely to demonstrate that the L1s (L1a & L1b) can be on a different subnet than the L2s (TCserverA & TCserverB). The actual addressing scheme will be specific to your environment. If you choose to implement with a single subnet, then there will be no need for VRRP/HSRP but you will still need to configure a single VLAN (can be VLAN 1) for all TC cluster machines.
In this diagram, there are two switches that are connected with trunked links for redundancy and which implement Virtual Router Redundancy Protocol (VRRP) or HSRP to provide redundant network paths to the cluster servers in the event of a switch failure. Additionally, all servers are configured with both a primary and secondary network link which is controlled by the operating system. In the event of a NIC or link failure on any single link, the operating system should fail over to the backup link without disturbing (e.g. restarting) the Java processes (L1 or L2) on the systems.
The Terracotta fail over is identical to that in the simple case above, only two NIC cards on a single host would need to fail in this scenario before the TC software initiates any fail over of its own.
Additional configuration
- Switch - Switches need to implement VRRP or HSRP to provide redundant gateways for each subnet. Switches also need to have a trunked connection of two or more lines in order to prevent any single link failure from splitting the virtual router in two.
- Operating System - Hosts need to be configured with bonded network interfaces connected to the two different switches. For Linux, choose mode 1. More information about Linux channel bonding can be found in the RedHat Linux Reference Guide
. Pay special attention to the amount of time it takes for your VRRP or HSRP implementation to reconverge after a recovery. You don't want your NICs to change to a switch that is not ready to pass traffic. This should be tunable in your bonding configuration.
Failure scenarios
| TestID |
Failure |
Expected Outcome |
| FS8 |
Loss of any primary network link |
Failover to standby link |
| FS9 |
Loss of all primary links |
All nodes fail to their secondary link |
| FS10 |
Loss of any switch |
Remaining switch assumes VRRP address and switches fail over NICs if necessary |
| FS11 |
Loss of any L1 (both links or system) |
Cluster continues as normal using only other L1 |
| FS12 |
Loss of Active L2 |
Passive L2 becomes the new Active L2, All L1s fail over to the new Active L2 |
| FS13 |
Loss of Passive L2 |
Cluster continues as normal without TC redundancy |
| FS14 |
Loss of both switches |
non-functioning cluster |
| FS15 |
Loss of single link in switch trunk |
Cluster continues as normal without trunk redundancy |
| FS16 |
Loss of both trunk links |
possible non-functioning cluster depending on VRRP or HSRP implementation |
| FS17 |
Loss of both L1s |
non-functioning cluster |
| FS18 |
Loss of both L2s |
non-functioning cluster |
Network testing
After the network has been configured, you can test your configuration with simple ping tests and various failure scenarios
| TestID |
Host |
Action |
Expected Outcome |
| NT4 |
any |
ping every other host |
successful ping |
| NT5 |
any |
pull primary link during continuous ping to any other host |
failover to secondary link, no noticable network interruption |
| NT6 |
any |
pull standby link during continuous ping to any other host |
no effect |
| NT7 |
Active L2 |
pull both network links |
Passive L2 becomes Active, L1s fail over to new Active L2 |
| NT8 |
Passive L2 |
pull both network links |
no effect |
| NT9 |
switchA |
reload |
nodes detect link down and fail to standby link, brief network outage if VRRP transition occurs |
| NT10 |
switchB |
reload |
brief network outage if VRRP transition occurs |
| NT11 |
switch |
pull single trunk link |
no effect |
Cluster Tests with Terracotta
These tests should be run after the Network Test when the cluster is verified as failing over.
Active L2 System Loss Tests -verify Passive Takeover
| TestID |
Test |
Setup |
Steps |
Expected Result |
| TAL1 |
Active L2 Loss - Kill |
L2-A is active, L2-B is passive. All systems are running and available to take traffic. |
1. Run app<br>2. Kill -9 Terracotta PID on L2-A (Active) |
L2-B(passive) becomes active. Takes the load. No drop in TPS on Failover. |
| TAL2 |
Active L2 Loss - clean shutdown |
L2-A is active, L2-B is passive. All systems are running and available to take traffic. |
1. Run app 2.Run ~/bin/stop-tc-server.sh on L2-A (Active) |
L2-B(passive) becomes active. Takes the load. No drop in TPS on Failover. |
| TAL3 |
Active L2 Loss - Power Down |
L2-A is Active, L2-B is passive. All systems are running and available to take traffic |
1. Run app 2. Power down L2-A (Active) |
L2-B(passive) becomes active. Takes the load. No drop in TPS on Failover. |
| TAL4 |
Active L2 Loss - Reboot |
L2-A is Active, L2-B is passive. All systems are running and available to take traffic |
1. Run app 2. Reboot L2-A (Active) |
L2-B(passive) becomes active. Takes the load. No drop in TPS on Failover. |
| TAL5 |
Active L2 Loss - Pull Plug |
L2-A is Active, L2-B is passive. All systems are running and available to take traffic |
1. Run app 2. Pull the power cable on L2-A (Active) |
L2-B(passive) becomes active. Takes the load. No drop in TPS on Failover. |
Passive L2 System Loss Tests
| TestID |
Test |
Setup |
Steps |
Expected Result |
| TPL1 |
Passive L2 loss - kill |
L2-A is active, L2-B is passive. All systems are running and available to take traffic. |
1. Run app 2. Kill -9 L2-B (Passive) |
data directory needs to be cleaned up, then when L2-B is restarted, it re-synchs state from Active Server. |
| TPL2 |
Passive L2 loss -clean |
L2-A is active, L2-B is passive. All systems are running and available to take traffic |
1. Run app 2. Run ~/bin/stop-tc-server.sh on L2-B (passive) |
data directory needs to be cleaned up, then when L2-B is restarted, it re-synchs state from Active Server. |
| TPL3 |
Passive L2 loss -power down |
L2-A is active, L2-B is passive. All systems are running and available to take traffic |
1. Run app 2. Power down L2-B (Passive) |
data directory needs to be cleaned up, then when L2-B is restarted, it re-synchs state from Active Server. |
| TPL4 |
Passive L2 loss -reboot |
L2-A is active, L2-B is passive. All systems are running and available to take traffic |
1. Run app 2. Reboot L2-B (Passive) |
data directory needs to be cleaned up, then when L2-B is restarted, it re-synchs state from Active Server. |
| TPL5 |
Passive L2 loss -Pull Plug |
L2-A is active, L2-B is passive. All systems are running and available to take traffic |
1. Run app 2. Pull plug on L2-B (Passive) |
data directory needs to be cleaned up, then when L2-B is restarted, it re-synchs state from Active Server. |
Failover/Failback Test
| TestID |
Test |
Setup |
Steps |
Expected Result |
| TFO1 |
Failover/Failback |
L2-A is active, L2-B is passive. All systems are running and available to take traffic |
1. Run application 2. Kill -9 (or run stop-tc-server) on L2-A (Active) 3. After L2-B takes over as Active, start-tc-server on L2-A. (L2-A is now passive) 4. Kill -9 (or run stop-tc-server) on L2-B. (L2-A is now Active) |
After first failover L2-A->L2-B, txns should continue. L2-A should come up cleanly in passive mode when tc-server is run. When second failover occurs L2-B->L2-A, L2-A should process txns. |
Loss of Switch - This test can only be run on a redundant network
| TestID |
Test |
Setup |
Steps |
Expected Result |
| TSL1 |
Loss of 1 Switch |
2 Switches in redundant configuration. L2-A is active, L2-B is passive. All systems are running and available to take traffic. |
1. Run application 2. Power down/pull plug on Switch |
All traffic transparently moves to switch 2 with no interruptions |
Loss of Network Connectivity
| TestID |
Test |
Setup |
Steps |
Expected Result |
| TNL1 |
Loss of NIC wiring (Active) |
L2-A is active, L2-B is passive. All systems are runnng and available to traffic |
1. Run application 2. Remove Network Cable on L2-A |
All traffic transparently moves to L2-B with no interruptions |
| TNL2 |
Loss of NIC wiring (Passive) |
L2-A is active, L2-B is passive. All systems are runnng and available to traffic |
1. Run application 2. Remove Network Cable on L2-B |
No user impact on cluster |
Terracotta Cluster Failure
| TestID |
Test |
Setup |
Steps |
Expected Result |
| TF1 |
Process Failure Recovery |
L2-A is active, L2-B is passive. All systems are runnng and available to traffic |
1. Run application 2. Bring down all L1s and L2s 3. Start L2s then L1s |
Cluster should come up and begin taking txns again |
| TF2 |
Server Failure Recovery |
L2-A is active, L2-B is passive. All systems are runnng and available to traffic |
1. Run application 2. Power down all machines 3. Start L2s and then L1s |
Should be able to run application once all servers are up. |
Client Failure Tests
| TestID |
Test |
Setup |
Steps |
Expected Result |
| TCF1 |
L1 Failure - |
L2-A is active, L2-B is passive. 2 L1s L1-A and L1-B All systems are running and available to traffic |
1. Run application 2. kill -9 L1-A. |
L1-B should take all incoming traffic. Some timeouts may occur due to txns in process when L1 fails over. |