Skip navigation



Release: 2.6.2 Previous Releases
Publish Date: July, 2008

Article Rating?


Operator Guide

Introduction

Living successfully with a Terracotta deployment, as with any production deployment, requires developing and rigorously adhering to an operating plan. This plan should include a runbook that describes what steps to take for all routine operations on your cluster as well as how to monitor the health of your cluster and take action against unplanned events, such as hardware failures, load spikes, and the like.

Set Up Monitoring

Terracotta provides a wealth of information about the current state of your cluster. Before deployment, you should have a plan for what to monitor and how you are going to integrate it into your network operations center.

There are a few different ways to retrieve this information for your purposes.

The Terracotta Administration Console

The Terracotta Administration Console provides a graphical view of all of the runtime information available from Terracotta. The Terracotta Administration Console is especially useful during development and testing, but, because it does not offer alert functionality, it may not be suitable for your operations center.

The Terracotta JMX Interface

The Terracotta JMX Interface can be used to retrieve all of the data presented by the Terracotta Administration Console. Additionally, because it uses the standard the JMX protocol, it can be integrated into your existing monitoring and management infrastructure. Likewise, it can be used to create alerts when values cross certain tolerance thresholds.

The Terracotta Cluster Statistics Recorder and Snapshot Visualization Tool

In addition to the runtime statistics offered via JMX and the administration console, Terracotta offers highly detailed statistics that may be recorded, stored into a database, and viewed using the Terracotta Snapshot Visualization tool. Because gathering these more detailed statistics can add significant load to the Terracotta server, they are not available as always-on runtime statistics. You should, therefore, be judicious in your gathering of these statistics.

For information on using the Terracotta Cluster Statistics Recorder and the Snapshot Visualization Guide, see the Cluster Statistics Recorder Guide.

Baselines And Tolerances: Understanding How Your Application Behaves

During your performance and destructive testing phase, you should be gathering baselines for the relevant statistics that you are going to monitor. You should also establish tolerance bands for them so that you know when the health of your cluster is degrading. Once in production, you should maintain a regular schedule of baselining your application and re-evaluating your tolerances and adjusting your runbook accordingly. You must also do this during major software releases, as well.

Baselining and tolerance setting is not something you can just do once and forget about. It is an ongoing process of adapting to changing workload characteristics and changing software. The better you understand how your application behaves under a wide range of conditions, the better equipped you will be to handle critical situations as they arise.

Create A Dashboard View Of Your Runtime Statistics

You should create a dashboard view of all of the most important statistics that indicate the health of your cluster. This will let you see at a glance when your application health starts to operate outside tolerances.

Monitoring for Issues

The following section describes various issues that may arise during the ongoing operation of your cluster, how to monitor and alert for them, and what corrective actions to take. You can use this section as the starting point for your cluster operations runbook.

Issue: Disk Full

Description

If a filesystem that the Terracotta server writes to is full, the Terracotta server will throw an exception. If the filesystem in question is only used for logging, logging will stop, but the cluster will continue to operate. If the filesystem in question contains the Terracotta server's data store (i.e., the Sleepycat database), the cluster will fail to operate.

Symptoms/Manifestation

The system logs (for example, /var/log/messages or /var/adm/messages) will contain a "disk full" message. If the filesystem containing the Terracotta server's data store is full, the Terracotta server log will contain an exception similar to the following:

TerracottaServerLog: [WorkerThread(commit_changes_stage,2)] ERROR com.tc.server.TCServerMain - Thread:Thread[WorkerThread(commit_changes_stage,2),5,TC Thread Group] got an uncaught exception. calling CallbackOnExitHandlers. Environment invalid because of previous exception: com.sleepycat.je.RunRecoveryException: (JE 3.2.76) . IOE during Write. Caused by: java.io.IOException: No space left on device

Monitoring

Monitor the disk usage using your standard system monitoring tools.

Suggested Tolerances

  • Green: less than 65% full.
  • Yellow: between 65% and 85% full.
  • Red: greater than 85% full.

Actions

  • Make sure that there are no other processes filling the disk.
  • Make sure your log rotation facilities are working properly.
  • Increase available disk space to the filesystem.
  • If the Terracotta data store is growing without bounds, this may be indicative of either a memory leak in your application or a sign that the Terracotta distributed garbage collector is not cleaning up garbage faster than it is created. See the Tuning Guide for garbage creation considerations and information on tuning the distributed garbage collector.

Issue: Disk I/O Throughput Degradation

Symptoms/Manifestation

High disk access latency.

Monitoring

Monitor disk I/O

Issue: Terracotta Server Low On Memory

Description

For a number of reason, the Terracotta server may run low on memory. See the Tuning Guide for details.

Symptoms/Manifestation

Low memory in the Terracotta server is indicated by:

  • Long, frequent garbage collection cycles in the Terracotta server JVM (the JVM garbage collector, not the Terracotta Distributed Garbage Collector).
  • Pauses during full JVM garbage collection which cause the server to stop responding to client requests.
  • OutOfMemoryErrors in the Terracotta server log

Monitoring

There are a number of ways to monitor Terracotta server heap usage:

  • Scrape the Terracotta server log files for OutOfMemoryError.
  • Use the Terracotta Administration Console or the Terracotta JMX interface to monitor the heap statistics of the Terracotta server.
  • Use the SVT to sample the following statistics:
    • cache-objects-evict-request
    • cache-objects-evicted
    • vm-garbage-collector
    • memory

Actions

You may only need to increase the heap size of the Terracotta server JVM. Additionally, you may need to tune the Terracotta virtual memory manager as described in the Tuning Guide.

Issue: Terracotta Server CPU Utilization Too High

Symptoms/Manifestation

If the Terracotta server's CPU utilization is high, you may see a dip in the transaction rate reported by the server; you may also see a dip in the transaction rate of the cluster as a whole.

Monitoring

Monitor Terracotta server CPU utilization using your standard system monitoring tools or through the Terracotta JMX interface. Make sure you monitor the utilization of all processors/cores.

Suggested Tolerances

  • Green: less than 65% utilization
  • Yellow: between 65% and 80% utilization
  • Red: greater than 80% utilization

Actions

If not all processors/cores are being utilized equally, you may have a hardware or operating system issue, in which case you should consult your hardware or software manufacturer. If all processors/cores are being utilized equally, you may need to increase the CPU capacity of the Terracotta server machine.

Issue: Terracotta Server CPU Failure

Symptoms/Manifestation

The kernel should panic and the machine will probably go into a reboot sequence. You may also see messages in the operating system logs (e.g., /var/log or /var/adm).

Monitoring

Monitor the health of the Terracotta server machine's CPU using your standard system monitoring tools.

Actions

In the event of a kernel panic, the Terracotta server will fail over to the passive standby Terracotta server. If configured properly, the cluster will automatically reconnect to the new active Terracotta server and resume normal operations. You must now return your cluster to a highly availability configuration by deploying a new passive Terracotta server.

Issue: Terracotta Client CPU Failure

Symptoms/Manifestation

Like a failure of a Terracotta server CPU, the kernel should panic and the machine will probably go into a reboot sequence. When it is disconnected from the Terracotta server, it will be automatically removed from the cluster and any cluster resources it held will be reclaimed.

Monitoring

Monitor the health of a Terracotta client machine's CPU using standard system monitoring tools. If you are using a load balancer, it should also be configured to detect a cluster node failure.

Actions

If you are using a load balancer, the load balancer should automatically detect the failure and rebalance load across the other cluster nodes. To re regain capacity, you should deploy a replacement machine.

Issue: Terracotta Server Distributed Garbage Collection Performance Degradation

Description

Under certain conditions, the Terracotta Distributed Garbage Collector's performance may degrade. For more information on this issue, see the Tuning Guide.

Symptoms/Manifestation

If the Terracotta Distributed Garbage Collector (DGC) falls behind, you will see the following:

  • Gradual reduction in cluster throughput
  • Increased managed object count
  • Increased DGC cycle times
  • Increase in disk usage by the Terracotta server's data store.

Monitoring

Use the Terracotta Administration Console or the Terracotta JMX interface to monitor the following:

  • managed object count
  • DGC total time
  • DGC pause time

Suggested Tolerances

The DGC tolerances will vary depending on your application, but as a rule of thumb, DGC cycle times over 30 seconds and pause times over 10 seconds are a cause for concern.

Actions

See the Tuning Guide for information on tuning the distributed garbage collector.

Issue: Terracotta Server Object Cache Hit Rate Degradation

Description

The Terracotta server keeps object data in a memory cache for fast access. If the active set of objects doesn't fit in the server's memory cache, you may experience performance degradation.

Symptoms/Manifestation

Poor cache hit rate in the Terracotta server is indicated by:

  • reduction in cluster throughput
  • poor read/write performance given constant disk lookup

Monitoring

  • Use the Terracotta Administration Console or the Terracotta JMX interface to monitor the Terracotta server cache miss rate
  • Use the SVT to sample l2-faults-from-disk

Issue: Application Deadlock

Description

An application concurrency bug may lead to a cluster-wide deadlock. Sometimes, however, what appears to be a deadlock may actually be a very slow operation.

Monitoring

You should have a health check specific to your application that alerts your operations center when there is a degradation in throughput outside certain tolerances. In addition, you may also use the transaction rate form the Terracotta JMX interface or the Terracotta Administration Console as a proxy for cluster throughput.

Actions

  • Using the Terracotta Administration Console or the Terracotta JMX interface, take a series of thread dumps 5 seconds apart for 1-2 minutes. If the relevant threads' stack-trace appear stationary, then there is a high likelihood that you have a deadlock
  • TO RECOVER: Kill one client JVM at a time - in the best case, the first client JMV restart relieves the cluster of the deadlock and in the worst case the last client JVM restart relieves it of the deadlock.
  • To ANALYZE: Based on the thread-dump and the lock profiler in the Terracotta Administration Console, determine where, if in application code, this might be occurring. Also code-review to ensure that locks are being obtained in the same sequence to eliminate possibility of application deadlock. Also consider tools to help deadlock detection.

Cluster Events

Description

The Terracotta JMX interface includes event notifications for cluster events. A node coming online or going offline will fire an event that you can use to monitor the disposition of the cluster. See the JMX Guide for details.

Upgrading To A New Version Of Terracotta

Upgrading your cluster to a new version of Terracotta should be done in a phased approach where parts of the cluster are taken out of service for upgrade, then returned to service while the other parts are taken out of service, upgraded, then returned to service according to your third-party load balancer or workload router rules.

The steps are as follows:

  1. Take half of the Terracotta client servers (application servers) out of service. Application load carried by those machines should fail over to the remainder of the active cluster.
  2. Upgrade Terracotta on the out-of-service machines.
  3. Take the active Terracotta server out of service. This will cause the passive Terracotta server to be promoted to the active server and cause the active Terracotta clients to fail over to the new active server.
  4. Upgrade Terracotta on the out-of-service Terracotta server.
  5. Restart the out-of-service Terracotta clients.
  6. Restart the out-of-service Terracotta server such that it is the passive server.
  7. When the upgraded passive server is up-to-date, take all of the currently active cluster out of service, including the active Terracotta server.
  8. Simultaneously, return the upgraded Terracotta clients back into service.
  9. Upgrade the remainder of the out-of-service cluster.
  10. Restart the out-of-service Terracotta clients.
  11. Restart the out-of-service Terracotta server such that it is the passive server.

NOTE: THIS ASSUMES THAT THE TERRACOTTA DATA FORMAT STORED IN THE SLEEPYCAT DATABASE FILES IS COMPATIBLE BETWEEN THE TWO RELEASES. In the odd-case that it is not, you must remove the data files before restarting the upgraded Terracotta server or point the upgraded server to an empty directory location in the tc-config.xml.

ALWAYS MAKE SURE THAT YOU TEST THIS PROCEDURE in a staging environment before performing it in production.

Deploying A New Version of Your Application Software

Upgrading your cluster to a new version of your application software should be done in a phased approach where parts of the cluster are taken out of service for upgrade, then returned to service while the other parts are taken out of service, upgraded, then returned to service.

The steps are as follows:

  1. Take half of the Terracotta client servers (application servers) out of service. Application load carried by those machines should fail over to the remainder of the active cluster according to your third-party load balancer or workload router rules.
  2. Upgrade your application on the out-of-service machines.
  3. Take the remainder of the active Terracotta clients out of service.
  4. Simultaneously, return the upgraded Terracotta clients back to service.
  5. Upgrade your application on the out-of-service machines.
  6. Return the upgraded Terracotta clients back to service.

Class Schema Change Considerations

Terracotta does not use Java serialization to share object data, so upgrading your application code is not subject to the same class-versioning limitations that serialization imposes. However, there are some considerations that you must make when upgrading your Terracotta-enabled code.

Change TypeCurrent SupportFuture SupportNotes
Add, delete, or modify methodsYesYesOnly the object data is stored and manipulated by Terracotta, so changes to methods will have no effect on existing object data. However, you must make sure that method changes are reflected in the Terracotta configuration as necessary.
Add a fieldYesYesIf you add a field to a class, its value will be the default value for primitives and null for references. You must take care to initialize new fields using Terracotta's "onLoad" feature. See the Configuration Guide and Reference for details.
Delete a fieldYesYesThe field data is preserved for objects created using the old class schema, but will not be visible to code using the new class schema. Objects created using the new class schema will not create data for that field.
Modify a field nameNoYesModifying a field name is currently the same as deleting the old field and creating a new field. There is not currently direct support for migrating the field data from the old field name to the new field name, although that can be done with a custom object data translator (see below).
Modify a field typePartialYesThis is supported if the old field type can be coerced into the new field type (e.g., 'int' to 'long').
Modify an instance field to a static fieldNoYes 
Modify a static field to an instance fieldYesYesThis is equivalent to adding a new field.
Change the name of a classNoYes 
Change the type of a rootNoYes 

Custom Object Data Translation

While there is no direct support for modifying the class schema in the Terracotta object database, it is possible to create a custom object data translator that will convert an object graph from one format to another. One possible approach is to write code that walks an existing object graph, copying values from that existing object graph to a new object graph with the new class format. Once the translation has occured, code using the new class format would then use the new object graph. The old object graph will fall out of scope and become garbage collected.

ALWAYS MAKE SURE THAT YOU TEST ALL UPGRADE PROCEDURES in a staging environment before performing them in production.

Backup The Cluster Database

The Terracotta cluster database is a Sleepycat database that contains all the object data for the Terracotta virtual heap as well as some metadata about the state of the cluster. To backup this database, we have provided a simple utility that will create a snapshot of the database at the moment it attaches to the Sleepycat Environment and copies that snapshot to a backup location. This backup may be performed while the Terracotta server is running. The backup tool is designed to take a full backup and works such that the snapshot will have a consistent view of the data in the database.

To retrieve the backup tool, download the attachment to this page called BerkelyDbBackup.zip. It contains code, a backup script, and a README file containing instructions for configuring and running the backup script.




Copyright Information

Copyright © 2005-2007
Terracotta, Inc.
All Rights Reserved

This publication (the "Documentation") and the Terracotta software which it describes (the "Software") are protected to the maximum extent permitted under applicable law, including but not limited to, the regulations set forth in Title 17 of the United States Code, and California law. This Documentation, or any parts thereof, may not be reproduced in any form, by any method, for any purpose, without the express written consent of Terracotta. Terracotta makes no warranty, either express or implied, including but not limited to any implied warranties of merchantability or fitness for a particular purpose, with respect to the Software discussed in this Documentation, and the Documentation itself (collectively, "the Materials"). The Materials are made available solely on an "as-is" basis. In no event shall Terracotta be liable to anyone for special, collateral, incidental, indirect, punitive, exemplary, or consequential damages in connection with, or arising from the purchase or use of, the Materials. Under no circumstances and regardless of the cause of action alleged, shall Terracotta's liability exceed the purchase price of the Software described herein. Terracotta reserves the right to revise and improve its Software and Documentation as it deems fit. The Documentation describes the state of the Software at the time of publication.

Trademarks
"Terracotta," the stylized "T" logo, and "Open Terracotta" are trademarks of Terracotta. All other brand names, product names, or trademarks belong to their respective holders. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respective holders.

Government Use
Use, duplication, or disclosure by the U.S. Government is subject to restrictions as set forth in FAR 12.212 (Commercial Computer Software-Restricted Rights) and DFAR 267.7202 (Rights in Technical Data and Computer Software), as applicable.

Adaptavist Theme Builder Powered by Atlassian Confluence