The Truth About Synchronous Replication

At this time, more than ever, data is the enterprise. Only 6% of companies that have a total data loss survive. Until recently, backup has been firmly in the domain of the tape array and has been increasingly difficult to perform as data sizes increase well beyond available time and data storage capacity. Yet backup is not the main problem, as any CIO will tell you. The old joke goes something like this, "Our backup systems work fine, however our restore systems are another matter ... Ha Ha Ha." For the CIO with everything (including a large budget), replication has been seen as a way out of this cycle. Replicated data is moved from one storage system to another, where it lives in its original form on highly available disk arrays. No transfer from disk to tape is required and the data can quickly be checked to make sure it is good. Replication comes in two forms - Synchronous and Asynchronous.
In a synchronous replication scheme, all data is committed to both sets of storage before the client that wrote the data is informed the data has been written. A number of technical difficulties occur with this method. Often, it can take a relatively long time to write data to the remote storage (due to the pipe size and latency between locations) and the client must wait for this to occur, leading to very slow storage access speeds. Another issue is the problem that occurs when one of the sets of storage becomes disconnected. Some systems have elegant recovery mechanisms for these events but other can take extended amounts of time before the data is again safe.
In the asynchronous scheme, data is written to one storage array but may not be sent to the other until a certain elapsed time has passed. This means that the client can be informed that the data has been written before the data has been passed to potentially slow off-site storage. The downside of this is that the possibility of data loss if the data has not been put onto both sets of storage when a fault occurs. Without much thought, it can be assumed that synchronous replication is better than asynchronous replication as the data is always being copied is always on the two separate arrays. However, as is the case most of the time, that which appears obvious without thought actually turns out to be more complex as thought is applied. Replication is no exception. The non-obvious issue with synchronous replication becomes apparent when you expand your view to include applications which actually generate and control the data - from small applications (such as Microsoft Word) all the way to large applications (such as CRM suites or OLTP database systems).
As can be seen from the diagram, the data comprising the whole data set for an application resides in two key locations. First, you have application-resident data which resides on the application server (this could be your workstation running Microsoft Word in a simple case or a database application server storing a companies whole financial records in a more complex case) and then you have storage-resident data which resides on the storage array. Both sets of data together make up the application data set. If either piece of data is missing, then the data as a whole is incomplete, thus replicating only the storage resident data, leaving you in the best case with data loss and unrecoverable data in the worst case. Where data integrity is paramount, logs of changes to the data are kept. These logs are typically kept both on the application server and storage system so recovery can be assured. Log recovery always works best when working from a "known good" point. In the case of asynchronous replication, often the replication is taken from a known good point whereas in synchronous replication, this is impossible to ensure. Take a replicated snapshot for example. The application is usually placed into a state where the application-resident data is forced to the storage array. Then a snapshot of the data is taken. The application can then be placed back in normal operation while the snapshot is replicated. This snapshot will always lead to a good recovery. Any data created since the last snapshot was taken can be rebuilt from the log files as required. In application areas where the data rate is so high that replication is the method for backup, the additional issue of accidental data removal at the source can only be avoided with a proper asynchronous replication scheme. A common issue with synchronous replication, source data removal is compounded by the same data's near-instant removal at the target array. A properly implemented asynchronous replication strategy will oftentimes lead to better data recoverability than synchronous replication and will provide for 100% certainty in that recovery. Sometimes the obvious isn't. COMMENTARY FROM THE DESK OF GEOFF BARRALL, CTO, BlueArc, www.bluearc.com