Protect Microsoft Exchange in virtual and physical environments

Jerry Melnick

Network management - For many companies, email has become a much more important communication tool than phones. Help communicate among internal employees, communicate between vendors and partners, email integration with enterprise applications, collaboration using shared documents and schedules, capture capabilities and hosting jobs, besides being an interactive issue, all contribute to increased email reliability.

Businesses of all sizes, ranging from multinational companies to small and medium businesses are now using Microsoft Exchange's messaging and collaboration features to implement enterprise operations. problems may occur, may even interrupt for a short time, which can cause a bad situation for the business face. Apparently, Exchange has become an important application for many businesses. When businesses look at the available solutions to protect key enterprise applications, Exchange is often the first application they target.

Improving the availability of Exchange involves reducing or eliminating many potential problems that cause downtime - downtime. Pre-planned Downtime will reduce the consequences as it can be scheduled on nights or on weekends - when user actions are greatly reduced. Failure to plan ahead will lead to bad consequences and may affect the company clearly. Failure to plan ahead can lead to problems such as hardware errors, software errors, operational errors, data loss or geographic problems. To successfully protect Exchange, you need to make sure that there are no error points that can cause Exchange servers, storage problems, and networks to become unavailable. This article will show you how to identify the risk points of error and some of the best actions to minimize or eliminate these risks, depending on the needs of availability. How is Exchange for your company as well as for resources and treasury?

Options for Exchange availability

Most of the availability of Exchange products is in three categories: traditional failover cluster (failover group), virtualization cluster (virtualization group) and data replication. Some solutions combine components of both clustering and preserving data; however, there is no solution to all the problems caused by downtime. Failover clusters and virtualization clusters are based on shared issues and the ability to run applications on an alternate server if the primary server is down or needs maintenance. Data copy software maintains a copy of application data, in a remote or internal location, that supports manual or automatic failover to manage unplanned server issues. or plan.

All of these products rely on backup servers to provide availability. Applications can be transferred to an alternate server if the primary server is down or needs maintenance. It is also possible to add redundancy components within a server to reduce possible server errors.

Eliminate automatic failover - eliminates machine downtime

Most of the available products are based on the recovery process called 'failover' that starts after an error occurs. Automatic failover converts the application's processing to a standby host after an unexpected error occurs or by a certain command that creates the planned maintenance action. This method is effective in bringing applications back to online status quickly but besides that they also cause downtime for applications, losing transactions in process Application logic and data are in memory, exposing many potential data errors. Even a periodic failover can cause up to minutes or ten minutes of discontinuation, such as the time required to restart the application and the process of recovering data from an unplanned error. In the worst case, software errors in scripts or working procedures may cause in failover times not working properly; This makes the downtime of machines increasing, possibly for hours or even days. Reducing the number of failover, reducing conversion time and ensuring that the failover process is reliable, all of which will contribute to reducing Exchange downtime.

Local server redundancy and automatic failover of the basic level are made for the most frequently occurring errors that cause Exchange's unplanned downtime. However, data loss or data errors and bad status are located in geographic areas, although they are rare but can cause more serious problems and require additional solution components to target the right locations. just.

Evaluate the reason for not stopping work

Unexpected downtime may be caused by a number of different events:

Catastrophic server errors can be caused by memory errors, processors or motherboards.
Server component failures such as power supply, fan, internal drive, disk controller, host bus adapter and network adapter.
Software error of operating system, firmware and applications.
Geographical location issues such as power supply failures, network errors, fires, floods or natural disasters

Each unscheduled work item is described in more detail in the sections below.

How to avoid server hardware errors

Some of the server core components include power supplies, fans, memory, CPU and logic boards. Purchasing a strong server, performing recommended routine maintenance, checking server errors for signs of future problems, all of which can reduce the risk. failover engine due to server error.

Failures caused by server component errors can be significantly reduced by adding component level redundancy. Strong servers need additional cooling and power supply systems. ECC memory, with the ability to fix single-bit memory errors, has a standard feature for most servers for years. Newer memory technology includes advanced ECC, online memory storage and mirror memory that provide additional protection but is only available on high-cost servers. Online redundancy and mirror memory can significantly increase memory costs and may not be effective for many Exchange environments.

Internal disks, disk controllers, host bus adapters and network adapters can all be copied. However, adding component redundancy to every server can make costly and complex.

Reduce storage errors with hardware

Storage protection relies on device redundancy in conjunction with RAID storage to protect data access and data integrity from hardware failures. There are different issues for both local and shared network storage.

Important moves to protect internal storage

Internal storage is only used for static and temporary system data in the cluster solution. The data replication solution will maintain a copy of all internal data on to a secondary server. However, unprotected internal storage errors will cause unintended server errors, causing inactivity and related risks in a failover to the server. other. With internal storage, you can add external disks that are configured to protect RAID 1 quite easily. Besides, a second disk controller is required to use redundant and disks inside RAID 1 are connected separately to its controllers.

Shared storage protection

Shared storage depends on the redundancy within the storage system itself. Luckily, the storage systems of many existing storage companies on the market have full redundancy including disks, storage controllers, caches, network controllers, power and cooling. Redundancy, the cache for synchronized recording is available in many storage systems that allow the use of write cache with high performance without the risk of data errors associated with single write caches. However, one important thing is, only use the full backup storage system; Low cost components, non-redundant storage systems need to be avoided.

Access to a fiber channel-based shared storage system or Ethernet storage network. To ensure uninterrupted access to shared storage systems, these networks must be designed to avoid all single point of failure. This requires redundancy for network paths, network switches as well as connections to each storage system. Many host bus adapters (HBA) inside a server's environment can protect servers from HBA and path errors. Multi-path IO software, required for supporting redundant HBAs, available in many standard operating systems (including MPIO for Windows) is also provided by many other storage system vendors; Examples include EMC PowerPath, HP Secure Path and Hitachi Dynamic Link Manager. However, these competitive solutions are not commonly supported by all storage networks and storage system manufacturers, often very difficult to choose the right multi-path software for an environment. specific school. This problem gets worse if your storage environment includes various network components and storage systems. Multi-path IO software may be difficult to configure and may not be compatible with all storage networks or system components.

Say no to connection errors

The network infrastructure itself must be a fault-tolerant, including redundancy of network paths, switches, routers and other network components. Server connections can also be replicated to avoid failover caused by a single server component error. Ensure that physical network hardware does not share common components. For example, dual-port network card shares common hardware logic when the card error can disable both ports. Full redundancy requires both separate adapters or a built-in network port combination with a separate network adapter.

Software to control failover and load sharing on multiple adapters is in a NIC category or combination and includes many different options. The options here include tolerance errors (active / passive operation with failover), load balancing (multidimensional but one-way reception) and link acquisition (simultaneous transmitting and receiving on multiple adapters). Load balancing and link collection also include failover.

Selecting any configuration options can be very difficult and must be reviewed along with the capabilities of the entire network and design purposes. For example, link acquisition requires support in network switches and includes several different protocol options such as Gigabit EtherChannel and IEEE 802.3ad. It also requires that all connections need to be made for the same switch.

Minimize software errors

Software errors may appear at the operating system level or Exchange application level. In virtualized environments, the hypervisor itself or virtual machines may also experience problems. In addition to hardware errors, performance problems or functional problems can seriously affect Exchange users, even while all software components continue to function. In addition to installing and configuring the appropriate software with timely installation of hotfixes, the best way to improve software reliability is to use effective testing tools. Fortunately, we have a lot of choices for Exchange availability and management tools from Microsoft as well as third parties.

Reduce operating errors

Operational errors are the main cause of machine downtime. A proven way, well-documented procedures, trained and properly qualified IT staff will reduce the risk of operating errors. However, some solutions to availability may increase the risk of operating errors by requiring special skills and training, because of the emergence of a need for development of automatic transfer scenarios. Complex redundancy and maintenance or by exact co-operation requirements for configuration changes in servers.

Self-protection for location issues

Errors due to the location may simply be due to an air conditioning error, or roof gaps can affect a building, electrical faults can affect a local area or a hurricane. Large can affect a large geographic area. Conditions in areas that affect anywhere from a few hours to a few days or even weeks. While these errors are less likely than hardware or software errors, they can make your situation worse.

Disaster-based disaster recovery solutions are the most common way to protect Exchange from errors in geographic areas, while minimizing the downtime of the machines involved. restore. A data replication solution for transferring data changes in real time and optimizing bandwidth in a wide area network will mitigate the risk of data loss when encountering geographical location errors. Virtualization-based solutions can reduce hardware requirements where backups and simplify configuration and test management while the system is operating.

Positioned locations are close enough to other locations to support a high-speed and low-latency network connection, solutions that provide better availability without data loss are also a way to do so. .

Reliability of automatic failover

The investment in backup hardware and software with available capabilities will be wasted if the failover process is not reliable. The important thing here is that you must choose a solution with strong availability to gain reliability in failover and ensure that your IT staff have the skills and training. fully. Solutions need to be installed, configured, maintained and tested properly.
Some features of the solution contribute to the reliability of failover:

Simplifying installation, configuration, and maintenance, these will reduce the burden of time for IT staff and expertise when reducing the risk of errors.
Avoiding scripting or failover policies can reduce conversion errors
Detect current hardware and software errors in lieu of detecting errors based on timeout time.
Booking a guaranteed resource for algorithms can be risky.

Protection against data loss and errors

There are many problems with data loss and errors that require solutions far beyond hardware redundancy and failover. Errors in application logic or errors caused by users or IT staff may be due to the accident that deleted files or logs, wrong data and other data loss and integrity issues . Certain types of hardware or software errors can lead to data errors. Geographical problems or natural disasters may also cause loss of data access or loss of data, Beyond the need for current data protection, enterprise requirements as well. Common requirements are added as demand to perform and retrieve previous data, often lasting for a few years and multiple data types. Full protection for data loss and error requires a comprehensive backup and a recovery strategy, along with an accompanying disaster recovery plan.

Previously, backups and recovery strategies were developed based on writing data to the tape device to be stored off-site. However, this method has several weaknesses:

Backup operations require storage and resource processing can interfere with production and may require some applications to stop during the backup process.
Backup intervals usually take from a few hours to a full day, while there are risks of losing data updates for a few hours between backups.
Using tape backup for disaster recovery can cause multiple recoveries that can count up to several days, unacceptable levels of machine downtime for many organizations.

Using data copies will be a better solution for both disaster protection and disaster recovery. Data copying solutions will capture changes of data from the main production system and send them to a system at the location of the remote disaster or at a local or both location in real time. . There are still cases where system errors may occur before data changes are copied, but this exposure can take place in seconds or minutes instead of hours or days. Data replication can be combined with error detection and automated failover tools to help disaster recovery and run in minutes or hours instead of days. Internal data copies can be used to reduce tape backup requirements and to separate backup tape storage from production system operations to eliminate resource conflicts and remove door restrictions. backup window.

Consider issues that may cause the system to be discontinued in advance

Hardware and software reconfiguration, hardware upgrades and software hotfixes, as well as new service packs and software releases all require a planned discontinuation of the system. This planned disruption can be scheduled at night or on weekends when actions on the system decrease, but there are still issues to consider here. The spirit of IT staff may not be comfortable if the 'off-hour' action takes place regularly. Companies may need to pay overtime costs for this job. However, stopping the application even at night and on the weekend can still cause problems for many companies using their system 24/7.

The use of standby servers in an available solution may allow reconfiguration and upgrade to be applied to a server while Exchange continues to run on another server. After reconfiguration or upgrade is completed, Exchange can be transferred to an upgraded server with minimal downtime. Most jobs can be done during normal working hours. Virtualization-based solutions that can transfer applications from one server to another without downtime can also greatly reduce downtime. However, it is important to know that changes to the application data structure and format may not be for this type of upgrade.

Other benefits of virtualization

The latest server virtualization technologies today, not only do not require Exchange protection, but also provide many unique benefits that can make Exchange protection easier and more efficient.

Virtualization makes it easy to set up evaluation, test, and development environments without the need for additional specialized hardware. Many companies cannot afford the hardware that needs to supplement Exchange testing in a traditional physical environment, but effective testing is one of the keys to avoiding problems when creating structural changes. image, install the hotfix, or switch to a new upgrade.
Virtualization allows resources to be dynamically adjusted to deliver at peak load times. How to buy enough expansion capacity to be able to respond when the peak load can consume a lot of expensive costs. Meanwhile, if configuring the size only for certain conditional load requirements will reduce performance and eventually lead to spoil the link with upgrading and replacing production hardware.

Update 25 May 2019

Protect Microsoft Exchange in virtual and physical environments

You should read it

Maybe you are interested

System

Mac OS X

Hardware

Game

Tech info

Technology

Science

Life

Application

Electric

Program

Mobile