Views: 76 Author: Site Editor Publish Time: 2022-05-24 Origin: Site
Data center availability has long plagued IT operations due to silos or gaps between IT operations, security operations, and facilities. Enterprises must address these gaps to achieve more accurate and comprehensive decision-making, especially when it comes to data center optimization.
The draft Data Center Optimization Plan released in November 2018 proposes a number of new metrics that can be used to measure U.S. federal data center optimization efforts, including new metrics around data center availability. If mandated, the U.S. government's implementation of the Data Center Optimization Initiative (DCOI) availability metrics could present new challenges. While data center facility availability can be measured by a metric, it has proven to be highly inaccurate and may actually stifle the ability of the research agency to predict and address the issues necessary to maintain data center availability and any interdependencies critical to the agency's mission.
This is why U.S. federal agencies could benefit from measuring sub-metrics that represent the operational status, availability, and risk of data centers and their infrastructure. Using this business services approach (dynamic grouping of components by geographic location, application type, or technology stack) for data center optimization allows agencies to anticipate and resolve issues faster, thus better ensuring availability.
Using a business service architecture to collect metrics about the operational status, availability and risk of the underlying IT components of a business service, as well as a dynamic real-time mapping of the infrastructure and applications supporting that service, can provide IT managers with a real-time operational view to support the identification of the underlying issues that isolate the impact of the service. Devices can be abstracted and individual devices and IT services can be "bubbled" into a combined metric that represents the overall state of the business service. However, the representation of sub-metrics can enable an executive or management view of business services to really provide a deeper understanding of the overall availability state of the data center.
Suppose an agent has four identical servers that can host the entire workload and one of them is operational. These three extra servers are essentially backups that can be used in the event that one of the other systems fails. In this example, if one server fails, the service is still 100% available. However, the operational health of the system drops to 75%; therefore, causing the risk to rise to 25%.
These metrics are important because they remove the barriers that prevent executives from having oversight of business services. Previously, a data center administrator might receive an alert that indicated that server CPU utilization levels had dropped below a certain threshold. Using more granular metrics, utilization alerts can automatically trigger the addition of another server or two to support more traffic, and business service policies can be automatically adjusted to recalculate new operating conditions, availability and risk metrics without manual intervention. Redundancy and self-healing capabilities should be incorporated into every layer of the data center.
When it comes to data center optimization, definitions of health, availability and risk cannot be generalized. IT operations teams can define them and create automation and event policies as needed. As more software-defined services, artificial intelligence, machine learning and advanced analytics enter the data center, IT operations teams will have more ways to gain actionable IT insights, understand the interdependencies between infrastructure and applications, and automate manual tasks to improve efficiency. A topology mapping approach between business processes and the systems that run them can facilitate automation, including remediation, configuration management database enhancements and advanced event scaling, resulting in less management, maintenance and troubleshooting.