With increasing digitalisation companies are becoming more and more dependent on their data and according to Gartner, cloud computing is the new reality for South Africa.

By Hayden Sadler, country head at Infinidat South Africa

It is therefore essential for them that this data is always available. Interruptions need to be prevented by all means. The traditional approach of fixing malfunctions and failures as quickly as possible after they have been discovered is no longer sufficient. Predictive prevention of system failures and problems needs to become the norm. As storage systems play a key role in data availability it is essential that they are up and running at all times.

So, how is constant availability of the storage systems best achieved? These days, storage systems are way too extensive and complex to rely on manual controls. It simply is not practical to have a member of the IT team do near-permanent systems controls by running through a check list and ensure that all component run within optimal parameters. This means that automation is the way forward.

However, the demands on the necessary automation are very high. Automated systems must be able to recognise where problems are likely to develop by means of a precise prognosis and avoid them before they even occur. This is where Artificial Intelligence (AI) and machine learning (ML) come into their own. Only by the use of these advanced technologies predictive maintenance and problem prevention can become a reality. Not only are they able to detect the development of potential problems. They can also do so for problems which have not been expected by the developers (e.g. memory leaks) and would therefore be undetectable by checking systems performance against specific parameters.

Artificial Intelligence Operations (AIOps) is the method to apply to realise predictive maintenance for all systems, and storage systems are no different. With a combination of big data analysis and machine learning, AIOps can proactively detect issues in a very granular way before they have an impact on users both during the development phase and after deployment in the field. In addition, AIOps is a great method to predict and plan future storage requirements. This is particularly beneficial, especially during tough times such as the national lockdown.

AIOps uses empirical data from the vendor’s global install base to learn to detect technical issues and capacity bottlenecks before they affect the business. The system analyses problems that have already occurred to identify patterns and it then uses these patterns to predict similar cases in the future. Of course, there have also been monitoring systems in the past that tried to detect emerging problems at an early stage. However, these could only detect constellations that were expected and observed by the developers in advance and coded accordingly. An AIOps system, on the other hand, is much more flexible, learns continuously and can also identify types of emerging problems for which it was not specifically programmed. This makes AIOps much more effective and precise in identifying problems and additional capacity-requirements before they arise. The result is actionable insight on how to prevent issues, which leads to uninterrupted operations and business continuity.

Relatively “young” systems in the field are pose a unique challenge to the predictive algorithms as the AIOps system does not yet have any enough usage data that it can analyse to make it forecasts. That is why solutions need to match the system as quickly as possible into a known usage patterns, allowing multiple algorithms to compete on which can provide the most accurate predictions for each dataset. AIOps then uses the algorithm that best derives the present state of the system from a relatively short past to make its predictions moving forward. By repeating this process periodically, the longer the system runs, the more the algorithm is then adjusted based on the actual operating data. It is crucial that, with this approach, the system has prognostic capabilities right from the start. Predictive maintenance is in the very genes of the system, so to say.

In addition, a support team should perform an anomaly detection process by analysing the IO profile of each data set and looking for performance anomalies that indicate a problem in the external communication environment. Often, data access problems are not caused by the storage system, but by the network or specific hosts, for example. By their very nature, these causes remain are universal and require little no training datasets to detect. After analysis, the support team proactively engages the customer to resolve any bottlenecks or misconfigurations before they become apparent to users. In addition, this team feeds AIOps with the knowledge gained, so that in future it will be able to predict problems the causes of which lie outside the storage system itself.

In the current situation, storage vendors in South Africa will no longer be able to compete without implementing AIOps. This counts for capacity planning as much as for predictive maintenance. One of the more common storage AIOps predictions is for the time it would take for the free capacity to run out. Here, a unique solution is required in order to the challenge of increasing capacity: Instead of going in long procurement cycles requiring months, Capacity On Demand (COD) allows customers to receive more capacity than they paid for initially, so they can grow instantly and serve their business units. When additional capacity is required, it is simply unlocked, just like in the public cloud. The role of AIOps in in the whole life cycle of a storage system is therefore twofold: On the one hand, it detects problems before they become noticeable to the user. On the other hand, it is able to reliably forecast future capacity bottlenecks and thus ensures that the company can purchase new capacity in time to ensure business continuity. It also means that they can postpone spending money on storage until the absolute last moment

When looking for a new storage infrastructure local companies do not only need to take into account factors like the supported network protocol and the storage medium, hard disc, SSD, tape etc. They will have to take into account the scalability, consumption model (pay as you grow) and predictive capacities of a storage model to find the product that best suits their business. Here, choosing the right data storage provider is key.