For many years, tuning storage performance (particularly shared storage arrays) has been seen as a dark art that requires a degree of skill and experience to achieve.

But as the market evolves with new technology, it might be assumed that less effort is needed to match the requirements of specific application workloads.

However, storage still needs to be “tuned”.

In this article, we discuss some of the issues involved and what can be done to optimise storage hardware to the demands of modern applications.

External storage has typically been used as a permanent store for application data, with the downside that the speed of persistent media is vastly slower than main memory.

Hard drives have latency (response) times measured in milliseconds and are good for sequential rather than random input/output (I/O) performance.

NAND flash offers a performance boost with good random I/O handling, albeit at the expense of the lifetime of the media. Flash storage also has an issue with the need to perform background tasks like garbage collection that can add temporary spikes to response times.

DRAM provides high performance and, as we will discuss, can be used to improve performance.

Finally, we should remember that for shared arrays, servers and storage are connected via a network, and with scale-out node-based solutions (as in hyper-converged infrastructure), there is a network connecting the nodes that is essential to maintain data integrity.

With all that said, here are the areas where we can configure tuning options.

  • Data layout – If you distribute data across physical media I/O performance can be improved. Individual disk and flash drives have limited I/O capability, so “striping” across multiple devices can spread I/O across many concurrent read/write streams. With RAID data protection, stripe width can’t be extended indefinitely as there is a trade-off in resiliency of the RAID group and the rebuild time with extended RAID group sizes. RAID-6 schemes extend protection at the expense of extra space and parity calculation overhead. An alternative is to use erasure coding, but this is suited more to object-type data.
  • Caching – Caching data on flash or in DRAM allows I/O latency to be improved by serving read requests from the cache in a shared array or in the application host. Write I/O can also be accelerated but needs to be protected against hardware failure by replication and/or writing to a persistent cache device. Modern caching solutions like Nimble’s Adaptive Flash or HPE 3PAR’s Adaptive Flash Cache look to optimise the use of more expensive resources while improving performance.
  • Network tuning – In shared storage environments, Fibre Channel and Ethernet networks can be tuned to improve performance. For Fibre Channel, this can mean looking at settings like buffer credits and for Ethernet looking at packet size. Obviously, having non-blocking switches ensures that point-to-point throughput is guaranteed for each port on the switch. Network design also has an impact. Historically, Fibre Channel networks were designed using a range of topologies, based on saving cost. Today, Fibre Channel and Ethernet are approaching individual port speeds that are hard to saturate (32Gbps for FC, 40Gbps for Ethernet), so port sharing isn’t a big issue. However, if ports can be dedicated as much as possible then this helps eliminate bottlenecks.
  • Tiering – The use of tiering typically offers cost savings but is a performance option too. Data can be placed on the most appropriate tier of storage based on I/O performance needs and cost effectiveness of media. Tiering algorithms have developed rapidly over the years, moving from LUN to block-based tiering. Getting a tiering algorithm and data layout right can improve performance without resorting to purchasing additional hardware.

To get the best out of tuning, the starting point is to know the I/O profile of the application.

This can be quite variable, but we can break it down into a number of categories.

Structured data – This is typically represented by SQL (Oracle, SQL Server) and NoSQL (MongoDB, CouchDB) databases and has a mixed I/O profile. Data stored typically has an I/O profile that demands random I/O (excluding full table scans) whereas data writes are logged as small append-style writes. Traditional wisdom has been to place data on RAID-5 storage and logs on RAID-10, but this was really only relevant where there was little I/O caching. With modern storage arrays, most database loads (except the most intensive) are easily managed without manual placement of data. For more intensive workloads, placing logs on high performance storage is a good strategy.

Virtual servers – Server virtualisation introduces the “I/O blender” effect that randomises even sequential I/O workloads. This is because data is distributed across a LUN or volume from many virtual machines, each of which acts independently of the other, generating a random workload profile. Improving performance for virtual servers means deploying faster media (to reduce individual I/O latency) or introducing caching. Both vSphere (VMware) and Hyper-V (Microsoft) allow caching to be implemented for individual virtual machines. There are also third-party caching solutions that integrate into the hypervisor to improve I/O performance. For HCI, VMware’s Virtual SAN offers an all-flash option that uses a mix of high performance and high capacity flash to optimise I/O workloads.

Virtual desktops – The challenge of delivering I/O performance for virtual desktops brings up the same issues of randomness as virtual servers, with a couple of differences. First, most desktops are made from a single image so there is a large amount of duplicate data when starting many desktops. Second, virtual desktops are booted frequently, potentially daily, and so there are some intensive read (startup) and write (shutdown) periods. Virtual desktop performance can be vastly improved by caching and deduplicating the desktop image in a shared array or using third-party software. Non-persistent desktops can even be cached in DRAM. This solution works out much cheaper than buying an expensive all-flash system.

Web servers – Web and other read-intensive applications (like content management systems) will benefit from the use of additional read cache. This can be implemented in shared arrays, or as dedicated cache in the hypervisor. There’s an obvious trade-off here in avoiding backend I/O altogether with efficient caching in the web server itself, but these cache systems still have a limit and so need to fall back on external I/O at some stage.

Email servers – Looking back 10 or 15 years, the demand for I/O per mailbox on a platform like Microsoft Exchange would have been quite high. With successive product releases, the I/O demands per user have dropped by factor of 15x to 20x. Exchange 2016 requires around 5% of the IOPS per user of Exchange 2003. As a result, Exchange can be deployed successfully on JBOD systems rather than a SAN. Having said that, email platforms like Exchange will benefit from increased use of cache and distributed data layout like wide striping.

Analytics – Many analytics tools read and re-read the same data as they build up a profile of the data. These tools are sensitive to latency and need to be able to execute queries in parallel, hence the design of Hadoop across many physical storage nodes with multiple disk spindles, for example. So, improving performance for analytics workloads is about reducing I/O read latency. This can mean using flash media, adding more cache to external storage or hosts running the analytics software. Data placement isn’t that useful as it’s hard to predict exactly what data will be used when running analytics software. So, the focus for analytics is getting the right balance of storage and cache and designing for the ability to increase cache as desired.

With all of the above scenarios, having access to detailed metrics that show performance and resource usage is critical. With the right data, the impact of any change can be assessed and measured against the cost in additional resources.


Source link