You are here

Be careful when using SAN

Be careful when using SAN (Storage Area Networks) or similar shard storage solutions (and any other virtualization, consolidation or cloud solutions).

This week it happened again: A customer called us because he had some troubles with his on-line shop (note the date!). Everybody in his company complained that the databases are answering slowly.

When looking on the box (with iostat) we have seen some I/O load and some pending reads in InnoDB (SHOW ENGINE INNODB STATUS and SHOW GLOBAL STATUS LIKE 'InnoDB%') and a very bad InnoDB buffer pool hit ratio (about 80%, yes I know hit ratios are bad, but sometimes they are helpful).

Customer affirmed that he has nothing changed on his box since a few days. And the day before everything was working fine but this day in the afternoon the system became suddenly slow. He further told me that they are producing some monthly reports. But only on the slaves.

We found one query running since 25 minutes so I assumed first this was the evil thing. After we killed the query the system relaxed a bit but still was pretty I/O loaded and customer had the opinion that we are soon at the end of peak hours. That would explain the relaxation of the system...

No clue until here what could have been the problem or the reason. A bit frustrating.

Luckily one of the System Administrators just came in and complained that we were filling up his SAN. He complained about a Slave of our slow Master. On the Slave we found a query (end of month report) running since 2 to 3 hours which generated a temporary table of about 350 Gbyte! This table filled up the SAN up to about 99%.

The System Administrator further mentioned that filling up a SAN more than about 90% will slow it down (why ever).

This let my alarm bells ring: Did we not have an I/O problem suddenly starting 2 to 3 hours ago? We found, that our Master and the Slave accidentally were located on the same SAN.

When we killed the query the table was removed automatically and after a few minutes the disk space was released again. I was told, that it will take some hours until the SAN will relax again (why ever). Customer confirmed the next day everything was working fine again. So we can be quite sure that filling up the SAN with the Slave caused the problem on our production system.

Conclusion: SAN can have several unexpected side effects and impacts on performance. If you do not want to experience unpredictable performance impact try to remain on dedicated storage.

See also our commit demo test:

trx_san_vm.png

Comments

so you don't have `df -h` in troubleshooting commands list? ;-)
Anonymouscomment