Friday, December 18, 2020

Fixing bad T10-PI checksums

Background (in case you care)

Let's say you have some kind of hypothetical data disaster where you lose all redundancy in a RAID-6 pool through no fault or design decision of your own, maybe because of a firmware bug. Then let's say a hard drive physically bites the dust at the worst possible time.

Then let's imagine that you send the drive off to a data recovery firm who is able to recover 100% of the data (awesome!!!), but they don't know about T10-PI checksums and thus don't copy them onto the clone. Let's assume that due to various circumstances this was your one shot at it and they can't just reclone with T10-PI for some reason.

So then pretend that you get your 100% successful clone back, only to find out that your disk is unreadable in the array because the disk, which was correctly formatted with T10-PI Type 2 checksums, does not contain the correct checksums for each sector. Every read of every sector on the disk will fail.

So now your data is sitting there on the drive, just waiting for you to pull it off. Simple, right? Just find somewhere else you can read the data off by disabling the checksum verification. Well, let's say that for some reason your "RAID-6" pool is actually proprietary RAID-6 from the vendor, so you're stuck using their array to read the data and you can't disable T10-PI. Uh oh.

Now in this hypothetical scenario, you almost have a solution. You can find some random Linux system where you could disable T10-PI, except it won't have access to the proprietary vendor RAID. You could get around the proprietary RAID issue by simply using the vendor's hardware, but let's say that they can't even find a way to turn it off on your own system or in a system physically sitting in their lab.

Let's also assume that the storage array requires that the T10-PI be enabled.  You can't just ignore the T10-PI somewhere and dd the data to a new disk since that would mean you don't have the correct T10-PI checksums in place on that drive, meaning it will be rejected by the array upon insertion.  Also, keep in mind that most non-JBOD arrays can't actually present a single disk as a single disk.  Otherwise you could dd the image onto it in the array from elsewhere (image file captured elsewhere, piped over ssh, or whatever) which would be very nice and easy.

Now what do you do?

For the (maybe) three other people in the world who will ever run into this implausible hypothetical scenario, there is a solution!

You can disable T10-PI on the disk! There's got to be an easy tool for that, right? Well, no. sg_format can reformat the disk with or without T10-PI. However, it reformats it and you will lose your data.

What you need to do at this point is set the checksums to a special value in each sector on the disk. Yes, each and every sector.

It turns out that there is a tool called ddpt that is about 99% of what is needed. It can read and write data similar to dd, but it has some nice enhancements. One extremely useful feature is that it can selectively enable and disable verification of checksums for reads and writes. Now you can read (and write) disks with bad T10-PI checksums, even if dd fails in the same situation.

The only real task at that point is rewriting eight bytes in each sector with eight bytes of 0xff. That is a special value that disables checksum validation, though there are some variations between versions. As I understand it, writing 0xff to all eight bytes works for Type 1, Type 2, and Type 3 (archive.org link).

Solution

I wrote a simple C program that reads a file and writes out the same data to a different file, except with the T10-PI bytes in each sector overwritten with 0xff. It also optionally writes out a second file with just the user data, no checksums included. By using pipes, this can simultaneously read from a source disk, "fix" the checksums, write to a target disk, and generate a checksum on the fly (or copy the user data itself to a separate destination rather than just checksum it).

Usage example:

ddpt if=/dev/sde of=- status=progress iflag=pt bs=4096 --protect=3 | ./fix_t10pi /dev/stdin /dev/stdout >(sha256sum - > somefile.sha256sum) | ddpt if=- of=/dev/sdr oflag=pt bs=4096 --protect=0,3

Easy! Well, that's assuming that you have a decent JBOD or other location that can expose the raw disk to the OS, not RAID... and that's assuming that the JBOD works correctly and doesn't throw lots of SCSI aborts because of a design flaw. I'm not sure why anyone would want to buy a JBOD like that...

The code for this is located at https://github.com/BYUHPC/fix_t10pi. If you need to use this, I extend my condolences to you for actually experiencing this hypothetical scenario.

Acknowledgements

Billy Wilson (also works at BYU Office of Research Computing) played a huge role in finding a solution to this very elaborate, implausible, and obviously hypothetical thought exercise that spanned three months.

I wish to sincerely thank Doug Gilbert for his help in figuring this out. He was able to figure out and explain the right method for working around the bad checksums then writing "good" ones out to disk. It seemed simple after he explained it, but it certainly was not simple for me prior to that. (Doug wrote ddpt and tons of other disk-related tools and is a Linux kernel maintainer for parts of the SCSI code.)

I also sincerely thank Martin K. Petersen for his very valuable insights into the details of T10-PI and how to work with it. Martin introduced us to some very useful low level tools and ways to work with T10-PI.

SEO Attempt

"T10 PI" can be written in multiple ways, and there's all sorts of other ways to refer to T10PI and related ideas. Data Integrity Extension (DIX) and Data Integrity Field (DIF) are two of those other terms. I would like to make sure that people can find this document if it's useful...

By the way, if you benefited from this document I would love to hear about it (and I can extend my condolences to you).

No comments:

Post a Comment

Please leave any comments, questions, or suggestions below. If you find a better approach than what I have documented in my posts, please list that as well. I also enjoy hearing when my posts are beneficial to others.