Body
Environment
The Great Lakes Scratch Filesystem (as of April 2021 the issue remains).
Issue
ARC has recently become aware of a situation in which archiving data using the tar or archivetar commands results in silent data loss of very small files (less than 3KBytes) only on the Great Lakes Scratch filesystem.
Impact
- Any user using tar --sparse | -S or the archivetar utility on Great Lakes scratch
- Data on scratch is safe, only the archived version is corrupted
- The use of those tools or options on other storage, such as Turbo, Lighthouse Scratch, and MiStorage, are not impacted
- The archivetar utility has been updated as of May 4th to not use the --sparse option to tar. Older versions have been removed
Resolution
Data Recovery Options
Data that is impacted that was deleted from Scratch after April 14 can be recovered via snapshot. Contact arcts-support@umich.edu for the recovery process. Scratch has no backups and older data are lost.
What do I need to do if I use or have used tar --sparse | -S` or the archivetar tool on Great Lakes /scratch
- Discontinue using the --sparse or -S option to tar on Great Lakes Scratch
- If you still have a known good copy of the data, re-archive it without the --sparse option or archivetar version 0.10 or newer released May 4
- Future data archived with archivetar that contains sparse files will be much larger than the source data. This can be mitigated by using one of the compression options such as --gzip --bzip2 --lz4, etc.
- Check any data that was archived with the mentioned tools or options for corruption. Corruption manifests itself as files being filled with no data. See “Checking Data for Corruption” below for details
- Check with others in your organization to raise awareness of this risk
Checking Data for Corruption
Only the archived data has been corrupted. If you still have the original source data, we recommend re-archiving the data rather than checking for corruption.
After expanding your suspect archived data in a new location, the impact is restricted to small files. You cannot use the sparse factor or number of blocks reported by the stat command if the data is expanded on scratch. You can use the stat command if the data is expanded on Turbo or other storage systems. Any file reporting 0 blocks should be suspect.
Files before archiving will show up as data files after archiving. You can use the file command to check.
Good File will likely print anything other than data but not all
$ file goodfile.txt
goodfile: Bourne-Again shell script, ASCII text executable
Files marked ‘data’ are not necessarily corrupted; it depends on your data format. But only corrupted files will show up as type ‘data’.
$ file corrupted.txt
corrupted.txt: data
Corrupted files will show up as having no data when cat’d:
$ cat goodfile.txt
<contents of file>
$ cat corrupted.txt
$
Prints nothing
Building a list of suspect files can be done by looking at the sparse factor of files. You can do this with the find command.
# Requires modification for paths with spaces in name
$ find . -type f -printf "%S\t%p\n" | gawk '$1 < 1 {print $2}'
You can combine this with the xargs file command to check if they are of type ‘data’ and thus should be checked:
$ find . -type f -printf "%S\t%p\n" | gawk '$1 < 1 {print $2}' | xargs file
The final command to create a list of suspect files to verify is:
$ find . -type f -printf "%S\t%p\n" | gawk '$1 < 1 {print $2}' | xargs file | grep ‘data$’ | gawk -F: ‘{print $1}’
Additional Information
The specific file system that underlies the Great Lakes scratch system uses an optimization that for very small files the contents of the files are stored with the metadata (data in inode) in the very fast SSD metadata pool. This increases performance when reading very small files and does not consume any extra space.
When tar checks for sparse data when given the --sparse option, it compares the number of consumed blocks vs the reported file size, data in metadata files report 0 blocks. This causes the resulting tar to create a sparse file the exact same size as the original non sparse data, but is mapped with all empty space.
The result is a file that reports the same size as the original but consumes no space on disk due to the sparse nature of the resulting file. Tar never copied the actual data because it saw no blocks containing data.