File Formats for Data Curation
A file format encodes information within a computer file so that it can be recognized by an application and accessed. It is indicated by the file name extension. Each file type (such as text, images, or sound) has many file formats available.
Looking for a cheat sheet? Check out our one-pager!
Table of contents
Which formats do you use the most?
- .xls (Microsoft Excel)
- .mp3 (for digital audio)
- .docx (Microsoft Word)
- .gdoc (Google Document)
These are all commonly used file formats, but what if we tell you that these formats are not recommended to be used for data curation? These formats are unsustainable because they are proprietary.
Warm-up - Exercise 1
A dancing club in Saskatoon has kept its documents since the early 2000s. Recently, a club member wanted to refer back to an agenda from July 5, 2003 (please see the WordPerfect file below) but was unable to open it.
Could you download that document and try to recover it on your machine?
Please share your experience in Padlet.
What are proprietary formats?
- The file you were trying to open in the warm-up exercise was an older proprietary file that can no longer be opened. This inaccessibility can happen to other proprietary formats like Microsoft Word or Google Docs.
- They are limited by software patents, lack of format specification details, or built-in encryption to prevent open usage by the public.
- This results in requiring specific software provided by one vendor to use the proprietary format.
- In some cases, an industry may treat specific file formats as a de facto standard even if the formats are proprietary and rely on expensive software.
We recommend open formats because they are more sustainable and easier to preserve in the long term.
What are open formats?
- They are non-proprietary.
- They are freely available for everyone to use.
- Because the specifications are released, open-source developers can write software to utilize the file format in case a particular vendor no longer supports the format.
- They may decrease the risk of technical obsolescence by removing the dependency on the underlying technology.
Other Considerations
- Open, Standard documentation
- Common usage by research community
- Standard representation (ASCII, Unicode)
- Unencrypted
- Uncompressed
- File quality
- Encoding that handles high resolution will have larger sizes than lower quality file formats
- However, the trade-off comes at the cost of storage space and convenience in disseminating the file to others
We recommend these common file formats
File Type | Recommended Formats | Avoided Formats |
---|---|---|
Text | XML, ASCII, TXT, PDF, LaTeX, .docx | .doc, .wpd |
Images | TIFF, JPEG2000, PNG, JPEG/JFIF | RAW, Adobe Photoshop, PDF |
Video | MOV, MPEG-2 | .wmv |
Audio | PCM, WAVE, DSD | CD, DVD, .m4p, .mp3, xmi, .mod |
Dataset | CSV, TSV, .db, .sqlite, Shapefile, .xlsx | .xls |
Web Data | JSON, XML, HTML |
Exercise 2
Please help us to preserve datasets for the long term.
Access a dataset:
Yarmand,Shahram, 2019, “Replication data for: Stochastic and Deterministic Modeling of the Future Price of Crude oil and Bottled Water”, https://doi.org/10.5683/SP2/VPF8J8, Borealis, V1
Download one Excel ( .xls) file and convert it to CSV.
Please share your experience in Padlet.
Congrats!
Now you know which file formats are proper for data curation so your research data can be preserved for a longer term!
Sources
- https://lib.guides.umbc.edu/c.php?g=728911&p=5872066
- https://guides.nyu.edu/data_management/file-formats
- https://www.library.northwestern.edu/about/administration/policies/file-format-recommendations.html
- https://www.loc.gov/preservation/resources/rfs
- https://pixabay.com
- https://www.pexels.com
Need help?
Please reach out to research.data@ubc.ca
for assistance with any of your research data questions.