LibGuides: Data Management Planning: File Naming and Formats

Best Practices for File Naming

How you choose to name your data files has a large impact on your ability to find and understand those files later on. File names should remain consistent, logical, and descriptive in order to maximize accessibility and findability. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type.

When choosing a file name, check for any database management limitations on file name length and use of special characters. Also, in general, lower-case names are less software and platform dependent. Avoid using spaces and special characters in file names, directory paths and field names. Instead, consider using underscore ( _ ) or dashes ( - ) to separate meaningful parts of file names. Avoid $ % ^ & # | : and similar.

Example of Descriptive File Name:

Sevilleta_LTER_NM_2001_NPP.csv

Sevilleta_LTER is the project name
NM is the state abbreviation
2001 is the calendar year
NPP represents Net Primary Productivity data
csv stands for the file type - ASCII comma separated variable

When organizing these data files together, the directory top-level folder should include the project title, unique identifier, and date (year).

Source: Adapted from DataONE

Best Practices for File Formats

The file formats you choose now will affect your own ability to open the data in the future as well as other's ability to access the data.

Using non-proprietary (open) file formats will maximize access to the data and are more sustainable for the future. Consider migrating your data into a open format in addition to keeping a copy in the original software format. If it is necessary to use a proprietary file format, make sure to include the name and version of software used to generate the file, as well as the company who made the software in a readme.txt.

File formats should also be:

Unencrypted
Uncompressed
Common usage by your research community
Standard representation
Open, documented standard

Preferred file formats include:

Containers: TAR, GZIP, ZIP
Databases: XML, CSV
Geospatial: NetCDF, GeoTIFF, DBF, SHP
Moving images: MOV, MPEG, AVI, MXF
Sounds: WAVE, AIFF, MP3, MXF
Statistics: ASCII, DTA, POR, SAS, SAV
Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
Tabular data: CSV
Text: XML, PDF/A, HTML, ASCII, UTF-8
Web archive: WARC

Data repositories treat file formats differently so make sure to research what your chosen archive accepts. Note that not all repositories are able to migrate data files to new file formats for preservation.

Data Versioning

Versioning refers to saving new copies of your files when you make changes so that you can go back and retain specific versions of your files later. This is especially useful in collaborations so researchers in various teams know that changes have been made. Versioning additionally allows you to decide later that you prefer an earlier version of the data rather than retracing your misteps.

General Rules:

Record date of the update as well as the collaborator involved in the file name of each version.
Create a revision history document in a readme.txt to make it easier for collaborators to coordinate and understand changes.
Consider archiving older versions of the data in a separate folder or archive system.

Manual File Versioning:

One of the simplest way to version is to manually save new versions each time you make significant changes. This method is best used when only one person is working on the files and few versions are needed.

Software File Versioning:

CSUF Student Dropbox Business

This file sharing software records version changes.

Every time you press "save" on a document, a new version is saved in the version history.
You can preview and revert back to a previous version at any time.
Documents can be shared with others, and Dropbox will track who uploaded or updated each file and when.
Any type of document can be stored and versioned with Dropbox.
Cons: Version history is not updated when a file is edited offline.

Google Drive

Drive's word processing, spreadsheet, and presentation software automatically create versions as you edit.

Any time you edit files on Google Drive, new versions are saved as you go.
Version information includes who was editing the file and the date and time the new version was created.
You can also see what changes were made from one version to the next and revert back to a previous version at any time.
Cons: You are restricted to Google software, which may not work with your research.

Git

Git offers a free and open source distributed version control system with more features than the above options. Unlike the previous software options, Git is designed specifically for managing version tracking.

Files are kept in a repository and users clone copies of the repository for editing and commit changes back to the repository when they are done.
This system is often used for groups writing software and code, but can be used for any kind of files or projects.

Data Management Planning