NORM File Format by xkcd is licensed CC BY NC
The organization of your data is vital to make sure it can be used and understood by future users. File organization and naming conventions are often unique to the project and can be highly personalized. The important thing is to be consistent and to write the conventions down. Spending a little time on file management strategies early in the project planning process can save lots of time (and headaches) later. After determining conventions for file naming and organization, document and share them with collaborators and anyone else who may need access to the data. Groups should establish a convention and save it to a shared space so that everyone can follow the same conventions.
This page gives an introduction to:
It is important to use a consistent file structure in order to ensure all of your files can be found.
This file structure should be recorded in your readme.txt file and in your data documentation. This readme.txt file should be located at the top of the file structure hierarchy so it can be easy to find.
Try to keep raw data, processed data, code and outputs in separate folders in order to avoid confusion.
The names and folders should follow a file naming convention (see File Naming Conventions).
The exact file structure can differ according to the needs of the researcher.
The TIER protocol offers a framework and an online collaborative tool for organizing data. See the figure below for a recommended file structure to begin with:
Organizing data figure by Project TIER: licensed CC BY NC
Content adapted from: UR
It is useful to establish a best practice for file naming in order to manage both paper and electronic records. The consistent use of naming conventions makes sorting more predictable, finding files easier, establishing uniformity, gives clues to file and folder contents, and version control. Here are things to keep in mind, but remember to use only the ideas that will serve your collection best!:
Use leading zeroes when it comes to numbers. 07 will sort above 70, but 7 will not. Consider how many files you will have, and use that many digits. (i.e., less than 100 use 01-99. More than 100 use 001-999.)
If you want files to be organized first by date, then date should be first. If you want to organize first by project/experiment name, then the that name should come first.
For sorting by date, date order should be YYYY-MM-DD (e.g. Program_GeneseoAuthors_2022-10-26.docx)
Project name, experiment name or acronym
Initials or name of researcher
Date or range of dates when data was collected
Location or spatial information
Type of data
Type of analysis
Conditions
Description of experiment
Unique identifier
Language
Name or pseudonym of interviewee
Sample name
Version number of file (with leading zeroes)
Three letter file extension for the file format
"Working" file formats (i.e, those used when collecting and working with project data) are not always ideal for re-use or long-term preservation. They may not meet the requirements of data archives or repositories, or satisfy research funders' requirements.
We offer the following general guidelines for selecting file formats for preservation and reuse. KnightScholar, a repository service provided by the Library supports many file types. More information can be found on our policies page.
"File extensions" by xkcd is licensed CC BY NC
Open, non-proprietary formats are far more likely to remain usable even if the software that created them is not available or no longer functional. Formats whose documentation is complete and freely available also have a higher likelihood of long-term preservation. If the program that created the file is the only option for reading or accessing the data, it is likely to be a proprietary, non-open format. As a general rule, plain text formats, such as comma- or tab- delimited files, are open formats and are typically better for re-use and long-term preservation.
Formats that compress the information in a file are often smaller, but the compression often permanently removes data from the file. These formats are "lossy," while formats that do not result in the loss of information when uncompressed are "lossless."
If the encryption key, passphrase, or password to a file is lost, there may be no way to retrieve the data from the file later, rendering it unusable to others. Uncompiled source code is more readily re-usable by others and has a far greater likelihood of remaining usable over time since recompiling is possible on different architectures and platforms.
Content adapted from File Management and File Formats for Preservation by RDMS Cornell University which is licensed CC BY.
The Windows Explorer and the Mac Finder will allow you to rename multiple files at once.
Bulk Rename Utility (Windows)
Geneseo Authors Hall preserves over 90 years of scholarly works.
KnightScholar facilitates creation of works by the SUNY Geneseo community.
IDS Project is a resource-sharing cooperative.