Library Website : Research Data Management at SUNY Geneseo: Organization

Data Organization Overview

NORM File Format by xkcd is licensed CC BY NC

The organization of your data is vital to make sure it can be used and understood by future users. File organization and naming conventions are often unique to the project and can be highly personalized. The important thing is to be consistent and to write the conventions down. Spending a little time on file management strategies early in the project planning process can save lots of time (and headaches) later. After determining conventions for file naming and organization, document and share them with collaborators and anyone else who may need access to the data. Groups should establish a convention and save it to a shared space so that everyone can follow the same conventions.

This page gives an introduction to:

Structuring your files
File naming conventions
File types
File renaming tools

Structuring your files

It is important to use a consistent file structure in order to ensure all of your files can be found.
This file structure should be recorded in your readme.txt file and in your data documentation. This readme.txt file should be located at the top of the file structure hierarchy so it can be easy to find.
Try to keep raw data, processed data, code and outputs in separate folders in order to avoid confusion.
The names and folders should follow a file naming convention (see File Naming Conventions).
The exact file structure can differ according to the needs of the researcher.

The TIER protocol offers a framework and an online collaborative tool for organizing data. See the figure below for a recommended file structure to begin with:

Infographic showing data organization. Details listed nearby.

Organizing data figure by Project TIER: licensed CC BY NC

Folder: Project/
- File: The Read Me File
- File: The Report
- Folder: Data/
  - Folder: InputData/
    - File: Input Data Files
    - Folder: Metadata/
      - File: Data Sources Guide
      - File: Codebooks
  - Folder: AnalysisData/
    - File: Analysis Data Files
    - File: The Data Appendix
  - Folder: IntermediateData/
- Folder: Scripts/
  - Folder: ProcessingScripts/
  - Folder: DataAppendixScripts/
  - Folder: AnalysisScripts/
  - File: The Master Script
- Folder: Output/
  - Folder: DataAppendixOutput
  - Folder: Results

Content adapted from: UR

File Naming Conventions

It is useful to establish a best practice for file naming in order to manage both paper and electronic records. The consistent use of naming conventions makes sorting more predictable, finding files easier, establishing uniformity, gives clues to file and folder contents, and version control. Here are things to keep in mind, but remember to use only the ideas that will serve your collection best!:

Be consistent
Create names that will allow useful sorting (see below for more tips for sorting)
Include only alphanumeric characters
Keep names as short as possible and make them easy to read (Windows, OS X, and Linux all limit names to 255 characters, but under 32 characters is advised)
Use camel case or snake case to distinguish words (e.g. LiveOnStage01.docx or Program_GeneseoAuthors_2022-10-26.xlsx )
Avoid spaces, abbreviations, and most symbols except underscore “_” and hyphen “-“ (hyphens should only be used in the root filename, preferably for dates)
Use the filename for version control (e.g. CollectionDevPolicy_rev2019-02Feb.docx, Minutes_draft_2018-08-18.pdf, Minutes_final_2018-08-18.pdf)
Consider putting the initials of the author in the filename (e.g. DMP_Draft_2023-02-20_tlp.docx)
Avoid generic file names
Avoid using acronym names that cannot be easily understood, or are not explained in the readme.txt file.

Tips for naming your files for easy sorting

Use leading zeroes when it comes to numbers. 07 will sort above 70, but 7 will not. Consider how many files you will have, and use that many digits. (i.e., less than 100 use 01-99. More than 100 use 001-999.)
If you want files to be organized first by date, then date should be first. If you want to organize first by project/experiment name, then the that name should come first.
For sorting by date, date order should be YYYY-MM-DD (e.g. Program_GeneseoAuthors_2022-10-26.docx)

Consider including these descriptors in your File Naming Convention Record:

Project name, experiment name or acronym
Initials or name of researcher
Date or range of dates when data was collected
Location or spatial information
Type of data
Type of analysis
Conditions
Description of experiment
Unique identifier
Language
Name or pseudonym of interviewee
Sample name
Version number of file (with leading zeroes)
Three letter file extension for the file format

Content adapted from: Cornell, UR

File Formats

"Working" file formats (i.e, those used when collecting and working with project data) are not always ideal for re-use or long-term preservation. They may not meet the requirements of data archives or repositories, or satisfy research funders' requirements.

We offer the following general guidelines for selecting file formats for preservation and reuse. KnightScholar, a repository service provided by the Library supports many file types. More information can be found on our policies page.

"File extensions" by xkcd is licensed CC BY NC

Principles for selecting file formats

Select open, non-proprietary formats

Open, non-proprietary formats are far more likely to remain usable even if the software that created them is not available or no longer functional. Formats whose documentation is complete and freely available also have a higher likelihood of long-term preservation. If the program that created the file is the only option for reading or accessing the data, it is likely to be a proprietary, non-open format. As a general rule, plain text formats, such as comma- or tab- delimited files, are open formats and are typically better for re-use and long-term preservation.

Example of a proprietary format: Photoshop .psd file
Example of an open format: .tiff image file

Select "lossless" formats

Formats that compress the information in a file are often smaller, but the compression often permanently removes data from the file. These formats are "lossy," while formats that do not result in the loss of information when uncompressed are "lossless."

Example of lossy formats: .mp3 audio file, .jpeg image file
Example of lossless formats: .wav audio file, .tiff image file

Select unencrypted and uncompiled formats

If the encryption key, passphrase, or password to a file is lost, there may be no way to retrieve the data from the file later, rendering it unusable to others. Uncompiled source code is more readily re-usable by others and has a far greater likelihood of remaining usable over time since recompiling is possible on different architectures and platforms.

Content adapted from File Management and File Formats for Preservation by RDMS Cornell University which is licensed CC BY.

File Renaming Tools

The Windows Explorer and the Mac Finder will allow you to rename multiple files at once.
Bulk Rename Utility (Windows)

Research Data Management at SUNY Geneseo

Data Support Services Available from the Library