Skip to Main Content

Research Data Management at SUNY Geneseo

Data management includes the processes of collecting, organizing, describing, sharing, and preserving data. This guide will help you plan for and prepare a data management plan and learn about the data lifecycle and research data management.

Data Organization Overview

NORM File Format by xkcd is licensed CC BY NC 

The organization of your data is vital to make sure it can be used and understood by future users. File organization and naming conventions are often unique to the project and can be highly personalized. The important thing is to be consistent and to write the conventions down. Spending a little time on file management strategies early in the project planning process can save lots of time (and headaches) later. After determining conventions for file naming and organization, document and share them with collaborators and anyone else who may need access to the data. Groups should establish a convention and save it to a shared space so that everyone can follow the same conventions.

This page gives an introduction to:

  • Structuring your files
  • File naming conventions
  • File types
  • File renaming tools

Structuring your files

  • It is important to use a consistent file structure in order to ensure all of your files can be found.

  • This file structure should be recorded in your readme.txt file and in your data documentation. This readme.txt file should be located at the top of the file structure hierarchy so it can be easy to find.  

  • Try to keep raw data, processed data, code and outputs in separate folders in order to avoid confusion. 

  • The names and folders should follow a file naming convention (see File Naming Conventions). 

  • The exact file structure can differ according to the needs of the researcher.

The TIER protocol offers a framework and an online collaborative tool for organizing data. See the figure below for a recommended file structure to begin with:

Infographic showing data organization. Details listed nearby.

Organizing data figure by Project TIER: licensed CC BY NC

 

  • Folder: Project/
    • File: The Read Me File
    • File: The Report
    • Folder: Data/
      • Folder: InputData/
        • File: Input Data Files
        • Folder: Metadata/
          • File: Data Sources Guide
          • File: Codebooks
      • Folder: AnalysisData/
        • File: Analysis Data Files
        • File: The Data Appendix
      • Folder: IntermediateData/
    • Folder: Scripts/
      • Folder: ProcessingScripts/
      • Folder: DataAppendixScripts/
      • Folder: AnalysisScripts/
      • File: The Master Script
    • Folder: Output/
      • Folder: DataAppendixOutput
      • Folder: Results

Content adapted from: UR

File Naming Conventions

It is useful to establish a best practice for file naming in order to manage both paper and electronic records. The consistent use of naming conventions makes sorting more predictable, finding files easier, establishing uniformity, gives clues to file and folder contents, and version control. Here are things to keep in mind, but remember to use only the ideas that will serve your collection best!:

  • Be consistent
  • Create names that will allow useful sorting (see below for more tips for sorting)
  • Include only alphanumeric characters
  • Keep names as short as possible and make them easy to read (Windows, OS X, and Linux all limit names to 255 characters, but under 32 characters is advised)
  • Use camel case or snake case to distinguish words (e.g. LiveOnStage01.docx or Program_GeneseoAuthors_2022-10-26.xlsx )
  • Avoid spaces, abbreviations, and most symbols except underscore “_” and hyphen “-“ (hyphens should only be used in the root filename, preferably for dates)
  • Use the filename for version control (e.g. CollectionDevPolicy_rev2019-02Feb.docx, Minutes_draft_2018-08-18.pdf, Minutes_final_2018-08-18.pdf)
  • Consider putting the initials of the author in the filename (e.g. DMP_Draft_2023-02-20_tlp.docx)
  • Avoid generic file names
  • Avoid using acronym names that cannot be easily understood, or are not explained in the readme.txt file.

Tips for naming your files for easy sorting

  • Use leading zeroes when it comes to numbers. 07 will sort above 70, but 7 will not. Consider how many files you will have, and use that many digits. (i.e., less than 100 use 01-99. More than 100 use 001-999.)

  • If you want files to be organized first by date, then date should be first. If you want to organize first by project/experiment name, then the that name should come first.

  • For sorting by date, date order should be YYYY-MM-DD (e.g. Program_GeneseoAuthors_2022-10-26.docx)

Consider including these descriptors in your File Naming Convention Record:

  • Project name, experiment name or acronym 

  • Initials or name of researcher

  • Date or range of dates when data was collected

  • Location or spatial information

  • Type of data

  • Type of analysis

  • Conditions

  • Description of experiment

  • Unique identifier

  • Language

  • Name or pseudonym of interviewee

  • Sample name

  • Version number of file (with leading zeroes)

  • Three letter file extension for the file format

 

Content adapted from: Cornell, UR

File Formats

"Working" file formats (i.e, those used when collecting and working with project data) are not always ideal for re-use or long-term preservation. They may not meet the requirements of data archives or repositories, or satisfy research funders' requirements.

We offer the following general guidelines for selecting file formats for preservation and reuse. KnightScholar, a repository service provided by the Library supports many file types. More information can be found on our policies page.

"File extensions" by xkcd is licensed CC BY NC

Principles for selecting file formats

Select open, non-proprietary formats

Open, non-proprietary formats are far more likely to remain usable even if the software that created them is not available or no longer functional. Formats whose documentation is complete and freely available also have a higher likelihood of long-term preservation. If the program that created the file is the only option for reading or accessing the data, it is likely to be a proprietary, non-open format. As a general rule, plain text formats, such as comma- or tab- delimited files, are open formats and are typically better for re-use and long-term preservation.

  • Example of a proprietary format: Photoshop .psd file
  • Example of an open format: .tiff image file

Select "lossless" formats

Formats that compress the information in a file are often smaller, but the compression often permanently removes data from the file. These formats are "lossy," while formats that do not result in the loss of information when uncompressed are "lossless."

  • Example of lossy formats: .mp3 audio file, .jpeg image file
  • Example of lossless formats: .wav audio file, .tiff image file

Select unencrypted and uncompiled formats

If the encryption key, passphrase, or password to a file is lost, there may be no way to retrieve the data from the file later, rendering it unusable to others. Uncompiled source code is more readily re-usable by others and has a far greater likelihood of remaining usable over time since recompiling is possible on different architectures and platforms.

Content adapted from File Management and File Formats for Preservation by RDMS Cornell University which is licensed CC BY.

File Renaming Tools