There are several important elements to digital preservation, including data protection, backup and archiving. In this lesson, these concepts are introduced and best practices are highlighted with case study examples of how things can go wrong. Exploring the logistical, technical and policy implications of data preservation, participants will be able to identify their preservation needs and be ready to implement good data preservation practices by the end of the module.
Different types of new data may be created in the course of a project, for instance visualizations, plots, statistical outputs, a new dataset created by integrating multiple datasets, etc. Whenever possible, document your workflow (the process used to clean, analyze and visualize data) noting what data products are created at each step. Depending on the nature of the project, this might be as a computer script, or it may be notes in a text file documenting the process you used (i.e. process metadata). If workflows are preserved along with data products, they can be executed and enable the data product to be reproduced.
When describing the process for creating derived data products, the following information should be included in the data documentation or the companion metadata file.
When searching for data, whether locally on one’s machine or in external repositories, one may use a variety of search terms. In addition, data are often housed in databases or clearinghouses where a query is required in order access data. In order to reproduce the search results and obtain similar, if not the same results, it is necessary to document which terms and queries were used.
Follow the steps below to choose the most appropriate software to meet your needs:
Identify what you want to achieve (discover data, analyze data, write a paper, etc.)
Identify the necessary software features for your project (i.e. functional requirements)
Identify logistics features of the software that are required, such as licensing, cost, time constraints, user expertise, etc. (i.e. non-functional requirements)
Determine what software has been used by others with similar requirements
Ask around (yes, really); find out what people like
Find out what software your institution has licensed
Search the web (e.g. directory services, open source sites, forums)
Follow-up with independent assessment
Generate a list of software candidates
Evaluate the list; iterate back to Step 1 as needed
As feasible, try a few software candidates that seem promising
Outliers may not be the result of actual observations, but rather the result of errors in data collection, data recording, or other parts of the data life cycle. This can be used to identify outliers for closer examination.
Understand the input geospatial data parameters, including scale, map projection, geographic datum, and resolution, when integrating data from multiple sources. Care should be taken to ensure that the geospatial parameters of the source datasets can be legitimately combined.
Understanding the types, processes, and frameworks of workflows and analyses is helpful for researchers seeking to understand more about research, how it was created, and what it may be used for. This lesson uses a subset of data analysis types to introduce reproducibility, iterative analysis, documentation, provenance and different types of processes. Described in more detail are the benefits of documenting and establishing informal (conceptual) and formal (executable) workflows.
Data citation is a key practice that supports the recognition of data creation as a primary research output rather than as a mere byproduct of research. Providing reliable access to research data should be a routine practice, similar to the practice of linking researchers to bibliographic references. After completing this lesson, participants should be able to define data citation and describe its benefits; to identify the roles of various actors in supporting data citation; to recognize common metadata elements and persistent data locators and describe the process for obtaining one, and to summarize best practices for supporting data citation.
When entering data, common goals include creating data sets that are valid, have gone through an established process to ensure quality, are organized, and reusable. This lesson outlines best practices for creating data files. It will detail options for data entry and integration, and provide examples of processes used for data cleaning, organization and manipulation.
Data management planning is the starting point in the data life cycle. Creating a formal document that outlines what you will do with the data during and after the completion of research helps to ensure that the data is safe for current and future use. This lesson describes the benefits of a data management plan (DMP), outlines the components of a DMP, details tools for creating a DMP, provides NSF DMP information, and demonstrates the use of an example DMP.
The Data Management Skillbuilding Hub is a repository for open educational resources regarding data management, meaning that it is a collection of learning resources freely contributed by anyone willing to share them. Materials such as lessons, best practices, and videos, are stored in the DataONEorg GitHub repository as well as searchable through the Data Management Training Clearinghouse. We invite you submit your own educational resources so that the Data Management Skillbuilding Hub can remain an up-to-date and sustainable educational tool for all to benefit from. You can easily contribute learning materials to the Skillbuilding Hub via GitHub online.
Quality assurance and quality control are phrases used to describe activities that prevent errors from entering or staying in a data set. These activities ensure the quality of the data before it is collected, entered, or analyzed, as well as actively monitoring and maintaining the quality of data throughout the study. In this lesson, we define and provide examples of quality assurance, quality control, data contamination and types of errors that may be found in data sets. After completing this lesson, participants will be able to describe best practices in quality assurance and quality control and relate them to different phases of data collection and entry.
When first sharing research data, researchers often raise questions about the value, benefits, and mechanisms for sharing. Many stakeholders and interested parties, such as funding agencies, communities, other researchers, or members of the public may be interested in research, results and related data. This lesson addresses data sharing in the context of the data life cycle, the value of sharing data, concerns about sharing data, and methods and best practices for sharing data.
As rapidly changing technology enables researchers to collect large, complex datasets with relative ease, the need to effectively manage these data increases in kind. This is the first lesson in a series of education modules intended to provide a broad overview of various topics related to research data management. It covers: trends in data collection, storage and loss, the importance and benefits of data management, and an introduction to the data life cycle.
Conversations regarding research data often intersect with questions related to ethical, legal, and policy issues for managing research data. This lesson will define copyrights, licenses, and waivers, discuss ownership and intellectual property, and describe some reasons for data restriction. After completing this lesson, participants will be able to identify ethical, legal, and policy considerations that surround the use and management of research data.
What is metadata? Metadata is data (or documentation) that describes and provides context for data and it is everywhere around us. Metadata allows us to understand the details of a dataset, including: where it was collected, how it was collected, what gaps in the data mean, what the units of measurement are, who collected the data, how it should be attributed etc. By creating and providing good descriptive metadata for our own data, we enable others to efficiently discover and use the data products from our research. This lesson explores the importance of metadata to data authors, users of the data and organizations, and highlights the utility of metadata. It provides an overview of the different metadata standards that exist, and the core elements that are consistent across them. It guides users in selecting a metadata standard to work with and introduces the best practices needed for writing a high quality metadata record.