A data management plan (DMP) is a document which defines how data handled throughout the lifecycle of a project—that is, from its acquisition to archival. While these documents are typically used for research projects to meet funder requirements, they can be leveraged within a corporate environment as well to create structure and alignment between stakeholders. Since DMPs highlight the types of data that will be used within the project and addresses the management of it throughout the data lifecycle, stakeholders, such as governance teams, can provide clear feedback on the storage and dissemination of sensitive data, such as personally identifiable information (PII), at the onset of a project. These documents allow teams to avoid compliance and regulatory pitfalls, and they can serve as templates on how to approach and manage data for future projects.
Scale AI workloads, for all your data, anywhere
A data management plan typically has five components:
1. A statement of purpose
2. Data definitions
3. Data collection and access
4. Frequently asked questions (FAQs)
5. Research data limitations
Each of these focus areas enables research agencies and research funders (or perhaps your data management team) to assess the amount of risk associated with a given project. The data management plan also addresses how to manage that risk. For example, if sensitive data is used within a project, is it appropriate to re-use that data for future projects? Depending on the sensitivity of that data, it may not be appropriate, or it may require additional user consent.
Each component of a data management plan focuses on a particular piece of information, we’ll delve more into each one.
1. Statement of purpose: This explains why the team needs to acquire specific types of data over the course of the project. It should clearly outline the question that the team is attempting to answer with this dataset.
2. Data definitions: Data descriptions help end users and their audiences understand naming conventions and their correspondence with specific datasets. Some of this information may also be held within the metadata, typically labeling data by its data sources and file formats. Creating and abiding by pre-defined metadata standards throughout the data acquisition process will also ensure a more consistent collection and smoother integration process.
3. Data collection and access: This section of a DMP highlights how data will be collected, stored, and accessed from a data repository. It will likely address the data source of any existing data or the approach that will be taken to create new data, such as an experiment. It should also contain information around the timing of data—i.e. how often it will be updated and over what period of time. The type and timing of the data will generally inform its storage and access to third-parties. For example, unstructured data will require a non-relational system versus a relational one, and larger datasets will require more compute power compared to smaller ones. There also may be restrictions around data sharing due to privacy or intellectual property rights. Since project stakeholders will expect that sensitive data, such as personally identifiable information (PII), is treated with the upmost care and security, it’s important for data owners to be clear about their data management practices, particularly in this area. This will include answers to questions around the data’s long-term preservation, such as data archiving or data re-use. For data that is not sensitive in nature, there will be an expectation to provide a pathway for third parties to access raw data and research results.
4. Frequently Asked Questions: This section can be considered a “catch-all” for other popular questions within data management projects, such as sharing plans, citation preferences, and data backup methods. Researchers or data owners may to highlight any digital object identifiers (DOI) for owners of adjacent or related projects. Additionally, if project owners are archiving data, they’ll also need to address the length of the archive’s existence. Will it live for one year, five years, or perhaps indefinitely?
5. Research data limitations: This section addresses upfront limitations with the dataset, which will limit its ability to generalize more broadly to populations. For example, data may be focused on a specific demographic, such as a geography, gender, race, age group, et cetera.
Data management plans are predominantly used in more academic settings, particularly for federal government funded programs, such as the National Institutes of Health (NIH) and National Science Foundation (NSF), but corporations can also leverage them in either their research or data governance functions. While academics and researchers need to comply with funder requirements in grant applications, many research institutions create a DMP tool to provide participants with the relevant template for their research project. Data governance teams within organizations can set up similar protocols to ingest data requests from stakeholders advocating for new data initiatives.
Researchers in both private and public sectors look to different funding agencies to sponsor research and innovation initiatives. DMPs mitigate risk for both parties, ensuring that data owners have assessed the value as well as their own personal responsibility (i.e. security and disaster recovery measures) to research data management.
Data governance initiatives
Data management plans are also incredibly helpful for new data initiatives in business settings, assisting all stakeholders in understanding the importance of new data sources and how it can tie to business outcomes. As developments within hybrid cloud, artificial intelligence, the internet of things (IoT), and edge computing continue to spur the growth of big data, enterprises will need to find ways to manage the complexity of it within their data systems.
Read the free report to learn how data management on a unified platform for data, analytics and AI can accelerate time to insights.
Learn the best practices to ensure data quality, accessibility, and security as a foundation to an AI-centric data architecture. (4.5 MB)
Discover the AI-infused IBM solutions built on the Red Hat® OpenShift® open platform that make data management simpler and smarter.