What data assets should you publish and what data assets should you archive?

Archival must happen, publishing can happen.

Selecting an Archive

Data archiving: mid- or long-term?

In the Data Management Plan the researcher describes if the data will be stored for the mid or the long term.

A monochrome picture of two women operating the ENIAC, an early computer

Mid-term archive

According to the VU RDM Policy, all publication-related data should be archived for at least ten years for verification and replication of research. For this purpose, Vrije Universiteit Amsterdam offers researchers two options to archive their data in one of the organisational repositories (DataverseNL and Yoda). Other archival options may be used depending on the discipline as described in faculty data management policy documents.

Long-term archive

Data relevant for future research should be archived for the long term. A dataset is relevant for future research when at least one of the following general criteria applies:

1. The data have a scientific or historical value
2. The data are unique
3. Others may want to reuse the data
4. The data cannot be reproduced

Researchers should bear in mind that repositories can charge for archiving data. These costs can vary according to the data volume and the archive used. It is important that you consider in advance how you will budget for these costs. Whatever archiving option is used, proper descriptions of the dataset(s) and adding metadata are important.

Deposit your data

VU Amsterdam requests that researchers archive the data used in a publication in a repository for at least ten years after the release of the publication (see also VU Policies & Regulations). There are a lot of digital archives and many more keep appearing.

The right archival option depends on the nature of the data and the field of science as described in faculty or departmental data management policy documents. The university offers 2 different general repositories for data archiving.

  1. The RDM Support Desk and faculty data stewards can help researchers with the selection of a repository that meets all the relevant criteria of privacy (sensitivity), dataset size, etc.
  2. DataverseNL - an online platform for the publication of citable research data in a semi-open environment. DataverseNL allows users to link publications to datasets directly, and to share the data through online archives such as DANS.

Specifications:

  • For publishing research data on the internet
  • The researcher publishing the data decides whether access to the data is public or restricted
  • Not suitable for privacy or otherwise sensitive information
  • Enables researchers to publish open data according to grant providers’ regulations
  • Generates a link (persistent identifier), e.g. for data citations in publications
  • Retention period is at least 10 years

Yoda - besides active storage, Yoda also has an archive function: the vault. You can use the vault in two ways:

  • For archiving data securely; data are only available for verification purposes and may be access only by special request. A special procedure will be followed if anyone requests access to the data in order to verify them.
  • For publishing data; data can be available for anyone, or on request. The data will get a persistent identifier as well.

Before sending data to the vault, you will need to add metadata. A data steward, metadata specialist or functional manager can help you with the metadata and the entire process of sending data to the vault. Please get in touch with the RDM Support Desk to find this help.

Archiving vs. Publishing Data

There is a difference between archiving and publishing data. When we talk about archiving data, we mean that data are deposited securely, in a fixed state, in a location that is not accessible to the public or even a colleague at the VU. Archiving often happens for data that are confidential - for privacy or other reasons - and that should not be accessible publicly. Archiving is usually done for verification purposes, or, in case of medical research, to comply with the preservation requirements within the WMO.

Publishing refers to depositing data in a public repository that allows others to view, access and download your data. You can set certain restrictions, but as a rule of thumb, publishing should only happen for data that are not confidential at all. That includes data that have been anonymised, or were not personal to begin with, and data that were never otherwise confidential. If you cannot publish any data at all, we do usually recommend trying to publish some documentation, such as data collection protocols, scripts, codebooks, etc. In this way, others can see how the research was carried out, even if they cannot simply access the data.

Use the image below to remind yourself of the difference between archiving and publishing, and read the data publication page to find out what aspects are important when you decide to publish your data.

A sketch diagram by Scriberia illustrating the archive or publish data journey

This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807

Choosing a different repository

Besides the repositories offered by the VU, there are many others. Unless you are working with personal or otherwise confidential data and you need to archive them in Yoda, you are, in principle, free to choose a different repository from the ones hosted by the VU.

There can be various reasons to decide to use a different repository, including funder requirements, preferences of research partners, and a repository being a common choice in your field. For example, Dutch archaeologists mostly use DANS Data Stations to deposit and publish their data. Using a repository that is a common choice in your field will make your data more findable for your colleagues and increase the visibility of your work as a researcher. Some of the data repositories most commonly used in the Netherlands include:

  • DANS Data Stations: a domain-agnostic research data repository hosted by the Data Archiving and Networked Services, an institute of NWO and KNAW. DANS also develops policies, services and new infrastructures for research data and provides researchers with advice on how to preserve their data. VU researchers are also welcome to deposit their data at DANS-EASY;
  • 4TU.ResearchData: a repository for science, engineering and design data hosted by the 4TU Federation. This is a consortium of the four Dutch technical universities: TU Delft, TU Eindhoven, University of Twente and Wageningen University and Research. VU researchers are also welcome to deposit their data at 4TU;
  • Zenodo: a domain-agnostic research data repository hosted by CERN in Switzerland and funded by the European Commission. Zenodo does not only host data, but also presentations, conference procedures and policy documents. It is also possible to archive GitHub repositories directly into Zenodo, by which you contribute to Open Science by making a snapshot of your code available in its current form and for the long term;
  • OSF (Open Science Framework): a data management and research dissemination platform. The VU is an institutional member of the OSF, which means that you can sign up (and in) using your VU account by clicking on the Institution Button on the sign in/up pages. You can use the OSF to create registrations and preregistrations for your research, to publish preprints, and publish and share data and documentation. You can also link other repositories such as DataverseNL to your OSF project. The same goes for GitHub and storage options such as Research Drive and Surfdrive. Do be careful about what you connect! A full guide for VU OSF users, including instructions about connecting external storage can be found here.

You can also find repositories via the Registry of Research Data Repositories. When you are choosing a repository, it is important to check that it provides all the services you need. A good way to find out is to check if a repository as a Core Trust Seal, which is a form of certification for quality repositories. But if a repository does not have the Core Trust Seal, it does not necessarily mean it is not a good repository. As a minimum, you should check that:

  • The repository provides a persistent identifier, such as a DOI;
  • The repository enables you to add rich metadata to your dataset and ideally follows an internationally recognised metadata standard, such as Dublin Core or DataCite;
  • The repository offers functionality to publish data with an embargo or under restrictions, if you need that;
  • The repository allows you to add a licence to the dataset;
  • The repository is funded sustainably for at least the next 50 years;
  • And, in some cases, that the repository’s servers are located in the EU.

More recommendations for choosing a data repository can be found on CESSDA.

If you would like advice about what would be a good place for you to archive your research data, you can always reach out to the RDM Support Desk.

Data Publication

Open Access and Open Science

Open Access publishing means that you make your publication freely accessible online to everyone without restrictions. VU believes that government-funded research should be available free of charge to as many people as possible.

Open Access publishing is one component of Open Science. The European Commission has defined open science as follows: “Open Science represents a new approach to the scientific process based on cooperative work and new ways of diffusing knowledge by using digital technologies and new collaborative tools. The idea captures a systemic change to the way science and research have been carried out for the last fifty years: shifting from the standard practices of publishing research results in scientific publications towards sharing and using all available knowledge at an earlier stage in the research process” (Definition taken from Nationaal Programma Open Science). This includes making openly available research data, methods and documentation where possible. As such, RDM and the practices outlined in the Research Support Handbook are a precondition of Open Science. You can read more about Open Science in the Netherlands on the website of the Nationaal Programma Open Science and join the Open Science Community Amsterdam, the community of VU employees interested in Open Science (joint with the University of Amsterdam).

Publishing your data in a data journal

Instead of archiving research data in a data repository, you may choose to publish an article about your data collection. This is not necessarily common for all disciplines. Some examples of data journals where you can publish your data and dataset, are:

Persistent Identifier

A persistent identifier (PID) is a durable reference to a digital dataset document, website or other object. It is a kind of ISBN for digital files. By using a persistent identifier, you make sure that your dataset will be findable well into the future. A DOI or Handle are the commonly used PIDs. The data archiving options at the VU commonly offer DOIs.

Most data archives or repositories offer a persistent identifier and generate this automatically when research data are archived. For example, this is the case for DataverseNL at the VU. In Yoda at the VU, assigning a PID is possible, but does not happen automatically. Please get in touch with the RDM Support Desk if you have questions about assigning a PID when you archive data in Yoda.

Licensing the data

A data licence agreement is a legal instrument that lets others know what they can and cannot do with your research data (and any documentation. scripts and metadata that are published with the data). It is important to consider what kind of limitations are relevant. An important component can be a guideline on how people should cite the dataset. Other components could be:

  • Can people make copies or even distribute copies
  • Who should be contacted if you need access to re-use data
  • Etc.

An image of open data, made up of public domain icons

In principle, Dataverse allows you to choose your terms of use. Some data repositories require you to use a certain licence if you want to deposit your data with them. At Dryad, for example, all datasets are published under the terms of Creative Commons Zero to minimise legal barriers and to maximise the impact for research and education. Some funders may also require that you publish the data as open data. Open data are data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and share alike (Open Knowledge International definition). If you need help with drawing up license agreements, you can contact the IXA office.

Additional websites and tools:

Publishing research software under an appropriate license is crucial for its accessibility, usability, and further integration into research. Choosing a license usually happens right when you start developing the software or when you put it in a public repository, rather than when the software is finished and fully baked.

A software license states how other people may re-use your code and under which circumstances. For research software, it is recommended (and often required by funders) that licenses are as permissible as possible.

There are many licenses out there; below we list some very frequently used licenses in research software. However, if none of these licenses fit your case, there are several tools that can help you to choose a suitable software license. If you need guidance in choosing a licence for your software, get in touch with the RDM Support Desk.

MIT License

The MIT License is a popular choice, due to its readability and permissiveness. It allows users to reuse the software for any purpose, including using, copying, modifying, and distributing it, provided they include the original copyright notice and license text.

However, its permissiveness means that derivative works can be closed-source and do not need to mention that they use your code, which might not align with all scientific openness goals or general.

GNU GPLv3

The GNU General Public License (GPLv3) is another option, designed to ensure that the software and any derivatives remain open-source.

This encourages collaborative improvement of software. Any software that includes GPL-licensed code must also be open-source under the GPLpotentially deterring commercial use or integration with proprietary software. In conclusion, when you want your code to be used by others, but only the code that uses your code is also open source, this is the way to go.

Apache License 2.0

The Apache License 2.0 allows for modification and distribution of the software and its derivative works, with the requirement that changes to the original code are documented.

It is a more complex license than the MIT License and can be incompatible with GPL-licensed software. The specifics of this go beyond the scope of the handbook.

Adding a license to GitHub

On GitHub you add a license on creating a new repository, by selecting the license from the drop-down menu. If your repository already exists, add a new file called “LICENSE” using the “+”-button on top of the repository (see below).

Location of file creation button

One the next page, start type LICENSE as the file name, and a button to “Choose a licence template” should automatically pop up. Follow the steps provided by GitHub to finish adding the license to the repository.

You should now see your license shown on the main page of your repository.

Further considerations

  • If you are reusing software or libraries written by someone else, you must stick to the clauses of the licence given to the original software/library;
  • When choosing a licence, do not just think about what others may do with the software, but also what you might want to do with the software in the future.

Dataset Registration

Registration & Findability

When you have finished finalizing a dataset and are ready to archive it, there are many options available. Depending on the research and choices made earlier the archive provides the option to fill in descriptive fields for a dataset. The descriptions in the archives often are automatically created using metadata standards like DataCite or Dublin Core, or some other type of standard. See also the item Metadata in this LibGuide.

When registering a dataset in an archive it is important to use unique identifiers to allow for increased findability and easy attribution & citation. Examples of this are:

  • Personal names: try to consistently use the same notation for all researchers and assistants that are included as authors
  • ORCID: using a unique identifier like this for all authors is recommended. More information is available here.
  • Institutonal names: avoid using different versions (or language versions) of participating Institutes/organizations and departments. In the case of the VU the official written name is: Vrije Universiteit Amsterdam. For each organization or Institute that is included: try to make sure that the official name is used each time.

Some archives also allow you to preregister your project/dataset. Examples are:

Register your Dataset in PURE

Just like your publications, data that you have collected for your research constitute research output, too. Therefore you are required to record your datasets in PURE. Your datasets can be of interest to others, which can in turn lead to new collaboration opportunities. Datasets recorded in PURE also appear in reports that are used for research evaluations. Even if access to your dataset is closed, you are required to register your dataset in PURE. It is a record of the research, data collection and analysis that you have carried out.

Benefits of recording your dataset in PURE

  • It increases the visibility and findability of your datasets
  • It contributes to re-use and transparency
  • It boosts your collaboration opportunities
  • It counts towards research evaluations and assessments

How to register your dataset in PURE?

Screenshot: adding a dataset to your PURE profile
  1. Log into the VU Research Portal (PURE) using your VU or VUmc credentials
  2. Click on the “+” (plus) icon next to selecting “Datasets” in the overview
  3. You can fill in the form using this manual (NL)/manual (EN), and read more about the various metadata in use (generic and subject specific)
  4. Click on “Save” to store the registration

Footnotes

  1. For the source code, see https://github.com/ufal/public-license-selector/↩︎