How can I archive and publish my data?

Publish & Share
All data and software leading to a published result, must be archived and published.

Selecting an Archive

Data archiving: mid- or long-term?

In the Data Management Plan the researcher describes if the data will be stored for the mid or the long term.

A monochrome picture of two women operating the ENIAC, an early computer

Mid-term archive

According to the VU RDM Policy, all publication-related data should be archived for at least ten years for verification and replication of research. For this purpose, VU Amsterdam offers researchers two options to archive their data in one of the organisational repositories (DataverseNL and Yoda). Other archival options may be used depending on the discipline as described in faculty data management policy documents.

Long-term archive

Data relevant for future research should be archived for the long term. A dataset is relevant for future research when at least one of the following general criteria applies:

1. The data have a scientific or historical value
2. The data are unique
3. Others may want to reuse the data
4. The data cannot be reproduced

Researchers should bear in mind that repositories can charge for archiving data. These costs can vary according to the data volume and the archive used. It is important that you consider in advance how you will budget for these costs. Whatever archiving option is used, proper descriptions of the dataset(s) and adding metadata are important.

Deposit your data

VU Amsterdam requests that researchers archive the data used in a publication in a repository for at least ten years after the release of the publication (see also VU Policies & Regulations). There are a lot of digital archives and many more keep appearing.

The right archival option depends on the nature of the data and the field of science as described in faculty or departmental data management policy documents. The university offers 2 different general repositories for data archiving.

  1. The RDM Support Desk and faculty data stewards can help researchers with the selection of a repository that meets all the relevant criteria of privacy (sensitivity), dataset size, etc.
  2. DataverseNL - an online platform for the publication of citable research data in a semi-open environment. DataverseNL allows users to link publications to datasets directly, and to share the data through online archives such as DANS.

Specifications:

  • For publishing research data on the internet
  • The researcher publishing the data decides whether access to the data is public or restricted
  • Not suitable for privacy or otherwise sensitive information
  • Enables researchers to publish open data according to grant providers’ regulations
  • Generates a link (persistent identifier), e.g. for data citations in publications
  • Retention period is at least 10 years

Yoda - besides active storage, Yoda also has an archive function: the vault. You can use the vault in two ways:

  • For archiving data securely; data are only available for verification purposes and may be access only by special request. A special procedure will be followed if anyone requests access to the data in order to verify them.
  • For publishing data; data can be available for anyone, or on request. The data will get a persistent identifier as well.

Before sending data to the vault, you will need to add metadata. A data steward, metadata specialist or functional manager can help you with the metadata and the entire process of sending data to the vault. Please get in touch with the RDM Support Desk to find this help.

Choosing a different repository

Besides the repositories offered by VU Amsterdam, there are many others. Unless you are working with personal or otherwise confidential data and you need to archive them in Yoda, you are, in principle, free to choose a different repository from the ones hosted by VU Amsterdam.

There can be various reasons to decide to use a different repository, including funder requirements, preferences of research partners, and a repository being a common choice in your field. For example, Dutch archaeologists mostly use DANS Data Stations to deposit and publish their data. Using a repository that is a common choice in your field will make your data more findable for your colleagues and increase the visibility of your work as a researcher. Some of the data repositories most commonly used in the Netherlands include:

  • DANS Data Stations: a domain-agnostic research data repository hosted by the Data Archiving and Networked Services, an institute of NWO and KNAW. DANS also develops policies, services and new infrastructures for research data and provides researchers with advice on how to preserve their data. VU researchers are also welcome to deposit their data at DANS-EASY;
  • 4TU.ResearchData: a repository for science, engineering and design data hosted by the 4TU Federation. This is a consortium of the four Dutch technical universities: TU Delft, TU Eindhoven, University of Twente and Wageningen University and Research. VU researchers are also welcome to deposit their data at 4TU;
  • Zenodo: a domain-agnostic research data repository hosted by CERN in Switzerland and funded by the European Commission. Zenodo does not only host data, but also presentations, conference procedures and policy documents. It is also possible to archive GitHub repositories directly into Zenodo, by which you contribute to Open Science by making a snapshot of your code available in its current form and for the long term;
  • OSF (Open Science Framework): a data management and research dissemination platform. VU Amsterdam is an institutional member of the OSF, which means that you can sign up (and in) using your VU account by clicking on the Institution Button on the sign in/up pages. You can use the OSF to create registrations and preregistrations for your research, to publish preprints, and publish and share data and documentation. You can also link other repositories such as DataverseNL to your OSF project. The same goes for GitHub and storage options such as Research Drive and Surfdrive. Do be careful about what you connect! A full guide for VU OSF users, including instructions about connecting external storage can be found here.

You can also find repositories via the Registry of Research Data Repositories. When you are choosing a repository, it is important to check that it provides all the services you need. A good way to find out is to check if a repository as a Core Trust Seal, which is a form of certification for quality repositories. But if a repository does not have the Core Trust Seal, it does not necessarily mean it is not a good repository. As a minimum, you should check that:

  • The repository provides a persistent identifier, such as a DOI;
  • The repository enables you to add rich metadata to your dataset and ideally follows an internationally recognised metadata standard, such as Dublin Core or DataCite;
  • The repository offers functionality to publish data with an embargo or under restrictions, if you need that;
  • The repository allows you to add a licence to the dataset;
  • The repository is funded sustainably for at least the next 50 years;
  • And, in some cases, that the repository’s servers are located in the EU.

More recommendations for choosing a data repository can be found on CESSDA.

If you would like advice about what would be a good place for you to archive your research data, you can always reach out to the RDM Support Desk.

Alternative strategies

Publishing your data in a data journal

Instead of archiving research data in a data repository, you may choose to publish an article about your data collection. This is not necessarily common for all disciplines. Some examples of data journals where you can publish your data and dataset, are:

Publishing your data as supplementary information with your article

Another way to make your data available, is to add them as supplementary information with your article in a journal. At first sight, this may seem a practical solution, because the publication and the underlying data in that case appear together as part of a single publication. However, making your data available as a seperate piece of research output has other advantages:

  • The dataset will be citable on its own, which also enables you to get acknowledged for the work on your dataset
  • Datasets with many files or many different types of files are easier to structure and present in a repository
  • You can assign different levels of accessibility (unrestricted, restricted or closed) if necessary, which is not possible in a publication
  • You don’t transfer copyright of your data publication to the publisher of your article

If you make your data available through a repository, you can link from your article to your dataset and the other way around, so that you can present them as related research output.

Persistent Identifier

A Persistent Identifier (PID) is a durable reference to a digital dataset, document, website or other object. In the context of research data and software, it is essentially a URL that will never break. By using a Persistent Identifier, you make sure that your dataset will be findable well into the future when it is registered online (for example at DataCite. Another advantage is that it makes a digital object citable.

Multiple PID systems

There are multiple PID systems, each with its own particular properties. Examples of widely used PIDs in the research domain include the following.

See the Persistent Identifier guide of Netwerk Digitaal Erfgoed for a more elaborate overview. Apart from widely used domain-agnostic PIDs, there is a wide range of domain-specific unique identifiers that can be used. If you are interested in domain-specific identifiers, it is useful to ask colleagues in your department or discipline.

Persistent Identifiers for data and software in repositories

Persistent Identifiers can be assigned to datasets and software upon their deposit in a repository. In many repositories, this is a DOI. Data repositories are entitled to generate Persistent Identifiers for data and software. This is one of the reasons why archiving and publishing data and software has to be done in a repository. After the process of uploading data or software to a repository, a Persistent Identifier will be generated. Upon publishing the data or software, the DOI is registered online (usually at DataCite when it concerns a dataset).

Some repositories enable their users to reserve a Persistent Identifier before the publishing process has finished, so that you can include the Persistent Identifier in a publication before the data will be actually published, or to include the Persistent Identifier in a readme file. This is for example possible in Zenodo.

The repositories offered by VU Amsterdam, Yoda and DataverseNL provide DOIs for deposited datasets and software.

Creating and using an ORCiD

Researchers can use an ORCiD to identify their research output as their work. You can request an ORCiD yourself. Instructions for setting up an ORCiD and connecting it to your VU research profile in PURE are available in this ORCiD LibGuide. An ORCiD is often asked for when you submit a publication or upload data or software to a repository. You can use your ORCiD record to create a research profile as well.

Using a ROR

Researchers can use the ROR for VU Amsterdam when filling metadata forms for their research output to show that their work has been created within their employment at VU Amsterdam.

Licensing data and software

Licensing the data

A data licence agreement is a legal instrument that lets others know what they can and cannot do with your research data (and any documentation, scripts and metadata that are published with the data - information about software licensing can be found on the software licensing page). It is important to consider what kind of limitations are relevant. Usually, at least the following questions are considered:

  • Can people make copies or even distribute copies?
  • Do others (and you) reusing the dataset need to acknowledge you as the author of the original dataset? (This is called Attribution.)
  • Do others (and you) who reuse the dataset and/or make derivatives of the dataset need to share their work under a similar licence? (This is called Share-Alike.)
  • Can others (and you) use your dataset commercially? (A restriction on commercial use is called Non-Commercial.)
  • Can others (and you) create a new work based on the dataset? (This is called a Derivative.)

The considerations above are the ‘building blocks’ that Creative Commons licences use. There are also other considerations, and also other licences.

An image of open data, made up of public domain icons

In principle, Dataverse allows you to choose your terms of use. If you publish your data in Yoda, there is guidance available on how to choose a licence and how to customise licences. Some data repositories require you to use a certain licence if you want to deposit your data with them. At Dryad, for example, all datasets are published under the terms of Creative Commons Zero to minimise legal barriers and to maximise the impact for research and education. Some funders may also require that you publish the data as open data. Open data are data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and share alike (Open Knowledge International definition). If you need help with drawing up licence agreements, you can contact the VU’s legal office.

Additional websites and tools:

Licensing software

Publishing research software under an appropriate licence is crucial for its accessibility, usability, and further integration into research. Choosing a licence usually happens right when you start developing the software or when you put it in a public repository, rather than when the software is finished and fully baked.

A software licence states how other people may re-use your code and under which circumstances. For research software, it is recommended (and often required by funders) that licences are as permissible as possible.

There are many licences out there; below we list some very frequently used licences in research software. However, if none of these licences fit your case, there are several tools that can help you to choose a suitable software licence. If you need guidance in choosing a licence for your software, get in touch with the RDM Support Desk.

MIT License

The MIT License is a popular choice, due to its readability and permissiveness. It allows users to reuse the software for any purpose, including using, copying, modifying, and distributing it, provided they include the original copyright notice and licence text.

However, its permissiveness means that derivative works can be closed-source and do not need to mention that they use your code, which might not align with all scientific openness goals or general.

GNU GPLv3

The GNU General Public License (GPLv3) is another option, designed to ensure that the software and any derivatives remain open-source.

This encourages collaborative improvement of software. Any software that includes GPL-licensed code must also be open-source under the GPLpotentially deterring commercial use or integration with proprietary software. In conclusion, when you want your code to be used by others, but only the code that uses your code is also open source, this is the way to go.

Apache License 2.0

The Apache License 2.0 allows for modification and distribution of the software and its derivative works, with the requirement that changes to the original code are documented.

It is a more complex licence than the MIT License and can be incompatible with GPL-licensed software. The specifics of this go beyond the scope of the handbook.

Adding a licence to GitHub

On GitHub you add a licence on creating a new repository, by selecting the licence from the drop-down menu. If your repository already exists, add a new file called “LICENSE” using the “+”-button on top of the repository (see below).

Location of file creation button

One the next page, start to type LICENSE as the file name, and a button to “Choose a license template” should automatically pop up. Follow the steps provided by GitHub to finish adding the licence to the repository.

You should now see your licence shown on the main page of your repository.

Further considerations

  • If you are reusing software or libraries written by someone else, you must stick to the clauses of the licence given to the original software/library;
  • When choosing a licence, do not just think about what others may do with the software, but also what you might want to do with the software in the future.

Dataset Registration

Registration & Findability

When you have finished finalizing a dataset or research software and are ready to archive it, there are many options available. Depending on the research and choices made earlier the archive provides the option to fill in descriptive fields for a dataset. The descriptions in the archives often are automatically created using metadata standards like DataCite or Dublin Core, or some other type of standard. See also section Metadata

When registering a dataset or research software in an archive it is important to use unique identifiers to allow for increased findability and easy attribution & citation. Examples of this are:

  • Personal names: try to consistently use the same notation for all researchers and assistants that are included as authors
  • ORCID: using a unique identifier like this for all authors is recommended. More information is available here.
  • Institutional names: avoid using different versions (or language versions) of participating Institutes/organisations and departments. In the case of VU Amsterdam the official written name is: Vrije Universiteit Amsterdam. For each organisation or Institute that is included: try to make sure that the official name is used each time.

Some archives also allow you to preregister your project/dataset/software. Examples are:

Register your Dataset/Software in PURE

Just like your publications, data that you have collected for your research constitute research output, too. Therefore you are required to record your datasets and research software in PURE. Your datasets and research software can be of interest to others, which can in turn lead to new collaboration opportunities. Datasets and research software recorded in PURE also appear in reports that are used for research evaluations. Even if access to your dataset and research software is closed, you are required to register your dataset and research software in PURE. It is a record of the research, data collection and analysis that you have carried out.

Benefits of recording your dataset/software in PURE
  • It increases the visibility and findability of your datasets and research software
  • It contributes to re-use and transparency
  • It boosts your collaboration opportunities
  • It counts towards research evaluations and assessments
How to register your dataset/software in PURE?

An image of PURE, indicating where to add a new dataset or software
  1. Log into the VU Research Portal (PURE) using your VU credentials
  2. Click on the “+” (plus) icon next to selecting “Datasets/Software” in the overview
  3. You can fill in the form using the following manuals and read more about the various metadata in use (generic and subject specific):
  4. Click on “Save” to store the registration

Footnotes

  1. For the source code, see https://github.com/ufal/public-license-selector/↩︎