Dataverse utilities¶
This is a generalized set of utilities which help with managing Dataverse repositories. This has nothing to do with the Microsoft product of the same name.
With these utilities you can:
- Upload your data sets from a tab-separated-value spreadsheet
- Bulk release multiple data sets
- Bulk delete (unpublished) assets
- Quickly duplicate records
- Replace licences
- and more!
Get your copy today!
Important note¶
These are console utilities, meaning that they will run in a command prompt window, PowerShell, bash, zshell etc. If the sentence you just read is gibberish to you, then these utilities are probably not for you. While they don’t require any programming knowledge to use, you will still need to be able to install Python.
Source code (and this documentation) is available at the Github repository https://github.com/ubc-library-rc/dataverse_utils, and the user-friendly version of the documentation is at https://ubc-library-rc.github.io/dataverse_utils. Presumably you know this already otherwise you wouldn’t be reading this.
Installation¶
Any installation will require the use of the command line/command prompt (see above).
The easiest installation is with pipx. pipx
will allow you to run these utilities as separate utilities isolated completely from the rest of your Python installation[s].
This should work for any platform which supports pipx
pipx install dataverse_utils
There is also a server specific version if you need to use the dv_facet_date utility. This can only be run on a server hosting a Dataverse instance, so for the vast majority of users it will be unusable.
This can also be installed with pipx
:
pipx install 'dataverse_utils[server]'
Note the extra quotes. You can install the server version even if you don’t have server access, but there’s no reason to.
Upgrading¶
Just as easy as installation:
pipx upgrade dataverse_utils
Other methods of installing Python packages can be found at https://packaging.python.org/tutorials/installing-packages/.
Downloading the source code¶
Source code is available at https://github.com/ubc-library-rc/dataverse_utils. Working on the assumption that git
is installed, you can download the whole works with:
git clone https://github.com/ubc-library-rc/dataverse_utils
If you have mkdocs installed, you can view the documentation in a web browser by running mkdocs from the top level directory of the downloaded source files by running mkdocs serve
.
The components¶
Console utilities¶
There are nine (9) console utilities currently available.
-
dv_del: Bulk (unpublished) file deletion utility
-
dv_ldc_uploader: A utility which scrapes Linguistic Data Consortium metadata from their website, converts it to Dataverse JSON and uploads it, with the possibility of including local files. As of early 2023, there is an issue which requires attaching a manually downloaded certificate chain. Don’t worry, that’s not as hard as it sounds.
-
dv_list_files: Lists all the files in a dataverse record, potentially including all versions and draft versions.
-
dv_manifest_gen: Creates a simple tab-separated value format file which can be edited and then used to upload files as well as file-level metadata. Normally files will be edited after creation, usually in a spreadsheet like Excel.
-
dv_pg_facet_date: A server-based tool which updates the publication date facet and performs a study reindex.
-
dv_record_copy: Copies an existing Dataverse study metadata record to a target collection, or replace a currently existing record.
-
dv_release: A bulk release utility. Either releases all the unreleased studies in a Dataverse or individually if persistent identifiers are available.
-
dv_replace_licence: Replaces the licence associated with a PID with text from a Markdown file. Also available as dv_replace_license for those using American English.
-
dv_upload_tsv: Takes a tsv file in the format from dv_manifest_gen.py and does all the uploading and metadata entry.
More information about these can be found on the console utilities page.
Python package: dataverse_utils¶
If you want to use the Python package directly, you should install with pip
instead of pipx
although, to be fair, you don’t have to. It will just make your life much easier. If you have no interest in using dataverse_utils
code in your own code, you can safely ignore this section.
The package contains a variety of utility functions which, for the most part, allow uploads of files and associated metadata without having to touch the Dataverse GUI or to have complex JSON attached.
For example, the upload_file
requires no JSON attachments:
dataverse_utils.upload_file('/path/to/file.ext',
dv='https://targetdataverse.invalid'
descr='A file description',
tags=['Data', 'Example', 'Spam'],
dirlabel=['path/to/spam'],
mimetype='application/geo+json')
Consult the API reference for full details.
ldc¶
The ldc
component represents the Linguistic Data Consortium or LDC. The ldc module is designed to harvest LDC metadata from its catalogue, convert it to Dataverse JSON, then upload it to a Dataverse installation. Once the study has been created, the general dataverse_utils
module can handle the file uploading.
The ldc
module requires the dryad2dataverse package.
Because of this, it requires a tiny bit more effort, because LDC material doesn’t have the required metadata. Here’s snippet that shows how it works.
import dataverse_utils.ldc as ldc
ldc.ds.constants.DV_CONTACT_EMAIL='iamcontact@test.invalid'
ldc.ds.constants.DV_CONTACT_NAME='Generic Support Email'
KEY = 'IAM-YOUR-DVERSE-APIKEY'
stud = 'LDC2021T02' #LDC study number
a = ldc.Ldc(stud)
a.fetch_record()
#Data goes into the 'ldc' dataverse
info = a.upload_metadata(url='https://dataverse.invalid',
key=KEY,
dv='ldc')
hdl = info['data']['persistentId']
with open('/Users/you/tmp/testme.tsv') as fil:
du.upload_from_tsv(fil, hdl=hdl,dv='https://dataverse.invalid',
apikey=KEY)
Note that one method uses key
and the other apikey
. This is what is known as ad hoc.
More information is available at the API reference.