EJT editorial standard for the semantic enhancement of specimen data in taxonomy literature

This paper describes a set of guidelines for the citation of zoological and botanical specimens in the European Journal of Taxonomy. The guidelines stipulate controlled vocabularies and precise formats for presenting the specimens examined within a taxonomic publication, which allow for the rich data associated with the primary research material to be harvested, distributed and interlinked online via international biodiversity data aggregators. Herein we explain how the EJT editorial standard was defi ned and how this initiative fi ts into the journal’s project to semantically enhance its publications using the Plazi TaxPub DTD extension. By establishing a standardised format for the citation of taxonomic specimens, the journal intends to widen the distribution of and improve accessibility to the data it publishes. Authors who conform to these guidelines will benefi t from higher visibility and new ways of visualising their work. In a wider context, we hope that other taxonomy journals will adopt this approach to their publications, adapting their working methods to enable domain-specifi c text mining to take place. If specimen data can be effi ciently cited, harvested and linked to wider resources, we propose that there is also the potential to develop alternative metrics for assessing impact and productivity within the natural sciences.


Introduction Taxonomy and the urgent need for integrated bioinformatics
Publications containing descriptive taxonomy and nomenclatural acts constitute a pillar for developing robust hypotheses on identity and relationships in the natural world. Species names and the treatments associated with them are a prerequisite for reliable research across the natural sciences (Wägele et al. 2011), which in turn plays a pivotal role in effective conservation management and sustainable development (e.g., Groombridge 1992;Heywood et al. 1995;McCook et al. 2010). At a time when there is growing concern across the planet for the cascading effects of climate change and biodiversity loss on agriculture, land use and human welfare (IPCC 2018), it is now urgent for taxon concepts to move beyond the description of isolated organisms and towards an integrative systems approach that is able to address species interactions, both with their environment and with other species (Hardisty et al. 2013).
Despite its fundamental role in subsequent fields of research, alpha taxonomy is suffering from impediments that effect end-user accessibility, as well as the valorisation of its authors and their output (Ebach et al. 2011). The establishment of best practices in the citation of collection specimens, taxon concepts and bibliographic works would contribute to the interlinked resource infrastructure that is called for by the community (Hobern et al. 2019) and help to resolve some of the issues. However, the challenge therein is twofold, as we must overcome both sociological (e.g., habits and 'traditional' ways of working) and technical barriers in order to engineer change. This notion was a guiding vector throughout the European Journal of Taxonomy's project for the semantic enhancement of its publications. Achieving dynamic data exchange was not the unique goal; as well as the chosen workflow (the technical framework), the applicability and relevance of the method (the sociological factor) were equally paramount to ensuring that authors embrace the increasingly technological research paradigm.

The European Journal of Taxonomy: spearheading innovative publishing workflows
The European Journal of Taxonomy (EJT) is a peer-reviewed international journal in descriptive taxonomy of eukaryotic organisms (zoology, entomology, botany, palaeontology). The journal was founded by a consortium of European natural history institutions to take advantage of the shift from paper to online publications, as supported by the recent modifications in the governing Codes of nomenclature in zoology (ICZN 2009(ICZN , 2012 and botany (McNeill et al. 2012, Turland et al. 2018. From the journal's onset in 2011, EJT's articles have been published directly online as individual PDFs in Diamond Open Access (Bénichou et al. 2011). However, in accordance with the journal's founding principle to spearhead innovative publishing techniques, it had always been envisaged to provide publications in a machine-readable format, namely Extensible Mark-up Language (XML).
Beyond the advantages of XML's stability as an archiving format (Morrissey et al. 2010; cOAlition S 2019; Library of Congress 2019), which would serve to guarantee the longevity of its publications, EJT understood that the way to increase the visibility of taxonomic research was to provide machinereadable and semantically enhanced text Agosti & Egloff 2009;Penev et al. 2010;Penev et al. 2011a;Penev et al. 2012;Miller et al. 2012;Miller et al. 2015). Moreover, in light of recent academic movements such as the FAIR Data Principles (Findable, Accessible, Interoperable, Reusable) for Open Science (Wilkinson et al. 2016) and international initiatives to mobilise biodiversity data -e.g., GBIF (https://gbif.org), the Catalogue of Life (https://www.catalogueoflife.org), DiSSCo (https://www.dissco.eu) -it became an increasing priority to spread the alpha taxonomy data contained within EJT articles throughout the ecosystem of dynamic, stable, free-to-use and interconnected platforms available on the Web.

A community-wide solution
The promotion of taxonomy, systematics and collection-based research via scientific publishing is a vision that EJT shares with the Consortium of European Taxonomic Facilities (CETAF), who officially endorsed the journal as its flagship title in 2016. EJT therefore aims to provide the taxonomic community with all of the modern interactive web-based facilities expected of a high-level, high-impact journal.
An additional motivation behind the creation of an international, cross-institutional journal was to enable its members to collectively tackle the challenges of the digital transition. Indeed, the unprecedented technological advances associated with 21st-century scientific publishing have given rise to complex strategic and technical issues related to the visibility, access, format and financial structure of academic journals, especially publicly funded titles (Bénichou et al. 2011(Bénichou et al. , 2012Côtez et al. 2018).
XML has proven itself as an indispensable format. The availability of a "machine readable format (for example XML)" now features in the mandatory quality criteria for Plan S-compliant journals (cOAlition S 2019: Art. 9.2). However, this is a veritable technical barrier for many smaller Open Access publishers, especially those who do not impose author processing charges (APCs). Given the fact that EJT is representative of typical natural history journals and their technical set-up, publishing on behalf of public-sector authors, the EJT-XML project reflects the operational and sociological constraints experienced by many other independent journals.
Thus, in line with EJT's role as incubator on behalf of its members, the solution had to be transposable to other institutional journals managed within the consortium's network, and even further afield into the wider community of taxonomic publishing. We here report on the process of defining a new postpublishing XML workflow and the associated formatting that enables optimised encoding of the published content.

Project analysis
The EJT steering committee, consisting of the general directors of the members of the EJT consortium, asked for an assessment of the potential ways to achieve a semantically enhanced XML version of EJT's publications. In 2015, the Naturalis Biodiversity Centre (an EJT consortium member) was tasked with investigating the different methods of obtaining and utilizing an XML version of EJT content.
The report submitted by Naturalis in March 2017 concluded that, while the field of taxonomy provided exciting opportunities for interlinking data to the wider biodiversity community, elaborating an in-house workflow operated by the EJT production team was not advisable. Given the journal's independent position (i.e., entirely managed, produced and published directly by the consortium members), the development of a tailor-made XML production platform was considered too costly, and performing manual encoding would be too time-consuming and technical for the desk editors to undertake. The only existing platform that offered integrated services for XML encoding of taxonomy literature was the Pensoft ARPHA Journal Publishing System (see Penev et al. 2019). EJT felt that it was important to investigate an alternative to this unique solution which, although highly performant, represented issues of workflow rigidity, cost and commercial monopoly.

Plazi and GoldenGATE
Based on its findings, the Naturalis team carried out a proof-of-concept using GoldenGATE to apply mark-up to EJT articles published in PDF (Fig. 1). GoldenGATE is a semi-automatic retro-conversion tool for encoding taxonomic literature ). The open-source program was developed by the Swiss NGO Plazi (http://plazi.org) to support and promote the interoperability of taxonomic treatments with other relevant cyber infrastructure components (name servers, biodiversity resources, etc.). It was specifically conceived to process "born-digital" PDFs (as opposed to digitised paper publications) by recovering the text from the rendering instructions embedded in the file and discovering structural elements such as figures, tables and textual sub-sections e.g., taxonomic treatments. Once identified, these elements are parsed using the tags defined in the TaxPub extension to the JATS DTD (Catapano 2010), which in turn allows the text to be annotated with scalable links to external sources. The process can be highly automated by developing a journal-specific template, chaining the various steps in the conversion process and using batch processing.

The EJT-Plazi workflow
The solution of a collaboration between EJT and Plazi was chosen owing to the latter's valuable expertise in domain-specific mark-up for taxonomic literature (Agosti et al. 2019a), as well as for their established distribution network for the harvested content; both essential points for EJT. Furthermore, Plazi's data warehouse Treatment Bank (https://treatmentbank.org), combined with the group's active status as a trusted data provider for the Global Biodiversity Infrastructure Facility (GBIF) and the Biodiversity Literature Repository (BLR; https://zenodo.org/communities/biosyslit/; Agosti 2019b) offered further scope for the distribution of the taxonomic treatments, images, specimens and other data published by EJT.
Using GoldenGATE and TaxPub, the taxonomic treatments and specimen data featured within EJT articles are converted into Darwin Core (DwC) archives (Wieczorek et al. 2012), the biodiversity informatics data standard developed by the Biodiversity Information Standards (TDWG), which is the preferred format for publishing data to the GBIF network. Once converted, data relating to the treatments and specimens published in EJT is accessible via Treatment Bank and GBIF within a few days of publication, at no extra cost or effort for the author. The figures and captions, along with a full copy of the article, are also available via the Biodiversity Literature Repository. All data harvested from an article is explicitly linked to the original publication using full original citations and a DOI. This means that the sub-article elements are integrated into a much larger network of biodiversity data, which helps to ensure their preservation but also maximises the reach of the article (Fig. 2).

Semantic enhancement of specimen citations A goldmine of data
The details given by authors about the specimens studied in their research -especially for the type material used in nomenclatural acts -are extremely rich and highly structured. Within the realm of biodiversity informatics, the physical specimens studied in a systematic account can be used as a valuable anchor to unambiguously identify taxa, analyse the associated collecting data and track any research based on these vectors. The EJT-Plazi workflow. Plazi processes the PDF of an EJT publication to extract the subarticle elements and distribute them to biodiversity aggregators. Taxonomic treatments, images, tables, scientific names and even the fine-grain specimen data are semantically enhanced and available on a variety of platforms.

Fig. 3.
A taxonomic treatment extracted from the European Journal of Taxonomy and displayed on the Plazi Treatment Bank. In the right-hand sidebar, the parsing performed on the specimen citations has been used to generate graphic charts of the material studied. In the inset, a map has been generated via GoogleMaps by plotting the extracted geocoordinates.
Plazi demonstrated how the information relating to the collection and preservation of physical specimens can be parsed to a fine degree and used to generate Darwin Core archives of occurrence records (Wieczorek et al. 2012). Once available in a machine-readable format, these records can be used to search, cite and track specimens through a multitude of criteria (by locality, date, collector, repository etc.) (Fig. 3), providing researchers and collection managers with sophisticated ways to query data sets and carry out their work (e.g., Nicolson & Tucker 2017;Miller 2019).

Structuring text for automatic parsing
During the first test phase, EJT submitted to Plazi a sample of 30 articles representing the diversity of disciplines (entomology, zoology, palaeontology, botany) and article types (species descriptions, revisions, check-lists...) published by the journal. These publications were processed within GoldenGATE using algorithms to automatically mark-up the article structure and the material sections.
Certain highly structured elements such as dates and geo-coordinates were correctly identified and encoded. However, the wide range of methods, punctuation and vocabularies used to describe the specimens examined resulted in a considerable amount of errors that were time-consuming to correct manually. Difficulties mainly stemmed from matching occurrence data to the correct specimen and delimiting unstructured data, e.g., distinguishing a habitat description from a locality within a text string.
Thus it became clear that a standardised format for the presentation of specimen data was required in order to facilitate fine-grain harvesting of details related to the physical material (specimens) studied. If correctly formatted, these data could be harvested, converted into occurrence records, then integrated into the wider infrastructure of biodiversity informatics (Dikow 2019). Moreover, by establishing standardised formatting that different authors and journals could follow, we also hoped to enable text mining for other taxonomy journals who publish in PDF.

Defining the EJT format
The specimen citations contained within the test set of articles, as well as those found in further articles published by EJT and other taxonomic journals (Zoosystema, Adansonia, Geodiversitas, ZooKeys, Zootaxa, the Zoological Journal of the Linnean Society, the Botanical Journal of the Linnean Society, PhytoKeys, Phytotaxa), were analysed to establish the types of data most commonly presented.
Once the specimen citations had been broken down into a list of highly recurrent fields, these fields were mapped to DWC terms (http://rs.tdwg.org/dwc/terms/). This permitted the creation of a flexible template for specimen citations, designed to fit a broad array of specimen data, which was tested on a wide sample of citations and refined according to feedback from Plazi, the EJT production staff and scientific editors, as well as several active EJT authors. The resulting template and formatting guidelines are described in detail hereafter as the Material Citations Formatting Guidelines.

Material citations formatting guidelines
In accordance with the European Journal of Taxonomy's FAIR Data & Open Science policy (available from https://europeanjournaloftaxonomy.eu), the formatting guide for zoological and botanical specimen citations is presented below. Authors are encouraged to prepare their manuscripts according to this model prior to submission, but they will also be given an opportunity to comply upon acceptance of the article.
While EJT strongly recommends that authors adhere to the guidelines given below, the fine-grain formatting of the material citations is not compulsory; if an author decides not to comply or that the material is not appropriate, EJT will perform reduced formatting during production. In this case, the majority of the specimen data will not be tagged and converted into a machine-readable format; this means that the specimen-related information from the paper will not be included in major databases.
Only specimen data presented in the 'Materials examined' section will be tagged and converted for distribution. At this time, any specimen data presented in a separate table or section of the paper cannot be linked back to the specimen citation to form a full occurrence record.

Order
Each material citation is composed of diverse data fields (number of specimens, locality, date collected, etc.) that EJT identifies using Darwin Core (DWC) terms. To efficiently perform this, it is important to ensure that the different fields of a material citation are consistently presented in the same order throughout the article or, at the very least, within a taxon treatment.

Botany
Details on how to format each data field are provided in the 'Data fields' section.

Punctuation
A bullet point "•" (unicode: hex 2022, decimal 8226) is used to signify the beginning of a material citation. In Microsoft Word, the following keyboard shortcuts can be used to obtain a bullet point: • for Mac: Alt + 8 (QWERTY keyboard) / Alt + shift + full stop (AZERTY) • for Windows: Alt + 0149 on the numeric keypad Within a citation, a semicolon ";" delimits each different field. Semicolons should not be used elsewhere in a material citation.
A single field can be composed of several details, which are separated by commas (e.g., the details region, area and town for the 'locality' field). In the following example, the 'locality' field is composed of two details: Province ("Eastern Cape Province") and town ("Cradock"):

Type material
Zoology Type material should be presented in separate paragraphs with relevant subheadings (Holotype, Paratypes, etc.).

Basionyms & synonyms
In botanical articles, the type material of basionyms and homotypic synonyms is presented in the same paragraph as the relative scientific name and bibliographic reference (just under the treatment heading), preceded by the mention "Type:" in bold. All heterotypic synonyms under the recognised name are cited accordingly with their basionyms.
This presentation should be used regardless of whether the specimen has been examined (indicated by an exclamation mark in this context) or not.

Repetitive data
Repetitive data can be indicated with terms such as "same data as for holotype", "same data as for preceding", "same locality", "ibid.", etc., as long as the method used is consistent throughout the paper.

'Missing' elements Zoology
It is not necessary to include information such as "no date" or "no locality data"; list only the elements that are available.

Label citations
Double quotation marks (" ") are used to represent label citations that cannot be reliably interpreted and formatted as recommended in these guidelines. This data will simply be parsed as a verbatim citation. EJT recommends including photos of labels as figures if they contain data that cannot be standardised.
Only quotation marks should be used to present verbatim label data and they should not appear elsewhere in a material citation.

Author interpretation
Use square brackets [ ] to distinguish data that has been interpreted from a label e.g., coordinates interpreted from a locality, or translations of foreign text:

Locality
The locality data is listed from least to most specific, using commas to divide each detail.
It is recommended to employ the English name in current usage where possible. If a different system is used, e.g., variant spellings or archaic names from label transcriptions, these should preferably be identified using quotes, with their current names given in square brackets.

Specimen count (zoology)
This field can contain several indications about the specimen(s) cited: number, nature (e.g., specimen, juv., shell, exuviae), sex and type status. All subsequent data in the same citation will be applied to the specimen(s) presented. THAILAND • 3 shells, same data as for preceding; HNHM 97479 • 16 specimens (preserved in ethanol); same data as for preceding; UF 76457.

Data fields
The different data fields of a material citation that EJT identifies for conversion and diffusion are explained below, along with the format required to achieve maximum output and accuracy.

Country/ Water body
The citations must be listed by either country or water body (e.g., ocean/sea), using a separate paragraph for each new zone. The country or water body is presented in capital letters.
If the material is organised by region, use the following format:

Geographic coordinates
Various formats are accepted but it is important to include the degree symbol (°) as well as the direction (N/E/S/W), which distinguishes the data as a geographic coordinate: • degrees minutes seconds: 40°26′46″ N, 79°58′56″ W • degrees decimal minutes: 40°26.767′ N, 79°58.933′ W • decimal degrees: 40.446° N, 79.982° W Geographic coordinates should be presented to a maximum of 5 decimal places. Latitude and longitude are separated with a comma. Latitude is cited first, then longitude.

Altitude/elevation/depth
This type of measurement should be explicit in the material citations, e.g.: • Altitude: alt. 489 m or 547 m a.s.l. • Depth: depth 20 m

Collector and collection number (botany)
The collector's name and field number are cited together in italics.
For botanical disciplines that do not catalogue specimens on sheets (e.g., algae, diatoms), we ask that authors use "collected by: X", because the term "leg." does not have the same signification across all botanical fields.

Additional data
Ideally, the data fields identified above should be listed before any other collection data. If a different order is used, it is important to be as consistent as possible throughout the paper, or at least within a single treatment. Semicolons may be used to separate any additional data into appropriate fields, e.g.: Additional data can also be given in the appropriate field between brackets, e.g.:

Repository data
The repository data field should be composed of an institution acronym followed by a specimen code/ catalogue number/barcode (where available).

Zoology institution acronym
Acronyms for repositories must feature in a distinct list in the Materials and methods section, under a heading called "Repositories", "Institutional acronyms" or "Institutional abbreviations". Institution codes must follow GRSciColl (https://gbif.org/grscicoll) where possible. specimen code Where a specimen code is available, it should be explicit which specimen it refers to. This guarantees unambiguous interpretation, both by human readers and upon encoding. For example, in the citation below, we cannot distinguish which specimens are catalogued under which code: This citation should be presented as follows: Use the word "to" instead of a hyphen or an n-dash in order to show a range of specimen numbers. E.g., "NHMUK 213584 to 213595".
In case of type material, use the same convention for all fields, except for repository[identifier in the repository]: introduce the field with the nature of the type: holotype, lectotype...

First results
By respecting the standardised format explained in this paper, the specimens cited in EJT publications are quickly and efficiently converted into DwC archives by Plazi using a semi-automatic mark-up workflow developed within GoldenGATE. Since the project was launched in early 2018, EJT and Plazi have contributed 2155 treatments and 7888 materials citations to major aggregators of biodiversity data, as well as 2288 figures and 7324 bibliographic references: all directly linked to the original publication.
identifier in the repository Each identifier (barcode and/or other catalogue numbers) should be cited exactly as it is registered in the repository. Each individual code is presented within square brackets immediately after the herbarium acronym. E.g.:

UANT = University of Antwerp, Belgium
Botany repository Acronyms of herbaria must follow Index Herbariorum (http://sweetgum.nybg.org/science/ih) and a phrase to this effect will be included in the 'Materials and methods' section under the heading 'Repositories'. Any acronyms used for repositories that do not feature in the Index Herbariorum must also be given here, e.g.: Spot checks carried out on the occurrence records obtained from the encoded specimen citations reveal that the use of the standard formatting greatly improves the precision and quantity of harvested data. Eventual discrepancies in the distributed data compared to the original data (stemming from misinterpretation by the parsing algorithms) can be easily reported ("contact" details available directly of the derived web pages) and corrected (updates performed by Plazi and automatically propagated throughout the network).

Workflow evaluation
The most evident drawback of the GoldenGATE workflow is the approach of retro-conversion, whereby pattern recognition and natural language processing techniques are used to decipher the PDF postpublication. This method is seldom chosen by publishers setting up digital workflows owing to the difficulties connected to a computer interpreting the text correctly. However, for the EJT "born-digital" PDF publications, the text is recovered from the rendering instructions embedded in the PDF, which facilitates the task. In addition, a template has been built to automate the conversion process, right up to the fine granularity of material citations. Using Plazi's expertise and the EJT standardised formatting as much as possible, we have not only improved the quality and quantity of the data harvested from the articles, but also the speed with which Plazi can perform encoding.

Finding a consensus for best practices
The process of establishing a standard for citing specimens within taxonomic publications exposed the variety of methods currently used and accepted by publishers. While it is important for researchers to benefit from flexibility in journals to present their work in a form that befits the nuances of their subject matter, it is clear that the taxonomic community needs to invest time and effort into reaching a consensus for standardising data and sharing best practices.
Reaching a consensus on working methods that call on community-wide standards, such as those suggested in the Material Citations Formatting Guidelines, could have a considerable impact on issues that contribute to the "taxonomic impediment" but the solutions -even those easily attainable by allwill only be effective through adequate buy-in and through our willingness to change. For example, the appropriate citation of taxonomic works, acts and authorities has long been evoked as a subject that could improve the recognition of taxonomy researchers (Ebach et al. 2011), yet there is still much confusion about which course of action should be followed, resulting in anarchical practices across different journals and even from one article to another . This is a purely sociological problem, which was directly addressed in the recent paper "Consortium of European Taxonomic Facilities (CETAF) best practices in electronic publishing in taxonomy" ). An authoritative response is therefore available; the question now remains whether the community will assimilate these best practices.
Best practices and the interoperability of biodiversity data are recurrent themes in CETAF meetings, workgroups and publications (e.g., Keklikoglou et al. 2019), and other leading publishers of taxonomy are equally concerned with these issues, e.g., Pensoft (Penev et al. 2019), who also recommends that authors submitting to the journal ZooKeys apply the format developed by EJT and described in this paper (see: https://zookeys.pensoft.net/about#MaterialsExaminedFormattingGuidelines).
Another pertinent example of the need for controlled vocabularies and well-maintained registers is the citation of digital bio-collections in taxonomical work. Over the past decade, a huge effort has been made by natural history institutions to digitize their collections. This initiative demands serious resources, but also consensus on how to identify, retrieve and cite specimens once they are digitised. Once again, the CETAF has published directives on recommended best practices (Güntsch et al. 2017) for the implementation of globally unique identifiers (GUIDs), but bio-collection managers, informatics communities, researchers and publishers still need to agree on which type of identifier to use, when to assign it and how to cite it, in order for real progress to be made (Guralnick et al. 2015). The citation of specimen GUIDs has not been covered within the present guidelines for this very reason but will be addressed in a subsequent paper.
Producing well-formatted occurrence records and handling associated GUIDs may take extra effort from researchers and from editors during manuscript preparation, but they constitute a solid foundation for subsequent study and will greatly contribute to the creation of an interlinked network of robust data sources. The standardised documentation of physical specimens and collection events would facilitate bio-collection management and, if stricter practices for the citation of material were consistently respected by researchers and publishers, improve interpretation by both humans and machines.

Using digital identifiers to assess impact
The traceable nature of linked data could potentially offer alternative metrics for evaluating research impact (McDade et al. 2011). However, in order to achieve metrics that are sufficiently representative in terms of quality and quantity to make reliable statements, it is urgent to reach a consensus in the use of controlled vocabularies and identifiers, starting with unique IDs for physical specimens, collections and institutions. This would allow specimens to be tracked from the moment they are collected to their deposition in a repository and throughout any subsequent movements, permitting the identification and retrieval of a given physical specimen despite eventual changes in local collection management. With such a tagging system in place, it would also be possible to infer the number of times a specimen has been used in research papers by tracking its identifier through every publication that cites it, or to imagine using GUIDs to gather statistics about the representation of museum collections across different fields of research (Guralnick et al. 2015;Nicolson et al. 2019). However, for as long as there is no standardised approach to the citation of biological specimens, any data obtained on a large scale will have a limited power.
As an institutional publisher, a specific area of interest for EJT was the identification of institution codes within research articles, indicating where specimens are held. However, the lack of an agreed method for identifying a given institution, in fields other than botany, means that these results cannot currently be exploited. For example, within the test batch of converted EJT articles, the Natural History Museum of London was identified by several acronyms: NHM, NHMUK and BMNH. Without a recognised register of official acronyms for zoological collections, as is the case with botanical collections and the institutional register Index Herbariorum, it would be difficult to make any statements about the collections represented by the journal's publications.
If we hope to align our efforts and build a truly seamless global infrastructure for biodiversity knowledge, the institutions dedicated to the curation, documentation and preservation of the primary research matter must take a central role in any proposed scheme. The recommendation in this paper is to follow Index Herbariorum (for botany) and for all other fields the Global Register of Scientific Collections (GRSciColl -a compendium of earlier directories recently merged together and re-launched by GBIF), but this resource will only be as strong as the will of its contributors, i.e., the community must make the effort to contribute, develop and utilise the registry. We therefore recommend that institutions take an active role in enriching and controlling this resource, which offers institutions and collections a foothold in the developing landscape of interlinked data, specimens, publications and institutions. While awaiting the community-curation functionality to be activated by GBIF (intended for late 2019), management should consult the website (https://www.gbif.org/grscicoll) and reflect upon how their institution and collections should be presented. This includes consensus on and communication about a unique and unambiguous acronym to represent their institution (and collections) that should be consistently used as an identifier in all scholarly output.

Conclusion
By contributing data to aggregators such as GBIF and BLR, the EJT-Plazi workflow not only improves access to and citation of taxonomic literature, but also means that the specimens studied and their linked taxonomic information are rapidly made available to the wider scientific community. Another huge benefit of this workflow is that the original data source is consistently cited and linked throughout all subsequent representations.
The EJT-XML project was presented to the international community of biodiversity scientists at the Biodiversity Next meeting in Leiden, The Netherlands in October 2019 , where it was received with enthusiasm. We hope that this paper will now serve as a reference for progress in the field of biodiversity informatics, as well as for EJT authors to prepare their manuscripts.
The standard formatting for material citations presented in this paper will be adapted and implemented by four other titles produced by the Muséum national d'histoire naturelle (Paris), and additional journals produced by EJT member institutions are expected to follow suit. We hope that any taxonomic journal or author who would like to contribute to the dynamic exchange of biodiversity data might consider adopting a similar approach to formatting and inform Plazi of their efforts.
Within the scope of this project, EJT has concentrated on standardising the presentation of specimen data to facilitate semantic enhancement and text mining. The next step will be to work on integrating persistent identifiers for sub-article elements (e.g., treatments), specimens and institutions, with the aim of further improving the accessibility of its publications and establishing alternative metrics for taxonomy. EJT intends to take an active role in addressing these issues (CETAF E-publishing workgroup; Biodiversity Next 2019; TDWG working groups) and will continue to collaborate closely with the CETAF to define best practices across European natural science institutions.