Tidy long-table formats (Wickham 2014) are the most predictable and compatible format for the core content of trait datasets and should be used for standalone uploads on low-threshold repositories.

Besides the core trait data (value, trait & taxon), harmonized entries should be added as duplicates (see ‘Glossary’). If data contain multi-layered information on the measurement or occurrence (including geolocation and date information), further columns may be added to the core set of columns (see description of Extensions). For reasons of reproducibility and openness, data should not be uploaded in proprietary spreadsheet formats (like ‘.xlsx’) but rather in comma-separated text files (‘.csv’ or ‘.txt’).

There are two possibilities to integrate further information to the core trait data columns:

For chosing one or the other, the trade-off is data-consistency and readability vs. avoidance of content duplication:

For standalone dataset publications on a hosting service with only little information content beside the core traitdata columns, the first would be the preferred format, since it facilitates an analysis of cofactors and correlations further down the road. If datasets of different source are merged, the information is readily available without the risk of breaking the reference to an external datasheet. Other cases, where key data columns would be placed in the same table as the core data are traits assessed on a higher level of organisation, e.g. microbial functional traits assessed at the community level taken from a soil sample. Here, location or measurement information are in the primary focus of the investigation (see vocabulary extensions below). A general definition, whether a column describes asset data or is part of the central dataset is ill advised. Therefore, our glossary of terms and its extensions should be used to describe the scientific data according to the study context.

The latter links separate data sheets by identifiers, which has the advantage of tidy datasets and avoids duplication of verbose information (Wickham 2014). As a rule of thumb, the columns of the ‘Measurement or Fact’ and ‘Occurrence’ extension would be stored in a separate data sheet. The use of Darwin Core Archives (http://eol.org/info/structured_data_archives, DwC-A; Robertson et al. 2009) is the recommended structure for GBIF (GBIF 2017, http://tools.gbif.org/dwca-assistant/) or EOL TraitBank (Parr et al. 2016, http://eol.org/info/cp_archives). These are .zip archives containing data table txt-files along with a descriptive metadata file (in .xml format). Detailled descriptions and tools for validation can be found on the website of EOL (http://eol.org/info/cp_archives) and GBIF (http://tools.gbif.org/dwca-assistant/).

The metadata of any dataset that employs this data structure should refer to the respective version of the Ecological Traitdata Standard as “Schneider et al. 2017 Ecological Traitdata Standard v1.0, DOI: XXXX.xxxx, URL: https://ecologicaltraitdata.github.io/ETS/v1.0/”. In addition to the versioned online reference, the dataset should also cite the methods paper “Schneider et al. (in preparation) …” for an explanation of the rationale.

R Tools

The R package ‘traitdataform’ (https://www.github.com/fdschneider/traitdataform) provides tools to transfer heterogeneous datasets into a longtable format and to create standardised taxa and trait columns, based on public ontologies. See the package documentation site and vignettes for further information.

References

GBIF. 2017. Darwin Core Archives - How-to Guide. http://eol.org/info/structured_data_archives.

Parr, C. S., K. S. Schulz, J. Hammock, N. Wilson, P. Leary, J. Rice, C. Jr, et al. 2016. TraitBank: Practical semantics for organism attribute data. Semantic Web 7:577–588.

Robertson, T., M. Döring, J. Wieczorek, R. De Giovanni, and D. Vieglais. 2009. Darwin Core Text Guide. http://rs.tdwg.org/dwc/terms/guides/text/index.htm.

Wickham, H. 2014. Tidy data. Journal of Statistical Software 59:1–23.