Assistance for handling functional trait data and transferring them into the Ecological Trait-data Standard (Schneider et al. 2018, https://terminologies.gfbio.org/terms/ets/pages/ doi: 10.5281/zenodo.1485739).

There are two major use cases for the package:

  • preparation of own trait datasets for upload into public data bases, and
  • harmonizing trait datasets from different sources by moulding them into a unified format.

The toolset of the package includes

  1. transforming typical trait-data formats (e.g. species-trait-matrix or measurement-table data) into a unified long-table format and mapping column names into terms provided in the Ecological Trait-data Standard (ETS) (see section 1. Reading data),
  2. mapping of trait concepts onto a user-provided traitlist (i.e. a thesaurus of traits) or globally accessible URIs (see section 2. Standardize traits) and unify units and factor levels (see vignette ‘Handling units with trait-data’),
  3. mapping of species concepts onto globally accessible definitions via URIs (pointing to GFBio taxonomic ontology server) (see section 3. Standardize taxa),
  4. Merging and handling compiled trait-data, while keeping track of the metadata for each original dataset (see section 4. Working with trait-datasets)
  5. saving trait dataset into a desired format using templates (e.g. for project-specific databases or online repositories) (see section 5. Writing data)

This vignette contains step-by step instructions for transferring own data into a standardized trait-dataset for upload to public databases. See Schneider et al. 2018 ‘Towards an Ecological Trait-data Standard’ (pre-print on biorxiv.org, doi: 10.1101/328302 ) for a discussion of the rationale.


1. Reading data

load data from source

The first step is to load your data into R. This can be your own data, read from file, or data published elsewhere, directly accessible via an URL.

R knows many ways of getting your data into an R object. In most cases you would read an object from a csv or txt file while maintaining the column headers.

carabids <- read.table("../../data/carabid traits final.txt", header = TRUE)

If reading files from a file repository, you can refer to the URL directly, e.g.

Most trait data are stored in one of the following two formats:

  • species\(\times\)trait matrix : a single account of a trait value for each species (in rows) for a couple of different traits (in columns). No replicates of species are reported. This is the most likely format for literature data, where aggregate measurements or facts for entire species have been collated into a single lookup table.
  • observation wide table : in case of measured data, authors may report multiple raw measurements of different traits (in columns) taken from a single observation instance of a species, i.e. an individual (in rows). Repeated measures of the same trait might also be included as columns or pooled into average values. This is valuable for investigations of intra-specific variation, and also leaves space for filtering by co-factors or analyzing trait response along environmental gradients.

In both cases, additional information on the species or observation may be stored in further columns (e.g. the unit in which a value is reported or the literature source for this measurement or fact, or the date and geolocation of sampling), or in a separate data sheet linked via identifiers for trait, taxon, occurrence or sampling/measurement event. As the column names and the width of the table varies with the number of traits included, merging data from different sources requires user-defined mapping and manual harmonization of these structures.

A more effective format is the measurement long-table (Kattge et al., 2011; Wickham, 2014; Parr et al., 2016), where each row is reserved for a single measurement or fact of a specific trait. This allows repeated measurements on a single individual to be stored by linking the data from separate rows via a unique identifier for each individual (labelled occurrenceID). Also, multivariate trait measurements can be recorded in this format by linking multiple rows via a unique measurement identifier. Long-table datasets purport multiple advantages for data manipulation (e.g. filtering, sub-setting and aggregating data), visualization (e.g. plot measured values by factor variable or taxon) and statistical modelling (e.g. ANOVA for testing difference of trait value by sex) (Wickham, 2014). Each row of the dataset can therefore be interpreted as a statement of an ‘entity x having a qualitative/quantitative feature y’ (Garnier et al., 2017; Schneider et al., 2018). As long-table formats draw from a defined set of columns, merging of datasets is much easier.

The function as.traitdata() provided in the package assist in transferring data into the measurement long-table format. For this function to work, it needs at least to know about the columns of the original data that contain trait values (parameter traits), and the column which contains the taxonomic concept (parameter taxa).

Note that in the output table the columns have been renamed according to the ETS. The essential columns are traitName, traitValue for the reported measurement or fact as well as scientificName for the taxon concept. The newly assigned column measurementID contains a running number for each individual trait measurement.
The function automatically interprets data as a species\(\times\)traits matrix if the taxa column contains only unique entries and no duplicates. In case of multiple assignments to the same taxon, the script assumes an observation wide-table and procures a new column occurrenceID which links measurements taken on the same individuals. Both occurrenceID and measurementID can be provided by the author using the parameter occurrences (as a column name or a vector) or measurements (as a column name or a vector).

heteroptera_raw <-  read.delim(url("https://ndownloader.figshare.com/files/5633883", 
                                         encoding = "windows-1252"),
                                    stringsAsFactors=TRUE)

# Data pulication: M. Gossner, Martin; K. Simons, Nadja; Höck, Leonhard; W.
# Weisser, Wolfgang (2016): Morphometric measures of Heteroptera sampled in
# grasslands across three regions of Germany. figshare.
# https://doi.org/10.6084/m9.figshare.c.3307611.v1


dataset2 <- as.traitdata(heteroptera_raw,
              traits = c("Body_length", "Body_width", "Body_height", "Thorax_length",
                         "Thorax_width", "Head_width", "Eye_width", "Antenna_Seg1",
                         "Antenna_Seg2", "Antenna_Seg3", "Antenna_Seg4", "Antenna_Seg5",
                         "Front.Tibia_length", "Mid.Tibia_length", "Hind.Tibia_length",
                         "Front.Femur_length", "Hind.Femur_length", "Front.Femur_width",
                         "Hind.Femur_width", "Rostrum_length", "Rostrum_width", 
                         "Wing_length", "Wing_width"),
              taxa = "SpeciesID",
              occurrences = "ID"
              )
#> Input is taken to be an occurrence table/an observation -- trait matrix 
#> (i.e. with individual specimens per row and multiple trait measurements in columns). 
#> If this is not the case, please provide parameters!

# show different trait measurements for same occurrence/individual
subset(dataset2, occurrenceID == "5" ) 
#>        scientificName          traitName traitValue measurementID
#> 5    Acalypta parvula        Body_length       1.84             5
#> 430  Acalypta parvula         Body_width       0.82           430
#> 855  Acalypta parvula        Body_height       0.56           855
#> 1280 Acalypta parvula      Thorax_length       0.17          1280
#> 1705 Acalypta parvula       Thorax_width       0.84          1705
#> 2130 Acalypta parvula         Head_width       0.36          2130
#> 2555 Acalypta parvula          Eye_width       0.10          2555
#> 2980 Acalypta parvula       Antenna_Seg1       0.07          2980
#> 3405 Acalypta parvula       Antenna_Seg2       0.06          3405
#> 3830 Acalypta parvula       Antenna_Seg3       0.40          3830
#> 4255 Acalypta parvula       Antenna_Seg4       0.15          4255
#> 5105 Acalypta parvula Front.Tibia_length       0.39          4716
#> 5530 Acalypta parvula   Mid.Tibia_length       0.40          5141
#> 5955 Acalypta parvula  Hind.Tibia_length       0.50          5566
#> 6380 Acalypta parvula Front.Femur_length       0.42          5991
#> 6805 Acalypta parvula  Hind.Femur_length       0.43          6416
#> 7230 Acalypta parvula  Front.Femur_width       0.08          6841
#> 7655 Acalypta parvula   Hind.Femur_width       0.07          7266
#> 8080 Acalypta parvula     Rostrum_length       0.80          7691
#> 8505 Acalypta parvula      Rostrum_width       0.08          8116
#> 8930 Acalypta parvula        Wing_length       1.78          8541
#> 9355 Acalypta parvula         Wing_width       0.65          8966
#>      occurrenceID
#> 5               5
#> 430             5
#> 855             5
#> 1280            5
#> 1705            5
#> 2130            5
#> 2555            5
#> 2980            5
#> 3405            5
#> 3830            5
#> 4255            5
#> 5105            5
#> 5530            5
#> 5955            5
#> 6380            5
#> 6805            5
#> 7230            5
#> 7655            5
#> 8080            5
#> 8505            5
#> 8930            5
#> 9355            5
#> 
#> This trait-dataset contains 23 traits for 179 taxa ( 9386 measurements in total).
#> NULL

This allows the user to be explicit about the structure of the output data.

specify units

For a standardisation of quantitative trait data, the unit of measurement is essential. Often, this information is kept in the metadata descriptions. But for a standardised table containing measurements from different sources, this information should always accompany the measurement value. The ETS suggests the term traitUnit to contain the unit for each measurement in the data table.

The function as.traitdata() creates this column via its parameter units (see example above). This can be done for all traits in a single stroke (if all reported values refer to the same unit) or to each trait specifically (if they used different measurement units or if the table comprises a mixture of quantitative and qualitative traits).
Accordingly, the parameter units takes a single character string, or a vector of character strings, containing valid entries as expected by the package ‘units’ (Pebesma et al. 2016, https://github.com/edzer/units/, v0.4-5, Examples are ‘mm’, ‘m2’ or ‘m^2’, ‘m/s’ ).

The vignette “Handling units with trait-data” cover the various use cases of unit assignments and harmonizing trait-data with different units and different factor levels in depth.

keep additional information

The raw data might contain further information on the individuals or the trait measurement itself in further data columns that are valuable for later analysis. This can be for instance data about the sex or developmental stage of the individual, the sampling or preservation method of the specimen, or the conditions under which the measurement was taken.

The parameter keep allows you to specify which columns contain valuable information as a character vector. As a negative version of keep, specifying drop would allow you to name the columns that are not valuable, while all others will be kept. Not specifying keep or drop will result in dropping all columns except the core measurement and identifier columns.

The three extensions of the ETS provide standard terms for this kind of information:

  • The Taxon extension provides further terms for specifying the taxonomic resolution of the observation and to ensure the correct reference in case of synonyms and homonyms.
  • The Measurement Or Fact extension provides terms to describe information at the level of single measurements or reported facts, such as the original literature reference for the reported value, the method of measurement or statistical method of aggregation. It provides important information that allows for the tracking of potential sources of noise or bias in measured data (e.g. variation in measurement method) or aggregated values (e.g. statistical method), as well as the source of reported facts (e.g. literature source or expert reference).
  • The Occurrence extension contains vocabulary to describe information on the observation context of individual specimens, such as sex, life stage or age. This also includes the method of sampling and preservation, as well as the date and geographical location, which provide an important resource to analyze trait variation due to differences in space and time.

We highly recommend mapping the input columns into these standard terms by providing a named vector for keep that gives the target ETS terms as vector names.

dataset2 <- as.traitdata(heteroptera_raw,
              traits = c("Body_length", "Body_width", "Body_height", "Thorax_length",
                         "Thorax_width", "Head_width", "Eye_width", "Antenna_Seg1",
                         "Antenna_Seg2", "Antenna_Seg3", "Antenna_Seg4", "Antenna_Seg5",
                         "Front.Tibia_length", "Mid.Tibia_length", "Hind.Tibia_length",
                         "Front.Femur_length", "Hind.Femur_length", "Front.Femur_width",
                         "Hind.Femur_width", "Rostrum_length", "Rostrum_width", 
                         "Wing_length", "Wing_width"),
              taxa = "SpeciesID",
              occurrences = "ID",
              units = "mm",
              keep = c(order = "Order", family = "Family", 
                       sex = "Sex", lifeStage = "Wing_development", 
                       basisOfRecordDescription = "Source", 
                       verbatimLocality = "Center_Sampling_region", 
                       references = "Voucher_ID" )
)
#> Input is taken to be an occurrence table/an observation -- trait matrix 
#> (i.e. with individual specimens per row and multiple trait measurements in columns). 
#> If this is not the case, please provide parameters!

head(dataset2)
#>     scientificName   traitName traitValue traitUnit measurementID
#> 1 Acalypta nigrina Body_length       2.35        mm             1
#> 2 Acalypta nigrina Body_length       2.10        mm             2
#> 3 Acalypta nigrina Body_length       2.17        mm             3
#> 4 Acalypta nigrina Body_length       2.15        mm             4
#> 5 Acalypta parvula Body_length       1.84        mm             5
#> 6 Acalypta parvula Body_length       1.81        mm             6
#>   occurrenceID     order   family           basisOfRecordDescription
#> 1            1 Hemiptera Tingidae Zoological State Collection Munich
#> 2            2 Hemiptera Tingidae Zoological State Collection Munich
#> 3            3 Hemiptera Tingidae Zoological State Collection Munich
#> 4            4 Hemiptera Tingidae Zoological State Collection Munich
#> 5            5 Hemiptera Tingidae Zoological State Collection Munich
#> 6            6 Hemiptera Tingidae Zoological State Collection Munich
#>                                     references sex lifeStage
#> 1      Treuchtlingen_leg.Seidenstücker_21.5.48   f         m
#> 2            Nürnberg_leg.Seidenstücker_9.7.44   f         b
#> 3 Gunzenshausen_Mfr._leg.Seidenstücker_15.4.58   m         m
#> 4            Nürnberg_leg.Seidenstücker_9.7.44   m         b
#> 5                Nürnberg_Seidenstücker_4.8.45   f         m
#> 6               Erlangen_Seidenstücker_30.9.40   f         b
#>    verbatimLocality
#> 1 48°57’ N 10°55’ E
#> 2 49°27’ N 11°05’ E
#> 3 49°07’ N 10°45’ E
#> 4 49°27’ N 11°05’ E
#> 5 49°27’ N 11°05’ E
#> 6 49°36’ N 11°00’ E
#> 
#> This trait-dataset contains 23 traits for 179 taxa ( 9386 measurements in total).
#> 
#> [ ] :

Note that a lack of a name in the named vector maintains the original name. Note also, that no checking for valid column names (as compared to the traitdata glossary) is performed at this stage. This is to ensure that the raw data table created by as.traittable() can contain any columns that the author considers relevant. The keep parameter can be used to rename columns into intuitive column names.

derived trait-values

Many traits comprise compound measures of multiple traits, such as length-mass ratios or morphometric indices. Other traits must be refined in terms of factor levels, or reduced to binary trait values. Many of these tasks can be achieved on the matrix raw data using base functions like transform(), factor() or match() or the mutate() function provided by the package ‘plyr’ before conversion into the long-table format.

However, if the data are converted to long-table format, these tasks may become tedious as they require splitting the data before the computation can be done. The function mutate.traitdata() performs these tasks (working as a wrapper to plyr::mutate()) while keeping an eye on the units.

dataset2 <- mutate.traitdata(dataset2, 
                            Body_shape = Body_length/Body_width, 
                            Body_volume = Body_length*Body_width*Body_height,
                            Wingload = Wing_length*Wing_width/Body_volume)

head(dataset2[dataset2$traitName %in% c("Body_shape", "Body_volume", "Wingload"),])
#>        scientificName  traitName traitValue traitUnit measurementID
#> 9387 Acalypta nigrina Body_shape   2.043478         1          <NA>
#> 9388 Acalypta nigrina Body_shape   1.721311         1          <NA>
#> 9389 Acalypta nigrina Body_shape   2.028037         1          <NA>
#> 9390 Acalypta nigrina Body_shape   2.067308         1          <NA>
#> 9391 Acalypta parvula Body_shape   2.243902         1          <NA>
#> 9392 Acalypta parvula Body_shape   1.885417         1          <NA>
#>      occurrenceID     order   family           basisOfRecordDescription
#> 9387            1 Hemiptera Tingidae Zoological State Collection Munich
#> 9388            2 Hemiptera Tingidae Zoological State Collection Munich
#> 9389            3 Hemiptera Tingidae Zoological State Collection Munich
#> 9390            4 Hemiptera Tingidae Zoological State Collection Munich
#> 9391            5 Hemiptera Tingidae Zoological State Collection Munich
#> 9392            6 Hemiptera Tingidae Zoological State Collection Munich
#>                                        references sex lifeStage
#> 9387      Treuchtlingen_leg.Seidenstücker_21.5.48   f         m
#> 9388            Nürnberg_leg.Seidenstücker_9.7.44   f         b
#> 9389 Gunzenshausen_Mfr._leg.Seidenstücker_15.4.58   m         m
#> 9390            Nürnberg_leg.Seidenstücker_9.7.44   m         b
#> 9391                Nürnberg_Seidenstücker_4.8.45   f         m
#> 9392               Erlangen_Seidenstücker_30.9.40   f         b
#>       verbatimLocality
#> 9387 48°57’ N 10°55’ E
#> 9388 49°27’ N 11°05’ E
#> 9389 49°07’ N 10°45’ E
#> 9390 49°27’ N 11°05’ E
#> 9391 49°27’ N 11°05’ E
#> 9392 49°36’ N 11°00’ E
#> 
#> This trait-dataset contains 26 traits for 179 taxa ( 9386 measurements in total).
#> 
#> [ ] :

Note that all existing traits remain untouched and additional trait measures will be added to the dataset, unless a definition replaces an already existing trait.

It is important to note that the mutate function works at the level of data resolution that is provided by the data, i.e. for occurrence data with multiple measurements on a single individual, the data columns are mutated per occurrenceID.

2. Standardize traits

The function as.traitdata() produced a tidy and correctly formatted version of your own trait data. We now turn to the challenging task of standardisation.

The field traitID is meant to contain a globally valid reference to a trait definition that applies to the measurement in question. Due to the heterogeneity of approaches, research questions and taxonomic focus in trait-based research, it is hard to come up with universal trait definitions that can be employed in each and every research context. The mode of measurement or the precise prescriptions of a sampling procedure have been formalized into published handbooks, (e.g. Cornelissen et al., 2003; Perez-Harguindeguy et al., 2013; or for invertebrates, Moretti et al., 2017), but are of limited use in harmonising trait data that pre-date or ignore this standard. Thesauri, e.g. the TOP Thesaurus of plant traits (Garnier et al., 2017, employed by TRY) or Gramene.org offer definitions of plant traits in a formal language. For soil invertebrates, the T-SITA thesaurus offers a set of traits relevant for this organism group (see Schneider et al., 2018 for a more detailed distinction of thesauri and ontologies). All in all, only for few organism groups and trait methodologies exist Unique Resource Identifiers (URIs) that provide a stable reference to an unambiguous definition and can be referenced from the dataset.

Refer to trait definitions via URIs

Thus, the key information must be provided manually as an own data object in R. However, traitdataform assists in creating an own reference list of traits, a so called ‘thesaurus’, that will be used to feed trait definitions, units or identifiers into the dataset.

The function to create an object of class ‘thesaurus’ is as.thesaurus() and deals with several objects created by as.trait(). The ETS provides a set of terms to describe trait concepts which can be provided as an input parameter to as.traits(). Using the as.trait() function allows assigning flexible trait definition while ensuring compliance with the terms of the traitdata standard outlined above. It also allows building a library of trait definitions where single traits can be reused in multiple projects.

E.g. if all of the traits reported in your dataset refer to a definition published under a publicly available identifier, the thesaurus could be created like this:

Alternatively, a thesaurus can be created from a data.frame, which might be easier if only trait name and identifier are to be provided and more specific trait definitions are not to be stored in the R object.

thesaurus1 <- as.thesaurus(data.frame(
                      trait = c("body_length",  "antenna_length", "metafemur_length", "eyewidth_corr"),
                      identifier = paste0("http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=", 
                                          c("Body_length", "Antenna_length", "Femur_length", "Eye_diameter")), 
                      valueType = c("numeric"),
                      expectedUnit = "mm")
)

To transfer the user-provided traits and trait values into standardised values, the function standardize_traits() merges the data table with a reference table of trait definitions to produce values of a compliant format.

The output table now contains a duplicate record of the originally provided trait measurements (in traitName, traitValue and traitUnit) and now being standardised into target terms and units as requested by the thesaurus.

refer to own trait definitions

If no published trait concept can be referenced, trait-datasets should be accompanied by a dataset-specific thesaurus. Ideally this is stored as an asset along with your trait dataset in the same data publication or in a separate publication. This can be a csv or txt file, or a website providing direct and stable links to each trait definition.

This reference file should contain at least the following fields for each trait concept:

  • trait should be a short descriptive name. No spaces should be used. Rather use a scheme with underscore or capital letters to highlight multiple words (e.g. ‘body_length’ or ‘bodyLenght’).
  • traitDescription: a detailed and unambiguous, human readable definition.
  • valueType to specify the expected kind of entries. Set it to ‘numeric’ for quantitative traits, ‘integer’ for counts or ordinal traits, ‘character’ for trait values that are provided as free text, ‘factor’ for traits that take one of few non-ordinal levels, ‘logical’ for binary/boolean entries (yes/no).
    • For numeric traits, the parameter expectedUnit should provide the expected unit for the trait. The R script will then try to convert trait values into this unit.
    • for categorical traits of kind ‘factor’ or ‘integer’, the field factorLevels should contain a list the valid factorial traits separated by semicolon. In case of ordinal traits, the order must be precisely corresponding to the number of possible integer values.
  • comments may contain examples and clarifications
  • optionally, identifier may specify an alphanumeric ID for the specific use in your dataset, but this function is also covered by having defined unambiguous trait labels in field trait which recur in field traitName of the main dataset.

Refer to the ETS set of terms to describe trait concepts) for further definitions of these terms, as well as the best practice guidelines for trait-data publications.

# M. Gossner, Martin; K. Simons, Nadja; Höck, Leonhard; W. Weisser, Wolfgang
# (2016): Morphometric measures of Heteroptera sampled in grasslands across
# three regions of Germany. figshare.
# https://doi.org/10.6084/m9.figshare.c.3307611.v1 
# following the definitions in data publication 
# http://www.esapubs.org/archive/ecol/E096/102/metadata.php

thesaurus2 <-  as.thesaurus(
    Body_length = as.trait("Body_length", identifier = "t1",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "From the tip of the head to the end of the abdomen"),
    Body_width = as.trait("Body_width", identifier = "t2",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Widest part of the body"),
    Body_height = as.trait("Body_height",identifier = "t3",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Thickest part of the body"),
    Thorax_length = as.trait("Thorax_length", identifier = "t4",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Longest part of the pronotum"),
    Thorax_width = as.trait("Thorax_width", identifier = "t5",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Widest part of the pronotum"),
    Head_width = as.trait("Head_width", identifier = "t6",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Widest part of the head including eyes"),
    Eye_width = as.trait("Eye_width", identifier = "t7",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Widest part of the left eye"),
    Antenna_Seg1 = as.trait("Antenna_Seg1", identifier = "t8",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of first antenna segment",
                            broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
    Antenna_Seg2 = as.trait("Antenna_Seg2", identifier = "t9",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of second antenna segment",
                            broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
    Antenna_Seg3 = as.trait("Antenna_Seg3", identifier = "t10",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of third antenna segment",
                            broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
    Antenna_Seg4 = as.trait("Antenna_Seg4", identifier = "t11",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of fourth antenna segment",
                            broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
    Antenna_Seg5 = as.trait("Antenna_Seg5", identifier = "t12",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of fifth antenna segment (only Pentatomoidea)",
                            broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
    Front.Tibia_length = as.trait("Front.Tibia_length", identifier = "t13",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of the tibia of the foreleg"),
    Mid.Tibia_length = as.trait("Mid.Tibia_length", identifier = "t14",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of the tibia of the mid leg",
                            broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Tibia_length"),
    Hind.Tibia_length = as.trait("Hind.Tibia_length", identifier = "t15",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of the tibia of the hind leg",
                            broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Tibia_length"),
    Front.Femur_length = as.trait("Front.Femur_length", identifier = "t16",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of the femur of the foreleg",
                            broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Femur_length"),
    Hind.Femur_length = as.trait("Hind.Femur_length", identifier = "t17",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of the femur of the hind leg",
                            broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Femur_length"),
    Front.Femur_width = as.trait("Front.Femur_width", identifier = "t18",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Width of the femur of the foreleg"),
    Hind.Femur_width = as.trait("Hind.Femur_width", identifier = "t18",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Width of the femur of the hind leg"),
    Rostrum_length = as.trait("Rostrum_length", identifier = "t19",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of the rostrum including all segments"),
    Rostrum_width = as.trait("Rostrum_width", identifier = "t20",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Widest part of the rostrum"),
    Wing_length = as.trait("Wing_length", identifier = "t21",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Longest part of the forewing",
                            broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Wing"),
    Wing_width = as.trait("Wing_width", identifier = "t22",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Widest part of the forewing",
                            broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Wing")
  )

Applying standardize_traits() will refer to this dataset-specific thesaurus and append it as an attribute to the R object.

dataset2Std <- standardize_traits(dataset2, thesaurus2)
subset(dataset2Std, occurrenceID == 2)
#>        scientificName          traitName traitID traitValue traitUnit
#> 185  Acalypta nigrina        Body_length      t1       2.10        mm
#> 610  Acalypta nigrina         Body_width      t2       1.22        mm
#> 871  Acalypta nigrina        Body_height      t3       0.67        mm
#> 1337 Acalypta nigrina      Thorax_length      t4       0.24        mm
#> 1721 Acalypta nigrina       Thorax_width      t5       0.95        mm
#> 2309 Acalypta nigrina         Head_width      t6       0.45        mm
#> 2734 Acalypta nigrina          Eye_width      t7       0.13        mm
#> 3364 Acalypta nigrina       Antenna_Seg1      t8       0.12        mm
#> 3422 Acalypta nigrina       Antenna_Seg2      t9       0.04        mm
#> 3888 Acalypta nigrina       Antenna_Seg3     t10       0.44        mm
#> 4271 Acalypta nigrina       Antenna_Seg4     t11       0.15        mm
#> 4895 Acalypta nigrina Front.Tibia_length     t13       0.40        mm
#> 5524 Acalypta nigrina   Mid.Tibia_length     t14       0.42        mm
#> 5563 Acalypta nigrina  Hind.Tibia_length     t15       0.56        mm
#> 6048 Acalypta nigrina Front.Femur_length     t16       0.48        mm
#> 6412 Acalypta nigrina  Hind.Femur_length     t17       0.56        mm
#> 7021 Acalypta nigrina  Front.Femur_width     t18       0.12        mm
#> 7569 Acalypta nigrina   Hind.Femur_width     t18       0.10        mm
#> 7870 Acalypta nigrina     Rostrum_length     t19       0.92        mm
#> 8118 Acalypta nigrina      Rostrum_width     t20       0.08        mm
#> 8598 Acalypta nigrina        Wing_length     t21       1.66        mm
#> 8968 Acalypta nigrina         Wing_width     t22       0.57        mm
#>            traitNameStd traitValueStd traitUnitStd measurementID
#> 185         Body_length          2.10           mm             2
#> 610          Body_width          1.22           mm           427
#> 871         Body_height          0.67           mm           852
#> 1337      Thorax_length          0.24           mm          1277
#> 1721       Thorax_width          0.95           mm          1702
#> 2309         Head_width          0.45           mm          2127
#> 2734          Eye_width          0.13           mm          2552
#> 3364       Antenna_Seg1          0.12           mm          2977
#> 3422       Antenna_Seg2          0.04           mm          3402
#> 3888       Antenna_Seg3          0.44           mm          3827
#> 4271       Antenna_Seg4          0.15           mm          4252
#> 4895 Front.Tibia_length          0.40           mm          4713
#> 5524   Mid.Tibia_length          0.42           mm          5138
#> 5563  Hind.Tibia_length          0.56           mm          5563
#> 6048 Front.Femur_length          0.48           mm          5988
#> 6412  Hind.Femur_length          0.56           mm          6413
#> 7021  Front.Femur_width          0.12           mm          6838
#> 7569   Hind.Femur_width          0.10           mm          7263
#> 7870     Rostrum_length          0.92           mm          7688
#> 8118      Rostrum_width          0.08           mm          8113
#> 8598        Wing_length          1.66           mm          8538
#> 8968         Wing_width          0.57           mm          8963
#>      occurrenceID     order   family           basisOfRecordDescription
#> 185             2 Hemiptera Tingidae Zoological State Collection Munich
#> 610             2 Hemiptera Tingidae Zoological State Collection Munich
#> 871             2 Hemiptera Tingidae Zoological State Collection Munich
#> 1337            2 Hemiptera Tingidae Zoological State Collection Munich
#> 1721            2 Hemiptera Tingidae Zoological State Collection Munich
#> 2309            2 Hemiptera Tingidae Zoological State Collection Munich
#> 2734            2 Hemiptera Tingidae Zoological State Collection Munich
#> 3364            2 Hemiptera Tingidae Zoological State Collection Munich
#> 3422            2 Hemiptera Tingidae Zoological State Collection Munich
#> 3888            2 Hemiptera Tingidae Zoological State Collection Munich
#> 4271            2 Hemiptera Tingidae Zoological State Collection Munich
#> 4895            2 Hemiptera Tingidae Zoological State Collection Munich
#> 5524            2 Hemiptera Tingidae Zoological State Collection Munich
#> 5563            2 Hemiptera Tingidae Zoological State Collection Munich
#> 6048            2 Hemiptera Tingidae Zoological State Collection Munich
#> 6412            2 Hemiptera Tingidae Zoological State Collection Munich
#> 7021            2 Hemiptera Tingidae Zoological State Collection Munich
#> 7569            2 Hemiptera Tingidae Zoological State Collection Munich
#> 7870            2 Hemiptera Tingidae Zoological State Collection Munich
#> 8118            2 Hemiptera Tingidae Zoological State Collection Munich
#> 8598            2 Hemiptera Tingidae Zoological State Collection Munich
#> 8968            2 Hemiptera Tingidae Zoological State Collection Munich
#>                             references sex lifeStage  verbatimLocality
#> 185  Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 610  Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 871  Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 1337 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 1721 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 2309 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 2734 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 3364 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 3422 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 3888 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 4271 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 4895 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 5524 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 5563 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 6048 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 6412 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 7021 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 7569 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 7870 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 8118 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 8598 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 8968 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 
#> This trait-dataset contains 26 traits for 179 taxa ( 9386 measurements in total).
#> NULL

attributes(dataset2Std)$traits[,c("trait", "identifier","traitDescription","expectedUnit")]
#>                                 trait identifier
#> Body_length               Body_length         t1
#> Body_width                 Body_width         t2
#> Body_height               Body_height         t3
#> Thorax_length           Thorax_length         t4
#> Thorax_width             Thorax_width         t5
#> Head_width                 Head_width         t6
#> Eye_width                   Eye_width         t7
#> Antenna_Seg1             Antenna_Seg1         t8
#> Antenna_Seg2             Antenna_Seg2         t9
#> Antenna_Seg3             Antenna_Seg3        t10
#> Antenna_Seg4             Antenna_Seg4        t11
#> Antenna_Seg5             Antenna_Seg5        t12
#> Front.Tibia_length Front.Tibia_length        t13
#> Mid.Tibia_length     Mid.Tibia_length        t14
#> Hind.Tibia_length   Hind.Tibia_length        t15
#> Front.Femur_length Front.Femur_length        t16
#> Hind.Femur_length   Hind.Femur_length        t17
#> Front.Femur_width   Front.Femur_width        t18
#> Hind.Femur_width     Hind.Femur_width        t18
#> Rostrum_length         Rostrum_length        t19
#> Rostrum_width           Rostrum_width        t20
#> Wing_length               Wing_length        t21
#> Wing_width                 Wing_width        t22
#>                                                        traitDescription
#> Body_length          From the tip of the head to the end of the abdomen
#> Body_width                                      Widest part of the body
#> Body_height                                   Thickest part of the body
#> Thorax_length                              Longest part of the pronotum
#> Thorax_width                                Widest part of the pronotum
#> Head_width                       Widest part of the head including eyes
#> Eye_width                                   Widest part of the left eye
#> Antenna_Seg1                            Length of first antenna segment
#> Antenna_Seg2                           Length of second antenna segment
#> Antenna_Seg3                            Length of third antenna segment
#> Antenna_Seg4                           Length of fourth antenna segment
#> Antenna_Seg5       Length of fifth antenna segment (only Pentatomoidea)
#> Front.Tibia_length                   Length of the tibia of the foreleg
#> Mid.Tibia_length                     Length of the tibia of the mid leg
#> Hind.Tibia_length                   Length of the tibia of the hind leg
#> Front.Femur_length                   Length of the femur of the foreleg
#> Hind.Femur_length                   Length of the femur of the hind leg
#> Front.Femur_width                     Width of the femur of the foreleg
#> Hind.Femur_width                     Width of the femur of the hind leg
#> Rostrum_length             Length of the rostrum including all segments
#> Rostrum_width                                Widest part of the rostrum
#> Wing_length                                Longest part of the forewing
#> Wing_width                                  Widest part of the forewing
#>                    expectedUnit
#> Body_length                  mm
#> Body_width                   mm
#> Body_height                  mm
#> Thorax_length                mm
#> Thorax_width                 mm
#> Head_width                   mm
#> Eye_width                    mm
#> Antenna_Seg1                 mm
#> Antenna_Seg2                 mm
#> Antenna_Seg3                 mm
#> Antenna_Seg4                 mm
#> Antenna_Seg5                 mm
#> Front.Tibia_length           mm
#> Mid.Tibia_length             mm
#> Hind.Tibia_length            mm
#> Front.Femur_length           mm
#> Hind.Femur_length            mm
#> Front.Femur_width            mm
#> Hind.Femur_width             mm
#> Rostrum_length               mm
#> Rostrum_width                mm
#> Wing_length                  mm
#> Wing_width                   mm

3. Standardize taxa

For taxon name standardisation, the function standardize_taxa() makes use of fuzzy matching algorithms provided by the package ‘taxize’ by Scott Chamberlain to match the entries of column scientificName against the GBIF Backbone Taxonomy. The result is written into a new column scientificNameStd. Additional columns comprise the order (for ambiguous names), the reported taxon rank, as well as a globally unique taxon ID which references the taxon to GBIF Backbone Taxonomy in a universal URI format.

If further layers of taxonomic information are desired as an output, the function takes the parameter return, which by default contains c("taxonID", "scientificNameStd", "order", "taxonRank"). Other specifications can be added here.

Note that for this to work, scientificName must contain a full account of the species name or higher taxon, no abbreviations (spaces or underscores are handled alright). Note also, that taxon name mapping requires an internet connection and might take some time, depending on the length of your species list.

Single-stroke standardization

The functions standardize_traits() and standardize_taxa() are applied sequentially but not necessarily in that order. The output of the first step can be piped into the second step.

To make things even simpler, the functions for format conversion and standardization come with a wrapper function standardize(). Therefore it is possible to run the functions in a single-handed way, if all necessary parameters for the intermediate steps are provided. A single call will do, taking all the optional parameters described above.

As an alternative input pathway, all parameters to standardize() can be specified as attributes of the input object and will be found natively by the function. This allows for the specification of recipes for data integration for projects pulling data from multiple sources.

4. Working with trait-datasets

combine multiple traitdata tables

After standardizing trait and taxon concepts into unified definitions and converting trait values into harmonized units, it is straightforward to combine multiple trait-dataset into one using rbind(). This can be applied before or after the standardisation process, depending on the use case. Use cases of merging data are:

  • you collected data from different sources and want to harmonize taxon and trait names: bring data in long-table format and merge into one data object, then harmonize taxa and units following a uniform standard
  • No unified trait list or taxon reference exists for the heterogeneous data assembled of different sources (e.g. because spanning many different taxa): Apply standardization to different reference systems before merging the datasets.

The function call will append the data tables while merging the common columns and maintaining columns that are not present in all datasets (this might produce lots of NA). The column datasetID will be added to keep track of the origin of the data. By default this column will contain the object names of the original datasets, but it can be replaced by more meaningful IDs using the parameter datasetID.

newdata <- rbind(dataset1Std, dataset2Std, 
                datasetID = c("vanderplas15", "gossner15")
              )

Note that the package provides a method for the base function rbind() that handles this merge. Documentation can be accessed via ?rbind.traitdata.

maintaining metadata

The function will handle metadata information on the dataset level as described in the section ‘Metadata’ of the Traitdata Standard (e.g. author or bibliographicCitation) and add a column datasetID as well as datasetName and author if those are provided in the parameter metadata of the as.traitdata() function call which creates the data. The function as.metadata() provides a standard structure for this case.

metadata1 <- as.metadata(
      datasetName = "Carabid traits",
      datasetID = "carabids",
      bibliographicCitation =  bibentry(
        bibtype = "Article",
        title = "Sensitivity of functional diversity metrics to sampling intensity",
        journal = "Methods in Ecology and Evolution",
        author = c(as.person("Fons van der Plas, Roel van Klink, Pete Manning, Han Olff, Markus Fischer")
        ),
        year = 2017,
        doi = "10.1111/2041-210x.12728"
      ),
      author = "Fons van der Plas",
      license = "http://creativecommons.org/publicdomain/zero/1.0/"
       )

dataset1 <- as.traitdata(carabids,
  taxa = "name_correct",
  thesaurus = thesaurus1,
  units = "mm",
  keep = c(datasetID = "source_measurement", measurementRemark = "note"),
  metadata = metadata1
)
#> Input is taken to be a species -- trait matrix. If this is not the case, please provide parameters!

head(dataset1)
#>          scientificName      traitName traitValue traitUnit measurementID
#> 1 Abax_parallelepipedus antenna_length   8.518519        mm             1
#> 2  Acupalpus_meridianus antenna_length   0.700000        mm             2
#> 3        Agonum_ericeti antenna_length   3.743386        mm             3
#> 4    Agonum_fuliginosum antenna_length   3.500000        mm             4
#> 5        Agonum_gracile antenna_length   3.220000        mm             5
#> 6     Agonum_marginatum antenna_length   5.030000        mm             6
#>   datasetID  measurementRemark
#> 1     klink               <NA>
#> 2  WOODCOCK               <NA>
#> 3     klink               <NA>
#> 4    ribera               <NA>
#> 5    ribera deduced_from_genus
#> 6    ribera               <NA>
#> 
#> This trait-dataset contains 4 traits for 120 taxa ( 480 measurements in total).
#> 
#>  carabids : Carabid traits by Fons van der Plas .
#> 
#>     When using these data, you must acknowledge the following usage policies: 
#> 
#>     Cite this trait dataset as: 
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling
#> intensity." _Methods in Ecology and Evolution_. doi:
#> 10.1111/2041-210x.12728 (URL:
#> http://doi.org/10.1111/2041-210x.12728).
#> 
#>     Published under: http://creativecommons.org/publicdomain/zero/1.0/

Note the use of the bibentry() function to create a formal bibliographic entry. Also note that this also affects the way how the dataset is printed into the R console. This facilitates for data users to acknowledge authorship and ownership of the data, while also providing a machine readable structure that can easily be accessed further down the line.

metadata2 <- as.metadata(
  datasetName = "Heteroptera morphometry traits",
  datasetID = "heteroptera",
  bibliographicCitation =  bibentry(
    bibtype = "Article",
    title = "Morphometric measures of Heteroptera sampled in grasslands across three regions of Germany",
    journal = "Ecology",
    volume = 96,
    issue = 4,
    pages = 1154,
    author = c(as.person("Martin M. Gossner , Nadja K. Simons, Leonhard Hoeck, Wolfgang W. Weisser")),
    year = 2015,
    doi = "10.1890/14-2159.1"
  ),
  author = "Martin M. Gossner",
  license = "http://creativecommons.org/publicdomain/zero/1.0/"
)

dataset2 <- as.traitdata(heteroptera_raw,
  taxa = "SpeciesID",
  thesaurus = thesaurus2,
  units = "mm",
  keep = c(sex = "Sex", references = "Source", lifestage = "Wing_development"),
  metadata =  metadata2
)
#> it seems you are providing repeated measures of traits on multiple specimens of the same species (i.e. an occurrence table)! Sequential identifiers for the occuences will be added. If your dataset contains user-defined occurrenceIDs you may specify the column name in parameter 'occurrences'.

database <- rbind(dataset1, dataset2, 
                datasetID = c("vanderplas17", "gossner15"), 
                metadata_as_columns = TRUE
                ) 
#> Warning in rbind(deparse.level, ...): There seems to be no overlap in trait names of the provided datasets. 
#> It is recommended to map 'traitNameStd' of each dataset to the same thesaurus or ontology!
#> Warning in rbind(deparse.level, ...): There seems to be no overlap in taxon names of the provided datasets!
#> It is recommended to map 'ScientificNameStd' of each dataset to the same thesaurus or ontology!

head(database)
#>          scientificName      traitName traitValue traitUnit measurementID
#> 1 Abax_parallelepipedus antenna_length   8.518519        mm             1
#> 2  Acupalpus_meridianus antenna_length   0.700000        mm             2
#> 3        Agonum_ericeti antenna_length   3.743386        mm             3
#> 4    Agonum_fuliginosum antenna_length   3.500000        mm             4
#> 5        Agonum_gracile antenna_length   3.220000        mm             5
#> 6     Agonum_marginatum antenna_length   5.030000        mm             6
#>   occurrenceID references  sex
#> 1         <NA>       <NA> <NA>
#> 2         <NA>       <NA> <NA>
#> 3         <NA>       <NA> <NA>
#> 4         <NA>       <NA> <NA>
#> 5         <NA>       <NA> <NA>
#> 6         <NA>       <NA> <NA>
#>                                             license datasetID
#> 1 http://creativecommons.org/publicdomain/zero/1.0/  carabids
#> 2 http://creativecommons.org/publicdomain/zero/1.0/  carabids
#> 3 http://creativecommons.org/publicdomain/zero/1.0/  carabids
#> 4 http://creativecommons.org/publicdomain/zero/1.0/  carabids
#> 5 http://creativecommons.org/publicdomain/zero/1.0/  carabids
#> 6 http://creativecommons.org/publicdomain/zero/1.0/  carabids
#>      datasetName            author  measurementRemark lifestage
#> 1 Carabid traits Fons van der Plas               <NA>      <NA>
#> 2 Carabid traits Fons van der Plas               <NA>      <NA>
#> 3 Carabid traits Fons van der Plas               <NA>      <NA>
#> 4 Carabid traits Fons van der Plas               <NA>      <NA>
#> 5 Carabid traits Fons van der Plas deduced_from_genus      <NA>
#> 6 Carabid traits Fons van der Plas               <NA>      <NA>
#> 
#> This trait-dataset contains 27 traits for 299 taxa ( 9386 measurements in total).
#> $carabids
#> 
#>  carabids : Carabid traits by Fons van der Plas .
#> 
#>     When using these data, you must acknowledge the following usage policies: 
#> 
#>     Cite this trait dataset as: 
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling
#> intensity." _Methods in Ecology and Evolution_. doi:
#> 10.1111/2041-210x.12728 (URL:
#> http://doi.org/10.1111/2041-210x.12728).
#> 
#>     Published under: http://creativecommons.org/publicdomain/zero/1.0/ 
#> 
#> 
#> $heteroptera
#> 
#>  heteroptera : Heteroptera morphometry traits by Martin M. Gossner .
#> 
#>     When using these data, you must acknowledge the following usage policies: 
#> 
#>     Cite this trait dataset as: 
#> Gossner MM, Simons NK, Hoeck L, Weisser WW (2015). "Morphometric
#> measures of Heteroptera sampled in grasslands across three regions
#> of Germany." _Ecology_, *96*, 1154. doi: 10.1890/14-2159.1 (URL:
#> http://doi.org/10.1890/14-2159.1).
#> 
#>     Published under: http://creativecommons.org/publicdomain/zero/1.0/

The detailed metadata information of both datasets (e.g. license and bibliographic citation) will be stored in the attributes of the dataset and displayed when calling it in R console. You can access the metadata via the attributes() function. E.g.

writing data recipes

For projects compiling data from multiple sources, it is recommended best practice to refer to original raw data, potentially even by pulling them from their original repository, and make any changes and standardisation procedures script based in R. If many field-based changes are necessary, you can refer to lookup tables to keep the script slim.

traitdataform allows you to script all parameters required for the standardization call into the attributes of the R object. A script for a single data source can then look like this

carabids <- utils::read.delim(url("https://datadryad.org/bitstream/handle/10255/dryad.134418/carabid%20traits%20final.txt", 
                                encoding = "UTF-8")
                              )

attr(carabids, 'metadata') <- traitdataform::as.metadata(
      datasetName = "Carabid traits",
      datasetID = "carabids",
      bibliographicCitation =  utils::bibentry(
        bibtype = "Article",
        title = "Sensitivity of functional diversity metrics to sampling intensity",
        journal = "Methods in Ecology and Evolution",
        author = c(utils::as.person("Fons van der Plas, Roel van Klink, Pete Manning, Han Olff, Markus Fischer")
        ),
        year = 2017,
        doi = "10.1111/2041-210x.12728"
      ),
      author = "Fons van der Plas",
      license = "http://creativecommons.org/publicdomain/zero/1.0/"
       )

attr(carabids, 'thesaurus') <-  traitdataform:::as.thesaurus(
          body_length = traitdataform:::as.trait("body_length",
                              expectedUnit = "mm", valueType = "numeric",
                              identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length"),
          antenna_length = traitdataform:::as.trait("antenna_length",
                              expectedUnit = "mm", valueType = "numeric",
                              identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Antenna_length"),
          metafemur_length = traitdataform:::as.trait("femur_length",
                              expectedUnit = "mm", valueType = "numeric",
                              identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Femur_length"),
          eyewidth_corr = traitdataform:::as.trait("eye_diameter",
                              expectedUnit = "mm", valueType = "numeric",
                              identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Eye_diameter")
        )

attr(carabids, 'taxa') <- "name_correct"
attr(carabids, 'units') <- "mm"
attr(carabids, 'keep') <-  c(measurementDeterminedBy = "source_measurement", measurementRemarks = "note")

When thus specified, the data can be re-formatted simply by calling standardize(carabids).

5. Writing data

The final step in converting trait data into a standardised format before uploading the file to a public file hosting service is saving them in a file format that is internationalized, portable and long-term accessible. Internationalization refers to the file encoding (‘UTF-8’ should be used, ‘ASCII’ is possible for data with no special characters) as well as the use of decimal delimiters (highly recommended to use ‘.’) and internationally accepted formatting standards for values such as dates (the international norm for date entries is ISO 8601, i.e. “YYYY-MM-DD”). Portability means that the file can be opened on all operating systems (specifically important, the ‘end of line’ character) and does not rely on proprietary software (like MS Excel or database tools). Long-term accessibility is warranted by choosing a text-based file format (txt, csv or tsv) and by packaging the primary data with all necessary metadata.

The base R function write.table() gives full control over these parameters and should be used to export trait-data.

Along with these primary data, you should make any ancillary data table available along with the data, e.g. the metadata in a human readable form, as well as the lookup table of traits and taxa:

capture.output(attributes(dataset1Std)$metadata, file = "metadata.txt")

write.table(attributes(dataset1Std)$traits, file = "traits.csv", 
            sep = ",", dec = ".", quote = TRUE, eol = "\r", row.names = FALSE)
write.table(attributes(dataset1Std)$taxonomy, file = "taxa.csv", 
            sep = ",", dec = ".", quote = TRUE, eol = "\r", row.names = FALSE)

When publishing the trait data on file servers like Figshare or Zenodo, those files should be uploaded in a single file repository (e.g. in a zip file). R does the archiving for you using zip():

zip("carabids_std.zip", c("carabids_std.csv", "metadata.txt", "traits.csv", "taxa.csv") )

More advise for publishing trait data in a standardised way can be found in our ‘Best practice examples for primary data publication’ (Schneider et al., 2018).

References

Cornelissen, J. H. C., Lavorel, S., Garnier, E., Diaz, S., Buchmann, N., Gurvich, D. E., … Van Der Heijden, M. G. A. (2003). A handbook of protocols for standardised and easy measurement of plant functional traits worldwide. Australian Journal of Botany, 51(4), 335–380.

Garnier, E., Stahl, U., Laporte, M.-A., Kattge, J., Mougenot, I., Kühn, I., … Klotz, S. (2017). Towards a thesaurus of plant characteristics: An ecological contribution. Journal of Ecology, 105(2), 298–309. doi:10.1111/1365-2745.12698

Kattge, J., Ogle, K., Bönisch, G., Díaz, S., Lavorel, S., Madin, J., … Wirth, C. (2011). A generic structure for plant trait databases. Methods in Ecology and Evolution, 2(2), 202–213. doi:10.1111/j.2041-210X.2010.00067.x

Moretti, M., Dias, A. T., Bello, F., Altermatt, F., Chown, S. L., Azcárate, F. M., … others. (2017). Handbook of protocols for standardized measurement of terrestrial invertebrate functional traits. Functional Ecology, 31(3), 558–567. doi:10.1111/1365-2435.12776

Parr, C. S., Schulz, K. S., Hammock, J., Wilson, N., Leary, P., Rice, J., … J, R. (2016). TraitBank: Practical semantics for organism attribute data. Semantic Web, 7(6), 577–588. doi:10.3233/SW-150190

Perez-Harguindeguy, N., Diaz, S., Garnier, E., Lavorel, S., Poorter, H., Jaureguiberry, P., … Gurvich, D. E. (2013). New handbook for standardised measurement of plant functional traits worldwide. Australian Journal of Botany, 61(3), 167–234.

Schneider, F. D., Jochum, M., Provost, G. L., Ostrowski, A., Penone, C., Fichtmüller, D., … Simons, N. K. (2018). Towards an Ecological Trait-data Standard. bioRxiv, 328302. doi:10.1101/328302

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23. doi:10.18637/jss.v059.i10