Introduction to 'traitdataform'

Assistance for handling functional trait data and transferring them into the Ecological Trait-data Standard (Schneider et al. 2018, https://terminologies.gfbio.org/terms/ets/pages/ doi: 10.5281/zenodo.1485739).

There are two major use cases for the package:

preparation of own trait datasets for upload into public data bases, and
harmonizing trait datasets from different sources by moulding them into a unified format.

The toolset of the package includes

transforming typical trait-data formats (e.g. species-trait-matrix or measurement-table data) into a unified long-table format and mapping column names into terms provided in the Ecological Trait-data Standard (ETS) (see section 1. Reading data),
mapping of trait concepts onto a user-provided trait list (i.e. a thesaurus of traits) or globally accessible URIs (see section 2. Standardize traits) and unify units and factor levels,
mapping of species concepts onto globally accessible definitions via URIs (pointing to GFBio taxonomic ontology server) (see section 3. Standardize taxa),
Merging and handling compiled trait-data, while keeping track of the metadata for each original dataset (see section 4. Working with trait-datasets)
saving trait dataset into a desired format using templates (e.g. for project-specific databases or online repositories) (see section 5. Writing data)

This vignette contains step-by step instructions for transferring own data into a standardized trait-dataset for upload to public databases. See Schneider et al. 2019 Towards an Ecological Trait-data Standard Methods in Ecology and Evolution DOI: 10.1111/2041-210X.13288) for a discussion of the rationale.

1. Reading data

load data from source

The first step is to load your data into R. This can be your own data, read from file, or data published elsewhere, directly accessible via an URL.

R knows many ways of getting your data into an R object. In most cases you would read an object from a csv or txt file while maintaining the column headers.

carabids <- read.table("../../data/carabid traits final.txt", header = TRUE)

If reading files from a file repository, you can refer to the URL directly, e.g.

# pulling data from van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017) Sensitivity of functional diversity metrics to sampling intensity. Methods in Ecology and Evolution 8(9): 1072-1080. https://doi.org/10.1111/2041-210x.12728

carabids <- read.delim("https://datadryad.org/stash/downloads/file_stream/23901", stringsAsFactors = FALSE)

Most trait data are stored in one of the following two formats:

species\(\times\)trait matrix : a single account of a trait value for each species (in rows) for a couple of different traits (in columns). No replicates of species are reported. This is the most likely format for literature data, where aggregate measurements or facts for entire species have been collated into a single lookup table.
observation wide table : in case of measured data, authors may report multiple raw measurements of different traits (in columns) taken from a single observation instance of a species, i.e. an individual (in rows). Repeated measures of the same trait might also be included as columns or pooled into average values. This is valuable for investigations of intra-specific variation, and also leaves space for filtering by co-factors or analyzing trait response along environmental gradients.

In both cases, additional information on the species or observation may be stored in further columns (e.g. the unit in which a value is reported or the literature source for this measurement or fact, or the date and geolocation of sampling), or in a separate data sheet linked via identifiers for trait, taxon, occurrence or sampling/measurement event. As the column names and the width of the table varies with the number of traits included, merging data from different sources requires user-defined mapping and manual harmonization of these structures.

A more effective format is the measurement long-table (Kattge et al., 2011; Wickham, 2014; Parr et al., 2016), where each row is reserved for a single measurement or fact of a specific trait. This allows repeated measurements on a single individual to be stored by linking the data from separate rows via a unique identifier for each individual (labelled occurrenceID). Also, multivariate trait measurements can be recorded in this format by linking multiple rows via a unique measurement identifier. Long-table datasets purport multiple advantages for data manipulation (e.g. filtering, sub-setting and aggregating data), visualization (e.g. plot measured values by factor variable or taxon) and statistical modelling (e.g. ANOVA for testing difference of trait value by sex) (Wickham, 2014). Each row of the dataset can therefore be interpreted as a statement of an ‘entity x having a qualitative/quantitative feature y’ (Garnier et al., 2017; Schneider et al., 2018). As long-table formats draw from a defined set of columns, merging of datasets is much easier.

The function as.traitdata() provided in the package assist in transferring data into the measurement long-table format. For this function to work, it needs at least to know about the columns of the original data that contain trait values (parameter traits), and the column which contains the taxonomic concept (parameter taxa).

dataset1 <- as.traitdata(carabids, 
                         taxa = "name_correct",
                         traits = c("body_length", 
                                    "antenna_length", 
                                    "metafemur_length", 
                                    "eyewidth_corr"),
                         units = "mm"
                         )
#> Input is taken to be a species -- trait matrix. If this is not the case, please provide parameters!

head(dataset1)
#>   verbatimScientificName verbatimTraitName verbatimTraitValue verbatimTraitUnit
#> 1  Abax_parallelepipedus       body_length          15.846561                mm
#> 2   Acupalpus_meridianus       body_length           2.670000                mm
#> 3         Agonum_ericeti       body_length           5.873016                mm
#> 4     Agonum_fuliginosum       body_length           5.090000                mm
#> 5         Agonum_gracile       body_length           4.880000                mm
#> 6      Agonum_marginatum       body_length           8.250000                mm
#>   measurementID measurementDeterminedBy measurementRemarks
#> 1             1                   klink               <NA>
#> 2             2                WOODCOCK               <NA>
#> 3             3                   klink               <NA>
#> 4             4                  ribera               <NA>
#> 5             5                  ribera deduced_from_genus
#> 6             6                  ribera               <NA>
#> 
#> This trait-dataset contains 4 traits for 120 taxa ( 480 measurements in total).
#> 
#>  carabids : Carabid traits by Fons van der Plas .
#> 
#>     When using these data, you must acknowledge the following usage policies: 
#> 
#>     Cite this trait dataset as: 
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling intensity."
#> _Methods in Ecology and Evolution_. doi:10.1111/2041-210x.12728
#> <https://doi.org/10.1111/2041-210x.12728>.
#> 
#>     Published under: http://creativecommons.org/publicdomain/zero/1.0/ 
#> 
#>     This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"

Note that in the output table the columns have been renamed according to the ETS. The essential columns are verbatimTraitName, verbatimTraitValue for the reported measurement or fact as well as verbatimScientificName for the taxon concept. The newly assigned column measurementID contains a running number for each individual trait measurement.
The function automatically interprets data as a species\(\times\)traits matrix if the taxa column contains only unique entries and no duplicates. In case of multiple assignments to the same taxon, the script assumes an observation wide-table and procures a new column occurrenceID which links measurements taken on the same individuals. Both occurrenceID and measurementID can be provided by the author using the parameter occurrences (as a column name or a vector) or measurements (as a column name or a vector).

#
# heteroptera_raw
#
# dataset included in package traitdataform 
#
# Data publication: M. Gossner, Martin; K. Simons, Nadja; Hoeck, Leonhard; W.
# Weisser, Wolfgang (2016): Morphometric measures of Heteroptera sampled in
# grasslands across three regions of Germany. figshare.
# https://doi.org/10.6084/m9.figshare.c.3307611.v1


dataset2 <- as.traitdata(heteroptera_raw,
              traits = c("Body_length", "Body_width", "Body_height", "Thorax_length",
                         "Thorax_width", "Head_width", "Eye_width", "Antenna_Seg1",
                         "Antenna_Seg2", "Antenna_Seg3", "Antenna_Seg4", "Antenna_Seg5",
                         "Front.Tibia_length", "Mid.Tibia_length", "Hind.Tibia_length",
                         "Front.Femur_length", "Hind.Femur_length", "Front.Femur_width",
                         "Hind.Femur_width", "Rostrum_length", "Rostrum_width", 
                         "Wing_length", "Wing_width"),
              taxa = "SpeciesID",
              occurrences = "ID"
              )

# show different trait measurements for same occurrence/individual
subset(dataset2, occurrenceID == "5" )

This allows the user to be explicit about the structure of the output data.

specify units

For a standardisation of quantitative trait data, the unit of measurement is essential. Often, this information is kept in the metadata descriptions. But for a standardised table containing measurements from different sources, this information should always accompany the measurement value. The ETS suggests the term verbatimTraitUnit to contain the original author’s unit for each measurement in the data table.

The function as.traitdata() creates this column via its parameter units (see example above). This can be done for all traits in a single stroke (if all reported values refer to the same unit) or to each trait specifically (if they used different measurement units or if the table comprises a mixture of quantitative and qualitative traits).
Accordingly, the parameter units takes a single character string, or a vector of character strings, containing valid entries as expected by the package ‘units’ (Pebesma et al., 2016, https://github.com/r-quantities/units). Examples are ‘mm’, ‘m2’ or ‘m^2’, ‘m/s’.

keep additional information

The raw data might contain further information on the individuals or the trait measurement itself in further data columns that are valuable for later analysis. This can be for instance data about the sex or developmental stage of the individual, the sampling or preservation method of the specimen, or the conditions under which the measurement was taken.

The parameter keep allows you to specify which columns contain valuable information as a character vector. As a negative version of keep, specifying drop would allow you to name the columns that are not valuable, while all others will be kept. Not specifying keep or drop will result in dropping all columns except the core measurement and identifier columns.

dataset2 <- as.traitdata(heteroptera_raw,
              traits = c("Body_length", "Body_width", "Body_height", "Thorax_length",
                         "Thorax_width", "Head_width", "Eye_width", "Antenna_Seg1",
                         "Antenna_Seg2", "Antenna_Seg3", "Antenna_Seg4", "Antenna_Seg5",
                         "Front.Tibia_length", "Mid.Tibia_length", "Hind.Tibia_length",
                         "Front.Femur_length", "Hind.Femur_length", "Front.Femur_width",
                         "Hind.Femur_width", "Rostrum_length", "Rostrum_width", 
                         "Wing_length", "Wing_width"),
              taxa = "SpeciesID",
              occurrences = "ID",
              keep = c("Sex")
              )
#> Input is taken to be an occurrence table/an observation -- trait matrix 
#> (i.e. with individual specimens per row and multiple trait measurements in columns). 
#> If this is not the case, please provide parameters!

head(dataset2) 
#>   verbatimScientificName verbatimTraitName verbatimTraitValue verbatimTraitUnit
#> 1       Acalypta nigrina       Body_length               2.35                mm
#> 2       Acalypta nigrina       Body_length               2.10                mm
#> 3       Acalypta nigrina       Body_length               2.17                mm
#> 4       Acalypta nigrina       Body_length               2.15                mm
#> 5       Acalypta parvula       Body_length               1.84                mm
#> 6       Acalypta parvula       Body_length               1.81                mm
#>   measurementID occurrenceID Sex
#> 1             1            1   f
#> 2             2            2   f
#> 3             3            3   m
#> 4             4            4   m
#> 5             5            5   f
#> 6             6            6   f
#> 
#> This trait-dataset contains 23 traits for 179 taxa ( 9386 measurements in total).
#> 
#>  heteroptera : Heteroptera morphometry traits by Martin M. Gossner .
#> 
#>     When using these data, you must acknowledge the following usage policies: 
#> 
#>     Cite this trait dataset as: 
#> Gossner MM, Simons NK, Hoeck L, Weisser WW (2015). "Morphometric
#> measures of Heteroptera sampled in grasslands across three regions of
#> Germany." _Ecology_, *96*, 1154. doi:10.1890/14-2159.1
#> <https://doi.org/10.1890/14-2159.1>.
#> 
#>     Published under: http://creativecommons.org/publicdomain/zero/1.0/ 
#> 
#>     This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"

The three extensions of the ETS provide standard terms for this kind of information:

The Taxon extension provides further terms for specifying the taxonomic resolution of the observation and to ensure the correct reference in case of synonyms and homonyms.
The Measurement Or Fact extension provides terms to describe information at the level of single measurements or reported facts, such as the original literature reference for the reported value, the method of measurement or statistical method of aggregation. It provides important information that allows for the tracking of potential sources of noise or bias in measured data (e.g. variation in measurement method) or aggregated values (e.g. statistical method), as well as the source of reported facts (e.g. literature source or expert reference).
The Occurrence extension contains vocabulary to describe information on the observation context of individual specimens, such as sex, life stage or age. This also includes the method of sampling and preservation, as well as the date and geographical location, which provide an important resource to analyze trait variation due to differences in space and time.

We highly recommend mapping the input columns into these standard terms by providing a named vector for keep that gives the target ETS terms as vector names.

dataset2 <- as.traitdata(heteroptera_raw,
              traits = c("Body_length", "Body_width", "Body_height", "Thorax_length",
                         "Thorax_width", "Head_width", "Eye_width", "Antenna_Seg1",
                         "Antenna_Seg2", "Antenna_Seg3", "Antenna_Seg4", "Antenna_Seg5",
                         "Front.Tibia_length", "Mid.Tibia_length", "Hind.Tibia_length",
                         "Front.Femur_length", "Hind.Femur_length", "Front.Femur_width",
                         "Hind.Femur_width", "Rostrum_length", "Rostrum_width", 
                         "Wing_length", "Wing_width"),
              taxa = "SpeciesID",
              occurrences = "ID",
              units = "mm",
              keep = c(order = "Order", family = "Family", 
                       sex = "Sex", lifeStage = "Wing_development", 
                       basisOfRecordDescription = "Source", 
                       verbatimLocality = "Center_Sampling_region", 
                       references = "Voucher_ID" )
)
#> Input is taken to be an occurrence table/an observation -- trait matrix 
#> (i.e. with individual specimens per row and multiple trait measurements in columns). 
#> If this is not the case, please provide parameters!

head(dataset2)
#>   verbatimScientificName verbatimTraitName verbatimTraitValue verbatimTraitUnit
#> 1       Acalypta nigrina       Body_length               2.35                mm
#> 2       Acalypta nigrina       Body_length               2.10                mm
#> 3       Acalypta nigrina       Body_length               2.17                mm
#> 4       Acalypta nigrina       Body_length               2.15                mm
#> 5       Acalypta parvula       Body_length               1.84                mm
#> 6       Acalypta parvula       Body_length               1.81                mm
#>   measurementID occurrenceID     order   family
#> 1             1            1 Hemiptera Tingidae
#> 2             2            2 Hemiptera Tingidae
#> 3             3            3 Hemiptera Tingidae
#> 4             4            4 Hemiptera Tingidae
#> 5             5            5 Hemiptera Tingidae
#> 6             6            6 Hemiptera Tingidae
#>             basisOfRecordDescription
#> 1 Zoological State Collection Munich
#> 2 Zoological State Collection Munich
#> 3 Zoological State Collection Munich
#> 4 Zoological State Collection Munich
#> 5 Zoological State Collection Munich
#> 6 Zoological State Collection Munich
#>                                     references sex lifeStage  verbatimLocality
#> 1      Treuchtlingen_leg.Seidenstücker_21.5.48   f         m 48°57’ N 10°55’ E
#> 2            Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 3 Gunzenshausen_Mfr._leg.Seidenstücker_15.4.58   m         m 49°07’ N 10°45’ E
#> 4            Nürnberg_leg.Seidenstücker_9.7.44   m         b 49°27’ N 11°05’ E
#> 5                Nürnberg_Seidenstücker_4.8.45   f         m 49°27’ N 11°05’ E
#> 6               Erlangen_Seidenstücker_30.9.40   f         b 49°36’ N 11°00’ E
#> 
#> This trait-dataset contains 23 traits for 179 taxa ( 9386 measurements in total).
#> 
#>  heteroptera : Heteroptera morphometry traits by Martin M. Gossner .
#> 
#>     When using these data, you must acknowledge the following usage policies: 
#> 
#>     Cite this trait dataset as: 
#> Gossner MM, Simons NK, Hoeck L, Weisser WW (2015). "Morphometric
#> measures of Heteroptera sampled in grasslands across three regions of
#> Germany." _Ecology_, *96*, 1154. doi:10.1890/14-2159.1
#> <https://doi.org/10.1890/14-2159.1>.
#> 
#>     Published under: http://creativecommons.org/publicdomain/zero/1.0/ 
#> 
#>     This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"

Note that a lack of a name in the named vector maintains the original name. Note also, that no checking for valid column names (as compared to the traitdata glossary) is performed at this stage. This is to ensure that the raw data table created by as.traittable() can contain any columns that the author considers relevant. The keep parameter can be used to rename columns into intuitive column names.

derived trait-values

Many traits comprise compound measures of multiple traits, such as length-mass ratios or morphometric indices. Other traits must be refined in terms of factor levels, or reduced to binary trait values. Many of these tasks can be achieved on the matrix raw data using base functions like transform(), factor() or match() or the mutate() function provided by the package ‘plyr’ before conversion into the long-table format.

However, if the data are converted to long-table format, these tasks may become tedious as they require splitting the data before the computation can be done. The function mutate.traitdata() performs these tasks (working as a wrapper to plyr::mutate()) while keeping an eye on the units.

dataset2 <- mutate.traitdata(dataset2, 
                            Body_shape = Body_length/Body_width, 
                            Body_volume = Body_length*Body_width*Body_height,
                            Wingload = Wing_length*Wing_width/Body_volume)

head(dataset2[dataset2$verbatimTraitName %in% c("Body_shape", "Body_volume", "Wingload"),])
#>      verbatimScientificName verbatimTraitName verbatimTraitValue
#> 9387       Acalypta nigrina        Body_shape           2.043478
#> 9388       Acalypta nigrina        Body_shape           1.721311
#> 9389       Acalypta nigrina        Body_shape           2.028037
#> 9390       Acalypta nigrina        Body_shape           2.067308
#> 9391       Acalypta parvula        Body_shape           2.243902
#> 9392       Acalypta parvula        Body_shape           1.885417
#>      verbatimTraitUnit measurementID occurrenceID     order   family
#> 9387                 1          <NA>            1 Hemiptera Tingidae
#> 9388                 1          <NA>            2 Hemiptera Tingidae
#> 9389                 1          <NA>            3 Hemiptera Tingidae
#> 9390                 1          <NA>            4 Hemiptera Tingidae
#> 9391                 1          <NA>            5 Hemiptera Tingidae
#> 9392                 1          <NA>            6 Hemiptera Tingidae
#>                basisOfRecordDescription
#> 9387 Zoological State Collection Munich
#> 9388 Zoological State Collection Munich
#> 9389 Zoological State Collection Munich
#> 9390 Zoological State Collection Munich
#> 9391 Zoological State Collection Munich
#> 9392 Zoological State Collection Munich
#>                                        references sex lifeStage
#> 9387      Treuchtlingen_leg.Seidenstücker_21.5.48   f         m
#> 9388            Nürnberg_leg.Seidenstücker_9.7.44   f         b
#> 9389 Gunzenshausen_Mfr._leg.Seidenstücker_15.4.58   m         m
#> 9390            Nürnberg_leg.Seidenstücker_9.7.44   m         b
#> 9391                Nürnberg_Seidenstücker_4.8.45   f         m
#> 9392               Erlangen_Seidenstücker_30.9.40   f         b
#>       verbatimLocality
#> 9387 48°57’ N 10°55’ E
#> 9388 49°27’ N 11°05’ E
#> 9389 49°07’ N 10°45’ E
#> 9390 49°27’ N 11°05’ E
#> 9391 49°27’ N 11°05’ E
#> 9392 49°36’ N 11°00’ E
#> 
#> This trait-dataset contains 26 traits for 179 taxa ( 9386 measurements in total).
#> 
#>  heteroptera : Heteroptera morphometry traits by Martin M. Gossner .
#> 
#>     When using these data, you must acknowledge the following usage policies: 
#> 
#>     Cite this trait dataset as: 
#> Gossner MM, Simons NK, Hoeck L, Weisser WW (2015). "Morphometric
#> measures of Heteroptera sampled in grasslands across three regions of
#> Germany." _Ecology_, *96*, 1154. doi:10.1890/14-2159.1
#> <https://doi.org/10.1890/14-2159.1>.
#> 
#>     Published under: http://creativecommons.org/publicdomain/zero/1.0/ 
#> 
#>     This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"

Note that all existing traits remain untouched and additional trait measures will be added to the dataset, unless a definition replaces an already existing trait.

It is important to note that the mutate function works at the level of data resolution that is provided by the data, i.e. for occurrence data with multiple measurements on a single individual, the data columns are mutated per occurrenceID.

2. Standardize traits

The function as.traitdata() produced a tidy and correctly formatted version of your own trait data. We now turn to the challenging task of standardisation.

The field traitID is meant to contain a globally valid reference to a trait definition that applies to the measurement in question. Due to the heterogeneity of approaches, research questions and taxonomic focus in trait-based research, it is hard to come up with universal trait definitions that can be employed in each and every research context. The mode of measurement or the precise prescriptions of a sampling procedure have been formalized into published handbooks, (e.g. Cornelissen et al., 2003; Perez-Harguindeguy et al., 2013; or for invertebrates, Moretti et al., 2017), but are of limited use in harmonising trait data that pre-date or ignore this standard. Thesauri, e.g. the TOP Thesaurus of plant traits (Garnier et al., 2017, employed by TRY) or Gramene.org offer definitions of plant traits in a formal language. For soil invertebrates, the T-SITA thesaurus offers a set of traits relevant for this organism group (see Schneider et al., 2018 for a more detailed distinction of thesauri and ontologies). All in all, only for few organism groups and trait methodologies exist Unique Resource Identifiers (URIs) that provide a stable reference to an unambiguous definition and can be referenced from the dataset.

Refer to trait definitions via URIs

Thus, the key information must be provided manually as an own data object in R. However, traitdataform assists in creating an own reference list of traits, a so called ‘thesaurus’, that will be used to feed trait definitions, units or identifiers into the dataset.

The function to create an object of class ‘thesaurus’ is as.thesaurus() and deals with several objects created by as.trait(). The ETS provides a set of terms to describe trait concepts which can be provided as an input parameter to as.traits(). Using the as.trait() function allows assigning flexible trait definition while ensuring compliance with the terms of the traitdata standard outlined above. It also allows building a library of trait definitions where single traits can be reused in multiple projects.

as.trait("body_length",
         expectedUnit = "mm", valueType = "numeric",
         traitDescription = "The known longest dimension of the physical structure of organisms",
         identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length",
         author = "Maggenti and Maggenti, 2005",
         broaderTerm = c("http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_dimension"),
         narrowerTerm = c("http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Female_body_length",
                          "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Male_body_length")
         )
#> 
#>  body_length :
#> 
#>  Defined as: The known longest dimension of the physical structure of
#>          organisms
#>           
#>  Broader term: http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_dimension
#> 
#>  Narrower term: http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Female_body_length;
#>          http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Male_body_length
#> 
#>  Value type:  numeric 
#>  Expected unit:  mm 
#> 
#>   http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length

E.g. if all of the traits reported in your dataset refer to a definition published under a publicly available identifier, the thesaurus could be created like this:

thesaurus1 <- as.thesaurus(
          body_length = as.trait("body_length",
                  expectedUnit = "mm", valueType = "numeric",
                  identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length"),
          antenna_length = as.trait("antenna_length",
                  expectedUnit = "mm", valueType = "numeric",
                  identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Antenna_length"),
          metafemur_length = as.trait("femur_length",
                  expectedUnit = "mm", valueType = "numeric",
                  identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Femur_length"),
          eyewidth_corr = as.trait("eye_diameter",
                  expectedUnit = "mm", valueType = "numeric",
                  identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Eye_diameter")
        )

Alternatively, a thesaurus can be created from a data.frame, which might be easier if only trait name and identifier are to be provided and more specific trait definitions are not to be stored in the R object.

thesaurus1 <- as.thesaurus(data.frame(
                      trait = c("body_length",  "antenna_length", "metafemur_length", "eyewidth_corr"),
                      identifier = paste0("http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=", 
                                          c("Body_length", "Antenna_length", "Femur_length", "Eye_diameter")), 
                      valueType = c("numeric"),
                      expectedUnit = "mm")
)

To transfer the user-provided traits and trait values into standardised values, the function standardize_traits() merges the data table with a reference table of trait definitions to produce values of a compliant format.

dataset1Std <- standardize_traits(dataset1, thesaurus1)
head(dataset1Std)
#>   traitID   traitName traitValue traitUnit verbatimScientificName
#> 1       2 body_length  15.846561        mm  Abax_parallelepipedus
#> 2       2 body_length   2.670000        mm   Acupalpus_meridianus
#> 3       2 body_length   5.873016        mm         Agonum_ericeti
#> 4       2 body_length   5.090000        mm     Agonum_fuliginosum
#> 5       2 body_length   4.880000        mm         Agonum_gracile
#> 6       2 body_length   8.250000        mm      Agonum_marginatum
#>   verbatimTraitName verbatimTraitValue verbatimTraitUnit measurementID
#> 1       body_length          15.846561                mm             1
#> 2       body_length           2.670000                mm             2
#> 3       body_length           5.873016                mm             3
#> 4       body_length           5.090000                mm             4
#> 5       body_length           4.880000                mm             5
#> 6       body_length           8.250000                mm             6
#>   measurementDeterminedBy measurementRemarks
#> 1                   klink               <NA>
#> 2                WOODCOCK               <NA>
#> 3                   klink               <NA>
#> 4                  ribera               <NA>
#> 5                  ribera deduced_from_genus
#> 6                  ribera               <NA>
#> 
#> This trait-dataset contains 4 traits for 120 taxa ( 480 measurements in total).
#> 
#>  carabids : Carabid traits by Fons van der Plas .
#> 
#>     When using these data, you must acknowledge the following usage policies: 
#> 
#>     Cite this trait dataset as: 
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling intensity."
#> _Methods in Ecology and Evolution_. doi:10.1111/2041-210x.12728
#> <https://doi.org/10.1111/2041-210x.12728>.
#> 
#>     Published under: http://creativecommons.org/publicdomain/zero/1.0/ 
#> 
#>     This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"

The output table now contains a duplicate record of the originally provided trait measurements (in verbatimTraitName, verbatimTraitValue and verbatimTraitUnit) and now being standardised into target terms and units as requested by the thesaurus.

refer to own trait definitions

If no published trait concept can be referenced, trait-datasets should be accompanied by a dataset-specific thesaurus. Ideally this is stored as an asset along with your trait dataset in the same data publication or in a separate publication. This can be a csv or txt file, or a website providing direct and stable links to each trait definition.

This reference file should contain at least the following fields for each trait concept:

trait should be a short descriptive name. No spaces should be used. Rather use a scheme with underscore or capital letters to highlight multiple words (e.g. ‘body_length’ or ‘bodyLenght’).
traitDescription: a detailed and unambiguous, human readable definition.
valueType to specify the expected kind of entries. Set it to ‘numeric’ for quantitative traits, ‘integer’ for counts or ordinal traits, ‘character’ for trait values that are provided as free text, ‘factor’ for traits that take one of few non-ordinal levels, ‘logical’ for binary/boolean entries (yes/no).
- For numeric traits, the parameter expectedUnit should provide the expected unit for the trait. The R script will then try to convert trait values into this unit.
- for categorical traits of kind ‘factor’ or ‘integer’, the field factorLevels should contain a list the valid factorial traits separated by semicolon. In case of ordinal traits, the order must be precisely corresponding to the number of possible integer values.
comments may contain examples and clarifications
optionally, identifier may specify an alphanumeric ID for the specific use in your dataset, but this function is also covered by having defined unambiguous trait labels in field trait which recur in field verbatimTraitName of the main dataset.

Refer to the ETS set of terms to describe trait concepts) for further definitions of these terms, as well as the best practice guidelines for trait-data publications.

# M. Gossner, Martin; K. Simons, Nadja; Hoeck, Leonhard; W. Weisser, Wolfgang
# (2016): Morphometric measures of Heteroptera sampled in grasslands across
# three regions of Germany. figshare.
# https://doi.org/10.6084/m9.figshare.c.3307611.v1 
# following the definitions in data publication 
# http://www.esapubs.org/archive/ecol/E096/102/metadata.php

thesaurus2 <-  as.thesaurus(
    Body_length = as.trait("Body_length", identifier = "t1",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "From the tip of the head to the end of the abdomen"),
    Body_width = as.trait("Body_width", identifier = "t2",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Widest part of the body"),
    Body_height = as.trait("Body_height",identifier = "t3",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Thickest part of the body"),
    Thorax_length = as.trait("Thorax_length", identifier = "t4",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Longest part of the pronotum"),
    Thorax_width = as.trait("Thorax_width", identifier = "t5",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Widest part of the pronotum"),
    Head_width = as.trait("Head_width", identifier = "t6",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Widest part of the head including eyes"),
    Eye_width = as.trait("Eye_width", identifier = "t7",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Widest part of the left eye"),
    Antenna_Seg1 = as.trait("Antenna_Seg1", identifier = "t8",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of first antenna segment",
                            broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
    Antenna_Seg2 = as.trait("Antenna_Seg2", identifier = "t9",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of second antenna segment",
                            broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
    Antenna_Seg3 = as.trait("Antenna_Seg3", identifier = "t10",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of third antenna segment",
                            broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
    Antenna_Seg4 = as.trait("Antenna_Seg4", identifier = "t11",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of fourth antenna segment",
                            broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
    Antenna_Seg5 = as.trait("Antenna_Seg5", identifier = "t12",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of fifth antenna segment (only Pentatomoidea)",
                            broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
    Front.Tibia_length = as.trait("Front.Tibia_length", identifier = "t13",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of the tibia of the foreleg"),
    Mid.Tibia_length = as.trait("Mid.Tibia_length", identifier = "t14",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of the tibia of the mid leg",
                            broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Tibia_length"),
    Hind.Tibia_length = as.trait("Hind.Tibia_length", identifier = "t15",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of the tibia of the hind leg",
                            broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Tibia_length"),
    Front.Femur_length = as.trait("Front.Femur_length", identifier = "t16",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of the femur of the foreleg",
                            broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Femur_length"),
    Hind.Femur_length = as.trait("Hind.Femur_length", identifier = "t17",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of the femur of the hind leg",
                            broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Femur_length"),
    Front.Femur_width = as.trait("Front.Femur_width", identifier = "t18",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Width of the femur of the foreleg"),
    Hind.Femur_width = as.trait("Hind.Femur_width", identifier = "t18",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Width of the femur of the hind leg"),
    Rostrum_length = as.trait("Rostrum_length", identifier = "t19",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Length of the rostrum including all segments"),
    Rostrum_width = as.trait("Rostrum_width", identifier = "t20",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Widest part of the rostrum"),
    Wing_length = as.trait("Wing_length", identifier = "t21",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Longest part of the forewing",
                            broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Wing"),
    Wing_width = as.trait("Wing_width", identifier = "t22",
                            expectedUnit = "mm", valueType = "numeric",
                            traitDescription = "Widest part of the forewing",
                            broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Wing")
  )

Applying standardize_traits() will refer to this dataset-specific thesaurus and append it as an attribute to the R object.

dataset2Std <- standardize_traits(dataset2, thesaurus2)
subset(dataset2Std, occurrenceID == 2)
#>      traitID          traitName traitValue traitUnit verbatimScientificName
#> 185        1        Body_length       2.10        mm       Acalypta nigrina
#> 610        2         Body_width       1.22        mm       Acalypta nigrina
#> 871        3        Body_height       0.67        mm       Acalypta nigrina
#> 1337       4      Thorax_length       0.24        mm       Acalypta nigrina
#> 1721       5       Thorax_width       0.95        mm       Acalypta nigrina
#> 2309       6         Head_width       0.45        mm       Acalypta nigrina
#> 2734       7          Eye_width       0.13        mm       Acalypta nigrina
#> 3364       8       Antenna_Seg1       0.12        mm       Acalypta nigrina
#> 3422       9       Antenna_Seg2       0.04        mm       Acalypta nigrina
#> 3888      10       Antenna_Seg3       0.44        mm       Acalypta nigrina
#> 4271      11       Antenna_Seg4       0.15        mm       Acalypta nigrina
#> 4895      13 Front.Tibia_length       0.40        mm       Acalypta nigrina
#> 5524      14   Mid.Tibia_length       0.42        mm       Acalypta nigrina
#> 5563      15  Hind.Tibia_length       0.56        mm       Acalypta nigrina
#> 6048      16 Front.Femur_length       0.48        mm       Acalypta nigrina
#> 6412      17  Hind.Femur_length       0.56        mm       Acalypta nigrina
#> 7021      18  Front.Femur_width       0.12        mm       Acalypta nigrina
#> 7569      19   Hind.Femur_width       0.10        mm       Acalypta nigrina
#> 7870      20     Rostrum_length       0.92        mm       Acalypta nigrina
#> 8118      21      Rostrum_width       0.08        mm       Acalypta nigrina
#> 8598      22        Wing_length       1.66        mm       Acalypta nigrina
#> 8968      23         Wing_width       0.57        mm       Acalypta nigrina
#>       verbatimTraitName verbatimTraitValue verbatimTraitUnit measurementID
#> 185         Body_length               2.10                mm             2
#> 610          Body_width               1.22                mm           427
#> 871         Body_height               0.67                mm           852
#> 1337      Thorax_length               0.24                mm          1277
#> 1721       Thorax_width               0.95                mm          1702
#> 2309         Head_width               0.45                mm          2127
#> 2734          Eye_width               0.13                mm          2552
#> 3364       Antenna_Seg1               0.12                mm          2977
#> 3422       Antenna_Seg2               0.04                mm          3402
#> 3888       Antenna_Seg3               0.44                mm          3827
#> 4271       Antenna_Seg4               0.15                mm          4252
#> 4895 Front.Tibia_length               0.40                mm          4713
#> 5524   Mid.Tibia_length               0.42                mm          5138
#> 5563  Hind.Tibia_length               0.56                mm          5563
#> 6048 Front.Femur_length               0.48                mm          5988
#> 6412  Hind.Femur_length               0.56                mm          6413
#> 7021  Front.Femur_width               0.12                mm          6838
#> 7569   Hind.Femur_width               0.10                mm          7263
#> 7870     Rostrum_length               0.92                mm          7688
#> 8118      Rostrum_width               0.08                mm          8113
#> 8598        Wing_length               1.66                mm          8538
#> 8968         Wing_width               0.57                mm          8963
#>      occurrenceID     order   family           basisOfRecordDescription
#> 185             2 Hemiptera Tingidae Zoological State Collection Munich
#> 610             2 Hemiptera Tingidae Zoological State Collection Munich
#> 871             2 Hemiptera Tingidae Zoological State Collection Munich
#> 1337            2 Hemiptera Tingidae Zoological State Collection Munich
#> 1721            2 Hemiptera Tingidae Zoological State Collection Munich
#> 2309            2 Hemiptera Tingidae Zoological State Collection Munich
#> 2734            2 Hemiptera Tingidae Zoological State Collection Munich
#> 3364            2 Hemiptera Tingidae Zoological State Collection Munich
#> 3422            2 Hemiptera Tingidae Zoological State Collection Munich
#> 3888            2 Hemiptera Tingidae Zoological State Collection Munich
#> 4271            2 Hemiptera Tingidae Zoological State Collection Munich
#> 4895            2 Hemiptera Tingidae Zoological State Collection Munich
#> 5524            2 Hemiptera Tingidae Zoological State Collection Munich
#> 5563            2 Hemiptera Tingidae Zoological State Collection Munich
#> 6048            2 Hemiptera Tingidae Zoological State Collection Munich
#> 6412            2 Hemiptera Tingidae Zoological State Collection Munich
#> 7021            2 Hemiptera Tingidae Zoological State Collection Munich
#> 7569            2 Hemiptera Tingidae Zoological State Collection Munich
#> 7870            2 Hemiptera Tingidae Zoological State Collection Munich
#> 8118            2 Hemiptera Tingidae Zoological State Collection Munich
#> 8598            2 Hemiptera Tingidae Zoological State Collection Munich
#> 8968            2 Hemiptera Tingidae Zoological State Collection Munich
#>                             references sex lifeStage  verbatimLocality
#> 185  Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 610  Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 871  Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 1337 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 1721 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 2309 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 2734 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 3364 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 3422 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 3888 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 4271 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 4895 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 5524 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 5563 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 6048 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 6412 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 7021 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 7569 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 7870 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 8118 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 8598 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 8968 Nürnberg_leg.Seidenstücker_9.7.44   f         b 49°27’ N 11°05’ E
#> 
#> This trait-dataset contains 26 traits for 179 taxa ( 9386 measurements in total).
#> NULL

attributes(dataset2Std)$traits[,c("trait", "identifier","traitDescription","expectedUnit")]
#>                                 trait identifier
#> Body_length               Body_length          1
#> Body_width                 Body_width          2
#> Body_height               Body_height          3
#> Thorax_length           Thorax_length          4
#> Thorax_width             Thorax_width          5
#> Head_width                 Head_width          6
#> Eye_width                   Eye_width          7
#> Antenna_Seg1             Antenna_Seg1          8
#> Antenna_Seg2             Antenna_Seg2          9
#> Antenna_Seg3             Antenna_Seg3         10
#> Antenna_Seg4             Antenna_Seg4         11
#> Antenna_Seg5             Antenna_Seg5         12
#> Front.Tibia_length Front.Tibia_length         13
#> Mid.Tibia_length     Mid.Tibia_length         14
#> Hind.Tibia_length   Hind.Tibia_length         15
#> Front.Femur_length Front.Femur_length         16
#> Hind.Femur_length   Hind.Femur_length         17
#> Front.Femur_width   Front.Femur_width         18
#> Hind.Femur_width     Hind.Femur_width         19
#> Rostrum_length         Rostrum_length         20
#> Rostrum_width           Rostrum_width         21
#> Wing_length               Wing_length         22
#> Wing_width                 Wing_width         23
#>                                                        traitDescription
#> Body_length          From the tip of the head to the end of the abdomen
#> Body_width                                      Widest part of the body
#> Body_height                                   Thickest part of the body
#> Thorax_length                              Longest part of the pronotum
#> Thorax_width                                Widest part of the pronotum
#> Head_width                       Widest part of the head including eyes
#> Eye_width                                   Widest part of the left eye
#> Antenna_Seg1                            Length of first antenna segment
#> Antenna_Seg2                           Length of second antenna segment
#> Antenna_Seg3                            Length of third antenna segment
#> Antenna_Seg4                           Length of fourth antenna segment
#> Antenna_Seg5       Length of fifth antenna segment (only Pentatomoidea)
#> Front.Tibia_length                   Length of the tibia of the foreleg
#> Mid.Tibia_length                     Length of the tibia of the mid leg
#> Hind.Tibia_length                   Length of the tibia of the hind leg
#> Front.Femur_length                   Length of the femur of the foreleg
#> Hind.Femur_length                   Length of the femur of the hind leg
#> Front.Femur_width                     Width of the femur of the foreleg
#> Hind.Femur_width                     Width of the femur of the hind leg
#> Rostrum_length             Length of the rostrum including all segments
#> Rostrum_width                                Widest part of the rostrum
#> Wing_length                                Longest part of the forewing
#> Wing_width                                  Widest part of the forewing
#>                    expectedUnit
#> Body_length                  mm
#> Body_width                   mm
#> Body_height                  mm
#> Thorax_length                mm
#> Thorax_width                 mm
#> Head_width                   mm
#> Eye_width                    mm
#> Antenna_Seg1                 mm
#> Antenna_Seg2                 mm
#> Antenna_Seg3                 mm
#> Antenna_Seg4                 mm
#> Antenna_Seg5                 mm
#> Front.Tibia_length           mm
#> Mid.Tibia_length             mm
#> Hind.Tibia_length            mm
#> Front.Femur_length           mm
#> Hind.Femur_length            mm
#> Front.Femur_width            mm
#> Hind.Femur_width             mm
#> Rostrum_length               mm
#> Rostrum_width                mm
#> Wing_length                  mm
#> Wing_width                   mm

3. Standardize taxa

For taxon name standardisation, the function standardize_taxa() makes use of fuzzy matching algorithms provided by the package ‘taxize’ by Scott Chamberlain to match the entries of column verbatimScientificName against the GBIF Backbone Taxonomy. The result is written into a new column scientificName. Additional columns comprise the order (for ambiguous names), the reported taxon rank, as well as a globally unique taxon ID which references the taxon to GBIF Backbone Taxonomy in a universal URI format.

If further layers of taxonomic information are desired as an output, the function takes the parameter return, which by default contains c("taxonID", "scientificName", "order", "taxonRank"). Other specifications can be added here.

Note that for this to work, verbatimScientificName must contain a full account of the species name or higher taxon, no abbreviations (spaces or underscores are handled alright). Note also, that taxon name mapping requires an internet connection and might take some time, depending on the length of your species list.

dataset1Std <- standardize_taxa(dataset1)
head(dataset1Std)
#>          scientificName verbatimScientificName verbatimTraitName
#> 1 Abax parallelepipedus  Abax_parallelepipedus       body_length
#> 2 Abax parallelepipedus  Abax_parallelepipedus    antenna_length
#> 3 Abax parallelepipedus  Abax_parallelepipedus     eyewidth_corr
#> 4 Abax parallelepipedus  Abax_parallelepipedus  metafemur_length
#> 5  Acupalpus meridianus   Acupalpus_meridianus    antenna_length
#> 6  Acupalpus meridianus   Acupalpus_meridianus  metafemur_length
#>   verbatimTraitValue verbatimTraitUnit                             taxonID
#> 1          15.846561                mm http://www.gbif.org/species/5754772
#> 2           8.518519                mm http://www.gbif.org/species/5754772
#> 3           0.481250                mm http://www.gbif.org/species/5754772
#> 4           5.608466                mm http://www.gbif.org/species/5754772
#> 5           0.700000                mm http://www.gbif.org/species/1037633
#> 6           0.750000                mm http://www.gbif.org/species/1037633
#>   measurementID warnings taxonRank  kingdom     phylum   class      order
#> 1             1            species Animalia Arthropoda Insecta Coleoptera
#> 2           121            species Animalia Arthropoda Insecta Coleoptera
#> 3           361            species Animalia Arthropoda Insecta Coleoptera
#> 4           241            species Animalia Arthropoda Insecta Coleoptera
#> 5           122            species Animalia Arthropoda Insecta Coleoptera
#> 6           242            species Animalia Arthropoda Insecta Coleoptera
#>      family measurementDeterminedBy measurementRemarks
#> 1 Carabidae                   klink               <NA>
#> 2 Carabidae                   klink               <NA>
#> 3 Carabidae                   klink               <NA>
#> 4 Carabidae                   klink               <NA>
#> 5 Carabidae                WOODCOCK               <NA>
#> 6 Carabidae                WOODCOCK               <NA>
#> 
#> This trait-dataset contains 4 traits for 120 taxa ( 480 measurements in total).
#> 
#>  carabids : Carabid traits by Fons van der Plas .
#> 
#>     When using these data, you must acknowledge the following usage policies: 
#> 
#>     Cite this trait dataset as: 
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling intensity."
#> _Methods in Ecology and Evolution_. doi:10.1111/2041-210x.12728
#> <https://doi.org/10.1111/2041-210x.12728>.
#> 
#>     Published under: http://creativecommons.org/publicdomain/zero/1.0/ 
#> 
#>     This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"

Single-stroke standardization

The functions standardize_traits() and standardize_taxa() are applied sequentially but not necessarily in that order. The output of the first step can be piped into the second step.

To make things even simpler, the functions for format conversion and standardization come with a wrapper function standardize(). Therefore it is possible to run the functions in a single-handed way, if all necessary parameters for the intermediate steps are provided. A single call will do, taking all the optional parameters described above.

dataset1Std <- standardize(carabids,
            thesaurus = thesaurus1,
            taxa = "name_correct",
            units = "mm",
            keep = c(measurementDeterminedBy = "source_measurement")
            )

As an alternative input pathway, all parameters to standardize() can be specified as attributes of the input object and will be found natively by the function. This allows for the specification of recipes for data integration for projects pulling data from multiple sources.

4. Working with trait-datasets

combine multiple traitdata tables

After standardizing trait and taxon concepts into unified definitions and converting trait values into harmonized units, it is straightforward to combine multiple trait-dataset into one using rbind(). This can be applied before or after the standardisation process, depending on the use case. Use cases of merging data are:

you collected data from different sources and want to harmonize taxon and trait names: bring data in long-table format and merge into one data object, then harmonize taxa and units following a uniform standard
No unified trait list or taxon reference exists for the heterogeneous data assembled of different sources (e.g. because spanning many different taxa): Apply standardization to different reference systems before merging the datasets.

The function call will append the data tables while merging the common columns and maintaining columns that are not present in all datasets (this might produce lots of NA). The column datasetID will be added to keep track of the origin of the data. By default this column will contain the object names of the original datasets, but it can be replaced by more meaningful IDs using the parameter datasetID.

newdata <- rbind(dataset1Std, dataset2Std, 
                datasetID = c("vanderplas15", "gossner15")
              )

Note that the package provides a method for the base function rbind() that handles this merge. Documentation can be accessed via ?rbind.traitdata.

maintaining metadata

The function will handle metadata information on the dataset level as described in the section ‘Metadata’ of the Traitdata Standard (e.g. author or bibliographicCitation) and add a column datasetID as well as datasetName and author if those are provided in the parameter metadata of the as.traitdata() function call which creates the data. The function as.metadata() provides a standard structure for this case.

metadata1 <- as.metadata(
      datasetName = "Carabid traits",
      datasetID = "carabids",
      bibliographicCitation =  bibentry(
        bibtype = "Article",
        title = "Sensitivity of functional diversity metrics to sampling intensity",
        journal = "Methods in Ecology and Evolution",
        author = c(as.person("Fons van der Plas, Roel van Klink, Pete Manning, Han Olff, Markus Fischer")
        ),
        year = 2017,
        doi = "10.1111/2041-210x.12728"
      ),
      author = "Fons van der Plas",
      license = "http://creativecommons.org/publicdomain/zero/1.0/"
       )

dataset1 <- as.traitdata(carabids,
  taxa = "name_correct",
  thesaurus = thesaurus1,
  units = "mm",
  keep = c(datasetID = "source_measurement", measurementRemark = "note"),
  metadata = metadata1
)
#> Input is taken to be a species -- trait matrix. If this is not the case, please provide parameters!

head(dataset1)
#>   verbatimScientificName verbatimTraitName verbatimTraitValue verbatimTraitUnit
#> 1  Abax_parallelepipedus    antenna_length           8.518519                mm
#> 2   Acupalpus_meridianus    antenna_length           0.700000                mm
#> 3         Agonum_ericeti    antenna_length           3.743386                mm
#> 4     Agonum_fuliginosum    antenna_length           3.500000                mm
#> 5         Agonum_gracile    antenna_length           3.220000                mm
#> 6      Agonum_marginatum    antenna_length           5.030000                mm
#>   measurementID datasetID  measurementRemark
#> 1             1     klink               <NA>
#> 2             2  WOODCOCK               <NA>
#> 3             3     klink               <NA>
#> 4             4    ribera               <NA>
#> 5             5    ribera deduced_from_genus
#> 6             6    ribera               <NA>
#> 
#> This trait-dataset contains 4 traits for 120 taxa ( 480 measurements in total).
#> 
#>  carabids : Carabid traits by Fons van der Plas .
#> 
#>     When using these data, you must acknowledge the following usage policies: 
#> 
#>     Cite this trait dataset as: 
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling intensity."
#> _Methods in Ecology and Evolution_. doi:10.1111/2041-210x.12728
#> <https://doi.org/10.1111/2041-210x.12728>.
#> 
#>     Published under: http://creativecommons.org/publicdomain/zero/1.0/ 
#> 
#>     This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"

Note the use of the bibentry() function to create a formal bibliographic entry. Also note that this also affects the way how the dataset is printed into the R console. This facilitates for data users to acknowledge authorship and ownership of the data, while also providing a machine readable structure that can easily be accessed further down the line.

metadata2 <- as.metadata(
  datasetName = "Heteroptera morphometry traits",
  datasetID = "heteroptera",
  bibliographicCitation =  bibentry(
    bibtype = "Article",
    title = "Morphometric measures of Heteroptera sampled in grasslands across three regions of Germany",
    journal = "Ecology",
    volume = 96,
    issue = 4,
    pages = 1154,
    author = c(as.person("Martin M. Gossner , Nadja K. Simons, Leonhard Hoeck, Wolfgang W. Weisser")),
    year = 2015,
    doi = "10.1890/14-2159.1"
  ),
  author = "Martin M. Gossner",
  license = "http://creativecommons.org/publicdomain/zero/1.0/"
)

dataset2 <- as.traitdata(heteroptera_raw,
  taxa = "SpeciesID",
  thesaurus = thesaurus2,
  units = "mm",
  keep = c(sex = "Sex", references = "Source", lifestage = "Wing_development"),
  metadata =  metadata2
)
#> Input is taken to be an occurrence table/an observation -- trait matrix 
#> (i.e. with individual specimens per row and multiple trait measurements in columns). 
#> If this is not the case, please provide parameters!

database <- rbind(dataset1, dataset2, 
                datasetID = c("vanderplas17", "gossner15"), 
                metadata_as_columns = TRUE
                ) 

head(database)
#>   verbatimScientificName verbatimTraitName verbatimTraitValue verbatimTraitUnit
#> 1  Abax_parallelepipedus    antenna_length           8.518519                mm
#> 2   Acupalpus_meridianus    antenna_length           0.700000                mm
#> 3         Agonum_ericeti    antenna_length           3.743386                mm
#> 4     Agonum_fuliginosum    antenna_length           3.500000                mm
#> 5         Agonum_gracile    antenna_length           3.220000                mm
#> 6      Agonum_marginatum    antenna_length           5.030000                mm
#>   measurementID occurrenceID references  sex datasetID    datasetName
#> 1             1         <NA>       <NA> <NA>  carabids Carabid traits
#> 2             2         <NA>       <NA> <NA>  carabids Carabid traits
#> 3             3         <NA>       <NA> <NA>  carabids Carabid traits
#> 4             4         <NA>       <NA> <NA>  carabids Carabid traits
#> 5             5         <NA>       <NA> <NA>  carabids Carabid traits
#> 6             6         <NA>       <NA> <NA>  carabids Carabid traits
#>              author                                           license
#> 1 Fons van der Plas http://creativecommons.org/publicdomain/zero/1.0/
#> 2 Fons van der Plas http://creativecommons.org/publicdomain/zero/1.0/
#> 3 Fons van der Plas http://creativecommons.org/publicdomain/zero/1.0/
#> 4 Fons van der Plas http://creativecommons.org/publicdomain/zero/1.0/
#> 5 Fons van der Plas http://creativecommons.org/publicdomain/zero/1.0/
#> 6 Fons van der Plas http://creativecommons.org/publicdomain/zero/1.0/
#>    measurementRemark lifestage
#> 1               <NA>      <NA>
#> 2               <NA>      <NA>
#> 3               <NA>      <NA>
#> 4               <NA>      <NA>
#> 5 deduced_from_genus      <NA>
#> 6               <NA>      <NA>
#> 
#> This trait-dataset contains 27 traits for 299 taxa ( 9386 measurements in total).
#> $carabids
#> 
#>  carabids : Carabid traits by Fons van der Plas .
#> 
#>     When using these data, you must acknowledge the following usage policies: 
#> 
#>     Cite this trait dataset as: 
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling intensity."
#> _Methods in Ecology and Evolution_. doi:10.1111/2041-210x.12728
#> <https://doi.org/10.1111/2041-210x.12728>.
#> 
#>     Published under: http://creativecommons.org/publicdomain/zero/1.0/ 
#> 
#>     This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"
#> 
#> 
#> $heteroptera
#> 
#>  heteroptera : Heteroptera morphometry traits by Martin M. Gossner .
#> 
#>     When using these data, you must acknowledge the following usage policies: 
#> 
#>     Cite this trait dataset as: 
#> Gossner MM, Simons NK, Hoeck L, Weisser WW (2015). "Morphometric
#> measures of Heteroptera sampled in grasslands across three regions of
#> Germany." _Ecology_, *96*, 1154. doi:10.1890/14-2159.1
#> <https://doi.org/10.1890/14-2159.1>.
#> 
#>     Published under: http://creativecommons.org/publicdomain/zero/1.0/ 
#> 
#>     This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"

The detailed metadata information of both datasets (e.g. license and bibliographic citation) will be stored in the attributes of the dataset and displayed when calling it in R console. You can access the metadata via the attributes() function. E.g.

attributes(dataset1)$metadata$bibliographicCitation
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling intensity."
#> _Methods in Ecology and Evolution_. doi:10.1111/2041-210x.12728
#> <https://doi.org/10.1111/2041-210x.12728>.

writing data recipes

For projects compiling data from multiple sources, it is recommended best practice to refer to original raw data, potentially even by pulling them from their original repository, and make any changes and standardisation procedures script based in R. If many field-based changes are necessary, you can refer to lookup tables to keep the script slim.

traitdataform allows you to script all parameters required for the standardization call into the attributes of the R object. A script for a single data source can then look like this

carabids <- utils::read.delim(url("https://datadryad.org/stash/downloads/file_stream/24267", 
                                encoding = "UTF-8")
                              )

attr(carabids, 'metadata') <- traitdataform::as.metadata(
      datasetName = "Carabid traits",
      datasetID = "carabids",
      bibliographicCitation =  utils::bibentry(
        bibtype = "Article",
        title = "Sensitivity of functional diversity metrics to sampling intensity",
        journal = "Methods in Ecology and Evolution",
        author = c(utils::as.person("Fons van der Plas, Roel van Klink, Pete Manning, Han Olff, Markus Fischer")
        ),
        year = 2017,
        doi = "10.1111/2041-210x.12728"
      ),
      author = "Fons van der Plas",
      license = "http://creativecommons.org/publicdomain/zero/1.0/"
       )

attr(carabids, 'thesaurus') <-  traitdataform:::as.thesaurus(
          body_length = traitdataform:::as.trait("body_length",
                              expectedUnit = "mm", valueType = "numeric",
                              identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length"),
          antenna_length = traitdataform:::as.trait("antenna_length",
                              expectedUnit = "mm", valueType = "numeric",
                              identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Antenna_length"),
          metafemur_length = traitdataform:::as.trait("femur_length",
                              expectedUnit = "mm", valueType = "numeric",
                              identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Femur_length"),
          eyewidth_corr = traitdataform:::as.trait("eye_diameter",
                              expectedUnit = "mm", valueType = "numeric",
                              identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Eye_diameter")
        )

attr(carabids, 'taxa') <- "name_correct"
attr(carabids, 'units') <- "mm"
attr(carabids, 'keep') <-  c(measurementDeterminedBy = "source_measurement", measurementRemarks = "note")

When thus specified, the data can be re-formatted simply by calling standardize(carabids).

5. Writing data

The final step in converting trait data into a standardised format before uploading the file to a public file hosting service is saving them in a file format that is internationalized, portable and long-term accessible. Internationalization refers to the file encoding (‘UTF-8’ should be used, ‘ASCII’ is possible for data with no special characters) as well as the use of decimal delimiters (highly recommended to use ‘.’) and internationally accepted formatting standards for values such as dates (the international norm for date entries is ISO 8601, i.e. “YYYY-MM-DD”). Portability means that the file can be opened on all operating systems (specifically important, the ‘end of line’ character) and does not rely on proprietary software (like MS Excel or database tools). Long-term accessibility is warranted by choosing a text-based file format (txt, csv or tsv) and by packaging the primary data with all necessary metadata.

The base R function write.table() gives full control over these parameters and should be used to export trait-data.

write.table(dataset1Std, file = "carabids_std.csv", 
            sep = ",", dec = ".", quote = TRUE, eol = "\r", row.names = FALSE)

Along with these primary data, you should make any ancillary data table available along with the data, e.g. the metadata in a human readable form, as well as the lookup table of traits and taxa:

capture.output(attributes(dataset1Std)$metadata, file = "metadata.txt")

write.table(attributes(dataset1Std)$traits, file = "traits.csv", 
            sep = ",", dec = ".", quote = TRUE, eol = "\r", row.names = FALSE)
write.table(attributes(dataset1Std)$taxonomy, file = "taxa.csv", 
            sep = ",", dec = ".", quote = TRUE, eol = "\r", row.names = FALSE)

When publishing the trait data on file servers like Figshare or Zenodo, those files should be uploaded in a single file repository (e.g. in a zip file). R does the archiving for you using zip():

zip("carabids_std.zip", c("carabids_std.csv", "metadata.txt", "traits.csv", "taxa.csv") )

More advise for publishing trait data in a standardised way can be found in our ‘Best practice examples for primary data publication’ (Schneider et al., 2018).

References

Cornelissen, J. H. C., Lavorel, S., Garnier, E., Diaz, S., Buchmann, N., Gurvich, D. E., … Van Der Heijden, M. G. A. (2003). A handbook of protocols for standardised and easy measurement of plant functional traits worldwide. Australian Journal of Botany, 51(4), 335–380.

Garnier, E., Stahl, U., Laporte, M.-A., Kattge, J., Mougenot, I., Kühn, I., … Klotz, S. (2017). Towards a thesaurus of plant characteristics: An ecological contribution. Journal of Ecology, 105(2), 298–309. doi:10.1111/1365-2745.12698

Kattge, J., Ogle, K., Bönisch, G., Díaz, S., Lavorel, S., Madin, J., … Wirth, C. (2011). A generic structure for plant trait databases. Methods in Ecology and Evolution, 2(2), 202–213. doi:10.1111/j.2041-210X.2010.00067.x

Moretti, M., Dias, A. T., Bello, F., Altermatt, F., Chown, S. L., Azcárate, F. M., … others. (2017). Handbook of protocols for standardized measurement of terrestrial invertebrate functional traits. Functional Ecology, 31(3), 558–567. doi:10.1111/1365-2435.12776

Parr, C. S., Schulz, K. S., Hammock, J., Wilson, N., Leary, P., Rice, J., … J, R. (2016). TraitBank: Practical semantics for organism attribute data. Semantic Web, 7(6), 577–588. doi:10.3233/SW-150190

Pebesma, E., Mailund, T., & Hiebert, J. (2016). Measurement Units in R. The R Journal, 8(2), 486–494. doi:10.32614/RJ-2016-061

Perez-Harguindeguy, N., Diaz, S., Garnier, E., Lavorel, S., Poorter, H., Jaureguiberry, P., … Gurvich, D. E. (2013). New handbook for standardised measurement of plant functional traits worldwide. Australian Journal of Botany, 61(3), 167–234.

Schneider, F. D., Jochum, M., Provost, G. L., Ostrowski, A., Penone, C., Fichtmüller, D., … Simons, N. K. (2018). Towards an Ecological Trait-data Standard. bioRxiv, 328302. doi:10.1101/328302

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23. doi:10.18637/jss.v059.i10

Florian D. Schneider