This vignette contains step-by step instructions for transferring own data into a standardized trait-dataset for upload to public databases. The output object uses the trait data standard put forward in Schneider et al. XX (refer to pub).

1. reading data

First, load your own data into R, preferrably in a species–trait matrix, occurrence table or measurement longtable format (See notes on different data formats of trait data).

You may rename the column names of the original file to match the column names described in the trait data standard, but this vignette also describes the mapping of the column names along the process of data handling.

R knows many ways of getting your original data into an R object. In most cases you would read an object from a csv or txt file while maintaining the column headers.

2. transfer into measurement longtable format

As explained in Schneider et al. XX, most trait data are stored in one of the following formats:

  • species-trait matrix : a single account of a trait value for each species (in rows) for a couple of different traits (in columns). No replicates of species are reported. This is the most likely format for literature data, where aggregate measurements or facts for entire species have been collated into a single lookup table.
  • occurrence wide table : In case of measured data, authors may report multiple raw measurements of different traits (in columns) taken from a single occurrence of a species, i.e. an individual specimen (in rows). Repeated measures of the same trait might also be included as columns or pooled into average values. This is valuable for investigations of intra-specific variation, and also leaves space for filtering by cofactors or analysing trait response along environmental gradients.
  • measurement long table : For a standardisation of trait data for use in online databases, we propose a measurement long table format, where each row comprises the reporting of a single measurement or fact, linked to a trait definition as well as a valid taxon name, and optionally to other layers of information. This data format is more predictable in terms of columns and thus easiert to merge with other datasets.

In all cases, additional information on the reported value may be stored in further colums (e.g. the unit in which a value is reported or the literature source for this measurement or fact), or in a separate data sheet linked via identifiers for trait, taxon, occurrence or sampling/measurement event. Examples below will explain how these information can be added to the main data sheet.

The function as.traitdata() provided in the package assist in transferring any data format into the measurement longtable format. For this function to work, it needs to know about the columns of the original data that contain trait values (parameter traits), and the column which contains the taxonomic specification (parameter taxa).

## The dataset 'carabids' is now available for use!
## Input is taken to be a species -- trait matrix. If this is not the case, please provide parameters!
##          scientificName   traitName traitValue traitUnit measurementID
## 1 Abax_parallelepipedus body_length  15.846561        mm             1
## 2  Acupalpus_meridianus body_length   2.670000        mm             2
## 3        Agonum_ericeti body_length   5.873016        mm             3
## 4    Agonum_fuliginosum body_length   5.090000        mm             4
## 5        Agonum_gracile body_length   4.880000        mm             5
## 6     Agonum_marginatum body_length   8.250000        mm             6
##   measurementRemarks datasetID
## 1               <NA>     klink
## 2               <NA>  WOODCOCK
## 3               <NA>     klink
## 4               <NA>    ribera
## 5 deduced_from_genus    ribera
## 6               <NA>    ribera
## 
## This trait-dataset contains 4 traits for 120 taxa ( 6 measurements in total).
## 
##  carabids : Carabid traits by Fons van der Plas .
## 
##     When using these data, you must acknowledge the following usage policies: 
## 
##     Cite this trait dataset as: 
## van der Plas F, van Klink R, Manning P, Olff H and Fischer M
## (2017). "Sensitivity of functional diversity metrics to sampling
## intensity." _Methods in Ecology and Evolution_. doi:
## 10.1111/2041-210x.12728 (URL:
## http://doi.org/10.1111/2041-210x.12728).
## 
##     Published under: http://creativecommons.org/publicdomain/zero/1.0/

Note that in the output table the columns have been named after the traitdata standard proposed in the whitepaper (ref). The essential columns are traitName, traitValue for the reported measurement or fact as well as ScientificName for the taxon assignment. The function auomaticall interprets data as species- trait matrix if the taxa column contains only unique entries and no duplicates.

In case of occurrence table data, an occurrenceID is provided automatically, or can be provided by the author using the parameter occurrences (as a column name or a vector of occurrence IDs).

pulldata("heteroptera_raw")
## loading dataset 'heteroptera_raw' from original data source! 
##  When using this data, please cite the original publication: 
## Gossner MM, Simons NK, Höck L and Weisser WW (2015). "Morphometric
## measures of Heteroptera sampled in grasslands across three regions
## of Germany." _Ecology_, *96*, pp. 1154. doi: 10.1890/14-2159.1
## (URL: http://doi.org/10.1890/14-2159.1).
## The dataset 'heteroptera_raw' is now available for use!
##        scientificName          traitName traitValue measurementID
## 5    Acalypta parvula        Body_length       1.84             5
## 430  Acalypta parvula         Body_width       0.82           430
## 855  Acalypta parvula        Body_height       0.56           855
## 1280 Acalypta parvula      Thorax_length       0.17          1280
## 1705 Acalypta parvula       Thorax_width       0.84          1705
## 2130 Acalypta parvula         Head_width       0.36          2130
## 2555 Acalypta parvula          Eye_width       0.10          2555
## 2980 Acalypta parvula       Antenna_Seg1       0.07          2980
## 3405 Acalypta parvula       Antenna_Seg2       0.06          3405
## 3830 Acalypta parvula       Antenna_Seg3       0.40          3830
## 4255 Acalypta parvula       Antenna_Seg4       0.15          4255
## 4716 Acalypta parvula Front.Tibia_length       0.39          4716
## 5141 Acalypta parvula   Mid.Tibia_length       0.40          5141
## 5566 Acalypta parvula  Hind.Tibia_length       0.50          5566
## 5991 Acalypta parvula Front.Femur_length       0.42          5991
## 6416 Acalypta parvula  Hind.Femur_length       0.43          6416
## 6841 Acalypta parvula  Front.Femur_width       0.08          6841
## 7266 Acalypta parvula   Hind.Femur_width       0.07          7266
## 7691 Acalypta parvula     Rostrum_length       0.80          7691
## 8116 Acalypta parvula      Rostrum_width       0.08          8116
## 8541 Acalypta parvula        Wing_length       1.78          8541
## 8966 Acalypta parvula         Wing_width       0.65          8966
##      occurrenceID
## 5               5
## 430             5
## 855             5
## 1280            5
## 1705            5
## 2130            5
## 2555            5
## 2980            5
## 3405            5
## 3830            5
## 4255            5
## 4716            5
## 5141            5
## 5566            5
## 5991            5
## 6416            5
## 6841            5
## 7266            5
## 7691            5
## 8116            5
## 8541            5
## 8966            5
## 
## This trait-dataset contains 23 traits for 179 taxa ( 22 measurements in total).
## NULL

case example: provide measurement unit

For a standardisation of quantitative trait data, the unit of measurement is essential. Often, this information is kept in the metadata descriptions. But for a standardised table containing measurements from different sources, this information should always accompany the measurement value. A common way to provide the unit is adding another column to your original data table containing the unit in an unambiguous format. The function as.traittable() assists in adding the units via its parameter units.

This can be done for all traits in a single stroke (if all reported values refer to the same unit) or to each trait specifically (if they used different measuremnt units or if the table comprises a mixture of quantitative and qualitative traits).

The syntax for this uses the parameter units, which takes a single character string, or a vector of character strings, containing valid entries as expected by the package ‘units’ (Pebesma et al. 2016, https://github.com/edzer/units/, v0.4-5, Examples are ‘mm’, ‘m2’ or ‘m^2’, ‘m/s’).

pulldata("carabids")
## The dataset 'carabids' is now available for use!
## Input is taken to be a species -- trait matrix. If this is not the case, please provide parameters!
##          scientificName   traitName traitValue traitUnit measurementID
## 1 Abax_parallelepipedus body_length  15.846561        mm             1
## 2  Acupalpus_meridianus body_length   2.670000        mm             2
## 3        Agonum_ericeti body_length   5.873016        mm             3
## 4    Agonum_fuliginosum body_length   5.090000        mm             4
## 5        Agonum_gracile body_length   4.880000        mm             5
## 6     Agonum_marginatum body_length   8.250000        mm             6
##   measurementRemarks datasetID
## 1               <NA>     klink
## 2               <NA>  WOODCOCK
## 3               <NA>     klink
## 4               <NA>    ribera
## 5 deduced_from_genus    ribera
## 6               <NA>    ribera
## 
## This trait-dataset contains 4 traits for 120 taxa ( 6 measurements in total).
## 
##  carabids : Carabid traits by Fons van der Plas .
## 
##     When using these data, you must acknowledge the following usage policies: 
## 
##     Cite this trait dataset as: 
## van der Plas F, van Klink R, Manning P, Olff H and Fischer M
## (2017). "Sensitivity of functional diversity metrics to sampling
## intensity." _Methods in Ecology and Evolution_. doi:
## 10.1111/2041-210x.12728 (URL:
## http://doi.org/10.1111/2041-210x.12728).
## 
##     Published under: http://creativecommons.org/publicdomain/zero/1.0/

A character vector should have the same length as the provided vector of trait names (in parameter traits), or otherwise should be a named vector of the form c(trait1 = "mm", trait2 = "mm2"), where only the traits provided will receive units while the others will remain blank.

pulldata("arthropodtraits")
## loading dataset 'arthropodtraits' from original data source! 
##  When using this data, please cite the original publication: 
## Gossner MM, Simons NK, Achtziger R, Blick T, Dorow W, Dziock F,
## Köhler F, Rabitsch W and Weisser WW (2015). "A summary of eight
## traits of Coleoptera, Hemiptera, Orthoptera and Araneae, occurring
## in grasslands in Germany." _Scientific Data_, *2*, pp. 150013.
## doi: 10.1038/sdata.2015.13 (URL:
## http://doi.org/10.1038/sdata.2015.13).
## The dataset 'arthropodtraits' is now available for use!
## Input is taken to be a species -- trait matrix. If this is not the case, please provide parameters!
##         scientificName traitName traitValue traitUnit measurementID
## 1 Anyphaena accentuata Body_Size       6.25        mm             1
## 2 Aculepeira ceropegia Body_Size         11        mm             2
## 3     Agalenatea redii Body_Size       6.93        mm             3
## 4   Araneus diadematus Body_Size      11.88        mm             4
## 5    Araneus marmoreus Body_Size      10.03        mm             5
## 6    Araneus quadratus Body_Size      12.25        mm             6
## 
## This trait-dataset contains 10 traits for 1230 taxa ( 6 measurements in total).
## 
## [ ] :
pulldata("heteroptera")
## loading dataset 'heteroptera' from original data source! 
##  When using this data, please cite the original publication: 
## Gossner MM, Simons NK, Höck L and Weisser WW (2015). "Morphometric
## measures of Heteroptera sampled in grasslands across three regions
## of Germany." _Ecology_, *96*, pp. 1154. doi: 10.1890/14-2159.1
## (URL: http://doi.org/10.1890/14-2159.1).
## The dataset 'heteroptera' is now available for use!
## Input is taken to be a species -- trait matrix. If this is not the case, please provide parameters!
##                 scientificName   traitName traitValue traitUnit
## 1             Acalypta nigrina Body_length       2.19        mm
## 2             Acalypta parvula Body_length       1.73        mm
## 3         Acalypta platycheila Body_length       2.22        mm
## 4           Acetropis carinata Body_length       5.73        mm
## 5      Adelphocoris lineolatus Body_length       6.84        mm
## 6 Adelphocoris quadripunctatus Body_length       7.95        mm
##   measurementID
## 1             1
## 2             2
## 3             3
## 4             4
## 5             5
## 6             6
## 
## This trait-dataset contains 10 traits for 0 taxa ( 6 measurements in total).
## 
## [ ] :

Logical or factorial traits ususally don’t come with a unit. In mixed data, the field should specify as empty, "" or asNA.

case example: raw data are coded as numeric factor levels

The data table should be human readable, thus you may consider translation into true factorial data via the function mutate.traitdata().

This may not be useful if the numeric levels correspond to fine grained distinctions that cannot be translated into short factor levels.

A translation into factorials is even ill-adviced if factor levels are ordinal, i.e. they correspond to a sequence of logically ordered levels and the ordering would be lost by translating into factorials: The traitdata object will not keep ordinal level definitions of the original R data.frame. In this case, integer numerical values are best to describe the relational structure of the factor levels.

Please don’t forget to provide a definition of factor levels in the metadata description of variables or in an accompanying dataset containing trait definitions.

case example: keep additional data columns

The raw data might contain further information on the specimen or the trait measurement itself in further data columns that are valuable for later analysis. This can be for instance data about the sex or developmental stage of the individual, the sampling or preservation method of the specimen, or the conditions under which the measurement was taken.

The parameter keep allows you to specify which columns contain valuable information as a character vector. As a negative version of keep, specifying drop would allow you to name the columns that are not valueable, while all others will be kept. Not specifying keep or drop will result in dropping all columns except the core measurement and identifier columns.

pulldata("heteroptera_raw")
## loading dataset 'heteroptera_raw' from original data source! 
##  When using this data, please cite the original publication: 
## Gossner MM, Simons NK, Höck L and Weisser WW (2015). "Morphometric
## measures of Heteroptera sampled in grasslands across three regions
## of Germany." _Ecology_, *96*, pp. 1154. doi: 10.1890/14-2159.1
## (URL: http://doi.org/10.1890/14-2159.1).
## The dataset 'heteroptera_raw' is now available for use!
##     scientificName   traitName traitValue measurementID occurrenceID Sex
## 1 Acalypta nigrina Body_length       2.35             1            1   f
## 2 Acalypta nigrina Body_length       2.10             2            2   f
## 3 Acalypta nigrina Body_length       2.17             3            3   m
## 4 Acalypta nigrina Body_length       2.15             4            4   m
## 5 Acalypta parvula Body_length       1.84             5            5   f
## 6 Acalypta parvula Body_length       1.81             6            6   f
## 
## This trait-dataset contains 23 traits for 179 taxa ( 6 measurements in total).
## 
## [ ] :

The traitdata standard (whitepaper) suggests standard names for many of these extra information, which might fall into the domain of the extensions for occurrence or measurementOrFact (see below). We highly reccomend mapping the columns provided into these standard names by using the rename feature of the as.traitdata() function. This is simply acheived by providing a named vector for keep that uses the compatible column names as vector names.

## Input is taken to be a species -- trait matrix. If this is not the case, please provide parameters!
##          scientificName   traitName traitValue traitUnit measurementID
## 1 Abax_parallelepipedus body_length  15.846561        mm             1
## 2  Acupalpus_meridianus body_length   2.670000        mm             2
## 3        Agonum_ericeti body_length   5.873016        mm             3
## 4    Agonum_fuliginosum body_length   5.090000        mm             4
## 5        Agonum_gracile body_length   4.880000        mm             5
## 6     Agonum_marginatum body_length   8.250000        mm             6
##   measurementDeterminedBy
## 1                   klink
## 2                WOODCOCK
## 3                   klink
## 4                  ribera
## 5                  ribera
## 6                  ribera
## 
## This trait-dataset contains 3 traits for 120 taxa ( 6 measurements in total).
## 
##  carabids : Carabid traits by Fons van der Plas .
## 
##     When using these data, you must acknowledge the following usage policies: 
## 
##     Cite this trait dataset as: 
## van der Plas F, van Klink R, Manning P, Olff H and Fischer M
## (2017). "Sensitivity of functional diversity metrics to sampling
## intensity." _Methods in Ecology and Evolution_. doi:
## 10.1111/2041-210x.12728 (URL:
## http://doi.org/10.1111/2041-210x.12728).
## 
##     Published under: http://creativecommons.org/publicdomain/zero/1.0/

Note that a lack of a name in the named vector maintains the original name. Note also, that no checking for valid column names (as compared to the traitdata glossary) is performed at this stage. This is to ensure that the raw data table created by as.traittable() can contain any columns that the author considers relevant. The keep parameter can be used to rename columns into intuitive column names.

case example: adding further information on traits, species or single measurements

Beyond measurement units, further information might be available that are not recorded in the raw data table, but are related to the trait type, the taxon, the individual or specimen, or to the reported fact, measurement or sampling event.

In most cases those information are kept in seperate data sheets of your file, e.g. the place were a specimen has been sampled or the literature source from where a species value has been cited. In this case, a unique identifier might link to this other datasheet, such as a number for each individual occurrence of a specimen (occurrenceID) or an identifier for a single measurement or reported fact (measurementID).

The trait data standard provides two extensions of the namespace that should be used to describe these data:

  • the occurrence extension contains information on the level of individual specimens, such as date and location and method of sampling and preservation, or physiological specifications of the phenotype, such as sex, life stage or age.
  • the measurementOrFact extension takes information at the level of single measurements or reported values, such as the original literature from where the value is cited, the method of measurement or statistical method of aggregation.

The extensions are compatible with Darwin Core Standard and EOL TraitBank.

You may decide to keep the information in a seperate data sheet. In that case, the traitdata table should at least contain a column with the respective identifier that directs to the covariate datasheet. The identifier might also take the format of a globally valid URI or API call.

It is however recommended to add these information directly as own columns within the data table to enable an analysis of cofactors and correlations further down the road. This way, if datasets of different source are merged, the information is readily available without the risk of breaking the reference to an external datasheet.

The function as.traitdata() provides a set of parameters to add information at the different levels. The following three examples will illustrate how to add covariates to each occurrence or measurement. The principle is always the same: A unique identifier for these levels of information can be associated with a vector or data table containing the additional information, which will then be merged into the data table. The functionality builds on the base R function merge.data.frame, but checks for compatibility with the glossary of terms of the traitdata standard.

Adding information on specimen level (occurrence Extension)

  • under construction -

Adding information on measurement or fact (measurementOrFact Extension)

  • under construction -

case example: mutate original columns into derived values

Many traits comprise compound measures of multiple traits, such as length-mass ratios or morphometric indices. Other traits must be refined in terms of factor levels, or reduced to binary trait values. While many of these tasks can be achieved on the raw data using base functions like transform(), factor() or match() or the mutate() function provided by the package ‘plyr’.

The function mutate.traitdata() performs these tasks (working as a wrapper to plyr::mutate()).

##        scientificName  traitName traitValue measurementID occurrenceID Sex
## 9387 Acalypta nigrina Body_shape   2.043478          <NA>            1   f
## 9388 Acalypta nigrina Body_shape   1.721311          <NA>            2   f
## 9389 Acalypta nigrina Body_shape   2.028037          <NA>            3   m
## 9390 Acalypta nigrina Body_shape   2.067308          <NA>            4   m
## 9391 Acalypta parvula Body_shape   2.243902          <NA>            5   f
## 9392 Acalypta parvula Body_shape   1.885417          <NA>            6   f
## 
## This trait-dataset contains 26 traits for 179 taxa ( 6 measurements in total).
## 
## [ ] :

Note that all existing traits remain untouched and additional trait measures will be added to the dataset, unless a definition replaces an already existing trait (such as ‘Stratum_use’ in this example).

It is important to note that the mutate function works at the level of data resolution that is provided by the data, i.e. for occurrence data with multiple measurements on a single individual, the data columns are mutated per occurrenceID.

3. standardise taxon names and trait values

Step 1 and 2 produced a tidy and correctly formatted version of your own trait data. We now turn to the challenging task of standardisation. Two aspects of trait data need thorough standardisation: the names of species and higher taxa need to be mapped to globally accepted definitions and the names of traits should be referenced to unambiguous definitions and, where possible, translated to standard units and accepted factor levels.

taxon name standardisation

For taxon name standardisation, the function standardize.taxonomy() makes use of fuzzy matching algorithms provided by the package ‘taxize’ to match the entries of column scientificName against the GBIF Backbone Taxonomy (taxize v). The result is written into a new column scientificNameStd. Additional columns comprise the order (for ambiguous names), the reported taxon rank, as well as a globally unique taxon ID which references the taxon to GBIF Backbone Taxonomy in a universal URI format.

If further layers of taxonomic information are desired as an output, the function takes the parameter return, which by default contains c("taxonID", "scientificNameStd", "order", "taxonRank"). Other specifications can be added here.

Note that for this to work, scientificName must contain a full account of the species name or higher taxon, no abbreviations (spaces or underscores are handled alright).

Note also, that taxon name mapping requires an internet connection and might take some time, depending on the length of your species list.

##          scientificName        traitName traitValue traitUnit
## 1 Abax_parallelepipedus      body_length  15.846561        mm
## 2 Abax_parallelepipedus   antenna_length   8.518519        mm
## 3 Abax_parallelepipedus metafemur_length   5.608466        mm
## 4  Acupalpus_meridianus   antenna_length   0.700000        mm
## 5  Acupalpus_meridianus metafemur_length   0.750000        mm
## 6  Acupalpus_meridianus      body_length   2.670000        mm
##       scientificNameStd                             taxonID measurementID
## 1 Abax parallelepipedus http://www.gbif.org/species/5754772             1
## 2 Abax parallelepipedus http://www.gbif.org/species/5754772           121
## 3 Abax parallelepipedus http://www.gbif.org/species/5754772           241
## 4  Acupalpus meridianus http://www.gbif.org/species/1037633           122
## 5  Acupalpus meridianus http://www.gbif.org/species/1037633           242
## 6  Acupalpus meridianus http://www.gbif.org/species/1037633             2
##   warnings taxonRank  kingdom     phylum   class      order    family
## 1            species Animalia Arthropoda Insecta Coleoptera Carabidae
## 2            species Animalia Arthropoda Insecta Coleoptera Carabidae
## 3            species Animalia Arthropoda Insecta Coleoptera Carabidae
## 4            species Animalia Arthropoda Insecta Coleoptera Carabidae
## 5            species Animalia Arthropoda Insecta Coleoptera Carabidae
## 6            species Animalia Arthropoda Insecta Coleoptera Carabidae
##   measurementDeterminedBy
## 1                   klink
## 2                   klink
## 3                   klink
## 4                WOODCOCK
## 5                WOODCOCK
## 6                WOODCOCK
## 
## This trait-dataset contains 3 traits for 120 taxa ( 6 measurements in total).
## 
##  carabids : Carabid traits by Fons van der Plas .
## 
##     When using these data, you must acknowledge the following usage policies: 
## 
##     Cite this trait dataset as: 
## van der Plas F, van Klink R, Manning P, Olff H and Fischer M
## (2017). "Sensitivity of functional diversity metrics to sampling
## intensity." _Methods in Ecology and Evolution_. doi:
## 10.1111/2041-210x.12728 (URL:
## http://doi.org/10.1111/2041-210x.12728).
## 
##     Published under: http://creativecommons.org/publicdomain/zero/1.0/

trait name and value standardisation

Due to the heterogeneity of approaches and research questions related to trait-based research, a universal trait definition standard does not exist at the time of writing this. Therefore it is difficult to assign globally unique identifiers that provide a reference to an unambiguous definition. Some databases offer a list of traits in some way or another, e.g. as a datasheet of in-text table, but few offer a stable URI reference or an API. Exceptions are the Gramene Ontology, which offers trait definitions for crop plants, and the TOP thesaurus for plant traits (http://top-thesaurus.org), which is rather comprehensive, but does not provide easy means of referencing. Many such trait ontologies are currently under construction for different animal phyla.

For most cases, you would instead refer to an own lookup table, a so called thesaurus of traits, using dataset specific identifiers. The thesaurus may also be part of your metadata accompanying the trait dataset.

To transfer the user provided traits and trait values into standardised values, the function standardize.traits() merges the data table with a reference table of trait definitions to produce values of a compliant format.

refer to an existing trait ontology

A couple of trait ontologies do exist, e.g. the TOP Thesaurus of plant traits (used by TRY) or Gramene.org offer definitions of plant traits via an API. For soil invertebrates, the T-SITA thesaurus offers a set of traits relevant for this organism group. To date, no script for a systematic access of these ontologies can be provided here. Thus, the key information must be provided manually as an own data object in R.

This procedure is only recommended if all of the traits reported in your dataset refer to a definition in an online thesaurus.

We highly encourage the implementation of open online resources for these glossaries (e.g. via APIs), which would allow a looking up existing trait definitions programmatically, and match user provided names to accepted trait names via fuzzy matching.

refer to an own trait thesaurus

If no published trait definition is available that can be referenced, trait-datasets should be accompanied by a dataset-specific glossary of traits, or thesaurus. A thesaurus provides a “controlled vocabulary designed to clarify the definition and structuring of key terms and associated concepts in a specific discipline”.

Ideally this thesaurus is stored as an asset along with your trait dataset or in a public file on the internet. This can be a csv or txt file published on a open access repository (figshare, researchgate or github, to name but a few), or a website providing direct links to the trait definition (URI).

This reference file should contain at least the following columns:

  • trait should be a short descriptive name. No spaces should be used. Rather use a scheme with underscore or capital letters to highlight multiple words (e.g. ‘body_length’ or ‘bodyLenght’).
  • identifier should specify an alphanumeric ID for the specific use in your dataset or - better - a URI that reliably links to the definition of the trait measurement on an online repository. This could be achieved by providing a online version of your traitlist (TODO: provide instructions in wiki how to achieve this). We highly encourage to submit your own trait definitions to existing ontology servers to facilitate this process of trait standardisation (e.g. with GFBio).
  • traitDescription: a detailled and unambiguous, human readable definition.
  • valueType to specify the expected kind of entries. Set it to ‘numeric’ for quantitative traits, ‘integer’ for counts or ordinal traits, ‘character’ for trait values that are provided as free text, ‘factor’ for traits that take one of few non-ordinal levels, ‘logical’ for binary/boolean entries (yes/no).
    • For numeric traits, the parameter expectedUnit should provide the expected unit for the trait. The R script will then try to convert trait values into this unit.
    • for categorical traits of kind ‘factor’ or ‘integer’, the field factorLevels should contain a list the valid factorial traits separated by semicolon. In case of ordinal traits, the order must be precisely corresponding to the number of possible integer values.
  • comments may contain examples and clarifications

The trait thesaurus can be created from a data.frame using the function as.thesaurus(). The parameter replace can be used for fixing column names to the expected names outlined above (see function reshape::rename() of the plyr package).

In R, the thesaurus can be created manually by providing objects of class ‘trait’ for the function as.thesaurus() which will be used to create a valid data frame. This is especially useful if your data comprise only a small number of traits. Using the as.trait() syntax may allow a more flexible trait definition and an ensures compliance with the terms of the traitdata standard outlined above. It also allows building a library of trait definitions where single traits can be reused in multiple projects.

standardise trait data

The function standardize.traits() now finally has all it needs to complete its job.

##           scientificName   traitName
## 1  Abax_parallelepipedus body_length
## 2          Amara_brunnea body_length
## 3   Limodromus_assimilis body_length
## 4      Carabus_nemoralis body_length
## 5 Bradycellus_ruficollis body_length
## 6   Acupalpus_meridianus body_length
##                                                       traitID traitValue
## 1 http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length  15.846561
## 2 http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length   5.000000
## 3 http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length   8.981481
## 4 http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length  18.450000
## 5 http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length   2.880000
## 6 http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length   2.670000
##   traitUnit      scientificNameStd traitNameStd traitValueStd traitUnitStd
## 1        mm  Abax parallelepipedus  body_length            NA           mm
## 2        mm          Amara brunnea  body_length            NA           mm
## 3        mm     Platynus assimilis  body_length            NA           mm
## 4        mm      Carabus nemoralis  body_length            NA           mm
## 5        mm Bradycellus ruficollis  body_length            NA           mm
## 6        mm   Acupalpus meridianus  body_length            NA           mm
##                               taxonID measurementID
## 1 http://www.gbif.org/species/5754772             1
## 2 http://www.gbif.org/species/9102035            14
## 3 http://www.gbif.org/species/8137203            80
## 4 http://www.gbif.org/species/8056040            55
## 5 http://www.gbif.org/species/5755986            42
## 6 http://www.gbif.org/species/1037633             2
##                                                                  warnings
## 1                                                                        
## 2                                                                        
## 3  A synonym was provided! Automatically mapped to accepted species name!
## 4                                                                        
## 5                                                                        
## 6                                                                        
##   taxonRank  kingdom     phylum   class      order    family
## 1   species Animalia Arthropoda Insecta Coleoptera Carabidae
## 2   species Animalia Arthropoda Insecta Coleoptera Carabidae
## 3   species Animalia Arthropoda Insecta Coleoptera Carabidae
## 4   species Animalia Arthropoda Insecta Coleoptera Carabidae
## 5   species Animalia Arthropoda Insecta Coleoptera Carabidae
## 6   species Animalia Arthropoda Insecta Coleoptera Carabidae
##   measurementDeterminedBy
## 1                   klink
## 2                   klink
## 3                   klink
## 4                  ribera
## 5                  ribera
## 6                WOODCOCK
## 
## This trait-dataset contains 3 traits for 120 taxa ( 6 measurements in total).
## 
##  carabids : Carabid traits by Fons van der Plas .
## 
##     When using these data, you must acknowledge the following usage policies: 
## 
##     Cite this trait dataset as: 
## van der Plas F, van Klink R, Manning P, Olff H and Fischer M
## (2017). "Sensitivity of functional diversity metrics to sampling
## intensity." _Methods in Ecology and Evolution_. doi:
## 10.1111/2041-210x.12728 (URL:
## http://doi.org/10.1111/2041-210x.12728).
## 
##     Published under: http://creativecommons.org/publicdomain/zero/1.0/

What does the function do in terms of standardisation.

  • Unit conversion: on all numerical traits, unit conversion to the target unit will be attempted. Unit conversion can only be successfully performed if both columns traitUnit and traitUnitStd are provided with valid unit names for the numeric trait.
  • factor level checking : if a controlled vocabulary is provided in the trait thesaurus, the function checks whether the provided factor levels are valid and asks for a mapping vector otherwise. (not functional!)
  • logical value harmonization : for logical traits, the function harmonizes the standardised output. By default it produces a vector of TRUE and FALSE entries. Missing values return NA. The parameters output and categories can be provided to function standardize(). See ?fixlogical for further detail.

georeference standardization for Biodiversity Exploratories

The traitdata standard has been developed within the Biodiversity Exploratories project (DFG ) which is a long-term assessment of plant and animal communities in three regions across Germany starting in 2008. Trait data extracted from one of the 300 project plots can be georeferenced using the function standardize.exploratories().

To access the high-resolution location data, credentials to BEXIS will be requested. If credentials are incorrect or missing, only low-resolution geolocation can be extracted.

single stroke standardization

The functions described here are applied sequentially. The output of the first step can be piped into the second step, etc.

To make things even simpler, the functions for format conversion and standardization are wrapped into one named standardize(). Therefore it is possible to run the functions in a single handed way, if all necessary parameters for the intermediate steps are provided. A single call will do, taking all the optional parameters described above.

4.Working with trait datasets

combine multiple traitdata tables

Combining separate datasets can be done using rbind() before or after the standardisation process, depending on the use case. Use cases of merging data are:

  • you collected data from different sources and want to harmonize taxon and trait names: bring data in long-table format and merge into one data object, then harmonize taxa and units following a uniform standard
  • No unified trait list or taxon reference exists for the heterogeneous data assembled of different sources (e.g. because spanning many different taxa): Apply standardization to different reference systems before merging the datasets.

The function call will append the data tables while merging the common columns and maintaining columns that are not present in all datasets (this might produce lots of NA). The column datasetID will be added to keep track of the origin of the data. By default this column will contain the object names of the original datasets, but it can be replaced by more meaningful IDs using the parameter datasetID.

## Warning in rbind(deparse.level, ...): There seems to be no overlap in trait names of the provided datasets. 
##            It is recommended to map 'traitNameStd' of each dataset to the same thesaurus or ontology!
## Warning in rbind(deparse.level, ...): There seems to be no overlap in taxon names of the provided datasets. 
##             It is recommended to map 'ScientificNameStd' of each dataset to the same thesaurus or ontology!

Note that the package provides a method for the base function rbind(). Documentation can be accessed via ?rbind.traitdata.

The function will handle metadata information on the dataset level as described in the section ‘Metadata’ of the Traitdata Standard (e.g. author or bibliographicCitation) and add a column ‘datasetID’ as well as ‘datasetName’ and ‘author’ if those are provided in the attributes of the input objects. This can be achieved by using the metadata parameter of as.traitdata().

pulldata("carabids")
## The dataset 'carabids' is now available for use!
## Input is taken to be a species -- trait matrix. If this is not the case, please provide parameters!
pulldata("heteroptera_raw")
## loading dataset 'heteroptera_raw' from original data source! 
##  When using this data, please cite the original publication: 
## Gossner MM, Simons NK, Höck L and Weisser WW (2015). "Morphometric
## measures of Heteroptera sampled in grasslands across three regions
## of Germany." _Ecology_, *96*, pp. 1154. doi: 10.1890/14-2159.1
## (URL: http://doi.org/10.1890/14-2159.1).
## The dataset 'heteroptera_raw' is now available for use!
## it seems you are providing repeated measures of traits on multiple specimens of the same species (i.e. an occurrence table)! Sequential identifiers for the occuences will be added. If your dataset contains user-defined occurrenceIDs you may specify the column name in parameter 'occurrences'.
## Warning in rbind(deparse.level, ...): There seems to be no overlap in trait names of the provided datasets. 
##            It is recommended to map 'traitNameStd' of each dataset to the same thesaurus or ontology!
## Warning in rbind(deparse.level, ...): There seems to be no overlap in taxon names of the provided datasets. 
##             It is recommended to map 'ScientificNameStd' of each dataset to the same thesaurus or ontology!

The detailled metadata information (e.g. license and bibliographic citation) will be stored in the attributes of the dataset and displayed when calling it in R.

add data layers

When storing data, it might be advised to externalise repetitive entries in a separate datasheet and link to it via datasetID. In R, you can use the function merge() to map metadata information from a second data table into the core data table based on the datasetID.

compile aggregate values

The function cast.traitdata() rearranges the long-table format into the more intuitive wide-table or matrix format. This can be used to preview complex datasets or to perform calculations on multivariate values.

##         scientificName traitName traitValue traitUnit measurementID
## 1 Anyphaena accentuata Body_Size       6.25        mm             1
## 2 Aculepeira ceropegia Body_Size         11        mm             2
## 3     Agalenatea redii Body_Size       6.93        mm             3
## 4   Araneus diadematus Body_Size      11.88        mm             4
## 5    Araneus marmoreus Body_Size      10.03        mm             5
## 6    Araneus quadratus Body_Size      12.25        mm             6
## 
## This trait-dataset contains 10 traits for 1230 taxa ( 6 measurements in total).
## 
## [ ] :
## Aggregation requires fun.aggregate: length used as default
##   value Body_Size Dispersal_ability Feeding_guild Feeding_mode
## 1 (all)      1230              1230          1230         1230
##   Feeding_specialization Feeding_tissue Feeding_plant_part
## 1                    755            306                450
##   Endophagous_lifestyle Stratum_use Stratum_use_short
## 1                   117        1230              1230