Introduction to 'traitdataform'
Florian D. Schneider
Source:vignettes/traitdataform.Rmd
traitdataform.Rmd
Assistance for handling functional trait data and transferring them into the Ecological Trait-data Standard (Schneider et al. 2018, https://terminologies.gfbio.org/terms/ets/pages/ doi: 10.5281/zenodo.1485739).
There are two major use cases for the package:
- preparation of own trait datasets for upload into public data bases, and
- harmonizing trait datasets from different sources by moulding them into a unified format.
The toolset of the package includes
- transforming typical trait-data formats (e.g. species-trait-matrix or measurement-table data) into a unified long-table format and mapping column names into terms provided in the Ecological Trait-data Standard (ETS) (see section 1. Reading data),
- mapping of trait concepts onto a user-provided trait list (i.e. a thesaurus of traits) or globally accessible URIs (see section 2. Standardize traits) and unify units and factor levels,
- mapping of species concepts onto globally accessible definitions via URIs (pointing to GFBio taxonomic ontology server) (see section 3. Standardize taxa),
- Merging and handling compiled trait-data, while keeping track of the metadata for each original dataset (see section 4. Working with trait-datasets)
- saving trait dataset into a desired format using templates (e.g. for project-specific databases or online repositories) (see section 5. Writing data)
This vignette contains step-by step instructions for transferring own data into a standardized trait-dataset for upload to public databases. See Schneider et al. 2019 Towards an Ecological Trait-data Standard Methods in Ecology and Evolution DOI: 10.1111/2041-210X.13288) for a discussion of the rationale.
1. Reading data
load data from source
The first step is to load your data into R. This can be your own data, read from file, or data published elsewhere, directly accessible via an URL.
R knows many ways of getting your data into an R object. In most cases you would read an object from a csv or txt file while maintaining the column headers.
carabids <- read.table("../../data/carabid traits final.txt", header = TRUE)
If reading files from a file repository, you can refer to the URL directly, e.g.
# pulling data from van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017) Sensitivity of functional diversity metrics to sampling intensity. Methods in Ecology and Evolution 8(9): 1072-1080. https://doi.org/10.1111/2041-210x.12728
carabids <- read.delim("https://datadryad.org/stash/downloads/file_stream/23901", stringsAsFactors = FALSE)
Most trait data are stored in one of the following two formats:
- species\(\times\)trait matrix : a single account of a trait value for each species (in rows) for a couple of different traits (in columns). No replicates of species are reported. This is the most likely format for literature data, where aggregate measurements or facts for entire species have been collated into a single lookup table.
- observation wide table : in case of measured data, authors may report multiple raw measurements of different traits (in columns) taken from a single observation instance of a species, i.e. an individual (in rows). Repeated measures of the same trait might also be included as columns or pooled into average values. This is valuable for investigations of intra-specific variation, and also leaves space for filtering by co-factors or analyzing trait response along environmental gradients.
In both cases, additional information on the species or observation may be stored in further columns (e.g. the unit in which a value is reported or the literature source for this measurement or fact, or the date and geolocation of sampling), or in a separate data sheet linked via identifiers for trait, taxon, occurrence or sampling/measurement event. As the column names and the width of the table varies with the number of traits included, merging data from different sources requires user-defined mapping and manual harmonization of these structures.
A more effective format is the measurement long-table (Kattge et al., 2011; Wickham, 2014; Parr et al., 2016), where each row is reserved for a single measurement or fact of a specific trait. This allows repeated measurements on a single individual to be stored by linking the data from separate rows via a unique identifier for each individual (labelled occurrenceID
). Also, multivariate trait measurements can be recorded in this format by linking multiple rows via a unique measurement identifier. Long-table datasets purport multiple advantages for data manipulation (e.g. filtering, sub-setting and aggregating data), visualization (e.g. plot measured values by factor variable or taxon) and statistical modelling (e.g. ANOVA for testing difference of trait value by sex) (Wickham, 2014). Each row of the dataset can therefore be interpreted as a statement of an ‘entity x having a qualitative/quantitative feature y’ (Garnier et al., 2017; Schneider et al., 2018). As long-table formats draw from a defined set of columns, merging of datasets is much easier.
The function as.traitdata()
provided in the package assist in transferring data into the measurement long-table format. For this function to work, it needs at least to know about the columns of the original data that contain trait values (parameter traits
), and the column which contains the taxonomic concept (parameter taxa
).
dataset1 <- as.traitdata(carabids,
taxa = "name_correct",
traits = c("body_length",
"antenna_length",
"metafemur_length",
"eyewidth_corr"),
units = "mm"
)
#> Input is taken to be a species -- trait matrix. If this is not the case, please provide parameters!
head(dataset1)
#> verbatimScientificName verbatimTraitName verbatimTraitValue verbatimTraitUnit
#> 1 Abax_parallelepipedus body_length 15.846561 mm
#> 2 Acupalpus_meridianus body_length 2.670000 mm
#> 3 Agonum_ericeti body_length 5.873016 mm
#> 4 Agonum_fuliginosum body_length 5.090000 mm
#> 5 Agonum_gracile body_length 4.880000 mm
#> 6 Agonum_marginatum body_length 8.250000 mm
#> measurementID measurementDeterminedBy measurementRemarks
#> 1 1 klink <NA>
#> 2 2 WOODCOCK <NA>
#> 3 3 klink <NA>
#> 4 4 ribera <NA>
#> 5 5 ribera deduced_from_genus
#> 6 6 ribera <NA>
#>
#> This trait-dataset contains 4 traits for 120 taxa ( 480 measurements in total).
#>
#> carabids : Carabid traits by Fons van der Plas .
#>
#> When using these data, you must acknowledge the following usage policies:
#>
#> Cite this trait dataset as:
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling intensity."
#> _Methods in Ecology and Evolution_. doi:10.1111/2041-210x.12728
#> <https://doi.org/10.1111/2041-210x.12728>.
#>
#> Published under: http://creativecommons.org/publicdomain/zero/1.0/
#>
#> This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"
Note that in the output table the columns have been renamed according to the ETS. The essential columns are verbatimTraitName
, verbatimTraitValue
for the reported measurement or fact as well as verbatimScientificName
for the taxon concept. The newly assigned column measurementID
contains a running number for each individual trait measurement.
The function automatically interprets data as a species\(\times\)traits matrix if the taxa column contains only unique entries and no duplicates. In case of multiple assignments to the same taxon, the script assumes an observation wide-table and procures a new column occurrenceID
which links measurements taken on the same individuals. Both occurrenceID
and measurementID
can be provided by the author using the parameter occurrences
(as a column name or a vector) or measurements
(as a column name or a vector).
#
# heteroptera_raw
#
# dataset included in package traitdataform
#
# Data publication: M. Gossner, Martin; K. Simons, Nadja; Hoeck, Leonhard; W.
# Weisser, Wolfgang (2016): Morphometric measures of Heteroptera sampled in
# grasslands across three regions of Germany. figshare.
# https://doi.org/10.6084/m9.figshare.c.3307611.v1
dataset2 <- as.traitdata(heteroptera_raw,
traits = c("Body_length", "Body_width", "Body_height", "Thorax_length",
"Thorax_width", "Head_width", "Eye_width", "Antenna_Seg1",
"Antenna_Seg2", "Antenna_Seg3", "Antenna_Seg4", "Antenna_Seg5",
"Front.Tibia_length", "Mid.Tibia_length", "Hind.Tibia_length",
"Front.Femur_length", "Hind.Femur_length", "Front.Femur_width",
"Hind.Femur_width", "Rostrum_length", "Rostrum_width",
"Wing_length", "Wing_width"),
taxa = "SpeciesID",
occurrences = "ID"
)
# show different trait measurements for same occurrence/individual
subset(dataset2, occurrenceID == "5" )
This allows the user to be explicit about the structure of the output data.
specify units
For a standardisation of quantitative trait data, the unit of measurement is essential. Often, this information is kept in the metadata descriptions. But for a standardised table containing measurements from different sources, this information should always accompany the measurement value. The ETS suggests the term verbatimTraitUnit
to contain the original author’s unit for each measurement in the data table.
The function as.traitdata()
creates this column via its parameter units
(see example above). This can be done for all traits in a single stroke (if all reported values refer to the same unit) or to each trait specifically (if they used different measurement units or if the table comprises a mixture of quantitative and qualitative traits).
Accordingly, the parameter units
takes a single character string, or a vector of character strings, containing valid entries as expected by the package ‘units’ (Pebesma et al., 2016, https://github.com/r-quantities/units). Examples are ‘mm’, ‘m2’ or ‘m^2’, ‘m/s’.
keep additional information
The raw data might contain further information on the individuals or the trait measurement itself in further data columns that are valuable for later analysis. This can be for instance data about the sex or developmental stage of the individual, the sampling or preservation method of the specimen, or the conditions under which the measurement was taken.
The parameter keep
allows you to specify which columns contain valuable information as a character vector. As a negative version of keep
, specifying drop
would allow you to name the columns that are not valuable, while all others will be kept. Not specifying keep
or drop
will result in dropping all columns except the core measurement and identifier columns.
dataset2 <- as.traitdata(heteroptera_raw,
traits = c("Body_length", "Body_width", "Body_height", "Thorax_length",
"Thorax_width", "Head_width", "Eye_width", "Antenna_Seg1",
"Antenna_Seg2", "Antenna_Seg3", "Antenna_Seg4", "Antenna_Seg5",
"Front.Tibia_length", "Mid.Tibia_length", "Hind.Tibia_length",
"Front.Femur_length", "Hind.Femur_length", "Front.Femur_width",
"Hind.Femur_width", "Rostrum_length", "Rostrum_width",
"Wing_length", "Wing_width"),
taxa = "SpeciesID",
occurrences = "ID",
keep = c("Sex")
)
#> Input is taken to be an occurrence table/an observation -- trait matrix
#> (i.e. with individual specimens per row and multiple trait measurements in columns).
#> If this is not the case, please provide parameters!
head(dataset2)
#> verbatimScientificName verbatimTraitName verbatimTraitValue verbatimTraitUnit
#> 1 Acalypta nigrina Body_length 2.35 mm
#> 2 Acalypta nigrina Body_length 2.10 mm
#> 3 Acalypta nigrina Body_length 2.17 mm
#> 4 Acalypta nigrina Body_length 2.15 mm
#> 5 Acalypta parvula Body_length 1.84 mm
#> 6 Acalypta parvula Body_length 1.81 mm
#> measurementID occurrenceID Sex
#> 1 1 1 f
#> 2 2 2 f
#> 3 3 3 m
#> 4 4 4 m
#> 5 5 5 f
#> 6 6 6 f
#>
#> This trait-dataset contains 23 traits for 179 taxa ( 9386 measurements in total).
#>
#> heteroptera : Heteroptera morphometry traits by Martin M. Gossner .
#>
#> When using these data, you must acknowledge the following usage policies:
#>
#> Cite this trait dataset as:
#> Gossner MM, Simons NK, Hoeck L, Weisser WW (2015). "Morphometric
#> measures of Heteroptera sampled in grasslands across three regions of
#> Germany." _Ecology_, *96*, 1154. doi:10.1890/14-2159.1
#> <https://doi.org/10.1890/14-2159.1>.
#>
#> Published under: http://creativecommons.org/publicdomain/zero/1.0/
#>
#> This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"
The three extensions of the ETS provide standard terms for this kind of information:
- The
Taxon
extension provides further terms for specifying the taxonomic resolution of the observation and to ensure the correct reference in case of synonyms and homonyms. - The
Measurement Or Fact
extension provides terms to describe information at the level of single measurements or reported facts, such as the original literature reference for the reported value, the method of measurement or statistical method of aggregation. It provides important information that allows for the tracking of potential sources of noise or bias in measured data (e.g. variation in measurement method) or aggregated values (e.g. statistical method), as well as the source of reported facts (e.g. literature source or expert reference). - The
Occurrence
extension contains vocabulary to describe information on the observation context of individual specimens, such as sex, life stage or age. This also includes the method of sampling and preservation, as well as the date and geographical location, which provide an important resource to analyze trait variation due to differences in space and time.
We highly recommend mapping the input columns into these standard terms by providing a named vector for keep
that gives the target ETS terms as vector names.
dataset2 <- as.traitdata(heteroptera_raw,
traits = c("Body_length", "Body_width", "Body_height", "Thorax_length",
"Thorax_width", "Head_width", "Eye_width", "Antenna_Seg1",
"Antenna_Seg2", "Antenna_Seg3", "Antenna_Seg4", "Antenna_Seg5",
"Front.Tibia_length", "Mid.Tibia_length", "Hind.Tibia_length",
"Front.Femur_length", "Hind.Femur_length", "Front.Femur_width",
"Hind.Femur_width", "Rostrum_length", "Rostrum_width",
"Wing_length", "Wing_width"),
taxa = "SpeciesID",
occurrences = "ID",
units = "mm",
keep = c(order = "Order", family = "Family",
sex = "Sex", lifeStage = "Wing_development",
basisOfRecordDescription = "Source",
verbatimLocality = "Center_Sampling_region",
references = "Voucher_ID" )
)
#> Input is taken to be an occurrence table/an observation -- trait matrix
#> (i.e. with individual specimens per row and multiple trait measurements in columns).
#> If this is not the case, please provide parameters!
head(dataset2)
#> verbatimScientificName verbatimTraitName verbatimTraitValue verbatimTraitUnit
#> 1 Acalypta nigrina Body_length 2.35 mm
#> 2 Acalypta nigrina Body_length 2.10 mm
#> 3 Acalypta nigrina Body_length 2.17 mm
#> 4 Acalypta nigrina Body_length 2.15 mm
#> 5 Acalypta parvula Body_length 1.84 mm
#> 6 Acalypta parvula Body_length 1.81 mm
#> measurementID occurrenceID order family
#> 1 1 1 Hemiptera Tingidae
#> 2 2 2 Hemiptera Tingidae
#> 3 3 3 Hemiptera Tingidae
#> 4 4 4 Hemiptera Tingidae
#> 5 5 5 Hemiptera Tingidae
#> 6 6 6 Hemiptera Tingidae
#> basisOfRecordDescription
#> 1 Zoological State Collection Munich
#> 2 Zoological State Collection Munich
#> 3 Zoological State Collection Munich
#> 4 Zoological State Collection Munich
#> 5 Zoological State Collection Munich
#> 6 Zoological State Collection Munich
#> references sex lifeStage verbatimLocality
#> 1 Treuchtlingen_leg.Seidenstücker_21.5.48 f m 48°57’ N 10°55’ E
#> 2 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 3 Gunzenshausen_Mfr._leg.Seidenstücker_15.4.58 m m 49°07’ N 10°45’ E
#> 4 Nürnberg_leg.Seidenstücker_9.7.44 m b 49°27’ N 11°05’ E
#> 5 Nürnberg_Seidenstücker_4.8.45 f m 49°27’ N 11°05’ E
#> 6 Erlangen_Seidenstücker_30.9.40 f b 49°36’ N 11°00’ E
#>
#> This trait-dataset contains 23 traits for 179 taxa ( 9386 measurements in total).
#>
#> heteroptera : Heteroptera morphometry traits by Martin M. Gossner .
#>
#> When using these data, you must acknowledge the following usage policies:
#>
#> Cite this trait dataset as:
#> Gossner MM, Simons NK, Hoeck L, Weisser WW (2015). "Morphometric
#> measures of Heteroptera sampled in grasslands across three regions of
#> Germany." _Ecology_, *96*, 1154. doi:10.1890/14-2159.1
#> <https://doi.org/10.1890/14-2159.1>.
#>
#> Published under: http://creativecommons.org/publicdomain/zero/1.0/
#>
#> This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"
Note that a lack of a name in the named vector maintains the original name. Note also, that no checking for valid column names (as compared to the traitdata glossary) is performed at this stage. This is to ensure that the raw data table created by as.traittable()
can contain any columns that the author considers relevant. The keep
parameter can be used to rename columns into intuitive column names.
derived trait-values
Many traits comprise compound measures of multiple traits, such as length-mass ratios or morphometric indices. Other traits must be refined in terms of factor levels, or reduced to binary trait values. Many of these tasks can be achieved on the matrix raw data using base functions like transform()
, factor()
or match()
or the mutate()
function provided by the package ‘plyr’ before conversion into the long-table format.
However, if the data are converted to long-table format, these tasks may become tedious as they require splitting the data before the computation can be done. The function mutate.traitdata()
performs these tasks (working as a wrapper to plyr::mutate()
) while keeping an eye on the units.
dataset2 <- mutate.traitdata(dataset2,
Body_shape = Body_length/Body_width,
Body_volume = Body_length*Body_width*Body_height,
Wingload = Wing_length*Wing_width/Body_volume)
head(dataset2[dataset2$verbatimTraitName %in% c("Body_shape", "Body_volume", "Wingload"),])
#> verbatimScientificName verbatimTraitName verbatimTraitValue
#> 9387 Acalypta nigrina Body_shape 2.043478
#> 9388 Acalypta nigrina Body_shape 1.721311
#> 9389 Acalypta nigrina Body_shape 2.028037
#> 9390 Acalypta nigrina Body_shape 2.067308
#> 9391 Acalypta parvula Body_shape 2.243902
#> 9392 Acalypta parvula Body_shape 1.885417
#> verbatimTraitUnit measurementID occurrenceID order family
#> 9387 1 <NA> 1 Hemiptera Tingidae
#> 9388 1 <NA> 2 Hemiptera Tingidae
#> 9389 1 <NA> 3 Hemiptera Tingidae
#> 9390 1 <NA> 4 Hemiptera Tingidae
#> 9391 1 <NA> 5 Hemiptera Tingidae
#> 9392 1 <NA> 6 Hemiptera Tingidae
#> basisOfRecordDescription
#> 9387 Zoological State Collection Munich
#> 9388 Zoological State Collection Munich
#> 9389 Zoological State Collection Munich
#> 9390 Zoological State Collection Munich
#> 9391 Zoological State Collection Munich
#> 9392 Zoological State Collection Munich
#> references sex lifeStage
#> 9387 Treuchtlingen_leg.Seidenstücker_21.5.48 f m
#> 9388 Nürnberg_leg.Seidenstücker_9.7.44 f b
#> 9389 Gunzenshausen_Mfr._leg.Seidenstücker_15.4.58 m m
#> 9390 Nürnberg_leg.Seidenstücker_9.7.44 m b
#> 9391 Nürnberg_Seidenstücker_4.8.45 f m
#> 9392 Erlangen_Seidenstücker_30.9.40 f b
#> verbatimLocality
#> 9387 48°57’ N 10°55’ E
#> 9388 49°27’ N 11°05’ E
#> 9389 49°07’ N 10°45’ E
#> 9390 49°27’ N 11°05’ E
#> 9391 49°27’ N 11°05’ E
#> 9392 49°36’ N 11°00’ E
#>
#> This trait-dataset contains 26 traits for 179 taxa ( 9386 measurements in total).
#>
#> heteroptera : Heteroptera morphometry traits by Martin M. Gossner .
#>
#> When using these data, you must acknowledge the following usage policies:
#>
#> Cite this trait dataset as:
#> Gossner MM, Simons NK, Hoeck L, Weisser WW (2015). "Morphometric
#> measures of Heteroptera sampled in grasslands across three regions of
#> Germany." _Ecology_, *96*, 1154. doi:10.1890/14-2159.1
#> <https://doi.org/10.1890/14-2159.1>.
#>
#> Published under: http://creativecommons.org/publicdomain/zero/1.0/
#>
#> This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"
Note that all existing traits remain untouched and additional trait measures will be added to the dataset, unless a definition replaces an already existing trait.
It is important to note that the mutate function works at the level of data resolution that is provided by the data, i.e. for occurrence data with multiple measurements on a single individual, the data columns are mutated per occurrenceID
.
2. Standardize traits
The function as.traitdata()
produced a tidy and correctly formatted version of your own trait data. We now turn to the challenging task of standardisation.
The field traitID
is meant to contain a globally valid reference to a trait definition that applies to the measurement in question. Due to the heterogeneity of approaches, research questions and taxonomic focus in trait-based research, it is hard to come up with universal trait definitions that can be employed in each and every research context. The mode of measurement or the precise prescriptions of a sampling procedure have been formalized into published handbooks, (e.g. Cornelissen et al., 2003; Perez-Harguindeguy et al., 2013; or for invertebrates, Moretti et al., 2017), but are of limited use in harmonising trait data that pre-date or ignore this standard. Thesauri, e.g. the TOP Thesaurus of plant traits (Garnier et al., 2017, employed by TRY) or Gramene.org offer definitions of plant traits in a formal language. For soil invertebrates, the T-SITA thesaurus offers a set of traits relevant for this organism group (see Schneider et al., 2018 for a more detailed distinction of thesauri and ontologies). All in all, only for few organism groups and trait methodologies exist Unique Resource Identifiers (URIs) that provide a stable reference to an unambiguous definition and can be referenced from the dataset.
Refer to trait definitions via URIs
Thus, the key information must be provided manually as an own data object in R. However, traitdataform
assists in creating an own reference list of traits, a so called ‘thesaurus’, that will be used to feed trait definitions, units or identifiers into the dataset.
The function to create an object of class ‘thesaurus’ is as.thesaurus()
and deals with several objects created by as.trait()
. The ETS provides a set of terms to describe trait concepts which can be provided as an input parameter to as.traits()
. Using the as.trait()
function allows assigning flexible trait definition while ensuring compliance with the terms of the traitdata standard outlined above. It also allows building a library of trait definitions where single traits can be reused in multiple projects.
as.trait("body_length",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "The known longest dimension of the physical structure of organisms",
identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length",
author = "Maggenti and Maggenti, 2005",
broaderTerm = c("http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_dimension"),
narrowerTerm = c("http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Female_body_length",
"http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Male_body_length")
)
#>
#> body_length :
#>
#> Defined as: The known longest dimension of the physical structure of
#> organisms
#>
#> Broader term: http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_dimension
#>
#> Narrower term: http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Female_body_length;
#> http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Male_body_length
#>
#> Value type: numeric
#> Expected unit: mm
#>
#> http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length
E.g. if all of the traits reported in your dataset refer to a definition published under a publicly available identifier, the thesaurus could be created like this:
thesaurus1 <- as.thesaurus(
body_length = as.trait("body_length",
expectedUnit = "mm", valueType = "numeric",
identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length"),
antenna_length = as.trait("antenna_length",
expectedUnit = "mm", valueType = "numeric",
identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Antenna_length"),
metafemur_length = as.trait("femur_length",
expectedUnit = "mm", valueType = "numeric",
identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Femur_length"),
eyewidth_corr = as.trait("eye_diameter",
expectedUnit = "mm", valueType = "numeric",
identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Eye_diameter")
)
Alternatively, a thesaurus can be created from a data.frame
, which might be easier if only trait name and identifier are to be provided and more specific trait definitions are not to be stored in the R object.
thesaurus1 <- as.thesaurus(data.frame(
trait = c("body_length", "antenna_length", "metafemur_length", "eyewidth_corr"),
identifier = paste0("http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=",
c("Body_length", "Antenna_length", "Femur_length", "Eye_diameter")),
valueType = c("numeric"),
expectedUnit = "mm")
)
To transfer the user-provided traits and trait values into standardised values, the function standardize_traits()
merges the data table with a reference table of trait definitions to produce values of a compliant format.
dataset1Std <- standardize_traits(dataset1, thesaurus1)
head(dataset1Std)
#> traitID traitName traitValue traitUnit verbatimScientificName
#> 1 2 body_length 15.846561 mm Abax_parallelepipedus
#> 2 2 body_length 2.670000 mm Acupalpus_meridianus
#> 3 2 body_length 5.873016 mm Agonum_ericeti
#> 4 2 body_length 5.090000 mm Agonum_fuliginosum
#> 5 2 body_length 4.880000 mm Agonum_gracile
#> 6 2 body_length 8.250000 mm Agonum_marginatum
#> verbatimTraitName verbatimTraitValue verbatimTraitUnit measurementID
#> 1 body_length 15.846561 mm 1
#> 2 body_length 2.670000 mm 2
#> 3 body_length 5.873016 mm 3
#> 4 body_length 5.090000 mm 4
#> 5 body_length 4.880000 mm 5
#> 6 body_length 8.250000 mm 6
#> measurementDeterminedBy measurementRemarks
#> 1 klink <NA>
#> 2 WOODCOCK <NA>
#> 3 klink <NA>
#> 4 ribera <NA>
#> 5 ribera deduced_from_genus
#> 6 ribera <NA>
#>
#> This trait-dataset contains 4 traits for 120 taxa ( 480 measurements in total).
#>
#> carabids : Carabid traits by Fons van der Plas .
#>
#> When using these data, you must acknowledge the following usage policies:
#>
#> Cite this trait dataset as:
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling intensity."
#> _Methods in Ecology and Evolution_. doi:10.1111/2041-210x.12728
#> <https://doi.org/10.1111/2041-210x.12728>.
#>
#> Published under: http://creativecommons.org/publicdomain/zero/1.0/
#>
#> This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"
The output table now contains a duplicate record of the originally provided trait measurements (in verbatimTraitName
, verbatimTraitValue
and verbatimTraitUnit
) and now being standardised into target terms and units as requested by the thesaurus.
refer to own trait definitions
If no published trait concept can be referenced, trait-datasets should be accompanied by a dataset-specific thesaurus. Ideally this is stored as an asset along with your trait dataset in the same data publication or in a separate publication. This can be a csv or txt file, or a website providing direct and stable links to each trait definition.
This reference file should contain at least the following fields for each trait concept:
-
trait
should be a short descriptive name. No spaces should be used. Rather use a scheme with underscore or capital letters to highlight multiple words (e.g. ‘body_length’ or ‘bodyLenght’). -
traitDescription
: a detailed and unambiguous, human readable definition. -
valueType
to specify the expected kind of entries. Set it to ‘numeric’ for quantitative traits, ‘integer’ for counts or ordinal traits, ‘character’ for trait values that are provided as free text, ‘factor’ for traits that take one of few non-ordinal levels, ‘logical’ for binary/boolean entries (yes/no).- For numeric traits, the parameter
expectedUnit
should provide the expected unit for the trait. The R script will then try to convert trait values into this unit. - for categorical traits of kind ‘factor’ or ‘integer’, the field
factorLevels
should contain a list the valid factorial traits separated by semicolon. In case of ordinal traits, the order must be precisely corresponding to the number of possible integer values.
- For numeric traits, the parameter
-
comments
may contain examples and clarifications - optionally,
identifier
may specify an alphanumeric ID for the specific use in your dataset, but this function is also covered by having defined unambiguous trait labels in fieldtrait
which recur in fieldverbatimTraitName
of the main dataset.
Refer to the ETS set of terms to describe trait concepts) for further definitions of these terms, as well as the best practice guidelines for trait-data publications.
# M. Gossner, Martin; K. Simons, Nadja; Hoeck, Leonhard; W. Weisser, Wolfgang
# (2016): Morphometric measures of Heteroptera sampled in grasslands across
# three regions of Germany. figshare.
# https://doi.org/10.6084/m9.figshare.c.3307611.v1
# following the definitions in data publication
# http://www.esapubs.org/archive/ecol/E096/102/metadata.php
thesaurus2 <- as.thesaurus(
Body_length = as.trait("Body_length", identifier = "t1",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "From the tip of the head to the end of the abdomen"),
Body_width = as.trait("Body_width", identifier = "t2",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Widest part of the body"),
Body_height = as.trait("Body_height",identifier = "t3",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Thickest part of the body"),
Thorax_length = as.trait("Thorax_length", identifier = "t4",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Longest part of the pronotum"),
Thorax_width = as.trait("Thorax_width", identifier = "t5",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Widest part of the pronotum"),
Head_width = as.trait("Head_width", identifier = "t6",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Widest part of the head including eyes"),
Eye_width = as.trait("Eye_width", identifier = "t7",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Widest part of the left eye"),
Antenna_Seg1 = as.trait("Antenna_Seg1", identifier = "t8",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Length of first antenna segment",
broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
Antenna_Seg2 = as.trait("Antenna_Seg2", identifier = "t9",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Length of second antenna segment",
broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
Antenna_Seg3 = as.trait("Antenna_Seg3", identifier = "t10",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Length of third antenna segment",
broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
Antenna_Seg4 = as.trait("Antenna_Seg4", identifier = "t11",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Length of fourth antenna segment",
broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
Antenna_Seg5 = as.trait("Antenna_Seg5", identifier = "t12",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Length of fifth antenna segment (only Pentatomoidea)",
broaderTerm = "http://ecologicaltraitdata.github.io/TraitDataList/Antenna_length"),
Front.Tibia_length = as.trait("Front.Tibia_length", identifier = "t13",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Length of the tibia of the foreleg"),
Mid.Tibia_length = as.trait("Mid.Tibia_length", identifier = "t14",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Length of the tibia of the mid leg",
broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Tibia_length"),
Hind.Tibia_length = as.trait("Hind.Tibia_length", identifier = "t15",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Length of the tibia of the hind leg",
broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Tibia_length"),
Front.Femur_length = as.trait("Front.Femur_length", identifier = "t16",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Length of the femur of the foreleg",
broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Femur_length"),
Hind.Femur_length = as.trait("Hind.Femur_length", identifier = "t17",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Length of the femur of the hind leg",
broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Femur_length"),
Front.Femur_width = as.trait("Front.Femur_width", identifier = "t18",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Width of the femur of the foreleg"),
Hind.Femur_width = as.trait("Hind.Femur_width", identifier = "t18",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Width of the femur of the hind leg"),
Rostrum_length = as.trait("Rostrum_length", identifier = "t19",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Length of the rostrum including all segments"),
Rostrum_width = as.trait("Rostrum_width", identifier = "t20",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Widest part of the rostrum"),
Wing_length = as.trait("Wing_length", identifier = "t21",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Longest part of the forewing",
broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Wing"),
Wing_width = as.trait("Wing_width", identifier = "t22",
expectedUnit = "mm", valueType = "numeric",
traitDescription = "Widest part of the forewing",
broaderTerm = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Wing")
)
Applying standardize_traits()
will refer to this dataset-specific thesaurus and append it as an attribute to the R object.
dataset2Std <- standardize_traits(dataset2, thesaurus2)
subset(dataset2Std, occurrenceID == 2)
#> traitID traitName traitValue traitUnit verbatimScientificName
#> 185 1 Body_length 2.10 mm Acalypta nigrina
#> 610 2 Body_width 1.22 mm Acalypta nigrina
#> 871 3 Body_height 0.67 mm Acalypta nigrina
#> 1337 4 Thorax_length 0.24 mm Acalypta nigrina
#> 1721 5 Thorax_width 0.95 mm Acalypta nigrina
#> 2309 6 Head_width 0.45 mm Acalypta nigrina
#> 2734 7 Eye_width 0.13 mm Acalypta nigrina
#> 3364 8 Antenna_Seg1 0.12 mm Acalypta nigrina
#> 3422 9 Antenna_Seg2 0.04 mm Acalypta nigrina
#> 3888 10 Antenna_Seg3 0.44 mm Acalypta nigrina
#> 4271 11 Antenna_Seg4 0.15 mm Acalypta nigrina
#> 4895 13 Front.Tibia_length 0.40 mm Acalypta nigrina
#> 5524 14 Mid.Tibia_length 0.42 mm Acalypta nigrina
#> 5563 15 Hind.Tibia_length 0.56 mm Acalypta nigrina
#> 6048 16 Front.Femur_length 0.48 mm Acalypta nigrina
#> 6412 17 Hind.Femur_length 0.56 mm Acalypta nigrina
#> 7021 18 Front.Femur_width 0.12 mm Acalypta nigrina
#> 7569 19 Hind.Femur_width 0.10 mm Acalypta nigrina
#> 7870 20 Rostrum_length 0.92 mm Acalypta nigrina
#> 8118 21 Rostrum_width 0.08 mm Acalypta nigrina
#> 8598 22 Wing_length 1.66 mm Acalypta nigrina
#> 8968 23 Wing_width 0.57 mm Acalypta nigrina
#> verbatimTraitName verbatimTraitValue verbatimTraitUnit measurementID
#> 185 Body_length 2.10 mm 2
#> 610 Body_width 1.22 mm 427
#> 871 Body_height 0.67 mm 852
#> 1337 Thorax_length 0.24 mm 1277
#> 1721 Thorax_width 0.95 mm 1702
#> 2309 Head_width 0.45 mm 2127
#> 2734 Eye_width 0.13 mm 2552
#> 3364 Antenna_Seg1 0.12 mm 2977
#> 3422 Antenna_Seg2 0.04 mm 3402
#> 3888 Antenna_Seg3 0.44 mm 3827
#> 4271 Antenna_Seg4 0.15 mm 4252
#> 4895 Front.Tibia_length 0.40 mm 4713
#> 5524 Mid.Tibia_length 0.42 mm 5138
#> 5563 Hind.Tibia_length 0.56 mm 5563
#> 6048 Front.Femur_length 0.48 mm 5988
#> 6412 Hind.Femur_length 0.56 mm 6413
#> 7021 Front.Femur_width 0.12 mm 6838
#> 7569 Hind.Femur_width 0.10 mm 7263
#> 7870 Rostrum_length 0.92 mm 7688
#> 8118 Rostrum_width 0.08 mm 8113
#> 8598 Wing_length 1.66 mm 8538
#> 8968 Wing_width 0.57 mm 8963
#> occurrenceID order family basisOfRecordDescription
#> 185 2 Hemiptera Tingidae Zoological State Collection Munich
#> 610 2 Hemiptera Tingidae Zoological State Collection Munich
#> 871 2 Hemiptera Tingidae Zoological State Collection Munich
#> 1337 2 Hemiptera Tingidae Zoological State Collection Munich
#> 1721 2 Hemiptera Tingidae Zoological State Collection Munich
#> 2309 2 Hemiptera Tingidae Zoological State Collection Munich
#> 2734 2 Hemiptera Tingidae Zoological State Collection Munich
#> 3364 2 Hemiptera Tingidae Zoological State Collection Munich
#> 3422 2 Hemiptera Tingidae Zoological State Collection Munich
#> 3888 2 Hemiptera Tingidae Zoological State Collection Munich
#> 4271 2 Hemiptera Tingidae Zoological State Collection Munich
#> 4895 2 Hemiptera Tingidae Zoological State Collection Munich
#> 5524 2 Hemiptera Tingidae Zoological State Collection Munich
#> 5563 2 Hemiptera Tingidae Zoological State Collection Munich
#> 6048 2 Hemiptera Tingidae Zoological State Collection Munich
#> 6412 2 Hemiptera Tingidae Zoological State Collection Munich
#> 7021 2 Hemiptera Tingidae Zoological State Collection Munich
#> 7569 2 Hemiptera Tingidae Zoological State Collection Munich
#> 7870 2 Hemiptera Tingidae Zoological State Collection Munich
#> 8118 2 Hemiptera Tingidae Zoological State Collection Munich
#> 8598 2 Hemiptera Tingidae Zoological State Collection Munich
#> 8968 2 Hemiptera Tingidae Zoological State Collection Munich
#> references sex lifeStage verbatimLocality
#> 185 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 610 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 871 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 1337 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 1721 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 2309 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 2734 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 3364 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 3422 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 3888 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 4271 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 4895 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 5524 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 5563 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 6048 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 6412 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 7021 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 7569 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 7870 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 8118 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 8598 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#> 8968 Nürnberg_leg.Seidenstücker_9.7.44 f b 49°27’ N 11°05’ E
#>
#> This trait-dataset contains 26 traits for 179 taxa ( 9386 measurements in total).
#> NULL
attributes(dataset2Std)$traits[,c("trait", "identifier","traitDescription","expectedUnit")]
#> trait identifier
#> Body_length Body_length 1
#> Body_width Body_width 2
#> Body_height Body_height 3
#> Thorax_length Thorax_length 4
#> Thorax_width Thorax_width 5
#> Head_width Head_width 6
#> Eye_width Eye_width 7
#> Antenna_Seg1 Antenna_Seg1 8
#> Antenna_Seg2 Antenna_Seg2 9
#> Antenna_Seg3 Antenna_Seg3 10
#> Antenna_Seg4 Antenna_Seg4 11
#> Antenna_Seg5 Antenna_Seg5 12
#> Front.Tibia_length Front.Tibia_length 13
#> Mid.Tibia_length Mid.Tibia_length 14
#> Hind.Tibia_length Hind.Tibia_length 15
#> Front.Femur_length Front.Femur_length 16
#> Hind.Femur_length Hind.Femur_length 17
#> Front.Femur_width Front.Femur_width 18
#> Hind.Femur_width Hind.Femur_width 19
#> Rostrum_length Rostrum_length 20
#> Rostrum_width Rostrum_width 21
#> Wing_length Wing_length 22
#> Wing_width Wing_width 23
#> traitDescription
#> Body_length From the tip of the head to the end of the abdomen
#> Body_width Widest part of the body
#> Body_height Thickest part of the body
#> Thorax_length Longest part of the pronotum
#> Thorax_width Widest part of the pronotum
#> Head_width Widest part of the head including eyes
#> Eye_width Widest part of the left eye
#> Antenna_Seg1 Length of first antenna segment
#> Antenna_Seg2 Length of second antenna segment
#> Antenna_Seg3 Length of third antenna segment
#> Antenna_Seg4 Length of fourth antenna segment
#> Antenna_Seg5 Length of fifth antenna segment (only Pentatomoidea)
#> Front.Tibia_length Length of the tibia of the foreleg
#> Mid.Tibia_length Length of the tibia of the mid leg
#> Hind.Tibia_length Length of the tibia of the hind leg
#> Front.Femur_length Length of the femur of the foreleg
#> Hind.Femur_length Length of the femur of the hind leg
#> Front.Femur_width Width of the femur of the foreleg
#> Hind.Femur_width Width of the femur of the hind leg
#> Rostrum_length Length of the rostrum including all segments
#> Rostrum_width Widest part of the rostrum
#> Wing_length Longest part of the forewing
#> Wing_width Widest part of the forewing
#> expectedUnit
#> Body_length mm
#> Body_width mm
#> Body_height mm
#> Thorax_length mm
#> Thorax_width mm
#> Head_width mm
#> Eye_width mm
#> Antenna_Seg1 mm
#> Antenna_Seg2 mm
#> Antenna_Seg3 mm
#> Antenna_Seg4 mm
#> Antenna_Seg5 mm
#> Front.Tibia_length mm
#> Mid.Tibia_length mm
#> Hind.Tibia_length mm
#> Front.Femur_length mm
#> Hind.Femur_length mm
#> Front.Femur_width mm
#> Hind.Femur_width mm
#> Rostrum_length mm
#> Rostrum_width mm
#> Wing_length mm
#> Wing_width mm
3. Standardize taxa
For taxon name standardisation, the function standardize_taxa()
makes use of fuzzy matching algorithms provided by the package ‘taxize’ by Scott Chamberlain to match the entries of column verbatimScientificName
against the GBIF Backbone Taxonomy. The result is written into a new column scientificName
. Additional columns comprise the order (for ambiguous names), the reported taxon rank, as well as a globally unique taxon ID which references the taxon to GBIF Backbone Taxonomy in a universal URI format.
If further layers of taxonomic information are desired as an output, the function takes the parameter return
, which by default contains c("taxonID", "scientificName", "order", "taxonRank")
. Other specifications can be added here.
Note that for this to work, verbatimScientificName
must contain a full account of the species name or higher taxon, no abbreviations (spaces or underscores are handled alright). Note also, that taxon name mapping requires an internet connection and might take some time, depending on the length of your species list.
dataset1Std <- standardize_taxa(dataset1)
head(dataset1Std)
#> scientificName verbatimScientificName verbatimTraitName
#> 1 Abax parallelepipedus Abax_parallelepipedus body_length
#> 2 Abax parallelepipedus Abax_parallelepipedus antenna_length
#> 3 Abax parallelepipedus Abax_parallelepipedus eyewidth_corr
#> 4 Abax parallelepipedus Abax_parallelepipedus metafemur_length
#> 5 Acupalpus meridianus Acupalpus_meridianus antenna_length
#> 6 Acupalpus meridianus Acupalpus_meridianus metafemur_length
#> verbatimTraitValue verbatimTraitUnit taxonID
#> 1 15.846561 mm http://www.gbif.org/species/5754772
#> 2 8.518519 mm http://www.gbif.org/species/5754772
#> 3 0.481250 mm http://www.gbif.org/species/5754772
#> 4 5.608466 mm http://www.gbif.org/species/5754772
#> 5 0.700000 mm http://www.gbif.org/species/1037633
#> 6 0.750000 mm http://www.gbif.org/species/1037633
#> measurementID warnings taxonRank kingdom phylum class order
#> 1 1 species Animalia Arthropoda Insecta Coleoptera
#> 2 121 species Animalia Arthropoda Insecta Coleoptera
#> 3 361 species Animalia Arthropoda Insecta Coleoptera
#> 4 241 species Animalia Arthropoda Insecta Coleoptera
#> 5 122 species Animalia Arthropoda Insecta Coleoptera
#> 6 242 species Animalia Arthropoda Insecta Coleoptera
#> family measurementDeterminedBy measurementRemarks
#> 1 Carabidae klink <NA>
#> 2 Carabidae klink <NA>
#> 3 Carabidae klink <NA>
#> 4 Carabidae klink <NA>
#> 5 Carabidae WOODCOCK <NA>
#> 6 Carabidae WOODCOCK <NA>
#>
#> This trait-dataset contains 4 traits for 120 taxa ( 480 measurements in total).
#>
#> carabids : Carabid traits by Fons van der Plas .
#>
#> When using these data, you must acknowledge the following usage policies:
#>
#> Cite this trait dataset as:
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling intensity."
#> _Methods in Ecology and Evolution_. doi:10.1111/2041-210x.12728
#> <https://doi.org/10.1111/2041-210x.12728>.
#>
#> Published under: http://creativecommons.org/publicdomain/zero/1.0/
#>
#> This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"
Single-stroke standardization
The functions standardize_traits()
and standardize_taxa()
are applied sequentially but not necessarily in that order. The output of the first step can be piped into the second step.
To make things even simpler, the functions for format conversion and standardization come with a wrapper function standardize()
. Therefore it is possible to run the functions in a single-handed way, if all necessary parameters for the intermediate steps are provided. A single call will do, taking all the optional parameters described above.
dataset1Std <- standardize(carabids,
thesaurus = thesaurus1,
taxa = "name_correct",
units = "mm",
keep = c(measurementDeterminedBy = "source_measurement")
)
As an alternative input pathway, all parameters to standardize()
can be specified as attributes of the input object and will be found natively by the function. This allows for the specification of recipes for data integration for projects pulling data from multiple sources.
4. Working with trait-datasets
combine multiple traitdata tables
After standardizing trait and taxon concepts into unified definitions and converting trait values into harmonized units, it is straightforward to combine multiple trait-dataset into one using rbind()
. This can be applied before or after the standardisation process, depending on the use case. Use cases of merging data are:
- you collected data from different sources and want to harmonize taxon and trait names: bring data in long-table format and merge into one data object, then harmonize taxa and units following a uniform standard
- No unified trait list or taxon reference exists for the heterogeneous data assembled of different sources (e.g. because spanning many different taxa): Apply standardization to different reference systems before merging the datasets.
The function call will append the data tables while merging the common columns and maintaining columns that are not present in all datasets (this might produce lots of NA). The column datasetID
will be added to keep track of the origin of the data. By default this column will contain the object names of the original datasets, but it can be replaced by more meaningful IDs using the parameter datasetID
.
Note that the package provides a method for the base function rbind()
that handles this merge. Documentation can be accessed via ?rbind.traitdata
.
maintaining metadata
The function will handle metadata information on the dataset level as described in the section ‘Metadata’ of the Traitdata Standard (e.g. author
or bibliographicCitation
) and add a column datasetID
as well as datasetName
and author
if those are provided in the parameter metadata
of the as.traitdata()
function call which creates the data. The function as.metadata()
provides a standard structure for this case.
metadata1 <- as.metadata(
datasetName = "Carabid traits",
datasetID = "carabids",
bibliographicCitation = bibentry(
bibtype = "Article",
title = "Sensitivity of functional diversity metrics to sampling intensity",
journal = "Methods in Ecology and Evolution",
author = c(as.person("Fons van der Plas, Roel van Klink, Pete Manning, Han Olff, Markus Fischer")
),
year = 2017,
doi = "10.1111/2041-210x.12728"
),
author = "Fons van der Plas",
license = "http://creativecommons.org/publicdomain/zero/1.0/"
)
dataset1 <- as.traitdata(carabids,
taxa = "name_correct",
thesaurus = thesaurus1,
units = "mm",
keep = c(datasetID = "source_measurement", measurementRemark = "note"),
metadata = metadata1
)
#> Input is taken to be a species -- trait matrix. If this is not the case, please provide parameters!
head(dataset1)
#> verbatimScientificName verbatimTraitName verbatimTraitValue verbatimTraitUnit
#> 1 Abax_parallelepipedus antenna_length 8.518519 mm
#> 2 Acupalpus_meridianus antenna_length 0.700000 mm
#> 3 Agonum_ericeti antenna_length 3.743386 mm
#> 4 Agonum_fuliginosum antenna_length 3.500000 mm
#> 5 Agonum_gracile antenna_length 3.220000 mm
#> 6 Agonum_marginatum antenna_length 5.030000 mm
#> measurementID datasetID measurementRemark
#> 1 1 klink <NA>
#> 2 2 WOODCOCK <NA>
#> 3 3 klink <NA>
#> 4 4 ribera <NA>
#> 5 5 ribera deduced_from_genus
#> 6 6 ribera <NA>
#>
#> This trait-dataset contains 4 traits for 120 taxa ( 480 measurements in total).
#>
#> carabids : Carabid traits by Fons van der Plas .
#>
#> When using these data, you must acknowledge the following usage policies:
#>
#> Cite this trait dataset as:
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling intensity."
#> _Methods in Ecology and Evolution_. doi:10.1111/2041-210x.12728
#> <https://doi.org/10.1111/2041-210x.12728>.
#>
#> Published under: http://creativecommons.org/publicdomain/zero/1.0/
#>
#> This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"
Note the use of the bibentry()
function to create a formal bibliographic entry. Also note that this also affects the way how the dataset is printed into the R console. This facilitates for data users to acknowledge authorship and ownership of the data, while also providing a machine readable structure that can easily be accessed further down the line.
metadata2 <- as.metadata(
datasetName = "Heteroptera morphometry traits",
datasetID = "heteroptera",
bibliographicCitation = bibentry(
bibtype = "Article",
title = "Morphometric measures of Heteroptera sampled in grasslands across three regions of Germany",
journal = "Ecology",
volume = 96,
issue = 4,
pages = 1154,
author = c(as.person("Martin M. Gossner , Nadja K. Simons, Leonhard Hoeck, Wolfgang W. Weisser")),
year = 2015,
doi = "10.1890/14-2159.1"
),
author = "Martin M. Gossner",
license = "http://creativecommons.org/publicdomain/zero/1.0/"
)
dataset2 <- as.traitdata(heteroptera_raw,
taxa = "SpeciesID",
thesaurus = thesaurus2,
units = "mm",
keep = c(sex = "Sex", references = "Source", lifestage = "Wing_development"),
metadata = metadata2
)
#> Input is taken to be an occurrence table/an observation -- trait matrix
#> (i.e. with individual specimens per row and multiple trait measurements in columns).
#> If this is not the case, please provide parameters!
database <- rbind(dataset1, dataset2,
datasetID = c("vanderplas17", "gossner15"),
metadata_as_columns = TRUE
)
head(database)
#> verbatimScientificName verbatimTraitName verbatimTraitValue verbatimTraitUnit
#> 1 Abax_parallelepipedus antenna_length 8.518519 mm
#> 2 Acupalpus_meridianus antenna_length 0.700000 mm
#> 3 Agonum_ericeti antenna_length 3.743386 mm
#> 4 Agonum_fuliginosum antenna_length 3.500000 mm
#> 5 Agonum_gracile antenna_length 3.220000 mm
#> 6 Agonum_marginatum antenna_length 5.030000 mm
#> measurementID occurrenceID references sex datasetID datasetName
#> 1 1 <NA> <NA> <NA> carabids Carabid traits
#> 2 2 <NA> <NA> <NA> carabids Carabid traits
#> 3 3 <NA> <NA> <NA> carabids Carabid traits
#> 4 4 <NA> <NA> <NA> carabids Carabid traits
#> 5 5 <NA> <NA> <NA> carabids Carabid traits
#> 6 6 <NA> <NA> <NA> carabids Carabid traits
#> author license
#> 1 Fons van der Plas http://creativecommons.org/publicdomain/zero/1.0/
#> 2 Fons van der Plas http://creativecommons.org/publicdomain/zero/1.0/
#> 3 Fons van der Plas http://creativecommons.org/publicdomain/zero/1.0/
#> 4 Fons van der Plas http://creativecommons.org/publicdomain/zero/1.0/
#> 5 Fons van der Plas http://creativecommons.org/publicdomain/zero/1.0/
#> 6 Fons van der Plas http://creativecommons.org/publicdomain/zero/1.0/
#> measurementRemark lifestage
#> 1 <NA> <NA>
#> 2 <NA> <NA>
#> 3 <NA> <NA>
#> 4 <NA> <NA>
#> 5 deduced_from_genus <NA>
#> 6 <NA> <NA>
#>
#> This trait-dataset contains 27 traits for 299 taxa ( 9386 measurements in total).
#> $carabids
#>
#> carabids : Carabid traits by Fons van der Plas .
#>
#> When using these data, you must acknowledge the following usage policies:
#>
#> Cite this trait dataset as:
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling intensity."
#> _Methods in Ecology and Evolution_. doi:10.1111/2041-210x.12728
#> <https://doi.org/10.1111/2041-210x.12728>.
#>
#> Published under: http://creativecommons.org/publicdomain/zero/1.0/
#>
#> This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"
#>
#>
#> $heteroptera
#>
#> heteroptera : Heteroptera morphometry traits by Martin M. Gossner .
#>
#> When using these data, you must acknowledge the following usage policies:
#>
#> Cite this trait dataset as:
#> Gossner MM, Simons NK, Hoeck L, Weisser WW (2015). "Morphometric
#> measures of Heteroptera sampled in grasslands across three regions of
#> Germany." _Ecology_, *96*, 1154. doi:10.1890/14-2159.1
#> <https://doi.org/10.1890/14-2159.1>.
#>
#> Published under: http://creativecommons.org/publicdomain/zero/1.0/
#>
#> This dataset conforms to: [1] "Ecological Trait-data Standard (ETS) v0.10"
The detailed metadata information of both datasets (e.g. license and bibliographic citation) will be stored in the attributes of the dataset and displayed when calling it in R console. You can access the metadata via the attributes()
function. E.g.
attributes(dataset1)$metadata$bibliographicCitation
#> van der Plas F, van Klink R, Manning P, Olff H, Fischer M (2017).
#> "Sensitivity of functional diversity metrics to sampling intensity."
#> _Methods in Ecology and Evolution_. doi:10.1111/2041-210x.12728
#> <https://doi.org/10.1111/2041-210x.12728>.
writing data recipes
For projects compiling data from multiple sources, it is recommended best practice to refer to original raw data, potentially even by pulling them from their original repository, and make any changes and standardisation procedures script based in R. If many field-based changes are necessary, you can refer to lookup tables to keep the script slim.
traitdataform
allows you to script all parameters required for the standardization call into the attributes of the R object. A script for a single data source can then look like this
carabids <- utils::read.delim(url("https://datadryad.org/stash/downloads/file_stream/24267",
encoding = "UTF-8")
)
attr(carabids, 'metadata') <- traitdataform::as.metadata(
datasetName = "Carabid traits",
datasetID = "carabids",
bibliographicCitation = utils::bibentry(
bibtype = "Article",
title = "Sensitivity of functional diversity metrics to sampling intensity",
journal = "Methods in Ecology and Evolution",
author = c(utils::as.person("Fons van der Plas, Roel van Klink, Pete Manning, Han Olff, Markus Fischer")
),
year = 2017,
doi = "10.1111/2041-210x.12728"
),
author = "Fons van der Plas",
license = "http://creativecommons.org/publicdomain/zero/1.0/"
)
attr(carabids, 'thesaurus') <- traitdataform:::as.thesaurus(
body_length = traitdataform:::as.trait("body_length",
expectedUnit = "mm", valueType = "numeric",
identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Body_length"),
antenna_length = traitdataform:::as.trait("antenna_length",
expectedUnit = "mm", valueType = "numeric",
identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Antenna_length"),
metafemur_length = traitdataform:::as.trait("femur_length",
expectedUnit = "mm", valueType = "numeric",
identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Femur_length"),
eyewidth_corr = traitdataform:::as.trait("eye_diameter",
expectedUnit = "mm", valueType = "numeric",
identifier = "http://t-sita.cesab.org/BETSI_vizInfo.jsp?trait=Eye_diameter")
)
attr(carabids, 'taxa') <- "name_correct"
attr(carabids, 'units') <- "mm"
attr(carabids, 'keep') <- c(measurementDeterminedBy = "source_measurement", measurementRemarks = "note")
When thus specified, the data can be re-formatted simply by calling standardize(carabids)
.
5. Writing data
The final step in converting trait data into a standardised format before uploading the file to a public file hosting service is saving them in a file format that is internationalized, portable and long-term accessible. Internationalization refers to the file encoding (‘UTF-8’ should be used, ‘ASCII’ is possible for data with no special characters) as well as the use of decimal delimiters (highly recommended to use ‘.’) and internationally accepted formatting standards for values such as dates (the international norm for date entries is ISO 8601, i.e. “YYYY-MM-DD”). Portability means that the file can be opened on all operating systems (specifically important, the ‘end of line’ character) and does not rely on proprietary software (like MS Excel or database tools). Long-term accessibility is warranted by choosing a text-based file format (txt, csv or tsv) and by packaging the primary data with all necessary metadata.
The base R function write.table()
gives full control over these parameters and should be used to export trait-data.
write.table(dataset1Std, file = "carabids_std.csv",
sep = ",", dec = ".", quote = TRUE, eol = "\r", row.names = FALSE)
Along with these primary data, you should make any ancillary data table available along with the data, e.g. the metadata in a human readable form, as well as the lookup table of traits and taxa:
capture.output(attributes(dataset1Std)$metadata, file = "metadata.txt")
write.table(attributes(dataset1Std)$traits, file = "traits.csv",
sep = ",", dec = ".", quote = TRUE, eol = "\r", row.names = FALSE)
write.table(attributes(dataset1Std)$taxonomy, file = "taxa.csv",
sep = ",", dec = ".", quote = TRUE, eol = "\r", row.names = FALSE)
When publishing the trait data on file servers like Figshare or Zenodo, those files should be uploaded in a single file repository (e.g. in a zip file). R does the archiving for you using zip()
:
More advise for publishing trait data in a standardised way can be found in our ‘Best practice examples for primary data publication’ (Schneider et al., 2018).
References
Cornelissen, J. H. C., Lavorel, S., Garnier, E., Diaz, S., Buchmann, N., Gurvich, D. E., … Van Der Heijden, M. G. A. (2003). A handbook of protocols for standardised and easy measurement of plant functional traits worldwide. Australian Journal of Botany, 51(4), 335–380.
Garnier, E., Stahl, U., Laporte, M.-A., Kattge, J., Mougenot, I., Kühn, I., … Klotz, S. (2017). Towards a thesaurus of plant characteristics: An ecological contribution. Journal of Ecology, 105(2), 298–309. doi:10.1111/1365-2745.12698
Kattge, J., Ogle, K., Bönisch, G., Díaz, S., Lavorel, S., Madin, J., … Wirth, C. (2011). A generic structure for plant trait databases. Methods in Ecology and Evolution, 2(2), 202–213. doi:10.1111/j.2041-210X.2010.00067.x
Moretti, M., Dias, A. T., Bello, F., Altermatt, F., Chown, S. L., Azcárate, F. M., … others. (2017). Handbook of protocols for standardized measurement of terrestrial invertebrate functional traits. Functional Ecology, 31(3), 558–567. doi:10.1111/1365-2435.12776
Parr, C. S., Schulz, K. S., Hammock, J., Wilson, N., Leary, P., Rice, J., … J, R. (2016). TraitBank: Practical semantics for organism attribute data. Semantic Web, 7(6), 577–588. doi:10.3233/SW-150190
Pebesma, E., Mailund, T., & Hiebert, J. (2016). Measurement Units in R. The R Journal, 8(2), 486–494. doi:10.32614/RJ-2016-061
Perez-Harguindeguy, N., Diaz, S., Garnier, E., Lavorel, S., Poorter, H., Jaureguiberry, P., … Gurvich, D. E. (2013). New handbook for standardised measurement of plant functional traits worldwide. Australian Journal of Botany, 61(3), 167–234.
Schneider, F. D., Jochum, M., Provost, G. L., Ostrowski, A., Penone, C., Fichtmüller, D., … Simons, N. K. (2018). Towards an Ecological Trait-data Standard. bioRxiv, 328302. doi:10.1101/328302
Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23. doi:10.18637/jss.v059.i10