User:DarTar/Data model feedback

From biodiversity citizen science
< User:DarTar
Revision as of 00:33, 5 January 2020 by DarTar (talk | contribs) (initial feedback)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Some feedback on the data and data model as of January 4, 2020.

Observation entities

Item description
I feel a description like "Observation of Diodia virginiana" may not be persistent if the ID changes, however a description like "iNaturalist observation by dartar" should be in principle persistent. Is the rationale to add the taxon to make the items searchable?
Observed on
Is there a reason why the time is stripped? I see the iNat API gives a full timestamp as a response, e.g. "time_observed_at": "2019-12-28T21:22:10-05:00"
Observer
Can you strip the full URL and just leave the username
Scientific name
If you have it in the observation data, it would be useful to store the most recent, observation-level taxon ID in the observation record itself, instead of just resolving to the QID of the taxon. This will allow to query/filter directly the observation records.

Taxon entities

Taxon hierachy
I don't know what you can / want to retrieve about taxa but it would be fantastic to have at least the following represented:
  • the iNat taxon ID as a statement
  • a link to the parent taxon
  • a statement with the parent taxon ID
  • some basic ID mapping (GBIF / Wikidata for starters?)

Wikibase data model

Instance of?
Isn't it useful to have a notion of an Instance of to separate different types of entities?
Multi-language support
In my opinion, there's no need to have multiple languages in the Wikibase data model, IMO, since an observation will likely just include structured data. The only exception would be for taxon entities if you're planning to ingest localized common names.

Data

CC0 or all observations?
The data is too sparse to do meaningful analyses and queries. The most common taxa have less than 300 data points, I would consider ingesting the entire iNat data dump from GBIF and maybe creating a smaller instance with the exact same data model for fast prototyping / debugging.