User:DarTar/Data model feedback

From biodiversity citizen science
Jump to navigation Jump to search

Some feedback on the data and data model as of January 4, 2020.

Observation entities

Item description
I feel a description like "Observation of Diodia virginiana" may not be persistent if the ID changes, however a description like "iNaturalist observation by dartar" should be in principle persistent. Is the rationale to add the taxon to make the items searchable?
Observed on
Is there a reason why the time is stripped? I see the iNat API gives a full timestamp as a response, e.g. "time_observed_at": "2019-12-28T21:22:10-05:00"
Observer
Can you strip the full URL and just leave the username
Scientific name
If you have it in the observation data, it would be useful to store the most recent, observation-level taxon ID in the observation record itself, instead of just resolving to the QID of the taxon. This will allow to query/filter directly the observation records.

Taxon entities

Taxon hierachy
I don't know what you can / want to retrieve about taxa but it would be fantastic to have at least the following represented:
  • the iNat taxon ID as a statement
  • a link to the parent taxon
  • a statement with the parent taxon ID
  • some basic ID mapping (GBIF / Wikidata for starters?)
The bot to enrich the taxon item is ready to run. The taxon ID is already part of that bot, since it also listed in the data dumps obtained from iNaturalsit. For the parent taxon, I would like to rely on the lineage stored in Wikidata and use federated queries to obtain that information. Also, because if I am not mistaken that those parent relation might come with disagreement, so if those views change, we don't need to update that information here. WRT to the Wikidata mapping, yes totally, that is also setup to be included. Some species are already covered with Wikidata mappings --Andrawaag (talk) 13:29, 5 January 2020 (UTC)

Wikibase data model

Instance of?
Isn't it useful to have a notion of an Instance of to separate different types of entities?
Multi-language support
In my opinion, there's no need to have multiple languages in the Wikibase data model, IMO, since an observation will likely just include structured data. The only exception would be for taxon entities if you're planning to ingest localized common names.

Data

CC0 or all observations?
The data is too sparse to do meaningful analyses and queries. The most common taxa have less than 300 data points, I would consider ingesting the entire iNat data dump from GBIF and maybe creating a smaller instance with the exact same data model for fast prototyping / debugging.
Unfortunatly we are not there yet, but it certainly is the plan to eventually do. Currently we are a bit hold back by the limitations of the API and the data model that still needs to mature. However, the bot code, which I will shortly reference here, is made in a such a way that extending to other licenses is just a matter of changing a set of parameters. --Andrawaag (talk) 13:24, 5 January 2020 (UTC)