Portals Overview

All newly digitized collections records will be immediately available in one of two SYMBIOTA data portals established for consortia. SYMBIOTA, an open source software project specifically designed to aid biologists in establishing specimen-based virtual floras and faunas. These data portals provides advanced specimen search capabilities, dynamic species checklist generation, distribution mapping, automatic identification keys, image library management, and the ability to integrate the information to generate attractive species pages. In addition to providing public access to integrated data and a platform to integrate data with other biodiversity resources, the portals will offer tools to support the digitization of collection information directly from the images of the specimen labels. These tools will incorporate concepts of crowdsourcing, duplicate record harvesting, optical character recognition (OCR), natural language processing (NLP), and web services. 

Consortium of North American Lichen Herbaria

Consortium of North American Bryophyte Herbaria

Data Management

Typically, larger institutions maintain in-house specimen databases and the portals only display a snapshot of their data. Regular synchronization between the portal snapshot and the central database ensures that the data snapshot is up-to-date within the portal. Some of the smaller institutions choose to manage their specimen data directly within the portal. Since the portal acts as their central management database with record modifications being reflexed as they are made (live dataset), there is no need for infrastructure to regularly update the portal data. While SYMBIOTA is not being designed as a robust, full service collections management system, it does offer enough infrastructure and management tools to accommodate the needs of small institutions that lack the technical support and knowledge to properly maintain an in-house data management system. 

Linking Image to Portal

  1. Nightly Image Uploads: Images of the labels will be uploaded to the web servers via an FTP drop box. The upload process will involve the creation of 2-3 web versions (thumbnail, medium, large) of the images which will be displayed through the web portals.
  2. Linking Images to Portal Database: During the upload process, the barcode identifier will be obtained either from the image file name or directly from the image using OCR. The barcode identifier will be used to locate and link specimen records that already exist within the portal database. Previously existing records will be given a processing status of “pending review”. In cases where the specimen record does not yet already exist, the image will be linked to a new specimen record that is only populated with the barcode identifier. The new record will be given a processing status of “unprocessed”. If the imaging workflow records the most recent identification, this data will be appended to the record at this time.
  3. Automated OCR: Automated scripts will attempt to harvest raw text from each “unprocessed” image. When valid text is returned, it will be stored as a raw text block linked to the specimen record. Processing status will be changed to “OCR processed”.
  4. Automated NLP: Automated scripts will attempt to parse raw text into Darwin Core compliant data fields. On success, data will be appended to the appropriate Symbiota data fields. Processing status will be changed to “NLP parsed”.
  5. Automated Duplicate Record Query: Automated script will further process all records where the NLP parsing scripts returned collector, collector number, and collection date. This process will use those fields to search the integrated consortium database for duplicate records that have already been processed at another institution. Pending duplicates will be linked and the processing status will be changed to “pending duplicate”.

Specimen Record Review

All specimen records will require review. Depending on the results of the automated processing steps, the review process will consist of simple approval, minor editing, importing duplicate record data, reprocessing of OCR, trained NLP parsers, and/or simply key stroking the label information. 

Use case Scenario

  1. Simple Approval: In the best case scenario, reviewers will simply need to approve record.
  2. Minor Editing: Most records will likely need some type of data adjustments before approval.
  3. Importing Duplicate / Exsiccatae Record Data: In cases where a duplicate or exsiccatae record has already been processed in a partner institution, the reviewer will have the ability to view a list of pending duplicate records and selectively import data from the best matching record. Reviewers will have the ability to process these records in batches.
  4. Reprocessing of OCR: Reviewer will have the ability to rerun OCR on a particular image from the review page.
  5. Reprocessing NLP parsers: Reviewer will have the ability to reparse the raw text. There might be two alternative parsing algorithms and one may work better with some label formats than others. Furthermore, the central parsing algorithms will have the ability to "learn" how to better parse labels that have the same layout, e.g., from the same collector, or when a herbarium has used pre-printed label forms. The reviewer will have the ability to select certain label profiles that were specifically trained parse database fields based on its location or word frequency within the content.
  6. Key Stroking Label Information: Labels that were hand-written or have general poor OCR return will have to be hand typed into the data entry form. Unfortunately, key stroking will be necessary for many of the older labels; however, these labels tend to have little information that needs to be entered.
  7. Portal and Central Database Synchronization: In addition to regular updates of the data snapshot within the data portals, collections that maintain in-house central databases need the ability to transfer new or edited records that have been processed within the portal. Collections that manage their data directly within the portal have no need for this infrastructure since their central and portal datasets are the same.
    1. Refresh Portal Snapshots: When the portal features a data snapshot of an herbarium’s central database, the snapshot needs to be refreshed at regular intervals. Portals have several building tools and services to accomplish this. For more information, visit the Symbiota documentation website.
    2. Download New Records: Records entered within the portal from the images of the specimen labels need to be transferred to the collection’s central database at regular intervals. Password protected download modules will aid collection managers in preforming regular downloads in data formats that best match their needs. As an example, collections utilizing Specify as their data management system will be able to download recently reviewed records as a Darwin Core CVS file and import into their central database using the Specify Workbench.
    3. Downloading Recent Edits: Portals have the ability to make use of crowdsourcing and community involvement to aid in data cleaning, georeferencing, and error resolution. Edits will need to be regularly downloaded by data managers and integrated into their central database. To ensure that these edits that have not yet been transferring to the central database are not copied over at the next refresh of the data snapshot, these edits are preserved (versioned) separate layer than the snapshot and reapplied as needed.

This material is based upon work supported by the National Science Foundation grant ADBC#1115116. Any opinions, findings, conclusions, or
recommendations expressed in the material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.