relates to digitizing


Notes for Editing Wisconsin Lichen/Bryophyte Portal Occurrence Records 


Parse OCR (LBCC): Under the image there will be a read-out of the OCR (optical character recognition) effort. The PARSE OCR (LBCC) button toward the bottom will insert that info into the appropriate fields on the occurrence data tab. Anything inserted will be highlighted in green, which can be edited for correct spelling, or correct parsing, i.e. location info into Locality fields, habitat info in the correct places, correct numbers and dates. Using the Parse OCR button is optional, but it would be interesting to see how well this function operates.


Collector: First name or initials (no spaces) followed by last name. Limit to first collector or collector associated with the collection number.

Examples: John W. Thomson, Thomas H. Nash, III

Associated collectors: List other collectors with same format, include et al. if this written on label.

Number: Collection number as represented on the label. Include numbers and letters.

Examples: 34111, 28-B, 346a, PFS-276. If no number found, enter “s.n. “

Date: Enter in format of yyyy-mm-dd.

Examples: Label reads “June 6, 1923”, enter 1923-06-06 If year and month but no day given: (June 1923) 1923-06-00 If only year is given: (1923) enter as 1923-00-00 If a range of days indicated, assume the earliest day: (Aug. 12-14, 1978) 1978-08-12 and also write “Aug. 12-14, 1978” in the Verbatim Date field.

Verbatim Date: This is seen by clicking the +/pencil icon next to the Associated Collector box. Enter date as written for non-standard date formats or if the information is confusing or incomplete.

Example: label reads “Summer 1884”, enter 1884-00-00 in the Date field and write out “Summer 1884” into the Verbatim Date field. NOTE: Many European collectors often write dates a bit differently than in the US, by inverting the day and month. So 6/12/73 may mean June 12, 1973 or December 6, 1973. If unsure, write it in the verbatim field.

Exsiccati Title: Complete the title of Exsiccati using the drop down box.  Enter exsiccati number.  Click on the Dupes? button.  Possible exact duplicates will pop up in a separate window (be sure to allow pop ups).  Select appropriate record, and transfer all/some of the fields to import.  



This field is already populated for imaged specimens.  

Scientific Name

Author: This field will be automatically populated, but is editable.

ID Qualifier: The determiner's expression of uncertainty in their identification. This will be listed on the label along with the scientific name. This is a good place to enter forms or varieties not recognized by Esslinger.

Example: cf., s.l., aff.

Identified By: Enter the name of the person who identified the specimen, also called a determiner. Use the same format for Collector Name. If no name specifically indicated, leave blank at this time. [Updated: If blank, enter Collector's name, enter year of collection for the date identified, and enter 'assumed to be collector' in the ID Remarks field.  

Date Identified: The date the identification was made. Date can be entered as free form text and does not need to be in a standard date format. If no date indicated enter "s.d." or “unknown”.

ID Remarks: Any additional notes regarding the identification of the specimen. (This field is accessed by clicking on the +/ pencil symbol next to the box)


LOCALITY: Enter Country, State/Province as indicated on the label. The County field will supply drop down options after entering a few letters, use this for the correct spelling. This is also the place to enter parishes as found in LA, or boroughs as found in AK. Write out the names completely (St. Bernard Parish, North Slope Borough, etc.) County can also be used for third level locality information most often found out of the US, such as District (Franklin District). Municipality would be fourth level locality info; however we choose not to use it. Enter other specific location info into the Locality field. Write out as it appears on label, but it is OK to correct any spelling errors or write “near” if not specifically mentioned. If you are making an assumption about correct spelling or location, enter this into the Notes section under the Misc. heading.

Example: The specimen is from Florida and the collector has written “J. ville” as the location. Enter “Jacksonville” in the locality field, and write something such as “verbatim label location J. ville” in the Notes field.

Be sure to add information from the header of the label, if present, i.e. Keele River Region, Central McKenzie Mountains, Lake Itasca Biological Station. Beware of labels with Herbarium names but collection location is different state or province.

Lat/Long: The fields displayed are for decimal form. If other forms given, click on the “Tools” button to expand the selection. Use the Lat/Long boxes to enter xx°xx’xxs form, then click Insert Lat/Long Values. The decimal values will automatically be populated in the boxes above. Use the Town, Range, Section boxes and click Insert TRS Values (you must click this button for data to be saved). This will populate the Verbatim Coordinate field. Use the Verbatim Coordinate field if missing data. UTM can also be entered directly in the verbatim coordinates field. If a standard format is used, decimal lat/long will be calculated automatically. Clicking on the double arrows to the left of the verbatim field will recalculate the lat/long. Note that TRS values will not convert to decimal form at this time. 

Elevation in Meters: Enter the elevation if in meters. If only feet given, enter this in the Verbatim Elevation field. Click enter, tab or click on the << button to convert to meters.

Example: Enter 23 ft, 23 feet, 23’ into the Verbatim Elevation field.

[The Georeferencing fields are used when using the imbedded mapping tools. Contact collection manager for permission. Please read up on the Best Practices for Georeferencing before using.  

Georeferenced by: Enter your user name

Georeferenc Sources: When using the batch tool, this will automatically fill.  Enter any other sources for your search (Google maps, Acme mapper, Wikipedia, etc.)

Georeference Remarks: Add any information that helped to locate the point

Georeference Protocol: Enter Best Practices for Georeferencing

Georef. Verification Status: Can be modulated between low confidence to high.]


MISC. Habitat: Enter habitat info, but do not include specific locality info (which should have been entered into the Locality field).

Examples: Along valley on west facing slope; in dry streambed; in bog; in goat prairie surrounded by Artemesia grasslands.

Substrate: What the lichen was growing upon. It is not necessary to add the word “on” as this is assumed.

Examples: bark of spruce trees; granite boulders; basalt rock outcrops

Associated Taxa: Include the collector’s listing of other species found with this collection.

Description: Specific characteristics of the specimen.

Examples: old and finely fruited specimens; a crustose greenish brown plaque...scraped off with knife; rare

Notes: This is a good place for any other random or label information, or notes about the specimen not contained in other fields.

CURATION: This is for any information regarding Type Status at this time. 


**Processing Status: Please change the status from Unprocessed to Pending Review.

!Click Save Edits before anything else or you will lose all the info you have entered!

Determination History: Once you have saved information you entered on the Occurrence tab, you may leave the page and create an entry for any annotations on the label, including previous name if an update was done. At this time, if no specific name for Determiner, write in the collector's name, with the year it was collected, add 'assumed to be collector' in the id notes field.  Enter any notes as to the determination. 

IMPORTANT—Look at the boxes under the Add New Determination button. Be certain to check or uncheck the boxes as necessary to ascertain the correct current name! Do NOT change Scientific Name!! Click Add New Determination when finished. This will take you back to the Occurrence Data page. Verify that the scientific name under Latest Identification is correct. Any questions, please ask!

Optical Character Recognition (OCR) and Natural Language Processing (NLP)

Imaging Workflows


Portal Software

  • SYMBIOTA - Software project working towards building a library of webtools to aid biologists in establishing specimen based virtual floras and faunas.

Data Management

Societies, Associations and Clubs

Outreach Materials

Volunteer Management Resources

page tag: 

The imaging workflow is a Java v7 application for accumulating label images and their associated metadata for submission to the central FTP site in Florida for processing.

Setting up the Workflow Application.

Download Java v7.

Download the workflow application.

This is a zip file containing the java class files that comprise the application and three data files that can be to populate dropdown menus in the application. Once you have saved the zip file on your computer. Right click on the file name and select "Extract All" in the box that opens in response. Record the path to the folder that you extract the files into.

To create a Windows desktop shortcut to the application, right click on the desktop and select New Shortcut.  Enter "java production.ImagingWorkflowApplication" into the textbox labeled "Type the location of the item:".  Click "Next" and enter an appropriate name for the shortcut.  Click "Finish".  Then right click on the desktop shortcut you just created, select “Properties” and enter the path to the folder into which you extracted the workflow into the “Start in” field.  For instance, if you extracted the zip file into the C:\ImagingWorkflow folder and the class files are in the C:\ImagingWorkflow\production folder, enter "C:\ImagingWorkflow" into the "Start in" textbox. Click “OK”.

Using the Workflow Application

Clicking on the shortcut will open a panel with six tabs.  The first tab, labeled "Working Folder", will be displayed. Click on the button labeled "Select a Working Folder" to begin. This allows you to select the folder into which the application saves the images and metadata. Two subfolders within the working folder will be created by the workflow application at this point, "processing" and "destination".  The camera that is used to record the image labels should be set to send the images to the processing folder. Many cameras will automatically create subfolders within this folder but that will not not cause problems with this application. The workflow application will monitor the processing folder and its subfolders for incoming files from the camera. The destination folder will receive the files and their associated metadata files after processing.

After you have selected a working folder click on the "Select Data Sources" tab to select data sources that the application will use to populate dropdown menus containing scientific names (a required field) and exsiccati names (optional). If you click on the "Scientific Name File" option you will be asked to select an authority file.  This file should be a comma- or tab-deliminted file containing the scientific names you would like to have in the scientific name dropdown menu. The lichen_taxa.tab and bryophytes_taxa.tab files included in the application zip file can be used for this purpose.

If you choose to use a text file containing exsiccati names and numbers you can upload it by clicking on the Upload button. You can use the asu_exsiccati_full.tab (which contains column names in the first row) included in the zip file for this purpose.  If you click on the "Column names in first row" checkbox the entries in the first row of the file will appear in the dropdown menu that appears to allow mapping of the appropriate columns of the file to the exsiccati names, exsiccati numbers and scientific names.  Otherwise the dropdown menu will display column numbers. You must select columns for the application to use or the exsiccati dropdowns will not be populated correctly.

When this has been completed click on the "Persist Metadata" tab.  In this tab you can enter metadata that will be persisited between sessions and between images.  The five textboxes at the top allow entry of metadata that will be persisted between sessions. The checkboxes at the bottom allow the selection of metadata that will not be reset between images during data accumulation. Once these have been entered click on the "Enter Data" tab.

Two values are required for every label, barcode and scientific name. The barcode is assumed to come from a barcode reader and the scientific name will come from the Scientific Name dropdown menu. When a selection is made from the Scientific Name dropdown menu two things happen: (1) If an exsiccati name file has been uploaded and its columns correctly mapped, the application will read the exsiccati name file and those rows which have the chosen scientific name in the column which you have mapped as the scientific name column will have the value of the column which you have mapped as the exsiccati name column added to the "Exsiccati" dropdown menu. Subsequently, if a value is selected from the Exsiccati dropdown the file will be read again and those rows which match the selected value will have the values of the column mapped as the exsiccati number added to the Exsiccati Number dropdown menu. (2) If a working folder has been selected a "New Session" button will appear at the bottom of the tab.

When a selection is made from the Country dropdown menu the State/Province dropdown is populated with only those states or provinces which are associated with that country. To repopulate the State/Province dropdown with all of the states regardless of country (only the United States, Canada, Mexico, Norway, Sweden, Denmark and Finland currently have states or provinces associated with them and only those states or provinces will appear in the State/Province dropdown menu) select the empty line at the top of the Country dropdown. When a selection is made in the State/Province dropdown the country it is associated with is selected in the Country dropdown. Both the Country and State/Province dropdown menus are editable so users can enter values currently not in our database.

Clicking on the new session button begins the monitoring of the processing folder for incoming files. A folder is created in the destination folder with the date as its name. Subsequent sessions for a given day have "_1", "_2", etc. appended to the folder name after the date. Also, after clicking this button you can no longer select a new working folder without closing and reopening the application. After the new session button has been clicked new files that are sent to the processing folder by the camera will have their names displayed at the bottom of the tab and if a barcode has been entered "Enter" and "Delete" buttons will appear at the bottom next to the new session button. The delete button will remove all files from the processing folder and its subfolders. The enter button will rename the image file to be the contents of the barcode field, move it to the destination folder and enter its metadata from the application into the metadata file for the session.  If there is more than one file in the processing folder when the enter button is clicked or if the button is clicked twice with the same value in the barcode field the additional files will be named with "_a", "_b", etc. appended to filename after the barcode.  Those fields in the data entry tab that have not been checked in the Persist Metadata tab are cleared and, if it is the first entry for the session, the values of the working folder and the metadata that has been entered in the persist metadata tab are saved in a file that is read in subsequent sessions. After this, if you close the application and reopen it, this file is read and used to set the working folder and populate the dropdown menus and the application will go straight to the Enter Data tab. To change these persisted values you can go the the relevent tabs and reset them before clicking on the New Session button.

The fifth tab is called Manage Data and contains a single "Delete" button that allows the user to remove images from subfolders of the destination folder. When you remove images this way any references to that image in the metadata file are also removed. Only files with the .jpg file extension can be removed. Session folders can also be removed if they are subfolders of the current destination folder unless you try to remove the current session folder.

The final tab is called "Archive Data". It is intended to to be used to save accumulated data to an external storage drive. Clicking on the "Archive" button allows the user to select the drive and folder to save the data in. If the "Overwrite" check box is checked all subfolders of the destination folder will be transferred to the archive folder and already-existing files with the same name will be overwritten. If the overwrite checkbox is unchecked only new files will be transferred.

Contact Robert Anglin for assistance in setting up and using this application.

page tag: 

We have developed the following guidelines for preparing and documenting shipments of lichen and bryophyte specimens to The New York Botanical Garden for digitization for the Lichen-Bryophyte TCN Project.  The goal is to make the transfer of specimens for this project as accurate and efficient as possible, and to minimize the amount of time that the specimens are away from their home institutions.

Preparation of specimens for shipment:

Barcoding of specimens

As soon as the project barcodes have been received at New York, we will distribute these to each institution. 

All specimens sent to New York for digitization should be barcoded in advance by the home institution.  The barcodes should be positioned so they are visible without opening the packet or any  packet flaps, and make sure that there is a white border of at least ¼” inch around the barcode lines – generally the barcode labels have at least this much of a border  on the label itself.  The barcodes should be applied in numerical sequence within a species, preferably within a genus.  Some institutions have collections both mounted on sheets and as individual specimens – these should be barcoded in series within the type of preparation, but not across preparation types.  In other words, barcode all the specimens of Bryum argenteum mounted on sheets in sequence, and barcode all specimens of Bryum argenteum on packets in sequence, but the sequence doesn’t have to  be continuous across mounted and non-mounted specimens.

Spreadsheet inventory of shipments

A spreadsheet should accompany the shipment of specimens that includes the following information: barcode number, preparation type (packet or sheets), genus, species, subspecific  taxon.  If you use Excel or comparable spreadsheet software, you can use functions provided with the software to advance the barcode number automatically and copy data from the preceding entry.   

Below is a mock up of the spreadsheet columns with made- up sample data:

Barcode number

Prep. Type (mounted or packeted, if herbarium has both)

“Filed as” Genus

“Filed as” species

Subspecific qualifier  (if present)

Subspecific name (if present)




















The names used in the genus and species columns should be those under which the specimens are filed in your herbarium.   The spreadsheet will be used not only to create the skeletal data records for your specimens, but will also serve as inventory control, which will facilitate the checking in and out of specimens when they are sent out and received. The time you take to make the list will be largely offset by the time you will save counting and re-counting specimens pre-shipment and and post return , or, heaven forbid, figuring out where the discrepancy is, if these counts don’t match!

Specimens mounted on herbarium sheets should be placed in thin paper folders labeled with the genus and species name – including the barcode range on the folder is helpful, though not required.  For maximum protection of the specimens, folders should be grouped in 12—16 inch bundles that are sandwiched between corrugates and tied with two evenly-spaced strings.  Bundles should be placed in 12—20 inch high new boxes with a bursting strength of 275 lbs. (approx.).  We will use these same materials to return the specimens to their home institutions.

Specimens not mounted on sheets (that is, loose packets) can be prepared in one of two ways:

Line up in species and barcode order in cardboard trays that fit comfortably inside your shipping boxes.  Use some kind of marker indicating the “filed as” name for each species.  This marker is ideally a differently-colored piece of stiff paper or cardboard cut to the size of the packets that has the species name on it and marks the beginning of the sequence specimens with that “filed as” name.  The trays can be stacked inside each box, but should be separated by corrugates to prevent damage to packets.

Bundled using paper such as unprinted newsprint in groups of 5 – 10 specimens each, with each bundle labeled with the “filed as” name.  This method is a bit more time-consuming, but will contain the specimen contents, and reduce the potential of   contamination, should specimen fragments or dirt fall out of the packets during transfer.

Transaction Management:  We have decided to create a new transaction category for the shipments we receive of specimens for digitization.  We will call these “Incoming Loans for Digitization.”  Unlike other incoming loans to NYBG, these will not be assigned to a particular researcher.  All loans for digitization should be addressed to:


Dr. Barbara Thiers

Director, William and Lynda Steere Herbarium

The New York Botanical Garden

2900 Southern Blvd

Bronx, NY 10458-5126


LOAN OF [Bryophytes] [Lichens] FOR DIGITIZATION



The barcode will be used as the primary identifier linking together specimen images, the online specimen records, and the records within the herbarium’s central database. Due to the important role of this identifier, regularly using a barcode reader will avoid common transcription errors that occur when keying numbers by hand. For most of the participating collections, barcodes will be purchased by the TCN. Institutions already purchasing barcodes through their preferred provider will be given funds to continue.

Barcode Requirements

  • Unique Identifiers: The barcode must uniquely identify the specimen within the collection. It is important that methods are in place that ensures that no two specimens receive the same barcode identifier, which is particularly important for collections that make their own barcodes using a barcode printer.
  • Must be stable: Ideally the barcode should never be modified. Therefore, ensure that barcode is truly unique at the time of assignment so that it doesn’t have to be reassigned.
  • Where to Place: The final location for the barcode should be a location that is easy to scan without the need of opening a packet or disturbing the specimen any more than is needed. Remember, that the most important role for barcodes is that they will supply an easy and reliable method for identifying a specimen when preforming curatorial management tasks. If one is processing a group of specimens for a loan, one should be able to quickly go through a stack and scan each specimen without much trouble. Note that OCR returns of text immediately to the right or left of the barcode can be problematic. In order to reduce OCR "noise" that a barcode can create, it is preferable if the barcode is above or below the label with no adjacent text on a horizontal plane.

Barcode Recommendations and Comments

  • Global Unique Identifiers (GUID): Ideally, the barcode identifier would uniquely identify that specimen relative to all other specimens found worldwide. The current TDWG recommendations for creating unique herbarium identifiers are to use: <institution code>:<collection code>:12345678. For more information on the Darwin Core recommendations: http://rs.tdwg.org/dwc/terms/index.htm#occurrenceID
  • Format: Barcodes are often alphanumeric. The most common barcode standard used for herbarium specimens are Code 39. It’s a good idea to avoid using special characters (!@#$%&) and spaces when possible. The size of the barcode label will depend on the space available on the specimen. Smaller sized lichen and bryophyte specimens may make a barcode of the full GUID (ca 18 digits) impractical. In this case, the barcode might only represent the numeric portion of the identifier or have a collection code of only one to two digits.
  • Set Number of Digits: Barcodes with a uniform number of digits aids in catching and avoiding errors within the database. For the numeric portion of the barcode, collections typically use 7-8 digits with left padded zeros. For example, ABC herbarium with 275,429 lichen specimens might have a barcode sequence from ABC:L:0000001 to ABC:L:0275429. If the collection chose to match barcode and accession number, a specimen with an accession number of 1234 would be something like ABC:L:0001234.  
  • Readable Identifier: Include the human readable digits with the barcode so that one can read the identifier without the need of a barcode reader.
  • Ordering Barcodes: When ordering preprinted barcodes, it’s a good idea to order enough extra barcodes to cover incoming specimens for the next 10 years or more.
  • Pre-printed -vs- Barcode Printing: This TCN project recommends using pre-printed barcodes.
    • Pre-ordered barcodes bought in quantity are typically the cheaper way to go in the long run. This option avoids the need to buy and maintain printer, ink, blank barcodes, etc.
    • Purchasing pre-printed barcodes as a batch ensures that each barcode is unique. If one prints their own, ensure that your database application restricts the entry of duplicate barcodes.
    • Affixing pre-printing barcodes to specimen is generally fast since one doesn’t have to wait for printer to pop-out barcode.
    • Barcode printers make it easier to print 3 barcodes of the same number for specimens with 3 sheets. However, a regular printer can be used for printing an occasional barcode.
    • Using a barcode printer may be easier if a collection wishes to match barcodes with accession number, yet one needs to be very careful with typing the accession number in correctly. Errors like this can lead to more than one specimen having the same barcode.
  • Sequential –vs– Matching Accession Number
    • Matching barcodes is more work, time consuming, and typically the more expensive option. This is particularly true if barcodes are preprinted.
    • If one matches barcodes with accession numbers, a method is needed to ensure that no two specimens receive the same barcode identifier. Multiple specimens accidentally being given the same accession number is a typical problem within herbaria (e.g. stamp failed to advance).
    • Not matching produces one more identification number that can be assigned to a specimen. Institutions that decide not to match barcodes with accession numbers, often decide to do away with the old accession number in place of the new barcodes.
  • Multiple Sheet Specimens: There is disagreement on how to handle specimens that consist of multiple sheets. Some prefer that each specimen gets its own barcode while others assign the same barcode to each sheet. From the database perspective, using a single barcode identifier for all sheets of a specimen is preferred. Ideally, a single specimen should be represented by a single record within the database, whether the specimen consists of one or ten sheets. When general users query a database, they typically want the return count to correctly identify the true number of unique specimens rather than the number of sheets. When they look at the details of specimen record, they typically prefer to view images of all the sheets at once rather than having to click on separate records representing a different sheet. Finally, in the event of an annotation, the data managers should not have to enter the same annotation for multiple records within the database. Multiple records representing a single specimen not only increase data maintenance workload, but it also creates an increased possibility of ambiguity if each records states something different because each record was updated differently.


File storage for the label images will be handled by the ADBC HUB (iDigBio). Image submission will take place using password protected FTP. To obtain connection information for the FTP site and have an upload profile created for you, email Robert Anglin or Edward Gilbert. Once submitted, images will be processed to create three web versions (basic web, thumbnail, and large) that are placed on a web server and linked to specimen records located within the lichen and bryophyte web portals.


  1. Image specimen labels using your preferred workflow. Images are expected to follow these rules:
    • Images must be saved as JPGs.
    • Images are named using the unique specimen identifier (i.e. barcode) using a consistent naming convention. This identifier is used to link the images to their proper specimen record.
  2. Load the images on to the FTP server. FTP connection information will be supplied to each provider upon request. Each herbarium has a folder within the FTP base folder. Within these institution folders, one will find “lichens” and “bryophytes” folders. Since lichens and bryophytes have their own data portals, make sure to process the lichen and bryophyte images separately by placing them in the correct upload folders.
  3. Load skeletal data files into same folder as images. This file contains specimens data recorded during the imaging process, such as most recent identification, catalog number (accession number), collector, collection number, collection date, exsiccati name, exsiccati number, etc. Submission of these files is not required, though strongly encouraged. Recording the most recent identification at time of imaging is especially important since it is not always possible to determine filing information from the specimen image alone. This is a particular issue with collections that annotate specimens by simply filing the specimen under the new identification without attaching a formal annotation label. The skeletal data files are expected to adhere to the following rules:
    • File should have a file extension of .csv, .txt, .tab, or .dat. For example, skeletal_12Dec2011.csv, ABC_skeletal_12102011.txt, etc.
    • File can be comma, tab, or pipe delimited. Files with .csv extensions are expected to be a standard comma delimited file (CSV). Files ending in .tab are assumed to be tab delimited. Files with .txt or .dat extensions will be analyzed to determine the delimiter. Note that Excel has the unfortunate tendency to automatically format numeric fields, remove leading zeros, modify date formats, and sometimes assume lat/longs should be rounded to the nearest cent. If you open the skeletal records in Excel for review, it is highly recommend that you do not save the file!
    • CatalogNumber is the only requiered field since this is what is used to link the data to the correct images. All other fields are optional, though the most recent indentification fields are strongly encouraged. Most recent identification can be placed within any of the following fields: scientificName (if with author), sciname (without author). Or separate fields can be used for: genus, specificepithet, taxonRank, and infraSpecificEpithet.
    • The first row should contain the field (column) names. The field names should follow those approved for upload to the web portals (Symbiota platform). Most Darwin Core fields are importable within the Symbiota portals. A few additional fields are also acceptable. See the Symbiota Import Quick Guide for a list of fields typically imported into a portal. Note that while field names are not case sensitive, they must match the naming format exactly used within the portals (e.g. no spaces).   
    • Unique identifier (barcode number, accession, database pk, etc) must be supplied for each record to be loaded. If left blank, the record will be skipped. This field must be named catalogNumber. The identifier must follow the same format used for within the image file names.
    • Note that a Skeletal Data JAVA application has been made available through this project. This application is to be used during the imaging process to aid in renaming the images using the barcode and collect skeletal data. The resulting CSV skeletal data file can then be uploaded processed along with the images.
  4. Loading scripts will be triggered nightly to process images and skeletal data skeleton files.  

General Notes

  • For both image and skeletal data records, the unique identifier is used as the central link between the specimen records and images. If a specimen record does not already exist (typically the case), a new record will be created, populated with the identifier, and assigned a processing status of "unprocessed. Information from skeletal data files will be appended to these records. If a record already exists, images will simply be linked to the existing record and tagged as needing review.
  • Record data from a skeletal data file will be appended to an existing record without copying over any existing information. If there is already data within a field, new skeletal data that contradicts the existing data will be appended to the general notes field (occurrenceRemarks) with a clear indication of which field the data belongs.
  • Images loaded more than once will replace existing web images. This is a nice method for replacing faulty images (e.g. out of focus). However, it should be noted that if the imaging workflow incorrectly names a new image with the same name as an existing image, the good image will be replaced with the bad. If multiple images of a single specimen are to be loaded, make sure the file names are unique. 
page tag: 

The goal of this project is to efficiently digitize collection data by harvesting this information directly from images of the specimen labels. The first step of the workflow is to quickly image all specimen labels, annotations, and notes associated with the specimen and make them available online. Digitization of the label information will be done via an online interface that is optimized by optical character resolution (OCR), natural language processing (NLP), duplicate harvesting, and crowdsourcing.

There is some disagreement over the “best practices” for imaging specimen records. Rather than being able to simply select between a “right” and “wrong” way, one often needs to find a balance between “best” and “practical” solutions. Furthermore, due to the diversity of the how lichen and bryophyte collection are stored and managed, workflows and solutions are going to vary relative to the needs and available resources of each institution. Below are some general notes to follow and links to other imaging projects. Note that since this project concentrates on obtaining images of specimen labels rather than the specimens themselves, some of the issues that are relevant to other projects do not apply here.

Recommendations, General Notes, and Good Practices

  • Label Images the Central Goal: Capturing images of all written/typed material associated with each specimen is the primary goal of this project.
  • Image in focus with good lighting: For good OCR results, label images must be in focus. Before each session, ensure that the camera settings are appropriately set and quality images are being produced. Since it is difficult to determine if the image is truly in focus using only the camera’s build-in screen, it is recommended that the camera is plugged into a full sized computer monitor.
  • Backgroud: Avoid colored or black backgrounds when imaging since they can significantly interfere with OCR output, at least using Tesseract. A simple white backgorund is preferred!  
  • Resolution: For OCR, an x-height of 20 pixels or better is preferred. Some state that the preferred resolution for OCR is 300 dpi (dots per inch); however, dpi values can be misleading when a camera is used rather than a scanner. DPI is only relevant if the document size is 1:1, which is not always the case being that a camera focal distance will vary according to stand placement. Typically, one inch within the image obtained from a camera is not proportional to an inch within the label document. Furthermore, font height is another critical factor affecting the return of OCR results. A 16pt font at 200 dpi will return better OCR result than an 8pt font at 300 dpi. A better measure of image resolution for OCR purposes is obtained by counting the x-height in pixels of the text (x-height is the height of the lower case x). According to Tesseract (open source OCR program), an x-height of 20 pixels or better is preferred.
  • JPG images: Images submitted to the project should be in the JPG format
  • Multiple Images per Specimen: Since labels, annotations and notes can be placed on all sides of a specimen packet, multiple images may be needed for each specimen. Multiple images can have a suffix added to the identifier within the file name consisting of an underscore plus a letter or number (e.g. ABC12345678_a.jpg, ABC12345678_2.jpg, etc).
  • Image Submission: The HUB will cover image storage for the project. Images will be transferred to the HUB server via FTP. Images placed within the FTP folders system will be processed nightly. Processing consists of the creation of web image versions and their integration within the CNALH and CNABH data portals. See the Image Submission page for more information.
  • One specimen per Image: There can only be one specimen within each image. It is not uncommon for multiple lichen and bryophyte specimens to be stored together attached to a single large herbarium sheet. In order to conserve space, a few collections are using this opportunity to cut these sheets into smaller specimens that can be stored within card cabinets. Some of the other imaging teams have decided to capture all six specimens within a single image and then break the image into multiple 8 separate images. This is only possible if the following requirements are fulfilled: 1) The camera is powerful enough to capture each label with a 20 pixel x-height. 2) The edge labels are in focus and as clear as those in the middle. 3) The composite image can be efficiently separated into multiple images. See the online video featuring the NY Botanical Garden's imaging workflow to see an example of how this can be done. 
  • Capture of Specimen Images is optional and should not hinder the central goal of the project. Since chemical analysis and microscopic images are generally needed for positive identifications of lichens and bryophytes, the general consensus is that macro-images of the whole specimen, even at high resolution, are of little scientific value to the project. However, if an institution is able to capture specimen image without significant extra effort, these images can be incorporated into the data portals by submitting the images in the same manner as the label images. Below are some general rules and comments concerning specimen image capture:
    • Lossless Image for Archive: Original images are best archived using a lossless (http://en.wikipedia.org/wiki/Lossless_data_compression) image format (e.g. RAW, lossless TIFF, etc). To save space, some projects store their archive images as compressed JPGs. If this is the course taken, the images should be modified and resaved as little as possible. Every time you resave a JPG image, information is lost. Note that images of the specimen label do not contain the information detail that requires a TIFF archive. Compressed JPG images should suffice for a label image archive.
    • Maximum resolution: Highest resolution is the best, yet what can practically be captured and stored is dependent on the limitations of the equipment, processing time, and file storage. Computer storage is generally cheap, yet when you are talking about hundreds of thousands of large herbarium specimen's images, long-term storage can become problematic.
    • Many herbarium images projects capture their archive images at 300-600 dpi. Most store the original images as TIFFs, or an equivalent format. A 300 dpi TIFF image of a typical vascular herbarium specimen (12 x 18”) is roughly equivalent to a 20 megapixel image (MP). The file size of a TIFF is generally 3 bytes for each pixel (20MP = 60MB). One image per specimen for 100,000 herbarium specimens translates to 5TB of storage, which does not include the web versions of the images. If 600 dpi is your goal, multiply these numbers by a factor of 4. Lichen and bryophyte specimens are typically smaller. A 4x5” label at 300dpi = ca. 2 megapixels = ca. 6MB Tiff file = ca. 1MB JPG or smaller depending on compression ratio.

See also Mike Bevans' (New York Botanical Garden's Information Manager of Digitization) blog at: http://www.digitalphotorepro.blogspot.com/

page tag: 

Data capture will involve a semi-automated process consisting of the following steps (figure 1):

  1. Imaging all specimen labels 
  2. Automated scripts will prepare images for label processing, storage and web access
  3. Images of labels will be converted to text using Optical Character Recognition (OCR)
  4. Text will be parsed into appropriate database fields employing Natural Language data Processing (NLP)
  5. Human-assisted review with the ability to manually edit or enter the data as necessary
  6. Semi-automated geo-referencing
  7. Data quality control procedures  

A team of at least two people collaborating with the local curator(s) or collections manager(s) will be responsible for imaging specimens at each imaging institution. A central team at WIS consisting of an IT coordinator and a transcription/geo-referencing specialist will be responsible for maximal automation of the image preparation, and the initial transcription and geo-referencing processes. Final editing of label and location information including entering information for handwritten labels will be handled in collaboration between the central institution, the imaging institutions and the institutions owning the specimens, employing transcription assistants and frequently help from trained volunteers. Local imaging teams will collaborate with the central team to develop standard protocols, which will be available on-line as video training modules and other documentation. For a more detailed description of how the web portals will be utilized to process the specimen labels, visit the documentation covering the web portals (http://lbcc.limnology.wisc.edu/node/6).

workflow diagramInstitutional imaging teams will consist of a minimum of two people working together to capture images of labels, annotations, and notes for all specimens included in the proposal (for additional imaging of actual specimens, see below). Lichens and bryophytes frequently are archived in paper packets, multiples of which may be affixed to a larger herbarium sheet or they may be stored upright individually in boxes or drawers. Labels usually are on the outside of the packet with additional information like annotations, chemistry etc. in other places, including, within the packet. Therefore, imaging all information frequently will involve opening the packet and in some cases taking several images for each specimen. For increased speed and efficiency, cameras and light stands will be used to capture the images instead of using a flatbed scanner; this approach has been used successfully in similar projects by some of the participants (e.g., ASU, F, WTU). All specimens will be barcoded. Depending on the institution, the barcode will identify the specimen using a variable combination of a Globally Unique Identifier (GUID; InstitutionCode : CollectionCode : CatalogNumber). The barcode number will later be linked to the specimen in the database during the transcription process. Ideally, all images for one specimen will have the barcode label visible. This is possible if the barcode label has not been attached to the package, which should happen after images are captured. Images will be renamed using the barcode identifier and a letter suffix to denote multiple images of the same specimen. Typically, renaming the images using a barcode reader as part of the imaging workflow tends to be the most reliable method. However, the barcode can be captured using Optical Character Recognition (OCR) software to read the barcode as well as the text of the barcode. Experience has shown that a team of two people can capture images of labels, annotations, and notes for 300 and 400 specimens within a work day, which is less than reported for imaging vascular plant specimens due to the different storage methods. Since images will be taken in groups of specimens from each species, images will be stored in folders labeled with the institution and species name. At the end of each day, images organized in these folders will be uploaded to a central storage facility (central server in figure 1). For more information, visit the image page (http://lbcc.limnology.wisc.edu/node/4)

Once images are uploaded to the central server, processing scripts will manipulate them in preparation for web access, label transcription, and archiving. The barcode will be obtained from the image file name or directly from the image using OCR. In cases where OCR is employed to capture the barcode, the barcode as well as the text of the barcode will be read and compared as a verification step. The image file will be renamed to contain the barcode as part of its name.

The barcode will be obtained from the image file name and is used as the primary key in the database while the species name and holding institution will be obtained from the folder name and parsed into the appropriate database fields; thereby linking a database record to the image(s) and specimen. OCR is then used on typed labels to transcribe the label text from the image. Extensive experience at ASU (where the PI and co-PI worked until recently) and WTU assures that widely used fonts have an acceptable transcription rate with some known problems (e.g., the numeral one is frequently confused with the letter l). Initially this information is stored as one text block in the database record and then parsed applying Natural Language Processing (NLP) algorithms. To optimize NLP, extensive lookup tables will be developed containing collector names (including abbreviations and common misspellings), collection number formats, and date formats. Large thesauri are available and will be expanded to include additional species names with authorities, and geographic names. Within the CNALH, such a thesaurus already exists containing all lichen species names from the Index Fungorum, Integrated Taxonomic Information System (ITIS), and the list of lichens for North America published by Ted Esslinger (2010) among other international sources. The CNABH has been building up a thesaurus of species names starting with ITIS taxonomy and augmenting the liverwort and hornwort data with a taxonomy supplied by F and the Early Land Plants Today project (http://www.mapress.com/phytotaxa/content/2010/f/pt00009p021.pdf). A moss thesaurus is currently being compiled from collaborating institutional resources with major inputs from DUKE, MO and NY. Extensive experience with this approach at ASU has shown that species name, collector name, collection number, and date can be parsed reliably in about 75% of labels. The species name will also be available from the folder name. Comparing the two and checking against entries in an authority table will provide a high degree of reliability. Additional improvements will be made on batches of labels that have the same layout, e.g., from the same collector, or when a herbarium has used pre-printed label forms. The system will be trainable and certain information may be parsed into appropriate database fields based on its location on the label. At this point of automation, considerable information is computer accessible and searchable. Records meeting minimum requirements can be published although still marked as needing manual proofing. These records are then made available in the consortia for limited analytical purposes. Duplicate specimens will be further processed using the FilteredPush approach (Macklin et al. 2009), which will enable the recognition of these specimen as low priority for manual checking and link all duplicates of a specimen to avoid redundancy in manual editing efforts. Geographic information can then be searched and specimens will be grouped for rapid geo-referencing.

To ensure data of high quality, the OCR and NLP results for each specimen label will be reviewed by personnel of the institution owning the specimen, a national network of volunteers, or hourly workers coordinated by a central volunteer coordinator. Handwritten labels and those that failed transformation via OCR for other reasons will have to be keyed into the database at this point. A web application is currently being developed for this purpose as part of the SYMBIOTA package. This application will allow the editor to view the label image and the OCR/NLP results in the database. He/she can then edit the database fields accordingly. Having this editing step on the web provides the opportunity for remote access to editing tasks. This will allow for major volunteer involvement comparable to the successful British program ‘Herbaria@Home’ (see outreach section for more detail).

TDWG-ratified geo-referencing protocols and standards (http://wiki.tdwg.org/Geospatial, Chapman and Wieczorek 2006) will be followed. Existing scripts will be used to obtain decimal coordinates for records with UTM and Township, Range, Section (TRS) information. BioGeoMancer and GEOLocate geo-referencing services will be used as appropriate. Geo-referencing will only need to be completed for ca. 30% or less of the specimens that have not been originally geo-referenced by the collector because collectors typically collect multiple specimens at any one location. This process will take place primarily during the final year of the project, thus increasing efficiency by allowing geo-referencing to be done as a batch process performed on the project’s combined dataset. In this manner, coordinates can be accurately and efficiently assigned to multiple specimens that share matching locality descriptions. Based on earlier correspondence, we infer that the HUB will be involved in this process, as this will be central to all collections being digitized in the ADBC program.

Due to the unique storage of lichens and bryophytes, we cannot at this time provide specimen images together with the label images for the bulk of collections proposed to be digitized here. Since both lichens and bryophytes are usually rather small organisms that require a lens-view and often also microscopic details for taxonomic purposes, either high-resolution scans or direct macroscopic and microscopic imaging are required; high-resolution specimen imaging even with expensive cameras will not result in sufficient resolution to elucidate the necessary specimen details (this has been tested by participating institutions). Thus, meaningful specimen imaging will involve a time investment of at least 5 minutes per specimen (macroscopic shots only) and up to 15 minutes or more if microscopic images are necessary; in addition, such imaging can only be done by personnel with taxonomic training as it is necessary to take images that show representative details of a specimen. This effort would represent a taxonomic review and verification of specimen identification, which is not part of this funding request. Therefore, instead of bulk specimen imaging, we propose a mixed strategy where participants place already existing images at the disposal of the project and one or two selected and representative specimens will be imaged for each species by taxonomic experts of each participating institution as part of existing synergistic research programs. The consortium web sites allow for uploading such images and linking them to a specific specimen using the barcode and from there to the species in general using the species name.

  1. Imaging of Specimen Labels
  2. Label Image Processing
  3. Label Information Pre-Processing
  4. Manual Label Processing
  5. Geo-referencing
  6. Specimen Imaging
page tag: 

This material is based upon work supported by the National Science Foundation grant ADBC#1115116. Any opinions, findings, conclusions, or
recommendations expressed in the material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.