Image Recommendations

The goal of this project is to efficiently digitize collection data by harvesting this information directly from images of the specimen labels. The first step of the workflow is to quickly image all specimen labels, annotations, and notes associated with the specimen and make them available online. Digitization of the label information will be done via an online interface that is optimized by optical character resolution (OCR), natural language processing (NLP), duplicate harvesting, and crowdsourcing.

There is some disagreement over the “best practices” for imaging specimen records. Rather than being able to simply select between a “right” and “wrong” way, one often needs to find a balance between “best” and “practical” solutions. Furthermore, due to the diversity of the how lichen and bryophyte collection are stored and managed, workflows and solutions are going to vary relative to the needs and available resources of each institution. Below are some general notes to follow and links to other imaging projects. Note that since this project concentrates on obtaining images of specimen labels rather than the specimens themselves, some of the issues that are relevant to other projects do not apply here.

Recommendations, General Notes, and Good Practices

  • Label Images the Central Goal: Capturing images of all written/typed material associated with each specimen is the primary goal of this project.
  • Image in focus with good lighting: For good OCR results, label images must be in focus. Before each session, ensure that the camera settings are appropriately set and quality images are being produced. Since it is difficult to determine if the image is truly in focus using only the camera’s build-in screen, it is recommended that the camera is plugged into a full sized computer monitor.
  • Backgroud: Avoid colored or black backgrounds when imaging since they can significantly interfere with OCR output, at least using Tesseract. A simple white backgorund is preferred!  
  • Resolution: For OCR, an x-height of 20 pixels or better is preferred. Some state that the preferred resolution for OCR is 300 dpi (dots per inch); however, dpi values can be misleading when a camera is used rather than a scanner. DPI is only relevant if the document size is 1:1, which is not always the case being that a camera focal distance will vary according to stand placement. Typically, one inch within the image obtained from a camera is not proportional to an inch within the label document. Furthermore, font height is another critical factor affecting the return of OCR results. A 16pt font at 200 dpi will return better OCR result than an 8pt font at 300 dpi. A better measure of image resolution for OCR purposes is obtained by counting the x-height in pixels of the text (x-height is the height of the lower case x). According to Tesseract (open source OCR program), an x-height of 20 pixels or better is preferred.
  • JPG images: Images submitted to the project should be in the JPG format
  • Multiple Images per Specimen: Since labels, annotations and notes can be placed on all sides of a specimen packet, multiple images may be needed for each specimen. Multiple images can have a suffix added to the identifier within the file name consisting of an underscore plus a letter or number (e.g. ABC12345678_a.jpg, ABC12345678_2.jpg, etc).
  • Image Submission: The HUB will cover image storage for the project. Images will be transferred to the HUB server via FTP. Images placed within the FTP folders system will be processed nightly. Processing consists of the creation of web image versions and their integration within the CNALH and CNABH data portals. See the Image Submission page for more information.
  • One specimen per Image: There can only be one specimen within each image. It is not uncommon for multiple lichen and bryophyte specimens to be stored together attached to a single large herbarium sheet. In order to conserve space, a few collections are using this opportunity to cut these sheets into smaller specimens that can be stored within card cabinets. Some of the other imaging teams have decided to capture all six specimens within a single image and then break the image into multiple 8 separate images. This is only possible if the following requirements are fulfilled: 1) The camera is powerful enough to capture each label with a 20 pixel x-height. 2) The edge labels are in focus and as clear as those in the middle. 3) The composite image can be efficiently separated into multiple images. See the online video featuring the NY Botanical Garden's imaging workflow to see an example of how this can be done. 
  • Capture of Specimen Images is optional and should not hinder the central goal of the project. Since chemical analysis and microscopic images are generally needed for positive identifications of lichens and bryophytes, the general consensus is that macro-images of the whole specimen, even at high resolution, are of little scientific value to the project. However, if an institution is able to capture specimen image without significant extra effort, these images can be incorporated into the data portals by submitting the images in the same manner as the label images. Below are some general rules and comments concerning specimen image capture:
    • Lossless Image for Archive: Original images are best archived using a lossless (http://en.wikipedia.org/wiki/Lossless_data_compression) image format (e.g. RAW, lossless TIFF, etc). To save space, some projects store their archive images as compressed JPGs. If this is the course taken, the images should be modified and resaved as little as possible. Every time you resave a JPG image, information is lost. Note that images of the specimen label do not contain the information detail that requires a TIFF archive. Compressed JPG images should suffice for a label image archive.
    • Maximum resolution: Highest resolution is the best, yet what can practically be captured and stored is dependent on the limitations of the equipment, processing time, and file storage. Computer storage is generally cheap, yet when you are talking about hundreds of thousands of large herbarium specimen's images, long-term storage can become problematic.
    • Many herbarium images projects capture their archive images at 300-600 dpi. Most store the original images as TIFFs, or an equivalent format. A 300 dpi TIFF image of a typical vascular herbarium specimen (12 x 18”) is roughly equivalent to a 20 megapixel image (MP). The file size of a TIFF is generally 3 bytes for each pixel (20MP = 60MB). One image per specimen for 100,000 herbarium specimens translates to 5TB of storage, which does not include the web versions of the images. If 600 dpi is your goal, multiply these numbers by a factor of 4. Lichen and bryophyte specimens are typically smaller. A 4x5” label at 300dpi = ca. 2 megapixels = ca. 6MB Tiff file = ca. 1MB JPG or smaller depending on compression ratio.

See also Mike Bevans' (New York Botanical Garden's Information Manager of Digitization) blog at: http://www.digitalphotorepro.blogspot.com/

page tag: 

This material is based upon work supported by the National Science Foundation grant ADBC#1115116. Any opinions, findings, conclusions, or
recommendations expressed in the material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.