Date of Award

Fall 12-2010

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

Committee Director

Steven J. Zeil

Committee Member

Kurt Maly

Committee Member

Mohammad Zubair

Call Number for Print

Special Collections LD4331.C65 M87 2010

Abstract

In the recent years, there has been a tremendous growth in Internet and online resources that had previously been restricted to paper archives. OCR (Optical Character Recognition) tools can be used for digitalizing an existing corpus and making it available online. A number of federal agencies, universities, laboratories, and companies are placing their collections online and making them searchable via metadata fields such as author, title, and publishing organization. Manually creating metadata for a large collection is an extremely time-consuming task, and is difficult to automate, particularly for collections consisting of documents with diverse layout and structure. The Extract project at ODU has developed an automated metadata extraction system to support document collections with diverse structure and layout. A template language was developed for this purpose to describe the layout of metadata within diverse document layouts and has been used for several years. An alternative template language based on XPath was later proposed. This thesis involves the implementation of a Java based interpreter which executes the new template language to extract metadata from the documents. This thesis also involves the evaluation of the relative power and understandability of the two template languages.

Rights

In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).

DOI

10.25777/vekv-w277

Share

COinS