Date of Award

Winter 2006

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

Committee Director

Kurt Maly

Committee Director

Mohammad Zubair

Committee Member

Steven Zeil

Committee Member

Frank C. Thames

Committee Member

C. Michael Overstreet

Committee Member

Ravi Mukkamala

Abstract

With the growth of the Internet and related tools, there has been a rapid growth of online resources. In particular, by using high-quality OCR (Optical Character Recognition) tools it has become easy to convert an existing corpus into digital form and make it available online. However, a number of organizations have legacy collections that lack metadata. The lack of metadata hampers not only the discovery and dispersion of these collections over the Web, but also their interoperability with other collections. Unfortunately, manual metadata creation is expensive and time-consuming for a large collection, and most existing automated metadata extraction approaches have focused on specific domains and homogeneous collections.

Developing an approach to extract metadata automatically from a large number of challenges. In particular, the heterogeneous legacy collection poses a following issues need to be addressed: (1) Heterogeneity, i.e. how to achieve a high accuracy for a heterogeneous collection; (2) Scaling, i.e. how to apply an automated metadata extraction approach to a very large collection; (3) Evolution, i.e. how to process new documents added to a collection over time; (4) Adaptability, i.e. how to apply an approach to a new document collection; (5) Complexity, i.e. how many document features can be handled, and how complex the features should be.

In this dissertation, we propose a template-based metadata extraction approach to address these issues. The key idea of addressing the heterogeneity is to classify documents into equivalent groups so that each document group contains similar documents only. Next, for each document group we create a template that contains a set of rules to instruct a template engine how to extract metadata from documents in the group. Templates are written in an XML-based language and kept in separate files. Our approach of decoupling rules from programming codes and representing them in a XML format is easy to adapt to another collection with documents in different styles.

We developed our test bed by downloading about 10,000 documents from DTIC (Defense Technical Information Center) document collection that consists of scanned versions of documents in PDF (Portable Document Format) format. We have evaluated our approach on the test bed consisting of documents from DTIC collection, and our results are encouraging. We have also demonstrated how the extracted metadata can be utilized to integrate our test bed with an interoperable digital library framework based on OAI (Open Archives Initiative).

DOI

10.25777/3w53-dq19

ISBN

9780549083290

Share

COinS