WHIRL(ID:8182/)Word-based Heterogeneous Information Representation Language Cohen ATT Shannon Labs 1997 References: sources are converted into a highly structured collection of small fragments of text. Database-like queries to this structured collection of text fragments are approximated using a novel logic called WHIRL, which combines inference in the style of deductive databases with ranked retrieval methods from information retrieval. WHIRL allows queries that integrate information from information sources, without requiring the extraction and normalization of object identifiers that can be used as keys; instead, operations that in conventional databases require equality tests on keys are approximated using IR similarity metrics for text. This leads to a reduction in the amount of human engineering required to field an integration system. External link: Online copy that synergistically combines properties of logic-based and text-based representation systems. WHIRL is a subset of non-recursive Datalog that has been extended by introducing an atomic type for textual entities, an atomic operation for computing textual similarity, and a "soft" semantics; that is, inferences in WHIRL are associated with numeric scores, and presented to the user in decreasing order by score. We show that WHIRL strictly generalizes both ranked retrieval of documents, and logical deduction; that non-trivial queries about large databases can be answered efficiently; that WHIRL can be used to accurately integrate data from heterogeneous information sources, such as those found on the Web; that WHIRL can be used effectively for inductive classification of text; and finally, that WHIRL can be used to semi-automatically generate extraction programs for structured documents. External link: Online copy |