Identifying semantic components

Principal Investigator(s): 
Collin Baker

Identifying potential WMD-related threats before they materialize requires the ability to discover and analyze low-observable WMD-related information from data of all types, including social media. To help build the robust natural language understanding (NLU) systems needed for this goal, this project investigates the automatic identification of semantic components, sub-lexical elements of linguistic meaning that may be composed in different ways to capture the meanings of words. It focuses especially on theories and representations of aspects of lexical meaning that are important parts of the cognitive structure on which language understanding relies but are not easily captured in corpus-derived distributional semantic representations. ICSI researchers will automatically identify semantic components using three kinds of existing resources, brought into register with each other. The first is a set of cross-language datasets documenting variation in semantic categories across languages; these permit the identification of cross-linguistically recurring semantic components that form a universal or near-universal repertoire of semantic building-blocks, combining differently in different languages. The second is richly detailed lexical resources such as FrameNet and WordNet, which explicitly capture semantic relations among words, including underlying conceptual gestalts or “bundles” of meaning that are central to language understanding but are rarely themselves directly expressed. The third is corpus-derived word co-occurrence statistics. This project will identify semantic components from the juxtaposition of these resources using machine learning methods, and assess those semantic representations against human word similarity judgments, for comparison with the performance of other approaches to semantic representation.

Funded by DTRA.