The PanLex Project

David Kamholz, Jonathan Pool, Susan Colowick, and Laura Welcher

PanLex Project

Tuesday, January 27, 2015
12:30 p.m., Conference Room 5A

The goal of open-domain automatic translation of text among all human languages is ambitious, but what about the more limited goal of translating any lexeme into any language? Even this involves trillions of mappings, with quality loss when performed inferentially through intermediate languages. From 2006 to 2010, researchers at the University of Washington’s Turing Center headed by Oren Etzioni combined digital dictionaries into a “translation graph” and achieved more efficient automatic inference of translations than with existing methods. Their graph, built mainly from Wiktionaries and Freelang and Freedict dictionaries, contained about 3 million lexemes. They discovered computationally practical sampling algorithms for inferring new translations without loss of precision.

Now sponsored by The Long Now Foundation in San Francisco and partnering with The Rosetta Project and the Internet Archive, the PanLex project is building and deploying this resource for research and development communities via an API, monthly database snapshots, and online applications. PanLex’s database has now grown to contain 21 million lexemes, in about 10,000 languages and dialects, written in 60 scripts, with 1.2 billion attested pairwise translations. No longer a proof-of-concept project, it aspires to integrate all known lexical translations into a consistent data structure, supporting theoretical and applied research in language typology, semantic universals, machine translation, web search, text summarization, human-computer interaction, controlled languages, endangered-language revitalization, biolinguistic diversity, and other fields.

PanLex is open-source and employs Unicode, PostgreSQL, GNU/Linux, and other open-source standards and software. Its content curation is currently a small-team effort, but crowdsourced contributions are planned.

Bio:

David Kamholz, lexical data specialist, received his linguistics Ph.D. from UC Berkeley in 2014. He has worked on computational lexicography and the historical linguistics of Austronesian languages.

Jonathan Pool, project director, has a Ph.D. in political science from The University of Chicago. His research and teaching have focused on language politics and policy, and on language choice as a decision-theoretic problem.

The PanLex team also includes research associate Susan Colowick, Rosetta Project director Laura Welcher, local volunteers, and occasional interns. The project’s steering committee includes Emily Bender at the University of Washington and Steven Bird at the University of Melbourne. The project has a 19-member advisory committee.