UBC, Berkeley researchers reconstruct ancient languages

University of British Columbia and Berkeley researchers have used a sophisticated new computer system to quickly reconstruct protolanguages – the rudimentary ancient tongues from which modern languages evolved.


The results, which are 85 per cent accurate when compared to the painstaking manual reconstructions performed by linguists, will be published next week in the Proceedings of the National Academy of Sciences.


“We’re hopeful our tool will revolutionize historical linguistics much the same way that statistical analysis and computer power revolutionized the study of evolutionary biology,” says UBC Assistant Prof. of Statistics Alexandre Bouchard-Côté, lead author of the study.


“And while our system won’t replace the nuanced work of skilled linguists, it could prove valuable by enabling them to increase the number of modern languages they use as the basis for their reconstructions.”


Protolanguages are reconstructed by grouping words with common meanings from related modern languages, analyzing common features, and then applying sound-change rules and other criteria to derive the common parent.


The new tool designed by Bouchard-Côté and colleagues at the University of California, Berkeley analyzes sound changes at the level of basic phonetic units, and can operate at much greater scale than previous computerized tools.

"What excites me about this system is that it takes so many of the great ideas that linguists have had about historical reconstruction, and it automates them at a new scale: more data, more words, more languages, but less time," said Dan Klein, an associate professor of computer science at UC Berkeley and co-author of the paper published online today (Feb. 11) in the journal Proceedings of the National Academy of Sciences.


The research team's computational model uses probabilistic reasoning – which explores logic and statistics to predict an outcome – to reconstruct more than 600 Proto-Austronesian languages from an existing database of more than 140,000 words, replicating with 85 percent accuracy what linguists had done manually. While manual reconstruction is a meticulous process that can take years, this system can perform a large-scale reconstruction in a matter of days or even hours, researchers said.


Not only will this program speed up the ability of linguists to rebuild the world's proto-languages on a large scale, boosting our understanding of ancient civilizations based on their vocabularies, but it can also provide clues to how languages might change years from now.


"Our statistical model can be used to answer scientific questions about languages over time, not only to make inferences about the past, but also to extrapolate how language might change in the future," said Tom Griffiths, associate professor of psychology, director of UC Berkeley's Computational Cognitive Science Lab and another co-author of the paper.


The discovery advances UC Berkeley's mission to make sense of big data and to use new technology to document and maintain endangered languages as critical resources for preserving cultures and knowledge. For example, researchers plan to use the same computational model to reconstruct indigenous North American proto-languages.


Humans' earliest written records date back less than 6,000 years, long after the advent of many proto-languages. While archeologists can catch direct glimpses of ancient languages in written form, linguists typically use what is known as the "comparative method" to probe the past. This method establishes relationships between languages and identifying sounds that change with regularity over time to determine whether they share a common mother language.


"To understand how language changes -- which sounds are more likely to change and what they will become -- requires reconstructing and analyzing massive amounts of ancestral word forms, which is where automatic reconstructions play an important role," said Alexandre Bouchard-Côté, an assistant professor of statistics at the University of British Columbia and lead author of the study, which he started while a graduate student at UC Berkeley.


The UC Berkeley computational model is based on the established linguistic theory that words evolve along the branches of a family tree – much like a genealogical tree – reflecting linguistic relationships that evolve over time, with the roots and nodes representing proto-languages and the leaves representing modern languages.


Using an algorithm known as the Markov chain Monte Carlo sampler, the program sorted through sets of cognates, words in different languages that share a common sound, history and origin, to calculate the odds of which set is derived from which proto-language. At each step, it stored a hypothesized reconstruction for each cognate and each ancestral language.


"Because the sound changes and reconstructions are closely linked, our system uses them to repeatedly improve each other," Klein said. "It first fixes its predicted sound changes and deduces better reconstructions of the ancient forms. It then fixes the reconstructions and re-analyzes the sound changes. These steps are repeated, and both predictions gradually improve as the underlying structure emerges over time."


BACKGROUND | PROTOLANGUAGES


Most protolanguages do not leave written records–but in some instances reconstructions can be partially verified against ancient texts or literary histories. A notable exception is well-documented Latin, the protolanguage of the Romance languages, which include modern French, Italian, Portuguese, Romanian, Catalan and Spanish.


Examples of Protolanguage Words Reconstructed By UBC Tool

 

Modern Languages Reconstructed Protolanguage
English Fijian Melanau Inabaknon Manual Automated
star kalokalo biten bitu’on bituqen bituqen
bird manumanu manuk manok qayam qayam
wind cagi parjay bariyo bali beliu