documentation and description work on little-known languages

IMG_3784Most of the world’s approximately 7000 languages have not been documented in any depth, and many have never been recorded or described at all. At ELA, we work with speakers of lesser-known languages to produce highly quality video and audio recordings, which are then transcribed, annotated, and translated into a language of wider communication whenever possible.

By documenting the vocabulary of a language, analyzing its grammar, and collecting texts, we create a flexible, multi-purpose record that can be useful for learners, linguists, and speakers themselves. Most of ELA’s documentation work is done in the New York City area, in close collaboration with immigrant communities who have brought their languages and cultures with them. We also provide support to speakers and activists working to document their own languages back home, and to aspiring linguists and students interested in getting involved in language documentation and revitalization.

ELA has recorded hundreds of hours of video and audio in some 50 different languages from all over the world, belonging to many different language families. ELA researchers is engaged in ongoing, in-depth documentation of languages as different as Ikota, Gurung, Wakhi, Purhepecha, and Koda. Special attention is given to culture-specific speech genres and unique literary and verbal art traditions — ELA’s archive includes a Torah portion given in Bukhori, oral histories in Nones, and examples of abaimahani and arumahani genres of Garifuna music.

Watch the Bukhari Torah portion:

Language documentation ensures that future generations will have access to their own linguistic and cultural heritage. It also creates a new presence for the language on the internet, a potentially powerful boost for the prestige and visibility of a language. Only 5 percent of the world’s languages have a real presence online.

For more information on our strategies and techniques for documenting languages, see our How page.



Supported by a National Science Foundation, Kratylos is a new online tool under joint development by ELA and computer scientist Raphael Finkel at the University of Kentucky. Kratylos will enable linguists to share and analyze language data more easily as well as offering new ways of collecting data online.

Today, a major gap exists in the electronic ecosystem for fieldworkers and other linguists who use software such as Toolbox and FLEx: there is still no easy method for sharing projects containing a lexicon and glossed interlinearized texts in a way that enables complex searches and the elicitation of feedback through the Internet.

Kratylos complements rather than competes with existing database software, building on the Fieldworks Language Explorer (FLEx) software developed by SIL, which has a large international and highly active user community. Kratylos effectively replicates its powerful search features for online users. With Kratylos, a FLEx project can be transformed into a linked online concordance and dictionary, complete with audio and video media — a record of a language.

Why XML?

Best practices in linguistic documentation demands the use of formats that are maximally interoperable and least likely to become obsolete. As a result, linguistic data in electronic format is increasingly being encoded in Extensible Markup Language, XML for short. XML is a format for encoding documents that can be read easily by both humans and computers. Information must be classified using start-tags and end-tags so that each part of the document belongs unambiguously to the various sections or categories that exist within the scheme. To take a simple example, what we would write informally as,

Step 1: Find A

Step 2: Find B

Step 3: Connect A to B

would be expressed as the following in XML, where each step is enclosed by a start-tag and an end-tag and the whole set of steps are embedded within a procedure, with its own start- and end-tags.

    <step number="1">Find A.</step>
    <step number="2">Find B.</step> 
    <step number="3">Connect A to B.</step>

The tags tell us unambiguously where the string begins and ends as well as informing us that it is a step. In the schema employed, steps have the attribute “number” which marks each step distinctly. There are serious benefits to storing linguistic information (lexical data and interlinearized texts) in a tagged, well-delineated, hierarchical format. The result is a human readable, unambiguous and highly interoperable code that can be used for years to come.

While XML is a great way to store linguistic data there are still no readymade solutions for displaying and searching such data. Many general programs exist for viewing XML more easily but are not particularly well suited for linguistic analysis. In addition, XML viewers are stand-alone programs that are not designed to facilitate sharing data through the internet — which is crucial for documentary linguists.

Collaborating and Crowdsourcing With Kratylos

Kratylos will offer a new way for linguists to share their data, whether in XML or other standard formats, in the form of online corpora and dictionaries. This includes transforming XML exports from FLEx into a linked, searchable online corpus (complete with multimedia files) and dictionary.

The development of Kratylos began in 2015 and will continue through 2017. The system is being with complex real-world language data from four of the ELA’s ongoing language documentation projects: Ikota, KodaPurhepecha, and Wakhi. As data collection and analysis proceeds, the FLEx databases for each project are increasing in complexity, allowing us to test Kratylos against a wide range of linguistic issues. Making these projects freely available as easily searchable corpora and dictionaries online, we will be able to involve researchers and community members directly in the documentation process.

The fate of the world’s linguistic diversity may very well hang on our ability to take advantage of “crowdsourcing” strategies for language documentation in the coming decades. Crowdsourcing initiatives are underway for collection of audio and transcription, perhaps the most effective example being BOLD (Basic Oral Language Documentation) (Bird 2010), in which audio/video data is collected, re-spoken and then translated orally in a transcription-free workflow. While BOLD targets participants with low exposure to technology and areas that may be off the grid, there is a growing but unmet need for similar strategies aimed at technology savvy contributors. Kratylos will fill that gap by allowing the guided transcription of texts.

Stay Tuned!

Currently, Kratylos is a working prototype still under development, but already able to create online databases from a user’s uploaded FLEx, Praat or ELAN data as well as play associated audio files Within the next year or two, we hope it will be a valuable public tool, free and easy to use. Email us at info@elalliance.org for more information or if you are interested in uploading your own data to the test site.

Youtube screenshotJose Juarez (ALNY)

ALNY will be both the first-ever urban language archive and a unique portrait of New York, capturing the linguistic life of the city. The archive will be a resource for communities, scholars, and the general public, for all New Yorkers and for anyone interested in languages and cities. It will be powerful enough to enable research on little-known languages and straightforward enough for schoolchildren to explore. The front end will be an evolving, fully accessible, state-of-the-art audio-visual archive. On the back end, ALNY will be integrated with Kratylos, an innovative software tool for analyzing and crowdsourcing linguistic data, which is currently being built at ELA with the support of a funded three-year National Science Foundation grant.

Phase I of building ALNY will proceed over the course of 2016. The archive should be fully operational after Phase I, but Phase II will involve building a fully user-friendly front end, full integration with Kratylos, mobile versions, and other features.

To see some of the material that will be housed in ALNY, visit individual language pages, where a sample of our recordings are currently housed. ELA’s Youtube channel features over 200 videos in several dozen languages, including songs in Garifuna, oral histories in little-documented Jewish languages of Central Asia like Juhuri and Bukhori, and much more. ALNY will enable ELA to make all our material fully available, discoverable, and searchable.