Endangered Language Alliance


documentation and description work on little-known languages

IMG_3784Most of the world’s approximately 7000 languages have not been documented in any depth, and many have never been recorded or described at all. At ELA, we work with speakers of lesser-known languages to produce highly quality video and audio recordings, which are then transcribed, annotated, and translated into a language of wider communication whenever possible.

By documenting the vocabulary of a language, analyzing its grammar, and collecting texts, we create a flexible, multi-purpose record that can be useful for learners, linguists, and speakers themselves. Most of ELA’s documentation work is done in the New York City area, in close collaboration with immigrant communities who have brought their languages and cultures with them. We also provide support to speakers and activists working to document their own languages back home, and to aspiring linguists and students interested in getting involved in language documentation and revitalization.

ELA has recorded hundreds of hours of video and audio in some 50 different languages from all over the world, belonging to many different language families. ELA researchers is engaged in ongoing, in-depth documentation of languages as different as Ikota, Gurung, Wakhi, Purhepecha, and Koda. Special attention is given to culture-specific speech genres and unique literary and verbal art traditions — ELA’s archive includes a Torah portion given in Bukhori, oral histories in Nones, and examples of abaimahani and arumahani genres of Garifuna music.

Watch the Bukhari Torah portion:

Language documentation ensures that future generations will have access to their own linguistic and cultural heritage. It also creates a new presence for the language on the internet, a potentially powerful boost for the prestige and visibility of a language. Only 5 percent of the world’s languages have a real presence online.

For more information on our strategies and techniques for documenting languages, see our How page.

















*For a full description, see Kaufman, Daniel & Raphael Finkel. 2018. Kratylos: A tool for sharing interlinearized and lexical data in diverse formats. Language Documentation & Conservation 12. 124-146.

Supported by a National Science Foundation, Kratylos is a new online tool under joint development by ELA and computer scientist Raphael Finkel at the University of Kentucky. Kratylos will enable linguists to share and analyze language data more easily as well as offering new ways of collecting data online.

Today, a major gap exists in the electronic ecosystem for fieldworkers and other linguists who use software such as Toolbox and FLEx: there is still no easy method for sharing projects containing a lexicon and glossed interlinearized texts in a way that enables complex searches and the elicitation of feedback through the Internet.

Kratylos complements rather than competes with existing database software, building on the Fieldworks Language Explorer (FLEx) software developed by SIL, which has a large international and highly active user community. Kratylos effectively replicates its powerful search features for online users. With Kratylos, a FLEx project can be transformed into a linked online concordance and dictionary, complete with audio and video media — a record of a language.

Why XML?

Best practices in linguistic documentation demands the use of formats that are maximally interoperable and least likely to become obsolete. As a result, linguistic data in electronic format is increasingly being encoded in Extensible Markup Language, XML for short. XML is a format for encoding documents that can be read easily by both humans and computers. Information must be classified using start-tags and end-tags so that each part of the document belongs unambiguously to the various sections or categories that exist within the scheme. To take a simple example, what we would write informally as,

Step 1: Find A

Step 2: Find B

Step 3: Connect A to B

would be expressed as the following in XML, where each step is enclosed by a start-tag and an end-tag and the whole set of steps are embedded within a procedure, with its own start- and end-tags.

    <step number="1">Find A.</step>
    <step number="2">Find B.</step> 
    <step number="3">Connect A to B.</step>

The tags tell us unambiguously where the string begins and ends as well as informing us that it is a step. In the schema employed, steps have the attribute “number” which marks each step distinctly. There are serious benefits to storing linguistic information (lexical data and interlinearized texts) in a tagged, well-delineated, hierarchical format. The result is a human readable, unambiguous and highly interoperable code that can be used for years to come.

While XML is a great way to store linguistic data there are still no readymade solutions for displaying and searching such data. Many general programs exist for viewing XML more easily but are not particularly well suited for linguistic analysis. In addition, XML viewers are stand-alone programs that are not designed to facilitate sharing data through the internet — which is crucial for documentary linguists.

Collaborating and Crowdsourcing With Kratylos

Kratylos will offer a new way for linguists to share their data, whether in XML or other standard formats, in the form of online corpora and dictionaries. This includes transforming XML exports from FLEx into a linked, searchable online corpus (complete with multimedia files) and dictionary.

The development of Kratylos began in 2015 and will continue through 2017. The system is being with complex real-world language data from four of the ELA’s ongoing language documentation projects: Ikota, KodaPurhepecha, and Wakhi. As data collection and analysis proceeds, the FLEx databases for each project are increasing in complexity, allowing us to test Kratylos against a wide range of linguistic issues. Making these projects freely available as easily searchable corpora and dictionaries online, we will be able to involve researchers and community members directly in the documentation process.

The fate of the world’s linguistic diversity may very well hang on our ability to take advantage of “crowdsourcing” strategies for language documentation in the coming decades. Crowdsourcing initiatives are underway for collection of audio and transcription, perhaps the most effective example being BOLD (Basic Oral Language Documentation) (Bird 2010), in which audio/video data is collected, re-spoken and then translated orally in a transcription-free workflow. While BOLD targets participants with low exposure to technology and areas that may be off the grid, there is a growing but unmet need for similar strategies aimed at technology savvy contributors. Kratylos will fill that gap by allowing the guided transcription of texts.

Stay Tuned!

Currently, Kratylos is a working prototype still under development, but already able to create online databases from a user’s uploaded FLEx, Praat or ELAN data as well as play associated audio files Within the next year or two, we hope it will be a valuable public tool, free and easy to use. Email us at info@elalliance.org for more information or if you are interested in uploading your own data to the test site.

Youtube screenshotJose Juarez (ALNY)

Since its founding in 2010, the Endangered Language Alliance (ELA) has been recording a wide range of unique materials with speakers of over 100 minority, endangered, and Indigenous languages. The core of the collection consists of recordings reflecting the immigrant and diaspora communities in and around New York City, a world capital of linguistic diversity. In 2018, ELA began formalizing its collections as the Archive of the Languages of New York (ALNY) with the dual aim of ensuring the preservation of and providing access to this ever-growing collection.

ELA’s collections includes approximately 5 TB (and growing) of primarily born-digital recordings (video, audio, transcriptions, translations, lexical data, and fieldwork sessions). Among the materials are oral histories, historical narratives, songs, folktales, and a variety of other linguistic materials representing both a range of distinctive communities from around the world and the linguistic life of one of the world’s most diverse cities. Many recordings were made in New York over the past decade, while others came out of fieldwork by ELA staff or partners in communities in Belize, Mexico, Nepal, Tajikistan, Turkey, Indonesia, and numerous other sites around the world. A small but valuable portion of the recordings were made on tape beginning the advent of digital recording and have subsequently been digitized.

ELA’s collections foreground the linguistic and cultural contributions of communities that are underrepresented linguistically, culturally, politically, and otherwise, with the ultimate aim of making them maximally accessible and useful both to community members themselves and to a wider public. The significance of this humanities collection goes beyond linguistics with its offerings such as unique recipes in the Indigenous Mixtec language of Mexico, arumahani a cappella songs by traditional masters from the linked Garifuna communities in Belize and New York, oral histories from the Himalayan diaspora in Queens, folktales from storytellers in the Pamir mountains of Tajikistan, narratives of cultural survival from the Tsou people of Taiwan, and much more.

For many languages, ELA’s collections represent the only recordings or materials available—or at least the only materials that are public, high quality, and have been through some degree of annotation and analysis. This is true for languages as Seke (Nepal), Ishkashimi (Tajikistan), and several Iranian Jewish languages including Judeo-Kashani and Judeo-Shirazi.

In other cases, ELA’s collections represent the most complete corpora available anywhere for particular languages— examples include Wakhi, Zaza, Koda, Loke, Gurung, Neo-Mandaic, Bishnupriya Manipuri— or else represent a unique subset of materials that does not exist elsewhere. Examples of this include Ladino recordings about Sephardic Jewish history in New York, recordings of the broadcasts of the NYC Indigenous radio stations Alcal and Kichwa Hatari, materials related to health and community among Indigenous Mexican New Yorkers, over 500 diary entries in a dozen languages about the COVID-19 pandemic, and more.

ALNY is a work in progress, but currently being set up as a series of public-facing collections as part of the Internet Archive, a long-standing non-profit initiative. These collections are organized by lanugage and by project, but searchable across a range of metadata information following the Dublin Core-OLAC standard used by many other language archives. When fully public with ELA’s website and other digital efforts, the archive will be discoverable, searchable, and open to scholars and communities. It will also be integrated with Kratylos, an innovative software tool for analyzing and crowdsourcing linguistic data, which is currently being built at ELA with the support of a National Science Foundation grant.

For more immediate access to a curated set of recordings, visit ELA’s Youtube channel with over 900 videos in dozens of languages.


Perlin, Ross. 2019. A Grammar of Trung. Himalayan Linguistics, 18(2). http://dx.doi.org/10.5070/H918244579

Perlin, Ross and Daniel Kaufman (eds). 2019. Languages of New York City (1st and 2nd edition), map. New York: Endangered Language Alliance.

Perlin, Ross. 2019. “Talk of the Town”, Artforum, October 2019.

Kaufman, Daniel and Ross Perlin. 2018. “Language documentation in diaspora communities” in Kenneth Rehg and Lyle Campbell (eds.), The Oxford Handbook of Endangered Languages, Oxford: Oxford University Press.

Gurung, Nawang, Ross Perlin, Daniel Kaufman, Mark Turin, & Sienna R. Craig. 2018. Orality and Mobility: Documenting Himalayan Voices in New York CityVerge: Studies in Global Asias, 4 (2), 64-80.

Kaufman, Daniel and Raphael Finkel. 2017. “Kratylos: A Tool for Sharing Interlinearized and Lexical Data in Diverse Formats.” Language Documentation and Conservation, vol. 12, 2018, pp. 124-146. Reprinted in CUNY Academic Works.

Borjian, Habib and Daniel Kaufman. 2016. “Juhuri: from the Caucasus to New York City” in Maryam Borjian and Charles Häberl (eds.), Special Issue: Middle Eastern Languages in Diasporic USA Communities, International Journal of Sociology of Language, (237), 51–74.

Perlin, Ross. 2016. “The Race to Save a Dying Language”, Guardian, August 17.

Borjian, Habib and Ross Perlin. 2015. “Bukhori in New York”, Cahier de Studia Iranica 57:15-27.

Perlin, Ross. 2014. “Endangered Speakers”, n+1 20.

Borjian, Habib. 2014. “What Is Judeo-Median—and How Does it Differ from Judeo-Persian?” Journal of Jewish Languages, 2(2): 117-142. doi: https://doi.org/10.1163/22134638-12340026

Blevins, Juliette. 2010. Saving endangered languages in the United States.  A Living Legacy: Preserving Intangible Culture. Washington, D.C.: United States Department of State, Bureau of International Information Programs. 6-10.


Kaufman, Daniel. 2017. “Saisiyat Morphology: A Review Article”. Oceanic Linguistics 56(1):278-293.

Kaufman, Daniel. 2013. “A Grammar of Tamambo, the Language of Western Malo, Vanuatu” (review). ​Oceanic Linguistics 51(3): 286-299.

Kaufman, Daniel. 2012. “Endangered Austronesian and Australian Aboriginal Languages: Essays on language documentation, archiving and revitalization” (review). ​Oceanic Linguistics,51(2): 589-596

Kaufman, Daniel. 2007. “Salako or Badameà: Sketch Grammar, Texts and Lexicon of a Kanayatn Dialect in West Borneo” (review). Oceanic Linguistics 46(2): 624-633.


Perlin, Ross. 2020. Counting New York: The City and the Census. New York Public Library. Mar. 4.

Kaufman, Daniel and Ross Perlin. 2019. Memorial Sloan Kettering Cancer Center (Immigrant Health & Cancer Disparities Services).

Kaufman, Daniel. 2019. Language work with Indigenous Immigrants in NYC​. CNY Humanities Corridor Workshop: Celebrating Indigenous and Refugee Language Communities in New York State. Cornell University. Sept. 27.

Kaufman, Daniel. 2019. Language revitalization and  language access in NYC. Global Language Justice Book Workshop. Mellon-Sawyer Language Justice Project. Columbia University. Aug. 28.

Perlin, Ross and Daniel Kaufman. 2019. Language Access and NYC’s true linguistic diversity. Presentation to the Mayor’s Office of Immigrant Affairs Language Access team. July 10.

Kaufman, Daniel. Linguistic Research with Diaspora Communities​. LSA Summer Institute, UC Davis. June 30.

Kaufman, Daniel. 2019. Making Way for Indigenous Languages in the City: The View From New York.  Invited Talk. HELISET TTE SKAL ​Let the Languages Live Conference​. Victoria, British Columbia. June 24.

Perlin, Ross. 2019. “Stateless, oral, immigrant cultures in New York”, Workshop on Language in Its Settings, Columbia University, 31 May 2019.

Kaufman, Daniel. 2019. From field data recording to online interlinear glossed text corpus. NYU Fieldwork Discussion Group. May 3.

Kaufman, Daniel, Habib Borjian, Daniel Barry, Ross Perlin, Kathryn Rafailov and Matthew Zaslansky. 2017. “Endangered Iranian Languages“. North American Conference in Iranian Linguistics (NACIL 1), 2017 April 28-30.

Kaufman, Daniel and Raphael Finkel. 2019. Demonstration of Kratylos software. Technological showcase. International Conference on Language Documentation and Conservation. University of Hawai’i at Manoa.

Kaufman, Daniel. 2019. Community based research across borders. Invited panelist for Bringing Latin American Perspectives on Community Based Research to SSILA at the LSA conference. Jan. 2-4.

Kaufman, Daniel. 2018. Examining Austronesian prosody through the lens of hip-hop. Invited talk. CCLS Lecture Series , University of Cologne, Germany.

Kaufman, Daniel. 2018. Ways of engaging with urban linguistic diversity: A critical view from New York. Plenary talk. Big Cities, Small Languages Conference . ZAS, Berlin. Nov. 14-16.

Kaufman, Daniel. 2018. Indigenous languages in NYC: Ideology and conservation. Indigenous Languages: From Endangerment to Revitalization to Resilience. University of Michigan Center for Southeast Asian Studies. Oct. 25.

Kaufman, Daniel, Tony Woodbury and B’alam Mateo. 2018. Roundtable facilitator on collaborative documentation. Sound Systems of Latin America III. University of Mass. at Amherst. Oct. 19-21.

Alvarez, Jackeline & Daniel Kaufman. 2018. A Comparative Analysis of Alcozauca and Cuautipan Mixteco Deictics. First ILLC Conference. Long Island University. (Part of the NSF REU mentorship Program.)

Kaufman, Daniel. 2018. The Austronesians: Family relations and inter-family contact across six millennia. The Greater South China Sea Interaction Zone: A Workshop to Explore Interdisciplinary Interventions into the Study of the Ancient East Eurasian South. Columbia University.

Kaufman, Daniel. 2018. Discussant for Panel 326: Language Choice and Identity in South and Southeast Asia. Association for Asian Studies 2018 Annual Conference, Washington DC.

Kaufman, Daniel. 2010.  “Greenberg’s 16th Slayed in the Bronx?”. Harvard GSAS Workshop in Language Universals and Linguistic Fieldwork, 2010 April 13.