my research application & corpus data

This page brings together two complementary research outputs developed for Turkish–English code-switching researches.

These two resources are designed to function as an integrated research ecosystem: An annotation system provides the methodological infrastructure, while the intra-word code-switching corpus provides empirically annotated data for quantitative and qualitative linguistic analysis.

Use the buttons below to access the application repository and documentation, and to download the corpus releases and supporting files.

1. TREN: A Corpus Annotation Tool for Code-Switching Data

TREN is a system developed for the annotation and analysis of Turkish–English code-switching data in corpus-based linguistic research. It is designed as a semi-automatic annotation application that integrates automatic processing with user-controlled manual intervention, enabling transparent and reproducible analysis of code-switching.

The system consists of an interactive graphical annotation interface and an underlying processing pipeline. Through the interface, users can load raw textual data, preprocess it into token-based representations, and inspect or revise automatically assigned labels. The processing pipeline supports language identification, rule-based morphological analysis, and sentence-level computations, facilitating fine-grained analysis of bilingual data.

TREN’s features include, for instance:

• semi-automatic token-level language identification for Turkish and English
• support for the annotation of intra-word code-switching structures
• rule-based detection of Turkish morphemes attached to English stems
• morphological glossing based on the Leipzig Glossing Rules
• interactive manual correction and user supervision of automatic labels
• concordance (KWIC) and frequency-based inspection tools
• automatic computation of Matrix Language and Embedded Language at the sentence level
• flexible export of annotated data in .csv and .txt formats

TREN is intended for use by researchers working on bilingual and multilingual language data, particularly in contexts where fine-grained annotation of code-switching is required.

2. Turkish–English Intra-Word Code-Switching Corpus

This corpus contains a dataset focusing specifically on intra-word code-switching, where an English lexical stem combines with Turkish morphological suffixes. The corpus was created as part of a research project on Turkish–English code-switching and was fully annotated using the TREN annotation system. The corpus contains 236 intra-word code-switching instances and a total of 3,132 tokens.

The dataset includes:

• Token-level language labels
• Morphological glossing following Leipzig Glossing Rules
• Matrix and Embedded Language information
• Sentence-level identifiers
• Structured source metadata
• Raw subreddit data batches

!!! Please download the most recent release packages for both application and corpus from the GitHub Releases pages to access the complete application files and the full corpus dataset with all associated data. !!!

TREN Application:

https://github.com/bostanberkay/TREN/releases

Turkish–English Intra-Word Code-Switching Corpus: https://github.com/bostanberkay/turkish-english-intraword-code-switching-corpus/releases