Search | VHL Regional Portal

Languages with more speakers tend to be harder to (machine-)learn.

Koplenig, Alexander; Wolfer, Sascha.

Sci Rep ; 13(1): 18521, 2023 10 28.

Article in English | MEDLINE | ID: mdl-37898699

ABSTRACT

Computational language models (LMs), most notably exemplified by the widespread success of OpenAI's ChatGPT chatbot, show impressive performance on a wide range of linguistic tasks, thus providing cognitive science and linguistics with a computational working model to empirically study different aspects of human language. Here, we use LMs to test the hypothesis that languages with more speakers tend to be easier to learn. In two experiments, we train several LMs-ranging from very simple n-gram models to state-of-the-art deep neural networks-on written cross-linguistic corpus data covering 1293 different languages and statistically estimate learning difficulty. Using a variety of quantitative methods and machine learning techniques to account for phylogenetic relatedness and geographical proximity of languages, we show that there is robust evidence for a relationship between learning difficulty and speaker population size. However, contrary to expectations derived from previous research, our results suggest that languages with more speakers tend to be harder to learn.

Subject(s)

Language , Linguistics , Humans , Phylogeny , Neural Networks, Computer , Machine Learning

Is More Always Better? Testing the Addition Bias for German Language Statistics.

Wolfer, Sascha.

Cogn Sci ; 47(9): e13339, 2023 09.

Article in English | MEDLINE | ID: mdl-37705294

ABSTRACT

This replication study aims to investigate a potential bias toward addition in the German language, building upon previous findings of Winter and colleagues who identified a similar bias in English. Our results confirm a bias in word frequencies and binomial expressions, aligning with these previous findings. However, the analysis of distributional semantics based on word vectors did not yield consistent results for German. Furthermore, our study emphasizes the crucial role of selecting appropriate translational equivalents, highlighting the significance of considering language-specific factors when testing for such biases for languages other than English.

Subject(s)

Language , Semantics , Humans , Bias , Seasons

A large quantitative analysis of written language challenges the idea that all languages are equally complex.

Koplenig, Alexander; Wolfer, Sascha; Meyer, Peter.

Sci Rep ; 13(1): 15351, 2023 Sep 16.

Article in English | MEDLINE | ID: mdl-37717109

ABSTRACT

One of the fundamental questions about human language is whether all languages are equally complex. Here, we approach this question from an information-theoretic perspective. We present a large scale quantitative cross-linguistic analysis of written language by training a language model on more than 6500 different documents as represented in 41 multilingual text collections consisting of ~ 3.5 billion words or ~ 9.0 billion characters and covering 2069 different languages that are spoken as a native language by more than 90% of the world population. We statistically infer the entropy of each language model as an index of what we call average prediction complexity. We compare complexity rankings across corpora and show that a language that tends to be more complex than another language in one corpus also tends to be more complex in another corpus. In addition, we show that speaker population size predicts entropy. We argue that both results constitute evidence against the equi-complexity hypothesis from an information-theoretic perspective.

Testing the Relationship between Word Length, Frequency, and Predictability Based on the German Reference Corpus.

Koplenig, Alexander; Kupietz, Marc; Wolfer, Sascha.

Cogn Sci ; 46(6): e13090, 2022 06.

Article in English | MEDLINE | ID: mdl-35661231

ABSTRACT

In a recent article, Meylan and Griffiths (Meylan & Griffiths, 2021, henceforth, M&G) focus their attention on the significant methodological challenges that can arise when using large-scale linguistic corpora. To this end, M&G revisit a well-known result of Piantadosi, Tily, and Gibson (2011, henceforth, PT&G) who argue that average information content is a better predictor of word length than word frequency. We applaud M&G who conducted a very important study that should be read by any researcher interested in working with large-scale corpora. The fact that M&G mostly failed to find clear evidence in favor of PT&G's main finding motivated us to test PT&G's idea on a subset of the largest archive of German language texts designed for linguistic research, the German Reference Corpus consisting of â¼43 billion words. We only find very little support for the primary data point reported by PT&G.

Subject(s)

Language , Linguistics , Humans , Reading

Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size.

Koplenig, Alexander; Wolfer, Sascha; Müller-Spitzer, Carolin.

Entropy (Basel) ; 21(5)2019 May 03.

Article in English | MEDLINE | ID: mdl-33267178

ABSTRACT

Recently, it was demonstrated that generalized entropies of order α offer novel and important opportunities to quantify the similarity of symbol sequences where α is a free parameter. Varying this parameter makes it possible to magnify differences between different texts at specific scales of the corresponding word frequency spectrum. For the analysis of the statistical properties of natural languages, this is especially interesting, because textual data are characterized by Zipf's law, i.e., there are very few word types that occur very often (e.g., function words expressing grammatical relationships) and many word types with a very low frequency (e.g., content words carrying most of the meaning of a sentence). Here, this approach is systematically and empirically studied by analyzing the lexical dynamics of the German weekly news magazine Der Spiegel (consisting of approximately 365,000 articles and 237,000,000 words that were published between 1947 and 2017). We show that, analogous to most other measures in quantitative linguistics, similarity measures based on generalized entropies depend heavily on the sample size (i.e., text length). We argue that this makes it difficult to quantify lexical dynamics and language change and show that standard sampling approaches do not solve this problem. We discuss the consequences of the results for the statistical analysis of languages.

The statistical trade-off between word order and word structure - Large-scale evidence for the principle of least effort.

Koplenig, Alexander; Meyer, Peter; Wolfer, Sascha; Müller-Spitzer, Carolin.

PLoS One ; 12(3): e0173614, 2017.

Article in English | MEDLINE | ID: mdl-28282435

ABSTRACT

Languages employ different strategies to transmit structural and grammatical information. While, for example, grammatical dependency relationships in sentences are mainly conveyed by the ordering of the words for languages like Mandarin Chinese, or Vietnamese, the word ordering is much less restricted for languages such as Inupiatun or Quechua, as these languages (also) use the internal structure of words (e.g. inflectional morphology) to mark grammatical relationships in a sentence. Based on a quantitative analysis of more than 1,500 unique translations of different books of the Bible in almost 1,200 different languages that are spoken as a native language by approximately 6 billion people (more than 80% of the world population), we present large-scale evidence for a statistical trade-off between the amount of information conveyed by the ordering of words and the amount of information conveyed by internal word structure: languages that rely more strongly on word order information tend to rely less on word structure information and vice versa. Or put differently, if less information is carried within the word, more information has to be spread among words in order to communicate successfully. In addition, we find that-despite differences in the way information is expressed-there is also evidence for a trade-off between different books of the biblical canon that recurs with little variation across languages: the more informative the word order of the book, the less informative its word structure and vice versa. We argue that this might suggest that, on the one hand, languages encode information in very different (but efficient) ways. On the other hand, content-related and stylistic features are statistically encoded in very similar ways.

Subject(s)

Bible , Communication Barriers , Language Arts , Humans

Fast word reading in pure alexia: "fast, yet serial".

Bormann, Tobias; Wolfer, Sascha; Hachmann, Wibke; Neubauer, Claudia; Konieczny, Lars.

Neurocase ; 21(2): 251-67, 2015.

Article in English | MEDLINE | ID: mdl-24592898

ABSTRACT

Pure alexia is a severe impairment of word reading in which individuals process letters serially with a pronounced length effect. Yet, there is considerable variation in the performance of alexic readers with generally very slow, but also occasionally fast responses, an observation addressed rarely in previous reports. It has been suggested that "fast" responses in pure alexia reflect residual parallel letter processing or that they may even be subserved by an independent reading system. Four experiments assessed fast and slow reading in a participant (DN) with pure alexia. Two behavioral experiments investigated frequency, neighborhood, and length effects in forced fast reading. Two further experiments measured eye movements when DN was forced to read quickly, or could respond faster because words were easier to process. Taken together, there was little support for the proposal that "qualitatively different" mechanisms or reading strategies underlie both types of responses in DN. Instead, fast responses are argued to be generated by the same serial-reading strategy.

Subject(s)

Alexia, Pure/psychology , Alexia, Pure/pathology , Brain/pathology , Eye Movements , Humans , Male , Middle Aged , Neuropsychological Tests , Pattern Recognition, Visual , Reading , Semantics

An eye movement study on the role of the visual field defect in pure alexia.

Bormann, Tobias; Wolfer, Sascha A; Hachmann, Wibke; Lagrèze, Wolf A; Konieczny, Lars.

PLoS One ; 9(7): e100898, 2014.

Article in English | MEDLINE | ID: mdl-24999811

ABSTRACT

Pure alexia is a severe impairment of word reading which is usually accompanied by a right-sided visual field defect. Patients with pure alexia exhibit better preserved writing and a considerable word length effect, claimed to result from a serial letter processing strategy. Two experiments compared the eye movements of four patients with pure alexia to controls with simulated visual field defects (sVFD) when reading single words. Besides differences in response times and differential effects of word length on word reading in both groups, fixation durations and the occurrence of a serial, letter-by-letter fixation strategy were investigated. The analyses revealed quantitative and qualitative differences between pure alexic patients and unimpaired individuals reading with sVFD. The patients with pure alexia read words slower and exhibited more fixations. The serial, letter-by-letter fixation strategy was observed only in the patients but not in the controls with sVFD. It is argued that the VFD does not cause pure alexic reading.

Subject(s)

Alexia, Pure/physiopathology , Eye Movements , Visual Fields , Aged , Alexia, Pure/diagnostic imaging , Fixation, Ocular , Humans , Linguistics , Middle Aged , Reaction Time , Saccades , Tomography, X-Ray Computed

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL