Språknett

To advance the development of Large Language Models (LLMs) for Norwegian, it is essential to identify, collect, and accurately represent a comprehensive inventory of understudied multi-word language patterns of Norwegian, termed constructions (Goldberg 2024; Torrent et al. 2024; Tayyar Madabushi et al. 2025). These language units often comprise irregular syntax, non-compositional semantics, and pragmatic nuances that are central to human verbal communication but often elude traditional linguistic resources. Consider a Norwegian multi-word reduplicative construction XP og XP (Fru Blom) … exemplified below that does not have direct equivalents in non-Scandinavian languages:

– Har du hus? – Hus og hus, fru Blom, det er mer ei hytte...
have you house.sg house.sg and. house.sg. Mrs. Blom it is more a cabin
‘– Do you have a house?’ ‘– Yes and no / I do not quite agree, it is more like a cabin.’

Such constructions are neither full idioms nor “free” expressions and usually fall between the cracks of grammars and dictionaries. For a language with little inflectional morphology like Norwegian, combinatorial properties of words realized in constructions are especially relevant.

Kutuzov et al. (2021) report on developing the first large-scale monolingual language model for Norwegian trained from scratch on Norwegian language data (NorBERT). As NLP continues to advance toward more robust models, new challenges have emerged, particularly involving linguistic units that are not accounted for by dictionaries, part-of-speech morphological parsers, or reference grammars. One of the most pressing issues is the accurate representation of the language specific multi-word constructions that can be integrated in the training data for Norwegian LLMs. The major resource on Norwegian constructions, a comprehensive digital database, is missing (Fjeld 2009: 103), posing a persistent challenge for NLP systems. Addressing this gap is crucial for the development of LLMs that can effectively process Norwegian in all its linguistic complexity.

The EliNor project fills this knowledge gap by conducting the first large-scale investigation of Norwegian constructions and making these findings available in the form a major digital resource for Norwegian, the Norwegian Constructicon (Språknett). For each construction included in this large database, it will be possible to consult its meaning and syntactic properties, and find illustrative examples and available cross-linguistic equivalents in ten languages. This large-scale resource will represent a major portion of Norwegian grammar. The infrastructure of Constructicon is inspired by The Great Norwegian Encyclopedia (Store Norske Leksikon, https://snl.no/) and will enable linguists to register their descriptions of individual Norwegian constructions as academic article-like entries, recognized as research results in the National Research Data Archive (Nasjonalt vitenarkiv). We detailed the urgent need for building the Norwegian Constructicon in Endresen & Mikkelsen 2024.

In 2025, we developed this resource to Technology Readiness Level 4. We collected 2000 Norwegian constructions at https://constructicon.github.io/norwegian/, designed the new user-friendly interface and presented its preliminary version (https://spraknett.uit.no/) to researchers and university instructors of Norwegian at the kick-off research seminar, “Norwegian Grammar as a Constructicon,” funded by the Norwegian Directorate for Higher Education and Skills (HK-dir).

The seminar was held on August 25–27, 2025, on Sommarøy (https://uit.no/noko). Both the linguistic content and the technical functionalities of the digital database were thoroughly discussed. EliNor will enable the comprehensive integration of the detailed feedback we received into the Constructicon resource: the interface currently lacks several crucial functionalities that EliNor aims to develop, including gloss formatting, advanced search mechanisms, data security, a Nynorsk parallel-aligned database, and principles for protecting author rights, among others.

EliNor will incorporate the collected feedback to advance the Constructicon to Technology Readiness Level 6-7.

We are currently exploring more objective methods of compiling the inventory of constructions, in particular by using mMERGE, a novel corpus-driven algorithm for the inductive discovery of multiword expressions (MWEs) based on the interaction of multiple distributional dimensions (Ben Youssef 2024; Gries 2022). Implemented in Julia, the algorithm combines measures of token frequency, dispersion across corpus parts, directional association strength, and contextual entropy in order to identify lexically cohesive and distributionally stable sequences.

More information about this method is available in a recordong of a guest talk that took place at UiT on May 12, 2026, delivered by a visiting prostdoctoral research fellow Chadi Ben Youssef (University of Neuchâtel, Switzerland): https://vimeo.com/1190013424