Blog Corpus
The Norwegian Blog Corpus (NBC)
To identify and explore the linguistic properties of Norwegian constructions, EliNor will need to address the data representative of informal Norwegian. The existing Norwegian corpora have limitations. There is a need for a new corpus that is more “focused” (Barbaresi 2019) than the Norwegian web corpus NoWaC, includes sociodemographic information about language users, contains language that is more informal and spoken-like than that of Leksikografisk bokmålskorpus, and is significantly larger and more recent than the available spoken corpora (Nordisk dialektkorpus, the BigBrother-corpus).
To fill this gap, EliNor builds the Norwegian Blog Corpus that will serve as the major data source for the Norwegian Constructicon. The corpus is compiled by Olaf Mikkelsen (https://github.com/omikke/NBC) and contains 18 million words from recent blog texts written by more than 800 bloggers in 2010-2022. The Corpus requires further work on data cleaning and additional annotations. Currently, it is at Technology Readiness Level 3. EliNor aims to advance this resource to Technology Readiness Levels 6-7 by developing a modern, professional interface with advanced filtering options, including sociodemographic variables, time periods, and stylistic features. This will ensure that the Corpus becomes a versatile and indispensable resource for linguistic research and the development of Norwegian language technologies.