LiRI Wiki

Linguistic Research Infrastructure - University of Zurich

User Tools

Site Tools


langtech:swissdox:start

Swissdox@LiRI: Users' guide

Access the Swissdox@LiRI database at https://swissdox.linguistik.uzh.ch/

Swissdox@LiRI is a tool that allows you to retrieve large quantities of Swiss media data for research purposes.

LiRI cooperates with Schweizer Mediendatenbank AG (SMD) to make the Swissdox database easily accessible to researchers. The Swissdox@LiRI database includes approximately 23 million published media articles from a wide range of Swiss media sources (both print and digital) covering many decades, and is updated daily with approximately 5000 to 6000 new articles.

With Swissdox@LiRI you can retrieve articles by filtering for language, time interval, keywords, sources and document types. More metadata is available but cannot currently be used for filtering. After submitting a query, Swissdox@LiRI compiles the corresponding dataset, and notifies you when it becomes available. In this guide, Swissdox@LiRI's functionality is explained in more detail.

Note that some libraries provide access to Swissdox essentials application, which allows users to browse and search the Swissdox archive and retrieve individual texts. Unlike Swissdox@LiRI, the whole Swissdox database can be accessed through Swissdox essentials. While the Swissdox essentials service can only be used locally at those institutions, Swissdox@LiRI aims more to facilitate quantitative and/or computational language and media research.

Getting started

To use Swissdox@LiRI, first you need to login. Upon visiting the site, you will be redirected to the eduGAIN login interface, where you need to select your affiliation and log in using your institutional credentials.

If you are member of a supporting institution,1) you need to provide information about at least one research project of which you form part. Other people from the same institution can be added to existing projects. Queries are associated with a research project and retrieved datasets are accessible to all project members.

Users from other institutions that have contracted Swissdox@LiRI services need to added by the Swissdox administrator of their institution.

Corpus query

On the Corpus query page, you can define a query, in order to retrieve articles matching your particular research interests.

Below is a a list of attributes that the database can be filtered for:

Field Description
Languages Most articles are available in German and French, fewer articles are available in Italian, Romansh and English.
Document type PDLN identifiers describe the type of the respective source.
Source Media articles come from a large list of sources.
Content keywords A list of terms provided by the user, of which one or more must be present in an article; keywords are case-sensitive; asterisks (*) can be used as placeholders.
Examples
finden will match exact word finden, but will not match auffinden or findig)
find* will match all words starting with find (e.g. finden and findig, but will not match auffinden or auffindbar)
*finden will match words ending with finden (e.g. finden, auffinden, but not auffindbar or findig)
Time intervals Most available articles date from the last 25 years, but some date back to the beginning of the last century.

Example

If you are interested in articles in German comprising the word “Covid” which have been published in January 2022, you need to use the following filters:

Field Value
Languages German
Document type (leave blank)
Source (leave blank)
Content keywords “Covid”
Date ranges “2022-01-01 ~ 2022-01-31”

Note that if no option is selected, no filtering is performed.

Submitting a query

On the next page, you can provide a meaningful name for your query.

You can choose to be informed by email when a dataset has been compiled, which may be useful for long-running queries. To prevent datasets from automatically being removed, an expiry date can be specified.

Retrieved datasets

On the Retrieved datasets page, you see a summary of queries that you other project members have submitted to Swissdox@LiRI, alongside operational parameters such as the number of retrieved articles.

In general, the amount of time a query takes to finish is proportional to the amount of data requested. Thus, queries with fewer results will typically complete faster.

Clicking Details will bring up more information about the respective query. Via Open query you will load the filters of a completed query.

The Download icon on this page and the links sent via email will start the download of a dataset in a compressed TSV (tab-separated values) format.

Windows users can use, for instance, 7-Zip or Winzip to uncompress the file. Mac users may want to use The Unarchiver. In the Mac or Linux terminal, you can use the tar command to unpack a file like this:

tar xvf filename.tsv.xz

It is usually not necessary to uncompress the files for processing though, the contents can extracted on the fly, for instance via the xzcat command:

xzcat filename.tsv.xz

You can also use programming languages such as Python to directly read from the compressed TSV file. Here is an example snippet, showing how such a file can be efficiently processed:

import lzma
 
def read_xz_compressed_tsv(filepath):
    fh = lzma.open(filepath, mode='rt', encoding='utf-8')
    for line in fh:
        if not line.strip() or line.startswith('#'):
            continue
        yield line.rstrip().split('\t')
 
for row in read_xz_compressed_tsv('file.tsv.xz'):
    print(row)

Access the Swissdox@LiRI database at https://swissdox.linguistik.uzh.ch/

1)
currently UZH, ZHAW, University of Basel, ETH Zurich, University of Bern
langtech/swissdox/start.txt · Last modified: 2022/03/18 16:52 by Johannes Graën