corpuslinguistics

How to develop a corpus(corpus analysis)


I am goingt to build a linguistic corpus, but i don't understand which technologies should i use for it. Is it true, that for developing a courpus for any language i necessarily have to use IMS Corpus Workbench(CWB) and CQPweb technologies?

I want to understand which technologies should be use for creating a linguistic corpus.


Solution

  • While I worked at a British university with a reputation for corpus analysis, I did write several pieces of software for handling (large) text corpora. I also worked briefly at the Cobuild project, which had the at the time biggest text corpus of the English language, with 200 million words of data (mid-1990s)

    A corpus, as has been mentioned in the comments, is just a collection of texts. You might want to have some meta-data available so you can select sub-corpora based on genre or age, or type of language (written vs spoken, etc) for contrastive analyses. So you should be able to easily choose some texts and exclude others when processing it.

    The other main feature you need is searching for words (which is one of the main features in corpus analysis). So you need an inverted index, where each word has a list of the positions in the text that it occurs in. With this you can get pretty much instant results even with very large corpora.

    You might also want to have annotations, such as parts-of-speech information. You can attach them to the word tokens in the original text, or have a 'parallel' data set where they are at the same positions. You might also want to include this in the index, so you can search easily for words with a particular part-of-speech.

    If you use ready-made software, it might restrict the things you can do with your data, so it's a trade-off between learning to use something and then finding out you cannot do what you need to do, and writing your own system and having freedom to decide what features you need. I found a graphical user interface is the most time consuming aspect, so if you can stick with the command line (or nowadays use a web-interface) that shouldn't be too bad...

    My go-to recommendation is to have a look at Managing Gigabytes by Witten, Moffat, and Bell. It's a bit old by now, but it is a really helpful book when thinking about how to implement searching data in large text files. I pretty much used them as a blueprint to implement the system I was using then.