statisticsanalysissurveyfreeform

How to categorize and tabularize free-form answers to a question in a survey?


I want to analyze answers to a web survey (Git User's Survey 2008 if one is interested). Some of the questions were free-form questions, like "How did you hear about Git?". With more than 3,000 replies analyzing those replies entirely by hand is out of the question (especially that there is quite a bit of free-form questions in this survey).

How can I group those replies (probably based on the key words used in response) into categories at least semi-automatically (i.e. program can ask for confirmation), and later how to tabularize (count number of entries in each category) those free-form replies (answers)? One answer can belong to more than one category, although for simplicity one can assume that categories are orthogonal / exclusive.

What I'd like to know is at least keyword to search for, or an algorithm (a method) to use. I would prefer solutions in Perl (or C).


Possible solution No 1. (partial): Bayesian categorization

(added 2009-05-21)

One solution I thought about would be to use something like algorithm (and mathematical method behind it) for Bayesian spam filtering, only instead of one or two categories ("spam" and "ham") there would be more; and categories itself would be created adaptively / interactively.


Solution

  • Text::Ngrams + Algorithm::Cluster

    1. Generate some vector representation for each answer (e.g. word count) using Text::Ngrams.
    2. Cluster the vectors using Algorithm::Cluster to determine the groupings and also the keywords which correspond to the groups.