Creating an ARFF file from python output

gardai-plan-crackdown-on-troublemakers-at-protest-2438316.html': {'dail': 1, 'focus': 1, 'actions': 1, 'trade': 2, 'protest': 1, 'identify': 1, 'previous': 1, 'detectives': 1, 'republican': 1, 'group': 1, 'monitor': 1, 'clashes': 1, 'civil': 1, 'charge': 1, 'breaches': 1, 'travelling': 1, 'main': 1, 'disrupt': 1, 'real': 1, 'policing': 3, 'march': 6, 'finance': 1, 'drawn': 1, 'assistant': 1, 'protesters': 1, 'emphasised': 1, 'department': 1, 'traffic': 2, 'outbreak': 1, 'culprits': 1, 'proportionate': 1, 'instructions': 1, 'warned': 2, 'commanders': 1, 'michael': 2, 'exploit': 1, 'culminating': 1, 'large': 2, 'continue': 1, 'team': 1, 'hijack': 1, 'disorder': 1, 'square': 1, 'leaders': 1, 'deal': 2, 'people': 3, 'streets': 1, 'demonstrations': 2, 'observed': 1, 'street': 2, 'college': 1, 'organised': 1, 'operation': 1, 'special': 1, 'shown': 1, 'attendance': 1, 'normal': 1, 'unions': 2, 'individuals': 1, 'safety': 2, 'prosecuted': 1, 'ira': 1, 'ground': 1, 'public': 2, 'told': 1, 'body': 1, 'stewards': 2, 'obey': 1, 'business': 1, 'gathered': 1, 'assemble': 1, 'garda': 5, 'sinn': 1, 'broken': 1, 'fachtna': 1, 'management': 2, 'possibility': 1, 'groups': 3, 'put': 1, 'affiliated': 1, 'strong': 2, 'security': 1, 'stage': 1, 'behaviour': 1, 'involved': 1, 'route': 2, 'violence': 1, 'dublin': 3, 'fein': 1, 'ensure': 2, 'stand': 1, 'act': 2, 'contingency': 1, 'troublemakers': 2, 'facilitate': 2, 'road': 1, 'members': 1, 'prepared': 1, 'presence': 1, 'sullivan': 2, 'reassure': 1, 'number': 3, 'community': 1, 'strategic': 1, 'visible': 2, 'addressed': 1, 'notify': 1, 'trained': 1, 'eirigi': 1, 'city': 4, 'gpo': 1, 'from': 3, 'crowd': 1, 'visit': 1, 'wood': 1, 'editor': 1, 'peaceful': 4, 'expected': 2, 'today': 1, 'commissioner': 4, 'quay': 1, 'ictu': 1, 'advance': 1, 'murphy': 2, 'gardai': 6, 'aware': 1, 'closures': 1, 'courts': 1, 'branch': 1, 'deployed': 1, 'made': 1, 'thousands': 1, 'socialist': 1, 'work': 1, 'supt': 2, 'feehan': 1, 'mr': 1, 'briefing': 1, 'visited': 1, 'manner': 1, 'irish': 2, 'metropolitan': 1, 'spotters': 1, 'organisers': 1, 'in': 13, 'dissident': 1, 'evidence': 1, 'tom': 1, 'arrangements': 3, 'experience': 1, 'allowed': 1, 'sought': 1, 'rally': 1, 'connell': 1, 'officers': 3, 'potential': 1, 'holding': 1, 'units': 1, 'place': 2, 'events': 1, 'dignified': 1, 'planned': 1, 'independent': 1, 'added': 2, 'plans': 1, 'congress': 1, 'centre': 3, 'comprehensive': 1, 'measures': 1, 'yesterday': 2, 'alert': 1, 'important': 1, 'moving': 1, 'plan': 2, 'highly': 1, 'law': 2, 'senior': 2, 'fair': 1, 'recent': 1, 'refuse': 1, 'attempt': 1, 'brady': 1, 'liaising': 1, 'conscious': 1, 'light': 1, 'clear': 1, 'headquarters': 1, 'wing': 1, 'chief': 2, 'maintain': 1, 'harcourt': 1, 'order': 2, 'left': 1}}

I have a python script that extracts words from text files and counts the number of times they occur in the file.

I want to add them to an ".ARFF" file to use for weka classification. Above is an example output of my python script. How do I go about inserting them into an ARFF file, keeping each text file separate. Each file is differentiated by {"with their words in here!!"}

Solution

There are details on the ARFF file format here and it's very simple to generate. For example, using a cut-down version of your Python dictionary, the following script:

import re

d = { 'gardai-plan-crackdown-on-troublemakers-at-protest-2438316.html': 
      {'dail': 1,
       'focus': 1,
       'actions': 1,
       'trade': 2,
       'protest': 1,
       'identify': 1 }}

for original_filename in d.keys():
    m = re.search('^(.*)\.html$',original_filename,)
    if not m:
        print "Ignoring the file:", original_filename
        continue
    output_filename = m.group(1)+'.arff'
    with open(output_filename,"w") as fp:
        fp.write('''@RELATION wordcounts

@ATTRIBUTE word string
@ATTRIBUTE count numeric

@DATA
''')
        for word_and_count in d[original_filename].items():
            fp.write("%s,%d\n" % word_and_count)

Generates output of the form:

@RELATION wordcounts

@ATTRIBUTE word string
@ATTRIBUTE count numeric

@DATA
dail,1
focus,1
actions,1
trade,2
protest,1
identify,1

... in a file called gardai-plan-crackdown-on-troublemakers-at-protest-2438316.arff. If that's not exactly what you want, I'm sure you can easily alter it. (For example, if the "words" might have spaces or other punctuation in them, you probably want to quote them.)