link to original dataset
I have downloaded this dataset The TREC 2006 Public Corpus -- 75MB (trec06p.tgz)
. Here is the folder structure:
.
└── trec 06p/
├── data
├── data-delay
├── full
├── full-delay
├── ham25
├── ham25-delay
├── ham50
├── ham50-delay
├── spam25
├── spam25-delay
├── spam50
└── spam50-delay
Some questions:
data-delay
, full-delay
)full
mean in this case? (is it just the labels?)full-delay
subfolder?data-delay
folder empty?Before reading the answer, please note that since I had not participated in the TREC06 task nor am I the data creator/provider, I can do only some educated guess to the questions you have on the dataset.
First, reading the task paper helps https://trec.nist.gov/pubs/trec16/papers/SPAM.OVERVIEW16.pdf =)
Next, the right download link for future readers would be https://plg.uwaterloo.ca/~gvcormac/treccorpus06/
And now, some summary:
A: All the actual textual data are actually found in the trec06/data/**/*
files
/trec06p
/data
/000
/000
...
/299
...
/126
/000
/021
And for the rest of the directories, they are just a indices pointing to the subsets to emulate the different forms of evaluations.
trec06p/full/index
: The index of email lists that points to all the data points in trec06p/data/**/*
trec06p/full-delay/index
: The indices that points to the delayed feedback evaluation
trec06p/ham*-delay/index
: The indices that points to only the non-spam labelled emails in the delayed feedback evaluationtrec06p/spam*-delay/index
: The indices that points to only the spam labelled emails in the delayed feedback evaluationSo essentially, all the unique list of trec06p/ham*-delay/index
+ trec06p/spam*-delay/index
= trec06p/full-delay/index
For this, I don't have an answer... Got to ask the data provider/creator.
Now that's the fun coding part =)
Lets step back a little and think what we have essentially:
trec06/data/**/*
spam/ham
labels of each email in trec06/full/index
Spam/SPAM/Ham/HAM
labels of a subset of emails in trec06/full-delay/index
So...
import pandas as pd
from tqdm import tqdm
from lazyme import find_files
data_rows = {}
# Assuming you're on `trec06p` directory.
# P/S: you can use any other file path list function,
# I just use lazyme.find_files because I find it convenient.
for fn in tqdm(find_files('./data/**/*')):
if fn.endswith('.DS_Store'):
continue
# Note that not all files are in utf8/ascii charset
# so you'll have to read them in binary to store them.
# Also note: THIS CAN BE DANGEROUS IF THERE'S EXCUTABLES IN THE DATA!!!
# Assuming that there isn't.
with open(fn, 'rb') as fin:
data_id = tuple(fn.split('/')[-2:])
data_rows[data_id] = fin.read()
full_labels = {}
with open('./full/index') as fin:
for line in tqdm(fin):
label, fn = line.strip().split()
data_id = tuple(fn.split('/')[-2:])
full_labels[data_id] = label
full_delay_labels = {}
with open('./full-delay/index') as fin:
for line in tqdm(fin):
label, fn = line.strip().split()
data_id = tuple(fn.split('/')[-2:])
# You'll realize that the labels repeated per data point.
# but they are exactly the same.... -_-
if data_id in full_delay_labels:
assert label.lower() == full_delay_labels[data_id].lower()
full_delay_labels[data_id] = label.lower()
trec06p/*-delay/index
If we look carefully at the if data_id in full_delay_labels: assert label.lower() == full_delay_labels[data_id].lower()
line, we see that all the caps and the non-caps labels are the same.
Q: So why is there a difference?
A: Not sure, best to ask data provider/creator
trec06p/full-delay/index
and trec06p/full/index
?Don't seem like there's any.
>>> any(full_labels[data_id] != full_delay_labels[data_id] for data_id in full_labels)
False
Given what we know above:
import pandas as pd
from tqdm import tqdm
from lazyme import find_files
data_rows = {}
for fn in tqdm(find_files('./data/**/*')):
if fn.endswith('.DS_Store'):
continue
with open(fn, 'rb') as fin:
data_id = tuple(fn.split('/')[-2:])
data_rows[data_id] = fin.read()
full_labels = {}
with open('./full/index') as fin:
for line in tqdm(fin):
label, fn = line.strip().split()
data_id = tuple(fn.split('/')[-2:])
full_labels[data_id] = label
df = pd.DataFrame({'binary':pd.Series(data_rows),'label':full_labels})
Not really, it's pretty hard / messy to guess the encoding of a binary file but you can try this (though not all file specify charset=...
in the content)
import re, mmap
def find_charset(fn):
with open(fn, 'rb') as f:
view = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
return re.split(";|,|\n",
next(
re.finditer(br'charset\=([!-~\s]{%i,})\n' % 5, view)).group(1).decode('utf8')
)[0].strip('"').strip("'")