I am currently working on a project where I want to classify some text. For that, I first had to annotate text data. I did it using a web tool and have now the corresponding json file (containing the annotations) and the plain txt files (containing the raw text). I now want to use different classifiers to train the data and eventually predict the desired outcome.
However, I am struggling with where to start. I haven't really found what I've been looking for in the internet so that's why I try it here.
How would I proceed with the json and txt. files? As far as I understood I'd have to somehow convert these info to a .csv where I have information about the labels, the text but also "none" for thext that has not been annotated. So I guess that's why I use the .txt files to somehow merge them with the annotations files and being able to detect if a text sentence (or word) has a label or not. And then I could use the .csv data to load it into the model.
Could someone give me a hint on where to start or how I should proceed now? Everything I've found so far is covering the case that data is already converted and ready to preprocess but I am struggling with what to do with the results from the annotation process.
My JSON looks something like that:
{"annotatable":{"parts":["s1p1"]},
"anncomplete":true,
"sources":[],
"metas":{},
"entities":[{"classId":"e_1","part":"s1p1","offsets":
[{"start":11,"text":"This is the text"}],"coordinates":[],"confidence":
{"state":"pre-added","who":["user:1"],"prob":1},"fields":{"f_4":
{"value":"3","confidence":{"state":"pre-added","who":
["user:1"],"prob":1}}},"normalizations":{}},"normalizations":{}}],
"relations":[]}
Each text is given a classId
(e_1
in this case) and a field_value
(f_4
given the value 3
in this case). I'd need to extract it step by step. First extracting the entity with the corresponding text (and adding "none" to where no annotation has been annotated) and in a second step retrieving the field information with the corresponding text.
The corresponding .txt file is just simply like that:
This is the text
I have all .json files in one folder and all .txt in another.
So, let's assume you have a JSON
file where the labels are indexed by the corresponding line in your raw txt
file:
{
0: "politics"
1: "sports",
2: "weather",
}
And a txt
file with the correspondingly indexed raw text:
0 The American government has launched ... today.
1 FC Barcelona has won ... the country.
2 The forecast looks ... okay.
Then first, you would need to indeed connect the examples with their labels, before you go on featurizing the text and build a machine learning model. If your examples are, such as in my example, are aligned by index or an ID or any other identifying information, you could do:
import json
with open('labels.json') as json_file:
labels = json.load(json_file)
# This results in a Python dictionary where you can look-up a label given an index.
with open(raw.txt) as txt_file:
raw_texts = txt_file.readlines()
# This results in a list where you can retrieve the raw text by index like this: raw_texts[index].
Now that you can match your raw text to your labels, you may want to put them in one single dataframe for ease of use (assuming they are ordered the same way for now):
import pandas as pd
data = pd.DataFrame(
{'label': labels.values(),
'text': raw_texts
})
# label text
# 0 politics Sentence_1
# 1 sports Sentence_2
# 2 weather Sentence_3
Now, you can use different machine learning libraries, but the one I would recommend for starters is definitely scikit-learn
. It provides a good explanation on how to convert your raw text strings into machine learning usable features:
And afterwards, how to train a classifier using these features:
The provided DataFrame
I showed should be just right to start testing out these scikit-learn
techniques.