I am working in an NLP application with WKS, and after training, got a rather low performing results.
I wonder if there is a way to download annotated documents with its entity classification, both for train and test sets, so I can automatically identify in detail, where are the key differences, so I can fix them.
Those that were annotated by humans, can be downloaded in the section "Assets" / "Documents" -> Download Document Sets (button on the right side).
The following Python code, lets you look at the data inside it:
import json
import zipfile
with zipfile.ZipFile(<YOUR DOWNLOADED FILE>, "r") as zip:
with zip.open('documents.json') as arch:
data = arch.read()
documents = json.loads(data)
print(json.dumps(documents,indent=2,separators=(',',':')))
df_documentos = pd.DataFrame(None)
i = 0
for documento in documents:
df_documentos.at[i,'name'] = documento['name']
df_documentos.at[i,'text'] = documento['text']
df_documentos.at[i,'status'] = documento['status']
df_documentos.at[i,'id'] = documento['id']
df_documentos.at[i,'createdDate'] = '{:14.0f}'.format(documento['createdDate'])
df_documentos.at[i,'modifiedDate'] = '{:14.0f}'.format(documento['modifiedDate'])
i += 1
df_documentos
with zipfile.ZipFile(<YOUR DOWNLOADED FILE>, "r") as zip:
with zip.open('sets.json') as arch:
data = arch.read()
sets = json.loads(data)
print(json.dumps(sets,indent=2,separators=(',',':')))
df_sets = pd.DataFrame(None)
i = 0
for set in sets:
df_sets.at[i,'type'] = set['type']
df_sets.at[i,'name'] = set['name']
df_sets.at[i,'count'] = '{:6.0f}'.format(set['count'])
df_sets.at[i,'id'] = set['id']
df_sets.at[i,'createdDate'] = '{:14.0f}'.format(set['createdDate'])
df_sets.at[i,'modifiedDate'] = '{:14.0f}'.format(set['modifiedDate'])
i += 1
df_sets
Then you can iterate to read each one of the JSON files that come into the "gt" folder of the compressed file, and get the detailed sentence splitting, tokenization and annotation.
What I need is being able to download the annotations that resulted from the machine learning model over the TEST documents, which are visible in "Machine Learning Model" / "Performance" / "View Decoding Results".
With this I will be able to identify specific deviations that can lead to revise Type dictionary and annotation criteria.
I am sorry but this feature is not currently available.
You can submit a feature request at the following URL: https://ibm-data-and-ai.ideas.aha.io/?project=WKS
Thank you.