I created a document-term matrix from multiple txt files. The result is a dataframe with each column being a word, and each row being a file (my final goal is to visualize the document-term matrix with matplotlib).
My dataframe also have an index, but I rather want a column with the name of each file, since each filename is a year (for example, "1905.txt", "1906.txt", etc.). The data frame looks something like this:
Hello | I | am | |
---|---|---|---|
0 | 1 | 2 | 1 |
1 | 1 | 1 | 1 |
2 | 0 | 1 | 2 |
And I want something like this :
Hello | I | am | |
---|---|---|---|
1905.txt | 1 | 2 | 1 |
1906.txt | 1 | 1 | 1 |
1907.txt | 0 | 1 | 2 |
It would be even better without the ".txt"
How can I proceed ?
Here's my current code :
from sklearn.feature_extraction.text import CountVectorizer
from pathlib import Path
import pandas as pd
import numpy as np
import re
# create a list for all txt files
corpus =[]
# with pathlib, get all files in the corpus list
for fichier in Path("/Users/MyPath/files").rglob("*.txt"):
corpus.append(fichier.parent / fichier.name)
corpus.sort()
all_documents = []
for fichier_txt in corpus:
with open(fichier_txt) as f:
fichier_txt_chaine = f.read()
fichier_txt_chaine = re.sub('[^A-Za-z]', ' ', fichier_txt_chaine)
all_documents.append(fichier_txt_chaine)
# here i am using sklearn, but this part is not relevant for my question
coun_vect = CountVectorizer(stop_words= "english")
count_matrix = coun_vect.fit_transform(all_documents)
count_array = count_matrix.toarray()
allDataframe = pd.DataFrame(data=count_array,columns = coun_vect.get_feature_names())
print(allDataframe)
allDataframe.to_csv("Matrice_doc_term.csv")
I suppose my problem is similar to this one, but I don't know how to adapt the answer to my code : Python Pandas add Filename Column CSV
You most likely just need to pass the index
to the DataFrame
constructor:
pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names(),
index=corpus)
Or, since you have Path objects in corpus
and just want the filename:
pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names(),
index=[f.name for f in corpus])
Or for just the stem:
pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names(),
index=[f.stem for f in corpus])