pythonpandasdataframeterm-document-matrix

Add column with filenames on a dataframe with Pandas


I created a document-term matrix from multiple txt files. The result is a dataframe with each column being a word, and each row being a file (my final goal is to visualize the document-term matrix with matplotlib).

My dataframe also have an index, but I rather want a column with the name of each file, since each filename is a year (for example, "1905.txt", "1906.txt", etc.). The data frame looks something like this:

Hello I am
0 1 2 1
1 1 1 1
2 0 1 2

And I want something like this :

Hello I am
1905.txt 1 2 1
1906.txt 1 1 1
1907.txt 0 1 2

It would be even better without the ".txt"

How can I proceed ?

Here's my current code :

from sklearn.feature_extraction.text import CountVectorizer
from pathlib import Path
import pandas as pd
import numpy as np
import re

# create a list for all txt files
corpus =[]

# with pathlib, get all files in the corpus list 
for fichier in Path("/Users/MyPath/files").rglob("*.txt"):
     corpus.append(fichier.parent / fichier.name)


corpus.sort()

 
all_documents = []
for fichier_txt in corpus:
    with open(fichier_txt) as f:
        fichier_txt_chaine = f.read()
        fichier_txt_chaine = re.sub('[^A-Za-z]', ' ', fichier_txt_chaine) 
    all_documents.append(fichier_txt_chaine)

# here i am using sklearn, but this part is not relevant for my question
coun_vect = CountVectorizer(stop_words= "english")
count_matrix = coun_vect.fit_transform(all_documents)

count_array = count_matrix.toarray()
allDataframe = pd.DataFrame(data=count_array,columns = coun_vect.get_feature_names())
print(allDataframe)
allDataframe.to_csv("Matrice_doc_term.csv")

I suppose my problem is similar to this one, but I don't know how to adapt the answer to my code : Python Pandas add Filename Column CSV


Solution

  • You most likely just need to pass the index to the DataFrame constructor:

    pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names(),
                 index=corpus)
    

    Or, since you have Path objects in corpus and just want the filename:

    pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names(),
                 index=[f.name for f in corpus])
    

    Or for just the stem:

    pd.DataFrame(data=count_array, columns=coun_vect.get_feature_names(),
                 index=[f.stem for f in corpus])