I am trying to build up a model in which I load the dataframe (an excel file from Kaggle) and I am using TAPAS-large-finetuned-wtq model to query this dataset. I tried to query 259 rows (the memory usage is 62.9 KB). I didn't have a problem, but then I tried to query 260 rows with memory usage 63.1KB, and I have the error which says: "index out of range in self". I have attached a screenshot for the reference as well. The data I used here can be found from Kaggle datasets.
The code I am using is:
from transformers import pipeline
import pandas as pd
import torch
question = "Which Country code has the quantity 30604?"
tqa = pipeline(task="table-question-answering", model="google/tapas-large-finetuned-wtq")
c = tqa(table=df[:100], query=question)['cells']
In the last line, as you can see in the screenshot, I get the error.
Please let me know what can be the way I can work for a solution? Any tips would be welcome.
The way TAPAS works it needs to flatten the table into a sequence of word pieces. This sequence needs to fit into the specified maximal sequence length (default is 512). TAPAS has a pruning mechanism that will try to drop tokens but it will never drop cells. Therefore at a sequence length of 512 there is no way to fit a table with more than 512 cells.
If you really want to run the model on 1.8M rows I would suggest that you split your data row-wise. For your table for example you would need blocks with a maximum of ~8 rows.
Alternatively, you can increase the sequence size but that will also increase the cost of running the model.