ETA: It was super obvious. All I needed to do (after specifying the split in the data call) was add .to_pandas()
.
I've scoured the documentation but I am not finding what I need and I feel a little bit like I'm going crazy. I think perhaps I'm just not searching the right terms or missing something very obvious.
I have the hugging face datasets library installed and am able to successfully download a dataset off the hub in my notebook.
from datasets import load_dataset
ds = load_dataset("papluca/language-identification")
and when I run ds
I see the following:
DatasetDict({
train: Dataset({
features: ['labels', 'text'],
num_rows: 70000
})
validation: Dataset({
features: ['labels', 'text'],
num_rows: 10000
})
test: Dataset({
features: ['labels', 'text'],
num_rows: 10000
})
})
The problem is that once it's in my notebook, I cannot seem to figure out how to access the data itself. I'd like it to be in a pandas dataframe so that I can then work on the data like normal. I did figure out that if I run the below (specifying train
as the split), it changes type(ds)
to a dataset but I still can't figure out how to actually view the data itself.
ds = load_dataset("papluca/language-identification", split="train")
and when I run ds
this time it returns
Dataset({
features: ['labels', 'text'],
num_rows: 70000
})
What (probably very obvious) step am I missing to be able to work with the data so that if I run something like df.head()
it will return the below?
id | text | language
0 | the grass is green | english
1 | bonjour, ca va? | french
2 | como se dice | spanish
first you need to convert HF dataset to pandas
df = ds['train'].to_pandas()
then
df.head()
works fine.