jupyter-notebookjupyterhuggingface-datasets

view hugging face data in jupyter


ETA: It was super obvious. All I needed to do (after specifying the split in the data call) was add .to_pandas().

I've scoured the documentation but I am not finding what I need and I feel a little bit like I'm going crazy. I think perhaps I'm just not searching the right terms or missing something very obvious.

I have the hugging face datasets library installed and am able to successfully download a dataset off the hub in my notebook.

from datasets import load_dataset
ds = load_dataset("papluca/language-identification")

and when I run ds I see the following:

DatasetDict({
    train: Dataset({
        features: ['labels', 'text'],
        num_rows: 70000
    })
    validation: Dataset({
        features: ['labels', 'text'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['labels', 'text'],
        num_rows: 10000
    })
})

The problem is that once it's in my notebook, I cannot seem to figure out how to access the data itself. I'd like it to be in a pandas dataframe so that I can then work on the data like normal. I did figure out that if I run the below (specifying train as the split), it changes type(ds) to a dataset but I still can't figure out how to actually view the data itself.

ds = load_dataset("papluca/language-identification", split="train")

and when I run ds this time it returns

Dataset({
    features: ['labels', 'text'],
    num_rows: 70000
})

What (probably very obvious) step am I missing to be able to work with the data so that if I run something like df.head() it will return the below?

id | text               | language
0  | the grass is green | english
1  | bonjour, ca va?    | french
2  | como se dice       | spanish

Solution

  • first you need to convert HF dataset to pandas

    df = ds['train'].to_pandas()
    

    then

    df.head() works fine.