pytorchpytorch-dataloader

Re-create MNIST Dataset in Pytorch


I am newbie in Pytorch and in spite of quite a search, I am unable to grasp some concepts on datasets. Say I retrieve the MNIST dataset as follows

import torch
import torchvision
data = torch.utils.data.DataLoader(
        torchvision.datasets.MNIST("/Users/Myself/PyTorch_tutorials",
               transform=torchvision.transforms.ToTensor(),
               download=True),
        batch_size=128,
        shuffle=True)

took me a while to understand a DataLoader object is an iterable, so I can check the shape of one training batch with

next(iter(data))[0].shape

returning

torch.Size([128, 1, 28, 28])

So I gather 128 is the number of training rows (as per batch variable, 28*28 are pixels, and the second dimension is the label.

I also saw that the dataset is organised in such way that one could iterate like

for (x,y) in data:
   `    do something

but, FIRST QUESTION I cannot figure out where the (x,y) tuple is defined, so training data and labels, given that

next(iter(data))[0].shape

returns what seems a single tensor, of shape (128,1,28,28). How does Dataloader know that the second dimension in that tensor is a label? And what if it were multi-dimensional?

Now to main difficulty. say for learning purposes I would like to recreate the same dataset from scratch, from a numpy array. I downloaded a .csv file with 59999 rows and 795 columns, the first containing the labels (column name "5"), the remaining the pixel values. I am not interested in labels for now, just the pixel values (the dataset is to be fed to an autoencoder such asthis one

I tried this

import pandas as pd
import numpy as np
data = pd.read_csv("mnist_train.csv")
labels = data["5"].values
datapoints = data.iloc[:,1:]

And then I tried

batch_size = 128
dataset_pytor = TensorDataset(torch.from_numpy(datapoints.values.reshape(-1,28,28)).unsqueeze(1))
my_loader = DataLoader(dataset_pytor, shuffle=True, batch_size=batch_size)

but it does not work, I get an error in the code I am using later, which boils down to this error

for x in my_loader:
    
    x = x.to(device)
AttributeError: 'list' object has no attribute 'to'

I cannot understand what is going on. If I run

for x in my_loader:
    print(type(x))

I get type lists.

If I run the same but for data , so the DataLoader defined above using the in-built MNIST dataset

for x in my_data:
    print(type(x))

I also get lists. Only if I do

for x,y in my_data:
    print(type(x))

then I get

class 'torch.Tensor'

Why is this? My question is then, how to recreate the MNIST dataset from a numpy array?


Solution

  • To comprehensively answer all your questions: 128,1,28,18 is the shape of your tensor, where 128 is the batch size, 1 is the dimension ( RGB will be 3 here ) and the 2 28’s are, as you rightly put it, the shape of the image. So, your first question is answered above – there is no issue where the dataloader takes the second dimension as the label. As to your second question, after you convert the csv file into a TensorDataset object, if you iterate through it, you get a tuple. The first element of each tuple is your tensor.

      for x in dataset_pytor:
            print(x[0].shape)
            break 
    

    Try the above code with your dataset_pytor object

    An easy way to recreate MNIST is to create your own dataset object:

    I am assuming my first column in the dataframe contains my labels

      from torch.utils.data import Dataset
    
      class MNIST(Dataset):
         def __init__(self,dataframe,transform=False):
             self.dataframe = dataframe
             self.transform = transform
             self.sample = self.dataframe.iloc[:,1:]
             self.label = self.dataframe["label"]
         def __len__(self):
             return len(self.label)
    
         def __getitem__(self,index):
        
        
             img_tensor = torch.from_numpy(self.dataframe.iloc[index,1:].values.reshape(-1,28,28))
             label_tensor = torch.from_numpy(np.array(self.label[index]))
             return img_tensor,label_tensor
    

    Your dataframe will be given as an input to create the above class.

      trainset = MNIST(data)
      #data is your dataframe
      my_loader = DataLoader(trainset, shuffle=True, batch_size=128)