I am newbie in Pytorch and in spite of quite a search, I am unable to grasp some concepts on datasets. Say I retrieve the MNIST dataset as follows
import torch
import torchvision
data = torch.utils.data.DataLoader(
torchvision.datasets.MNIST("/Users/Myself/PyTorch_tutorials",
transform=torchvision.transforms.ToTensor(),
download=True),
batch_size=128,
shuffle=True)
took me a while to understand a DataLoader
object is an iterable, so I can check the shape of one training batch with
next(iter(data))[0].shape
returning
torch.Size([128, 1, 28, 28])
So I gather 128 is the number of training rows (as per batch
variable, 28*28 are pixels, and the second dimension is the label.
I also saw that the dataset is organised in such way that one could iterate like
for (x,y) in data:
` do something
but, FIRST QUESTION I cannot figure out where the (x,y) tuple is defined, so training data and labels, given that
next(iter(data))[0].shape
returns what seems a single tensor, of shape (128,1,28,28).
How does Dataloader
know that the second dimension in that tensor is a label? And what if it were multi-dimensional?
Now to main difficulty. say for learning purposes I would like to recreate the same dataset from scratch, from a numpy array. I downloaded a .csv file with 59999 rows and 795 columns, the first containing the labels (column name "5"), the remaining the pixel values. I am not interested in labels for now, just the pixel values (the dataset is to be fed to an autoencoder such asthis one
I tried this
import pandas as pd
import numpy as np
data = pd.read_csv("mnist_train.csv")
labels = data["5"].values
datapoints = data.iloc[:,1:]
And then I tried
batch_size = 128
dataset_pytor = TensorDataset(torch.from_numpy(datapoints.values.reshape(-1,28,28)).unsqueeze(1))
my_loader = DataLoader(dataset_pytor, shuffle=True, batch_size=batch_size)
but it does not work, I get an error in the code I am using later, which boils down to this error
for x in my_loader:
x = x.to(device)
AttributeError: 'list' object has no attribute 'to'
I cannot understand what is going on. If I run
for x in my_loader:
print(type(x))
I get type lists.
If I run the same but for data
, so the DataLoader defined above using the in-built MNIST dataset
for x in my_data:
print(type(x))
I also get lists. Only if I do
for x,y in my_data:
print(type(x))
then I get
class 'torch.Tensor'
Why is this? My question is then, how to recreate the MNIST dataset from a numpy array?
To comprehensively answer all your questions: 128,1,28,18 is the shape of your tensor, where 128 is the batch size, 1 is the dimension ( RGB will be 3 here ) and the 2 28’s are, as you rightly put it, the shape of the image. So, your first question is answered above – there is no issue where the dataloader takes the second dimension as the label. As to your second question, after you convert the csv file into a TensorDataset object, if you iterate through it, you get a tuple. The first element of each tuple is your tensor.
for x in dataset_pytor:
print(x[0].shape)
break
Try the above code with your dataset_pytor object
An easy way to recreate MNIST is to create your own dataset object:
I am assuming my first column in the dataframe contains my labels
from torch.utils.data import Dataset
class MNIST(Dataset):
def __init__(self,dataframe,transform=False):
self.dataframe = dataframe
self.transform = transform
self.sample = self.dataframe.iloc[:,1:]
self.label = self.dataframe["label"]
def __len__(self):
return len(self.label)
def __getitem__(self,index):
img_tensor = torch.from_numpy(self.dataframe.iloc[index,1:].values.reshape(-1,28,28))
label_tensor = torch.from_numpy(np.array(self.label[index]))
return img_tensor,label_tensor
Your dataframe will be given as an input to create the above class.
trainset = MNIST(data)
#data is your dataframe
my_loader = DataLoader(trainset, shuffle=True, batch_size=128)