rout-of-memoryworking-directoryrevolution-r

R- set working directory to hdfs


I need to create some data frames from very large data sets in R. Is there a way to change my working directory so that R objects that I create are saved into hdfs? I don't have enough space under /home to save these large data frames, but I need to use a few data frame functions that require a data frame as input.


Solution

  • If we are using data frame to do some operations on data from hdfs, we are technically using memory not the disk space. So the limiting factor will be memory(RAM) not the available disk space in any working directory and changing working directory wont make too much sense.

    You don't need to copy the file from hdfs to local compute context to process it as dataframe.

    Use rxReadXdf() to directly convert the xdf dataset to a dataframe in hdfs itself.

    something like this(assuming you are in hadoop compute context):

    airDS <- RxTextData(file="/data/revor/AirlineDemoSmall.csv", fileSystem=hdfFS)
    # making a text data source from a csv file at above hdfs location 
    # hdfsFS is the object storing hadoop fileSystem details using RxHdfsFileSyStem() 
    
    airxdf <- RxXdfData(file= "/data/AirlineXdf")
    # specifying the location to create the composite xdf file in hdfs
    # make sure this location exits in hdfs
    
    
    airXDF <- rxImport(inFile=airDS, outFile=airxdf)
    # Importing csv to composite xdf 
    
    
    airDataFrame <- rxReadXdf(file=airXDF)
    
    # Now airDataFrame is a dataframe in memory 
    # use class(airDataframe) to double check
    # do your required operations on this data frame