rh2or-bigmemory

How to handle huge data and build model in R


I have been trying to build a model for the dataset which contains 70 million records in R. I tried every possible option to build a model like clustering, using ff library, h2o (which is throwing me error) and bigmemory and biganalytics package as well. I couldn't handle such huge data using R with the options I tried.

Could you please suggest me any working option other than this, so that I can use it to build model. My laptop is 4GB RAM and 64bits processor.


Solution

  • As the name tells, Machine Learning needs a machine (PC). What's more, it requires a suitable machine for a specific work. Even though there are some techniques to deal with it:

    1. Down-Sampling

    Most of the time, you don't need all data for a machine learning, you can sample you data to get a much smaller one which can be used on your laptop.

    Of cause, you may need to use some tool(s) (e.g. database) for sampling work on your laptop.

    2. Data Points

    Depends on the number of variables you have, each record may not be unique. You can "aggregate" your data by your key variables. Each unique combination of variable is called a data point, and the number of duplicates can be used as the weight for clustering methods.

    But depends on the chosen clustering method and purpose of the project, this aggregated data may not provide you the best model.

    3. Split into Parts

    Assuming you have all your data in one csv file, you can read data in chunks using data.table::fread by specifying the rows that can fit your laptop.

    https://stackoverflow.com/a/21801701/5645311

    You can process each data chunk in R separately, and build model on those data. Eventually, you will have lots of clustering results as a kind of bagging method.

    4. Cloud Solution

    Nowadays, cloud solutions are really popular, and you can move your work to cloud for data manipulation and modelling.

    If you feels like it quite expensive for a whole project, you can down-sampling using cloud then back to your laptop if you cannot find a suitable tool locally for sampling work.

    5. A New Machine

    This is a way I'd think first. A new machine may still not handle your data (depends on number of variables in your data). But it will definitely make the other calculation more efficient.

    For personal project, a 32gb RAM with i7 CPU would be good enough to start machine learning. A Titan GPU would give you speed boost on some machine learning methods (e.g. xgboost, lightgbm keras etc.)

    For commercial purpose, a server solution or cluster solution makes more sense to deal with a 70m records data on a clustering job.