rigraphsocial-graph

Social graph analysis. 60GB and 100 million nodes


Good evening,

I am trying to analyse the forementioned data(edgelist or pajek format). First thought was R-project with igraph package. But memory limitations(6GB) wont do the trick. Will a 128GB PC be able to handle the data? Are there any alternatives that don't require whole graph in RAM?

Thanks in advance.

P.S: I have found several programs but I would like to hear some pro(yeah, that's you) opinions on the matter.


Solution

  • If you only want degree distributions, you likely don't need a graph package at all. I recommend the bigtablulate package so that

    1. your R objects are file backed so that you aren't limited by RAM
    2. you can parallelize the degree computation using foreach

    Check out their website for more details. To give a quick example of this approach, let's first create an example with an edgelist involving 1 million edges among 1 million nodes.

    set.seed(1)
    N <- 1e6
    M <- 1e6
    edgelist <- cbind(sample(1:N,M,replace=TRUE),
                      sample(1:N,M,replace=TRUE))
    colnames(edgelist) <- c("sender","receiver")
    write.table(edgelist,file="edgelist-small.csv",sep=",",
                row.names=FALSE,col.names=FALSE)
    

    I next concatenate this file 10 times to make the example a bit bigger.

    system("
    for i in $(seq 1 10) 
    do 
      cat edgelist-small.csv >> edgelist.csv 
    done")
    

    Next we load the bigtabulate package and read in the text file with our edgelist. The command read.big.matrix() creates a file-backed object in R.

    library(bigtabulate)
    x <- read.big.matrix("edgelist.csv", header = FALSE, 
                         type = "integer",sep = ",", 
                         backingfile = "edgelist.bin", 
                         descriptor = "edgelist.desc")
    nrow(x)  # 1e7 as expected
    

    We can compute the outdegrees by using bigtable() on the first column.

    outdegree <- bigtable(x,1)
    head(outdegree)
    

    Quick sanity check to make sure table is working as expected:

    # Check table worked as expected for first "node"
    j <- as.numeric(names(outdegree[1]))  # get name of first node
    all.equal(as.numeric(outdegree[1]),   # outdegree's answer
              sum(x[,1]==j))              # manual outdegree count
    

    To get indegree, just do bigtable(x,2).