I have been using the same variations of a pheatmap code to make heat maps for several months now without any problems, but lately it has stopped being able to cluster rows. Columns still cluster like normal but whenever I try to add row clustering it gives me the same error message about NA/NaN/Inf in the data
All of my datasets look very similar, with primarily just the number of rows changing (between 40-2000+). Here is a head of data I'm currently using, with all 0s already replaced with NA
:
> head(protdata, 4)
PR1 PO1 WA1 PR2 PO2 WA2 PR3 PO3 WA3 PR4 PO4 WA4 PR5 PO5
[1,] 0.004420 NA 0.002370 0.00141 0.002890 0.003740 4.36e-03 0.005370 0.00143 0.002070 NA 0.00428 0.005220 NA
[2,] 0.000233 8.85e-06 0.000136 NA 0.000056 0.000713 5.98e-05 NA NA 0.000541 NA NA 0.006700 4.95e-05
[3,] 0.001220 1.79e-05 0.000447 0.00183 0.000136 NA 6.99e-04 0.000298 0.00267 0.001330 NA NA 0.000655 1.36e-04
[4,] 0.001170 6.84e-04 0.000282 0.00173 0.001620 0.000648 1.05e-03 0.003570 0.00101 0.001410 NA NA 0.002960 NA
WA5 PR6 PO6 WA6 PR7 PO7 WA7
[1,] 0.001030 0.00448 NA 1.53e-03 0.005220 0.005520 1.86e-03
[2,] 0.000139 0.00145 0.000484 8.88e-05 0.000118 0.000122 1.79e-05
[3,] 0.003680 0.00033 NA NA NA 0.000163 3.99e-03
[4,] 0.000393 0.00023 NA NA 0.000625 NA 7.15e-04
There are a lot of 0s in the datasets, but clustering has always worked as long as they are converted to NA
. None of the columns or rows are zero variance. Here is the code I've been using to make the heat maps:
protdata <- as.matrix(input[,-1])
protdata[protdata == 0] <- NA
rownames <- input[,1]
annotation_row <- data.frame(rownames)
rownames(protdata) <- annotation_row$Gene
pheatmap(log10(protdata), scale="row", border_color=NA, na_col="white", breaks=seq(-2,2,.01),
color=colorRampPalette(rev(brewer.pal(n=7, name="RdYlBu")))(400))
And here is the error message I keep getting:
Error in hclust(d, method = method) :
NA/NaN/Inf in foreign function call (arg 10)
The only way I can get a plot to appear is with cluster_rows=FALSE
included in the above. I am stumped as to why this was working perfectly and now isn't, when as far as I know I haven't changed anything with the way I'm inputting my data.
Any help would be greatly appreciated!!
I converted your file into a csv and read in:
mat = read.csv("peet_protdata.csv",row.names=1)
mat[mat==0] = NA
No rows with all NAs or zero variance like you said, but if you do the dist calculation, there are NAs in some of the entries, indicating between some rows, it's not possible to calculate euclidean distances. You need to the euclidean distance matrix to have no NAs to do clustering:
sum(is.na(as.matrix(dist(mat))))
[1] 434
Below is a quick (nasty) bit to find the rows giving the most NAs, remove them to give you a complete distance matrix:
giveNAs = which(is.na(as.matrix(dist(mat))),arr.ind=TRUE)
head(giveNAs)
row col
G103 18 1
G100 53 1
So for example, row 18 and row 1 is giving you problems, and you can see there are no complete observations (pairwise):
mat[c(1,18),]
PR1 PO1 WA1 PR2 PO2 WA2 PR3 PO3 WA3 PR4
G56 NA NA NA 0.000483209 NA NA 0.000433088 NA NA 0.000203604
G103 NA NA NA NA NA NA NA NA NA NA
PO4 WA4 PR5 PO5 WA5 PR6 PO6 WA6 PR7
G56 0.000294898 0.000269724 0.000299341 0.00046987 NA NA NA NA NA
G103 NA NA NA NA NA NA NA NA NA
PO7 WA7 PR8 PO8 WA8 PR9 PO9 WA9
G56 0.000682594 0.000656168 0.000702988 NA NA NA NA NA
G103 NA NA NA NA NA 0.000629987 0.000504159 NA
We get the rows out and start checking what to remove:
tab = sort(table(c(giveNAs)),decreasing=TRUE)
checkNA = sapply(1:length(tab),function(i){
sum(is.na(as.matrix(dist(mat[-as.numeric(names(tab[1:i])),]))))
})
rmv = names(tab)[1:min(which(checkNA==0))]
[1] "18" "53" "81" "84" "54" "97" "55" "38" "70" "100" "31" "93"
[13] "52" "80" "91"
We remove those 15 rows:
mat = mat[-as.numeric(rmv),]
pheatmap(mat)