I have a simple example running on a Dataproc master node where Tachyon, Spark, and Hadoop are installed.
I have a replication error writing to Tachyon from Spark. Is there any way to specify it needs no replication?
15/10/17 08:45:21 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/tachyon/workers/1445071000001/3/8 could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3110)
The portion of the log I printed is just a warning, but a Spark error follows immediately.
I checked the Tachyon config docs, and found something that might be causing this:
tachyon.underfs.hdfs.impl "org.apache.hadoop.hdfs.DistributedFileSystem"
Given that this is all on a Dataproc master node, with Hadoop preinstalled and HDFS working with Spark, I would think that this is a problem solvable from within Tachyon.
You can adjust default replication by manually setting dfs.replication
inside /etc/hadoop/conf/hdfs-site.xml
to some value other than Dataproc's default of 2
. Setting it just on your master should at least cover driver calls, hadoop fs
calls, and it appears to correctly propagate into hadoop distcp
calls as well so most likely you don't need to worry about also setting it on every worker as long as workers are getting their FileSystem configs from job-scoped configurations.
Note that replication of 1
already means a single copy of the data in total, rather than meaning "one replica in addition to the main copy". So, replication can't really go lower than 1. The minimum replication is controlled with dfs.namenode.replication.min
in the same hdfs-site.xml
; you can see it referenced here in BlockManager.java
.