If we have a folder folder
having all .txt
files, we can read them all using sc.textFile("folder/*.txt")
. But what if I have a folder folder
containing even more folders named datewise, like, 03
, 04
, ..., which further contain some .log
files. How do I read these in Spark?
In my case, the structure is even more nested & complex, so a general answer is preferred.
If directory structure is regular, lets say something like this:
folder
├── a
│ ├── a
│ │ └── aa.txt
│ └── b
│ └── ab.txt
└── b
├── a
│ └── ba.txt
└── b
└── bb.txt
you can use *
wildcard for each level of nesting as shown below:
>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()
[u'file:/folder/a/a/aa.txt',
u'file:/folder/a/b/ab.txt',
u'file:/folder/b/a/ba.txt',
u'file:/folder/b/b/bb.txt']