apache-spark

Read all files in a nested folder in Spark


If we have a folder folder having all .txt files, we can read them all using sc.textFile("folder/*.txt"). But what if I have a folder folder containing even more folders named datewise, like, 03, 04, ..., which further contain some .log files. How do I read these in Spark?

In my case, the structure is even more nested & complex, so a general answer is preferred.


Solution

  • If directory structure is regular, lets say something like this:

    folder
    ├── a
    │   ├── a
    │   │   └── aa.txt
    │   └── b
    │       └── ab.txt
    └── b
        ├── a
        │   └── ba.txt
        └── b
            └── bb.txt
    

    you can use * wildcard for each level of nesting as shown below:

    >>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()
    
    [u'file:/folder/a/a/aa.txt',
     u'file:/folder/a/b/ab.txt',
     u'file:/folder/b/a/ba.txt',
     u'file:/folder/b/b/bb.txt']