google-cloud-dataflowapache-beamapache-beam-io

Apache Beam FileIO match - What's better/more efficient way to match files?


I'm just wondering - does the use of wildcard have an impact on how Beam matches files? For instance, if I want to match a file with Apache Beam, is there an advantage if I'd specify a direct path to a file (i.e. folder/subfolder/file.txt). Or, if I'd give just a wildcard to match() method as an input, would it be as efficient or worse, in terms of frameworks's performance?

Thanks


Solution

  • Compared to the cost of reading the file (and spinning up workers, if running on a distributed runner), the cost of matching will be negligible. On the other hand, multiple reads (with distinct direct paths) will generally be more overhead than reading a wildcard match.