I am doing
cat this_files_* > output.txt # reads approx 20 mio. files and cats them into output.txt
When doing it through command-line last week on another dataset, it was significantly faster than it is now run through a bash file main.sh.
Is it possible that cat through bash file is slower/limited?
Also - suggestions on doing this faster welcome! (Bash or Python)
The main reason why I am doing it through cat is because of RAM issues with other options, and it just seems easier. (after this, I run: sort output.txt > output_sorted.txt
)
When sort
is not reading from stdin, writing to stdout, or actively sending error messages to stderr, it does depend on terminal performance.
Similarly, it never inspects its parent process, so it doesn't know or care if it was started by a shell or any other process.
That said, using cat
makes your code unnecessarily inefficient.
sort this_files_* >output_sorted.txt
...still requires an amount of working space that scales with the size of your files -- as is also required in the cat
case -- but as long as you're using a high-quality sort
implementation such as the GNU one, it can use temporary files for storage and so sort contents larger than RAM.