awksedgrep

More efficient sed on huge input files


I regularly have to reduce huge db sql dumps (more than 100gb) down to more manageable file sizes by removing unnecessary INSERT statements. I do that with the following script. I'm concerned that my script involves iterating multiple times through the source file, which is obviously computationally expensive. Is there a way to combine all my SED statements into one, so the source file only needs to be processed once, or can be processed in a more efficient way?

sed '/INSERT INTO `attendance_log`/d' input.sql | \
sed '/INSERT INTO `analytics_models_log`/d' | \
sed '/INSERT INTO `backup_logs`/d' | \
sed '/INSERT INTO `config_log`/d' | \
sed '/INSERT INTO `course_completion_log`/d' | \
sed '/INSERT INTO `errorlog`/d' | \
sed '/INSERT INTO `log`/d' | \
sed '/INSERT INTO `logstore_standard_log`/d' | \
sed '/INSERT INTO `mnet_log`/d' | \
sed '/INSERT INTO `portfolio_log`/d' | \
sed '/INSERT INTO `portfolio_log`/d' | \
sed '/INSERT INTO `prog_completion_log`/d' | \
sed '/INSERT INTO `local_amosdatasend_log_entry`/d' | \
sed '/INSERT INTO `totara_sync_log`/d' | \
sed '/INSERT INTO `prog_messagelog`/d' | \
sed '/INSERT INTO `stats_daily`/d' | \
sed '/INSERT INTO `course_modules_completion`/d' | \
sed '/INSERT INTO `question_attempt_step_data`/d' | \
sed '/INSERT INTO `scorm_scoes_track`/d' | \
sed '/INSERT INTO `question_attempts`/d' | \
sed '/INSERT INTO `grade_grades_history`/d' | \
sed '/INSERT INTO `task_log`/d' > reduced.sql 

Is this idea going in the right direction?

cat input.sql | sed '/INSERT INTO `analytics_models_log`/d' | sed '/INSERT INTO `backup_logs`/d' | sed '/INSERT INTO `config_log`/d' | sed '/INSERT INTO `course_completion_log`/d' | sed '/INSERT INTO `errorlog`/d' | sed '/INSERT INTO `log`/d' | sed '/INSERT INTO `logstore_standard_log`/d' | sed '/INSERT INTO `mnet_log`/d' | sed '/INSERT INTO `portfolio_log`/d' | sed '/INSERT INTO `portfolio_log`/d' | sed '/INSERT INTO `prog_completion_log`/d' | sed '/INSERT INTO `local_amosdatasend_log_entry`/d' | sed '/INSERT INTO `totara_sync_log`/d' | sed '/INSERT INTO `prog_messagelog`/d' | sed '/INSERT INTO `stats_daily`/d' | sed '/INSERT INTO `course_modules_completion`/d' | sed '/INSERT INTO `question_attempt_step_data`/d' | sed '/INSERT INTO `scorm_scoes_track`/d' | sed '/INSERT INTO `question_attempts`/d' | sed '/INSERT INTO `grade_grades_history`/d' | sed '/INSERT INTO `task_log`/d' > reduced.sql 

Solution

  • If you have multiple sed ... | sed ... you can combine them by writing sed -e ... -e ... or sed '...; ...'. But in this case there is an even more efficient method:

    sed -E '/INSERT INTO `(attendance_log|analytics_models_log|...)`/d'
    

    Alternatively, switch to grep which could be even faster:

    grep -vE 'INSERT INTO `(attendance_log|analytics_models_log|...)`'
    

    or

    grep -vFf <(printf 'INSERT INTO `%s`\n' attendance_log analytics_models_log ...)
    

    You could even try to replace all ..._log and logs with a regex, if this is what you want. With this, you only have to explicitly list the non-log files:

    INSERT INTO `([^`]*logs?|local_amosdatasend_log_entry|stats_daily|...)`