regexperl

Extract multiple tabular data from spark logs


I'm trying to extract the tabular data from apache pyspark logs using a perl one-liner.

Below is the sample file log and there are 3 tabular output from the spark output:

24/06/19 01:00:00 INFO org.apache.spark.SparkContext: Running Spark version 3.5.1
24/06/19 01:00:01 INFO org.apache.spark.SparkContext: Submitted application: MyPySparkApp
24/06/19 01:00:02 INFO org.apache.spark.scheduler.DAGScheduler: Registering RDD 0 (text at <stdin>:1)
24/06/19 01:00:03 DEBUG pyspark_logging_examples.workloads.sample_logging_job.SampleLoggingJob: This is a debug message from my PySpark code.
+----+----------+-----+---+
|acct|        dt|  amt| rk|
+----+----------+-----+---+
|ACC3|2010-06-24| 35.7|  2|
|ACC2|2010-06-22| 23.4|  2|
|ACC4|2010-06-21| 21.5|  2|
|ACC5|2010-06-23| 34.9|  2|
|ACC6|2010-06-25|100.0|  1|
+----+----------+-----+---+
24/06/19 01:00:04 INFO pyspark_logging_examples.workloads.sample_logging_job.SampleLoggingJob: Processing data in MyPySparkApp.
24/06/19 01:00:05 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on host localhost: Executor lost connection, trying to reconnect.
24/06/19 01:00:07 INFO org.apache.spark.scheduler.DAGScheduler: Job 0 finished: collect at <stdin>:1, took 7.0000 s
24/06/19 01:00:08 INFO org.apache.spark.SparkContext: Stopped SparkContext
+----------+-----+
|inc       |check|
+----------+-----+
|Australia |true |
|Bangladesh|false|
|England   |true |
+----------+-----+

24/06/19 01:00:09 INFO org.apache.spark.scheduler.DAGScheduler: Job 0 finished: collect at <stdin>:1, took 7.0000 s
24/06/19 01:00:09 INFO org.apache.spark.SparkContext: Stopped SparkContext

+-----+------+---------+----+---------+----+----+------+
|empno| ename|      job| mgr| hiredate| sal|comm|deptno|
+-----+------+---------+----+---------+----+----+------+
| 7369| SMITH|    CLERK|7902|17-Dec-80| 800|  20|    10|
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300|    30|
| 7521|  WARD| SALESMAN|7698|22-Feb-81|1250| 500|    30|
| 7566| JONES|  MANAGER|7839| 2-Apr-81|2975|   0|    20|
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400|    30|
| 7698| BLAKE|  MANAGER|7839| 1-May-81|2850|   0|    30|
| 7782| CLARK|  MANAGER|7839| 9-Jun-81|2450|   0|    10|
| 7788| SCOTT|  ANALYST|7566|19-Apr-87|3000|   0|    20|
| 7839|  KING|PRESIDENT|   0|17-Nov-81|5000|   0|    10|
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500|   0|    30|
| 7876| ADAMS|    CLERK|7788|23-May-87|1100|   0|    20|
+-----+------+---------+----+---------+----+----+------+

root
 |-- empno: integer (nullable = true)
 |-- ename: string (nullable = true)
 |-- job: string (nullable = true)
 |-- mgr: integer (nullable = true)
 |-- hiredate: string (nullable = true)
 |-- sal: integer (nullable = true)
 |-- comm: integer (nullable = true)
 |-- deptno: integer (nullable = true)
 
24/06/19 01:00:20 INFO org.apache.spark.SparkContext: Running Spark version 3.5.1
24/06/19 01:00:21 INFO org.apache.spark.SparkContext: Submitted application: MyPySparkApp
24/06/19 01:00:22 INFO org.apache.spark.SparkContext: Running Spark version 3.5.1
24/06/19 01:00:23 INFO org.apache.spark.SparkContext: Submitted application: MyPySparkApp2

When there is only one tabular output the below command works:

perl -0777 -ne ' while(m/^\x2b(.+)\x2b$/gsm) { print "$&\n" } ' spark.log # \x2b="+"

but for multiple tabular outputs, it pulls all the text from first occurrence to end of last tabular occurrence. How do I get all the 3 tabular output from my sample log?

Expected output:

Table-1:

+----+----------+-----+---+
|acct|        dt|  amt| rk|
+----+----------+-----+---+
|ACC3|2010-06-24| 35.7|  2|
|ACC2|2010-06-22| 23.4|  2|
|ACC4|2010-06-21| 21.5|  2|
|ACC5|2010-06-23| 34.9|  2|
|ACC6|2010-06-25|100.0|  1|
+----+----------+-----+---+

Table-2

+----------+-----+
|inc       |check|
+----------+-----+
|Australia |true |
|Bangladesh|false|
|England   |true |
+----------+-----+

Table-3

+-----+------+---------+----+---------+----+----+------+
|empno| ename|      job| mgr| hiredate| sal|comm|deptno|
+-----+------+---------+----+---------+----+----+------+
| 7369| SMITH|    CLERK|7902|17-Dec-80| 800|  20|    10|
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300|    30|
| 7521|  WARD| SALESMAN|7698|22-Feb-81|1250| 500|    30|
| 7566| JONES|  MANAGER|7839| 2-Apr-81|2975|   0|    20|
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400|    30|
| 7698| BLAKE|  MANAGER|7839| 1-May-81|2850|   0|    30|
| 7782| CLARK|  MANAGER|7839| 9-Jun-81|2450|   0|    10|
| 7788| SCOTT|  ANALYST|7566|19-Apr-87|3000|   0|    20|
| 7839|  KING|PRESIDENT|   0|17-Nov-81|5000|   0|    10|
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500|   0|    30|
| 7876| ADAMS|    CLERK|7788|23-May-87|1100|   0|    20|
+-----+------+---------+----+---------+----+----+------+

Solution

  • The problem with your attempt is that . matches any characters (including Line Feed characters) when the s modifier is used.


    The following will do the trick, and it does it without loading the entire log file into memory:

    perl -ne'print if /^[|+]/' spark.log 
    

    Same idea, using grep:

    grep '^[|+]' spark.log
    

    The following version identifies the individual tables (allowing you to do something at the table level):

    perl -gne'
       print "Table ", ++$i, ":\n", $&
          while /
             ^
             ( \+ .* \+\n )
             \| .* \|\n
             \1
             (?: \| .* \|\n )*
             \1
          /xmg;
    ' spark.log
    

    The same, but using fewer lines:

    perl -gne'
       print "Table ", ++$i, ":\n", $&
          while /^(\+.*\+\n)\|.*\|\n\1(?:\|.*\|\n)*\1/mg;
    ' spark.log
    

    These are stricter as to what is considered a table, so they might work better than the first solution. Also, I'm assuming it's not a problem to read the whole log file into memory. It's not necessary, but it's simpler to write.

    (Use -0777 instead of -g if your Perl is too old to support -g.)