apache-sparklog4jlog4j2emramazon-emr

How to reduce logs for Apache Spark in EMR?


I have a question regarding the Apache Spark job running on AWS EMR. Each time when I executed the Spark job it generated a lot of logs, in my case the logs size around 5-10GB, but the 80% of the logs is information(useless), how can I reduce those logs?

I was used log4j2 for Spark to change the log level to "warn" to avoid the unnecessary logs but as those logs from different components in spark some of theose logs from YARN, some of the logs from EMR, it merged together. so how to fix this issue? Does anyone have such experiences? because for me I don't want to re-configuration each node in the cluster.

I have tried the below solution, seems it doesn't work in the EMR

Logger logger = LogManager.getLogger("sparklog");
logger.setlevel()

xml configuration below.

String used to match the log4j2.xml configuration files
<Configuration status="WARN" monitorInterval="300">////reload the configuration file each 300 seconds
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n" /> //// control output format
        </Console>
    </Appenders>
    <Loggers>
        <Logger name="sparklog" level="warn" additivity="false">//// configuration the  myloger loger level
            <AppenderRef ref="Console" />
        </Logger>
        <Root level="error">
            <AppenderRef ref="Console" />
        </Root>
    </Loggers>
</Configuration>

Solution

  • Since no one answers my question, here I got solutions by myself. 1.upload the configuration file to your master node.

    scp -i ~/.ssh/emr_dev.pem /Users/x/log4j_files/log4j.properties hadoop@ec2-xxx-xxx-xxx.eu-west-1.compute.amazonaws.com:/usr/tmp/
    

    2.In your submit script just attach

    "--files": "/usr/tmp/log4j.properties"
    

    This above solution is working properly for me.