pythonloggingazure-application-insightsfastapiopencensus

Recursive logging issue when using Opencensus with FastAPI


I have a problem with my implementation of Opencensus, logging in Python and FastAPI. I want to log incomming requests to Application Insights in Azure, so I added a FastAPI middleware to my code following the Microsoft docs and this Github post:

propagator = TraceContextPropagator()

@app.middleware('http')
async def middleware_opencensus(request: Request, call_next):
    tracer = Tracer(
        span_context=propagator.from_headers(request.headers),
        exporter=AzureExporter(connection_string=os.environ['APPLICATION_INSIGHTS_CONNECTION_STRING']),
        sampler=AlwaysOnSampler(),
        propagator=propagator)

    with tracer.span('main') as span:
        span.span_kind = SpanKind.SERVER
        tracer.add_attribute_to_current_span(HTTP_HOST, request.url.hostname)
        tracer.add_attribute_to_current_span(HTTP_METHOD, request.method)
        tracer.add_attribute_to_current_span(HTTP_PATH, request.url.path)
        tracer.add_attribute_to_current_span(HTTP_ROUTE, request.url.path)
        tracer.add_attribute_to_current_span(HTTP_URL, str(request.url))

        response = await call_next(request)
        tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, response.status_code)

    return response

This works great when running local, and all incomming requests to the api are logged to Application Insights. Since having Opencensus implemented however, when deployed in a Container Instance on Azure, after a couple of days (approximately 3) an issue arises where it looks like some recursive logging issue happens (+30.000 logs per second!), i.a. stating Queue is full. Dropping telemetry, before finally crashing after a few hours of mad logging:

enter image description here

Our logger.py file where we define our logging handlers is as follows:

import logging.config
import os
import tqdm
from pathlib import Path
from opencensus.ext.azure.log_exporter import AzureLogHandler


class TqdmLoggingHandler(logging.Handler):
    """
        Class for enabling logging during a process with a tqdm progress bar.
        Using this handler logs will be put above the progress bar, pushing the
        process bar down instead of replacing it.
    """
    def __init__(self, level=logging.NOTSET):
        super().__init__(level)
        self.formatter = logging.Formatter(fmt='%(asctime)s <%(name)s> %(levelname)s: %(message)s',
                                           datefmt='%d-%m-%Y %H:%M:%S')

    def emit(self, record):
        try:
            msg = self.format(record)
            tqdm.tqdm.write(msg)
            self.flush()
        except (KeyboardInterrupt, SystemExit):
            raise
        except:
            self.handleError(record)


logging_conf_path = Path(__file__).parent
logging.config.fileConfig(logging_conf_path / 'logging.conf')

logger = logging.getLogger(__name__)
logger.addHandler(TqdmLoggingHandler(logging.DEBUG))  # Add tqdm handler to root logger to replace the stream handler
if os.getenv('APPLICATION_INSIGHTS_CONNECTION_STRING'):
    logger.addHandler(AzureLogHandler(connection_string=os.environ['APPLICATION_INSIGHTS_CONNECTION_STRING']))

warning_level_loggers = ['urllib3', 'requests']
for lgr in warning_level_loggers:
    logging.getLogger(lgr).setLevel(logging.WARNING)

Does anyone have any idea on what could be the cause of this issue, or have people encountered similar issues? I don't know what the 'first' error log is due to the fast amount of logging.

Please let me know if additional information is required.

Thanks in advance!


Solution

  • We decided to revisit the problem and found two helpful threads describing similar if not exactly the same behaviour we were seeing:

    As described in the second thread it seems that Opencensus attempts to send a trace to AI and on failure the failed logs will be batched and sent again in 15s (default). This will go on indefinitely until success, possibly causing the huge and seemingly recursive spam of failure logs.

    A solution introduced and proposed by Izchen in this comment is to set the enable_local_storage=False for this issue.

    Another solution would be to migrate to OpenTelemetry that should not contain this potential problem and is the solution we are currently running. Do keep in mind that Opencensus is still the officially supported application monitoring solution by Microsoft and OpenTelemetry is still very young. OpenTelemetry does seem to have a lot of support however and getting more and more traction.

    As for the implementation of OpenTelemetry we did the following to trace our requests:

    if os.getenv('APPLICATION_INSIGHTS_CONNECTION_STRING'):
        from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter
        from opentelemetry import trace
        from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
        from opentelemetry.propagate import extract
        from opentelemetry.sdk.resources import SERVICE_NAME, SERVICE_NAMESPACE, SERVICE_INSTANCE_ID, Resource
        from opentelemetry.sdk.trace import TracerProvider
        from opentelemetry.sdk.trace.export import BatchSpanProcessor
    
        provider = TracerProvider()
    
        processor = BatchSpanProcessor(AzureMonitorTraceExporter.from_connection_string(
            os.environ['APPLICATION_INSIGHTS_CONNECTION_STRING']))
        provider.add_span_processor(processor)
        trace.set_tracer_provider(provider)
    
        FastAPIInstrumentor.instrument_app(app)
    

    OpenTelemetry supports a lot of custom Instrumentors that can be used to create spans for for example Requests PyMongo, Elastic, Redis, etc. => https://opentelemetry.io/registry/.

    If you'd want to write your custom tracers/spans like in the OpenCensus example above you can attempt something like this:

    # These come still from Opencensus for convenience
    HTTP_HOST = COMMON_ATTRIBUTES['HTTP_HOST']
    HTTP_METHOD = COMMON_ATTRIBUTES['HTTP_METHOD']
    HTTP_PATH = COMMON_ATTRIBUTES['HTTP_PATH']
    HTTP_ROUTE = COMMON_ATTRIBUTES['HTTP_ROUTE']
    HTTP_URL = COMMON_ATTRIBUTES['HTTP_URL']
    HTTP_STATUS_CODE = COMMON_ATTRIBUTES['HTTP_STATUS_CODE']
    
    provider = TracerProvider()
    
    processor = BatchSpanProcessor(AzureMonitorTraceExporter.from_connection_string(
            os.environ['APPLICATION_INSIGHTS_CONNECTION_STRING']))
    provider.add_span_processor(processor)
    trace.set_tracer_provider(provider)
    
    @app.middleware('http')
    async def middleware_opentelemetry(request: Request, call_next):
        tracer = trace.get_tracer(__name__)
    
        with tracer.start_as_current_span('main',
                                          context=extract(request.headers),
                                          kind=trace.SpanKind.SERVER) as span:
            span.set_attributes({
                HTTP_HOST: request.url.hostname,
                HTTP_METHOD: request.method,
                HTTP_PATH: request.url.path,
                HTTP_ROUTE: request.url.path,
                HTTP_URL: str(request.url)
            })
    
            response = await call_next(request)
            span.set_attribute(HTTP_STATUS_CODE, response.status_code)
    
        return response
    

    The AzureLogHandler from our logger.py configuration wasn't needed any more with this solution and was thus removed.

    Some other sources that might be useful: