pythonperformancefeedparser

Python program using feedparser slows over time


I have a Python program which is running in a loop and downloading 20k RSS feeds using feedparser and inserting feed data into RDBMS.

I have observed that it starts from 20-30 feeds a min and gradually slows down. After couple of hours it comes down to 4-5 feeds an hour. If I kill the program and restart from where it left, again the throughput is 20-30 feeds a min.

It certainly is not MySQL which is slowing down.

What could be potential issues with the program?


Solution

  • In all likelihood the issue is to do with memory. You are probably holding the feeds in memory or somehow accumulating memory that isn't getting garbage collected. To diagnose:

    1. Look at the size of your task (task manager if windows and top if unix/Linux) and monitor it as it grows with the feeds.
    2. Then you can use a memory profiler to figure what exactly is consuming the memory
    3. Once you have found that you can code differently maybe

    A few tips:

    1. Do an explicit garbage collection call (gc.collect()) after setting any relevant unused data structures to empty
    2. Use a multiprocessing scheme where you spawn multiple processes that each handle a smaller number of feeds
    3. Maybe go on a 64 bit system if you are using a 32 bit

    Some suggestions on memory profiler:

    1. https://pypi.python.org/pypi/memory_profiler This one is quite good and the decorators are helpful
    2. https://stackoverflow.com/a/110826/559095