pythonpandasmatplotlibpandoc

Efficient access to data in a series of transient Python scripts


Pandoc has a filter that accepts Python snippets and uses (for example) Matplotlib to generate charts. I want to produce documents that produce many charts from a common data source (e.g. a pandas data frame).

As an example:

Here's the first chart:

~~~{.matplotlib}
import sqlite3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

conn = sqlite3.connect('somedb.db')
query = '''SELECT something'''

df = pd.read_sql_query(query, conn).dropna()
fig, ax = plt.subplots()
ax.something()
~~~

The problem is that every chart has to regenerate the data frame, which is expensive. What I'd like to do is:

Any ideas?


Solution

  • The author of pandoc-plot kindly provided the following answer in Github:

    Out-of-the-box there's no handling of your use-case in the pandoc-plot filter. Each code block that gets turned into a plot is intended to be independent from all others. This has many benefits, most importantly performance -- I wrote pandoc-plot for book-sized workloads, with close to 100 figures.

    The reason using preamble isn't working is because the preamble script gets copy-pasted into every code block before pandoc-plot renders a figure. Therefore, the creation of your dataframe will still be duplicated.

    I would recommend you proceed with a script to wrap your usage of pandoc. For example (assuming you use bash):

    # Run a script that goes through your expensive computation,
    # storing the results as a CSV i
    python create-data.py
    
    # Render the document, where plots can reference the file created by 
    # your python script instead of re-creating the pandas dataframe for every plot
    pandoc -f pandoc-plot ...
    
    # Clean up temporary data file if you know where it is
    

    You can communicate between the bash script above and your document plots using environment variables.