Running Python in excel, how can a dataframe be used as the data for an excel table, to then make pivot charts with the data.
For example, in excel, cell A1 contains a pd.DataFrame: pd.DataFrame(np.random.randint(0,3,size=(3,3)),columns = list('ABC'),index = [1,2,3])
. This dataframe is 4x4 (including the index which is output in an excel column), how can this be loaded into a excel table with 4 columns, 1 header row, and 3 rows? Ability to have a variable number of rows and headers would be nice also.
This can easily be viewed as a range in excel by changing the cell output to 'Excel Value' from 'Python Object', but making that range a table fails due to spill errors and so can't go into a pivot chart.
This could avoided easily by plotting using python, but that looses the excel chart familiarity and portability to other MS programs which is needed (unless anyone knows of a way to create excel pivot charts that keep MS program functionality in python?)
Ideally need a solution without VBA.
The issue here only partially has to do with "Python in Excel", namely as far as the shape is concerned (1). Otherwise, the difficulty lies with connecting the PivotTable to a dynamic range (2), setting up the refresh (3), and adding a timestamp for that refresh (4).
1. Shape dataframe
With your current use of pd.DataFrame
, you'll end up with a blank top-left cell. E.g., with PY()
in cell A1
:
A | B | C | |
---|---|---|---|
1 | 2 | 0 | 0 |
2 | 0 | 2 | 2 |
3 | 1 | 2 | 2 |
Left like this, the connected PivotTable will protest: "The PivotTable field name is not valid...". To avoid this (the easy way), you want to fill up A1
with a "header" as well. Either leave out the index altogether (it will not show), or use a trick and add a name for the columns (not the index):
E.g. for PY()
in cell A1
with "Excel Value" selected, we can use something like below to have the data spill into the sheet:
from string import ascii_uppercase
rows = np.random.randint(1,11)
cols = 4
df = pd.DataFrame(np.random.randint(0,3,size=(rows,cols)),
columns=pd.Index(list(ascii_uppercase[:cols]), name='index'),
index=range(1, rows+1))
df
N.B.1 Make sure that the length of your data is at least one row (hence randint(1, ...)
above). With 0 rows, you will likely trigger an error with the PivotTable: "One or more field names used in the report are no longer valid", plus the selected "Fields" will reset.
N.B.2 You can add more columns, and if you relabel columns with "Fields" used by the PivotTable (e.g., from "B"
for column C to "B"
for column E), this will keep the PivotTable functionality as is. However, if you drop a label, the PivotTable will also drop that "Field" and you will have to reconfigure the "Fields" (go to "PivotTable Analyze" -> "Field List").
2. Source PivotTable
To use the spilled range as input for a PivotTable, it is best to create a "Name". Go to "Formulas -> Name Manager", add a "Name" and for "Refers to" use a reference to the cell with =PY()
and add a #
for the spill. Here, I've used df_data
with =Sheet1!$A$1#
.
Now, go to "Insert -> PivotTable". At "Table/Range" fill in the "Name" df_data
(without =
at the beginning). For auto-refresh at certain intervals / opening the file, make sure also to tick the box for "Add this data to the Data Model":
Configure the PivotTable, and go to "PivotTable Analyze" to add your PivotChart. N.B. You have to select a cell inside your PivotTable to see "PivotTable Analyze" in the ribbon.
If you are happy with manual refreshing, you can skip ticking the "Data Model" box, and simply use "Data -> Refresh" or "Refresh all" as required, and you are done.
3. Configure refresh
For auto-refreshing, go to "Data -> Queries & Connections" and in the window pane (to the right) select WorksheetConnection_Book1!df_data
. Right click to enter "Properties...":
Now, set the desired refresh in minutes + refresh upon opening the file. N.B. PY()
cells do not auto-refresh, also not on opening. You will need to use "reset" in "Formulas -> Python (Preview)" to trigger a re-calculation.
With refresh for your python
code:
4. Add a timestamp for the refresh
Somewhat annoyingly, the above implementation lacks a timestamp, so that the user cannot ascertain that there was a refresh. With VBA adding one would be easy. Without the aid of VBA, you can get fairly close by following the instructions provided in this answer by @Wizhi. Here, you do not have to tick the "Add this data to the Data Model" box. After adding the Query, copy the refresh settings that you used for the PivotChart connection, and the timestamp and data refresh will run successively at the interval, with a negligible delta.
You can end up with something like this: