I'm trying to get table contents from a Confluence page in JSON format. It's all SSO so I can only use an API key and I haven't found a way to access Confluence with the requests library. Unfortunately, the output of the Confluence API is plain html.
This is what I've got so far. Can the Confluence library output tables in JSON format (rather than displaying the raw html code in a dictionary)?
from atlassian import Confluence
import os
user = "me@myself.com"
api_key = os.environ['confluence_api_key']
server = "https://xxxxxx.atlassian.net"
api_url = "/rest/api/content"
page_id = "12345"
confluence = Confluence(url=server, username=user, password=api_key)
page = confluence.get_page_by_title("TEST", "page 1", expand="body.storage")
content = page["body"]["storage"]
print(content)
And the output looks like this:
{'value': '<p>Something something.</p><p /><table data-layout="default" ac:local-id="xxx"><colgroup><col style="width: 226.67px;" /><col style="width: 226.67px;" /><col style="width: 226.67px;" /></colgroup><tbody><tr><th><p><strong>name</strong></p></th><th><p><strong>type</strong></p></th><th><p><strong>comment</strong></p></th></tr><tr><td><p>text1</p></td><td><p>varchar(10)</p></td><td><p /></td></tr><tr><td><p>123</p></td><td><p>int</p></td><td><p /></td></tr></tbody></table>', 'representation': 'storage', 'embeddedContent': [], '_expandable': {'content': '/rest/api/content/12345'}}
404 error with the requests library:
request_url = "{server}{api_url}/{page_id}?expand=body.storage".format(
server=server, api_url=api_url, page_id=page_id
)
requestResponse = requests.get(request_url, auth=(user, api_key))
print(requestResponse.status_code)
Why would you want to use the requests library directly? The atlassian python API is using it already, sparing you some work.
I happen to have run into the same problem this week and I had to parse tables with BeautifulSoup. I think for a generic solution it would be best to get your tables as Dataframes:
from atlassian import Confluence
import os
from bs4 import BeautifulSoup
import pandas as pd
user = "me@myself.com"
api_key = os.environ['confluence_api_key']
server = "https://xxxxxx.atlassian.net"
confluence = Confluence(url=server, username=user, password=api_key)
page = confluence.get_page_by_title("TEST", "page 1", expand="body.storage")
body = page["body"]["storage"]["value"]
tables_raw = [[[cell.text for cell in row("th") + row("td")]
for row in table("tr")]
for table in BeautifulSoup(body, features="lxml")("table")]
tables_df = [pd.DataFrame(table) for table in tables_raw]
for table_df in tables_df:
print(table_df)
You can then convert your DataFrames to JSON using to_json depending on how you want to structure your dictionaries...
EDIT: style information (and other tags like links) get lost in this case (we only get the text of the cell) so beware if you want to update the content of your page after modification
Also if you wanna use table content as dictionary keys you might want to change row/column index
EDIT2: This is an old answer but since it got upvoted recently, I would just like to add that in this case, tables_raw
can be computed with the pandas built-in read_html
:
tables_df = pd.read_html(body)
This will even set tables headers as df column names directly and has arguments to extract links or parse dates. However, especially if you need more than cell.text
(in my initial case I wanted to import icons) in your df, the above answer is still valid.