I am running Apache Druid datastore locally. I am loading data from a Kafka stream.
On Druid, I can see the column names:
And then using druiddb (https://github.com/betodealmeida/druid-dbapi), I am writing an SQL query and reading data into Python environment and putting it in a pandas dataframe. However, some column names do not appear:
from druiddb import connect
# https://github.com/betodealmeida/druid-dbapi
import pandas as pd
druid_host = "localhost"
druid_port = 8888
druid_path = "/druid/v2/sql"
druid_scheme = "http"
druid_query = """SELECT * FROM malaria_cases_full"""
druid_connection = connect(host=druid_host, port=druid_port, path=druid_path, scheme=druid_scheme)
druid_cursor= druid_connection.cursor()
df = pd.DataFrame(druid_cursor.execute(druid_query))
df.head(n =10)
I suggest you to use the (official?) Python connector for Druid, which is pydruid
.
Or simply read_sql
with an sqlalchemy engine :
# pip install sqlalchemy==1.4.4
from sqlalchemy import MetaData, Table
from sqlalchemy.engine import create_engine
engine = create_engine("druid://localhost:8888/druid/v2/sql") # add ?header=True
ta = Table("wikipedia", MetaData(bind=engine), autoload=True) # if needed
df = pd.read_sql(ta.select(), engine)
Output :
print(df.columns)
Index(['__time', 'isRobot', 'channel', 'flags', 'isUnpatrolled', 'page',
'diffUrl', 'added', 'comment', 'commentLength', 'isNew', 'isMinor',
'delta', 'isAnonymous', 'user', 'deltaBucket', 'deleted', 'namespace',
'cityName', 'countryName', 'regionIsoCode', 'metroCode',
'countryIsoCode', 'regionName'],
dtype='object')
print(df)
__time isRobot ... countryIsoCode regionName
21243 2016-06-27T18:50:07.084Z true ...
10272 2016-06-27T10:20:13.238Z false ... AU New South Wales
... ... ... ... ... ...
21271 2016-06-27T18:51:31.698Z false ...
5773 2016-06-27T06:16:43.741Z false ...
[24433 rows x 24 columns]
Used datasource (wikipedia) :