sqldatabasepentahodata-warehousepdi

Data Insertion takes too long Pentaho (PDI)


I'm trying to insert 500k+ rows into a table but it takes too long. The speed at the beginning is around 1000 r/s but it goes down to 100 r/s quickly after that.

Here is the code I'm executing in my transformation

SELECT 
    accident_index,
    longitude,
    latitude,
    accident_severity as accident_severity_fk,
    number_of_vehicles,
    number_of_casualties,
    EXTRACT('year' from formatted_date)*10000 + EXTRACT('month' from formatted_date)*100+EXTRACT('day' from formatted_date) as date_fk,
    r.road_information_pk as road_information_fk,
    speed_limit,
    p.pedestrian_crossing_information_pk as pedestrian_crossing_information_fk,
    l.light_conditions_pk as light_conditions_fk,
    w.weather_conditions_pk as weather_conditions_fk,
    rs.road_surface_conditions_pk as road_surface_conditions_fk,
    s.special_conditions_at_site_pk as special_conditions_at_site_fk,
    c.carriageway_hazards_pk as carriageway_hazards_fk,
    urban_or_rural_area as urban_or_rural_area_fk,
    time
FROM staging.accidents ac
left join core.road_information r on r.road_type = ac.road_type and r.junction_control = coalesce(ac.junction_control,'Unknown')
left join core.pedestrian_crossing_information p on p.pedestrian_crossing_human_control = coalesce(ac.pedestrian_crossing_human_control,'Unknown')
    and p.pedestrian_crossing_physical_facilities = coalesce(ac.pedestrian_crossing_physical_facilities,'Unknown')
left join core.light_conditions as l on l.light_conditions = coalesce(ac.light_conditions,'Unknown')
left join core.weather_conditions as w on w.weather_conditions = coalesce(ac.weather_conditions,'Unknown')
left join core.road_surface_conditions as rs on rs.road_surface_conditions = coalesce(ac.road_surface_conditions,'Unknown')
left join core.special_conditions_at_site as s on s.special_conditions_at_site = coalesce(ac.special_conditions_at_site,'Unknown')
left join core.carriageway_hazards as c on c.carriageway_hazards = coalesce(ac.carriageway_hazards,'Unknown')

I'm using a Insert/Update tranformation. For testing purposes, I tried inserting only a couple of rows and avoid the joins but it goes even a bit slower. Any idea on how I can improve the speed?


Solution

  • Your speed problem is more likely to be on write than on read. Best way to check this is disabling the hop coming into the insert/update step and running the transformation. It should run at several thousand or even tens of thousand of rows per second. If that’s the case, optimising your read query is pointless, the bottleneck is on the output.

    Insert/Update is notoriously slow, as it needs round trips to lookup each row and decide whether to insert or update. If the target table is initially empty then there’s no need to insert/update. Just sort the rows on the lookup keys, pick up the values you want for each attribute and use a table output step.

    If you need to update the table, then there are different possible strategies to speed that up, depending on how much of that dataset is new vs existing: