I have a complicated db query that selects schools based on real estate listing attributes, performance statistics for the schools, and the distance of each listing to public transit. Users create a Search
object, and a method find_schools
in the search.rb
has this query:
School.where(id: school_ids).narrow_schools_for_search(self,prop_type,status,year).joins(listings:
:cta_listings).joins(:performance_stats).where("cta_listings.distance <= ?",
self.cta_distance).where.not(performance_stats: {"#{sort_column.to_sym}" =>
nil}).distinct.limit(30).order("performance_stats.#{sort_column} DESC")
School.rb
scope :narrow_schools_for_search, ->(search,prop_type,status,year) {joins(:listings).joins(:performance_stats)
.where("listings.beds >= ?",search.beds).where("listings.price <= ?",search.max_price)
.where("listings.price >= ?",search.min_price).where(listings: {prop_type: prop_type, status: status})
.where(performance_stats: {year: year}).distinct}
has_many :performance_stats, dependent: :destroy
has_many :assignments, dependent: :destroy
has_many :listings, through: :assignments
Listing.rb
has_many :assignments, dependent: :destroy
has_many :schools, through: :assignments
has_many :cta_listings, dependent: :destroy
has_many :cta_stations, through: :cta_listings
has_many :metra_listings, dependent: :destroy
has_many :metra_stations, through: :metra_listings
PerformanceStat.rb
belongs_to :school
I need the Schools ordered by an attribute in the associated table PerformanceStats, which is a user defined attribute sort_column
. The query works in development env (sqlite3), but fails on the staging app (PG) with this error:
ActiveRecord::StatementInvalid (PG::InvalidColumnReference: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
I need to add a select statement that includes the column name by which I am sorting schools.
The advice on other posts like this one is to do something like:
Widget.select('"widgets".*, "widget_steps.name"')
So, for my case, I tried this:
sort_for_select = "performance_stats.#{sort_column}"
School.select('"schools".*, "#{sort_for_select"').where(id: school_ids).narrow_schools_for_search(self,prop_type,status,year).joins(listings:
:cta_listings).joins(:performance_stats).where("cta_listings.distance <= ?",
self.cta_distance).where.not(performance_stats: {sort_column.to_sym =>
nil}).distinct.limit(30).order("performance_stats.#{sort_column} DESC")
But my editor is indicating I'm not actually escaping to to ruby. I tried it anyway, and sure enough, it fails with
ActiveRecord::StatementInvalid (PG::UndefinedColumn: ERROR: column "#{sort_for_select}" does not exist.
Then I tried hard coding the sort_column
:
School.select('"schools".*, "performance_stats.grall_adjpicalc"').where(id: school_ids).narrow_schools_for_search(self,prop_type,status,year).joins(listings:
:cta_listings).joins(:performance_stats).where("cta_listings.distance <= ?",
self.cta_distance).where.not(performance_stats: {grall_adjpicalc:
nil}).distinct.limit(30).order("performance_stats.grall_adjpicalc DESC")
This works in dev environment, but if fails on the staging app with this error:
ActiveRecord::StatementInvalid (PG::UndefinedColumn: ERROR: column "performance_stats.grall_adjpicalc" does not exist
So at this point I have to deploy every time to test new ideas. I know PG in development would be ideal, but I lost an entire week trying to change over and couldn't get it to work. ended up losing everything and had to reseed from scratch.
I have 3 questions:
What am I doing wrong with the Select statement?
Is there another fast way to do this that avoids this issue? I was thinking instead of Distinct, perhaps I could go with uniq, converting to an array then sort the array accordingly.
How can I get the variable sort_column
into the select statement?
Any thoughts or suggestions are much appreciated!
Update: the working code in my original answer was slow, often ending up in timeouts on Heroku. I ended up with a query that operates 3x faster using arel_tables. The working code looks like this (self
is an @search
object). I'm not a professional coder, so without a doubt this could perform even faster, and I welcome any suggestions to speed up performance. It still takes 5-8 seconds depending on the query. But at least I'm not getting timed out anymore.
school = School.arel_table
pstat = PerformanceStat.arel_table
schools = school.project(Arel.star).join(pstat).on(pstat[:school_id].eq(school[:id]).and(
school[:area_id].in(self.area_ids).and(
pstat[:year].eq(year)
)
)
)
query = schools.to_sql
school_ids = School.find_by_sql(query).pluck(:id)
This gives me an array of possible school ids, now I need to find the listings in these schools that match the other search parameters, including max distance to public transit if selected. Schools and listings are joined via an Assignment model.
listing = Listing.arel_table
assignment = Assignment.arel_table
if self.cta || self.metra
cta_listing = CtaListing.arel_table
metra_listing = MetraListing.arel_table
listing_assign = listing.join(assignment).on(assignment[:listing_id].eq(listing[:id])).join(cta_listing, Arel::Nodes::OuterJoin).on(cta_listing[:listing_id].eq(listing[:id])).join(metra_listing, Arel::Nodes::OuterJoin).on(metra_listing[:listing_id].eq(listing[:id]))
selected_listings = listing_assign.project(assignment[:school_id], listing[:id]).where(
assignment[:school_id].in(school_ids).and(
cta_listing[:distance].lteq(self.cta_distance).or(
metra_listing[:distance].lteq(self.metra_distance))).and(
listing[:prop_type].in(prop_type).and(
listing[:status].in(status).and(
listing[:beds].gteq(self.beds).and(
listing[:active].eq(true).and(
listing[:price].lteq(self.max_price).and(
listing[:price].gteq(self.min_price))))))))
else
listing_assign = listing.join(assignment).on(assignment[:listing_id].eq(listing[:id]))
selected_listings = listing_assign.project(assignment[:school_id], listing[:id]).where(
assignment[:school_id].in(school_ids).and(
listing[:prop_type].in(prop_type).and(
listing[:status].in(status).and(
listing[:beds].gteq(self.beds).and(
listing[:active].eq(true).and(
listing[:price].lteq(self.max_price).and(
listing[:price].gteq(self.min_price))))))))
end
q = selected_listings.to_sql
listings = Listing.find_by_sql(q)
Now I have an AR Relation of all the listings that match the search. I can't limit to 30 schools before this point, because I'm not sure if a school will have any listings that match the requirements. Every school must have at lease one listing. I need to return the top 30 schools and their listings, so first I create an array of arrays with each listings id and the corresponding school_id.
listings_array = listings.map{|x| [x.school_id,x.id]}
Then I convert this array of arrays into a hash grouping by school_id:
listings_hash = listings_array.group_by{|school_id| school_id.shift}.transform_values do |values|
values.flatten.uniq
end
Now I can sort these schools by the selected column and pick the top 30.
if sort_column.nil?
sort_column = "grall_adjpicalc"
end
schools = School.where(id: listings_hash.keys).includes(:performance_stats).where(performance_stats:
{year: year}).order("performance_stats.#{sort_column} desc").limit(30)
Now I have our top 30 schools and can return an array of arrays with each school and its corresponding listing_ids.
schools_array = schools.map{|school| [school,listings_hash[school.id]]}
return schools_array
This is much longer that the previous answer, but at least 3x faster. Can you find a way to make this significantly faster?