GROUP BY and Fields Returned

When you use the GROUP BY clauise in your database query, you can't just reference all the fields you want to return directly in the query. This is because when you group the content of rows together you end up needing to tell the query which of the values that are in the field across the columns that you are grouping together it is that you wish to have the query return in the results.

The only fields where you don't need to be more specific with are those fields that are actually referenced in the \GROUP BY clause itself since you are grouping the records that have the same value in those fields and so you know that all of the rows being grouped have the same value in those fields.

For example:

SELECT product_id FROM sales GROUP BY product_id

This will produce a list of all the product_id values in the table but will not display any duplicates. We don't need to do anything special with the first product_id reference here because the rows are being grouped by this field.

Now let's add another field.

SELECT product_id, sale_date FROM sales GROUP BY product_id

The extra field we have added here is anbiguous because we have not said which sale date to return and there is nothing that says that all of the records with the same product_id will have the same sale date.

The way that we resolve this is to ensure that all of the fields that are not directly referenced in the GROUP BY use an aggregating function to specify what value to return for the group. In this particular instance if we want the latest date we'd use MAX() or if we want the earliest date we'd use MIN(). Any of the aggregating functions will resolve the issue of what value to use but not all will necessarily be meaningful. For example we could use AVG() with this query but the average date isn't likely to be what we are looking for in most instances and COUNT() which tells us the number of different dates will be even less likely to be useful.

The query gets even more difficult to write when we want multiple values to be retrieved from the same record within the group. Let's say that we not only want the latest purchase date for each product but also the quantity purchased on that date. We can't use the following:

SELECT product_id, MAX(sale_date) AS sdate, MAX(quantity) AS qty FROM sales GROUP BY product_id

That query returns a specific value for each field but while we are getting the latest date we are getting the maximum sale quantity which is not necessarily the quantity from the latest sale.

The simplest solution to this is to consider if we really need the quantity. If we don't then we can simply remove the field and keep our query simple. If we do really need the quantity and we want to retrieve the results in a single query then we can no longer use a simple query with GROUP BY but will instead need to use a subquery or join to ensure that the correct values are retrieved.

 

This article written by Stephen Chapman, Felgall Pty Ltd.

go to top

FaceBook Follow
Twitter Follow
Donate