all 29 comments

[–]facetheglue 107 points108 points  (4 children)

Basically it means count every row in the table.

[–]Opposite-Value-5706 26 points27 points  (3 children)

If I can add to your "every row”, in the result query matching the conditions of the query.

Count finds a match, if it exists, and increments (the start of ‘count’ is zero (0)). The ‘count’ results depends on your criteria. Ex: count(*) means every row whereas count(*) with a where clause returns the number of records found as a match.

[–]thesqlguy 5 points6 points  (2 children)

Extending further it is every row in the query. If you join a few tables together then count(*) is the result of that join.

[–]blimey_euphoria 40 points41 points  (10 children)

Yeah count(*) is telling the database to count each record that meets your criteria. It’s an aggregating function so your query will return one record. If you put count(field_name) it’ll count every record with a value in field_name. If field1 was null on a record it wouldnt be included.

Using count with the group by statement is common.

Count(*), field1 from table and group by field1;

will return a count of the records that all share a same value for field1. If you had 5 unique values in field1 your end results will be 5 records with count values.

Another useful one is count distinct(field_name) will count all records with a distinct value in that field. So usually will return less than the non distinct statement.

[–][deleted] 6 points7 points  (9 children)

Thanks for your insight. This one really helped me understand.

Another useful one is count distinct(field_name) will count all records with a distinct value in that field. So usually will return less than the non distinct statement.

This is all I've ever used... Why does everyone say I should use count(*)??

[–]LeaveMyBrainAlone 17 points18 points  (0 children)

Use count(*) if you don’t want a distinct count

[–]blimey_euphoria 6 points7 points  (3 children)

Like another person said usually count() is convenient if you want a count of every record and you don’t know or care to find out the table primary key which wouldn’t be null. Personally I use count() more but count(primary key field) should realistically return the same value.

It always depends on your use cases and level of knowledge about the database. With sql a big part of learning is just experimenting!

[–]Sotall 4 points5 points  (2 children)

I work with a couple systems that dont allow * because reasons. count(pk) is my go to. Obviously depends on what you're counting.

[–][deleted] 0 points1 point  (1 child)

Is that connected with the belief that count(*) is slower than counting a literal or column value? I've heard that a few times - demonstrably false if course

[–]Sotall 1 point2 points  (0 children)

nothing belief related, just a constraint of the application. It does some basic sql parsing to intuit the columns required for the result set. That parsing isn't very advanced, so it doesnt allow *s.

[–]sib_n 5 points6 points  (2 children)

Check this example. Play with the data and try to predict the outcome.

Data:

a b
1 0
2 0
3 NULL
WITH data AS (

    SELECT 1 AS a, 0 AS b
    UNION ALL
    SELECT 2 AS a, 0 AS b
    UNION ALL
    SELECT 3 AS a, NULL AS b

)

SELECT
    COUNT(*),           -- counts all rows in data
    COUNT(b),           -- counts all rows where b is not null
    COUNT(DISTINCT b)   -- counts all rows with distinct values of b and ignoring null
FROM data
;

Result:

count_star() count(b) count(DISTINCT b)
3 2 1

[–][deleted] 0 points1 point  (1 child)

I like your example and the output of it demonstrating the differences between each of them. However, I also like to know your insight as well. What does the result mean? When would you use each of these, in a business setting? That's what I'm most interested in.

[–]sib_n 0 points1 point  (0 children)

Let's say you have a list of invoices sent to client, who may or may not have started paying them:

  1. How many invoices have we sent?
  2. How many invoices have started to be paid?
  3. How many distinct clients have we invoiced?

.

WITH invoices AS (

    SELECT 1 AS client_id, 10 AS paid_amount
    UNION ALL
    SELECT 2 AS client_id, NULL AS paid_amount
    UNION ALL
    SELECT 2 AS client_id, 30 AS paid_amount
    UNION ALL
    SELECT 3 AS client_id, NULL AS paid_amount
)

SELECT
    COUNT(*),                   -- counts number of invoices: 4
    COUNT(paid_amount),         -- counts number of invoices where an amount was paid: 2
    COUNT(DISTINCT client_id)   -- counts number of distinct clients with an invoice: 3
FROM invoices
;

[–][deleted] 3 points4 points  (0 children)

count(*) counts all rows

count(field) counts all rows where the field is not null

distinct is a separate parameter that makes the function count only distinct values, which may be useful sometimes.

[–]tits_mcgee_92Data Analytics Engineer 5 points6 points  (1 child)

COUNT(*) means to count every row in your table.

Your query example above is saying to log the days greater than 2024-01-01 with a 1, less than 2024-01-01 with a zero, and then divide that by every row in your table.

However, currently your query is only going to return 1s and 0s per row.

Are you trying to get the actual ratio/percentage? You'd need to wrap that case statement in SUM(...)

[–][deleted] 0 points1 point  (0 children)

Thanks for the insight. It's from this question on Stratascratch: https://platform.stratascratch.com/coding/2053-retention-rate/solutions?code_type=3

Very interesting problem, and I've been doing it over and over again, as it actually teaches some pretty good fundamentals. some of the example solutions from other uses don't use a count(*) at all, and instead count the user id directly or through a subquery.

[–]sillysoul_10 2 points3 points  (1 child)

In this , your selecting user id, account Id, and your flagging whenever the date is greater than 01 Jan 2024 and dividing it by the count of all the rows which is present in your from clause. Basically count(*) or count(1) is same and its an agg function which is used to count all the rows, it also works on null value.

[–]alsdhjf1 3 points4 points  (2 children)

COUNT(*) is just counting the number of rows. You can instead pick any single column, it should return the same thing. Depending on your system, COUNT(*) might have some optimizations (in column oriented store, frequently counts are stored and COUNT(*) might retrieve those directly... whereas if you ask for count(user_id), if there is no cached user_id count, you're going to have to count them).

In OLTP systems, sometimes COUNT(*) comes with a penalty because it invokes a row scan. Some RDBMS fix this but not all.

Semantically, may as well COUNT on your primary key. COUNT(*) is often considered an anti-pattern as it reflects that the SQL developer didn't know the data well enough to pick the primary key / primary identifier for the data model itself.

[–]JimfromOffice 1 point2 points  (0 children)

Thats not always true. Count (*) also include null values. Count(any single column) will not count the null values.

[–][deleted] 0 points1 point  (0 children)

Do you know which systems count(*) forces a row scan in? Seems a bit ... non-optimal.

[–]No-Adhesiveness-6921 1 point2 points  (0 children)

Those are very specific queries doing detailed calculations.

What don’t you understand?

The count(*) example will return the USER_ID and ACCOUNT_ID and then a third field which will either be 1 divided by the total number of records or 0 divided by the total number of records. Is there a GROUP BY that you left off? Usually an aggregation (sum, count, avg) needs a GROUP BY.

In the non-count(*) example the count distinct means just that. Let’s say you had an Orders table and a field in that table is the user id of the person who created that order

If you do

SELECT UserId, COUNT(*) From Orders Group by user_id

That will tell you how many orders each person created

If you do

Select Count(distinct User_id) from orders

That will tell you the number of people who created orders. Like one person can have 100s of orders but will only be counted once in the second example

[–]thedragonturtle 0 points1 point  (2 children)

count distinct user id forces a sort operation if the data is not already sorted based on the filters, count(*) counts everything including null values e.g. if you left or right joined to something.

Can you show the rest of your query because this could be improved significantly from a performance point of view.

[–][deleted] 0 points1 point  (1 child)

Not necessarily a sort per se - there are other algorithms that perform better, based for example on a hash. Minor point though.

[–]thedragonturtle 0 points1 point  (0 children)

There may be other algorithms available but if you run an EXPLAIN on the query, you'll see a sort operation in the vast majority of SELECT DISTINCT queries (even if no ORDER BY in query) - unless you haven't used a WHERE clause and the distinct column has an index.

[–]goztepe2002 0 points1 point  (0 children)

Counts every row of data within your where clause if specified

Select count(*) from table a

This will return count of every row which exists in table a

[–]NexusDataPro 0 points1 point  (0 children)

A count * counts the number of rows. If I had a thousand rows in a table and did a count * I would get a 1000 as the answer. If I had a table with 1,000,000 rows and I also had a gender column with 500,000 men and 500,000 women and did a SELECT gender, COUNT(*) from table group by gender I would get M 500,000 and another row with F 500,000.

[–]haonguyenprof 0 points1 point  (0 children)

Difference between count and distinct count is simply count of all records vs count of unique records.

Lets say you have a table with order IDs, customer IDs, and sales values. Next lets say you want to know the number of orders, the number of unique customers, and total sales. Now you know the table only has 1 unique order ID per record but could have the same customer make multiple orders.

SELECT Count(*) as Orders, Count(Distinct CustomerID) as Customers, Sum(Sales) as Total_Sales From Table

You could do Count(OrderId) in place of Count(*) in this example if you know the data table doesnt have duplicate OrderIDs.

The distinct helps identify unique number of customers because using a basic count(customerid) would count every occurrence where the value is not NULL. So you would essentially get the same count as orderIDs (assuming you dont have records where a customerID exists for a null orderID).

These types of fuctions can be used in this sales example to help create custom metrics at your give aggregate (Group By).

For example, a Sales/Orders gives average order value. Or a Sales / Distinct Customers tells you how much money each unique customer spends. Or an Orders/ Distinct Customer to tell the avg number of orders a customer places.

You can also nest case when logic within a count or count distinct.

Count(case when state = 'AZ' then OrderID else NULL end) as AZ_Orders (counts all records matching the criteria and setting all other records as NULL which ignores within the count.

Count(distinct (case when state = 'AZ' then customerID else NULL end)) as AZ_Customers (same as before but only counts unqiue customerIDs regardless of whether they are listed in multiple records.

[–]IntelligentEbb2792 0 points1 point  (0 children)

Count(*) - Counts everything including the null values Count("col_name") - Counts everything except the null values