Skip to Content

Implementing retention analysis & cohorts

This is an advanced topic that assumes good, pre-existing knowledge of SQL and Cube.

Whether you’re selling groceries, financial services, or gym memberships, successful recruitment of new customers is only truly successful if they return to buy from you again. The metric that reflects this is called retention, and the approach we use is customer retention analysis. Retention analysis is typically done using cohort analysis.

Cohort analysis is a technique to see how variables change over in different groups with different starting conditions. Retention is a simplified one, where the starting condition is usually the time of signup and the variable is simply activity.

It’s usually visualized as a cohort grid or retention curves.

Cohort retention analysis is pretty hard to do in SQL. We need to have the user-date combination, which tells us about a user’s activity on that date, including dates with no activity. To do this, we need to make a tricky join, which gives us a dates list. Once we have it, we can “fill it” with users’ activities.

The example below shows monthly cohort retention. The same technique can be used for daily or weekly retention.

The SQL code in this guide is Postgres-compliant. The final SQL code may be different depending on your database. Also, this technique requires at least 1 user to be active during the month, otherwise this month will not be included in the months’ list.

cubes: - name: monthly_retention sql: | SELECT users.id as user_id, date_trunc('month', users.created_at) as signup_month, months_list.activity_month as activity_month, data.monthly_pageviews FROM users LEFT JOIN ( SELECT DISTINCT (date_trunc('month', pages.original_timestamp)) as activity_month FROM pages ) as months_list ON months_list.activity_month >= date_trunc('month', users.created_at) LEFT JOIN ( SELECT p.user_id, date_trunc('month', p.original_timestamp) as activity_month, COUNT(DISTINCT p.id) as monthly_pageviews FROM pages p GROUP BY 1,2 ) as data ON data.activity_month = months_list.activity_month AND data.user_id = users.id
cube(`monthly_retention`, { sql: `SELECT users.id as user_id, date_trunc('month', users.created_at) as signup_month, months_list.activity_month as activity_month, data.monthly_pageviews FROM users LEFT JOIN ( SELECT DISTINCT (date_trunc('month', pages.original_timestamp)) as activity_month FROM pages ) as months_list ON months_list.activity_month >= date_trunc('month', users.created_at) LEFT JOIN ( SELECT p.user_id, date_trunc('month', p.original_timestamp) as activity_month, COUNT(DISTINCT p.id) as monthly_pageviews FROM pages p GROUP BY 1,2 ) as data ON data.activity_month = months_list.activity_month AND data.user_id = users.id` })

The SQL above provides the base table for our retention cube. It would show signup months and activity months with pageviews:

user_idsignup_monthactivity_monthmonthly_pageviews
11/181/1810
11/182/185
11/183/180
22/182/1812
22/183/180
33/183/185

Now we can calculate a total count of users and the total count of active users, who has more than 0 page views, for every month. Based on these two measures we can calculate monthly percentage_of_active.

cubes: - name: monthly_retention # ... measures: - name: total_count sql: user_id type: count_distinct public: false - name: total_active_count sql: user_id type: count_distinct filters: - sql: monthly_pageviews > 0 drill_members: - users.id - users.email - name: percentage_of_active sql: "100.0 * {total_active_count} / NULLIF({total_count}, 0)" type: number format: percent drill_members: - users.email - bots.team - bots.last_seen - percentage_of_active
cube(`monthly_retention`, { // ... measures: { total_count: { sql: `user_id`, type: `count_distinct`, public: false }, total_active_count: { sql: `user_id`, type: `count_distinct`, filters: [{ sql: `${CUBE}.monthly_pageviews > 0` }], drill_members: [users.id, users.email] }, percentage_of_active: { sql: `100.0 * ${total_active_count} / NULLIF(${total_count}, 0)`, type: `number`, format: `percent`, drill_members: [ users.email, bots.team, bots.last_seen, percentage_of_active ] } } })

To be able to build cohorts, we need to group by two dimensions: signup date, which will define our cohorts, and months since signup, which will show how the percentage of active users is changing.

cubes: - name: monthly_retention # ... dimensions: - name: months_since_signup sql: "DATEDIFF('month', signup_month, activity_month)" type: number - name: signup_date sql: "(signup_month AT TIME ZONE 'America/Los_Angeles')" type: time
cube(`monthly_retention`, { // ... dimensions: { months_since_signup: { sql: `DATEDIFF('month', ${CUBE}.signup_month, ${CUBE}.activity_month)`, type: `number` }, signup_date: { sql: `(signup_month AT TIME ZONE 'America/Los_Angeles')`, type: `time` } } })

Note, we are explicitly setting the signup_month timezone. date_trunc returns UTC dates and not setting a correct timezone would lead to wrong results due to time shift.

Was this page useful?