Data modeling to filter customers who bought a group of products but not some other products?

This topic originated from this GitHub issue and is to discuss ways of doing data modeling so that you can filter customers who haven’t bought some products (C) but have bought other products (A and B) as in the following:

SELECT DISTINCT customer FROM transactions WHERE purchased_product in (‘A’, ‘B’) AND customer NOT IN ( SELECT DISTINCT customer FROM transactions WHERE purchased_product in (‘C’) )

We have tried to solve this with a single cube schema using a nested query, like above, and by joining two cube schemas.

Here is an example schema for the nested query approach:

cube(Example, {
sql: SELECT * FROM example WHERE ${FILTER_PARAMS.Example.productRef.filter('product_ref')} AND customer_ref NOT IN( SELECT DISTINCT customer_ref FROM example WHERE ${FILTER_PARAMS.Example.productRef2.filter('product_ref')} ) ,
dimensions: {
productRef: {
type: ‘string’,
sql: ‘product_ref’
},
productRef2: {
type: ‘string’,
sql: ‘product_ref’
},
customerRef: {
type: ‘string’,
sql: ‘customer_ref’
}
}
});

Here is an example schema for a solution using a join between two cube schemas:

cube(Example1, {
sql: SELECT * FROM example WHERE ${FILTER_PARAMS.Example1.productRef.filter('product_ref')} ,
joins:{
Example2: {
relationship: belongsTo,
sql: ${Example1}.customer_ref = ${Example2}.customer_ref
}
},
segments: {
excludeCustomers : {
sql: ${Example1}.customer_ref != ${Example2}.customer_ref
}
},
dimensions: {
productRef: {
type: ‘string’,
sql: ‘product_ref’
},
customerRef: {
type: ‘string’,
sql: ‘customer_ref’
}
}
});
cube(Example2, {
sql: SELECT * FROM example WHERE ${FILTER_PARAMS.Example2.productRef.filter('product_ref')} ,
dimensions: {
productRef: {
type: ‘string’,
sql: ‘product_ref’
},
customerRef: {
type: ‘string’,
sql: ‘customer_ref’
}
}
});

The main issue we see is that FILTER_PARAM generates WHERE clauses in the final query which are guaranteed to never return any data. The second case will generate the following Clickhouse query:

SELECT
example1.customer_ref example1__customer_ref
FROM
(
SELECT * FROM example WHERE product_ref IN (?)
) AS example1
LEFT JOIN (
SELECT * FROM example WHERE product_ref IN (?)
) AS example2 ON example1.customer_ref = example2.customer_ref
WHERE
(example1.customer_ref != example2.customer_ref)
AND (example1.product_ref IN (?))
AND (example2.product_ref IN (?))
GROUP BY example1__customer_ref
ORDER BY example1__customer_ref ASC
LIMIT 10000

This part of the where statement removes all customer_refs and the final query never returns any data. This filter should only be applied in the join select statement.

AND (example2.product_ref IN (?))

1 Like

This can be handled using following cube definition:

cube(`PurchaseFunnel`, {
  sql: `select 
    c.id as customer_id,
    purchased_products.id purchased_product_id,
    not_purchased_products.id not_purchased_product_id
  FROM customers c
  CROSS JOIN products purchased_products
  CROSS JOIN products not_purchased_products
  LEFT JOIN transactions purchased ON purchased.customer_id = c.id AND purchased.product_id = purchased_products.id
  LEFT JOIN transactions not_purchased ON not_purchased.customer_id = u.id AND not_purchased.product_id = not_purchased_products.id
  WHERE purchased.id IS NOT NULL AND not_purchased.id IS NULL
`,

  measures: {
    customerCount: {
      sql: `customer_id`,
      type: `countDistinct`
    }
  },

  dimensions: {
    purchasedProductId: {
      sql: `purchased_product_id`,
      type: `string`
    },

    notPurchasedProductId: {
      sql: `not_purchased_product_id`,
      type: `string`
    }
  }
});

If counts of customers and products are high countDistinctApprox measure type should be used instead together with Cube Store pre-aggregations. This way cardinality of a rollup would be just N ^ 2 where N is a count of products.