Choosing between 2 approaches for querying MongoDB data

We are trying to decide between 2 different approaches for using Cube.js with MongoDB. The context is that we have analytics events data that lives in Snowflake and application data that lives in MongoDB. Data in each of these needs to be joined for the dashboards we are building. So it seems that we can either:

  1. Use the MongoDB Connector for BI directly from Cube. Cubes for application data come from MongoDB datasource. Cubes for analytics events come from Snowflake datasource.
  2. Send the MongoDB data to our Snowflake instance using an ETL (we have used Stitch for things like this in the past). All cubes come from Snowflake datasource.

Are there strong reasons to go with one approach or the other here? Originally I thought option 1 would be better since using MongoBI directory would be faster since we wouldn’t have to wait for the ETL from Mongo → Snowflake. But I’ve since learned that cross-datasource joins need to use rollupJoin with external pre-aggregations, which makes me think this might not be as big of an advantage as I thought.

Would appreciate any insight here. Thanks!

Hey @Doron_Roberts-Kedes :wave:

Thanks for this really great and elaborate question!

First, let me address this statement:

So, I guess you’re concerned about minimizing the time-to-insight so you can query the up-to-date, most recent data. If so, please note that you have plenty of options to make sure that pre-aggregations are rebuilt as soon as the data updates (see refresh strategy) and that they are rebuilt in reasonable time (see time partitioning and incremental pre-aggregations). IMO, usage of rollupJoins is not a relevant criteria for the decision-making here.

A more relevant criteria that I can think of is that, if you a complex enough data schema definitions, you’ll need to use different SQL dialects in different cubes (the one for MongoDB BI, and the other for Snowflake). While it’s not a huge issue, you’ll have to account for the database you’ll defining your cubes against.

Another criteria might be the expected load and whether your MongoDB instance can sustain the writes/reads related and unrelated to Cube. Again, it might not be an issue at all if the load is not very high and you use scalable (cloud-based) deployments.

I hope this helps!

1 Like

Hey @igorlukanin thanks so much for the thorough & detailed response.

IMO, usage of rollupJoins is not a relevant criteria for the decision-making here.

^ Very useful. We’ll get a POC working on rollupJoins.

if you a complex enough data schema definitions, you’ll need to use different SQL dialects in different cubes

Cool, makes sense. If our data schema definitions become sufficiently complex I think this is a price worth paying. We do use a scalable cloud-based deployment for Mongo, so I’m not worried about cube-generated load for now. Using the right refresh strategy will allow us to minimize time-to-insight whereas relying on Mongo → Snowflake ETL introduces a significant minimum latency.

1 Like