In today’s data-driven world, the landscape is dominated by cloud-based data products.
These services are fantastic if you’re willing to pay for them, as they eliminate the hassle of managing infrastructure.
However, if you’re not ready to bear the cost, open-source solutions could be your best bet.
Imagine having all the components of a Modern Data Stack (MDS) in a single, secure machine that you can set up in minutes and start getting tangible results in a matter of hours.
Welcome to “MDS in a Box”! Dorian’s Way.
Why Should You Care?
- Single Secured Machine: No need to worry about multiple systems and integrations; everything runs on one secure machine.
- Quick Setup: Get everything up and running in minutes, and start seeing results in hours.
- Open Source: Free to set up, with complete ownership and control over your data and infrastructure.
Let’s dive into how to build this MDS in a box, covering everything from data extraction to scheduling.
Source Data (Extract Layer)
The first step in any data pipeline is extracting data from various sources.
Here are some tools that can help:
Airbyte
Airbyte is a powerful tool with a vast array of connectors for different data sources. It offers a user-friendly GUI and a Python library for more advanced users.
- Pros:
- Lots of connectors.
- A graphical interface where you can set up your data flows.
- Offers a python based library for managing ETL tasks: PyAirbyte
- Cons:
- Some connectors might be broken or limited.
Streampipe
Streampipe is a SQL-based tool that is faster and more stable than Airbyte but offers fewer connectors.
- Pros:
- Speed and stability.
- SQL-based tool.
- Cons:
- Limited connectors.
Custom APIs with Python
For more customized solutions, you can use Python to build custom APIs. We can leverage libraries like dltHub for straightforward-to-deploy data pipelines.
Store Data
Once the data is extracted, the next step is to store it efficiently. Here are some excellent storage solutions:
DuckDB
They say that Big Data is dead. Most people don’t have/don’t need that much data.
Enter DuckDB. The free, simple, fast, and non-scalable engine for data-hungry professionals.
DuckDB is perfect for local testing with its blazing-fast performance and SQL-first approach. For production projects, you can use MotherDuck, a cloud service based on DuckDB.
- Pros:
- Fast and SQL-based.
- Cons:
- Primarily for local analysis (MotherDuck for production)
- Doesn’t scale for big data (which for most cases is fine)
Postgres
PostgreSQL is a reliable option for small production-ready projects.
- Pros:
- Solid for smaller projects.
- Cons:
- Will not scale well for large projects, as it’s not an analytical based database.
Clickhouse
For enterprise-scale and real-time applications, Clickhouse is an excellent choice.
- Pros:
- It’s fast.
- Scales well for large data and real-time processing.
- Cons:
- More complex setup and maintenance.
- There are potential limitations with JOINs. It works better with wide tables.
StarRocks
StarRocks is a next-generation, blazing-fast massively parallel processing (MPP) database designed to make real-time analytics easy for enterprises. It is built to power sub-second queries at scale.
- Pros
- High-speed ingestion and real-time updates.
- Supports complex queries and multi-table joins efficiently.
- Compatible with MySQL protocols and major BI tools like Tableau and Power BI.
- Integration with modern data catalogs like Apache Iceberg and Hudi.
- Cons:
- Newer to the market, with a smaller community compared to established solutions like Clickhouse.
Cloud-based alternatives
This is the one area where a cloud data warehouse might be a better fit than an open source project.
Storing your data in a product like Snowflake, Google BigQuery, or even Databricks/Microsoft Fabric might be the way to go.
- Pros:
- They’re fast.
- They’re scalable.
- Easy to access by your end users (business analysts, data scientists, BI specialists, etc.)
- They’re easy to integrate with other tools.
- Cons:
- Can become costly if not governed properly.
Model Data (Semantic Layer)
Transforming raw data into meaningful, reusable models is where the semantic layer comes in.
Below are some tools to help with that.
dbt
dbt (data build tool) is at the core of the modern data stack, allowing you to transform data using SQL and manage your data pipeline.
- Pros:
- Widely used and powerful.
- Cons:
- Can drive up could warehouse costs if not managed carefully.
- The open source version (dbt core) doesn’t include the UI or the scheduler. If you want these features you need to leverage the costly cloud version.
SQLMesh
SQLMesh provides column-level lineage, making it easier to track data transformations.
- Pros:
- Open source user interface.
- Detailed lineage tracking.
- No need to learn Jinja templating.
- Unit tests & semantic validation.
- Cons:
- Less popular user base.
Pandas
For Python enthusiasts, Pandas is an indispensable tool. Similar data libraries like Polars, Ibis, Modin, and Rapids also offer powerful data manipulation capabilities.
- Pros:
- Flexible and powerful.
- Cons:
- Can be memory-intensive for large datasets.
Schedule Processes
There’s no point in setting up data pipelines if you can’t automate them. Scheduling and orchestrating data flows is critical for maintaining solid processes.
Here are some tools to consider:
Cron
Cron is simple and powerful for scheduling tasks.
- Pros:
- Easy to use and reliable.
- Cons:
- Basic and not ideal for complex workflows.
Airflow
Airflow offers advanced orchestration capabilities but can be dusty and complex.
- Pros:
- Task-based orchestration.
- Large community.
- Cons:
- Requires maintenance and can be difficult to debug.
Dagster
Dagster is a more modern orchestration tool with asset-oriented design and seamless dbt integration.
- Pros:
- Advanced features and dbt integration.
- Pipeline oriented (data assets can be shared between tasks).
- Cons:
- Smaller user base.
GitHub Actions
GitHub Actions is a CI/CD tool that can also be used for scheduling and automating data workflows. It integrates directly with GitHub repositories.
It’s worth noting that this is a paid service, not an open-source solution. It is not specifically designed for ETL purposes, but it can be an affordable and sensible option if your code is already hosted on GitHub.
- Pros:
- Integrates directly with GitHub.
- Supports a wide range of workflows and custom scripts.
- Free for public repositories and includes a generous free tier for private repositories.
- Cons:
- It can be complex to set up advanced workflows.
- Limited to GitHub-hosted repositories.
- It is a paid service, not open source.
- Not specifically designed for ETL purposes.
Conclusion
Building an MDS in a box using my way provides a streamlined, efficient, and simple data infrastructure.
With the right tools for data extraction, storage, modeling, and scheduling, you can set up a powerful data stack that meets your needs.
Embrace the power of open source, and take control of your data with “MDS in a Box” – Dorian’s Way.
Highly passionate about data, analysis, visualization, and everything that helps people make informed decisions.
I love what I do! I am working to improve speed in every aspect of my life and that of our clients.
I find comfort in helping people, so if you have a question, give me a shout!