Case Studies

Case Study: Cloud Support Advisory

The Client: Major Social Media Company

Date: 2020

Project Lead: Tyler Fischella

Case Study: Video Conference Redesign

The Client: Major Social Media Company

Date: 2020

Project Lead: Tyler Fischella

Case Study: Data Governance Design

The Client: Major Social Media Company

Date: 2021

Project Lead: Tyler Fischella

Case Study: Cloud Cost Optimization

The Client: Fortune 500 Toy Company

Date: 2021

Project Lead: Tyler Fischella

Case Study: Machine Learning Advisory

The Client: Accounting Software Company

Date: 2022

Project Lead: Tyler Fischella

Case Study: Privacy & Ads Advisory

The Client: Digital Advertising & Monetization Business

Date: 2022

Project Lead: Tyler Fischella

Case Study: Kubernetes Migration

The Client: Global Recruiting Platform

Date: 2023

Project Lead: Tyler Fischella

What is Datamesh?

The old way of
data management…

Legacy data management practices often rely on centralized, monolithic architectures, where data is governed and accessed from various disconnected systems. This approach creates bottlenecks, limits data discoverability, and hinders scalability, leading to slow decision-making and increased operational costs. In contrast, a data mesh promotes a decentralized, domain-oriented structure, where each team treats its data as a product, abstracted away from the underlying infrastructure. The primary focus of this approach is improvement of data cataloging, pipelines, quality, and accessibility while maintaining compliance through data lineage, data profiling, and auditing capabilities. This approach ensures value-driven cost controls and accountability.

0. Getting Started (Cloud)

To get started, you need a cloud provider or infrastructure service that aligns with your data needs and budget. For data warehousing, services like Google BigQuery, Amazon Redshift, or Snowflake provide scalable solutions optimized for analytics. For data lakes, consider AWS Lake Formation, Azure Data Lake, or Google Cloud Storage for flexible, large-scale data storage.

Organizations can then leverage these open source technologies to build a Datamesh framework:

OpenLineage and Apache Airflow can be leveraged for lineage.
Apache Ranger is often used for decentralized access control.
Dbt (data build tool) is useful for transforming and testing data pipelines.
Great Expectations for automated data quality checks empower teams to maintain high data standards autonomously.

Alternatively, pre-built SaaS solutions also exist, such as Collibra, Informatica, and Alation; each of these vendors provide a data cataloging, data quality, data lineage, and auditing capabilities. Google Cloud also launched it’s data governance platform, which I helped design, called Dataplex. These services can be further enhanced through extensive metadata collection and data-loss-prevention services.

1. Data Loss Prevention (DLP)

Traditional data loss prevention often relies on centralized security frameworks, creating performance challenges as data scales. In a data mesh, DLP can be embedded in each data domain to allow domain teams to define security controls specific to their data types, using tools like Apache Ranger for managing access policies and Sentry for monitoring data access patterns. Domain-level DLP frameworks, such as those enabled by HashiCorp Vault for secrets management, ensure sensitive data remains protected near its source, limiting exposure risks and aligning with compliance requirements.

Impact on AI: Embedding DLP at the domain level allows AI models to incorporate sensitive information within secure environments, opening the potential for more robust models without compromising privacy or regulatory compliance.

2. Data Annotations and Metadata Management

Annotations enrich datasets with context, facilitating their effective use across teams. Centralized legacy annotation frameworks often limit data discoverability. In a data mesh, annotations and metadata management can be handled by each domain using open-source tools like Amundsen or DataHub for metadata tracking and cataloging. These tools allow teams to document data origin, transformations, and context in a shared repository, which can also be integrated with Apache Atlas for enterprise-grade metadata and governance capabilities.

Impact on AI: Rich metadata and annotations enable data scientists to quickly understand data context, crucial for training high-accuracy, domain-specific models, and reducing time spent on data exploration and preparation.

3. Data Classification

Data classification is crucial for organizing, securing, and managing data at scale. In a legacy environment, classification processes are often centralized, leading to gaps as data changes over time. With a data mesh, classification can be standardized yet decentralized, enabling domain teams to apply tags and labels using tools like Apache Atlas or OpenMetadata. These tools allow classification to be uniformly applied and updated within each domain, helping to ensure that sensitive data remains properly labeled and governed across the organization.

Impact on AI: Uniform data classification across domains supports AI models by making it easier to locate, structure, and understand datasets. Consistent classification also supports automated cataloging and recommendation engines, enabling data scientists to discover relevant datasets across the organization.

4. Data Profiling

Data profiling—analyzing datasets to understand their content, structure, and quality—is essential to effective data management. Legacy data profiling often occurs infrequently, providing only a limited view of data quality. In a data mesh, profiling is a domain-specific, ongoing process. Each domain team can utilize tools like Great Expectations or Deequ (an open-source library for data quality) to continuously profile and validate data. These tools help identify data quality issues early, like outliers and inconsistencies, and enable more accurate, reliable datasets.

Impact on AI: Continuous data profiling improves AI model performance by ensuring data consistency. Ongoing profiling detects data drift or distribution changes, alerting teams to quality issues that could impact model accuracy, allowing for proactive adjustments.

5. Data Quality Management

Data quality management is fundamental in a data mesh, where data is distributed across domain teams. Legacy quality practices often rely on centralized checks, leading to inefficiencies in maintaining standards across diverse datasets. In a data mesh, tools like Great Expectations or dbt support automated testing, validation, and feedback loops within each domain. Each team can set and monitor quality thresholds tailored to their data, maintaining high standards that align with organizational objectives. With Great Expectations, domains can automate validation steps and continuous monitoring.

Impact on AI: High data quality improves AI model outputs, reducing the time spent on data cleaning and ensuring AI models are trained on reliable, consistent data. Decentralized quality management allows faster model training cycles and builds trust in AI-driven insights.

Case Studies

What is Datamesh?

The old way of data management…

0. Getting Started (Cloud)

1. Data Loss Prevention (DLP)

2. Data Annotations and Metadata Management

3. Data Classification

4. Data Profiling

5. Data Quality Management

“Datamesh” is the modern approach…

The old way of
data management…