Christopher's Blog

Logo

View My GitHub Profile

Data Quality and Governance: A Practical Guide for Small Companies

Small companies sometimes treat data quality as an afterthought, focusing instead on getting data quickly to make business decisions. However, poor data quality compounds over time, leading to lost opportunities, bad choices, and expensive cleanup efforts. This guide shows how to implement practical data quality and governance processes that scale with your company.

The Three Pillars of Data Quality

1. Source-Level Quality

Quality starts at data ingestion. Whether data comes from application events, third-party APIs, or manual entry, you need controls at the point where data enters your system. Poor-quality data that make it past this first checkpoint will cause issues throughout your pipeline. Moreover, finding and fixing data quality issues becomes exponentially more expensive as data moves downstream.

Validate data as it arrives

Incoming data must be validated before it enters your system. At a minimum, check that the required fields exist and contain valid values. For example, ensure email addresses match a valid pattern, dates are formatted correctly, and numeric fields contain reasonable values. Modern data ingestion tools like Fivetran and Airbyte provide these capabilities out of the box. When building custom ingestion, add validation rules that match your business needs. Load the good data and keep the failed data in an “error” table. There is often a pattern to broken data so that it can be fixed and loaded. Sometimes the fix can be added to the permanent pipeline.

Profile incoming data patterns

Data profiling helps you understand your data’s characteristics and is useful for detecting anomalies. Track metrics like record counts, value distributions, and field completeness. This baseline understanding helps you spot issues quickly. For instance, if you typically receive 1000 orders per day and suddenly see 10 orders, something is likely wrong. Similarly, if a field that’s usually 99% complete drops to 50% complete, investigate. Checks like these are easy to write, or you can find many off-the-shelf tools.

Preserve original data copies

Permanently preserve a copy of data exactly as it arrived. This might seem wasteful, but storage is cheap compared to the cost of losing data. When issues arise (and they will), having the original data lets you investigate what went wrong and reprocess if needed. This is especially crucial for compliance - many regulations require maintaining original records. Store this data in your Bronze layer (read Architecture of Data Platform for details on this layer) and treat it as immutable: never modify it after the initial write. Often this data is placed someplace with heightened security and has extra systems built to meet compliance issues. e.g., CCPA requires data to be deleted on request.

Document data lineage

Data lineage tracks how data moves and transforms through your system. At the source level, record where data came from, when it arrived, and what initial transformations were applied. This documentation is crucial for debugging issues and maintaining compliance. For example, if a report shows incorrect numbers, data lineage helps you track back to the source to determine if the issue was bad input data, transformations, or business logic problems. There are many tools that assist with lineage tracking.

2. Process-Level Quality

As data moves through your system, quality issues can emerge at each step. Processing errors can corrupt data, transformations can drop records, and integrating data from multiple sources can create inconsistencies. A multi-layer quality approach catches issues before they impact business decisions.

Ingestion Layer Checks

The ingestion layer is your first defense against data quality issues. This layer should detect both technical problems (like format changes) and business issues (like missing data). Most importantly, it should catch issues before they propagate through your system. For example, if a vendor changes their API format, you want to see this at ingestion rather than discovering broken reports days later.

Processing Layer Validation

The processing layer is where data from different sources comes together and transformations occur. This is where subtle data quality issues often emerge. For instance, a seemingly correct join might drop records due to truncated key values, or a transformation might produce unexpected results with certain inputs. Regular validation at this layer prevents these issues from reaching users.

Reporting Layer Verification

The reporting layer is your last chance to catch issues before they reach business users. This layer should focus on business rule validation and the accuracy of final metrics. For example, if revenue suddenly drops by 50%, you want to catch this before executives see the report and make decisions based on incorrect data.

3. Usage-Level Quality

Data quality isn’t just about accuracy - it’s about usability. Even perfectly accurate data can be worthless if users can’t find it or don’t understand it. Usage-level quality focuses on making data accessible, understandable, and trustworthy for your users. This is particularly important in small companies where users often need to be self-sufficient with data.

Document Definitions Clearly

Clear documentation is the foundation of data usability. When different teams interpret metrics differently, it leads to confusion and conflicts. For example, if marketing defines “active users” differently than product, you’ll have endless meetings resolving discrepancies instead of making decisions.

Maintain Consistent Metrics

Consistency in metrics is crucial for business decisions. When the same metric is calculated differently in different reports, it erodes trust in data. This issue is particularly common in small companies where reports often evolve organically rather than through central planning.

Track Data Usage

Understanding how data is used helps you focus quality efforts where they matter most. There’s no point in spending time improving the quality of unused data while neglecting critical datasets. Usage tracking also helps identify potential training needs or documentation gaps.

Monitor Query Patterns

Query patterns can reveal both problems and opportunities. Slow queries indicate the need for optimization, while frequent manual calculations suggest the need for new standardized metrics.

Governance often sounds bureaucratic and heavyweight, but small companies need governance too - just implemented practically. Good governance enables growth and prevents chaos, while poor governance (or none at all) leads to data swamps and compliance nightmares. The key is finding the right balance for your size and needs.

Compliance and Security

Modern data platforms handle sensitive information ranging from personal data to financial records. Regulations like GDPR, CCPA, and industry-specific requirements aren’t optional, even for small companies. However, compliance doesn’t have to be overwhelming if built into your processes from the start.

Documentation and Discovery

In small companies, tribal knowledge often rules. This works until key people leave or the company grows. Good documentation doesn’t just help current team members; it’s crucial for onboarding new people and scaling the organization. The key is making documentation maintainable and actually useful.

Alert Management

Alert fatigue is real and dangerous. Too many alerts lead to ignored alerts, which defeats their purpose. Effective alert management requires balancing the need for monitoring with the reality of human attention limits.

Dynamic Thresholds

Static thresholds often lead to false positives. A 20% drop in orders might be expected on weekends but critical on weekdays. Dynamic thresholds adjust to your business patterns, reducing noise while catching real issues.

Anomaly Detection

Simple threshold-based alerts aren’t enough for complex data patterns. Modern anomaly detection can catch subtle issues while reducing false positives. The key is combining statistical methods with business context.

Alert Routing

Getting alerts to the right people is as important as detecting issues. Different problems need different responders, and not everyone needs to know about every issue.

Getting Started

Starting a data quality and governance program can feel overwhelming. The key is to start small, focus on high-impact areas, and build incrementally. Remember, imperfect coverage of critical data is better than perfect coverage of non-critical data.

Begin with Critical Data Elements

Start with the data that drives key business decisions. This might be revenue data, customer information, or product metrics. Don’t try to tackle everything at once. If a data quality issue would cause immediate business problems, that’s where you should focus first.

Implement Basic Quality Checks

Start with simple, high-impact checks. You can build sophistication over time, but basic checks catch the most critical issues. For example, ensuring order amounts are positive or customer emails are valid catches obvious problems before they propagate.

Document as You Go

Documentation is easiest when done during implementation. Waiting until later means details are forgotten or recorded incorrectly. Plus, good documentation immediately helps new team members understand the system.

Expand Coverage Gradually

Once basic coverage is in place for critical data, gradually expand. Use what you’ve learned from initial implementation to improve your approach. Pay attention to what works and what doesn’t in your environment.

Connection to Architecture

While this guide focuses on data quality and governance, these practices integrate naturally with the Medallion Architecture discussed in our overview. Each layer in the Medallion Architecture provides natural quality checkpoints and governance boundaries.

Bronze Layer Integration

The Bronze layer is your first line of defense. This is where source-level quality checks happen and where you preserve original data. It’s also where you begin separating sensitive data and implementing security controls.

Silver Layer Integration

Most processing occurs in the Silver layer, making it crucial for process-level quality. This is where you standardize data and create consistent views across sources.

Gold Layer Integration

The Gold layer focuses on business metrics and reporting, making it the primary focus for usage-level quality. This is where you ensure data is accurate and useful.

Remember, quality and governance aren’t separate from your architecture - they’re integral parts of it. Building them in from the start is always easier than adding them later.

Back to Overview