c77_mvc/README.md

174 lines
7.1 KiB
Markdown

# PostgreSQL Data Management System
## Overview
This collection of PostgreSQL functions forms a comprehensive data management system designed to analyze table structures, create optimized materialized views, and maintain their health over time. The system consists of two integrated subsystems that work together to improve database performance, data quality, and maintenance efficiency.
## Core Subsystems
### 1. Table Analysis Subsystem
This subsystem analyzes database tables to identify their characteristics, data quality, and optimal strategies for keys, partitioning, and ordering.
**Key Features:**
- Statistical sampling for efficient analysis of large tables
- Column-level fitness evaluation for primary/foreign key suitability
- Data quality assessment with encoding issue detection
- Identification of optimal column combinations for partitioning
- Detection of timestamp columns suitable for ordering
- Overall Data Quality Index (DQI) calculation
**Primary Functions:**
- `grok_analyze_table_fitness`: Main entry point for table analysis
- `grok_analyze_column_stats`: Analyzes individual column characteristics
- `grok_analyze_column_combinations`: Evaluates column pairs for composite keys
- `grok_calculate_dqi`: Calculates the overall Data Quality Index
### 2. Materialized View Management Subsystem
This subsystem creates, monitors, and maintains optimized materialized views based on insights from the table analysis.
**Key Features:**
- Optimized materialized view creation with proper indexing
- Automatic handling of character encoding issues
- Synthetic key generation for uniqueness
- Content hash generation for efficient change detection
- Health monitoring with staleness detection
- Automated maintenance and remediation actions
**Primary Functions:**
- `grok_create_optimized_matv`: Creates a complete materialized view system
- `grok_manage_matv_health`: Monitors and maintains materialized view health
- `grok_check_matv_mismatches`: Detects inconsistencies between source and materialized views
- `grok_perform_matv_action`: Executes maintenance actions on materialized views
## Architecture & Design Patterns
The system implements several important design patterns:
1. **View Layering Pattern**: Creates multiple views serving different purposes:
- `vtw_*`: View To Watch (source view with data quality enhancement)
- `matc_*`: MATerialized Copy (physical storage with indexes)
- `vm_*`: View of Materialized view (clean data for querying)
- `vprob_*`: View of PROBlematic data (encoding issues for review)
2. **Data Quality Management Pattern**: Automatically detects, flags, and segregates problematic data:
- Non-ASCII character detection
- Cleansed versions of problematic text
- Separate views for clean vs. problematic data
3. **Change Detection Pattern**: Implements efficient methods to detect data changes:
- Content hash generation from relevant columns
- Timestamp-based staleness detection
- Sampling-based consistency validation
4. **Maintenance Strategy Pattern**: Provides multiple strategies for maintaining materialized views:
- Refresh: Updates with fresh data from the source
- Repair: Rebuilds indexes and constraints
- Reindex: Rebuilds indexes without dropping them
## Usage Examples
### Analyzing a Table
```sql
-- Analyze a table to identify key characteristics and data quality
SELECT config.grok_analyze_table_fitness(
'public', -- Source schema
'customer_data', -- Source table
ARRAY['id', 'uid'] -- Columns to exclude from key fitness evaluation
);
```
### Creating an Optimized Materialized View
```sql
-- Create an optimized materialized view system based on analysis results
SELECT config.grok_create_optimized_matv(
'public', -- Source schema
'customer_data', -- Source table
'analytics', -- Target schema
'matc_customer_summary', -- Target materialized view name
ARRAY['region', 'customer_type'], -- Partition columns
ARRAY['updated_at', 'customer_id'], -- Order-by columns
ARRAY['created_by', 'modified_by'], -- Columns to exclude from hash
true -- Filter to latest records only
);
```
### Monitoring Materialized View Health
```sql
-- Check health of a materialized view
SELECT config.grok_manage_matv_health(
'analytics', -- Schema
'matc_customer_summary', -- Materialized view name
'daily', -- Validation type: 'quick', 'daily', or 'full'
NULL -- Action (NULL for check only, 'refresh', 'repair', 'reindex')
);
```
### Maintaining Materialized View Health
```sql
-- Refresh a stale materialized view
SELECT config.grok_manage_matv_health(
'analytics', -- Schema
'matc_customer_summary', -- Materialized view name
'daily', -- Validation type
'refresh' -- Action to perform
);
```
## Performance Considerations
- **Sampling**: The system uses statistical sampling for efficient analysis of large tables
- **Concurrent Refresh**: Uses concurrent refresh when possible (requires unique indexes)
- **Validation Modes**: Offers different validation modes with performance/thoroughness tradeoffs:
- `quick`: Fastest, uses 0.1% sampling, 3-day staleness threshold
- `daily`: Medium, uses 1% sampling, 1-day staleness threshold
- `full`: Most thorough, uses 100% sampling, 12-hour staleness threshold
## Dependencies
This system depends on the following database objects:
1. **Table Fitness Audit Table**:
- `config.table_fitness_audit`: Stores table analysis results
2. **Materialized View Statistics Table**:
- `public.c77_dbh_matv_stats`: Stores materialized view refresh statistics
## Best Practices
1. **Initial Analysis**: Run table analysis before creating materialized views to identify optimal configuration
2. **Regular Health Checks**: Schedule periodic health checks using `grok_manage_matv_health`
3. **Validation Types**: Use `quick` for frequent checks, `daily` for daily maintenance, and `full` for critical views
4. **Monitoring**: Track Data Quality Index (DQI) over time to detect data quality trends
5. **Maintenance Windows**: Schedule refreshes during low-usage periods for large materialized views
## Error Handling
All functions include comprehensive error handling with:
- Clear error messages indicating what went wrong
- Processing notes to track execution steps
- Safe failure modes that avoid leaving the database in an inconsistent state
## Troubleshooting
Common issues and solutions:
1. **Stale Materialized Views**: Use `grok_manage_matv_health` with action='refresh'
2. **Encoding Issues**: Use `grok_manage_matv_health` with action='repair'
3. **Index Performance Issues**: Use `grok_manage_matv_health` with action='reindex'
4. **Missing Statistics**: Ensure `public.c77_dbh_matv_stats` table is populated with refresh statistics
## Extension Points
The system is designed to be extended in several ways:
1. Add custom data quality checks in the `vtw_` view creation
2. Extend partition and order-by column validation logic
3. Implement additional maintenance actions in `grok_perform_matv_action`
4. Add custom health metrics to `grok_manage_matv_health`