c77_mvc/USAGE.md
2025-03-27 07:00:55 -05:00

336 lines
11 KiB
Markdown

# c77_mvc PostgreSQL Extension Usage Guide
This guide provides detailed instructions on how to use the c77_mvc extension for materialized view management and table fitness analysis in PostgreSQL.
## Table of Contents
1. [Overview](#overview)
2. [Table Fitness Analysis](#table-fitness-analysis)
3. [Materialized View Management](#materialized-view-management)
4. [Materialized View Health Monitoring](#materialized-view-health-monitoring)
5. [Advanced Use Cases](#advanced-use-cases)
6. [Best Practices](#best-practices)
7. [Function Reference](#function-reference)
## Overview
The c77_mvc extension provides two main sets of functionality:
1. **Table Fitness Analysis**: Evaluate table structure for data quality, partitioning suitability, and optimization opportunities
2. **Materialized View Management**: Create and maintain optimized materialized views with content hashing, synthetic keys, and encoding status tracking
## Table Fitness Analysis
Table fitness analysis helps you evaluate table structure and data quality to inform optimization decisions.
### Basic Table Analysis
To perform a basic analysis on a table:
```sql
SELECT * FROM public.c77_mvc_analyze_table_fitness('schema_name', 'table_name');
```
This returns a JSON object containing:
- Column statistics
- Recommended partition combinations
- Order-by candidates
- Data quality index
- Analysis notes
### Example with Excluding Key Columns
If you have columns that should not be considered for key combinations:
```sql
SELECT * FROM public.c77_mvc_analyze_table_fitness(
'schema_name',
'table_name',
ARRAY['id', 'created_at']::text[]
);
```
### Interpreting Analysis Results
The analysis result includes:
1. **Column Stats**: Individual column metrics including:
- Null ratio
- Uniqueness ratio
- Encoding issue ratio
- Fitness score
2. **Recommended Partition Combinations**: Column pairs that work well together for partitioning:
- Uniqueness ratio
- Discrimination power
- Average fitness score
3. **Order-by Candidates**: Columns suitable for ordering data:
- Timestamp columns
- Text columns parseable as timestamps
4. **Data Quality Index (DQI)**: Overall score from 0-100 indicating data quality
Example query to extract key information:
```sql
SELECT
run_id,
run_timestamp,
analysis_result->>'data_quality_index' as dqi,
analysis_result->'recommended_partition_combinations' as partition_recommendations
FROM public.c77_mvc_table_fitness_audit
WHERE source_schema = 'schema_name' AND source_table = 'table_name'
ORDER BY run_id DESC
LIMIT 1;
```
## Materialized View Management
The extension provides tools to create and manage optimized materialized views.
### Creating Optimized Materialized Views
To create an optimized materialized view with synthetic keys and content hashing:
```sql
SELECT * FROM public.c77_mvc_create_optimized_matv(
'source_schema', -- Source schema name
'source_table', -- Source table name
'target_schema', -- Target schema for materialized view
'matc_target_mv_name', -- Target materialized view name (should start with matc_)
ARRAY['column1', 'column2'], -- Partition columns
ARRAY['timestamp_column'], -- Order-by columns
ARRAY['exclude_column1'], -- Columns to exclude from content hash calculation (optional)
false -- Filter to get only latest records (optional)
);
```
This creates:
1. `vtw_target_mv_name`: A view with regexp_replace for character columns
2. `matc_target_mv_name`: A materialized view derived from the vtw_ view
3. `vm_target_mv_name`: A view that filters out encoding issues
4. `vprob_target_mv_name`: A view showing only records with encoding issues
### View Structure and Purpose
When you create an optimized materialized view, multiple objects are created:
| Object Type | Naming Pattern | Purpose |
|-------------|----------------|---------|
| View | vtw_* | Source view with content hash, synthetic key, and encoding status |
| Materialized View | matc_* | Materialized copy of the vtw_ view |
| View | vm_* | Clean data view (excludes encoding issues) |
| View | vprob_* | Problematic data view (only encoding issues) |
### Example Use Case
Scenario: Creating a materialized view of customer data:
```sql
SELECT * FROM public.c77_mvc_create_optimized_matv(
'sales',
'customers',
'reporting',
'matc_customer_summary',
ARRAY['customer_id', 'region'],
ARRAY['last_updated'],
ARRAY['notes', 'internal_comments'],
false
);
```
To query clean data:
```sql
SELECT * FROM reporting.vm_customer_summary;
```
To check for encoding issues:
```sql
SELECT * FROM reporting.vprob_customer_summary;
```
## Materialized View Health Monitoring
The extension provides tools to monitor and maintain the health of materialized views.
### Checking Materialized View Health
```sql
SELECT * FROM public.c77_mvc_manage_matv_health(
'schema_name', -- Schema containing the materialized view
'matc_view_name', -- Materialized view name (should start with matc_)
'quick', -- Validation type: 'quick', 'daily', or 'full'
NULL -- Action: NULL, 'refresh', 'repair', or 'reindex'
);
```
Validation types:
- `quick`: Fast check with 0.1% sample (3-day threshold)
- `daily`: More thorough check with 1% sample (1-day threshold)
- `full`: Complete check with 100% sample (12-hour threshold)
### Automated Refresh
To check health and refresh if needed:
```sql
SELECT * FROM public.c77_mvc_manage_matv_health(
'schema_name',
'matc_view_name',
'daily',
'refresh' -- Will refresh if stale based on thresholds
);
```
### Monitoring Multiple Views
Example script to monitor all materialized views in a schema:
```sql
DO $$
DECLARE
view_record RECORD;
result JSONB;
BEGIN
FOR view_record IN
SELECT matviewname
FROM pg_matviews
WHERE schemaname = 'target_schema'
AND matviewname LIKE 'matc_%'
LOOP
RAISE NOTICE 'Checking view: %', view_record.matviewname;
SELECT * FROM public.c77_mvc_manage_matv_health('target_schema', view_record.matviewname, 'quick', NULL) INTO result;
RAISE NOTICE 'Status: %', result->>'status';
END LOOP;
END;
$$;
```
## Advanced Use Cases
### Customizing Character Encoding Handling
The extension detects non-ASCII characters using the regex pattern `[^\x00-\x7F]`. For custom handling:
1. Create your own view that modifies the `vtw_` view:
```sql
CREATE OR REPLACE VIEW custom_schema.my_custom_vtw AS
SELECT *,
CASE
WHEN column1 ~ '[^\x00-\x7F]' OR column2 ~ '[^\x20-\x7E]' THEN 'CUSTOM_ENCODING_ISSUE'
ELSE 'CLEAN'
END AS encoding_status
FROM schema_name.vtw_original_view;
```
### Batch Refresh Strategy
Example of a batch refresh strategy based on analysis:
```sql
DO $$
DECLARE
view_record RECORD;
health_result JSONB;
estimated_time INTERVAL;
total_time INTERVAL := '0 seconds'::INTERVAL;
max_batch_time INTERVAL := '2 hours'::INTERVAL;
views_to_refresh TEXT[] := '{}';
BEGIN
-- Gather health stats and estimated times
FOR view_record IN
SELECT matviewname
FROM pg_matviews
WHERE schemaname = 'target_schema'
AND matviewname LIKE 'matc_%'
LOOP
SELECT * FROM public.c77_mvc_manage_matv_health('target_schema', view_record.matviewname, 'quick', NULL)
INTO health_result;
IF health_result->>'status' = 'Stale' THEN
estimated_time := (health_result->>'estimated_refresh_time')::INTERVAL;
-- Add to batch if we don't exceed max time
IF (total_time + estimated_time) < max_batch_time THEN
views_to_refresh := array_append(views_to_refresh, view_record.matviewname);
total_time := total_time + estimated_time;
END IF;
END IF;
END LOOP;
-- Refresh the batch
FOREACH view_record.matviewname IN ARRAY views_to_refresh
LOOP
RAISE NOTICE 'Refreshing %', view_record.matviewname;
PERFORM public.c77_mvc_manage_matv_health('target_schema', view_record.matviewname, 'quick', 'refresh');
END LOOP;
END;
$$;
```
## Best Practices
### Table Fitness Analysis
1. **Analyze Regularly**: Run table fitness analysis regularly to track data quality changes
2. **Compare Over Time**: Store historical analysis results for trend tracking
3. **Sample Size Consideration**: For very large tables, adjust confidence level and margin of error:
```sql
SELECT public.c77_mvc_calculate_sample_size(10000000, 0.95, 0.05);
```
### Materialized View Management
1. **Naming Convention**: Follow the expected naming pattern:
- Materialized views must start with `matc_`
- Source views will be created with `vtw_` prefix
- Read views will be created with `vm_` prefix
- Problem record views will have `vprob_` prefix
2. **Partition Column Selection**: Choose partition columns based on:
- Table fitness analysis recommendations
- High uniqueness ratio
- Low null ratio
- Business requirements for data segmentation
3. **Order-by Column Selection**: Choose columns that:
- Represent timestamps or dates
- Have a clear logical ordering in the data
- Are regularly populated (low null ratio)
4. **Refresh Strategy**: Consider:
- Data change frequency
- Query load patterns
- Validation type based on criticality
5. **Performance Monitoring**:
- Track refresh times
- Monitor the c77_dbh_matv_stats table for historical performance
## Function Reference
### Table Fitness Analysis
| Function | Description |
|----------|-------------|
| `c77_mvc_analyze_table_fitness(source_schema, source_table, exclude_key_columns)` | Main function for table fitness analysis |
| `c77_mvc_calculate_sample_size(total_rows, confidence_level, margin_of_error)` | Calculate appropriate sample size for analysis |
| `c77_mvc_analyze_column_stats(temp_table_name, col_name, column_type, sample_size, total_rows, exclude_key_columns)` | Analyze individual column statistics |
| `c77_mvc_analyze_column_combinations(temp_table_name, column_stats, sample_size, total_rows, exclude_key_columns)` | Analyze column combinations for partitioning |
| `c77_mvc_identify_order_by_candidates(temp_table_name, column_stats)` | Identify columns suitable for ordering |
| `c77_mvc_calculate_dqi(column_stats)` | Calculate Data Quality Index |
### Materialized View Management
| Function | Description |
|----------|-------------|
| `c77_mvc_create_optimized_matv(source_schema, source_table, target_schema, target_matview, partition_columns, order_by_columns, exclude_columns_from_hash, filter_latest_only)` | Create an optimized materialized view |
| `c77_mvc_manage_matv_health(target_schema, matview_name, validation_type, action)` | Check and manage materialized view health |
| `c77_mvc_check_matv_mismatches(target_schema, matview_name, validation_type)` | Check for mismatches between source and materialized view |
| `c77_mvc_create_indexes(target_schema, target_mv_name, partition_columns)` | Create indexes on a materialized view |
| `c77_mvc_validate_matv_inputs(schema_name, matview_name, vtw_name)` | Validate materialized view inputs |
| `c77_mvc_validate_order_by_columns(source_schema, source_table, order_by_columns)` | Validate order-by columns |
| `c77_mvc_collect_matv_stats(full_matview_name, full_vtw_name)` | Collect materialized view statistics |
| `c77_mvc_estimate_matv_refresh_time(full_matview_name)` | Estimate refresh time for a materialized view |