diff --git a/README.md b/README.md new file mode 100644 index 0000000..69e87ba --- /dev/null +++ b/README.md @@ -0,0 +1,173 @@ +# PostgreSQL Data Management System + +## Overview + +This collection of PostgreSQL functions forms a comprehensive data management system designed to analyze table structures, create optimized materialized views, and maintain their health over time. The system consists of two integrated subsystems that work together to improve database performance, data quality, and maintenance efficiency. + +## Core Subsystems + +### 1. Table Analysis Subsystem + +This subsystem analyzes database tables to identify their characteristics, data quality, and optimal strategies for keys, partitioning, and ordering. + +**Key Features:** +- Statistical sampling for efficient analysis of large tables +- Column-level fitness evaluation for primary/foreign key suitability +- Data quality assessment with encoding issue detection +- Identification of optimal column combinations for partitioning +- Detection of timestamp columns suitable for ordering +- Overall Data Quality Index (DQI) calculation + +**Primary Functions:** +- `grok_analyze_table_fitness`: Main entry point for table analysis +- `grok_analyze_column_stats`: Analyzes individual column characteristics +- `grok_analyze_column_combinations`: Evaluates column pairs for composite keys +- `grok_calculate_dqi`: Calculates the overall Data Quality Index + +### 2. Materialized View Management Subsystem + +This subsystem creates, monitors, and maintains optimized materialized views based on insights from the table analysis. + +**Key Features:** +- Optimized materialized view creation with proper indexing +- Automatic handling of character encoding issues +- Synthetic key generation for uniqueness +- Content hash generation for efficient change detection +- Health monitoring with staleness detection +- Automated maintenance and remediation actions + +**Primary Functions:** +- `grok_create_optimized_matv`: Creates a complete materialized view system +- `grok_manage_matv_health`: Monitors and maintains materialized view health +- `grok_check_matv_mismatches`: Detects inconsistencies between source and materialized views +- `grok_perform_matv_action`: Executes maintenance actions on materialized views + +## Architecture & Design Patterns + +The system implements several important design patterns: + +1. **View Layering Pattern**: Creates multiple views serving different purposes: + - `vtw_*`: View To Watch (source view with data quality enhancement) + - `matc_*`: MATerialized Copy (physical storage with indexes) + - `vm_*`: View of Materialized view (clean data for querying) + - `vprob_*`: View of PROBlematic data (encoding issues for review) + +2. **Data Quality Management Pattern**: Automatically detects, flags, and segregates problematic data: + - Non-ASCII character detection + - Cleansed versions of problematic text + - Separate views for clean vs. problematic data + +3. **Change Detection Pattern**: Implements efficient methods to detect data changes: + - Content hash generation from relevant columns + - Timestamp-based staleness detection + - Sampling-based consistency validation + +4. **Maintenance Strategy Pattern**: Provides multiple strategies for maintaining materialized views: + - Refresh: Updates with fresh data from the source + - Repair: Rebuilds indexes and constraints + - Reindex: Rebuilds indexes without dropping them + +## Usage Examples + +### Analyzing a Table + +```sql +-- Analyze a table to identify key characteristics and data quality +SELECT config.grok_analyze_table_fitness( + 'public', -- Source schema + 'customer_data', -- Source table + ARRAY['id', 'uid'] -- Columns to exclude from key fitness evaluation +); +``` + +### Creating an Optimized Materialized View + +```sql +-- Create an optimized materialized view system based on analysis results +SELECT config.grok_create_optimized_matv( + 'public', -- Source schema + 'customer_data', -- Source table + 'analytics', -- Target schema + 'matc_customer_summary', -- Target materialized view name + ARRAY['region', 'customer_type'], -- Partition columns + ARRAY['updated_at', 'customer_id'], -- Order-by columns + ARRAY['created_by', 'modified_by'], -- Columns to exclude from hash + true -- Filter to latest records only +); +``` + +### Monitoring Materialized View Health + +```sql +-- Check health of a materialized view +SELECT config.grok_manage_matv_health( + 'analytics', -- Schema + 'matc_customer_summary', -- Materialized view name + 'daily', -- Validation type: 'quick', 'daily', or 'full' + NULL -- Action (NULL for check only, 'refresh', 'repair', 'reindex') +); +``` + +### Maintaining Materialized View Health + +```sql +-- Refresh a stale materialized view +SELECT config.grok_manage_matv_health( + 'analytics', -- Schema + 'matc_customer_summary', -- Materialized view name + 'daily', -- Validation type + 'refresh' -- Action to perform +); +``` + +## Performance Considerations + +- **Sampling**: The system uses statistical sampling for efficient analysis of large tables +- **Concurrent Refresh**: Uses concurrent refresh when possible (requires unique indexes) +- **Validation Modes**: Offers different validation modes with performance/thoroughness tradeoffs: + - `quick`: Fastest, uses 0.1% sampling, 3-day staleness threshold + - `daily`: Medium, uses 1% sampling, 1-day staleness threshold + - `full`: Most thorough, uses 100% sampling, 12-hour staleness threshold + +## Dependencies + +This system depends on the following database objects: + +1. **Table Fitness Audit Table**: + - `config.table_fitness_audit`: Stores table analysis results + +2. **Materialized View Statistics Table**: + - `public.c77_dbh_matv_stats`: Stores materialized view refresh statistics + +## Best Practices + +1. **Initial Analysis**: Run table analysis before creating materialized views to identify optimal configuration +2. **Regular Health Checks**: Schedule periodic health checks using `grok_manage_matv_health` +3. **Validation Types**: Use `quick` for frequent checks, `daily` for daily maintenance, and `full` for critical views +4. **Monitoring**: Track Data Quality Index (DQI) over time to detect data quality trends +5. **Maintenance Windows**: Schedule refreshes during low-usage periods for large materialized views + +## Error Handling + +All functions include comprehensive error handling with: +- Clear error messages indicating what went wrong +- Processing notes to track execution steps +- Safe failure modes that avoid leaving the database in an inconsistent state + +## Troubleshooting + +Common issues and solutions: + +1. **Stale Materialized Views**: Use `grok_manage_matv_health` with action='refresh' +2. **Encoding Issues**: Use `grok_manage_matv_health` with action='repair' +3. **Index Performance Issues**: Use `grok_manage_matv_health` with action='reindex' +4. **Missing Statistics**: Ensure `public.c77_dbh_matv_stats` table is populated with refresh statistics + +## Extension Points + +The system is designed to be extended in several ways: + +1. Add custom data quality checks in the `vtw_` view creation +2. Extend partition and order-by column validation logic +3. Implement additional maintenance actions in `grok_perform_matv_action` +4. Add custom health metrics to `grok_manage_matv_health` diff --git a/grok_perform_matv_action-readme.md b/grok_perform_matv_action-readme.md new file mode 100644 index 0000000..f42841f --- /dev/null +++ b/grok_perform_matv_action-readme.md @@ -0,0 +1,82 @@ +# Function: grok_perform_matv_action + +## Overview +This function performs maintenance actions on a materialized view based on its current health status, applying the appropriate remediation strategy. + +## Schema +`config.grok_perform_matv_action` + +## Parameters +- `full_matview_name` (text): Full name of the materialized view (schema.name) +- `schema_name` (text): Schema containing the materialized view +- `matview_name` (text): Name of the materialized view +- `action` (text): Action to perform: 'refresh', 'repair', or 'reindex' +- `mismatched_records` (bigint): Number of records that don't match between materialized view and source +- `total_matview_records` (bigint): Total number of records in the materialized view +- `time_diff` (interval): Time since last refresh +- `mismatch_threshold` (numeric): Threshold percentage that determines when a refresh is needed +- `time_threshold` (interval): Time threshold that determines when a refresh is needed +- `encoding_issues` (bigint): Number of records with encoding issues + +## Return Value +Returns a JSONB object indicating the action result: +```json +{ + "action_performed": true, + "action_result": "Refreshed successfully (concurrently)" +} +``` + +Or in case no action was taken or an error occurred: +```json +{ + "action_performed": false, + "action_result": "Action skipped: threshold not met or invalid action" +} +``` + +## Description +This function implements a conditional maintenance system for materialized views based on their current health. It supports three types of actions: + +1. **Refresh**: Updates the materialized view with current data from the source view + - Uses concurrent refresh if a unique index exists + - Falls back to non-concurrent refresh if no unique index is found + - Only performed if mismatch ratio exceeds the threshold or time since last refresh exceeds the time threshold + +2. **Repair**: Rebuilds indexes and constraints to address encoding issues + - Drops all existing indexes (except primary keys) + - Drops primary key and unique constraints + - Recreates standard indexes on content_hash and synthetic_key + - Analyzes the table to update statistics + - Only performed if encoding issues are detected + +3. **Reindex**: Rebuilds all indexes without dropping them + - Can be used for routine maintenance + - Always performed when requested (no threshold check) + +The function intelligently applies the most appropriate technique based on the materialized view's structure and current state. + +## Index Management +For materialized views with unique indexes, the function uses PostgreSQL's REFRESH MATERIALIZED VIEW CONCURRENTLY command, which allows queries to continue running against the materialized view during the refresh. For views without unique indexes, it falls back to the standard non-concurrent refresh. + +## Error Handling +If an error occurs during action execution, the function returns information about the failure without raising an exception, allowing the calling process to continue. + +## Dependencies +This function doesn't directly call other functions but is likely called by `config.grok_manage_matv_health`. + +## Usage Example +```sql +SELECT config.grok_perform_matv_action( + 'analytics.matc_daily_sales', + 'analytics', + 'matc_daily_sales', + 'refresh', + 155, + 12345, + '25:30:00'::interval, + 1.0, + '24:00:00'::interval, + 0 +); +``` diff --git a/grok_set_validation_params-readme.md b/grok_set_validation_params-readme.md new file mode 100644 index 0000000..4a34809 --- /dev/null +++ b/grok_set_validation_params-readme.md @@ -0,0 +1,69 @@ +# Function: grok_set_validation_params + +## Overview +This function sets validation parameters and thresholds based on the specified validation type for materialized view health checks. + +## Schema +`config.grok_set_validation_params` + +## Parameters +- `validation_type` (text): Type of validation to configure: 'quick', 'daily', or 'full' + +## Return Value +Returns a JSONB object containing validation parameters and thresholds: +```json +{ + "params": { + "sample_percent": 0.1, + "confidence": 0.95, + "margin": 0.03 + }, + "mismatch_threshold": 0.1, + "time_threshold": "3 days" +} +``` + +## Description +This function configures appropriate validation parameters and thresholds based on the specified validation type. It supports three validation modes, each with its own balance between thoroughness and performance: + +1. **Quick** (default): Light validation for frequent checks + - Sampling: 0.1% of records + - Confidence level: 95% + - Margin of error: 3% + - Mismatch threshold: 0.1% (data mismatch tolerance) + - Time threshold: 3 days (acceptable staleness) + +2. **Daily**: Medium validation for daily maintenance + - Sampling: 1% of records + - Confidence level: 99% + - Margin of error: 1% + - Mismatch threshold: 0.05% (data mismatch tolerance) + - Time threshold: 1 day (acceptable staleness) + +3. **Full**: Thorough validation for critical checks + - Sampling: 100% of records (full scan) + - Confidence level: 99% + - Margin of error: 0.5% + - Mismatch threshold: 0.01% (data mismatch tolerance) + - Time threshold: 12 hours (acceptable staleness) + +If an invalid validation type is provided, the function defaults to 'quick' mode parameters. + +## Parameter Explanations +- `sample_percent`: Percentage of records to sample during validation +- `confidence`: Statistical confidence level for sampling +- `margin`: Acceptable margin of error for sampling +- `mismatch_threshold`: Maximum acceptable percentage of mismatched records +- `time_threshold`: Maximum acceptable time since last refresh + +## Dependencies +This function is likely called by other materialized view health check functions to configure validation parameters. + +## Usage Example +```sql +-- Get validation parameters for daily checks +SELECT config.grok_set_validation_params('daily'); + +-- Get validation parameters for thorough health check +SELECT config.grok_set_validation_params('full'); +``` diff --git a/grok_validate_matv_inputs-readme.md b/grok_validate_matv_inputs-readme.md new file mode 100644 index 0000000..6bf0668 --- /dev/null +++ b/grok_validate_matv_inputs-readme.md @@ -0,0 +1,70 @@ +# Function: grok_validate_matv_inputs + +## Overview +This function validates the existence of a materialized view and its source view before performing operations on them, ensuring inputs are valid. + +## Schema +`config.grok_validate_matv_inputs` + +## Parameters +- `schema_name` (text): Schema containing the materialized view and source view +- `matview_name` (text): Name of the materialized view +- `vtw_name` (text): Optional name of the source view (if not provided, derived from matview_name) + +## Return Value +Returns a JSONB object with validation results: + +Success case: +```json +{ + "full_matview_name": "schema.matview_name", + "full_vtw_name": "schema.vtw_name", + "notes": [] +} +``` + +Error case: +```json +{ + "error": "Materialized view schema.matview_name does not exist", + "notes": [] +} +``` + +## Description +This function performs input validation before executing operations on materialized views by: + +1. Constructing the fully qualified names for the materialized view and source view +2. Checking if the materialized view exists in pg_matviews +3. Checking if the source view exists in either pg_views or pg_tables +4. Returning appropriate error messages if either object is missing + +If `vtw_name` is not provided, the function derives it by replacing 'matc_' with 'vtw_' in the materialized view name, following the standard naming convention. + +## Validation Checks +The function checks: +- Materialized view existence using the pg_matviews system catalog +- Source view existence using both pg_views and pg_tables system catalogs (handles both views and tables) + +## Error Handling +If validation fails, the function returns a descriptive error message indicating which object is missing. If an unexpected error occurs during validation, it returns a generic error message with the exception details. + +## Dependencies +This function doesn't call other functions but is likely called by materialized view management functions before performing operations. + +## Usage Example +```sql +-- Validate materialized view with automatic source view name derivation +SELECT config.grok_validate_matv_inputs( + 'analytics', + 'matc_daily_sales', + NULL +); + +-- Validate materialized view with explicit source view name +SELECT config.grok_validate_matv_inputs( + 'analytics', + 'matc_daily_sales', + 'custom_source_view' +); +``` diff --git a/grok_validate_order_by_columns-readme.md b/grok_validate_order_by_columns-readme.md new file mode 100644 index 0000000..1b1bf95 --- /dev/null +++ b/grok_validate_order_by_columns-readme.md @@ -0,0 +1,63 @@ +# Function: grok_validate_order_by_columns + +## Overview +This function validates that specified order-by columns exist in a source table and contain data that can be parsed as timestamps, ensuring they can be used for deterministic ordering. + +## Schema +`config.grok_validate_order_by_columns` + +## Parameters +- `source_schema` (text): Schema containing the source table +- `source_table` (text): Name of the source table +- `order_by_columns` (text[]): Array of column names to validate + +## Return Value +Returns a text array containing warning messages for any issues found: +``` +{ + "Warning: column_name not found in schema.table", + "Warning: column_name contains unparseable timestamp data: error message" +} +``` + +## Description +This function validates columns intended for use in ORDER BY clauses, particularly for generating synthetic keys in materialized views. It performs two types of validation: + +1. **Existence Check**: Verifies each column exists in the specified table +2. **Timestamp Parsing**: Tests if each column's data can be parsed as a timestamp + +For timestamp parsing, the function attempts to convert the column data using: +```sql +TO_TIMESTAMP(SUBSTRING(NULLIF(column, ''), 1, 19), 'YYYY-MM-DD HH24:MI:SS') +``` + +This validation approach ensures that: +- Columns are valid for the source table +- Timestamp columns can be parsed consistently +- The ORDER BY clause will produce deterministic results + +## Timestamp Parsing Details +The timestamp parsing logic: +- Uses NULLIF to handle NULL values +- Takes only the first 19 characters using SUBSTRING +- Uses a fixed format of 'YYYY-MM-DD HH24:MI:SS' + +This standardized parsing ensures consistent ordering behavior regardless of the actual format stored in the column. + +## Error Handling +The function collects warnings without failing, allowing for a complete validation report: +- Missing columns generate a warning +- Unparseable timestamp data generates a warning with the specific error +- If an unexpected error occurs, it returns a general error message + +## Dependencies +This function is likely called by other functions that create materialized views to validate order-by columns before using them. + +## Usage Example +```sql +SELECT config.grok_validate_order_by_columns( + 'public', + 'customers', + ARRAY['created_at', 'updated_at'] +); +```