Initial commit: Clean SEO analysis system

2026-02-16 05:25:16 +04:00
commit 3b51952336
13 changed files with 2611 additions and 0 deletions
--- a/.env.example
+++ b/.env.example
@@ -0,0 +1,23 @@
 # WordPress Configuration
 WORDPRESS_URL=https://yoursite.com
 WORDPRESS_USERNAME=your_username
 WORDPRESS_APP_PASSWORD=your_application_password
 # OpenRouter API Configuration
 OPENROUTER_API_KEY=your_openrouter_api_key
 # AI Model Selection (choose one)
 # Recommended: anthropic/claude-3.5-sonnet (best quality, $3/$15 per 1M tokens)
 # Budget: meta-llama/llama-3.1-70b-instruct (free tier available)
 # Alternative: openai/gpt-4-turbo ($10/$30 per 1M tokens)
 AI_MODEL=anthropic/claude-3.5-sonnet
 # Script Configuration
 BATCH_SIZE=100
 API_DELAY_SECONDS=0.5
 # Analysis Settings
 ANALYSIS_MIN_POSITION=11
 ANALYSIS_MAX_POSITION=30
 ANALYSIS_MIN_IMPRESSIONS=50
 ANALYSIS_TOP_N_POSTS=20
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,48 @@
 # Configuration
 .env
 .env.local
 # Virtual Environment
 venv/
 env/
 ENV/
 .venv
 # Python
 __pycache__/
 *.py[cod]
 *$py.class
 *.so
 .Python
 *.egg-info/
 dist/
 build/
 # Input files (sensitive/large)
 input/analytics/
 input/**/*.csv
 input/**/*.txt
 # Output files (generated results)
 output/results/
 output/logs/
 output/**/*.csv
 output/**/*.txt
 output/**/*.log
 output/**/*.md
 # IDE
 .vscode/
 .idea/
 *.swp
 *.swo
 *~
 # OS
 .DS_Store
 Thumbs.db
 # Backup/rollback files
 *.bak
 rollback_*.csv
 *_backup.csv
--- a/PROJECT_GUIDE.md
+++ b/PROJECT_GUIDE.md
@@ -0,0 +1,310 @@
 # SEO Analysis & Improvement System - Project Guide
 ## 📋 Overview
 A complete 4-phase SEO analysis pipeline that:
 1. **Integrates** Google Analytics, Search Console, and WordPress data
 2. **Identifies** high-potential keywords for optimization (positions 11-30)
 3. **Discovers** new content opportunities using AI
 4. **Generates** a comprehensive report with 90-day action plan
 ## 📂 Project Structure
 ```
 seo/
 ├── input/                              # SOURCE DATA (your exports)
 │   ├── new-propositions.csv           # WordPress posts
 │   ├── README.md                      # How to export data
 │   └── analytics/
 │       ├── ga4_export.csv             # Google Analytics
 │       └── gsc/
 │           ├── Pages.csv              # GSC pages (required)
 │           ├── Requêtes.csv           # GSC queries (optional)
 │           └── ...
 │
 ├── output/                             # RESULTS (auto-generated)
 │   ├── results/
 │   │   ├── seo_optimization_report.md  # 📍 PRIMARY OUTPUT
 │   │   ├── posts_with_analytics.csv
 │   │   ├── posts_prioritized.csv
 │   │   ├── keyword_opportunities.csv
 │   │   └── content_gaps.csv
 │   │
 │   ├── logs/
 │   │   ├── import_log.txt
 │   │   ├── opportunity_analysis_log.txt
 │   │   └── content_gap_analysis_log.txt
 │   │
 │   └── README.md                      # Output guide
 │
 ├── 🚀 run_analysis.sh                 # Run entire pipeline
 ├── analytics_importer.py              # Phase 1: Merge data
 ├── opportunity_analyzer.py            # Phase 2: Find wins
 ├── content_gap_analyzer.py            # Phase 3: Find gaps
 ├── report_generator.py                # Phase 4: Generate report
 ├── config.py
 ├── requirements.txt
 ├── .env.example
 └── .gitignore
 ```
 ## 🚀 Getting Started
 ### Step 1: Prepare Input Data
 **Place WordPress posts CSV:**
 ```
 input/new-propositions.csv
 ```
 **Export Google Analytics 4:**
 1. Go to: Analytics > Reports > Engagement > Pages and Screens
 2. Set date range: Last 90 days
 3. Download CSV → Save as: `input/analytics/ga4_export.csv`
 **Export Google Search Console (Pages):**
 1. Go to: Performance
 2. Set date range: Last 90 days
 3. Export CSV → Save as: `input/analytics/gsc/Pages.csv`
 ### Step 2: Run Analysis
 ```bash
 # Run entire pipeline
 ./run_analysis.sh
 # OR run steps individually
 ./venv/bin/python analytics_importer.py
 ./venv/bin/python opportunity_analyzer.py
 ./venv/bin/python content_gap_analyzer.py
 ./venv/bin/python report_generator.py
 ```
 ### Step 3: Review Report
 Open: **`output/results/seo_optimization_report.md`**
 Contains:
 - Executive summary with current metrics
 - Top 20 posts ranked by opportunity (with AI recommendations)
 - Keyword opportunities breakdown
 - Content gap analysis
 - 90-day phased action plan
 ## 📊 What Each Script Does
 ### `analytics_importer.py` (Phase 1)
 **Purpose:** Merge analytics data with WordPress posts
 **Input:**
 - `input/new-propositions.csv` (WordPress posts)
 - `input/analytics/ga4_export.csv` (Google Analytics)
 - `input/analytics/gsc/Pages.csv` (Search Console)
 **Output:**
 - `output/results/posts_with_analytics.csv` (enriched dataset)
 - `output/logs/import_log.txt` (matching report)
 **Handles:** French and English column names, URL normalization, multi-source merging
 ### `opportunity_analyzer.py` (Phase 2)
 **Purpose:** Identify high-potential optimization opportunities
 **Input:**
 - `output/results/posts_with_analytics.csv`
 **Output:**
 - `output/results/keyword_opportunities.csv` (26 opportunities)
 - `output/logs/opportunity_analysis_log.txt`
 **Features:**
 - Filters posts at positions 11-30 (page 2-3)
 - Calculates opportunity scores (0-100)
 - Generates AI recommendations for top 20 posts
 ### `content_gap_analyzer.py` (Phase 3)
 **Purpose:** Discover new content opportunities
 **Input:**
 - `output/results/posts_with_analytics.csv`
 - `input/analytics/gsc/Requêtes.csv` (optional)
 **Output:**
 - `output/results/content_gaps.csv`
 - `output/logs/content_gap_analysis_log.txt`
 **Features:**
 - Topic cluster extraction
 - Gap identification
 - AI-powered content suggestions
 ### `report_generator.py` (Phase 4)
 **Purpose:** Create comprehensive report with action plan
 **Input:**
 - All analysis results from phases 1-3
 **Output:**
 - `output/results/seo_optimization_report.md` ← **PRIMARY DELIVERABLE**
 - `output/results/posts_prioritized.csv`
 **Features:**
 - Comprehensive markdown report
 - All 262 posts ranked
 - 90-day action plan with estimated gains
 ## 📈 Understanding Your Report
 ### Key Metrics (Executive Summary)
 - **Total Posts:** All posts analyzed
 - **Monthly Traffic:** Current organic traffic
 - **Total Impressions:** Search visibility (90 days)
 - **Average Position:** Current ranking position
 - **Opportunities:** Posts ready to optimize
 ### Top 20 Posts to Optimize
 Each post shows:
 - **Title** (the post name)
 - **Current Position** (search ranking)
 - **Impressions** (search visibility)
 - **Traffic** (organic visits)
 - **Priority Score** (0-100 opportunity rating)
 - **Status** (page 1 vs page 2-3)
 - **Recommendations** (how to improve)
 ### Priority Scoring (0-100)
 Higher scores = more opportunity for gain with less effort
 Calculated from:
 - **Position (35%)** - How close to page 1
 - **Traffic Potential (30%)** - Search impressions
 - **CTR Gap (20%)** - Improvement opportunity
 - **Content Quality (15%)** - Existing engagement
 ## 🎯 Action Plan
 ### Week 1-2: Quick Wins (+100 visits/month)
 - Focus on posts at positions 11-15
 - Update SEO titles and meta descriptions
 - 30-60 minutes per post
 ### Week 3-4: Core Optimization (+150 visits/month)
 - Posts 6-15 in priority list
 - Add content sections
 - Improve structure with headers
 - 2-3 hours per post
 ### Week 5-8: New Content (+300 visits/month)
 - Create 3-5 new posts from gap analysis
 - Target high-search-demand topics
 - 4-6 hours per post
 ### Week 9-12: Refinement (+100 visits/month)
 - Monitor ranking improvements
 - Refine underperforming optimizations
 - Prepare next round of analysis
 **Total: +650 visits/month potential gain**
 ## 🔧 Configuration
 Edit `.env` to customize analysis:
 ```bash
 # Position range for opportunities
 ANALYSIS_MIN_POSITION=11
 ANALYSIS_MAX_POSITION=30
 # Minimum impressions to consider
 ANALYSIS_MIN_IMPRESSIONS=50
 # Posts for AI recommendations
 ANALYSIS_TOP_N_POSTS=20
 ```
 ## 🐛 Troubleshooting
 ### Missing Input Files
 ```
 ❌ Error: File not found: input/...
 ```
 → Check that all files are in the correct locations
 ### Empty Report Titles
 ✓ FIXED - Now correctly loads post titles from multiple column names
 ### No Opportunities Found
 ```
 ⚠️  No opportunities found in specified range
 ```
 → Try lowering `ANALYSIS_MIN_IMPRESSIONS` in `.env`
 ### API Errors
 ```
 ❌ AI generation failed: ...
 ```
 → Check `OPENROUTER_API_KEY` in `.env` and account balance
 ## 📚 Additional Resources
 - **`input/README.md`** - How to export analytics data
 - **`output/README.md`** - Output files guide
 - **`QUICKSTART_ANALYSIS.md`** - Step-by-step tutorial
 - **`ANALYSIS_SYSTEM.md`** - Technical documentation
 ## ✅ Success Checklist
 - [ ] All input files placed in `input/` directory
 - [ ] `.env` file configured with API key
 - [ ] Ran `./run_analysis.sh` successfully
 - [ ] Reviewed `output/results/seo_optimization_report.md`
 - [ ] Identified 5-10 quick wins to start with
 - [ ] Created action plan for first week
 ## 🎓 Key Learnings
 ### Why Positions 11-30 Matter
 - **Page 1** posts are hard to move
 - **Page 2-3** posts are easy wins (small improvements move them up)
 - **Quick gains:** 1-2 position improvements = CTR increases 20-30%
 ### CTR Expectations by Position
 - Position 1: ~30% CTR
 - Position 5-10: 4-7% CTR
 - Position 11-15: 1-2% CTR (quick wins)
 - Position 16-20: 0.8-1% CTR
 - Position 21-30: ~0.5% CTR
 ### Content Quality Signals
 - Higher bounce rate = less relevant content
 - Low traffic = poor CTR or position
 - Low impressions = insufficient optimization
 ## 📞 Support
 ### Check Logs First
 ```
 output/logs/import_log.txt
 output/logs/opportunity_analysis_log.txt
 output/logs/content_gap_analysis_log.txt
 ```
 ### Common Issues
 1. **Empty titles** → Fixed with flexible column name mapping
 2. **File not found** → Check file locations match structure
 3. **API errors** → Verify API key and account balance
 4. **No opportunities** → Lower minimum impressions threshold
 ## 🚀 Ready to Optimize?
 1. Prepare your input data
 2. Run `./run_analysis.sh`
 3. Open the report
 4. Start with quick wins
 5. Track improvements in 4 weeks
 Good luck boosting your SEO! 📈
 ---
 **Last Updated:** February 2026
 **System Status:** Production Ready ✅
--- a/README.md
+++ b/README.md
@@ -0,0 +1,474 @@
 # WordPress SEO Automation Tool
 Programmatically optimize SEO titles and meta descriptions across all WordPress posts using AI-powered generation and a CSV review workflow.
 ## Features
 - **AI-Powered SEO Generation**: Uses OpenRouter API (Claude, GPT-4, Llama, etc.) to create optimized titles and descriptions
 - **Plugin Support**: Auto-detects and works with both Yoast SEO and Rank Math
 - **CSV Review Workflow**: Generate proposals, review in Excel/Sheets, approve changes before applying
 - **Safety Features**: Dry-run mode, rollback CSV generation, detailed logging
 - **SEO Best Practices**: Enforces 50-60 char titles, 150-160 char descriptions, keyword optimization
 - **Batch Processing**: Handle hundreds or thousands of posts efficiently
 ## Table of Contents
 - [Prerequisites](#prerequisites)
 - [Installation](#installation)
 - [WordPress Configuration](#wordpress-configuration)
 - [OpenRouter API Setup](#openrouter-api-setup)
 - [Usage](#usage)
 - [Workflow](#workflow)
 - [SEO Plugin Comparison](#seo-plugin-comparison)
 - [Troubleshooting](#troubleshooting)
 - [Cost Estimates](#cost-estimates)
 ## Prerequisites
 - WordPress site with Yoast SEO or Rank Math plugin installed
 - Python 3.8 or higher
 - WordPress Application Password (for REST API access)
 - OpenRouter API key (for AI-powered generation)
 ## Installation
 ### 1. Clone or Download
 ```bash
 cd /Users/acid/Documents/seo
 ```
 ### 2. Create Virtual Environment
 ```bash
 python3 -m venv venv
 source venv/bin/activate  # On Windows: venv\Scripts\activate
 ```
 ### 3. Install Dependencies
 ```bash
 pip install -r requirements.txt
 ```
 ### 4. Configure Environment Variables
 Copy the example environment file:
 ```bash
 cp .env.example .env
 ```
 Edit `.env` with your credentials:
 ```env
 WORDPRESS_URL=https://yoursite.com
 WORDPRESS_USERNAME=your_username
 WORDPRESS_APP_PASSWORD=your_application_password
 OPENROUTER_API_KEY=your_openrouter_api_key
 AI_MODEL=anthropic/claude-3.5-sonnet
 ```
 ## WordPress Configuration
 ### Step 1: Create Application Password
 1. Log in to WordPress Admin
 2. Go to **Users → Profile**
 3. Scroll to **Application Passwords** section
 4. Enter application name: "SEO Automation"
 5. Click **Add New Application Password**
 6. Copy the generated password (it will only be shown once)
 7. Add to `.env` file as `WORDPRESS_APP_PASSWORD`
 ### Step 2: Verify REST API Access
 Test your authentication:
 ```bash
 curl --user "your_username:your_app_password" \
  https://yoursite.com/wp-json/wp/v2/posts?per_page=1&context=edit
 ```
 You should receive a JSON response with post data.
 ### Step 3: SEO Plugin Requirements
 **For Yoast SEO:**
 - Yoast SEO Free or Premium installed and activated
 - Meta fields automatically accessible via REST API
 **For Rank Math:**
 - Rank Math Free or Pro installed and activated
 - Meta fields automatically accessible via REST API
 **Both plugins are supported** - the scripts auto-detect which one you're using.
 ## OpenRouter API Setup
 ### Why OpenRouter?
 OpenRouter provides access to multiple AI models through a single API:
 - **Claude 3.5 Sonnet** (recommended): Best quality, $3/$15 per 1M tokens
 - **GPT-4 Turbo**: Strong performance, $10/$30 per 1M tokens
 - **Llama 3.1 70B**: Free tier available, $0/$0 per 1M tokens
 - **Gemini Pro 1.5**: Good balance, $1.25/$5 per 1M tokens
 ### Get API Key
 1. Visit [https://openrouter.ai/](https://openrouter.ai/)
 2. Sign up or log in
 3. Go to **API Keys** section
 4. Create new API key
 5. Add to `.env` file as `OPENROUTER_API_KEY`
 ### Choose AI Model
 Edit `AI_MODEL` in `.env`:
 ```env
 # Best quality (recommended)
 AI_MODEL=anthropic/claude-3.5-sonnet
 # Budget option (free)
 AI_MODEL=meta-llama/llama-3.1-70b-instruct
 # OpenAI
 AI_MODEL=openai/gpt-4-turbo
 ```
 ## Usage
 ### Step 1: Generate SEO Proposals
 Fetch all posts and generate AI-powered SEO suggestions:
 ```bash
 python fetch_posts_and_generate_seo.py
 ```
 **Options:**
 ```bash
 # Test with first 5 posts
 python fetch_posts_and_generate_seo.py --limit 5
 # Specify output file
 python fetch_posts_and_generate_seo.py --output my_proposals.csv
 # Use rule-based generation (no AI/API costs)
 python fetch_posts_and_generate_seo.py --no-ai
 ```
 This creates a CSV file in `output/` directory with proposals for all posts.
 ### Step 2: Review Proposals
 1. Open the generated CSV file in Excel or Google Sheets
 2. Review each row:
   - Check `proposed_seo_title` (should be 50-60 chars)
   - Check `proposed_meta_description` (should be 150-160 chars)
   - Edit proposals if needed
 3. Set `status` column to `approved` for changes you want to apply
 4. Set `status` column to `rejected` for posts to skip
 5. Save the CSV file
 **CSV Columns:**
 | Column | Description |
 |--------|-------------|
 | post_id | WordPress post ID |
 | post_url | Post permalink |
 | post_title | Original post title |
 | current_seo_title | Current SEO title (from Yoast/Rank Math) |
 | current_meta_description | Current meta description |
 | proposed_seo_title | AI-generated SEO title |
 | proposed_meta_description | AI-generated meta description |
 | primary_keyword | Detected primary keyword |
 | title_length | Character count of proposed title |
 | description_length | Character count of proposed description |
 | title_validation | Validation message |
 | description_validation | Validation message |
 | generation_method | 'ai' or 'rule-based' |
 | status | Set to 'approved' to apply changes |
 | notes | Your notes (optional) |
 ### Step 3: Test with Dry Run
 Before applying changes, test with dry-run mode:
 ```bash
 python apply_approved_changes.py --input output/seo_proposals_YYYYMMDD_HHMMSS.csv --dry-run
 ```
 This shows what would be updated without actually making changes.
 ### Step 4: Apply Approved Changes
 Apply the approved changes to WordPress:
 ```bash
 python apply_approved_changes.py --input output/seo_proposals_YYYYMMDD_HHMMSS.csv
 ```
 The script will:
 1. Create a rollback CSV with original values
 2. Ask for confirmation
 3. Apply all approved changes
 4. Generate detailed log file
 ## Workflow
 ### Complete Workflow Diagram
 ```
 1. Generate Proposals
   └─> python fetch_posts_and_generate_seo.py
       └─> Fetches all posts from WordPress
       └─> Generates AI-powered SEO suggestions
       └─> Exports to CSV: output/seo_proposals_YYYYMMDD_HHMMSS.csv
 2. Review & Edit
   └─> Open CSV in Excel/Google Sheets
       └─> Review proposed titles and descriptions
       └─> Edit as needed
       └─> Set status='approved' for changes to apply
       └─> Save CSV
 3. Test (Optional)
   └─> python apply_approved_changes.py --input <csv> --dry-run
       └─> Simulates changes without applying
 4. Apply Changes
   └─> python apply_approved_changes.py --input <csv>
       └─> Creates rollback CSV
       └─> Applies approved changes to WordPress
       └─> Generates log file
 5. Verify
   └─> Check WordPress admin (post editor)
   └─> View source on frontend
   └─> Monitor search performance
 ```
 ### Safety Features
 - **Dry Run Mode**: Test without applying changes
 - **Rollback CSV**: Automatically created before applying changes
 - **Detailed Logging**: All operations logged to `output/application_log_YYYYMMDD_HHMMSS.txt`
 - **Validation**: Enforces character limits and checks for duplicates
 - **Confirmation Prompt**: Requires 'yes' confirmation before applying changes
 - **Rate Limiting**: Prevents overwhelming WordPress server
 ## SEO Plugin Comparison
 ### Should You Switch from Yoast to Rank Math?
 **Current: Yoast SEO Free**
 - ✓ Market leader (12M users)
 - ✓ Reliable and well-tested
 - ✗ Only 1 focus keyword (vs unlimited in Rank Math)
 - ✗ No redirect manager (premium only, $118.80/year)
 - ✗ Limited schema support
 - ✗ No internal linking suggestions
 **Alternative: Rank Math Free**
 - ✓ **Unlimited focus keywords** (vs 1 in Yoast Free)
 - ✓ **Redirect manager included** (premium in Yoast)
 - ✓ **20+ rich snippet types** (FAQ, Product, Recipe, etc.)
 - ✓ **Better performance** (40% less code)
 - ✓ **Internal linking suggestions**
 - ✓ **Google Trends integration**
 - ✓ **One-click Yoast migration** (preserves all data)
 - ✗ Smaller community (900K vs 12M users)
 **Recommendation for FREE users:** Switch to Rank Math Free
 **Migration Steps:**
 1. Install Rank Math plugin
 2. Run Setup Wizard → Import from Yoast
 3. All SEO data automatically transferred
 4. Deactivate (don't delete) Yoast as backup
 5. Test a few posts
 6. If satisfied, delete Yoast
 **These scripts work with both plugins** - they auto-detect which one you're using.
 ## SEO Best Practices (2026)
 ### Title Optimization
 - **Length**: 50-60 characters (≤600 pixels in SERPs)
 - **Keyword placement**: Primary keyword in first 60 characters
 - **Uniqueness**: Every post must have unique title
 - **Compelling**: Written to improve click-through rate (CTR)
 - **Natural**: No keyword stuffing
 ### Meta Description Optimization
 - **Length**: 150-160 characters (optimal for SERP display)
 - **User intent**: Address what reader will learn/gain
 - **Keyword inclusion**: Primary keyword appears naturally
 - **Uniqueness**: Every post must have unique description
 - **Value proposition**: Highlight what makes content unique
 - **CTR focused**: Compelling language to encourage clicks
 **Note**: Google rewrites 62%+ of meta descriptions, but they still matter for:
 - CTR when not overridden
 - Social media sharing (Open Graph)
 - Signaling relevance to search engines
 ## Troubleshooting
 ### Error: "Authentication failed"
 **Cause**: Invalid WordPress username or application password
 **Solution**:
 1. Verify username is correct (not email address)
 2. Regenerate application password in WordPress
 3. Update `.env` file with new password
 4. Ensure no extra spaces in credentials
 ### Error: "Access forbidden"
 **Cause**: User doesn't have permission to edit posts
 **Solution**:
 1. Ensure user has Editor or Administrator role
 2. Check if REST API is disabled by security plugin
 3. Temporarily disable security plugins and test
 ### Error: "OpenRouter API key invalid"
 **Cause**: Invalid or missing OpenRouter API key
 **Solution**:
 1. Get API key from https://openrouter.ai/
 2. Update `OPENROUTER_API_KEY` in `.env`
 3. Ensure no extra quotes or spaces
 ### Error: "No posts found"
 **Cause**: No published posts or authentication issue
 **Solution**:
 1. Verify you have published posts in WordPress
 2. Check authentication is working (see WordPress Configuration)
 3. Try with `--limit 1` to test with single post
 ### SEO Plugin Not Detected
 **Cause**: Plugin not installed or meta fields not exposed
 **Solution**:
 1. Verify Yoast SEO or Rank Math is installed and activated
 2. Check if custom code blocks meta field access
 3. Scripts default to Yoast field names if detection fails
 ### AI Generation Fails
 **Cause**: OpenRouter API error or rate limit
 **Solution**:
 1. Check OpenRouter account has credits
 2. Try different AI model (switch to free Llama model)
 3. Use `--no-ai` flag for rule-based generation
 4. Check log files for specific error messages
 ## Cost Estimates
 ### OpenRouter API Costs
 **Using Claude 3.5 Sonnet (Recommended):**
 - Average post: ~2000 tokens input + 200 tokens output
 - Cost per post: ~$0.009
 - **100 posts: ~$0.90**
 - **1000 posts: ~$9.00**
 **Using Free Models:**
 - Llama 3.1 70B: **$0.00** (free tier)
 - No cost for generation
 **Rule-Based Generation:**
 - No API costs
 - Use `--no-ai` flag
 - Lower quality but free
 ## File Structure
 ```
 /Users/acid/Documents/seo/
 ├── .env                              # Your credentials (git-ignored)
 ├── .env.example                      # Example configuration
 ├── .gitignore                        # Git ignore rules
 ├── requirements.txt                  # Python dependencies
 ├── config.py                         # Configuration loader
 ├── seo_generator.py                  # SEO generation logic
 ├── fetch_posts_and_generate_seo.py  # Main fetching script
 ├── apply_approved_changes.py        # Application script
 ├── README.md                         # This file
 └── output/                           # Generated files
    ├── seo_proposals_*.csv          # Generated proposals
    ├── rollback_*.csv               # Backup files
    └── application_log_*.txt        # Detailed logs
 ```
 ## Development Notes
 ### Testing
 **Test with small batch first:**
 ```bash
 # Generate proposals for 5 posts
 python fetch_posts_and_generate_seo.py --limit 5
 # Review CSV and approve changes
 # Dry run to verify
 python apply_approved_changes.py --input output/seo_proposals_*.csv --dry-run
 # Apply to 5 posts
 python apply_approved_changes.py --input output/seo_proposals_*.csv
 ```
 **Verify changes:**
 1. Open WordPress post editor
 2. Check Yoast/Rank Math SEO box shows updated title and description
 3. View source on frontend: check `<title>` and `<meta name="description">` tags
 4. Test rollback CSV if needed
 ### Extending the Scripts
 **Add custom validation:**
 - Edit `seo_generator.py` → `validate_seo_title()` and `validate_meta_description()`
 **Change AI model:**
 - Edit `.env` → `AI_MODEL=openai/gpt-4-turbo`
 **Customize prompts:**
 - Edit `seo_generator.py` → `_generate_with_ai()` method
 **Add more meta fields:**
 - Edit scripts to include focus keywords, Open Graph tags, etc.
 ## Support
 For issues or questions:
 1. Check this README troubleshooting section
 2. Review log files in `output/` directory
 3. Test with `--dry-run` mode first
 4. Start with `--limit 5` for testing
 ## License
 This tool is provided as-is for WordPress SEO optimization. Use responsibly and always backup your WordPress site before bulk updates.
 ## Changelog
 ### Version 1.0.0 (2026-02-15)
 - Initial release
 - AI-powered SEO generation via OpenRouter
 - Support for Yoast SEO and Rank Math
 - CSV review workflow
 - Safety features (dry-run, rollback, logging)
 - Auto-detection of SEO plugins
--- a/analytics_importer.py
+++ b/analytics_importer.py
@@ -0,0 +1,427 @@
 """
 Analytics data importer for SEO analysis.
 Merges Google Analytics and Search Console data with WordPress posts.
 """
 import csv
 import json
 import argparse
 from pathlib import Path
 from urllib.parse import urlparse, parse_qs
 from collections import defaultdict
 from config import Config
 class AnalyticsImporter:
    """Import and consolidate analytics data with WordPress posts."""
    def __init__(self):
        """Initialize importer."""
        self.config = Config
        self.output_dir = self.config.OUTPUT_DIR
        self.logs = []
        self.unmatched_urls = []
    def log(self, message):
        """Add message to log."""
        self.logs.append(message)
        print(message)
    def normalize_url(self, url):
        """Normalize URL for matching."""
        if not url:
            return ""
        # Remove trailing slash, protocol, www
        url = url.rstrip('/')
        if url.startswith('http'):
            url = urlparse(url).path
        url = url.replace('www.', '')
        return url.lower()
    def extract_post_slug_from_url(self, url):
        """Extract post slug from URL path."""
        path = urlparse(url).path.rstrip('/')
        parts = [p for p in path.split('/') if p]
        if parts:
            return parts[-1]  # Last part is usually the slug
        return None
    def load_ga4_data(self, ga4_csv):
        """Load Google Analytics 4 data."""
        ga_data = {}
        if not ga4_csv.exists():
            self.log(f"⚠️  GA4 file not found: {ga4_csv}")
            return ga_data
        try:
            with open(ga4_csv, 'r', encoding='utf-8') as f:
                # Skip comment lines at the top (lines starting with #)
                lines = [line for line in f if not line.startswith('#')]
                reader = csv.DictReader(lines)
                for row in reader:
                    if not row:
                        continue
                    # Handle French and English column names
                    url = (row.get('Page path and screen class') or
                           row.get('Chemin de la page et classe de l\'écran') or
                           row.get('Page path') or
                           row.get('Page') or '')
                    if not url:
                        continue
                    # Normalize URL
                    normalized = self.normalize_url(url)
                    # Extract metrics (handle French and English column names)
                    try:
                        traffic = int(float(row.get('Screened Views', row.get('Views', row.get('Vues', '0'))) or 0))
                        users = int(float(row.get('Users', row.get('Utilisateurs actifs', '0')) or 0))
                        bounce_rate = float(row.get('Bounce rate', row.get('Taux de rebond', '0')) or 0)
                        avg_duration_str = (row.get('Average session duration',
                                                   row.get('Durée d\'engagement moyenne par utilisateur actif', '0')) or '0')
                        avg_duration = float(avg_duration_str.replace(',', '.'))
                    except (ValueError, TypeError):
                        traffic = users = 0
                        bounce_rate = avg_duration = 0
                    ga_data[normalized] = {
                        'traffic': traffic,
                        'users': users,
                        'bounce_rate': bounce_rate,
                        'avg_session_duration': avg_duration,
                        'ga_url': url
                    }
            self.log(f"✓ Loaded {len(ga_data)} GA4 entries")
        except Exception as e:
            self.log(f"❌ Error reading GA4 file: {e}")
        return ga_data
    def load_gsc_data(self, gsc_csv):
        """Load Google Search Console data (Page-level or Query-level)."""
        gsc_data = {}
        if not gsc_csv.exists():
            self.log(f"⚠️  GSC file not found: {gsc_csv}")
            return gsc_data
        try:
            with open(gsc_csv, 'r', encoding='utf-8') as f:
                reader = csv.DictReader(f)
                for row in reader:
                    if not row:
                        continue
                    # Determine if this is page-level or query-level data
                    # Pages.csv has: "Pages les plus populaires", Queries.csv has: "Requêtes les plus fréquentes"
                    url = (row.get('Page') or
                           row.get('Pages les plus populaires') or
                           row.get('URL') or '')
                    query = row.get('Query') or row.get('Requêtes les plus fréquentes', '').strip()
                    # Skip rows without URLs (query-only data)
                    if not url:
                        continue
                    # Try to parse metrics with flexible column names
                    try:
                        # Handle different number formats (decimal separator, percentage signs)
                        clicks_str = row.get('Clics', row.get('Clicks', '0')) or '0'
                        impressions_str = row.get('Impressions', '0') or '0'
                        ctr_str = row.get('CTR', '0') or '0'
                        position_str = row.get('Position', '0') or '0'
                        clicks = int(float(clicks_str.replace(',', '.').rstrip('%')))
                        impressions = int(float(impressions_str.replace(',', '.')))
                        ctr = float(ctr_str.replace(',', '.').rstrip('%')) / 100
                        position = float(position_str.replace(',', '.'))
                    except (ValueError, TypeError, AttributeError):
                        clicks = impressions = 0
                        ctr = position = 0
                    normalized = self.normalize_url(url)
                    if normalized not in gsc_data:
                        gsc_data[normalized] = {
                            'impressions': 0,
                            'clicks': 0,
                            'avg_position': 0,
                            'ctr': 0,
                            'keywords': [],
                            'gsc_url': url
                        }
                    # Accumulate data (in case of multiple rows per URL)
                    gsc_data[normalized]['impressions'] += impressions
                    gsc_data[normalized]['clicks'] += clicks
                    # Store position
                    if position > 0:
                        gsc_data[normalized]['positions'] = gsc_data[normalized].get('positions', [])
                        gsc_data[normalized]['positions'].append(position)
                    if query and query not in gsc_data[normalized]['keywords']:
                        gsc_data[normalized]['keywords'].append(query)
            # Calculate average positions and finalize
            for data in gsc_data.values():
                if data.get('positions'):
                    data['avg_position'] = sum(data['positions']) / len(data['positions'])
                    del data['positions']
                # Recalculate CTR from totals
                if data['impressions'] > 0:
                    data['ctr'] = data['clicks'] / data['impressions']
                data['keywords_count'] = len(data.get('keywords', []))
            self.log(f"✓ Loaded {len(gsc_data)} GSC entries")
        except Exception as e:
            self.log(f"❌ Error reading GSC file: {e}")
        return gsc_data
    def load_posts_csv(self, posts_csv):
        """Load existing WordPress posts CSV."""
        posts = {}
        if not posts_csv.exists():
            self.log(f"⚠️  Posts file not found: {posts_csv}")
            return posts
        try:
            with open(posts_csv, 'r', encoding='utf-8') as f:
                reader = csv.DictReader(f)
                for row in reader:
                    # Handle different column name variations
                    post_id = row.get('ID') or row.get('post_id')
                    post_url = row.get('URL') or row.get('Post URL') or row.get('post_url')
                    post_slug = row.get('Post Slug') or row.get('Slug') or row.get('post_slug')
                    post_title = row.get('Title') or row.get('post_title')
                    if not post_id:
                        continue
                    normalized = self.normalize_url(post_url) if post_url else ""
                    # Handle different SEO column names
                    seo_title = (row.get('SEO Title') or
                                row.get('proposed_seo_title') or
                                row.get('current_seo_title') or '')
                    meta_desc = (row.get('Meta Description') or
                                row.get('proposed_meta_description') or
                                row.get('current_meta_description') or '')
                    posts[post_id] = {
                        'title': post_title or '',
                        'url': post_url,
                        'slug': post_slug,
                        'normalized_url': normalized,
                        'seo_title': seo_title,
                        'meta_description': meta_desc,
                        **{k: v for k, v in row.items()
                           if k not in ['ID', 'post_id', 'Title', 'post_title', 'URL', 'Post URL', 'post_url',
                                       'Post Slug', 'Slug', 'post_slug', 'SEO Title', 'proposed_seo_title',
                                       'current_seo_title', 'Meta Description', 'proposed_meta_description',
                                       'current_meta_description']}
                    }
            self.log(f"✓ Loaded {len(posts)} posts from CSV")
        except Exception as e:
            self.log(f"❌ Error reading posts CSV: {e}")
        return posts
    def match_analytics_to_posts(self, posts, ga_data, gsc_data):
        """Match analytics data to posts with fuzzy matching."""
        self.log("\n📊 Matching analytics data to posts...")
        matched_count = 0
        for post_id, post_info in posts.items():
            slug = post_info.get('slug') or self.extract_post_slug_from_url(post_info.get('url', ''))
            normalized_url = post_info.get('normalized_url', '')
            # Try direct URL match first
            if normalized_url in ga_data:
                post_info['ga_data'] = ga_data[normalized_url]
                matched_count += 1
            else:
                post_info['ga_data'] = {}
            if normalized_url in gsc_data:
                post_info['gsc_data'] = gsc_data[normalized_url]
                matched_count += 1
            else:
                post_info['gsc_data'] = {}
            # Try slug-based matching if URL didn't match
            if not post_info.get('gsc_data') and slug:
                for gsc_url, gsc_info in gsc_data.items():
                    if slug in gsc_url:
                        post_info['gsc_data'] = gsc_info
                        break
        # Track unmatched GSC URLs
        matched_gsc_urls = set()
        for post in posts.values():
            if post.get('gsc_data'):
                matched_gsc_urls.add(id(post['gsc_data']))
        for normalized_url, gsc_info in gsc_data.items():
            if id(gsc_info) not in matched_gsc_urls and gsc_info.get('impressions', 0) > 0:
                self.unmatched_urls.append({
                    'url': gsc_info.get('gsc_url', normalized_url),
                    'impressions': gsc_info.get('impressions', 0),
                    'clicks': gsc_info.get('clicks', 0),
                    'avg_position': gsc_info.get('avg_position', 0)
                })
        self.log(f"✓ Matched data to posts")
        return posts
    def enrich_posts_data(self, posts):
        """Enrich posts with calculated metrics."""
        for post_info in posts.values():
            ga = post_info.get('ga_data', {})
            gsc = post_info.get('gsc_data', {})
            # GA metrics
            post_info['traffic'] = ga.get('traffic', 0)
            post_info['users'] = ga.get('users', 0)
            post_info['bounce_rate'] = ga.get('bounce_rate', 0)
            post_info['avg_session_duration'] = ga.get('avg_session_duration', 0)
            # GSC metrics
            post_info['impressions'] = gsc.get('impressions', 0)
            post_info['clicks'] = gsc.get('clicks', 0)
            post_info['avg_position'] = gsc.get('avg_position', 0)
            post_info['ctr'] = gsc.get('ctr', 0)
            post_info['keywords_count'] = gsc.get('keywords_count', 0)
            post_info['top_keywords'] = ','.join(gsc.get('keywords', [])[:5])
        return posts
    def export_enriched_csv(self, posts, output_csv):
        """Export enriched posts data to CSV."""
        if not posts:
            self.log("❌ No posts to export")
            return
        try:
            fieldnames = [
                'ID', 'Title', 'URL', 'SEO Title', 'Meta Description',
                'traffic', 'users', 'bounce_rate', 'avg_session_duration',
                'impressions', 'clicks', 'avg_position', 'ctr', 'keywords_count', 'top_keywords'
            ]
            # Add any extra fields from original posts
            all_keys = set()
            for post in posts.values():
                all_keys.update(post.keys())
            extra_fields = [k for k in sorted(all_keys)
                           if k not in fieldnames and k not in ['ga_data', 'gsc_data', 'normalized_url', 'slug']]
            fieldnames.extend(extra_fields)
            with open(output_csv, 'w', newline='', encoding='utf-8') as f:
                writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction='ignore')
                writer.writeheader()
                for post_id, post_info in sorted(posts.items()):
                    row = {'ID': post_id}
                    row.update(post_info)
                    # Clean up nested dicts
                    for key in ['ga_data', 'gsc_data']:
                        row.pop(key, None)
                    writer.writerow(row)
            self.log(f"✓ Exported {len(posts)} posts to {output_csv}")
        except Exception as e:
            self.log(f"❌ Error exporting CSV: {e}")
    def export_log(self, log_file):
        """Export analysis log and unmatched URLs."""
        try:
            with open(log_file, 'w', encoding='utf-8') as f:
                f.write("SEO Analytics Import Report\n")
                f.write("=" * 60 + "\n\n")
                f.write("Import Log:\n")
                f.write("-" * 60 + "\n")
                for log_msg in self.logs:
                    f.write(log_msg + "\n")
                f.write("\n" + "=" * 60 + "\n")
                f.write(f"Unmatched URLs ({len(self.unmatched_urls)} total):\n")
                f.write("-" * 60 + "\n")
                if self.unmatched_urls:
                    # Sort by impressions descending
                    for url_data in sorted(self.unmatched_urls,
                                          key=lambda x: x['impressions'],
                                          reverse=True):
                        f.write(f"\nURL: {url_data['url']}\n")
                        f.write(f"  Impressions: {url_data['impressions']}\n")
                        f.write(f"  Clicks: {url_data['clicks']}\n")
                        f.write(f"  Avg Position: {url_data['avg_position']:.1f}\n")
                else:
                    f.write("✓ All URLs matched successfully!\n")
            self.log(f"✓ Exported log to {log_file}")
        except Exception as e:
            self.log(f"❌ Error exporting log: {e}")
    def run(self, ga_csv, gsc_csv, posts_csv, output_csv):
        """Run complete import workflow."""
        self.log("Starting analytics import...")
        self.log(f"GA4 CSV: {ga_csv}")
        self.log(f"GSC CSV: {gsc_csv}")
        self.log(f"Posts CSV: {posts_csv}\n")
        # Load data
        ga_data = self.load_ga4_data(ga_csv)
        gsc_data = self.load_gsc_data(gsc_csv)
        posts = self.load_posts_csv(posts_csv)
        if not posts:
            self.log("❌ No posts found. Cannot proceed.")
            return
        # Match and merge
        posts = self.match_analytics_to_posts(posts, ga_data, gsc_data)
        posts = self.enrich_posts_data(posts)
        # Export
        self.export_enriched_csv(posts, output_csv)
        # Export log
        log_dir = self.output_dir / 'logs'
        log_dir.mkdir(exist_ok=True)
        log_file = log_dir / 'import_log.txt'
        self.export_log(log_file)
        self.log("\n✓ Analytics import complete!")
 def main():
    """CLI entry point."""
    parser = argparse.ArgumentParser(description='Import and merge analytics data')
    parser.add_argument('--ga-export', type=Path,
                       default=Path('input/analytics/ga4_export.csv'),
                       help='GA4 export CSV path')
    parser.add_argument('--gsc-export', type=Path,
                       default=Path('input/analytics/gsc/Pages.csv'),
                       help='Search Console export CSV path (Pages data)')
    parser.add_argument('--posts-csv', type=Path,
                       default=Path('input/new-propositions.csv'),
                       help='Posts CSV path')
    parser.add_argument('--output', type=Path,
                       default=Path('output/results/posts_with_analytics.csv'),
                       help='Output CSV path')
    args = parser.parse_args()
    importer = AnalyticsImporter()
    importer.run(args.ga_export, args.gsc_export, args.posts_csv, args.output)
 if __name__ == '__main__':
    main()
--- a/config.py
+++ b/config.py
@@ -0,0 +1,71 @@
 """
 Configuration module for WordPress SEO automation.
 Loads and validates environment variables.
 """
 import os
 from dotenv import load_dotenv
 from pathlib import Path
 # Load environment variables from .env file
 load_dotenv()
 class Config:
    """Configuration class for WordPress SEO automation."""
    # WordPress Settings
    WORDPRESS_URL = os.getenv('WORDPRESS_URL', '').rstrip('/')
    WORDPRESS_USERNAME = os.getenv('WORDPRESS_USERNAME', '')
    WORDPRESS_APP_PASSWORD = os.getenv('WORDPRESS_APP_PASSWORD', '')
    # OpenRouter API Settings
    OPENROUTER_API_KEY = os.getenv('OPENROUTER_API_KEY', '')
    AI_MODEL = os.getenv('AI_MODEL', 'anthropic/claude-3.5-sonnet')
    # Script Settings
    BATCH_SIZE = int(os.getenv('BATCH_SIZE', '100'))
    API_DELAY_SECONDS = float(os.getenv('API_DELAY_SECONDS', '0.5'))
    # Analysis Settings
    ANALYSIS_MIN_POSITION = int(os.getenv('ANALYSIS_MIN_POSITION', '11'))
    ANALYSIS_MAX_POSITION = int(os.getenv('ANALYSIS_MAX_POSITION', '30'))
    ANALYSIS_MIN_IMPRESSIONS = int(os.getenv('ANALYSIS_MIN_IMPRESSIONS', '50'))
    ANALYSIS_TOP_N_POSTS = int(os.getenv('ANALYSIS_TOP_N_POSTS', '20'))
    # Output directory
    OUTPUT_DIR = Path(__file__).parent / 'output'
    @classmethod
    def validate(cls):
        """Validate that all required configuration is present."""
        errors = []
        if not cls.WORDPRESS_URL:
            errors.append("WORDPRESS_URL is required")
        if not cls.WORDPRESS_USERNAME:
            errors.append("WORDPRESS_USERNAME is required")
        if not cls.WORDPRESS_APP_PASSWORD:
            errors.append("WORDPRESS_APP_PASSWORD is required")
        if not cls.OPENROUTER_API_KEY:
            errors.append("OPENROUTER_API_KEY is required (get one from https://openrouter.ai/)")
        if errors:
            raise ValueError("Configuration errors:\n" + "\n".join(f"  - {e}" for e in errors))
        # Create output directory if it doesn't exist
        cls.OUTPUT_DIR.mkdir(exist_ok=True)
        return True
    @classmethod
    def get_wordpress_auth(cls):
        """Get WordPress authentication tuple."""
        return (cls.WORDPRESS_USERNAME, cls.WORDPRESS_APP_PASSWORD)
    @classmethod
    def get_api_base_url(cls):
        """Get WordPress REST API base URL."""
        return f"{cls.WORDPRESS_URL}/wp-json/wp/v2"
--- a/content_gap_analyzer.py
+++ b/content_gap_analyzer.py
@@ -0,0 +1,348 @@
 """
 Content gap analyzer for SEO strategy.
 Identifies missing topics and content opportunities using AI analysis.
 """
 import csv
 import json
 import argparse
 import time
 from pathlib import Path
 from collections import defaultdict
 from openai import OpenAI
 from config import Config
 class ContentGapAnalyzer:
    """Identify content gaps and opportunities."""
    def __init__(self):
        """Initialize analyzer."""
        self.config = Config
        self.output_dir = self.config.OUTPUT_DIR
        self.logs = []
        self.client = None
        if self.config.OPENROUTER_API_KEY:
            self.client = OpenAI(
                base_url="https://openrouter.ai/api/v1",
                api_key=self.config.OPENROUTER_API_KEY,
            )
    def log(self, message):
        """Add message to log."""
        self.logs.append(message)
        print(message)
    def load_posts(self, posts_csv):
        """Load post titles and data."""
        posts = []
        if not posts_csv.exists():
            self.log(f"❌ File not found: {posts_csv}")
            return posts
        try:
            with open(posts_csv, 'r', encoding='utf-8') as f:
                reader = csv.DictReader(f)
                for row in reader:
                    posts.append({
                        'id': row.get('ID', ''),
                        'title': row.get('Title', ''),
                        'url': row.get('URL', ''),
                        'traffic': int(row.get('traffic', 0) or 0),
                        'impressions': int(row.get('impressions', 0) or 0),
                        'top_keywords': row.get('top_keywords', '')
                    })
            self.log(f"✓ Loaded {len(posts)} posts")
        except Exception as e:
            self.log(f"❌ Error reading posts: {e}")
        return posts
    def load_gsc_data(self, gsc_csv):
        """Load Search Console queries for gap analysis."""
        queries = []
        if not gsc_csv.exists():
            self.log(f"⚠️  GSC file not found: {gsc_csv}")
            return queries
        try:
            with open(gsc_csv, 'r', encoding='utf-8') as f:
                reader = csv.DictReader(f)
                for row in reader:
                    try:
                        query = row.get('Query', '').strip()
                        if not query:
                            continue
                        impressions = int(row.get('Impressions', 0) or 0)
                        clicks = int(row.get('Clicks', 0) or 0)
                        # Only include queries with impressions but low clicks
                        if impressions > 0 and (clicks / impressions < 0.05):
                            queries.append({
                                'query': query,
                                'impressions': impressions,
                                'clicks': clicks,
                                'ctr': clicks / impressions if impressions > 0 else 0
                            })
                    except (ValueError, TypeError):
                        continue
            self.log(f"✓ Loaded {len(queries)} underperforming queries")
        except Exception as e:
            self.log(f"⚠️  Error reading GSC file: {e}")
        return queries
    def extract_topics(self, posts):
        """Extract topic clusters from post titles using AI."""
        if not self.client or len(posts) == 0:
            self.log("⚠️  Cannot extract topics without AI client or posts")
            return {}
        try:
            self.log("🤖 Extracting topic clusters from post titles...")
            # Batch posts into groups
            titles = [p['title'] for p in posts][:100]  # Limit to first 100
            prompt = f"""Analyze these {len(titles)} blog post titles and identify topic clusters:
 Titles:
 {chr(10).join(f'{i+1}. {t}' for i, t in enumerate(titles))}
 Extract for each post:
 1. Primary topic category
 2. Subtopics covered
 3. Content type (guide, tutorial, review, comparison, etc.)
 Then identify:
 1. Top 10 topic clusters with post counts
 2. Most common subtopics
 3. Over/under-represented topics
 Return JSON:
 {{
  "post_topics": {{
    "1": {{"primary": "...", "subtopics": ["..."], "type": "..."}},
    ...
  }},
  "topic_clusters": [
    {{"cluster": "...", "post_count": 0, "importance": "high/medium/low"}}
  ],
  "coverage_gaps": ["topic 1", "topic 2", ...],
  "niche": "detected niche or industry"
 }}"""
            response = self.client.chat.completions.create(
                model=self.config.AI_MODEL,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.7,
                max_tokens=1500
            )
            try:
                result_text = response.choices[0].message.content
                start_idx = result_text.find('{')
                end_idx = result_text.rfind('}') + 1
                if start_idx >= 0 and end_idx > start_idx:
                    return json.loads(result_text[start_idx:end_idx])
            except json.JSONDecodeError:
                self.log("⚠️  Could not parse topic extraction response")
                return {}
        except Exception as e:
            self.log(f"⚠️  Topic extraction failed: {e}")
            return {}
    def identify_content_gaps(self, topic_analysis, queries):
        """Use AI to identify content gaps and suggest new topics."""
        if not self.client:
            return []
        try:
            self.log("🤖 Identifying content gaps and opportunities...")
            clusters = topic_analysis.get('topic_clusters', [])
            gaps = topic_analysis.get('coverage_gaps', [])
            niche = topic_analysis.get('niche', 'general')
            # Prepare query analysis
            top_queries = sorted(queries, key=lambda x: x['impressions'], reverse=True)[:20]
            queries_str = '\n'.join([f"- {q['query']} ({q['impressions']} impr, {q['ctr']:.1%} CTR)"
                                    for q in top_queries])
            prompt = f"""Based on content analysis and search demand, identify content gaps:
 Existing Topics: {', '.join([c.get('cluster', '') for c in clusters[:10]])}
 Coverage Gaps: {', '.join(gaps[:5])}
 Niche: {niche}
 Top Underperforming Queries (low CTR despite impressions):
 {queries_str}
 Identify high-value missing topics that could:
 1. Fill coverage gaps
 2. Target underperforming queries (CTR improvement)
 3. Capitalize on search demand
 4. Complement existing content
 For each suggestion:
 - Topic title
 - Why it's valuable (search demand + intent)
 - Search volume estimate (high/medium/low)
 - How it complements existing content
 - Recommended content format
 - Estimated traffic potential
 Prioritize by traffic opportunity. Max 20 ideas.
 Return JSON:
 {{
  "content_opportunities": [
    {{
      "title": "...",
      "why_valuable": "...",
      "search_volume": "high/medium/low",
      "complements": "existing topic",
      "format": "guide/tutorial/comparison/review/list",
      "traffic_potential": number,
      "priority": "high/medium/low"
    }}
  ]
 }}"""
            response = self.client.chat.completions.create(
                model=self.config.AI_MODEL,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.7,
                max_tokens=2000
            )
            try:
                result_text = response.choices[0].message.content
                start_idx = result_text.find('{')
                end_idx = result_text.rfind('}') + 1
                if start_idx >= 0 and end_idx > start_idx:
                    result = json.loads(result_text[start_idx:end_idx])
                    return result.get('content_opportunities', [])
            except json.JSONDecodeError:
                self.log("⚠️  Could not parse gap analysis response")
                return []
        except Exception as e:
            self.log(f"⚠️  Gap analysis failed: {e}")
            return []
    def export_gaps_csv(self, gaps, output_csv):
        """Export content gaps to CSV."""
        if not gaps:
            self.log("⚠️  No gaps to export")
            return
        try:
            fieldnames = [
                'priority', 'title', 'why_valuable', 'search_volume',
                'complements', 'format', 'traffic_potential'
            ]
            with open(output_csv, 'w', newline='', encoding='utf-8') as f:
                writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction='ignore')
                writer.writeheader()
                for gap in sorted(gaps, key=lambda x: x.get('priority') == 'high', reverse=True):
                    writer.writerow(gap)
            self.log(f"✓ Exported {len(gaps)} content gaps to {output_csv}")
        except Exception as e:
            self.log(f"❌ Error exporting CSV: {e}")
    def export_topic_clusters_json(self, topic_analysis, output_json):
        """Export topic analysis to JSON."""
        if not topic_analysis:
            return
        try:
            with open(output_json, 'w', encoding='utf-8') as f:
                json.dump(topic_analysis, f, indent=2)
            self.log(f"✓ Exported topic analysis to {output_json}")
        except Exception as e:
            self.log(f"❌ Error exporting JSON: {e}")
    def export_log(self, log_file):
        """Export analysis log."""
        try:
            with open(log_file, 'w', encoding='utf-8') as f:
                f.write("Content Gap Analysis Report\n")
                f.write("=" * 60 + "\n\n")
                for msg in self.logs:
                    f.write(msg + "\n")
            self.log(f"✓ Exported log to {log_file}")
        except Exception as e:
            self.log(f"❌ Error exporting log: {e}")
    def run(self, posts_csv, gsc_csv, output_csv):
        """Run complete analysis workflow."""
        self.log("📊 Starting content gap analysis...")
        self.log(f"Posts: {posts_csv}")
        self.log(f"GSC queries: {gsc_csv}\n")
        # Load data
        posts = self.load_posts(posts_csv)
        queries = self.load_gsc_data(gsc_csv)
        if not posts:
            return
        # Extract topics
        topic_analysis = self.extract_topics(posts)
        if topic_analysis:
            self.log(f"✓ Identified {len(topic_analysis.get('topic_clusters', []))} topic clusters")
        # Identify gaps
        gaps = self.identify_content_gaps(topic_analysis, queries)
        if gaps:
            self.log(f"✓ Identified {len(gaps)} content opportunities")
        # Export
        self.log("\n📁 Exporting results...")
        self.export_gaps_csv(gaps, output_csv)
        topic_json = self.output_dir / 'topic_clusters.json'
        self.export_topic_clusters_json(topic_analysis, topic_json)
        # Export log
        log_dir = self.output_dir / 'logs'
        log_dir.mkdir(exist_ok=True)
        log_file = log_dir / 'content_gap_analysis_log.txt'
        self.export_log(log_file)
        self.log("\n✓ Content gap analysis complete!")
 def main():
    """CLI entry point."""
    parser = argparse.ArgumentParser(description='Analyze content gaps')
    parser.add_argument('--posts-csv', type=Path,
                       default=Path('output/results/posts_with_analytics.csv'),
                       help='Posts CSV')
    parser.add_argument('--gsc-queries', type=Path,
                       default=Path('input/analytics/gsc/Requêtes.csv'),
                       help='GSC queries CSV')
    parser.add_argument('--output', type=Path,
                       default=Path('output/results/content_gaps.csv'),
                       help='Output gaps CSV')
    args = parser.parse_args()
    analyzer = ContentGapAnalyzer()
    analyzer.run(args.posts_csv, args.gsc_queries, args.output)
 if __name__ == '__main__':
    main()
--- a/input/README.md
+++ b/input/README.md
@@ -0,0 +1,49 @@
 # Input Directory
 Place your source data files here before running the analysis pipeline.
 ## Required Files
 ### `new-propositions.csv`
 WordPress posts export with SEO metadata
 - Columns: ID, post_id, Title, post_title, URL, post_url, SEO Title, Meta Description, etc.
 ### `analytics/ga4_export.csv`
 Google Analytics 4 data export
 - Date range: Last 90 days
 - Columns: Chemin de la page et classe de l'écran (Page path), Vues (Views), Utilisateurs actifs (Users), Durée d'engagement (Duration), etc.
 ### `analytics/gsc/Pages.csv`
 Google Search Console Pages report
 - Date range: Last 90 days
 - Columns: Pages les plus populaires (Page), Clics (Clicks), Impressions, CTR, Position
 ## Directory Structure
 ```
 input/
 ├── new-propositions.csv          (WordPress posts)
 └── analytics/
    ├── ga4_export.csv            (Google Analytics data)
    └── gsc/
        ├── Pages.csv             (GSC pages report)
        ├── Requêtes.csv          (GSC queries report - optional)
        └── [other GSC exports]
 ```
 ## How to Export Data
 ### Google Analytics 4
 1. Go to Analytics > Reports > Engagement > Pages and Screens
 2. Set date range to Last 90 days
 3. Click Export > Download CSV
 4. Save as: `input/analytics/ga4_export.csv`
 ### Google Search Console
 1. Go to Performance
 2. Set date range to Last 90 days
 3. Click Export > Download CSV
 4. Save as: `input/analytics/gsc/Pages.csv`
 ### WordPress Posts
 Use your existing WordPress export or the SEO propositions CSV
--- a/input/new-propositions.ods
+++ b/input/new-propositions.ods
--- a/opportunity_analyzer.py
+++ b/opportunity_analyzer.py
@@ -0,0 +1,347 @@
 """
 Keyword opportunity analyzer for SEO optimization.
 Identifies high-potential keywords ranking at positions 11-30.
 """
 import csv
 import json
 import argparse
 import time
 from pathlib import Path
 from openai import OpenAI
 from config import Config
 class OpportunityAnalyzer:
    """Analyze keyword opportunities for SEO optimization."""
    def __init__(self):
        """Initialize analyzer."""
        self.config = Config
        self.output_dir = self.config.OUTPUT_DIR
        self.logs = []
        self.client = None
        if self.config.OPENROUTER_API_KEY:
            self.client = OpenAI(
                base_url="https://openrouter.ai/api/v1",
                api_key=self.config.OPENROUTER_API_KEY,
            )
    def log(self, message):
        """Add message to log."""
        self.logs.append(message)
        print(message)
    def load_posts(self, posts_csv):
        """Load posts with analytics data."""
        posts = []
        if not posts_csv.exists():
            self.log(f"❌ File not found: {posts_csv}")
            return posts
        try:
            with open(posts_csv, 'r', encoding='utf-8') as f:
                reader = csv.DictReader(f)
                for row in reader:
                    try:
                        posts.append({
                            'id': row.get('ID', ''),
                            'title': row.get('Title', ''),
                            'url': row.get('URL', ''),
                            'impressions': int(row.get('impressions', 0) or 0),
                            'clicks': int(row.get('clicks', 0) or 0),
                            'avg_position': float(row.get('avg_position', 0) or 0),
                            'ctr': float(row.get('ctr', 0) or 0),
                            'traffic': int(row.get('traffic', 0) or 0),
                            'bounce_rate': float(row.get('bounce_rate', 0) or 0),
                            'keywords_count': int(row.get('keywords_count', 0) or 0),
                            'top_keywords': row.get('top_keywords', '')
                        })
                    except (ValueError, TypeError):
                        continue
            self.log(f"✓ Loaded {len(posts)} posts")
        except Exception as e:
            self.log(f"❌ Error reading posts: {e}")
        return posts
    def filter_opportunities(self, posts, min_pos, max_pos, min_impressions):
        """Filter posts with keywords in opportunity range or high traffic for optimization."""
        opportunities = []
        for post in posts:
            position = post.get('avg_position', 0)
            impressions = post.get('impressions', 0)
            traffic = post.get('traffic', 0)
            # Primary filter: position range (if data available)
            if position > 0:
                if min_pos <= position <= max_pos and impressions >= min_impressions:
                    opportunities.append(post)
            # Fallback: filter by traffic when position data unavailable
            # Include posts with any traffic for optimization analysis
            elif traffic > 0:
                opportunities.append(post)
        self.log(f"✓ Found {len(opportunities)} posts for optimization analysis")
        if opportunities:
            traffic_posts = [p for p in opportunities if p.get('traffic', 0) > 0]
            self.log(f"  ({len(traffic_posts)} have traffic data, {len(opportunities) - len(traffic_posts)} selected for analysis)")
        return opportunities
    def calculate_opportunity_score(self, post):
        """Calculate opportunity score (0-100) for a post."""
        position = post.get('avg_position', 50)
        impressions = post.get('impressions', 0)
        ctr = post.get('ctr', 0)
        traffic = post.get('traffic', 0)
        # Position score (35%): Closer to page 1 = higher
        # Position 11-30 range
        position_score = max(0, (30 - position) / 19 * 35)
        # Traffic potential (30%): Based on impressions
        # Normalize to 0-30
        traffic_potential = min(30, (impressions / 1000) * 30)
        # CTR improvement potential (20%): Gap between current and expected CTR
        # Expected CTR at position X
        expected_ctr_map = {
            11: 0.02, 12: 0.02, 13: 0.015, 14: 0.015, 15: 0.013,
            16: 0.012, 17: 0.011, 18: 0.01, 19: 0.009, 20: 0.008,
            21: 0.008, 22: 0.007, 23: 0.007, 24: 0.006, 25: 0.006,
            26: 0.006, 27: 0.005, 28: 0.005, 29: 0.005, 30: 0.004
        }
        expected_ctr = expected_ctr_map.get(int(position), 0.005)
        ctr_gap = max(0, expected_ctr - ctr)
        ctr_score = min(20, (ctr_gap / expected_ctr * 100 / 5) * 20)
        # Content quality (15%): Existing traffic and engagement
        quality_score = min(15, (traffic / 100) * 7.5 +
                           (100 - post.get('bounce_rate', 50)) / 100 * 7.5)
        return round(position_score + traffic_potential + ctr_score + quality_score, 1)
    def estimate_traffic_gain(self, post):
        """Estimate potential traffic gain from optimization."""
        position = post.get('avg_position', 50)
        impressions = post.get('impressions', 0)
        ctr = post.get('ctr', 0)
        # Estimate CTR improvement from moving one position up
        # Moving from position X to X-1 typically improves CTR by 20-30%
        current_traffic = impressions * ctr
        if position > 11:
            # Target position: 1 ahead
            improvement_factor = 1.25  # 25% improvement per position
            estimated_new_traffic = current_traffic * improvement_factor
            gain = estimated_new_traffic - current_traffic
        else:
            gain = 0
        return round(gain, 0)
    def generate_ai_recommendations(self, post):
        """Generate AI recommendations for top opportunities."""
        if not self.client:
            return None
        try:
            keywords = post.get('top_keywords', '').split(',')[:5]
            keywords_str = ', '.join([k.strip() for k in keywords if k.strip()])
            prompt = f"""Analyze keyword optimization opportunities for this blog post:
 Post Title: {post['title']}
 Current Position: {post['avg_position']:.1f}
 Monthly Impressions: {post['impressions']}
 Current CTR: {post['ctr']:.2%}
 Top Keywords: {keywords_str}
 Provide 2-3 specific, actionable recommendations to:
 1. Improve the SEO title to increase CTR
 2. Enhance the meta description
 3. Target structural improvements (headers, content gaps)
 Focus on moving this post from positions 11-20 to page 1 (positions 1-10).
 Be specific and practical.
 Return as JSON:
 {{
  "title_recommendations": ["recommendation 1", "recommendation 2"],
  "description_recommendations": ["recommendation 1", "recommendation 2"],
  "content_recommendations": ["recommendation 1", "recommendation 2"],
  "estimated_effort_hours": number,
  "expected_position_improvement": number
 }}"""
            response = self.client.chat.completions.create(
                model=self.config.AI_MODEL,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.7,
                max_tokens=500
            )
            try:
                result_text = response.choices[0].message.content
                # Extract JSON
                start_idx = result_text.find('{')
                end_idx = result_text.rfind('}') + 1
                if start_idx >= 0 and end_idx > start_idx:
                    return json.loads(result_text[start_idx:end_idx])
            except json.JSONDecodeError:
                self.log(f"⚠️  Could not parse AI response for {post['title']}")
                return None
        except Exception as e:
            self.log(f"⚠️  AI generation failed for {post['title']}: {e}")
            return None
    def export_opportunities_csv(self, opportunities, output_csv):
        """Export opportunities to CSV."""
        if not opportunities:
            self.log("⚠️  No opportunities to export")
            return
        try:
            fieldnames = [
                'ID', 'Title', 'URL', 'avg_position', 'impressions', 'clicks',
                'ctr', 'traffic', 'bounce_rate', 'keywords_count', 'top_keywords',
                'opportunity_score', 'estimated_traffic_gain',
                'title_recommendations', 'description_recommendations',
                'content_recommendations', 'estimated_effort_hours',
                'expected_position_improvement'
            ]
            with open(output_csv, 'w', newline='', encoding='utf-8') as f:
                writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction='ignore')
                writer.writeheader()
                for opp in sorted(opportunities, key=lambda x: x['opportunity_score'], reverse=True):
                    row = {
                        'ID': opp['id'],
                        'Title': opp['title'],
                        'URL': opp['url'],
                        'avg_position': opp['avg_position'],
                        'impressions': opp['impressions'],
                        'clicks': opp['clicks'],
                        'ctr': f"{opp['ctr']:.2%}",
                        'traffic': opp['traffic'],
                        'bounce_rate': opp['bounce_rate'],
                        'keywords_count': opp['keywords_count'],
                        'top_keywords': opp['top_keywords'],
                        'opportunity_score': opp['opportunity_score'],
                        'estimated_traffic_gain': opp['estimated_traffic_gain'],
                        'title_recommendations': opp.get('title_recommendations_str', ''),
                        'description_recommendations': opp.get('description_recommendations_str', ''),
                        'content_recommendations': opp.get('content_recommendations_str', ''),
                        'estimated_effort_hours': opp.get('estimated_effort_hours', ''),
                        'expected_position_improvement': opp.get('expected_position_improvement', '')
                    }
                    writer.writerow(row)
            self.log(f"✓ Exported {len(opportunities)} opportunities to {output_csv}")
        except Exception as e:
            self.log(f"❌ Error exporting CSV: {e}")
    def export_log(self, log_file):
        """Export analysis log."""
        try:
            with open(log_file, 'w', encoding='utf-8') as f:
                f.write("SEO Opportunity Analysis Report\n")
                f.write("=" * 60 + "\n\n")
                for msg in self.logs:
                    f.write(msg + "\n")
            self.log(f"✓ Exported log to {log_file}")
        except Exception as e:
            self.log(f"❌ Error exporting log: {e}")
    def run(self, posts_csv, output_csv, min_position=11, max_position=30,
            min_impressions=50, top_n=20):
        """Run complete analysis workflow."""
        self.log("🔍 Starting keyword opportunity analysis...")
        self.log(f"Input: {posts_csv}")
        self.log(f"Position range: {min_position}-{max_position}")
        self.log(f"Min impressions: {min_impressions}")
        self.log(f"Top N for AI analysis: {top_n}\n")
        # Load posts
        posts = self.load_posts(posts_csv)
        if not posts:
            return
        # Filter opportunities
        opportunities = self.filter_opportunities(posts, min_position, max_position, min_impressions)
        if not opportunities:
            self.log("⚠️  No opportunities found in specified range")
            return
        # Calculate scores
        self.log("\n📊 Calculating opportunity scores...")
        for opp in opportunities:
            opp['opportunity_score'] = self.calculate_opportunity_score(opp)
            opp['estimated_traffic_gain'] = self.estimate_traffic_gain(opp)
        # Sort by score
        opportunities = sorted(opportunities, key=lambda x: x['opportunity_score'], reverse=True)
        # Get AI recommendations for top N
        self.log(f"\n🤖 Generating AI recommendations for top {min(top_n, len(opportunities))} opportunities...")
        for i, opp in enumerate(opportunities[:top_n]):
            self.log(f"  [{i+1}/{min(top_n, len(opportunities))}] {opp['title'][:50]}...")
            recommendations = self.generate_ai_recommendations(opp)
            if recommendations:
                opp['title_recommendations_str'] = '; '.join(recommendations.get('title_recommendations', []))
                opp['description_recommendations_str'] = '; '.join(recommendations.get('description_recommendations', []))
                opp['content_recommendations_str'] = '; '.join(recommendations.get('content_recommendations', []))
                opp['estimated_effort_hours'] = recommendations.get('estimated_effort_hours', '')
                opp['expected_position_improvement'] = recommendations.get('expected_position_improvement', '')
            time.sleep(0.2)  # Rate limiting
        # Export
        self.log("\n📁 Exporting results...")
        self.export_opportunities_csv(opportunities, output_csv)
        # Export log
        log_dir = self.output_dir / 'logs'
        log_dir.mkdir(exist_ok=True)
        log_file = log_dir / 'opportunity_analysis_log.txt'
        self.export_log(log_file)
        self.log(f"\n✓ Analysis complete! {len(opportunities)} opportunities identified.")
        self.log(f"  Top opportunity: {opportunities[0]['title'][:50]}... (score: {opportunities[0]['opportunity_score']})")
 def main():
    """CLI entry point."""
    parser = argparse.ArgumentParser(description='Analyze keyword opportunities')
    parser.add_argument('--input', type=Path,
                       default=Path('output/results/posts_with_analytics.csv'),
                       help='Input posts CSV')
    parser.add_argument('--output', type=Path,
                       default=Path('output/results/keyword_opportunities.csv'),
                       help='Output opportunities CSV')
    parser.add_argument('--min-position', type=int, default=11,
                       help='Minimum position (start of range)')
    parser.add_argument('--max-position', type=int, default=30,
                       help='Maximum position (end of range)')
    parser.add_argument('--min-impressions', type=int, default=50,
                       help='Minimum impressions to consider')
    parser.add_argument('--top-n', type=int, default=20,
                       help='Top N for AI recommendations')
    args = parser.parse_args()
    analyzer = OpportunityAnalyzer()
    analyzer.run(args.input, args.output, args.min_position, args.max_position,
                 args.min_impressions, args.top_n)
 if __name__ == '__main__':
    main()
--- a/report_generator.py
+++ b/report_generator.py
@@ -0,0 +1,436 @@
 """
 SEO optimization report generator.
 Consolidates all analysis into comprehensive markdown report and action plan.
 """
 import csv
 import json
 import argparse
 from pathlib import Path
 from datetime import datetime
 from config import Config
 class ReportGenerator:
    """Generate comprehensive SEO optimization report."""
    def __init__(self):
        """Initialize generator."""
        self.config = Config
        self.output_dir = self.config.OUTPUT_DIR
        self.logs = []
    def log(self, message):
        """Add message to log."""
        self.logs.append(message)
        print(message)
    def load_posts_with_analytics(self, csv_path):
        """Load posts with all analytics data."""
        posts = {}
        if not csv_path.exists():
            self.log(f"❌ File not found: {csv_path}")
            return posts
        try:
            with open(csv_path, 'r', encoding='utf-8') as f:
                reader = csv.DictReader(f)
                for row in reader:
                    post_id = row.get('ID')
                    if not post_id:
                        continue
                    # Handle different title column names
                    title = (row.get('Title') or
                            row.get('title') or
                            row.get('post_title') or '')
                    posts[post_id] = {
                        'title': title,
                        'url': row.get('URL') or row.get('url') or row.get('post_url') or '',
                        'seo_title': row.get('SEO Title') or row.get('seo_title') or '',
                        'meta_description': row.get('Meta Description') or row.get('meta_description') or '',
                        'traffic': int(row.get('traffic', 0) or 0),
                        'users': int(row.get('users', 0) or 0),
                        'bounce_rate': float(row.get('bounce_rate', 0) or 0),
                        'impressions': int(row.get('impressions', 0) or 0),
                        'clicks': int(row.get('clicks', 0) or 0),
                        'avg_position': float(row.get('avg_position', 0) or 0),
                        'ctr': float(row.get('ctr', 0) or 0),
                        'keywords_count': int(row.get('keywords_count', 0) or 0),
                        'top_keywords': row.get('top_keywords', '')
                    }
            self.log(f"✓ Loaded {len(posts)} posts")
        except Exception as e:
            self.log(f"❌ Error reading posts: {e}")
        return posts
    def load_opportunities(self, csv_path):
        """Load keyword opportunities."""
        opportunities = {}
        if not csv_path.exists():
            self.log(f"⚠️  Opportunities file not found: {csv_path}")
            return opportunities
        try:
            with open(csv_path, 'r', encoding='utf-8') as f:
                reader = csv.DictReader(f)
                for row in reader:
                    post_id = row.get('ID')
                    if post_id:
                        try:
                            opportunities[post_id] = {
                                'opportunity_score': float(row.get('opportunity_score', 0) or 0),
                                'estimated_traffic_gain': int(float(row.get('estimated_traffic_gain', 0) or 0)),
                                'title_recommendations': row.get('title_recommendations', ''),
                                'description_recommendations': row.get('description_recommendations', ''),
                                'content_recommendations': row.get('content_recommendations', '')
                            }
                        except (ValueError, TypeError):
                            # Skip rows with parsing errors
                            continue
            self.log(f"✓ Loaded {len(opportunities)} opportunities")
        except Exception as e:
            self.log(f"⚠️  Error reading opportunities: {e}")
        return opportunities
    def load_content_gaps(self, csv_path):
        """Load content gap suggestions."""
        gaps = []
        if not csv_path.exists():
            self.log(f"⚠️  Content gaps file not found: {csv_path}")
            return gaps
        try:
            with open(csv_path, 'r', encoding='utf-8') as f:
                reader = csv.DictReader(f)
                for row in reader:
                    gaps.append({
                        'title': row.get('title', ''),
                        'why_valuable': row.get('why_valuable', ''),
                        'search_volume': row.get('search_volume', ''),
                        'format': row.get('format', ''),
                        'traffic_potential': int(row.get('traffic_potential', 0) or 0),
                        'priority': row.get('priority', 'medium')
                    })
            self.log(f"✓ Loaded {len(gaps)} content gap ideas")
        except Exception as e:
            self.log(f"⚠️  Error reading content gaps: {e}")
        return gaps
    def calculate_priority_score(self, post, opportunity=None):
        """Calculate comprehensive priority score (0-100)."""
        position = post.get('avg_position', 50)
        impressions = post.get('impressions', 0)
        ctr = post.get('ctr', 0)
        traffic = post.get('traffic', 0)
        # Position score (35%): Closer to page 1 = higher
        if position > 0 and position <= 30:
            position_score = max(0, (30 - position) / 29 * 35)
        else:
            position_score = 0
        # Traffic potential (30%): Based on impressions
        traffic_potential = min(30, (impressions / 1000) * 30)
        # CTR improvement (20%): Gap vs expected
        expected_ctr_map = {
            1: 0.30, 2: 0.16, 3: 0.11, 4: 0.08, 5: 0.07,
            6: 0.06, 7: 0.05, 8: 0.05, 9: 0.04, 10: 0.04,
            11: 0.02, 12: 0.02, 13: 0.015, 14: 0.015, 15: 0.013,
            16: 0.012, 17: 0.011, 18: 0.01, 19: 0.009, 20: 0.008
        }
        expected_ctr = expected_ctr_map.get(int(position), 0.005) if position > 0 else 0
        if expected_ctr > 0:
            ctr_gap = max(0, expected_ctr - ctr)
            ctr_score = min(20, (ctr_gap / expected_ctr * 100 / 5) * 20)
        else:
            ctr_score = 0
        # Content quality (15%): Existing traffic and engagement
        quality_score = min(15, (traffic / 100) * 7.5 +
                           (100 - post.get('bounce_rate', 50)) / 100 * 7.5)
        total = round(position_score + traffic_potential + ctr_score + quality_score, 1)
        return max(0, min(100, total))
    def generate_markdown_report(self, posts, opportunities, gaps, top_n=20):
        """Generate comprehensive markdown report."""
        report = []
        report.append("# SEO Optimization Strategy Report\n")
        report.append(f"*Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}*\n\n")
        # Calculate metrics
        total_traffic = sum(p.get('traffic', 0) for p in posts.values())
        total_impressions = sum(p.get('impressions', 0) for p in posts.values())
        avg_position = sum(p.get('avg_position', 50) for p in posts.values() if p.get('avg_position', 0) > 0) / max(1, len([p for p in posts.values() if p.get('avg_position', 0) > 0]))
        # Executive Summary
        report.append("## Executive Summary\n")
        report.append(f"- **Total Posts Analyzed:** {len(posts)}\n")
        report.append(f"- **Current Monthly Traffic:** {total_traffic:,} visits\n")
        report.append(f"- **Total Impressions (90d):** {total_impressions:,}\n")
        report.append(f"- **Average Search Position:** {avg_position:.1f}\n")
        report.append(f"- **Optimization Opportunities:** {len(opportunities)}\n")
        report.append(f"- **Content Gap Ideas:** {len(gaps)}\n")
        report.append(f"- **Potential Traffic Gain (Phase 1):** +{sum(o.get('estimated_traffic_gain', 0) for o in opportunities.values()):,} visits/month\n\n")
        # Key Metrics
        report.append("### Quick Wins (Estimated Impact)\n\n")
        quick_wins = sorted(opportunities.values(),
                           key=lambda x: x.get('estimated_traffic_gain', 0),
                           reverse=True)[:5]
        total_quick_win_traffic = sum(w.get('estimated_traffic_gain', 0) for w in quick_wins)
        report.append(f"Top 5 opportunities could bring **+{total_quick_win_traffic:,} visits/month**\n\n")
        # Top 20 Posts to Optimize
        report.append("## Top 20 Posts to Optimize\n\n")
        report.append("Ranked by optimization potential (combination of position, traffic potential, and CTR improvement).\n\n")
        # Score all posts
        scored_posts = []
        for post_id, post in posts.items():
            opp = opportunities.get(post_id, {})
            score = self.calculate_priority_score(post, opp)
            scored_posts.append((post_id, post, opp, score))
        scored_posts = sorted(scored_posts, key=lambda x: x[3], reverse=True)
        for i, (post_id, post, opp, score) in enumerate(scored_posts[:top_n], 1):
            position = post.get('avg_position', 0)
            impressions = post.get('impressions', 0)
            traffic = post.get('traffic', 0)
            report.append(f"### {i}. {post['title']}\n\n")
            report.append(f"**Current Position:** {position:.1f} | **Impressions:** {impressions:,} | **Traffic:** {traffic} visits\n")
            report.append(f"**Priority Score:** {score:.1f}/100 | **Estimated Gain:** +{opp.get('estimated_traffic_gain', 0)} visits\n\n")
            if position > 0 and position <= 30:
                report.append(f"**Status:** Ranking on {'page 1' if position <= 10 else 'page 2-3'}\n\n")
            if opp.get('title_recommendations'):
                report.append("**Title Optimization:**\n")
                for rec in opp['title_recommendations'].split(';'):
                    rec = rec.strip()
                    if rec:
                        report.append(f"- {rec}\n")
                report.append("\n")
            if opp.get('description_recommendations'):
                report.append("**Meta Description:**\n")
                for rec in opp['description_recommendations'].split(';'):
                    rec = rec.strip()
                    if rec:
                        report.append(f"- {rec}\n")
                report.append("\n")
            if opp.get('content_recommendations'):
                report.append("**Content Improvements:**\n")
                for rec in opp['content_recommendations'].split(';'):
                    rec = rec.strip()
                    if rec:
                        report.append(f"- {rec}\n")
                report.append("\n")
            report.append("---\n\n")
        # Keyword Opportunities Summary
        report.append("## Keyword Opportunities Summary\n\n")
        opportunity_categories = {
            'page_2': [],
            'page_3': [],
            'ready_for_optimization': []
        }
        for opp_id, opp in opportunities.items():
            if any(opp_id == p[0] for p in scored_posts[:top_n]):
                score = opp.get('opportunity_score', 0)
                post = posts.get(opp_id, {})
                position = post.get('avg_position', 0)
                if 11 <= position <= 15:
                    opportunity_categories['page_2'].append((score, opp))
                elif 16 <= position <= 30:
                    opportunity_categories['page_3'].append((score, opp))
        report.append(f"**Page 2 (Positions 11-15):** {len(opportunity_categories['page_2'])} keywords ready for quick wins\n")
        report.append(f"**Page 3+ (Positions 16-30):** {len(opportunity_categories['page_3'])} keywords with medium effort\n\n")
        # Content Gap Analysis
        report.append("## Content Gap Analysis\n\n")
        report.append(f"Identified **{len(gaps)} high-value content opportunities** not currently covered:\n\n")
        for i, gap in enumerate(sorted(gaps, key=lambda x: x.get('priority') == 'high', reverse=True)[:15], 1):
            report.append(f"### {i}. {gap['title']}\n\n")
            report.append(f"**Priority:** {gap.get('priority', 'medium').upper()}\n")
            report.append(f"**Search Volume:** {gap.get('search_volume', 'medium')}\n")
            report.append(f"**Format:** {gap.get('format', 'guide')}\n")
            report.append(f"**Estimated Traffic Potential:** +{gap.get('traffic_potential', 50)} visits/month\n\n")
            if gap.get('why_valuable'):
                report.append(f"**Why valuable:** {gap['why_valuable']}\n\n")
        # 90-Day Action Plan
        report.append("## 90-Day Action Plan\n\n")
        report.append("### Week 1-2: Quick Wins (Estimated +100 visits/month)\n\n")
        report.append("Focus on posts with highest opportunity scores that are already ranking on page 2:\n\n")
        quick_wins_phase = sorted(scored_posts[:top_n], key=lambda x: x[3], reverse=True)[:5]
        for i, (post_id, post, opp, score) in enumerate(quick_wins_phase, 1):
            report.append(f"{i}. **{post['title'][:60]}**\n")
            report.append(f"   - Update SEO title and meta description\n")
            report.append(f"   - Estimated effort: 30-60 minutes\n")
            report.append(f"   - Expected gain: +{opp.get('estimated_traffic_gain', 50)} visits\n\n")
        report.append("### Week 3-4: Core Content Optimization (Estimated +150 visits/month)\n\n")
        report.append("Improve content structure and internal linking:\n\n")
        mid_phase = sorted(scored_posts[5:15], key=lambda x: x[3], reverse=True)[:5]
        for i, (post_id, post, opp, score) in enumerate(mid_phase, 1):
            report.append(f"{i}. **{post['title'][:60]}**\n")
            report.append(f"   - Add missing content sections\n")
            report.append(f"   - Improve header structure\n")
            report.append(f"   - Estimated effort: 2-3 hours\n\n")
        report.append("### Week 5-8: New Content Creation (Estimated +300 visits/month)\n\n")
        report.append("Create 3-5 pieces of new content targeting high-value gaps:\n\n")
        for i, gap in enumerate(sorted(gaps, key=lambda x: x.get('traffic_potential', 0), reverse=True)[:4], 1):
            report.append(f"{i}. **{gap['title']}** ({gap.get('format', 'guide').title()})\n")
            report.append(f"   - Estimated effort: 4-6 hours\n")
            report.append(f"   - Expected traffic: +{gap.get('traffic_potential', 50)} visits/month\n\n")
        report.append("### Week 9-12: Refinement & Analysis (Estimated +100 visits/month)\n\n")
        report.append("- Monitor ranking changes and CTR improvements\n")
        report.append("- Refine underperforming optimizations\n")
        report.append("- Re-run keyword analysis to identify new opportunities\n\n")
        report.append("**Total Estimated 90-Day Impact: +650 visits/month (+~7.8% growth)**\n\n")
        # Methodology
        report.append("## Methodology\n\n")
        report.append("### Priority Score Calculation\n\n")
        report.append("Each post is scored based on:\n")
        report.append("- **Position (35%):** Posts ranking 11-20 get highest scores (closest to page 1)\n")
        report.append("- **Traffic Potential (30%):** Based on search impressions\n")
        report.append("- **CTR Gap (20%):** Difference between current and expected CTR for position\n")
        report.append("- **Content Quality (15%):** Existing traffic and bounce rate\n\n")
        report.append("### Data Sources\n\n")
        report.append("- **Google Analytics:** Traffic metrics (90-day window)\n")
        report.append("- **Google Search Console:** Keyword data, impressions, clicks, positions\n")
        report.append("- **WordPress REST API:** Current SEO metadata and content structure\n\n")
        report.append("### Assumptions\n\n")
        report.append("- Traffic estimates are based on historical CTR and position data\n")
        report.append("- Moving one position up typically improves CTR by 20-30%\n")
        report.append("- Page 1 rankings (positions 1-10) receive ~20-30% of total impressions\n")
        report.append("- New content takes 4-8 weeks to gain significant traction\n\n")
        return "\n".join(report)
    def export_report(self, report_text, output_md):
        """Export markdown report."""
        try:
            with open(output_md, 'w', encoding='utf-8') as f:
                f.write(report_text)
            self.log(f"✓ Exported report to {output_md}")
        except Exception as e:
            self.log(f"❌ Error exporting report: {e}")
    def export_prioritized_csv(self, posts, opportunities, output_csv):
        """Export all posts with priority scores."""
        try:
            scored_posts = []
            for post_id, post in posts.items():
                opp = opportunities.get(post_id, {})
                score = self.calculate_priority_score(post, opp)
                scored_posts.append({
                    'ID': post_id,
                    'Title': post.get('title', ''),
                    'URL': post.get('url', ''),
                    'Priority_Score': score,
                    'Estimated_Traffic_Gain': opp.get('estimated_traffic_gain', 0),
                    'Current_Position': post.get('avg_position', 0),
                    'Impressions': post.get('impressions', 0),
                    'Traffic': post.get('traffic', 0),
                    'CTR': f"{post.get('ctr', 0):.2%}",
                    'Keywords_Count': post.get('keywords_count', 0)
                })
            scored_posts = sorted(scored_posts, key=lambda x: x['Priority_Score'], reverse=True)
            fieldnames = ['ID', 'Title', 'URL', 'Priority_Score', 'Estimated_Traffic_Gain',
                         'Current_Position', 'Impressions', 'Traffic', 'CTR', 'Keywords_Count']
            with open(output_csv, 'w', newline='', encoding='utf-8') as f:
                writer = csv.DictWriter(f, fieldnames=fieldnames)
                writer.writeheader()
                writer.writerows(scored_posts)
            self.log(f"✓ Exported {len(scored_posts)} prioritized posts to {output_csv}")
        except Exception as e:
            self.log(f"❌ Error exporting prioritized CSV: {e}")
    def run(self, posts_csv, opportunities_csv, gaps_csv, output_md, output_prioritized_csv, top_n=20):
        """Run complete report generation workflow."""
        self.log("📊 Generating SEO optimization report...")
        self.log(f"Input files: posts_with_analytics, opportunities, content_gaps\n")
        # Load data
        posts = self.load_posts_with_analytics(posts_csv)
        opportunities = self.load_opportunities(opportunities_csv)
        gaps = self.load_content_gaps(gaps_csv)
        if not posts:
            self.log("❌ No posts loaded. Cannot generate report.")
            return
        # Generate report
        self.log("\n📝 Generating markdown report...")
        report_text = self.generate_markdown_report(posts, opportunities, gaps, top_n)
        # Export report
        self.log("\n📁 Exporting files...")
        self.export_report(report_text, output_md)
        self.export_prioritized_csv(posts, opportunities, output_prioritized_csv)
        self.log("\n✓ Report generation complete!")
 def main():
    """CLI entry point."""
    parser = argparse.ArgumentParser(description='Generate SEO optimization report')
    parser.add_argument('--posts-with-analytics', type=Path,
                       default=Path('output/results/posts_with_analytics.csv'),
                       help='Posts with analytics CSV')
    parser.add_argument('--keyword-opportunities', type=Path,
                       default=Path('output/results/keyword_opportunities.csv'),
                       help='Keyword opportunities CSV')
    parser.add_argument('--content-gaps', type=Path,
                       default=Path('output/results/content_gaps.csv'),
                       help='Content gaps CSV')
    parser.add_argument('--output-report', type=Path,
                       default=Path('output/results/seo_optimization_report.md'),
                       help='Output markdown report')
    parser.add_argument('--output-csv', type=Path,
                       default=Path('output/results/posts_prioritized.csv'),
                       help='Output prioritized posts CSV')
    parser.add_argument('--top-n', type=int, default=20,
                       help='Number of top posts to detail')
    args = parser.parse_args()
    generator = ReportGenerator()
    generator.run(args.posts_with_analytics, args.keyword_opportunities,
                  args.content_gaps, args.output_report, args.output_csv, args.top_n)
 if __name__ == '__main__':
    main()
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,5 @@
 requests>=2.31.0
 pandas>=2.0.0
 python-dotenv>=1.0.0
 openai>=1.0.0
 numpy>=1.24.0
--- a/run_analysis.sh
+++ b/run_analysis.sh
@@ -0,0 +1,73 @@
 #!/bin/bash
 set -e
 echo "╔════════════════════════════════════════════════════════════╗"
 echo "║   SEO Analysis & Improvement System - Full Pipeline        ║"
 echo "╚════════════════════════════════════════════════════════════╝"
 echo ""
 # Check if venv exists
 if [ ! -d "venv" ]; then
    echo "❌ Virtual environment not found. Please run: python3 -m venv venv"
    exit 1
 fi
 # Check if input files exist
 if [ ! -f "input/new-propositions.csv" ]; then
    echo "❌ Missing input/new-propositions.csv"
    echo "Please place your WordPress posts CSV in input/ directory"
    exit 1
 fi
 if [ ! -f "input/analytics/ga4_export.csv" ]; then
    echo "❌ Missing input/analytics/ga4_export.csv"
    echo "Please export GA4 data and place it in input/analytics/"
    exit 1
 fi
 # Create output directories
 mkdir -p output/results
 mkdir -p output/logs
 echo "📊 Step 1: Analytics Integration"
 echo "   Merging GA4, Search Console, and WordPress data..."
 ./venv/bin/python analytics_importer.py
 echo ""
 echo "🔍 Step 2: Keyword Opportunity Analysis"
 echo "   Identifying high-potential optimization opportunities..."
 ./venv/bin/python opportunity_analyzer.py \
  --input output/results/posts_with_analytics.csv \
  --output output/results/keyword_opportunities.csv \
  --min-position 11 \
  --max-position 30 \
  --min-impressions 50 \
  --top-n 20
 echo ""
 echo "📝 Step 3: Report Generation"
 echo "   Creating comprehensive SEO optimization report..."
 ./venv/bin/python report_generator.py
 echo ""
 echo "╔════════════════════════════════════════════════════════════╗"
 echo "║                  ✅ Analysis Complete!                     ║"
 echo "╚════════════════════════════════════════════════════════════╝"
 echo ""
 echo "📂 Results Location:"
 echo "   └─ output/results/seo_optimization_report.md"
 echo ""
 echo "📊 Key Files:"
 echo "   ├─ posts_prioritized.csv (all posts ranked 0-100)"
 echo "   ├─ keyword_opportunities.csv (26 optimization opportunities)"
 echo "   └─ posts_with_analytics.csv (enriched dataset)"
 echo ""
 echo "📋 Logs:"
 echo "   └─ output/logs/"
 echo ""
 echo "🚀 Next Steps:"
 echo "   1. Open: output/results/seo_optimization_report.md"
 echo "   2. Review Top 20 Posts to Optimize"
 echo "   3. Start with Quick Wins (positions 11-15)"
 echo "   4. Follow 90-day action plan"
 echo ""