UK data job market insights v1

Serverless ETL using Python Azure Functions, SQL, Power Query, DAX & Power BI.


Project background

With this project, I wanted to both demonstrate a broad range of data and BI skills and gain practical insights into the job market to support my own career transition.


Data selection

Initially, I considered using sample datasets from Kaggle or Hugging Face, but after researching publicly available options, the Reed.co.uk API stood out as a richer, more realistic source for a project centred around analytics, ETL, and reporting.


Architecture

Jobs Pipeline

┌─────────────────┐
│  Reed.co.uk API │  ← HTTP Basic Auth, 4-key rotation, rate-limit handling
└────────┬────────┘
         │ fetch (pagination & incremental filtering)
         ▼
┌─────────────────────────────────┐
│  landing.raw_jobs (JSONB raw)   │  ← Raw API responses, content hashing
│  Deduplication via composite PK │
└────────┬────────────────────────┘
         │ transform & enrich (Python ETL)
         ▼
┌──────────────────────────────────┐
│  staging.jobs_v1 (flattened)     │  ← Normalized schema, skill extraction
│  Enriched metadata & seniority   │
└────────┬───────────────────────┬─┘
         │                       │
         ▼                       ▼
┌──────────────────┐  ┌────────────────────┐
│ staging.fact_jobs│  │ staging.job_skills │  ← Many-to-many junction table
│ Materialized w/  │  │ 60+ canonical skills│
│ dimensional keys │  │ + categories        │
└────────┬─────────┘  └────────────────────┘
         │
         ▼
┌─────────────────┐
│ Power BI/Fabric │  ← Interactive dashboards & analytics
│   Analytics     │
└─────────────────┘

Key Features & Capabilities

Robust API Integration

Smart Data Enrichment

Intelligent Filtering & Quality

Salary Normalization & Analytics

Dimensional Data Warehouse


Technical Stack

Layer Technology Purpose
Language Python 3.x ETL logic, data transformations
Cloud Platform Azure Functions Serverless compute (timer-triggered)
Database PostgreSQL Data warehouse (Azure Database for PostgreSQL)
API Client requests HTTP client with authentication
Data Processing psycopg2 PostgreSQL driver with batch operations
ML (Optional) scikit-learn TF-IDF + Naive Bayes job classifier
Visualization Power BI / Microsoft Fabric Interactive dashboards
Deployment Azure Managed infrastructure

Project Structure

jobs-pipeline/
├── function_app.py              # Azure Functions entry point (timer trigger)
├── run_pipeline.py              # Local test runner for development
├── requirements.txt             # Python dependencies
├── local.settings.json          # Environment configuration (gitignored)
│
├── reed_ingest/                 # Main ETL module (2,500+ lines)
│   └── __init__.py              ├─ API client with 4-key rotation
│                                ├─ Skill extraction & categorization
│                                ├─ Title filtering (rule-based + ML)
│                                ├─ Salary annualization logic
│                                ├─ Seniority/location/employment detection
│                                ├─ Data quality monitoring
│                                └─ Database UPSERT operations
│
├── job_classifier.py            # TF-IDF + Naive Bayes ML classifier (optional)
│
├── sql/                         # Database schema definitions
│   ├── fact_jobs.sql            ├─ Materialized analytics table
│   ├── fact_job_skill.sql       ├─ Job-skill junction table
│   ├── dim_salaryband.sql       ├─ Dynamic salary bands
│   ├── dim_employer.sql         ├─ Employer dimension
│   ├── dim_location.sql         ├─ Location dimension
│   ├── dim_seniority.sql        ├─ Seniority levels
│   ├── dim_skill.sql            ├─ Canonical skills
│   └── ...                      └─ Other dimensions
│
├── docs/                        # Project documentation
│   ├── project_structure.md     ├─ Repository organization
│   ├── duplication_prevention.md├─ Data quality strategies
│   └── recent_changes.md        └─ Change log
│
└── powerbi/                     # Power BI visualizations
    └── README.md

Data Flow

1. Extract (API Ingestion)

2. Transform (Data Enrichment)

3. Load (Database Operations)

4. Analytics (Power BI)


Database Schema

Landing Layer

landing.raw_jobs - Raw API responses

CREATE TABLE landing.raw_jobs (
    source_name     TEXT,           -- 'reed'
    job_id          TEXT,           -- Reed job ID
    raw             JSONB,          -- Complete API JSON
    content_hash    TEXT,           -- MD5 for change detection
    posted_at       TIMESTAMPTZ,
    expires_at      TIMESTAMPTZ,
    ingested_at     TIMESTAMPTZ,
    PRIMARY KEY (source_name, job_id)
);

Staging Layer

staging.jobs_v1 - Normalized job records

CREATE TABLE staging.jobs_v1 (
    staging_id          BIGSERIAL PRIMARY KEY,
    source_name         TEXT,
    job_id              TEXT,
    job_title           TEXT,
    employer_name       TEXT,
    location_name       TEXT,
    salary_min          NUMERIC,        -- Annualized
    salary_max          NUMERIC,        -- Annualized
    salary_type         TEXT,
    work_location_type  TEXT,           -- remote/hybrid/office
    seniority_level     TEXT,
    job_role_category   TEXT,
    contract_type       TEXT,
    full_time           BOOLEAN,
    part_time           BOOLEAN,
    job_description     TEXT,           -- Full enriched text
    posted_at           TIMESTAMPTZ,
    expires_at          TIMESTAMPTZ,
    UNIQUE (source_name, job_id)
);

staging.job_skills - Job-to-skill mappings

CREATE TABLE staging.job_skills (
    id              BIGSERIAL PRIMARY KEY,
    source_name     TEXT,
    job_id          TEXT,
    skill           TEXT,               -- Canonical skill name
    category        TEXT,               -- programming_languages, databases, etc.
    matched_pattern TEXT,               -- Original variation matched
    UNIQUE (source_name, job_id, skill)
);

Analytics Layer

staging.fact_jobs - Materialized analytics table (indexed)

Dimension Tables (10+ tables)


Configuration

All settings are environment-driven via local.settings.json:

{
  "Values": {
    "API_KEY": "your-reed-api-key",
    "API_KEY_BACKUP": "backup-key-1",
    "API_KEY_BACKUP_2": "backup-key-2",
    "API_KEY_BACKUP_3": "backup-key-3",
    "API_BASE_URL": "https://www.reed.co.uk/api/1.0/search",
    "SEARCH_KEYWORDS": "data,bi,analyst,microsoft fabric",
    "RESULTS_PER_PAGE": 100,
    "POSTED_BY_DAYS": 30,
    "MAX_RESULTS": 0,
    "JOB_TITLE_INCLUDE": "data,bi,analyst,microsoft fabric",
    "JOB_TITLE_EXCLUDE": "trainee,intern,apprentice,asbestos,...",
    "USE_ML_CLASSIFIER": "false",
    "ML_CLASSIFIER_THRESHOLD": "0.7",
    "PGHOST": "your-postgres-server.postgres.database.azure.com",
    "PGPORT": "5432",
    "PGDATABASE": "jobs_warehouse",
    "PGUSER": "admin",
    "PGPASSWORD": "****",
    "PGSSLMODE": "require"
  }
}

Running the Pipeline

Local Development

# Install dependencies
pip install -r requirements.txt

# Configure settings
cp local.settings.json.example local.settings.json
# Edit local.settings.json with your API keys and database credentials

# Run pipeline locally
python run_pipeline.py

Azure Deployment

# Deploy to Azure Functions
func azure functionapp publish <your-function-app-name>

# Scheduled execution (daily at midnight UTC)
# Configured in function_app.py: @app.timer_trigger(schedule="0 0 0 * * *")

Key Metrics & Performance

Current System Stats

Performance Optimizations


Skills Demonstrated

Data Engineering

Python Development

Database Engineering

Cloud & DevOps

Analytics & BI

Next Project

Migrating SAP Business Objects to Power BI