Serverless ETL using Python Azure Functions, SQL, Power Query, DAX & Power BI.
With this project, I wanted to both demonstrate a broad range of data and BI skills and gain practical insights into the job market to support my own career transition.
Initially, I considered using sample datasets from Kaggle or Hugging Face, but after researching publicly available options, the Reed.co.uk API stood out as a richer, more realistic source for a project centred around analytics, ETL, and reporting.
┌─────────────────┐
│ Reed.co.uk API │ ← HTTP Basic Auth, 4-key rotation, rate-limit handling
└────────┬────────┘
│ fetch (pagination & incremental filtering)
▼
┌─────────────────────────────────┐
│ landing.raw_jobs (JSONB raw) │ ← Raw API responses, content hashing
│ Deduplication via composite PK │
└────────┬────────────────────────┘
│ transform & enrich (Python ETL)
▼
┌──────────────────────────────────┐
│ staging.jobs_v1 (flattened) │ ← Normalized schema, skill extraction
│ Enriched metadata & seniority │
└────────┬───────────────────────┬─┘
│ │
▼ ▼
┌──────────────────┐ ┌────────────────────┐
│ staging.fact_jobs│ │ staging.job_skills │ ← Many-to-many junction table
│ Materialized w/ │ │ 60+ canonical skills│
│ dimensional keys │ │ + categories │
└────────┬─────────┘ └────────────────────┘
│
▼
┌─────────────────┐
│ Power BI/Fabric │ ← Interactive dashboards & analytics
│ Analytics │
└─────────────────┘
postgres → postgresql, k8s → kubernetes)per week × 52 weeks/yearper day × 260 working days/yearper hour × 1,950 working hours/year*_old columns for audit trail)fact_jobs) with pre-computed dimensional keys| Layer | Technology | Purpose |
|---|---|---|
| Language | Python 3.x | ETL logic, data transformations |
| Cloud Platform | Azure Functions | Serverless compute (timer-triggered) |
| Database | PostgreSQL | Data warehouse (Azure Database for PostgreSQL) |
| API Client | requests |
HTTP client with authentication |
| Data Processing | psycopg2 |
PostgreSQL driver with batch operations |
| ML (Optional) | scikit-learn |
TF-IDF + Naive Bayes job classifier |
| Visualization | Power BI / Microsoft Fabric | Interactive dashboards |
| Deployment | Azure | Managed infrastructure |
jobs-pipeline/
├── function_app.py # Azure Functions entry point (timer trigger)
├── run_pipeline.py # Local test runner for development
├── requirements.txt # Python dependencies
├── local.settings.json # Environment configuration (gitignored)
│
├── reed_ingest/ # Main ETL module (2,500+ lines)
│ └── __init__.py ├─ API client with 4-key rotation
│ ├─ Skill extraction & categorization
│ ├─ Title filtering (rule-based + ML)
│ ├─ Salary annualization logic
│ ├─ Seniority/location/employment detection
│ ├─ Data quality monitoring
│ └─ Database UPSERT operations
│
├── job_classifier.py # TF-IDF + Naive Bayes ML classifier (optional)
│
├── sql/ # Database schema definitions
│ ├── fact_jobs.sql ├─ Materialized analytics table
│ ├── fact_job_skill.sql ├─ Job-skill junction table
│ ├── dim_salaryband.sql ├─ Dynamic salary bands
│ ├── dim_employer.sql ├─ Employer dimension
│ ├── dim_location.sql ├─ Location dimension
│ ├── dim_seniority.sql ├─ Seniority levels
│ ├── dim_skill.sql ├─ Canonical skills
│ └── ... └─ Other dimensions
│
├── docs/ # Project documentation
│ ├── project_structure.md ├─ Repository organization
│ ├── duplication_prevention.md├─ Data quality strategies
│ └── recent_changes.md └─ Change log
│
└── powerbi/ # Power BI visualizations
└── README.md
postedByDays parameter)landing.raw_jobs with content hash for change detectionpowerbi → power bi)staging.jobs_v1 via temp table patternstaging.job_skills junction tablefact_jobs) with pre-computed dimensional keyslanding.raw_jobs - Raw API responses
CREATE TABLE landing.raw_jobs (
source_name TEXT, -- 'reed'
job_id TEXT, -- Reed job ID
raw JSONB, -- Complete API JSON
content_hash TEXT, -- MD5 for change detection
posted_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ,
ingested_at TIMESTAMPTZ,
PRIMARY KEY (source_name, job_id)
);
staging.jobs_v1 - Normalized job records
CREATE TABLE staging.jobs_v1 (
staging_id BIGSERIAL PRIMARY KEY,
source_name TEXT,
job_id TEXT,
job_title TEXT,
employer_name TEXT,
location_name TEXT,
salary_min NUMERIC, -- Annualized
salary_max NUMERIC, -- Annualized
salary_type TEXT,
work_location_type TEXT, -- remote/hybrid/office
seniority_level TEXT,
job_role_category TEXT,
contract_type TEXT,
full_time BOOLEAN,
part_time BOOLEAN,
job_description TEXT, -- Full enriched text
posted_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ,
UNIQUE (source_name, job_id)
);
staging.job_skills - Job-to-skill mappings
CREATE TABLE staging.job_skills (
id BIGSERIAL PRIMARY KEY,
source_name TEXT,
job_id TEXT,
skill TEXT, -- Canonical skill name
category TEXT, -- programming_languages, databases, etc.
matched_pattern TEXT, -- Original variation matched
UNIQUE (source_name, job_id, skill)
);
staging.fact_jobs - Materialized analytics table (indexed)
days_open, apps_per_day, is_activesalaryband_key for fast filteringDimension Tables (10+ tables)
dim_salaryband - Dynamic £10k-width bands (£0–£549,999, capped)dim_employer - Employer master datadim_location - Location + work type (remote/hybrid/office)dim_seniority - Seniority levelsdim_contract - Contract typesdim_skill - Canonical skill namesdim_source - Data sources (Reed, etc.)dim_ageband - Job age bandsdim_demandband - Application volume bandsdim_jobtype - Role categoriesAll settings are environment-driven via local.settings.json:
{
"Values": {
"API_KEY": "your-reed-api-key",
"API_KEY_BACKUP": "backup-key-1",
"API_KEY_BACKUP_2": "backup-key-2",
"API_KEY_BACKUP_3": "backup-key-3",
"API_BASE_URL": "https://www.reed.co.uk/api/1.0/search",
"SEARCH_KEYWORDS": "data,bi,analyst,microsoft fabric",
"RESULTS_PER_PAGE": 100,
"POSTED_BY_DAYS": 30,
"MAX_RESULTS": 0,
"JOB_TITLE_INCLUDE": "data,bi,analyst,microsoft fabric",
"JOB_TITLE_EXCLUDE": "trainee,intern,apprentice,asbestos,...",
"USE_ML_CLASSIFIER": "false",
"ML_CLASSIFIER_THRESHOLD": "0.7",
"PGHOST": "your-postgres-server.postgres.database.azure.com",
"PGPORT": "5432",
"PGDATABASE": "jobs_warehouse",
"PGUSER": "admin",
"PGPASSWORD": "****",
"PGSSLMODE": "require"
}
}
# Install dependencies
pip install -r requirements.txt
# Configure settings
cp local.settings.json.example local.settings.json
# Edit local.settings.json with your API keys and database credentials
# Run pipeline locally
python run_pipeline.py
# Deploy to Azure Functions
func azure functionapp publish <your-function-app-name>
# Scheduled execution (daily at midnight UTC)
# Configured in function_app.py: @app.timer_trigger(schedule="0 0 0 * * *")