NYC Data Job Market Pulse [In Progress]

NYC Data Job Market Pulse [In Progress]

Data pipeline to investigate the entry-level data job market in NYC

Overview

I launched this project to solve a personal and technical challenge: I wanted to gain real-time, data-driven insights into the exact entry-level data job market I am currently navigating in New York City, while simultaneously upskilling within the Modern Data Stack.

Instead of analyzing a static, historical CSV dataset, I am building an automated, live data infrastructure pipeline. To prevent this project from getting lost in a sea of raw data, the entire architecture is engineered around answering four specific investigative tracks:

  • The "Skill-to-Dollar" Correlation: Which technical tools trigger an "Entry Level" tag but offer the highest salary floor? (e.g., Does dbt or Snowflake correlate with a 15% higher starting salary than Excel or PowerBI in NYC?)
  • The Market Velocity & "Ghost Job" Tracker: How long do entry-level roles stay active in NYC before being removed? By keeping the pipeline stateful, I can track records over time to separate "Fresh" listings from stale ones.
  • The "Experience Inflation" Gap: Quantifying exactly how many "Entry Level" tagged jobs demand 3+ years of experience within the unstructured text of the job description.
  • The Borough & Industry Heatmap: Mapping whether the NYC market is strictly concentrated in Manhattan Finance, or if there is an active surge in tech and healthcare sectors across Brooklyn and Queens.

Architecture

*The architectural diagram for this pipeline is currently being finalized to reflect the latest staging transformations. In the meantime, here is how the data flows through the ecosystem:*

The NYC Data Job Market Tracker operates on a modular ELT (Extract, Load, Transform) architecture designed to minimize API footprint while maximizing data integrity.

  • Ingestion Layer: A custom Python web client acts as a two-pass crawler to extract unstructured job listings directly from targeted job boards. To maximize free-tier API tokens and system efficiency, the ingestion script routes data through a pre-ingestion deduplication layer before it ever hits the database.
  • Storage & Raw Layer: Data is loaded directly into a dedicated Snowflake RAW database tier, ensuring that the original, un-mutated source data is permanently preserved and tracked.
  • Transformation & AI Parsing Layer: To handle the messiness of unstructured job descriptions, an AI LLM integration reads directly from the Snowflake RAW layer. It executes natural language parsing to extract precise entities—like required skills, tooling, years of experience, and salary bands—and structures them into clean JSON. From there, dbt (Data Build Tool) normalizes schemas, handles missing values, and builds highly optimized analytics views.
  • Analytics & Visualization: The final, transformed dbt models are surfaced directly into Tableau, providing a live, dynamic dashboard that tracks real-time skill demand, salary bands, and hiring velocity across the five boroughs.

Ingestion

The extraction and ingestion layer is fully operational and currently loading into production. Because I am leveraging free-tier tools, a major engineering hurdle was designing the pipeline to work around strict API pricing thresholds and rate limits without exhausting monthly tokens.

  • Multi-Source Extraction: A scheduled GitHub Actions cron job executes my Python data clients periodically to pull data from the JSearch API (limited to 200 monthly requests) and the TheirStack API (limited to 200 monthly credits). It simultaneously runs a custom, two-pass web scraper (crawler + page-parser) to extract unstructured text directly from Built In NYC.
  • The Pre-Ingestion Deduplication Trick: To prevent identical job listings cross-posted on multiple job boards from consuming my limited paid API tokens, I developed a workaround. The pipeline executes a pre-pass check using TheirStack's free preview search layer. By programmatically cross-referencing the job title, location, and technology slugs, the script drops duplicate entries before hitting the metered token endpoints.
  • The Raw Warehouse Target: Cleaned, structured JSON payloads and raw scrape files are automatically appended into a RAW database tier inside Snowflake via a secure Snowflake Python connection client.

Next Steps

With the ingestion pipeline successfully writing to the cloud warehouse, the next phase of development focuses on the transformation and modeling layers:

  • dbt (Data Build Tool) Modeling: I will be implementing dbt to handle modular SQL transformations, shifting the data from the RAW schema into Staging, and finally building clean, analytics-ready tables in a production Data Mart.
  • Programmatic AI Agent Integration: Rather than relying on fragile, complex regular expressions (Regex) in standard SQL to parse messy, unstructured job descriptions, I plan to integrate an LLM/AI agent directly into the transformation layer. Hooking up an automated text parsing client will allow for highly accurate entity extraction of specific technical requirements and year-of-experience metrics.
  • Analytics Layer: Connecting the final Snowflake analytics mart to Tableau to build a live, updating diagnostic dashboard of the NYC hiring landscape.