NYC Data Job Market Pulse

Data pipeline to investigate the early-career data job market in NYC

Overview

After finishing the NHL project, I started actively job searching, and kept seeing the same thing in postings: mentions of the "Modern Data Stack" — dbt, Snowflake, and tools I hadn't touched yet. I figured my next project should put those to use directly. At the same time, I noticed there were several different roles within the data space — Data Analyst, Data Engineer, Analytics Engineer, Data Scientist — each a real option for me to pursue as a candidate, but with blurry, overlapping lines between them. I got curious about what actually separates them: their responsibilities, requirements, and salaries. Rather than guess, I figured the best way to find out was to build something that could answer the question directly — and to point it at the market I'm actually living in and applying to right now, New York City.

The project ingests job postings from three sources — two APIs and one scraped site — across those four roles. It runs on an ELT architecture: raw payloads land in Snowflake VARIANT objects untouched, get enriched by an LLM that extracts structured metadata from each posting's freeform description, and are transformed through a three-stage dbt process (staging, intermediate, mart) before reaching a Streamlit dashboard.

The Investigation

The central question isn't "which role pays more." It's something harder:

Are these four roles actually distinct? And if so, where are the real boundaries — and are they moving?

To get at that, I needed more than job titles and salary ranges. I needed to know what the postings were actually asking for underneath the title. So the pipeline is designed around two layers of truth: what the market says a role is (the listed title), and what it actually is (extracted by an LLM from the full description). That gap — between what a job is called and what it actually requires — is where the interesting stuff lives.

The dashboard is organized around three questions:

The Landscape: What does the market look like right now? How many postings are there per role type, how are they distributed across sources, and how is volume changing over time as the pipeline accumulates data?
Under the Hood: What are these roles actually asking for? Where do the tech stacks and paradigms overlap across role types? How often does the LLM's classification of a job diverge from its listed title — and what does that tell us about how accurately the market is labeling its own roles? Which roles acknowledge AI, and which don't?
Job Explorer: Given all of this, what does a specific posting actually require? The explorer surfaces the full LLM-extracted metadata alongside the raw description so you can see both layers at once — what the job is called, and what it's actually asking for.

Early Signals

The dataset spans all four role types under investigation — Data Analyst, Data Engineer, Analytics Engineer, and Data Scientist — with 448 postings from three sources (Built In NYC, TheirStack, JSearch), spanning late May through late June 2026. These are early observations from an accumulating dataset, not final conclusions.

Three of the four roles are almost perfectly self-consistent. Analytics Engineer is the one exception.

Comparing each posting's listed title against the LLM's independent read of the actual job description: Data Scientist (98%), Data Engineer (97%), and Data Analyst (95%) all show near-total agreement between what a job is called and what it actually is. Analytics Engineer is the outlier at 85% — still a clear majority, but the only role where the title meaningfully undersells how often the work is actually something else, most often Data Engineer.

The three more technical roles pay noticeably more than Data Analyst — and the gap between them is comparatively small.

Median salaries where disclosed: Data Scientist $143k, Analytics Engineer $136k, Data Engineer $132k, Data Analyst $93k. The three technical roles sit within an $11k band of each other, while Data Analyst is $40–50k below all three. The real divide in this market isn't between any two of the technical roles — it's between Data Analyst and everything else.

What a posting calls its seniority and what it actually requires often don't match — especially at "mid-level."

Comparing each posting's source-labeled seniority against the LLM's independent read of the actual requirements: postings labeled "junior" agree with the LLM's classification 77% of the time, but 21% get read as genuinely entry-level. The bigger gap is at "mid-level" — only 53% of mid-level-labeled postings actually read as mid-level; 22% read as junior, and 8% read as entry-level. Roughly half of all "mid-level" labeled jobs in this dataset are arguably overclaiming their seniority relative to what they're actually asking for.

This is also why the dashboard's early-career salary comparisons exclude JSearch: years-required data shows "junior" and "mid-level" postings overlap heavily (junior: 0–3 years; mid-level: 0–10 years), so a structured seniority label is necessary to draw a defensible line — a years-of-experience cutoff alone can't cleanly separate the two.

AI awareness is near-universal for Data Scientist, and a real gradient for everyone else.

90% of Data Scientist postings explicitly mention AI or LLMs, compared to 62% for Analytics Engineer, 46% for Data Engineer, and 28% for Data Analyst. If you're hiring — or applying — for a data scientist role in this market right now, the posting almost certainly mentions AI in some form. The other three roles show a real gradient rather than a uniform blind spot: Analytics Engineer postings increasingly acknowledge it too, while Data Engineer and Data Analyst lag furthest behind.

Data Scientist also stands alone on degree requirements.

25% of Data Scientist postings require a master's degree — meaningfully higher than Data Engineer (5%), Data Analyst (4%), or Analytics Engineer (0%). No other role shows a comparable credential bar.

Architecture

The NYC Data Job Market Tracker operates on a modular ELT (Extract, Load, Transform) architecture across four layers. Original source payloads are preserved unmodified at every stage — nothing is overwritten, and every transformation is independently rerunnable without re-hitting the APIs.

Ingestion Layer: A GitHub Actions cron job fires every Monday and Thursday at 9am EDT, executing a Python ingestion client that pulls from three sources simultaneously: the JSearch API (RapidAPI), the TheirStack API, and a custom Built In NYC scraper built with BeautifulSoup using a two-pass approach — search page crawl followed by individual listing parsing. A pre-ingestion deduplication check runs before any metered API call is made, dropping duplicate postings before they consume credits.
Raw Storage Layer: All three sources write structured JSON payloads into a dedicated Snowflake RAW database, partitioned by source schema (RAW.JSEARCH, RAW.THEIRSTACK, RAW.BUILTIN). A pipeline tracking schema (RAW.PIPELINE) records every run with timestamps, row counts, and per-source API credit consumption.
LLM Enrichment & Transformation Layer: Rather than relying on fragile regex to extract entities from unstructured job descriptions, the pipeline routes each posting through GPT-4o-mini with a structured extraction prompt. The enrichment layer outputs a standardized JSON schema per posting — role archetype, tech stack arrays, paradigms, inferred seniority, years of experience required, degree requirement, and whether the posting acknowledges AI. Salary is a clear example of where this earns its keep: many postings state a range only in prose rather than in a structured field, and parsing that out raises usable salary coverage from 35.7% to 61.6% of postings. From there, dbt handles all schema normalization and cross-source unification across three tiers: staging views, ephemeral intermediates, and a production mart table (FCT_JOB_POSTINGS).
Analytics & Visualization Layer: The mart table is surfaced into Streamlit across four pages. Data loads with a one-hour cache and all charts pull dynamically from the live mart — as new postings are ingested and transformed, the dashboard updates automatically.

Engineering Decisions

The interesting parts of this project aren't the tools — they're the tradeoffs. Here are the decisions that shaped the architecture and why.

ELT over ETL

Data lands in Snowflake raw before any transformation touches it. This preserves source fidelity and makes the transformation layer independently rerunnable without re-hitting the APIs. If the dbt logic changes, I can reprocess the entire history from the raw layer without a single new API call.

VARIANT Columns for Raw Payloads

Source JSON is stored as Snowflake VARIANT rather than flattened at ingest time. This decouples the ingestion layer from schema changes — if TheirStack adds a field tomorrow, no ingestion code changes. The staging layer handles extraction, so schema evolution is isolated to one place.

LLM Enrichment as a Separate Layer

Running GPT-4o-mini against the raw layer rather than inline during ingestion means enrichment can be rerun independently, the prompt can be iterated without touching the pipeline, and partial failures are isolated to individual postings rather than aborting the run. A posting that fails enrichment doesn't take down the rest of the batch.

Pipeline-Level Observability

API credit tracking was a first-class concern, not an afterthought. Every run writes credit consumption per source to a tracking table, which feeds a burn rate forecast in the dashboard. On free-tier APIs, running out of credits mid-month silently kills the pipeline — the forecast makes that visible before it happens.

Cross-Source Deduplication with Source Priority

The same job posting frequently appears across all three sources. Rather than a naive deduplication on job title and company, the dbt layer applies a source priority hierarchy — Built In NYC > TheirStack > JSearch — so that when duplicates are detected, the richest version of the record is kept. This preserves salary and metadata fields that some sources carry and others don't.

Two-Database Snowflake Setup

Dev and prod are isolated into separate Snowflake databases (ANALYTICS_DEV and ANALYTICS_PROD). The GitHub Actions workflow runs dbt run --target prod on every pipeline execution, while local development targets dev. The dashboard always reads from prod.

Notebooks as a Working Layer

Every stage of the pipeline — ingestion, enrichment, transformation, presentation — has its own notebooks folder alongside the production code. Before anything becomes a script that runs on a schedule, it gets worked out in a notebook first: querying real data, testing an assumption, figuring out what's actually going on before committing to a production version. That same loose, investigative space is also where I go back when something breaks. It's the habit that surfaced most of what's in the next section.

Problems I Found In My Own Data

The "Encourages Applicants" Field Was Too Generous

The enrichment schema includes a boolean flag for whether a posting explicitly encourages candidates to apply even if they don't meet every listed qualification. Early results showed roughly 40% of postings flagged true — high enough that it didn't pass a gut check. Pulling the flagged postings into a notebook and reading the actual source language showed why: the prompt was matching any inclusive or welcoming phrasing — invitations for "people excited about the opportunity" to apply, or general DEI language — rather than the specific pattern of "don't worry if you don't meet every requirement." After tightening the prompt to target that narrower pattern and validating against the same flagged set, the enrichment was re-run against the full dataset.

Saving API Credits on TheirStack

TheirStack charges a credit per job returned, rather than per page like JSearch, so credit efficiency mattered more there than anywhere else in the pipeline. TheirStack offers a free, blurred preview of each result that withholds most fields but still exposes title, location, and the technology-stack slugs — enough to fingerprint a posting without paying for it. That mattered because TheirStack's advertised deduplication wasn't reliable in practice: testing showed the same posting reappearing under different source sites, which would have meant paying multiple credits for one job. The fix runs deduplication against the free preview first, and only spends paid credits fetching the postings confirmed to be genuinely unique.

Reconciling Seniority Across Inconsistent Sources

The dashboard's seniority comparisons needed to reflect what a posting actually says about itself, not just the LLM's independent read of the requirements — the whole point of this project is measuring how well the market's own labels hold up, so the listed seniority had to come from the source, not be re-inferred. But only two of three sources provide a listed seniority field at all, and the two that do use incompatible granularity: TheirStack offers only "junior" and "mid-level," while Built In NYC splits further into "entry," "junior," and "mid." Compared directly, "junior" in one source and "junior" in the other don't mean the same population of postings. The fix was early_career_tier, a field that collapses entry and junior together into a single early-career bucket for aggregate charts, while the raw listed-seniority value (including null, for the source that doesn't provide one) is preserved as-is in the Job Explorer for anyone inspecting individual postings.

Conclusion

The biggest shift in this project wasn't a tool or a technique — it was moving from analyzing a fixed, static dataset to building something that's alive. The NHL project worked with a closed set of drafts that would never change. This one doesn't sit still: new postings arrive twice a week, and the data itself evolves underneath the analysis. That changes what "fixing a problem" means. It's not enough to patch the code going forward — when the explicitly_encourages_applicants prompt was over-triggering, or ingestion_query turned out to be the wrong signal for role classification, the fix had to be applied twice: once so future postings come in correctly, and once to re-run against everything already sitting in the warehouse. A live pipeline doesn't let you just move on from a mistake; it makes you go back and clean up after it too.

There's a pattern underneath those fixes worth naming directly: this project's whole premise is that what a job posting is called and what it actually is often don't match — and that same gap kept showing up inside my own pipeline. ingestion_query claimed to identify a posting's role; it didn't. TheirStack claimed to deduplicate results; it didn't, reliably. A boolean flag claimed to detect encouragement toward underqualified applicants; it was matching boilerplate instead. The project's thesis turned out to be true recursively — not just about the job market, but about the tools and labels I was trusting to describe my own data. Catching that consistently, rather than after the fact, is most of what the notebook-first habit described above is actually for.

Beyond the findings themselves, this project gave me hands-on experience with a fuller slice of the data stack than the NHL project did: architecting ingestion against real API constraints and cost limits, designing an LLM enrichment layer for a task LLMs are genuinely well-suited to — pulling structured signal out of messy human-written text — standardizing that signal across sources that don't agree with each other, and building a pipeline that has to keep working as the world underneath it keeps changing. That combination — comfortable across ingestion, transformation, and analysis, not just one layer of it — is the kind of range I'm hoping to bring to a data role in this role market.