Skip to content
Methodology

How We Rewrite Job Descriptions

Agentic Jobs does not republish employer text verbatim. Every listing passes through an enrichment pipeline that extracts structured metadata, rewrites long descriptions into concise role briefs, and scores each posting for source quality and freshness. This page explains exactly how that works.

The Problem With Raw Job Descriptions

Employer-written job descriptions are optimized for breadth and legal coverage, not for candidate decision speed. A typical posting contains four to six sections: company marketing copy, a role summary, a responsibilities list, required qualifications, preferred qualifications, and a legal boilerplate block. The candidate-relevant signal, what the role actually does, which skills matter most, what seniority level is realistic, is compressed into roughly 20 to 30% of the text. The rest is noise.

The enrichment layer exists to invert that ratio. A processed listing should answer three questions in under 30 seconds: what this role actually does day-to-day, which skills and technologies are non-negotiable, and what angle a candidate should emphasize on their resume to signal fit clearly.

Stage 1: Description Parsing and Cleaning

Raw descriptions arrive from sources in mixed formats, plain text, HTML, Markdown, or structured JSON fields from ATS APIs. The first stage normalizes these into three canonical variants that the rest of the pipeline and the UI consume:

  • description_plain: Clean UTF-8 text with whitespace normalized, HTML tags stripped, and encoding artifacts resolved. Used for NLP processing and as the fallback display format.
  • description_markdown: A structured Markdown rendering that preserves heading hierarchy, bullet structure, and emphasis from the original HTML, without the formatting noise. Used in the listing detail panel.
  • description_html: Sanitized HTML retained for listings where rich formatting significantly improves readability.

After normalization, the plain text is cleaned further: runs of three or more blank lines are collapsed, inline tab/space artifacts are removed, and the text is checked against a set of synthetic-content markers that indicate preview-only or auto-generated descriptions rather than the full employer posting.

Detail page enrichment

Aggregator listings frequently contain truncated descriptions, 2 to 4 sentences that serve as a preview rather than the full posting. When the normalized plain text falls below a configurable character threshold, the pipeline fetches the full description from the job detail page directly. This fetch runs concurrently across eligible listings using a thread pool, with a per-request timeout to prevent stale connections from blocking the batch. The fetched content replaces the truncated version before any NLP processing begins.

Stage 2: Skill Extraction

Skills are extracted by matching the normalized description against a curated taxonomy of technical and domain skills. The taxonomy is organized by category, programming languages, cloud platforms, data tools, frameworks, soft skills, and domain areas, and matching is performed with title-weighted scoring: terms appearing in both the job title and the description receive higher confidence than description-only mentions.

Skill CategoryExample Skills ExtractedWeighting
Programming Languages Python, Go, Java, TypeScript, Rust, SQL High, core technical filter signal
Cloud Platforms AWS, GCP, Azure, Snowflake, Databricks High, stack compatibility signal
Data Tools dbt, Airflow, Spark, Kafka, Flink, Redshift High, domain-specific signal
Frameworks / Runtimes FastAPI, Django, React, Node.js, Spring Medium, role-type indicator
DevOps / Infra Kubernetes, Docker, Terraform, GitHub Actions Medium, environment signal
Domain / Soft Skills ML pipelines, data contracts, observability, system design Medium, seniority and scope signal

The extracted skills list is stored as a comma-separated field and surfaced in the listing card as skill tags. These tags also drive skill-based filtering on the dashboard, so a search filtered to "dbt" will surface all listings where dbt was extracted as a skill, even if the term doesn't appear in the title.

Stage 3: Experience Level and Work Mode Inference

Experience level

Experience level (Entry, Mid, Senior, Staff/Principal) is inferred from a combination of title signals and description language. Title signals are the primary driver, "Junior," "Associate," "Senior," "Staff," "Principal," and "Lead" map directly. When the title is ambiguous (e.g., plain "Software Engineer"), the description is analyzed for corroborating signals: years-of-experience requirements, team leadership language, scope of ownership described, and seniority-specific keywords ("mentoring," "defining standards," "driving cross-team alignment").

Work mode

Work mode (Remote, Hybrid, On-Site) is parsed from both the structured location field and the description body. The location field takes precedence for structured values. When the location field contains an ambiguous value ("Multiple Locations," "United States," or just a city name without modifiers), the description is scanned for remote policy language: "fully remote," "hybrid 3 days on-site," "in-office required," "remote within [region]," and similar patterns. The inferred value is stored separately from the raw location string so it can be used as a reliable filter regardless of source formatting variation.

Stage 4: Salary Extraction

Salary is extracted from a structured field when available (some ATS systems provide it directly). When absent, a pattern-based extractor scans the description for compensation language. The extractor handles common formats:

  • Annual salary ranges: $120,000 to $160,000, $120K to $160K per year
  • Hourly rates: $45 to $65/hour, $50 per hour
  • Base + equity formats: $150K base + equity
  • OTE (On-Target Earnings) for sales-adjacent roles

When multiple salary mentions appear (e.g., a base range and a total comp range), all extracted values are stored and the most informative one surfaces in the UI. Listings where no salary could be extracted are flagged with is_salary_missing: true, this flag feeds into the trust scoring calculation as a mild negative signal, since salary disclosure correlates with posting quality and active hiring intent.

Stage 5: Deduplication

The same job listing routinely appears across multiple sources. Without deduplication, a listing from a company's Greenhouse page also appears as a LinkedIn mirror, an Indeed mirror, a JSearch result, and possibly a Jooble result, five records for one real opening. This inflates apparent volume and degrades ranking quality.

Deduplication operates in two passes:

  1. Company grouping: Listings are grouped by normalized company name (lowercased, punctuation stripped, common suffixes like "Inc.", "LLC", "Corp" removed). This creates the candidate set for within-company comparison.
  2. Title fuzzy matching: Within each company group, listing titles are compared using a similarity algorithm (Levenshtein-based, threshold configurable, default 0.80). Titles that match above the threshold are linked by a shared dup_group_key. The group metadata, which sources contributed matching records, whether any source is a direct ATS, is used downstream in trust scoring.

Deduplication does not remove listings, it links them. Both the ATS record and the aggregator mirror remain in the index, but the trust scorer uses group membership to reward the ATS record and discount the mirror. The UI shows the highest-trust record for a given role by default.

Stage 6: Trust Scoring

Every listing receives a trust score between 0.0 and 1.0, displayed as a labeled tier (High / Medium / Low) on each card. The score is a weighted combination of observable quality signals:

SignalDirectionRationale
Source type (direct ATS vs. aggregator) ↑ ATS, ↓ aggregator-only ATS records are published by the employer directly; aggregators introduce lag and metadata errors
Posting freshness ↑ recent, ↓ stale Stale listings are more likely to represent frozen requisitions or pipeline collection
Salary present ↑ present Salary disclosure correlates with recruiter engagement and posting completeness
Description completeness ↑ longer/richer, ↓ below threshold Short descriptions often indicate preview-only or auto-syndicated records
Deduplication group has ATS member ↑ for all records in group A matching ATS record confirms the listing is real and employer-published
Metadata completeness (location, title, company) ↑ complete, ↓ missing fields Missing required fields indicate low-quality source records
Official ATS flag ↑ official Listings from official company ATS subdomains receive a direct quality bonus

The trust score is not a prediction of hiring outcomes. A High score means the observable data quality is strong, not that the company is guaranteed to be interviewing this week. It is a prioritization signal, not a guarantee.

Stage 7: Summary Generation

Every listing with a sufficient description receives a rule-based summary. The summary is generated by the following steps:

  1. Boilerplate removal: Repeated legal blocks, equal opportunity statements, and privacy boilerplate are stripped from the description before summarization begins.
  2. Role scope extraction: The first substantive content block, usually the role overview paragraph, is identified and rewritten into 1 to 2 plain-language sentences describing what the role does.
  3. Top skills identification: The 3 to 5 highest-frequency technical or domain skills from the extraction stage are pulled into the summary as an explicit "key skills" component.
  4. Resume angle generation: A short directive is constructed from the top skills, e.g., "Emphasize dbt modeling experience and Snowflake query optimization in your resume", to give the candidate an immediate application direction without requiring them to read the full description.

AI-assisted summaries

For listings that meet a minimum description length and trust score threshold, an AI-assisted summary is generated as a supplementary view. The AI summary provides more natural language rewriting of the role scope and can surface nuance that the rule-based system misses, context about team structure, business goals, or technical depth that appears in the description but not in skill tags or metadata fields.

AI summaries are explicitly labeled in the UI and are generated as a reading aid, not as a replacement for the source description. Candidates should always verify details on the original employer ATS page before applying.

What We Never Invent

The enrichment pipeline only surfaces information that can be grounded in the source posting. It does not invent compensation ranges, visa policies, benefits, or benefits not mentioned in the description. When a field cannot be reliably extracted, it is marked as missing rather than filled with a placeholder. This conservative approach produces more blank fields than some competing platforms, but it also means the fields that are populated can be trusted.

A Note on Accuracy

Extraction quality varies by description quality. A well-structured 800-word description from a Greenhouse ATS page produces significantly more accurate enrichment than a 90-word Indeed snippet. We surface a description completeness signal in the trust score so you can calibrate how much to rely on extracted metadata for any given listing.

Verification Reminder

Every listing on Agentic Jobs links to the original source posting. The enrichment layer is a decision aid, it is not the authoritative record. Always click through to the employer's ATS page before applying to confirm the listing is still live, verify the current requirements, and submit your application through the official channel.

See the Pipeline in Action

Browse listings with trust scores, extracted skills, and AI-rewritten summaries applied to real, live postings.

Open Job Dashboard →