Stop Writing Regex for Data You Should Be Describing in English
Source: Dev.to
Remote‑Friendly, Senior‑Level, Salary‑Disclosed Job Screening
You have a spreadsheet of job postings and need to filter it down to roles that are remote‑friendly, senior‑level, and have a disclosed salary. The data looks like this:
| company | post |
|---|---|
| Airtable | Async‑first team, 8+ yrs exp, $185‑220K base |
| Vercel | Lead our NYC team. Competitive comp, DOE |
| Notion | In‑office SF. Staff eng, $200K + equity |
| Linear | Bootcamp grads welcome! $85K, remote‑friendly |
| Descript | Work from anywhere. Principal architect, $250K |
Deterministic rules (in plain English)
- Remote‑friendly – contains “remote”, “work from anywhere”, “async‑first”, or implied by the absence of an office location.
- Senior‑level – contains “8+ yrs”, “Staff”, “Principal”, or “Lead” (note: “Lead” can sometimes be junior).
- Salary disclosed – contains an actual number (e.g., “$85K”, “$185‑220K”), not “Competitive comp” or “DOE”.
Using everyrow to express the logic in natural language
everyrow lets you define fuzzy, qualitative logic in plain English and apply it to every row of a dataframe. The SDK handles LLM orchestration, structured outputs, and scaling.
import asyncio
import pandas as pd
from pydantic import BaseModel, Field
from everyrow.ops import screen
jobs = pd.DataFrame([
{"company": "Airtable", "post": "Async-first team, 8+ yrs exp, $185-220K base"},
{"company": "Vercel", "post": "Lead our NYC team. Competitive comp, DOE"},
{"company": "Notion", "post": "In-office SF. Staff eng, $200K + equity"},
{"company": "Linear", "post": "Bootcamp grads welcome! $85K, remote-friendly"},
{"company": "Descript", "post": "Work from anywhere. Principal architect, $250K"},
])
class JobScreenResult(BaseModel):
qualifies: bool = Field(description="True if meets ALL criteria")
async def main():
result = await screen(
task="""
Qualifies if ALL THREE are met:
1. Remote‑friendly
2. Senior‑level (5+ yrs exp OR Senior/Staff/Principal in title)
3. Salary disclosed (specific numbers, not "competitive" or "DOE")
""",
input=jobs,
response_model=JobScreenResult,
)
print(result.data)
asyncio.run(main())
Result
| company | qualifies |
|---|---|
| Airtable | True |
| Vercel | False |
| Notion | False |
| Linear | False |
| Descript | True |
- Airtable qualifies: “async‑first” (remote‑friendly), “8+ years” (senior), “$185‑220K” (salary disclosed).
- Descript qualifies: “work from anywhere” (remote), “principal architect” (senior), “$250K” (salary disclosed).
The other rows fail at least one criterion (no real salary, in‑office location, or not senior enough).
Sessions: Track Everything in a Dashboard
Every operation runs within a grouping of related operations that appears in the everyrow.io web UI. Sessions are created automatically, but for multi‑step pipelines you’ll want to create one explicitly:
from everyrow import create_session
from everyrow.ops import screen, rank
async with create_session(name="Lead Qualification") as session:
print(f"View at: {session.get_url()}")
screened = await screen(
session=session,
task="Has a company email domain (not gmail, yahoo, etc.)",
input=leads,
response_model=ScreenResult,
)
ranked = await rank(
session=session,
task="Score by likelihood to convert",
input=screened.data,
field_name="conversion_score",
)
The session URL gives you a live dashboard where you can monitor progress and inspect results while your script runs.
Background Jobs for Large Datasets
All the operations above are already async/await. The _async variants are fire‑and‑forget: they submit work to the server and return immediately so your script can continue.
from everyrow.ops import screen_async
async with create_session(name="Background Screening") as session:
task = await screen_async(
session=session,
task="Remote‑friendly, senior‑level, salary disclosed",
input=large_dataframe,
)
print(f"Task ID: {task.task_id}")
# do other work...
result = await task.await_result()
If your script crashes, recover the result later using the task ID:
from everyrow import fetch_task_data
df = await fetch_task_data("12345678-1234-1234-1234-123456789abc")
Beyond Screening: Other Operations
| Operation | What it does |
|---|---|
| Screen | Filter rows by criteria that require judgment |
| Rank | Score rows by qualitative factors |
| Dedupe | Deduplicate when fuzzy string matching isn’t enough |
| Merge | Join tables when keys don’t match exactly |
| Research | Run web agents to research each row |
Each operation takes a natural‑language task description and a dataframe, and returns structured results. Same pattern, different capability.
When to Use (and When Not To)
everyrow shines for cases where the logic is easy to describe but hard to code: screening, ranking, deduplication, and enrichment tasks where the criteria require judgment, world knowledge, or fuzzy matching.
It is not a replacement for deterministic transformations. If you can write a reliable pandas filter like df[df["salary"] > 100_000], you should. Use everyrow for columns that contain natural‑language, inconsistent, or otherwise ambiguous values.
Trade‑offs: LLM‑based operations introduce latency and cost. Use them judiciously for the parts of your pipeline that truly need human‑like reasoning.
Scaling note – In the job‑screening example above, processing 5 rows takes a few seconds and costs a fraction of a cent. For 10 000 rows you’ll want the async variants and should expect minutes rather than milliseconds. The Getting Started docs cover scaling patterns for larger datasets.
Get Started
pip install everyrow
export EVERYROW_API_KEY=your_key_here
Get a free API key at everyrow.io/api-key – it comes with $20 free credit.
Full docs and more examples: everyrow.io/docs/getting-started