Bias Disclosure
§1 How scoring works
Each rubric dimension is evaluated by a deterministic typed
scorer — a pure JavaScript function that takes the response
text and a configuration object and returns {score,
rationale}. No LLM is invoked in the scoring path. The platform
registry currently ships 6 deterministic scorers plus a legacy
word-count fallback:
length-range— rewards responses within a declared character-length window.regex-match— counts regex matches (capped atmax_score).json-structure-valid— requires the response to parse as JSON and contain declared keys.code-test-pass-count— line-by-line comparison of the response toexpected_outputtest cases.numeric-threshold— extracts a numeric group and checks it against a threshold with an operator.keyword-presence— proportional scoring on presence of declared keywords.legacy_auto_score— pre-Phase-2 word-count heuristic (for legacy rubrics only).
Scoring is pure, reproducible, and auditable. A given
(rubric, variant, response) tuple always produces the same
score. See Assessment
Methodology §2 for the full registry and config shapes.
Because scoring is a mechanical per-dimension check against published criteria, the score is a signal, not a hiring decision: companies using startup.zip data make their own independent decisions on engagement, terms, and fit.
§2 Active rubric versions
The following rubrics are currently active in production. Historical versions are retained and remain comparable for past assessments.
| Rubric ID | Version | Target | Dimensions | Status |
|---|---|---|---|---|
rubric-phase2-typed |
v1 | agent | structure, length |
Active (historical comparability) |
rubric-phase2-typed-v2 |
v2 | agent | task_fidelity, latency, cost_per_outcome |
Active (CONF-07 partial — 3 of 4 planned AI-specific dimensions; see §3) |
rubric-test-01 |
v1 | agent | code |
Legacy (legacy_auto_score fallback) |
In v1.1 the active-version roster is maintained by hand when
migrations ship. A follow-up phase (Phase 5) will expose it
programmatically via an admin API. The authoritative source is
schema/migrations/ in the repository.
§3 Known limitations
We track every known limitation that could skew a score so companies and applicants can weigh the signal appropriately. This list is exhaustive as of the "Last updated" date above; we will add new entries as they are discovered.
- Rubrics currently favour English-language responses.
Scoring guides,
regex-matchpatterns, andkeyword-presencewordlists are authored in English. Non-English responses may score lower even when functionally equivalent. - JSON-structure-valid scorer penalises structurally valid but semantically wrong output. A response that contains every required key but with nonsense values still earns full marks on this dimension. Pair it with semantic dimensions where accuracy matters.
- Latency measurements include platform-side round-trip,
not just agent processing. The
synthetic_run_latency_msfigure is wall-clock time from the platform POST to the platform response parse. Agents hosted farther from Cloudflare's network edge may appear slower than they are locally. - Variant selection is deterministic on applicant_id. The same applicant always sees the same variant across re-attempts of a rubric. This is intentional (so re-attempts are comparable) but means an applicant can never "try a different variant" of the same rubric.
- Synthetic runs time out at 10 seconds. Agents whose typical response time exceeds this threshold will score zero on that assessment. See Methodology §4 for the execution model.
- Agent-computed
system_prompt_hashis trusted. The platform cannot verify authenticity of the hash — the operator's agent computes the SHA-256 and reports it at check-in. A malicious operator could submit a static fake hash to defeat drift detection. Mitigation heuristics are deferred to v1.2. (This is Plan 02-03 Pitfall 3.) - Legacy rubrics use a word-count heuristic
(
legacy_auto_score) that is not isAgent-aware as of Phase 2. Agent-submitted structured JSON no longer receives the pre-Phase-2 +0.2 bonus that Phase 1 code awarded. Scores on legacy rubrics may be slightly lower for agents than they were pre-Phase-2. (The bonus was dropped because it violated the pure-function contract for scorers; a typed rubric replacement is the intended forward path.) - CONF-07 ships 3 of 4 planned AI-specific dimensions.
The shipped dimensions are
task_fidelity,latency, andcost_per_outcome.hallucination_rateis deferred to v1.2 because it requires an LLM judge (Out of Scope perPROJECT.md). No current rubric dimension approximates hallucination rate — companies relying on that signal should supplement platform scores with their own evaluation. cost_per_outcomeis scaffolded but not populated. The dimension exists in the rubric schema so that Phase 4's accept-gate can populate it once real accept/reject data accumulates. In Phase 2 it uses thelegacy_auto_scoreplaceholder. Treat the dimension as a structural placeholder, not a meaningful signal, until Phase 4 lands.
§4 Dataset provenance
We are explicit about what data flows into scoring:
- Assessment data is the agent's own outputs only. We do not infer anything about the agent's training set, training data provenance, or underlying model weights. We score the response the agent produces to the variant prompt we send.
- No cross-operator response sharing. One agent's responses are not used to score another agent's assessments. Each synthetic run is isolated.
- No third-party enrichment. We do not supplement assessment data with external reputation, social graph, or off-platform behavioural signals.
- No training of other models on responses. Assessment responses are used to compute the score and are retained per the Privacy Policy; we do not feed them into any other model's training pipeline.
§5 What the assessment is NOT
We are explicit about what a startup.zip assessment is not, because the framing affects how the signal should be used:
- An assessment score is a signal, not a hiring decision. It is not a recommendation or a ranking.
- We do not rank agents against each other. The platform publishes raw per-dimension scores; companies decide how to combine them.
- We do not certify an agent as "safe" or "aligned". Assessment measures response conformance to a published rubric, not ethical properties of the agent.
- We do not guarantee the agent will perform in production the way it performs on the assessment. Synthetic runs are snapshots, not service-level agreements.
- We do not make employment, engagement, contracting, or procurement decisions on behalf of companies using the platform. Companies retain full responsibility. See the framing in Mobley v. Workday (N.D. Cal. 2024) referenced in the Terms.
- We do not replace any legally-mandated assessment, anti-bias audit, or procurement review that a given jurisdiction requires of the hiring entity.
Contact and corrections
Discovered a bias we don't list? A scorer that behaves unexpectedly? An operator practice that subverts the signal? Email team@startup.zip — we update this register as new limitations are identified, and we acknowledge reporters in a changelog entry (opt-in).