Netzer Epstein

I work on Microsoft's Excel Copilot team and research AI safety: how language models behave under evaluation, and how to make evaluation itself trustworthy and verifiable. I build the experiments, benchmarks, and open datasets that make claims about model reliability testable.

Now: Microsoft · Excel Copilot
Research: SPAR & Heron AI Security fellow
Based in: Tel Aviv, Israel

See research CV

Netzer Epstein, AI Safety Researcher · Research Engineer

About

I'm an AI safety researcher and research engineer focused on LLM evaluation, model behavior, and verifiable machine learning. My work is empirical and evals-first: I design protocols, run them across frontier models, analyze the results with proper statistics, and ship reproducible datasets.

At Microsoft I work on the Excel Copilot team, where my day-to-day is effectively research engineering: designing and optimizing prompt-engineering strategies and building the evaluation benchmarks that measure model accuracy, efficiency, and relevance across native and web clients.

Through the SPAR and Heron AI Security fellowships I run empirical safety research: causally attributing evaluation awareness in LLMs to specific surface cues (with a 22.7k-transcript public dataset), and benchmarking how zero-knowledge ML systems diverge from the protocols they claim to implement. I care about rigor, measurable results, and shipping work others can build on.

Focus areas

Evaluation awarenessLLM evals & benchmarkingDeception & sandbaggingMulti-agent interactionCooperation & coordinationVerifiable / zero-knowledge ML

Research

Two parallel fellowship projects on different sides of trustworthy AI: evaluation behavior and cryptographic verifiability.

SPAR Spring 2026 · 2026

What tips models off? Causal attribution of LLM evaluation awareness

Research Fellow · Advised by Qiyao Wei

Under review

LLMs can tell when they're being evaluated. I built a causal protocol that identifies which surface cues tip them off, transplanting evaluation 'tells' between transcripts under LLM-as-judge quality control, then released a 22.7k-transcript dataset to study it.

: flip rate from injected eval scaffolding metadata
: significant effect across all 5 tell templates
: cross-model effect correlation (gpt-5-mini vs gemini-3-flash)
: causal transcripts published on Hugging Face

Evaluation awarenessLLM judgesCausal analysisBenchmarks

Submitted to EMNLP 2026eval-awareness-tells dataset Code

Heron AI Security Fellowship (first cohort) · 2026

Auditing zero-knowledge ML: benchmarking theory-to-implementation gaps

Research Fellow · Advised by Daniel Kang

Under review

ZKML papers report large speedups for verifiable inference, but some gains come from omitting cryptographic operations the proofs depend on. I built a 56-artifact audit benchmark and a multi-agent inspector that quantify and automatically catch these protocol-to-implementation gaps.

: expert-authored vulnerability artifacts, 6 crypto categories
: hidden zkLLM prover-time inflation (≈911× proof size)
: inspector recall at 71.1% precision (F1 +5pp vs baseline)

ZKMLVerifiable inferenceMulti-agentCryptography

Contributes to a NeurIPS 2026 submissionzkml-inspector zkml-audit-benchmark Dataset on Hugging Face

Projects

Open-source research tooling and experiments, mostly LLM evaluation, deception/sandbagging studies, and zero-knowledge ML auditing.

zkml-inspector

A four-agent pipeline (orchestrator, paper analyst, code inspector, report writer) that reads a ZKML paper and codebase and flags soundness violations, beating a single-agent baseline by ~5 F1 points. My flagship Heron-fellowship project.

PythonMulti-agentZKMLAuditing

zkML-inspector-benchmark

An extensible audit benchmark pairing four frozen, peer-reviewed ZKML codebases with 56 expert-authored vulnerability artifacts across six cryptographic categories. Published on Hugging Face.

PythonBenchmarkZKMLDataset

eval_awareness_tells

The SPAR evaluation-awareness study: a five-phase tell-transplantation protocol that causally attributes when LLMs notice they're being tested, plus a 22.7k-transcript public dataset.

PythonEvalsCausal analysisDataset

zkllm-ccs2024

An audit and repair of a CUDA zero-knowledge proof system for 7–13B LLM inference, adding a real SHA3-256 Fiat–Shamir transcript and per-stage commitment chain that surfaced 15 soundness issues.

CUDACryptographySoundnessSystems

OctosquidAISandbaggingGame

A two-LLM game for studying deception and sandbagging: a judge interrogates a subject that is secretly free to lie, with a modular runner, constraint enforcement, and full transcript artifacts.

PythonDeceptionExperiment infra

HeronTestProject

A benchmark testing whether a misaligned LLM agent will sandbag (intentionally underperform) when reporting model evaluations, scored across three adversarial scenarios.

PythonAI controlSandbagging

Industry

2024–Present · Tel Aviv, Israel
Research Engineer, Excel Copilot
Microsoft, Israel R&D Center
Building LLM-powered features for Excel Copilot: formula suggestions, prompt-engineering strategies, and the evaluation benchmarks that measure model accuracy, efficiency, and user relevance across native and web clients.
TypeScriptC++OpenAI APIsAzure DevOpsLLM evals
2023–2024 · Tel Aviv, Israel
Software Engineer, Excel Online
Microsoft, Israel R&D Center
Led smart-suggestions work in Excel Online and shipped full-stack features across Excel Desktop and Excel Online.
TypeScriptNode.jsReactReact NativeC++KQL
2021–2023 · Tel Aviv, Israel
Software Engineer (Student position), Excel Online
Microsoft, Israel R&D Center
Owned slices of Excel Online infrastructure, telemetry, and complex build/bundling pipelines.
C#Node.jsWebpack

Fellowships & training

2026
SPAR Spring Fellow, Evaluation Awareness
SPAR (Supervised Program for Alignment Research)
Researching evaluation awareness in LLMs: how models may alter their behavior when they detect they are being benchmarked. Advised by Qiyao Wei.
LLM evalsMechanistic probesCausal interventions
2025–2026
Heron AI Security Research Fellow (first cohort)
Heron AI Security Initiative
Built zkml-inspector and zkml-audit-benchmark: a multi-agent auditor and 56-artifact dataset for catching soundness gaps in zero-knowledge ML implementations. Advised by Daniel Kang.
ZKMLMulti-agent systemsPython
2025
BlueDot Impact, Technical AI Safety
BlueDot Impact
Completed a technical curriculum covering mechanistic interpretability, RLHF, and threat modeling for transformative AI.
June 2026
ARBOx4 Fellow, Oxford AI Safety Initiative
OAISI (University of Oxford)
Alignment research bootcamp covering core technical AI safety methods and hands-on research practice.

Toolbox

The stack I reach for across research engineering, evaluation, and safety work. When a project needs a tool I don't yet know, I learn it, something I've done repeatedly.

Languages: PythonTypeScriptC++CUDAC#KQLJavaScript
AI / ML & evals: LLM evaluation & benchmark designInspect AIPrompt engineeringOpenAI / Anthropic / Gemini APIsAzure OpenAIMulti-agent systemsStatistical analysis (scipy)Hugging Face datasets
AI safety: Evaluation awarenessDeception & sandbaggingEvals & red-teamingAI controlVerifiable / zero-knowledge MLAdversarial robustnessInterpretability
Engineering: React / React NativeNode.jsFull-stackAzure DevOpsWebpackTelemetry

Writing

I write occasionally about LLM internals, evaluation, and AI safety, working through ideas in public. Posts here sync automatically from my Substack.

Hacking Context Representation
Deceiving AI models into thinking that a carrot is not a bomb
Dec 28, 2025

Games & stories

Tabletop RPGs, narrative games, and designing my own. I'm interested in games and gamification as a tool for studying how humans and AI behave under different incentives.

TTRPGsGame designStorytelling games

Reading & watching

Sci-fi and fantasy, plus animated series (not anime). Recently in rotation:

Dungeon Crawler CarlBetween Two FiresMalazanThere Is No Antimemetics DivisionPowder MageInvincibleThe Legend of Vox MachinaGravity Falls

Outdoors & art

Hiking, jogging, and watercolor painting: the things that pull me away from a screen.

HikingJoggingWatercolor

Get in touch

Always happy to discuss research, especially evaluations, model behavior, and verifiable ML, or to trade notes and feedback on ideas. Reach out anytime.

netzerep@gmail.com

GitHub LinkedIn Hugging Face Substack Email

Netzer Epstein

EngineeringEngineering inin serviceservice ofof safetysafety research.research.

Focus areas

WhatWhat I'mI'm workingworking on.on.

ThingsThings I'veI've built.built.

zkml-inspector

zkML-inspector-benchmark

eval_awareness_tells

zkllm-ccs2024

OctosquidAISandbaggingGame

HeronTestProject

WhereWhere I'veI've worked.worked.

Industry

Research Engineer, Excel Copilot

Software Engineer, Excel Online

Software Engineer (Student position), Excel Online

Fellowships & training

SPAR Spring Fellow, Evaluation Awareness

Heron AI Security Research Fellow (first cohort)

BlueDot Impact, Technical AI Safety

ARBOx4 Fellow, Oxford AI Safety Initiative

LanguagesLanguages && tools.tools.

NotesNotes && essays.essays.

Hacking Context Representation

WhereWhere II studied.studied.

OtherOther thingsthings II carecare about.about.

Games & stories

Reading & watching

Outdoors & art

Let'sLet's talk.talk.