Netzer Epstein
I work on Microsoft's Excel Copilot team and research AI safety: how language models behave under evaluation, and how to make evaluation itself trustworthy and verifiable. I build the experiments, benchmarks, and open datasets that make claims about model reliability testable.
- Now
- Microsoft · Excel Copilot
- Research
- SPAR & Heron AI Security fellow
- Based in
- Tel Aviv, Israel

About
I'm an AI safety researcher and research engineer focused on LLM evaluation, model behavior, and verifiable machine learning. My work is empirical and evals-first: I design protocols, run them across frontier models, analyze the results with proper statistics, and ship reproducible datasets.
At Microsoft I work on the Excel Copilot team, where my day-to-day is effectively research engineering: designing and optimizing prompt-engineering strategies and building the evaluation benchmarks that measure model accuracy, efficiency, and relevance across native and web clients.
Through the SPAR and Heron AI Security fellowships I run empirical safety research: causally attributing evaluation awareness in LLMs to specific surface cues (with a 22.7k-transcript public dataset), and benchmarking how zero-knowledge ML systems diverge from the protocols they claim to implement. I care about rigor, measurable results, and shipping work others can build on.
Focus areas
Research
Two parallel fellowship projects on different sides of trustworthy AI: evaluation behavior and cryptographic verifiability.
Projects
Open-source research tooling and experiments, mostly LLM evaluation, deception/sandbagging studies, and zero-knowledge ML auditing.
zkml-inspector
A four-agent pipeline (orchestrator, paper analyst, code inspector, report writer) that reads a ZKML paper and codebase and flags soundness violations, beating a single-agent baseline by ~5 F1 points. My flagship Heron-fellowship project.
zkML-inspector-benchmark
An extensible audit benchmark pairing four frozen, peer-reviewed ZKML codebases with 56 expert-authored vulnerability artifacts across six cryptographic categories. Published on Hugging Face.
eval_awareness_tells
The SPAR evaluation-awareness study: a five-phase tell-transplantation protocol that causally attributes when LLMs notice they're being tested, plus a 22.7k-transcript public dataset.
zkllm-ccs2024
An audit and repair of a CUDA zero-knowledge proof system for 7–13B LLM inference, adding a real SHA3-256 Fiat–Shamir transcript and per-stage commitment chain that surfaced 15 soundness issues.
OctosquidAISandbaggingGame
A two-LLM game for studying deception and sandbagging: a judge interrogates a subject that is secretly free to lie, with a modular runner, constraint enforcement, and full transcript artifacts.
HeronTestProject
A benchmark testing whether a misaligned LLM agent will sandbag (intentionally underperform) when reporting model evaluations, scored across three adversarial scenarios.
Experience
Five years at Microsoft on Excel, most recently on Excel Copilot, alongside research fellowships in AI safety.
Industry
2024–Present · Tel Aviv, Israel
Research Engineer, Excel Copilot
Microsoft, Israel R&D Center
Building LLM-powered features for Excel Copilot: formula suggestions, prompt-engineering strategies, and the evaluation benchmarks that measure model accuracy, efficiency, and user relevance across native and web clients.
TypeScriptC++OpenAI APIsAzure DevOpsLLM evals2023–2024 · Tel Aviv, Israel
Software Engineer, Excel Online
Microsoft, Israel R&D Center
Led smart-suggestions work in Excel Online and shipped full-stack features across Excel Desktop and Excel Online.
TypeScriptNode.jsReactReact NativeC++KQL2021–2023 · Tel Aviv, Israel
Software Engineer (Student position), Excel Online
Microsoft, Israel R&D Center
Owned slices of Excel Online infrastructure, telemetry, and complex build/bundling pipelines.
C#Node.jsWebpack
Fellowships & training
2026
SPAR Spring Fellow, Evaluation Awareness
SPAR (Supervised Program for Alignment Research)
Researching evaluation awareness in LLMs: how models may alter their behavior when they detect they are being benchmarked. Advised by Qiyao Wei.
LLM evalsMechanistic probesCausal interventions2025–2026
Heron AI Security Research Fellow (first cohort)
Heron AI Security Initiative
Built zkml-inspector and zkml-audit-benchmark: a multi-agent auditor and 56-artifact dataset for catching soundness gaps in zero-knowledge ML implementations. Advised by Daniel Kang.
ZKMLMulti-agent systemsPython2025
BlueDot Impact, Technical AI Safety
BlueDot Impact
Completed a technical curriculum covering mechanistic interpretability, RLHF, and threat modeling for transformative AI.
June 2026
ARBOx4 Fellow, Oxford AI Safety Initiative
OAISI (University of Oxford)
Alignment research bootcamp covering core technical AI safety methods and hands-on research practice.
Toolbox
The stack I reach for across research engineering, evaluation, and safety work. When a project needs a tool I don't yet know, I learn it, something I've done repeatedly.
- Languages
- PythonTypeScriptC++CUDAC#KQLJavaScript
- AI / ML & evals
- LLM evaluation & benchmark designInspect AIPrompt engineeringOpenAI / Anthropic / Gemini APIsAzure OpenAIMulti-agent systemsStatistical analysis (scipy)Hugging Face datasets
- AI safety
- Evaluation awarenessDeception & sandbaggingEvals & red-teamingAI controlVerifiable / zero-knowledge MLAdversarial robustnessInterpretability
- Engineering
- React / React NativeNode.jsFull-stackAzure DevOpsWebpackTelemetry
Writing
I write occasionally about LLM internals, evaluation, and AI safety, working through ideas in public. Posts here sync automatically from my Substack.
Education
Off the clock
A few of the things that occupy me away from a screen, and the occasional conversation starter.
Games & stories
Tabletop RPGs, narrative games, and designing my own. I'm interested in games and gamification as a tool for studying how humans and AI behave under different incentives.
Reading & watching
Sci-fi and fantasy, plus animated series (not anime). Recently in rotation:
Outdoors & art
Hiking, jogging, and watercolor painting: the things that pull me away from a screen.
Get in touch
Always happy to discuss research, especially evaluations, model behavior, and verifiable ML, or to trade notes and feedback on ideas. Reach out anytime.