Printing Machines

TLDR; We developed a benchmark for AI agents with access to creative tools, challenging agents to take multiple actions to complete an existing drawing “inpainting” as well as re-create an input image “replication.” Despite the simplicity of the setting, current frontier models failed miserably and in surprising ways. We release our benchmark, current tasks, and system for contribution. In future work, we extend to additional task regimes for 2D & 3D creation and study multi-modal training for multi-step creation.

Drawing as a testbed for agents#

Frontier vision-language models have advanced rapidly as agents, performing well at extracting relevant details from images and integrating them into multi-step decision making or problem solving workflows [1, 2, 3][1]ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use[2]Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos[3]CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs.

Drawing provides a minimal yet demanding setting in which multiple core components of agentic capability must be exercised simultaneously. Perception, planning, execution, and memory are all tightly coupled, while the task remains simple enough that failures can be directly observed and attributed. Small errors in perception or execution compound over time, leading to drift, broken constraints, or global misalignment from the intended goal.

These dynamics mirror those found in high-impact domains where models have historically struggled, including design and CAD tools, UI automation, and robotic manipulation [4, 5, 6, 7, 8][4]Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes[5]iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs[6]PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly[7]DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation[8]UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction. In practical creative workflows, success rarely comes from a single pass: agents must detect errors, backtrack, and iteratively correct their work under persistent constraints [9, 10, 11, 12, 13][9]Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks In Command Line Interfaces[10]SketchAgent: Language-Driven Sequential Sketch Generation[11]PhotoArtAgent: Intelligent Photo Retouching with Language Model-Based Artist Agents[12]GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing[13]MIRA: Multimodal Iterative Reasoning Agent for Image Editing.

Drawing exposes three distinct stages relevant to general visual reasoning and agentic capability.

The first is perception. Models must reliably recognize fine-grained visual details and track what is present, including precise object structure and cardinality. Today's models already struggle here: even on static images, fine distinctions and exact counts are often unreliable.

Counting is hard (for frontier language models)

The second is planning. Given a target and the current state of the drawing, models must generate one or more actionable steps to minimize the gap. This requires translating latent visual understanding into concrete, incremental plans and for non-trivial drawings, this process is iterative, continually updating based on newly observed state.

The third is execution. Plans must be realized through fine-grained precise tool control. While many models can articulate plausible objectives or strategies, accurately executing the corresponding actions remains challenging. Small execution errors compound quickly and are difficult to recover from without robust state tracking.

Together, these form a tightly coupled loop where errors introduced at any stage alter the visual state and constrain what is possible downstream.

The results#

To evaluate this, we built a simple harness consisting of a small suite of procedural drawing tasks and a standardized tool interface:

draw_line(x1, y1, x2, y2)
draw_spline(points[])
draw_polyline(points[])
draw_curve(x1, y1, cx1, cy1, cx2, cy2, x2, y2)
update_notes(notes)
undo()
stop()

The harness provides models with an optional append only scratchpad with a max length of 20,000 tokens, which could be accessed via the update_notes tool to track state and plan across steps. Each model was asked to iteratively observe, plan, and act, with access only to the current state of the canvas, the target, and its own prior actions.

We expected tasks at this level of complexity to be well within the capabilities of frontier models. They were not.

Sample runs

Switch models to compare action traces and reasoning.

No previous snapshot

Current

No snapshot

Target

Notes

No notes logged for this run.

Step 0 / 0

Action

No action selected.

Across the tasks we tested, many frontier models struggled to produce legible or consistent drawings, often diverging after only a small number of steps. Common failure modes included repeated spiraling behaviors (where models redraw the same structure without convergence), misidentification of specific objects or subcomponents, and a general lack of geometric precision. These failures were rarely caused by a single catastrophic error; instead, small perceptual or execution mistakes compounded over time, quickly invalidating otherwise reasonable plans.

What this tells us about current models#

The results highlight a region of the design space that has been comparatively underexplored by existing benchmarks. Performance on static or text-mediated visual tasks does not yet imply robustness under incremental interaction, and closing that gap will likely require both new training signals and new evaluation methodologies.

Variance in Final Image

Future work#

We plan to release an open-source version of this benchmark, including the task suite, difficulty methodology, evaluation protocol, agent harness, and empirical results. Our goal is to provide a simple, reproducible way to probe stateful visual interaction. We see this less as a final benchmark and more as a starting point for a broader class of evaluations that stress interaction, persistence, and control under evolving state.

Introducing Printing Machines: Benchmarking model masterpieces

Drawing as a testbed for agents#

The results#

What this tells us about current models#

Future work#

References#