How geeni was built

A transparent look at the data, benchmarks, and coaching loops behind an Earth Engine expert agent.

Test fixtures

Every answer geeni gives is measured against curated test cases. Two complementary sets cover the full range of Earth Engine tasks.

103 Golden Fixtures

Curated through iterative development and testing. 15 easy, 51 medium, and 37 hard tasks covering vegetation, floods, ML, SAR, time series, and more.

42 EEFA Fixtures

Auto-extracted from the EEFA textbook companion code. Real-world exercises from the definitive Earth Engine learning resource.

0
Golden
0
EEFA
0
Total
{
  "id": "gee-golden-042",
  "question": "Calculate NDVI for Sentinel-2...",
  "evaluation_criteria": {
    "must_use": ["normalizedDifference", "B8", "B4"],
    "must_not_use": [".getInfo()"],
    "difficulty": "medium",
    "topics": ["band_math", "vegetation", "sentinel2"]
  }
}

Fixtures span 30+ topic categories including band_math, machine_learning, SAR, flood_mapping, time_series, classification, cloud_masking, export, and many more.

Evaluation pipeline

Each benchmark run follows a deterministic pipeline: load fixtures, submit to the LLM, parse the response, check every criterion, and compute a weighted score.

Step 1
Load Fixtures
Step 2
Submit to LLM
Step 3
Parse Response
Step 4
Check Criteria
Step 5
Score

Criteria matching types

  • must_usetoken must appear in code
  • must_not_usebanned token
  • must_mentionconcept in explanation
  • regexpattern match
  • AND / OR / NOTcomposable logic

Weighted scoring formula

weighted = (easy × 1 + medium × 2 + hard × 3) / total_weight

// Example:
(15×1 + 50×2 + 35×3) / (15+102+111)
= 220 / 228 = 96.5%

Live execution

Beyond criteria matching, generated code is executed against the real Earth Engine API. If it runs without errors and produces valid output, it passes.

Generate
LLM Code
Sandbox
Isolated Runtime
Execute
EE API Call
Validate
Check Output

Python Executor

Subprocess isolation with 120-second timeout. Authenticated via gcloud application default credentials.

JavaScript Executor

Node.js VM sandbox with Code Editor API mocks. 60-second timeout. Service account authentication.

Sanity checks validate band value ranges, collection sizes, class distribution balance, and projection consistency.

Results dashboard

Iteration 45 benchmark snapshot. Both models exceed the 95% certification threshold on criteria matching.

0
%
Gemini Flash
100/103
0
%
Claude Haiku
98/103
0
%
EE Execution
38/44

Difficulty breakdown (Gemini Flash)

Easy
15/15
Medium
50/51
Hard
35/37
Certification threshold: 95% — both models certifiable

Coaching loop

geeni improves through a structured coaching cycle. Poor traces are identified, practiced against, and patched until the worker re-certifies.

1

Screen Traces

Flag low-quality responses below thresholds

2

Start Practice

Re-run failing fixtures in a practice session

3

Evaluate

Score each response against criteria

4

Patch Prompts

Generate targeted prompt improvements

5

Certify

Verify worker meets quality thresholds

←——— ——————— ——————— ——————— ———←
Continuous coaching cycle

Banned anti-patterns

.getInfo()
client-side for loops
deprecated datasets
missing scale param
ee.List.map()
reduceRegion without scale

Iteration timeline

Early iterations

Basic prompt engineering and fixture creation. ~60% pass rate.

RAG integration

4,000+ knowledge chunks from docs and forums. Jump to ~85%.

Anti-pattern bans

9 absolute bans for common EE mistakes. Reached ~92%.

Cross-language

EEFA fixtures, JS executor, dual-language validation.

Certification

Both Gemini Flash (97.1%) and Claude Haiku (95.2%) certified.

RAG pipeline

Every question is enriched with relevant knowledge before reaching the LLM. TF-IDF matching retrieves the most relevant chunks from 4,000+ curated entries.

Input
User Question
Match
TF-IDF Search
Assemble
Build Prompt
Generate
LLM Response

Knowledge base

  • API documentation and developer guides
  • EEFA textbook chapters and examples
  • High-quality forum answers
  • 4,000+ curated chunks total
// Prompt architecture
system_prompt
  "You are an Earth Engine expert..."
band_reference
  Landsat 8: B1-B7 + indices
absolute_bans
  9 anti-patterns (getInfo, ...)
retrieved_context
  Top-k TF-IDF matched chunks
user_question
  "How do I compute NDVI?"

12 validated prompt patches from coaching have been applied to improve responses on specific failure modes.

8 languages, one experience

geeni's entire interface is translated into 8 languages. Every page, every label, every test case description.

🇬🇧 English 🇪🇸 Español 🇨🇳 中文 🇫🇷 Français 🇰🇷 한국어 🇧🇷 Português 🇮🇳 हिन्दी 🇸🇦 العربية
0
+
Translated keys
0
Languages
0
Translated fixtures

How it works

  • data-i18n — attributes on every translatable element
  • i18n.js — client-side translation engine
  • ?lang= — URL parameter override
  • localStorage — persists language preference
Live preview
Ask geeni about Earth Engine

What's next

geeni keeps evolving. Here's what's on the roadmap.

RTL layout support

Full right-to-left layout for Arabic and Hebrew speakers.

More test fixtures

Expand beyond 145 fixtures with community contributions and new EE API coverage.

Auto-regression tracking

Continuous benchmark runs that detect quality regressions before they reach users.

Agent ecosystem

Deeper integration with the grove agent ecosystem for multi-agent workflows.

Start a conversation

Now you know how geeni works under the hood. Try it yourself and see an Earth Engine expert in action.