Process — How geeni Was Built

Test fixtures

Every answer geeni gives is measured against curated test cases. Two complementary sets cover the full range of Earth Engine tasks.

103 Golden Fixtures

Curated through iterative development and testing. 15 easy, 51 medium, and 37 hard tasks covering vegetation, floods, ML, SAR, time series, and more.

42 EEFA Fixtures

Auto-extracted from the EEFA textbook companion code. Real-world exercises from the definitive Earth Engine learning resource.

0

Golden

0

EEFA

0

Total

{
  "id": "gee-golden-042",
  "question": "Calculate NDVI for Sentinel-2...",
  "evaluation_criteria": {
    "must_use": ["normalizedDifference", "B8", "B4"],
    "must_not_use": [".getInfo()"],
    "difficulty": "medium",
    "topics": ["band_math", "vegetation", "sentinel2"]
  }
}

Fixtures span 30+ topic categories including band_math, machine_learning, SAR, flood_mapping, time_series, classification, cloud_masking, export, and many more.

Evaluation pipeline

Each benchmark run follows a deterministic pipeline: load fixtures, submit to the LLM, parse the response, check every criterion, and compute a weighted score.

Step 1

Load Fixtures

→

Step 2

Submit to LLM

→

Step 3

Parse Response

→

Step 4

Check Criteria

→

Step 5

Score

Criteria matching types

must_use — token must appear in code
must_not_use — banned token
must_mention — concept in explanation
regex — pattern match
AND / OR / NOT — composable logic

Weighted scoring formula

weighted = (easy × 1 + medium × 2 + hard × 3) / total_weight

// Example:
(15×1 + 50×2 + 35×3) / (15+102+111)
= 220 / 228 = 96.5%

Live execution

Beyond criteria matching, generated code is executed against the real Earth Engine API. If it runs without errors and produces valid output, it passes.

Generate

LLM Code

→

Sandbox

Isolated Runtime

→

Execute

EE API Call

→

Validate

Check Output

Python Executor

Subprocess isolation with 120-second timeout. Authenticated via gcloud application default credentials.

JavaScript Executor

Node.js VM sandbox with Code Editor API mocks. 60-second timeout. Service account authentication.

Sanity checks validate band value ranges, collection sizes, class distribution balance, and projection consistency.

Results dashboard

Iteration 45 benchmark snapshot. Both models exceed the 95% certification threshold on criteria matching.

0

%

Gemini Flash

100/103

0

%

Claude Haiku

98/103

0

%

EE Execution

38/44

Difficulty breakdown (Gemini Flash)

Easy

15/15

Medium

50/51

Hard

35/37

✅ Certification threshold: 95% — both models certifiable

Coaching loop

geeni improves through a structured coaching cycle. Poor traces are identified, practiced against, and patched until the worker re-certifies.

1

Screen Traces

Flag low-quality responses below thresholds

→

2

Start Practice

Re-run failing fixtures in a practice session

→

3

Evaluate

Score each response against criteria

→

4

Patch Prompts

Generate targeted prompt improvements

→

5

Certify

Verify worker meets quality thresholds

←——— ——————— ——————— ——————— ———←

Continuous coaching cycle

Banned anti-patterns

.getInfo()

client-side for loops

deprecated datasets

missing scale param

ee.List.map()

reduceRegion without scale

Iteration timeline

Early iterations

Basic prompt engineering and fixture creation. ~60% pass rate.

RAG integration

4,000+ knowledge chunks from docs and forums. Jump to ~85%.

Anti-pattern bans

9 absolute bans for common EE mistakes. Reached ~92%.

Cross-language

EEFA fixtures, JS executor, dual-language validation.

Certification

Both Gemini Flash (97.1%) and Claude Haiku (95.2%) certified.

RAG pipeline

Every question is enriched with relevant knowledge before reaching the LLM. TF-IDF matching retrieves the most relevant chunks from 4,000+ curated entries.

Input

User Question

→

Match

TF-IDF Search

→

Assemble

Build Prompt

→

Generate

LLM Response

Knowledge base

API documentation and developer guides
EEFA textbook chapters and examples
High-quality forum answers
4,000+ curated chunks total

// Prompt architecture
system_prompt
  "You are an Earth Engine expert..."
band_reference
  Landsat 8: B1-B7 + indices
absolute_bans
  9 anti-patterns (getInfo, ...)
retrieved_context
  Top-k TF-IDF matched chunks
user_question
  "How do I compute NDVI?"

12 validated prompt patches from coaching have been applied to improve responses on specific failure modes.

8 languages, one experience

geeni's entire interface is translated into 8 languages. Every page, every label, every test case description.

🇬🇧 English 🇪🇸 Español 🇨🇳 中文 🇫🇷 Français 🇰🇷 한국어 🇧🇷 Português 🇮🇳 हिन्दी 🇸🇦 العربية

0

+

Translated keys

0

Languages

0

Translated fixtures

How it works

data-i18n — attributes on every translatable element
i18n.js — client-side translation engine
?lang= — URL parameter override
localStorage — persists language preference

Live preview

Ask geeni about Earth Engine

What's next

geeni keeps evolving. Here's what's on the roadmap.

RTL layout support

Full right-to-left layout for Arabic and Hebrew speakers.

More test fixtures

Expand beyond 145 fixtures with community contributions and new EE API coverage.

Auto-regression tracking

Continuous benchmark runs that detect quality regressions before they reach users.

Agent ecosystem

Deeper integration with the grove agent ecosystem for multi-agent workflows.

How geeni was built