How geeni was built
A transparent look at the data, benchmarks, and coaching loops behind an Earth Engine expert agent.
Test fixtures
Every answer geeni gives is measured against curated test cases. Two complementary sets cover the full range of Earth Engine tasks.
103 Golden Fixtures
Curated through iterative development and testing. 15 easy, 51 medium, and 37 hard tasks covering vegetation, floods, ML, SAR, time series, and more.
42 EEFA Fixtures
Auto-extracted from the EEFA textbook companion code. Real-world exercises from the definitive Earth Engine learning resource.
{
"id": "gee-golden-042",
"question": "Calculate NDVI for Sentinel-2...",
"evaluation_criteria": {
"must_use": ["normalizedDifference", "B8", "B4"],
"must_not_use": [".getInfo()"],
"difficulty": "medium",
"topics": ["band_math", "vegetation", "sentinel2"]
}
}
Fixtures span 30+ topic categories including band_math, machine_learning, SAR, flood_mapping, time_series, classification, cloud_masking, export, and many more.
Evaluation pipeline
Each benchmark run follows a deterministic pipeline: load fixtures, submit to the LLM, parse the response, check every criterion, and compute a weighted score.
Criteria matching types
must_use— token must appear in codemust_not_use— banned tokenmust_mention— concept in explanationregex— pattern matchAND / OR / NOT— composable logic
Weighted scoring formula
// Example:
(15×1 + 50×2 + 35×3) / (15+102+111)
= 220 / 228 = 96.5%
Live execution
Beyond criteria matching, generated code is executed against the real Earth Engine API. If it runs without errors and produces valid output, it passes.
Python Executor
Subprocess isolation with 120-second timeout. Authenticated via gcloud application default credentials.
JavaScript Executor
Node.js VM sandbox with Code Editor API mocks. 60-second timeout. Service account authentication.
Sanity checks validate band value ranges, collection sizes, class distribution balance, and projection consistency.
Results dashboard
Iteration 45 benchmark snapshot. Both models exceed the 95% certification threshold on criteria matching.
Difficulty breakdown (Gemini Flash)
Coaching loop
geeni improves through a structured coaching cycle. Poor traces are identified, practiced against, and patched until the worker re-certifies.
Screen Traces
Flag low-quality responses below thresholds
Start Practice
Re-run failing fixtures in a practice session
Evaluate
Score each response against criteria
Patch Prompts
Generate targeted prompt improvements
Certify
Verify worker meets quality thresholds
Banned anti-patterns
Iteration timeline
Early iterations
Basic prompt engineering and fixture creation. ~60% pass rate.
RAG integration
4,000+ knowledge chunks from docs and forums. Jump to ~85%.
Anti-pattern bans
9 absolute bans for common EE mistakes. Reached ~92%.
Cross-language
EEFA fixtures, JS executor, dual-language validation.
Certification
Both Gemini Flash (97.1%) and Claude Haiku (95.2%) certified.
RAG pipeline
Every question is enriched with relevant knowledge before reaching the LLM. TF-IDF matching retrieves the most relevant chunks from 4,000+ curated entries.
Knowledge base
- API documentation and developer guides
- EEFA textbook chapters and examples
- High-quality forum answers
- 4,000+ curated chunks total
// Prompt architecture system_prompt "You are an Earth Engine expert..." band_reference Landsat 8: B1-B7 + indices absolute_bans 9 anti-patterns (getInfo, ...) retrieved_context Top-k TF-IDF matched chunks user_question "How do I compute NDVI?"
12 validated prompt patches from coaching have been applied to improve responses on specific failure modes.
8 languages, one experience
geeni's entire interface is translated into 8 languages. Every page, every label, every test case description.
How it works
data-i18n— attributes on every translatable elementi18n.js— client-side translation engine?lang=— URL parameter overridelocalStorage— persists language preference
What's next
geeni keeps evolving. Here's what's on the roadmap.
RTL layout support
Full right-to-left layout for Arabic and Hebrew speakers.
More test fixtures
Expand beyond 145 fixtures with community contributions and new EE API coverage.
Auto-regression tracking
Continuous benchmark runs that detect quality regressions before they reach users.
Agent ecosystem
Deeper integration with the grove agent ecosystem for multi-agent workflows.
Start a conversation
Now you know how geeni works under the hood. Try it yourself and see an Earth Engine expert in action.