# Turning Unstructured Text into Organized Insights with Google’s LangExtract

Revanth Reddy Tondapu
Aug 4
4 min read

1. Introduction

Imagine you have a pile of messy text—maybe a play script, long medical notes, or hundreds of pages of research—and wish you could instantly know who’s speaking, how they feel, and who they’re connected to. LangExtract is a Python library that uses Google’s Gemini AI to transform unstructured text into neat, structured information in seconds. No manual tagging, no scrolling through endless paragraphs—just clean data you can act on.

2. Key Concepts Explained

What Is LangExtract?

LangExtract is a simple Python package that:

• Connects to a powerful AI model (Gemini) to read and analyze text.

• Identifies entities like characters, emotions, and relationships.

• Marks exactly where each piece of information comes from in the original text.

Why Structure Matters

• Ease of Use: Instead of hunting through paragraphs, you get clear entries such as:

• Character: Romeo

• Emotion: Gentle awe

• Relationship: Juliet→Sun (metaphor)

• Traceability: Each extraction links back to the exact source location for verification.

• Speed: Processes thousands of words in seconds, even on a laptop.

3. How It Works

Step 1: Install and Set Up

1. Install the library and system dependency:

pip install langextract  
brew install libmagic  # (for macOS)

2. Supply your Gemini API key as an environment variable:

export LANGEXTRACT_API_KEY="YOUR_API_KEY_HERE"

Step 2: Define Your Extraction Task

Create a prompt explaining what you want to extract, then give a few examples so the AI understands your goal:

import langextract as lx
import textwrap
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships in order of appearance.    Use exact text for extraction_text. Do not paraphrase or overlap spans.    Provide meaningful attributes for each entity for context.""")

examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",        extractions=[
            lx.data.Extraction("character", "ROMEO", {"emotional_state": "wonder"}),
            lx.data.Extraction("emotion", "But soft!", {"feeling": "gentle awe"}),
            lx.data.Extraction("relationship", "Juliet is the sun", {"type": "metaphor"})
        ]
    )
]

Step 3: Run the Extraction

Supply your text and let LangExtract do the rest:

input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"
result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash"
)

Step 4: Save and Visualize

Store the AI’s output in a JSONL file and generate an interactive HTML report:

lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")
html = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    f.write(html)

Open visualization.html in your browser to see highlights of characters, emotions, and relationships in the original text.

Step 5: Scale Up to Long Documents

Process entire plays or reports—25,000+ words—with parallel workers:

result = lx.extract(
    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,
    max_workers=20,
    max_char_buffer=1000
)

In minutes, you’ll have structured data for hundreds of entities and an HTML visualization that handles large results seamlessly.

4. Advanced Analysis: Full Play Extraction & Insights

When you’re ready to go beyond basic extraction, you can process an entire play script, then summarize character mentions and entity distributions:

import langextract as lx
import textwrap
from collections import Counter
# Advanced prompt with detailed instructions
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships from the given text.
    Provide attributes for every entity. Use exact text spans; no overlaps.
    In play scripts, speaker names appear in ALL-CAPS followed by a period.""")

# Few-shot examples for clarity
examples = [
    lx.data.ExampleData(
        text=textwrap.dedent("""\
            ROMEO. But soft! What light through yonder window breaks?
            It is the east, and Juliet is the sun.
            JULIET. O Romeo, Romeo! Wherefore art thou Romeo?"""),
        extractions=[
            lx.data.Extraction("character", "ROMEO", {"emotional_state": "wonder"}),
            lx.data.Extraction("emotion", "But soft!", {"feeling": "gentle awe", "character": "Romeo"}),
            lx.data.Extraction("relationship", "Juliet is the sun", {"type": "metaphor", "character_1": "Romeo", "character_2": "Juliet"}),
            lx.data.Extraction("character", "JULIET", {"emotional_state": "yearning"}),
            lx.data.Extraction("emotion", "Wherefore art thou Romeo?", {"feeling": "longing question", "character": "Juliet"})
        ]
    )
]

# Extract from Project Gutenberg URL
result = lx.extract(
    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,
    max_workers=20,
    max_char_buffer=1000
)

# Save and visualize
lx.io.save_annotated_documents([result], output_name="romeo_juliet_extractions.jsonl")
with open("romeo_juliet_visualization.html", "w") as f:
    f.write(lx.visualize("romeo_juliet_extractions.jsonl"))

# Summarize top characters
char_counts = Counter(e.extraction_text for e in result.extractions if e.extraction_class=="character")
print("Top 10 Characters by Mentions:")
for char, count in char_counts.most_common(10):
    print(f"{char}: {count} mentions")

# Entity type distribution
entity_counts = Counter(e.extraction_class for e in result.extractions)
print("\nEntity Type Breakdown:")
for et, cnt in entity_counts.items():
    pct = cnt/len(result.extractions)*100
    print(f"{et}: {cnt} ({pct:.1f}%)")

• Character Summary: See which characters appear most frequently.

• Entity Breakdown: Understand what portion of extractions are emotions vs. relationships vs. characters.

• Interactive Exploration: The HTML file lets you click on any extraction to see its context in the play.

5. Tips & Takeaways

• Provide Clear Examples: High-quality few-shot examples greatly improve accuracy.

• Use Parallel Processing: For large texts, increase `max_workers` to speed up extraction.

• Customize Prompts: Tailor prompts to your domain—medical terms, legal clauses, or literary nuances.

• Review Interactively: HTML visualizations make it easy to verify and explore results.

6. Summary

LangExtract makes transforming unstructured text into structured, actionable insights effortless. With just a few lines of Python and clear prompts, you’ll extract entities and analyze large documents, gaining data you can trust and explore interactively. Whether you’re working on literature analysis, clinical reports, legal documents, or research papers, LangExtract delivers clarity and speed—so you can focus on the insights that matter.

# Turning Unstructured Text into Organized Insights with Google’s LangExtract

Recent Posts

Comments

Revanth Quick Learn