Setup & Understanding Your Corpus¶

Initialize Literature Mapper, process PDFs, and explore the database structure.

Prerequisites¶

Requirement	Details
Python	3.10 or newer
API Key	`GEMINI_API_KEY` environment variable (get one)
Corpus	A folder containing PDF files
Install	`pip install literature-mapper`

Cost Estimate

Processing costs approximately $0.50 USD for 50 papers via the Gemini API.

Initialization¶

import os
from pathlib import Path
from literature_mapper import LiteratureMapper

# Verify API key is set
if not os.getenv('GEMINI_API_KEY'):
    raise EnvironmentError("GEMINI_API_KEY not found")

# Define your corpus location
CORPUS_PATH = Path("./my_research")

# Initialize (creates corpus.db if needed)
mapper = LiteratureMapper(
    corpus_path=str(CORPUS_PATH),
    model_name="gemini-3-flash-preview"
)

Processing PDFs¶

Processing is incremental—previously processed PDFs are skipped automatically.

result = mapper.process_new_papers(recursive=True)

print(f"Processed: {result.processed}")
print(f"Skipped (already in DB): {result.skipped}")
print(f"Failed: {result.failed}")

Common Issues

Scanned PDFs without OCR → No extractable text
Password-protected files → Extraction fails
Corrupted PDF structure → Partial or no data

CLI Equivalent

literature-mapper process ./my_research --recursive

Database Schema¶

Literature Mapper stores everything in SQLite (corpus.db). Key tables:

Table	Description	Key Columns
`papers`	Core metadata	`id`, `title`, `year`, `core_argument`, `methodology`
`authors`	Unique author names	`id`, `name`, `canonical_name`
`concepts`	Extracted key terms	`id`, `name`, `canonical_name`
`paper_authors`	Many-to-many link	`paper_id`, `author_id`
`paper_concepts`	Many-to-many link	`paper_id`, `concept_id`
`kg_nodes`	Knowledge graph nodes	`id`, `type`, `label`, `vector` (embedding)
`kg_edges`	Relationships	`source_id`, `target_id`, `relation`
`citations`	OpenAlex data	`paper_id`, `cited_doi`, `cited_title`
`intellectual_edges`	Genealogy relationships	`source_paper_id`, `target_paper_id`, `relation_type`

Corpus Statistics¶

stats = mapper.get_statistics()

print(f"Papers:   {stats.total_papers}")
print(f"Authors:  {stats.total_authors}")
print(f"Concepts: {stats.total_concepts}")

Viewing All Papers¶

import pandas as pd

papers_df = mapper.get_all_analyses()
papers_df[['title', 'year', 'authors', 'journal']].head()

Knowledge Graph Structure¶

The KG contains typed nodes with semantic edges:

Node Type	Description
`paper`	The paper itself
`author`	Paper authors
`finding`	Key results or claims
`method`	Research methods
`concept`	Important terms
`limitation`	Acknowledged weaknesses
`hypothesis`	Proposed theories

Inspecting Node Distribution¶

from literature_mapper.database import get_db_session, KGNode, KGEdge
from sqlalchemy import func

with get_db_session(CORPUS_PATH) as session:
    node_counts = (
        session.query(KGNode.type, func.count(KGNode.id))
        .group_by(KGNode.type)
        .order_by(func.count(KGNode.id).desc())
        .all()
    )
    edge_count = session.query(KGEdge).count()

print("Node Types:")
for node_type, count in node_counts:
    print(f"  {node_type:15s} {count:>5}")
print(f"\nTotal edges: {edge_count:,}")

Temporal Distribution¶

Visualize when papers in your corpus were published:

import matplotlib.pyplot as plt
from literature_mapper.analysis import CorpusAnalyzer

analyzer = CorpusAnalyzer(CORPUS_PATH)
year_dist = analyzer.get_year_distribution()

if not year_dist.empty:
    fig, ax = plt.subplots(figsize=(10, 3))
    ax.bar(year_dist['year'], year_dist['count'], color='steelblue')
    ax.set_xlabel('Year')
    ax.set_ylabel('Papers')
    ax.set_title('Publication Timeline')
    plt.show()