scitex.scholar API Reference

SciTeX Scholar – scientific paper search, enrichment, and management.

Quick Start:

from scitex_scholar import Scholar, Paper, Papers

scholar = Scholar() papers = scholar.search(“deep learning”) papers.save(“results.bib”)

Installation:

pip install scitex-scholar

This module uses PEP 562 lazy __getattr__ so import scitex_scholar stays under 500ms cold-start. Submodules are imported on first attribute access only.

class scitex.scholar.Scholar(config=None, project=None, project_description=None, browser_mode=None)[source]

Bases: EnricherMixin, URLFindingMixin, PDFDownloadMixin, LoaderMixin, SearchMixin, SaverMixin, ProjectHandlerMixin, LibraryHandlerMixin, PipelineMixin, ServiceMixin

Main interface for SciTeX Scholar - scientific literature management made simple.

By default, papers are automatically enriched with:

Journal impact factors from impact_factor package (2024 JCR data)
Citation counts from Semantic Scholar (via DOI/title matching)

Examples

Basic search with automatic enrichment:

scholar = Scholar()
papers = scholar.search("deep learning neuroscience")
# Papers now have impact_factor and citation_count populated
papers.save("my_pac.bib")

Disable automatic enrichment if needed:

config = ScholarConfig(enable_auto_enrich=False)
scholar = Scholar(config=config)

Search a specific source:

papers = scholar.search("transformer models", sources='arxiv')

Advanced workflow:

papers = (
    scholar.search("transformer models", year_min=2020)
           .filter(min_citations=50)
           .sort_by("impact_factor")
           .save("transformers.bib")
)

Local library:

scholar._index_local_pdfs("./my_papers")
local_papers = scholar.search_local("attention mechanism")

property name: Class name for logging.

__init__(config=None, project=None, project_description=None, browser_mode=None)[source]

Initialize Scholar with configuration.

Parameters:

config (Union[ScholarConfig, str, Path, None]) –
One of:
- ScholarConfig instance
- Path to YAML config file (str or Path)
- None (uses ScholarConfig.load() to find config)
project (Optional[str]) – Default project name for operations.
project_description (Optional[str]) – Optional description for the project.
browser_mode (Optional[str]) – Browser mode ('stealth', 'interactive', 'manual').

class scitex.scholar.Paper(**data)[source]

Bases: BaseModel

Complete paper with metadata and container.

metadata: PaperMetadataStructure

container: ContainerMetadata

model_config: ClassVar[ConfigDict] = {'populate_by_name': True, 'validate_assignment': True, 'validate_by_alias': True, 'validate_by_name': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_dump(**kwargs)[source]

Custom serialization to ensure all nested models use aliases.

Return type:: Dict[str, Any]

classmethod from_dict(data)[source]

Create from dictionary (for loading from JSON).

Uses Pydantic’s model_validate which handles: - Type validation - Type coercion (e.g., “2024” -> 2024) - Field aliases (e.g., “2025” -> y2025)

Return type:: Paper

to_dict()[source]

Convert to dictionary for JSON serialization.

Alias for model_dump() for backward compatibility.

Return type:: Dict[str, Any]

detect_open_access(use_unpaywall=False, update_metadata=True)[source]

Detect open access status for this paper.

Uses identifiers (DOI, arXiv ID, PMCID) and known OA sources to determine if the paper is freely available.

Parameters:

use_unpaywall (bool) – If True, query Unpaywall API for uncertain cases
update_metadata (bool) – If True, update self.metadata.access with results

Return type:

OAResult

Returns:

OAResult with detection results

property is_open_access: bool: Check if paper is open access (quick check without API calls).

class scitex.scholar.Papers(papers=None, project=None, config=None)[source]

Bases: object

A simple collection of Paper objects.

This is a minimal collection class. Most business logic (loading, saving, enrichment, etc.) is handled by Scholar.

Methods have been reduced from 39 to ~15 for simplicity. Complex operations should use Scholar or utility functions.

__init__(papers=None, project=None, config=None)[source]

Initialize Papers collection.

Parameters:

papers (Union[List[Paper], List[Dict], None]) – List of Paper objects or dicts to convert to Papers
project (Optional[str]) – Project name for organizing papers
config (Optional[ScholarConfig]) – Scholar configuration

__len__()[source]

Number of papers in collection.

Return type:: int

__iter__()[source]

Iterate over papers.

Return type:: Iterator[Paper]

__getitem__(index)[source]

Get paper(s) by index or slice.

Parameters:: index (Union[int, slice]) – Integer index or slice
Return type:: Union[Paper, Papers]
Returns:: Single Paper if integer index, Papers collection if slice

__repr__()[source]

String representation.

Return type:: str

__str__()[source]

Human-readable string.

Return type:: str

__dir__()[source]

Custom dir for better discoverability.

Return type:: List[str]

property papers: List[Paper]: Get the underlying papers list.

append(paper)[source]

Add a paper to the collection.

Parameters:: paper (Paper) – Paper to add
Return type:: None

extend(papers)[source]

Add multiple papers to the collection.

Parameters:: papers (Union[List[Paper], Papers]) – List of papers or another Papers collection
Return type:: None

to_list()[source]

Get papers as a list.

Return type:: List[Paper]
Returns:: List of Paper objects

filter(condition=None, year_min=None, year_max=None, has_doi=None, has_abstract=None, has_pdf=None, min_citations=None, max_citations=None, min_impact_factor=None, max_impact_factor=None, journal=None, author=None, keyword=None, publisher=None, **kwargs)[source]

Filter papers by condition or criteria.

Parameters:

condition (Optional[Callable[[Paper], bool]]) – Function that takes a Paper and returns bool.
year_min (Optional[int]) – Minimum year.
year_max (Optional[int]) – Maximum year.
has_doi (Optional[bool]) – Filter papers with/without DOI.
has_abstract (Optional[bool]) – Filter papers with/without abstract.
has_pdf (Optional[bool]) – Filter papers with/without PDF URL.
min_citations (Optional[int]) – Minimum citation count.
max_citations (Optional[int]) – Maximum citation count.
min_impact_factor (Optional[float]) – Minimum journal impact factor.
max_impact_factor (Optional[float]) – Maximum journal impact factor.
journal (Optional[str]) – Journal name (partial match).
author (Optional[str]) – Author name (partial match).
keyword (Optional[str]) – Keyword (searches in keywords, title, abstract).
publisher (Optional[str]) – Publisher name (partial match).
**kwargs – Additional keyword arguments for backward compatibility.

Returns:

New Papers collection with filtered papers.

Return type:

Papers

Examples

Filter using a lambda condition:

high_impact = papers.filter(lambda p: p.journal_impact_factor and p.journal_impact_factor > 10)
highly_cited = papers.filter(lambda p: p.citation_count and p.citation_count > 500)
recent = papers.filter(lambda p: p.year and p.year >= 2020)

Filter using built-in parameters:

high_impact_v2 = papers.filter(min_impact_factor=10.0)
highly_cited_v2 = papers.filter(min_citations=500)
recent_v2 = papers.filter(year_min=2020)

Combine multiple parameters:

filtered = papers.filter(
    min_impact_factor=5.0,
    min_citations=100,
    year_min=2015,
    year_max=2023,
    journal="Nature",
    has_doi=True,
)

Chain filters for AND logic:

elite_recent = papers.filter(min_impact_factor=10).filter(year_min=2020)

sort_by(*criteria, reverse=False, **kwargs)[source]

Sort papers by criteria.

Parameters:

*criteria – Field names (as strings) or lambda functions to sort by.
reverse (bool) – Sort in descending order (default: False).
**kwargs – Additional options.

Returns:

New sorted Papers collection.

Return type:

Papers

Notes

Available Paper fields for sorting:

title – Paper title
year – Publication year
citation_count – Number of citations
journal_impact_factor – Journal impact factor
journal – Journal name
publisher – Publisher name
doi – Digital Object Identifier
created_at – When record was created
updated_at – When record was last updated

Examples

Sort by a single field:

by_year = papers.sort_by('year')
by_citations_desc = papers.sort_by('citation_count', reverse=True)

Sort by multiple fields (primary, secondary, etc.):

by_year_then_citations = papers.sort_by('year', 'citation_count')

Sort using a lambda function:

by_citations = papers.sort_by(lambda p: p.citation_count or 0, reverse=True)
by_year_safe = papers.sort_by(lambda p: p.year if p.year else 9999)

Sort by a computed value:

by_citation_per_year = papers.sort_by(
    lambda p: (p.citation_count or 0) / (2024 - p.year) if p.year else 0,
    reverse=True,
)

classmethod from_bibtex(bibtex_input)[source]

Load papers from BibTeX.

DEPRECATED: Use Scholar.from_bibtex() instead. This method is kept for backward compatibility.

Parameters:: bibtex_input (Union[str, Path]) – Path to BibTeX file or BibTeX string
Return type:: Papers
Returns:: Papers collection

save(output_path, format='auto', **kwargs)[source]

Save papers to file.

DEPRECATED: Use Scholar.save_papers() or Scholar.export_bibtex() instead. This method is kept for backward compatibility.

Parameters:

output_path (Union[str, Path]) – Path to save file
format (Optional[str]) – Output format (auto, bibtex, json, csv)
**kwargs – Additional options

Return type:

None

to_dict()[source]

Convert to dictionary.

DEPRECATED: Use papers_utils.papers_to_dict() for new code.

Return type:: List[Dict[str, Any]]
Returns:: Dictionary representation

to_dataframe()[source]

Convert to pandas DataFrame.

DEPRECATED: Use papers_utils.papers_to_dataframe() for new code.

Return type:: Any
Returns:: DataFrame with papers data

summary()[source]

Get summary statistics.

DEPRECATED: Use papers_utils.papers_statistics() for new code.

Return type:: Dict[str, Any]
Returns:: Dictionary with statistics

class scitex.scholar.ScholarConfig(config_path=None, scholar_dir=None)[source]

Bases: object

__init__(config_path=None, scholar_dir=None)[source]

Initialize ScholarConfig.

Parameters:

config_path (Union[str, Path, None]) – Path to custom config YAML file
scholar_dir (Union[str, Path, None]) – Direct path to scholar directory (e.g., /data/users/alice/.scitex) This bypasses SCITEX_DIR env var for thread-safe multi-user usage. Use this in Django/multi-user environments to avoid race conditions.

__getattr__(name)[source]: Delegate all get_* methods to path_manager.

__dir__()[source]: Include path_manager’s get_* methods in dir() output.

resolve(key, direct_val=None, default=None, type=<class 'str'>, mask=None)[source]: Resolve configuration value with precedence: direct → config → env → default

get(key)[source]: Get value from config dict only

print()[source]: Print how each config was resolved

clear_log()[source]: Clear resolution log

load_yaml(path)[source]

Return type:: dict

classmethod load(path=None)[source]

property paths: Access to path manager for organized directory structure

class scitex.scholar.ScholarAuthManager(email_openathens=None, email_ezproxy=None, email_shibboleth=None, config=None)[source]

Bases: object

Manages multiple authentication providers.

This class coordinates between different authentication methods (OpenAthens, Lean Library, etc.) and provides a unified interface.

__init__(email_openathens=None, email_ezproxy=None, email_shibboleth=None, config=None)[source]

Initialize the authentication manager.

Parameters:

email_openathens (Optional[str]) – User’s institutional email for OpenAthens authentication
email_ezproxy (Optional[str]) – User’s institutional email for EZProxy authentication
email_shibboleth (Optional[str]) – User’s institutional email for Shibboleth authentication
config (Optional[ScholarConfig]) – ScholarConfig instance (creates new if None)

async ensure_authenticate_async(provider_name=None, verify_live=True, **kwargs)[source]

Return type:: bool

async is_authenticate_async(verify_live=True)[source]

Check if authenticate_async with any provider.

Return type:: bool

async authenticate_async(provider_name=None, **kwargs)[source]

Authenticate with specified or active provider.

Return type:: dict

async get_auth_headers_async()[source]

Get authentication headers from active provider.

Return type:: Dict[str, str]

async get_auth_options()[source]

Return type:: dict

async get_auth_cookies_async(essential_only=True)[source]

Get authentication cookies from active provider.

Return type:: List[Dict[str, Any]]

set_active_provider(name)[source]

Set the active authentication provider.

Return type:: None

get_active_provider()[source]

Get the currently active provider.

Return type:: Optional[BaseAuthenticator]

async logout_async()[source]

Log out from all providers.

Return type:: None

list_providers()[source]

List all registered providers.

Return type:: List[str]

class scitex.scholar.ScholarBrowserManager(browser_mode=None, auth_manager=None, chrome_profile_name=None, config=None)[source]

Bases: BrowserMixin

Manages a local browser instance with stealth enhancements and invisible mode.

__init__(browser_mode=None, auth_manager=None, chrome_profile_name=None, config=None)[source]

Initialize ScholarBrowserManager with invisible browser capabilities.

Parameters:

auth_manager – Authentication manager instance
config (ScholarConfig) – Scholar configuration instance

async get_authenticated_browser_and_context_async(**context_options)[source]

Get browser context with authentication cookies and extensions loaded.

Return type:: tuple[Browser, BrowserContext]

async take_screenshot_async(page, path, timeout_sec=30.0, timeout_after_sec=30.0, full_page=False)[source]: Take screenshot without viewport changes.

async start_periodic_screenshots_async(page, output_dir, prefix='periodic', interval_seconds=1, duration_seconds=10, verbose=False)[source]

Start taking periodic screenshots in the background.

Parameters:

page – The page to screenshot
prefix (str) – Prefix for screenshot filenames
interval_seconds (int) – Seconds between screenshots
duration_seconds (int) – Total duration to take screenshots (0 = infinite)
verbose (bool) – Whether to log each screenshot

Returns:

asyncio.Task that can be cancelled to stop screenshots

async stop_periodic_screenshots_async(task)[source]: Stop periodic screenshots task.

async close()[source]: Close browser while preserving authentication and extension data.

class scitex.scholar.ScholarURLFinder(context, config=None)[source]

Bases: object

Find PDF URLs from web pages.

Simple, focused responsibility: - Input: Page or URL string - Output: List of PDF URLs

Authentication/DOI resolution should be handled BEFORE calling this.

PAGE_LOAD_TIMEOUT = 30000

async find_pdf_urls(page_or_url, base_url=None)[source]

Find PDF URLs from page or URL string.

Parameters:

page_or_url (Union[Page, str]) – Playwright Page object or URL string
base_url (Optional[str]) – Optional base URL for the page

Returns:

[{“url”: “…”, “source”: “zotero_translator”}]

Return type:

List of PDF URL dicts

class scitex.scholar.CitationGraphBuilder(db_path=None, api_url=None)[source]

Bases: object

Build citation network graphs for academic papers.

Auto-detects backend via crossref_local.Config (DB → HTTP).

Example (auto-detect):

>>> builder = CitationGraphBuilder()
>>> graph = builder.build("10.1038/s41586-020-2008-3", top_n=20)

Example (explicit SQLite):

>>> builder = CitationGraphBuilder(db_path="/path/to/crossref.db")

Example (explicit HTTP):

>>> builder = CitationGraphBuilder(api_url="http://localhost:31291")

__init__(db_path=None, api_url=None)[source]

Initialize builder with database path, HTTP API URL, or auto-detect.

When no args given, delegates to crossref_local.Config for auto-detection: 1. CROSSREF_LOCAL_MODE env var (explicit “db” or “http”) 2. CROSSREF_LOCAL_API_URL env var → HTTP mode 3. Local DB file existence → DB mode 4. Fallback to HTTP mode

Parameters:

db_path (str) – Path to CrossRef SQLite database (local mode)
api_url (str) – URL of crossref-local HTTP API (HTTP mode)

build(seed_doi, top_n=20, weight_coupling=2.0, weight_cocitation=2.0, weight_direct=1.0)[source]

Build citation network around a seed paper.

Parameters:

seed_doi (str) – DOI of the seed paper
top_n (int) – Number of most similar papers to include
weight_coupling (float) – Weight for bibliographic coupling
weight_cocitation (float) – Weight for co-citation
weight_direct (float) – Weight for direct citations

Return type:

CitationGraph

Returns:

CitationGraph object with nodes and edges

build_from_dois(dois, num_related_per_doi=20, weight_coupling=2.0, weight_cocitation=2.0, weight_direct=1.0)[source]

Build citation network from multiple seed DOIs.

Combines similarity scores from all seeds to find papers related to the entire set, producing a richer connected graph.

Parameters:

dois (List[str]) – List of seed DOIs
num_related_per_doi (int) – Number of related papers to discover per DOI
weight_coupling (float) – Weight for bibliographic coupling
weight_cocitation (float) – Weight for co-citation
weight_direct (float) – Weight for direct citations

Return type:

CitationGraph

Returns:

CitationGraph with all seeds + related papers + edges

build_from_query(query, num_related_per_doi=20, search_limit=10, weight_coupling=2.0, weight_cocitation=2.0, weight_direct=1.0)[source]

Build citation network from a text query.

Searches local databases, extracts DOIs from results, then delegates to build_from_dois().

Parameters:

query (str) – Search query (e.g. “hippocampal sharp wave ripples”)
num_related_per_doi (int) – Related papers per seed DOI
search_limit (int) – Max papers to fetch from search
weight_coupling (float) – Weight for bibliographic coupling
weight_cocitation (float) – Weight for co-citation
weight_direct (float) – Weight for direct citations

Return type:

CitationGraph

Returns:

CitationGraph with search-discovered seeds + related papers

export_json(graph, output_path)[source]

Export graph to JSON file for visualization.

Parameters:

graph (CitationGraph) – CitationGraph to export
output_path (str) – Path to output JSON file

get_paper_summary(doi)[source]

Get summary information for a paper.

Parameters:: doi (str) – DOI of the paper
Return type:: Optional[dict]
Returns:: Dictionary with paper summary

scitex.scholar.plot_citation_graph(graph, backend='auto', output=None, **kwargs)[source]

Visualize a citation graph with pluggable backends.

Parameters:

graph (CitationGraph or networkx.DiGraph) – Citation network to visualize. CitationGraph is auto-converted via to_networkx().
backend (str) – Rendering backend: ‘auto’, ‘figrecipe’, ‘scitex.plt’, ‘matplotlib’, or ‘pyvis’. Default ‘auto’ picks the best available.
output (str, optional) – Output file path. Required for ‘pyvis’ backend (HTML). For static backends, saves the figure to this path.
**kwargs – Backend-specific keyword arguments (layout, seed, figsize, etc.).

Returns:

Backend-specific result. Static backends return {'fig', 'ax', 'pos', 'backend'}. Pyvis returns {'output', 'backend'}.

Return type:

dict

scitex.scholar.to_bibtex(paper)[source]

Format a standard paper dict as a BibTeX entry.

Return type:: str

scitex.scholar.to_ris(paper)[source]

Format a standard paper dict as a RIS entry.

Return type:: str

scitex.scholar.to_endnote(paper)[source]

Format a standard paper dict as an EndNote entry.

Return type:: str

scitex.scholar.to_text_citation(paper, style='apa', doc_type='article')[source]

Format a paper dict as a text citation in the given style.

Parameters:

paper (dict) – Standard paper dict.
style (str) – One of apa, mla, chicago, vancouver.
doc_type (str) – One of article, dataset.

Returns:

Formatted citation string.

Return type:

str

scitex.scholar.papers_to_format(papers, fmt)[source]

Format a list of paper dicts to the given format string.

Return type:: str

scitex.scholar.generate_cite_key(paper)[source]

Generate a BibTeX citation key from a paper dict.

Return type:: str

scitex.scholar.make_citation_key(last_name, year=None)[source]

Generate a citation key from author last name and year.

Parameters:

last_name (str) – Author last name (special chars stripped).
year – Publication year (optional).

Return type:

str

Returns:

Citation key string, e.g. smith2024.

scitex.scholar.from_connected_papers(paper_id, *, cp_api_key=None, s2_api_key=None, output_format='citation_graph', dry_run=False)[source]

Import a Connected Papers graph into scitex.

Parameters:

paper_id (str) – Semantic Scholar paper ID (40-char SHA) for the seed paper.
cp_api_key (str, optional) – Connected Papers API key.
s2_api_key (str, optional) – Semantic Scholar API key for DOI resolution.
output_format (str) – “citation_graph” returns CitationGraph, “papers” returns Papers.
dry_run (bool) – If True, fetch and report stats without creating objects.

Returns:

{success: True, graph/papers, stats, warnings} or {success: False, error: str}.

Return type:

dict

scitex.scholar.to_connected_papers(graph, *, output=None)[source]

Export a CitationGraph as BibTeX/JSON for Connected Papers.

Parameters:

graph (CitationGraph) – Citation graph to export.
output (str or Path, optional) – Output directory. Defaults to current directory.

Returns:

{success, bibtex_path, json_path, paper_count} or {success: False, error}.

Return type:

dict

scitex.scholar.apply_filters(papers, filters=None, parsed_operators=None)[source]

Filter a list of paper dicts by various criteria.

Parameters:

papers (List[Dict[str, Any]]) – List of paper dicts. Each dict should contain the keys described in the module docstring; missing keys are treated as empty / zero values.
filters (Optional[Dict[str, Any]]) –
Dict of filter criteria extracted from a search form or URL parameters. Supported keys:
- year_from, year_to – year range (int)
- min_citations, max_citations – citation range (int)
- min_impact_factor – minimum IF (float)
- max_impact_factor – maximum IF (float)
- authors – list of author name strings (legacy)
- journal – journal name substring (legacy, str)
- open_access – bool
- doc_type – "review" | "preprint" | other
- language – language string ("english" passes)
parsed_operators (Optional[Dict[str, Any]]) –
Dict produced by SearchQueryParser.from_shell_syntax() or the equivalent parse_query_operators() function from scitex-cloud. Supported keys:
- title_includes, title_excludes – list[str]
- author_includes, author_excludes – list[str]
- journal_includes, journal_excludes – list[str]
- year_min, year_max – int
- citations_min, citations_max – int
- impact_factor_min, impact_factor_max – float

Returns:

Filtered list of paper dicts (same objects, not copies).

Return type:

list of dict