scitex.scholar API Reference
SciTeX Scholar – scientific paper search, enrichment, and management.
- Quick Start:
from scitex_scholar import Scholar, Paper, Papers
scholar = Scholar() papers = scholar.search(“deep learning”) papers.save(“results.bib”)
- Installation:
pip install scitex-scholar
This module uses PEP 562 lazy __getattr__ so import scitex_scholar stays under 500ms cold-start. Submodules are imported on first attribute access only.
- class scitex.scholar.Scholar(config=None, project=None, project_description=None, browser_mode=None)[source]
Bases:
EnricherMixin,URLFindingMixin,PDFDownloadMixin,LoaderMixin,SearchMixin,SaverMixin,ProjectHandlerMixin,LibraryHandlerMixin,PipelineMixin,ServiceMixinMain interface for SciTeX Scholar - scientific literature management made simple.
By default, papers are automatically enriched with:
Journal impact factors from impact_factor package (2024 JCR data)
Citation counts from Semantic Scholar (via DOI/title matching)
Examples
Basic search with automatic enrichment:
scholar = Scholar() papers = scholar.search("deep learning neuroscience") # Papers now have impact_factor and citation_count populated papers.save("my_pac.bib")
Disable automatic enrichment if needed:
config = ScholarConfig(enable_auto_enrich=False) scholar = Scholar(config=config)
Search a specific source:
papers = scholar.search("transformer models", sources='arxiv')
Advanced workflow:
papers = ( scholar.search("transformer models", year_min=2020) .filter(min_citations=50) .sort_by("impact_factor") .save("transformers.bib") )
Local library:
scholar._index_local_pdfs("./my_papers") local_papers = scholar.search_local("attention mechanism")
- property name
Class name for logging.
- __init__(config=None, project=None, project_description=None, browser_mode=None)[source]
Initialize Scholar with configuration.
- Parameters:
config (
Union[ScholarConfig,str,Path,None]) –One of:
ScholarConfiginstancePath to YAML config file (str or Path)
None(usesScholarConfig.load()to find config)
project (
Optional[str]) – Default project name for operations.project_description (
Optional[str]) – Optional description for the project.browser_mode (
Optional[str]) – Browser mode ('stealth','interactive','manual').
- class scitex.scholar.Paper(**data)[source]
Bases:
BaseModelComplete paper with metadata and container.
- metadata: PaperMetadataStructure
- container: ContainerMetadata
- model_config: ClassVar[ConfigDict] = {'populate_by_name': True, 'validate_assignment': True, 'validate_by_alias': True, 'validate_by_name': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod from_dict(data)[source]
Create from dictionary (for loading from JSON).
Uses Pydantic’s model_validate which handles: - Type validation - Type coercion (e.g., “2024” -> 2024) - Field aliases (e.g., “2025” -> y2025)
- Return type:
- to_dict()[source]
Convert to dictionary for JSON serialization.
Alias for model_dump() for backward compatibility.
- class scitex.scholar.Papers(papers=None, project=None, config=None)[source]
Bases:
objectA simple collection of Paper objects.
This is a minimal collection class. Most business logic (loading, saving, enrichment, etc.) is handled by Scholar.
Methods have been reduced from 39 to ~15 for simplicity. Complex operations should use Scholar or utility functions.
- filter(condition=None, year_min=None, year_max=None, has_doi=None, has_abstract=None, has_pdf=None, min_citations=None, max_citations=None, min_impact_factor=None, max_impact_factor=None, journal=None, author=None, keyword=None, publisher=None, **kwargs)[source]
Filter papers by condition or criteria.
- Parameters:
condition (
Optional[Callable[[Paper],bool]]) – Function that takes a Paper and returns bool.has_abstract (
Optional[bool]) – Filter papers with/without abstract.has_pdf (
Optional[bool]) – Filter papers with/without PDF URL.min_impact_factor (
Optional[float]) – Minimum journal impact factor.max_impact_factor (
Optional[float]) – Maximum journal impact factor.keyword (
Optional[str]) – Keyword (searches in keywords, title, abstract).**kwargs – Additional keyword arguments for backward compatibility.
- Returns:
New Papers collection with filtered papers.
- Return type:
Examples
Filter using a lambda condition:
high_impact = papers.filter(lambda p: p.journal_impact_factor and p.journal_impact_factor > 10) highly_cited = papers.filter(lambda p: p.citation_count and p.citation_count > 500) recent = papers.filter(lambda p: p.year and p.year >= 2020)
Filter using built-in parameters:
high_impact_v2 = papers.filter(min_impact_factor=10.0) highly_cited_v2 = papers.filter(min_citations=500) recent_v2 = papers.filter(year_min=2020)
Combine multiple parameters:
filtered = papers.filter( min_impact_factor=5.0, min_citations=100, year_min=2015, year_max=2023, journal="Nature", has_doi=True, )
Chain filters for AND logic:
elite_recent = papers.filter(min_impact_factor=10).filter(year_min=2020)
- sort_by(*criteria, reverse=False, **kwargs)[source]
Sort papers by criteria.
- Parameters:
*criteria – Field names (as strings) or lambda functions to sort by.
reverse (
bool) – Sort in descending order (default: False).**kwargs – Additional options.
- Returns:
New sorted Papers collection.
- Return type:
Notes
Available Paper fields for sorting:
title– Paper titleyear– Publication yearcitation_count– Number of citationsjournal_impact_factor– Journal impact factorjournal– Journal namepublisher– Publisher namedoi– Digital Object Identifiercreated_at– When record was createdupdated_at– When record was last updated
Examples
Sort by a single field:
by_year = papers.sort_by('year') by_citations_desc = papers.sort_by('citation_count', reverse=True)
Sort by multiple fields (primary, secondary, etc.):
by_year_then_citations = papers.sort_by('year', 'citation_count')
Sort using a lambda function:
by_citations = papers.sort_by(lambda p: p.citation_count or 0, reverse=True) by_year_safe = papers.sort_by(lambda p: p.year if p.year else 9999)
Sort by a computed value:
by_citation_per_year = papers.sort_by( lambda p: (p.citation_count or 0) / (2024 - p.year) if p.year else 0, reverse=True, )
- classmethod from_bibtex(bibtex_input)[source]
Load papers from BibTeX.
DEPRECATED: Use Scholar.from_bibtex() instead. This method is kept for backward compatibility.
- save(output_path, format='auto', **kwargs)[source]
Save papers to file.
DEPRECATED: Use Scholar.save_papers() or Scholar.export_bibtex() instead. This method is kept for backward compatibility.
- to_dict()[source]
Convert to dictionary.
DEPRECATED: Use papers_utils.papers_to_dict() for new code.
- class scitex.scholar.ScholarConfig(config_path=None, scholar_dir=None)[source]
Bases:
object- __init__(config_path=None, scholar_dir=None)[source]
Initialize ScholarConfig.
- Parameters:
config_path (
Union[str,Path,None]) – Path to custom config YAML filescholar_dir (
Union[str,Path,None]) – Direct path to scholar directory (e.g., /data/users/alice/.scitex) This bypasses SCITEX_DIR env var for thread-safe multi-user usage. Use this in Django/multi-user environments to avoid race conditions.
- resolve(key, direct_val=None, default=None, type=<class 'str'>, mask=None)[source]
Resolve configuration value with precedence: direct → config → env → default
- property paths
Access to path manager for organized directory structure
- class scitex.scholar.ScholarAuthManager(email_openathens=None, email_ezproxy=None, email_shibboleth=None, config=None)[source]
Bases:
objectManages multiple authentication providers.
This class coordinates between different authentication methods (OpenAthens, Lean Library, etc.) and provides a unified interface.
- __init__(email_openathens=None, email_ezproxy=None, email_shibboleth=None, config=None)[source]
Initialize the authentication manager.
- Parameters:
email_openathens (
Optional[str]) – User’s institutional email for OpenAthens authenticationemail_ezproxy (
Optional[str]) – User’s institutional email for EZProxy authenticationemail_shibboleth (
Optional[str]) – User’s institutional email for Shibboleth authenticationconfig (
Optional[ScholarConfig]) – ScholarConfig instance (creates new if None)
- async ensure_authenticate_async(provider_name=None, verify_live=True, **kwargs)[source]
- Return type:
- async is_authenticate_async(verify_live=True)[source]
Check if authenticate_async with any provider.
- Return type:
- async authenticate_async(provider_name=None, **kwargs)[source]
Authenticate with specified or active provider.
- Return type:
- async get_auth_cookies_async(essential_only=True)[source]
Get authentication cookies from active provider.
- class scitex.scholar.ScholarBrowserManager(browser_mode=None, auth_manager=None, chrome_profile_name=None, config=None)[source]
Bases:
BrowserMixinManages a local browser instance with stealth enhancements and invisible mode.
- __init__(browser_mode=None, auth_manager=None, chrome_profile_name=None, config=None)[source]
Initialize ScholarBrowserManager with invisible browser capabilities.
- Parameters:
auth_manager – Authentication manager instance
config (
ScholarConfig) – Scholar configuration instance
- async get_authenticated_browser_and_context_async(**context_options)[source]
Get browser context with authentication cookies and extensions loaded.
- Return type:
tuple[Browser,BrowserContext]
- async take_screenshot_async(page, path, timeout_sec=30.0, timeout_after_sec=30.0, full_page=False)[source]
Take screenshot without viewport changes.
- class scitex.scholar.ScholarURLFinder(context, config=None)[source]
Bases:
objectFind PDF URLs from web pages.
Simple, focused responsibility: - Input: Page or URL string - Output: List of PDF URLs
Authentication/DOI resolution should be handled BEFORE calling this.
- PAGE_LOAD_TIMEOUT = 30000
- class scitex.scholar.CitationGraphBuilder(db_path=None, api_url=None)[source]
Bases:
objectBuild citation network graphs for academic papers.
Auto-detects backend via crossref_local.Config (DB → HTTP).
- Example (auto-detect):
>>> builder = CitationGraphBuilder() >>> graph = builder.build("10.1038/s41586-020-2008-3", top_n=20)
- Example (explicit SQLite):
>>> builder = CitationGraphBuilder(db_path="/path/to/crossref.db")
- Example (explicit HTTP):
>>> builder = CitationGraphBuilder(api_url="http://localhost:31291")
- __init__(db_path=None, api_url=None)[source]
Initialize builder with database path, HTTP API URL, or auto-detect.
When no args given, delegates to crossref_local.Config for auto-detection: 1. CROSSREF_LOCAL_MODE env var (explicit “db” or “http”) 2. CROSSREF_LOCAL_API_URL env var → HTTP mode 3. Local DB file existence → DB mode 4. Fallback to HTTP mode
- build(seed_doi, top_n=20, weight_coupling=2.0, weight_cocitation=2.0, weight_direct=1.0)[source]
Build citation network around a seed paper.
- Parameters:
- Return type:
CitationGraph- Returns:
CitationGraph object with nodes and edges
- build_from_dois(dois, num_related_per_doi=20, weight_coupling=2.0, weight_cocitation=2.0, weight_direct=1.0)[source]
Build citation network from multiple seed DOIs.
Combines similarity scores from all seeds to find papers related to the entire set, producing a richer connected graph.
- Parameters:
- Return type:
CitationGraph- Returns:
CitationGraph with all seeds + related papers + edges
- build_from_query(query, num_related_per_doi=20, search_limit=10, weight_coupling=2.0, weight_cocitation=2.0, weight_direct=1.0)[source]
Build citation network from a text query.
Searches local databases, extracts DOIs from results, then delegates to build_from_dois().
- Parameters:
query (
str) – Search query (e.g. “hippocampal sharp wave ripples”)num_related_per_doi (
int) – Related papers per seed DOIsearch_limit (
int) – Max papers to fetch from searchweight_coupling (
float) – Weight for bibliographic couplingweight_cocitation (
float) – Weight for co-citationweight_direct (
float) – Weight for direct citations
- Return type:
CitationGraph- Returns:
CitationGraph with search-discovered seeds + related papers
- scitex.scholar.plot_citation_graph(graph, backend='auto', output=None, **kwargs)[source]
Visualize a citation graph with pluggable backends.
- Parameters:
graph (CitationGraph or networkx.DiGraph) – Citation network to visualize. CitationGraph is auto-converted via
to_networkx().backend (str) – Rendering backend: ‘auto’, ‘figrecipe’, ‘scitex.plt’, ‘matplotlib’, or ‘pyvis’. Default ‘auto’ picks the best available.
output (str, optional) – Output file path. Required for ‘pyvis’ backend (HTML). For static backends, saves the figure to this path.
**kwargs – Backend-specific keyword arguments (layout, seed, figsize, etc.).
- Returns:
Backend-specific result. Static backends return
{'fig', 'ax', 'pos', 'backend'}. Pyvis returns{'output', 'backend'}.- Return type:
- scitex.scholar.to_bibtex(paper)[source]
Format a standard paper dict as a BibTeX entry.
- Return type:
- scitex.scholar.to_endnote(paper)[source]
Format a standard paper dict as an EndNote entry.
- Return type:
- scitex.scholar.to_text_citation(paper, style='apa', doc_type='article')[source]
Format a paper dict as a text citation in the given style.
- scitex.scholar.papers_to_format(papers, fmt)[source]
Format a list of paper dicts to the given format string.
- Return type:
- scitex.scholar.generate_cite_key(paper)[source]
Generate a BibTeX citation key from a paper dict.
- Return type:
- scitex.scholar.make_citation_key(last_name, year=None)[source]
Generate a citation key from author last name and year.
- scitex.scholar.from_connected_papers(paper_id, *, cp_api_key=None, s2_api_key=None, output_format='citation_graph', dry_run=False)[source]
Import a Connected Papers graph into scitex.
- Parameters:
paper_id (str) – Semantic Scholar paper ID (40-char SHA) for the seed paper.
cp_api_key (str, optional) – Connected Papers API key.
s2_api_key (str, optional) – Semantic Scholar API key for DOI resolution.
output_format (str) – “citation_graph” returns CitationGraph, “papers” returns Papers.
dry_run (bool) – If True, fetch and report stats without creating objects.
- Returns:
{success: True, graph/papers, stats, warnings} or {success: False, error: str}.
- Return type:
- scitex.scholar.to_connected_papers(graph, *, output=None)[source]
Export a CitationGraph as BibTeX/JSON for Connected Papers.
- scitex.scholar.apply_filters(papers, filters=None, parsed_operators=None)[source]
Filter a list of paper dicts by various criteria.
- Parameters:
papers (
List[Dict[str,Any]]) – List of paper dicts. Each dict should contain the keys described in the module docstring; missing keys are treated as empty / zero values.filters (
Optional[Dict[str,Any]]) –Dict of filter criteria extracted from a search form or URL parameters. Supported keys:
year_from,year_to– year range (int)min_citations,max_citations– citation range (int)min_impact_factor– minimum IF (float)max_impact_factor– maximum IF (float)authors– list of author name strings (legacy)journal– journal name substring (legacy, str)open_access– booldoc_type–"review"|"preprint"| otherlanguage– language string ("english"passes)
parsed_operators (
Optional[Dict[str,Any]]) –Dict produced by
SearchQueryParser.from_shell_syntax()or the equivalentparse_query_operators()function from scitex-cloud. Supported keys:title_includes,title_excludes– list[str]author_includes,author_excludes– list[str]journal_includes,journal_excludes– list[str]year_min,year_max– intcitations_min,citations_max– intimpact_factor_min,impact_factor_max– float
- Returns:
Filtered list of paper dicts (same objects, not copies).
- Return type: