msword Module (stx.msword)
MS Word (DOCX) import/export utilities for SciTeX.
This module provides high-level functions to convert between MS Word .docx files and SciTeX’s internal writer document model.
Strategy:
Word users write text only (paragraphs, minimal formatting)
SciTeX handles: figures, tables, references, LaTeX generation
SciTeX JSON is the “source of truth”, Word is just a view/edit layer
Typical usage:
from scitex_msword import load_docx, save_docx, list_profiles
# Import from Word doc = load_docx(“input.docx”, profile=”generic”)
# Manipulate via scitex.writer… # doc.normalize()
# Export to Word (different journal template) save_docx(doc, “output.docx”, profile=”mdpi-ijerph”)
Available profiles:
generic: Standard Word with Heading 1/2/3
mdpi-ijerph: MDPI IJERPH journal template
resna-2025: RESNA 2025 scientific paper template
iop-double-anonymous: IOP double-anonymous template
- scitex.msword.load_docx(path, profile=None, extract_images=True)[source]
Load a DOCX file and convert it into a SciTeX writer document.
- Parameters:
- Returns:
A SciTeX writer document structure containing: - blocks: List of document blocks (headings, paragraphs, captions, etc.) - metadata: Profile and source file information - images: Extracted image references (if extract_images=True) - references: Parsed reference entries
- Return type:
Examples
>>> from scitex.msword import load_docx >>> doc = load_docx("manuscript.docx", profile="mdpi-ijerph") >>> print(doc["metadata"]["profile"]) 'mdpi-ijerph'
- scitex.msword.save_docx(writer_doc, path, profile=None, overwrite=True, template_path=None)[source]
Save a SciTeX writer document as a DOCX file.
- Parameters:
writer_doc (dict | Any) – SciTeX writer document instance to export.
path (str | Path) – Output path for the .docx file.
profile (str | None) – Optional profile name that controls how sections, headings, figures, tables and references are mapped to Word styles. If None, “generic” is used.
overwrite (bool) – If False and the file already exists, raises FileExistsError.
template_path (str | Path | None) – Optional path to a Word template (.dotx/.docx) to use as base. This allows using journal-specific formatting.
- Returns:
The path to the written .docx file.
- Return type:
Path
Examples
>>> from scitex.msword import save_docx >>> save_docx(doc, "submission_resna_2025.docx", profile="resna-2025") PosixPath('submission_resna_2025.docx')
- scitex.msword.convert_docx_to_tex(input_path, output_path, profile=None, *, image_dir=None, link_images=True, link_mode='by-number', normalize_headings=True, validate=True)[source]
Convert a DOCX file directly to LaTeX.
This is a convenience function that: 1. Loads the DOCX file into SciTeX intermediate format 2. (Optionally) normalizes headings 3. (Optionally) links figure captions to images 4. (Optionally) validates the document and adds warnings 5. Exports to LaTeX (including figures via image_dir)
- Parameters:
input_path (str | Path) – Path to the input .docx file.
output_path (str | Path) – Path for the output .tex file.
profile (str | None) – Word profile for interpreting styles (e.g., “resna-2025”, “iop-double-anonymous”).
image_dir (str | Path | None, optional) – Directory where extracted figure image files will be saved. If None, the LaTeX exporter will create “<tex_stem>_figures” next to output_path.
link_images (bool, default True) – Whether to link figure captions to extracted images so that LaTeX can generate includegraphics inside figure environments.
link_mode ({"by-number", "by-proximity"}, default "by-number") –
Strategy for linking captions to images: - “by-number”: Figure 1 -> first image, Figure 2 -> second image… - “by-proximity”: assign images in document order, useful when
figure numbers and image order don’t match.
normalize_headings (bool, default True) – If True, apply common heading normalizations (e.g., “intro” -> “Introduction”).
validate (bool, default True) – If True, run basic structural checks and populate doc[“warnings”] with any issues.
- Returns:
The path to the written .tex file.
- Return type:
Path
Examples
>>> from scitex.msword import convert_docx_to_tex >>> convert_docx_to_tex( ... "RESNA 2025 Scientific Paper Template.docx", ... "manuscript.tex", ... profile="resna-2025", ... image_dir="figures", ... ) PosixPath('manuscript.tex')
- scitex.msword.list_profiles()[source]
List available MS Word profiles.
Examples
>>> from scitex.msword import list_profiles >>> profiles = list_profiles() >>> "generic" in profiles True
- scitex.msword.get_profile(name)[source]
Get a Word profile by name.
- Parameters:
name (str | None) – Profile name. If None, “generic” is used.
- Returns:
The requested profile.
- Return type:
- Raises:
KeyError – If the profile name is unknown.
Examples
>>> from scitex.msword import get_profile >>> profile = get_profile("mdpi-ijerph") >>> profile.columns 1
- scitex.msword.register_profile(profile)[source]
Register a custom Word profile.
- Parameters:
profile (BaseWordProfile) – The profile to register.
- Return type:
Examples
>>> from scitex.msword import BaseWordProfile, register_profile >>> custom = BaseWordProfile( ... name="my-journal", ... description="My custom journal template", ... heading_styles={1: "Title", 2: "Subtitle"}, ... ) >>> register_profile(custom) >>> "my-journal" in list_profiles() True
- class scitex.msword.BaseWordProfile(name, description, heading_styles=<factory>, caption_style='Caption', normal_style='Normal', reference_section_titles=<factory>, figure_caption_prefixes=<factory>, table_caption_prefixes=<factory>, list_styles=<factory>, equation_style=None, columns=1, double_anonymous=False, body_font=None, body_font_size_pt=None, heading_background_hex=None, line_spacing=None, post_import_hooks=<factory>, pre_export_hooks=<factory>)[source]
Bases:
objectBase configuration for mapping between DOCX and SciTeX writer documents.
- heading_styles
Mapping from section depth (1, 2, 3…) to Word style names (e.g., {1: “Heading 1”, 2: “Heading 2”}).
- figure_caption_prefixes
Prefixes that identify figure captions (e.g., [“Figure”, “Fig.”]).
- class scitex.msword.WordReader(profile, extract_images=True)[source]
Bases:
objectRead a DOCX file and convert it into a SciTeX writer document.
This reader focuses on: - Sections (via heading styles) - Plain paragraphs - Figure/table captions (via caption style) - Embedded images extraction - References section boundary detection - Basic formatting (bold, italic)
The output is a structured intermediate representation that can be easily fed into scitex.writer or exported to LaTeX/other formats.
- __init__(profile, extract_images=True)[source]
- Parameters:
profile (BaseWordProfile) – Mapping between Word styles and SciTeX writer semantics.
extract_images (bool) – Whether to extract embedded images from the document.
- read(path)[source]
Read a DOCX file and return a SciTeX writer document.
- Parameters:
path (Path) – Path to the DOCX file.
- Returns:
SciTeX writer document structure with keys: - blocks: List of document blocks - metadata: Profile and source information - images: Extracted image data (if extract_images=True) - references: Parsed reference entries - warnings: List of conversion warnings
- Return type:
- class scitex.msword.WordWriter(profile, template_path=None)[source]
Bases:
objectExport a SciTeX writer document to a DOCX file.
This writer handles: - Section headings with proper styles - Paragraphs with formatting - Figure and table captions - References section - Image embedding - Journal-specific template application
- __init__(profile, template_path=None)[source]
- Parameters:
profile (BaseWordProfile) – Mapping from writer structures to Word styles.
template_path (Path | None) – Optional path to a Word template (.dotx/.docx) to use as base.
- scitex.msword.link_captions_to_images(doc)[source]
Link figure captions to images by matching order.
This function pairs figure captions with images based on their sequential order in the document. Each figure caption is assigned an image_hash that corresponds to the image at the same position.
- Parameters:
doc (dict) – SciTeX writer document with ‘blocks’ and ‘images’ keys.
- Returns:
The same document with image_hash added to figure captions.
- Return type:
Examples
>>> from scitex.msword import load_docx >>> from scitex.msword.utils import link_captions_to_images >>> doc = load_docx("manuscript.docx") >>> doc = link_captions_to_images(doc) >>> # Now captions have image_hash for LaTeX export
- scitex.msword.link_captions_to_images_by_proximity(doc)[source]
Link figure captions to images by document proximity.
This function uses the image blocks (type=”image”) that are inserted at their actual positions in the document body. It finds the nearest unlinked image block to each figure caption.
- scitex.msword.normalize_section_headings(doc)[source]
Normalize section headings for consistency.
Converts common section titles to standard academic format: - “intro” -> “Introduction” - “method” -> “Methods” - etc.
- scitex.msword.validate_document(doc)[source]
Validate document structure and add warnings.
Checks for common issues: - Missing required sections - Unmatched caption numbers - Empty references section - Duplicate figure numbers
- scitex.msword.create_post_import_hook(*functions)[source]
Create a composite post_import_hook from multiple functions.
- Parameters:
*functions (callable) – Functions to apply in sequence.
- Returns:
A single hook that applies all functions.
- Return type:
callable
Examples
>>> from scitex.msword.utils import ( ... link_captions_to_images, ... normalize_section_headings, ... create_post_import_hook, ... ) >>> hook = create_post_import_hook( ... link_captions_to_images, ... normalize_section_headings, ... ) >>> # Use with custom profile >>> profile.post_import_hooks = [hook]
- scitex.msword.diff_docx(a, b, *, include_run_diff=True)[source]
Compute paragraph-level diff between two DOCX documents.
- Parameters:
a (str | Path | docx.Document) – Inputs to compare. May be paths or already-loaded Documents.
b (str | Path | docx.Document) – Inputs to compare. May be paths or already-loaded Documents.
include_run_diff (bool, default True) – If True,
modifyoperations include aruns_changedfield listing the run-level formatting deltas.
- Returns:
Each entry is one of:
{"op": "equal", "index": int, "text_a": str, "text_b": str} {"op": "insert", "index": int, "text_a": None, "text_b": str} {"op": "delete", "index": int, "text_a": str, "text_b": None} {"op": "modify", "index": int, "text_a": str, "text_b": str, "runs_changed": [...]}
indexrefers to the paragraph index in documentbforequal/insert/modifyoperations, and to the paragraph index in documentafordeleteoperations.- Return type:
Examples
>>> from scitex_msword.diff import diff_docx >>> ops = diff_docx("v15.docx", "v16.docx") >>> changes = [o for o in ops if o["op"] != "equal"]
- scitex.msword.mark_additions(document, runs, color='turquoise')[source]
Highlight the runs that the operator (or agent) added to the document.
- Parameters:
document (docx.Document) – The Document to mutate in place.
runs (iterable of (paragraph_idx, run_idx)) – Targets to highlight. Out-of-range indices are skipped silently.
color (str, default "turquoise") – Color name. See module docstring for the supported palette.
- Returns:
The same Document object, mutated.
- Return type:
docx.Document
Examples
>>> from scitex_msword.highlights import mark_additions >>> doc = mark_additions(doc, [(3, 0), (5, 2)]) # default turquoise
- scitex.msword.mark_modifications(document, runs, color='magenta')[source]
Highlight the runs that the operator (or agent) modified in the document.
- Parameters:
document (docx.Document) – The Document to mutate in place.
runs (iterable of (paragraph_idx, run_idx)) – Targets to highlight. Out-of-range indices are skipped silently.
color (str, default "magenta") – Color name. See module docstring for the supported palette.
- Returns:
The same Document object, mutated.
- Return type:
docx.Document
Examples
>>> from scitex_msword.highlights import mark_modifications >>> doc = mark_modifications(doc, [(7, 1)]) # default magenta
- scitex.msword.extract_highlights(document, by_color=True)[source]
Extract highlighted runs from a document, grouped by color.
- Parameters:
document (docx.Document) – The document to scan.
by_color (bool, default True) – If True (default), return
{color_name: [run_info, ...]}. If False, return a single{"all": [run_info, ...]}bucket with each entry’scolorfield populated.
- Returns:
Mapping from color name to a list of run info dicts of shape:
{"paragraph": int, "run": int, "text": str, "color": str}
- Return type:
Examples
>>> from scitex_msword.highlights import extract_highlights >>> by_color = extract_highlights(doc) >>> by_color.get("turquoise", []) [{'paragraph': 3, 'run': 0, 'text': '...', 'color': 'turquoise'}]
- scitex.msword.clear_highlights(document, colors=None)[source]
Remove highlights from all runs (optionally only for the listed colors).
- Parameters:
document (docx.Document) – Document to mutate in place.
colors (iterable of str, optional) – If provided, only runs with one of these highlight colors are cleared. If
None(default), every highlighted run is cleared.
- Returns:
The same Document object, mutated.
- Return type:
docx.Document
- scitex.msword.preserve_bold_tokens(document, tokens, *, font_name='MS Gothic', case_sensitive=True)[source]
Walk every paragraph in
documentand bold-emphasize each token hit.Wherever a token appears inside paragraph text, the paragraph’s runs are split so that the token sits in its own run with
bold=Trueandfont.name = font_name(Latin + East-Asian + complex script slots are all set so Japanese text picks up MS Gothic in Word).- Parameters:
document (docx.Document) – The Document to mutate in place.
tokens (sequence of str) – Strings to emphasize. Empty strings are ignored. Longer tokens take precedence on overlapping matches.
font_name (str, default "MS Gothic") – Font face applied to matched tokens.
case_sensitive (bool, default True) – If False, matching is case-insensitive.
- Returns:
The same Document object, mutated.
- Return type:
docx.Document
Notes
This implementation rewrites all runs of a paragraph when at least one token hits; paragraphs without hits are left untouched. The original surrounding format (italic, underline, size, highlight) is captured from the first run before rebuilding — if you need finer-grained preservation, run
preserve_bold_tokens()before other run-level edits.Examples
>>> from scitex_msword.bold import preserve_bold_tokens >>> preserve_bold_tokens(doc, tokens=["BOOST", "JST"])
- scitex.msword.extract_comments(document)[source]
Extract Word comments from a .docx file or open Document.
- Parameters:
document (str | Path | docx.Document) – Path to the .docx or an already-open Document.
- Returns:
One entry per comment:
{"id": int | str, "author": str, "date": str, # ISO timestamp string, may be empty "text": str, # comment body "anchor_text": str, # text the comment is anchored to "paragraph_range": [start, end]}
anchor_textandparagraph_rangedefault to""and[None, None]when no in-document anchor can be located.- Return type:
Examples
>>> from scitex_msword.comments import extract_comments >>> comments = extract_comments("boost-v16.docx") >>> [c["text"] for c in comments] ['Please rephrase this', 'REPLACE: Use the new wording']
- scitex.msword.apply_comments_as_edits(document, *, comments=None, grammar='replace')[source]
Apply comments to the document body using a narrow grammar.
Only the
REPLACE:grammar is currently supported, i.e. a comment whose body matchesr"^\s*REPLACE\s*:\s*(.+?)\s*$"is interpreted as “replace this comment’s anchor text with the trailing payload”. Other comments are ignored.- Parameters:
document (docx.Document) – The Document to mutate in place.
comments (list[dict], optional) – Pre-extracted comments (as returned by
extract_comments()). IfNone, the comments are read fromdocumentdirectly.grammar (str, default "replace") – Reserved for future expansion. Currently only
"replace"is recognised.
- Returns:
Summary:
{"applied": int, "skipped": int, "details": [...]}.- Return type:
Examples
>>> from scitex_msword.comments import apply_comments_as_edits >>> summary = apply_comments_as_edits(doc) >>> summary["applied"] 2
- scitex.msword.enable_track_changes(document, enabled=True)[source]
Toggle Word’s “Track Changes” switch on the document.
Inserts
<w:trackChanges/>intoword/settings.xmlwhenenabled=True(idempotent) or removes it whenenabled=False.- Parameters:
document (docx.Document) – The Document to mutate in place.
enabled (bool, default True) –
Truekeeps a single<w:trackChanges/>element present;Falseremoves any such elements.
- Returns:
The same Document object (chainable).
- Return type:
docx.Document
- scitex.msword.is_track_changes_enabled(document)[source]
Return True iff
<w:trackChanges/>is present in settings.xml.- Return type:
- scitex.msword.wrap_as_tracked_insertion(paragraph, runs, author='agent', date=None, w_id=None)[source]
Wrap the given runs of
paragraphin<w:ins>revision blocks.Word renders the wrapped content as “inserted by <author>” and surfaces it as an accept/reject-able revision.
- Parameters:
paragraph (docx.text.paragraph.Paragraph) – Paragraph that owns the runs to wrap.
runs (sequence of Run or int) – Runs to wrap, by Run object or by 0-based index.
author (str, default "agent") – Recorded in
w:author.date (str, optional) – ISO-8601 string for
w:date; defaults tonow(UTC).w_id (int, optional) – Explicit revision id; auto-assigned (max+1) when
None.
- Returns:
Newly created
<w:ins>lxml elements.- Return type:
- scitex.msword.wrap_as_tracked_deletion(paragraph, runs, author='agent', date=None, w_id=None)[source]
Wrap the given runs of
paragraphin<w:del>revision blocks.Each wrapped run’s
<w:t>children are also retagged as<w:delText>so Word renders the deletion with strike-through.- Parameters:
paragraph (docx.text.paragraph.Paragraph) – Paragraph that owns the runs to wrap.
runs (sequence of Run or int) – Runs to wrap, by Run object or by 0-based index.
author (str, default "agent") – Recorded in
w:author.date (str, optional) – ISO-8601 string for
w:date; defaults tonow(UTC).w_id (int, optional) – Explicit revision id; auto-assigned (max+1) when
None.
- Returns:
Newly created
<w:del>lxml elements.- Return type:
- scitex.msword.extract_tracked_changes(document)[source]
Return every
<w:ins>/<w:del>revision as a structured dict.
- scitex.msword.accept_all_tracked_changes(document)[source]
Accept all tracked changes — equivalent to Word’s “Accept All”.
<w:ins>wrappers are unwrapped (content remains);<w:del>wrappers and their contents are removed.- Parameters:
document (docx.Document) – The Document to mutate in place.
- Returns:
The same Document, mutated.
- Return type:
docx.Document
- scitex.msword.reject_all_tracked_changes(document)[source]
Reject all tracked changes — equivalent to Word’s “Reject All”.
<w:ins>wrappers and contents are removed;<w:del>wrappers are unwrapped and their<w:delText>children retagged back to<w:t>so the original text is restored.- Parameters:
document (docx.Document) – The Document to mutate in place.
- Returns:
The same Document, mutated.
- Return type:
docx.Document