Module manubot.pandoc.cite_filter
This module defines a pandoc filter for manubot cite functionality.
The filter can be called with the pandoc-manubot-cite
command.
Options
Configuration is provided via Pandoc metadata fields.
-
bibliography
(sequence of strings): Use to define reference metadata manually. Pandoc supports specifying multiple external bibliography files. When bibliography files are specified, this filter will read them instead of pandoc. Behavior should be similar to Pandoc, with format inferred by extension: .json for CSL JSON, .yaml for CSL YAML, .bib for BibLaTeX. -
references
(sequence of CSL-Item mappings): Same as Pandoc's references metadata field. -
citekey-aliases
(mapping: string -> string): Used to define aliases (tags) for cite-by-id citations. Useful when a citation is used many times or contains invalid characters. Aliases can also be defined in markdown with link reference syntax. -
manubot-infer-citekey-prefixes
(boolean): Attempt to infer the prefix for citekeys without a known prefix. For example, allow '@10.1371/journal.pcbi.1007128' with a 'doi:' prefix. Default is true. -
manubot-output-citekeys
(string): path to write TSV table of citekeys showing their transformation from input_id to short_id. -
manubot-bibliography-cache
(string): Path to read and write bibliographic metadata as CSL JSON/YAML. Intended as a human-editable cache of the bibliography data, for situations where this filter is run multiple times. This is similar to specifying bibliography=FILE and manubot-output-bibliography=FILE in a single argument, but will not error trying to read a bibliography file that does not yet exist. -
manubot-output-bibliography
(string): path to write generated CSL JSON bibliography. If specified in addition tomanubot-bibliography-cache
, two output bibliographies will be written (with the same references). -
manubot-requests-cache-path
(string): Enable caching HTTP requests to this path (minus the extension) using requests-cache. For example, setting to.cache/requests-cache
will cache requests to.cache/requests-cache.sqlite
. -
manubot-clear-requests-cache
(boolean): If true, clear the requests cache atmanubot-requests-cache-path
. -
manubot-fail-on-errors
(boolean): If true, return a nonzero exit status if any errors are logged. Default is false, which allows Pandoc to proceed when some citations could not be processed.
development commands
# export to plain text (with pandoc < 2.11)
pandoc --to=plain --standalone --filter=pandoc-manubot-cite --filter=pandoc-citeproc manubot/pandoc/tests/test_cite_filter/input.md
# call the filter manually using pandoc JSON output
pandoc --to=json manubot/pandoc/tests/test_cite_filter/input.md | python manubot/pandoc/test_cite.py markdown
Related resources on pandoc filters:
View Source
"""
This module defines a pandoc filter for manubot cite functionality.
The filter can be called with the `pandoc-manubot-cite` command.
## Options
Configuration is provided via Pandoc metadata fields.
- `bibliography` (sequence of strings):
Use to define reference metadata manually.
Pandoc supports specifying multiple external bibliography files.
When bibliography files are specified,
this filter will read them instead of pandoc.
Behavior should be similar to Pandoc,
with format inferred by extension:
.json for CSL JSON,
.yaml for CSL YAML,
.bib for BibLaTeX.
- `references` (sequence of CSL-Item mappings):
Same as Pandoc's references metadata field.
- `citekey-aliases` (mapping: string -> string):
Used to define aliases (tags) for cite-by-id citations.
Useful when a citation is used many times or contains invalid characters.
Aliases can also be defined in markdown with link reference syntax.
- `manubot-infer-citekey-prefixes` (boolean):
Attempt to infer the prefix for citekeys without a known prefix.
For example, allow '@10.1371/journal.pcbi.1007128' with a 'doi:' prefix.
Default is true.
- `manubot-output-citekeys` (string):
path to write TSV table of citekeys
showing their transformation from input_id to short_id.
- `manubot-bibliography-cache` (string):
Path to read and write bibliographic metadata as CSL JSON/YAML.
Intended as a human-editable cache of the bibliography data,
for situations where this filter is run multiple times.
This is similar to specifying bibliography=FILE and manubot-output-bibliography=FILE in a single argument,
but will not error trying to read a bibliography file that does not yet exist.
- `manubot-output-bibliography` (string):
path to write generated CSL JSON bibliography.
If specified in addition to `manubot-bibliography-cache`,
two output bibliographies will be written (with the same references).
- `manubot-requests-cache-path` (string):
Enable caching HTTP requests to this path (minus the extension) using [requests-cache](https://github.com/reclosedev/requests-cache).
For example, setting to `.cache/requests-cache` will cache requests to `.cache/requests-cache.sqlite`.
- `manubot-clear-requests-cache` (boolean):
If true, clear the requests cache at `manubot-requests-cache-path`.
- `manubot-fail-on-errors` (boolean):
If true, return a nonzero exit status if any errors are logged.
Default is false, which allows Pandoc to proceed when some citations could not be processed.
## development commands
```shell
# export to plain text (with pandoc < 2.11)
pandoc \
--to=plain \
--standalone \
--filter=pandoc-manubot-cite \
--filter=pandoc-citeproc \
manubot/pandoc/tests/test_cite_filter/input.md
# call the filter manually using pandoc JSON output
pandoc \
--to=json \
manubot/pandoc/tests/test_cite_filter/input.md \
| python manubot/pandoc/test_cite.py markdown
```
Related resources on pandoc filters:
- [python pandocfilters package](https://github.com/jgm/pandocfilters)
- [python panflute package](https://github.com/sergiocorreia/panflute)
- [panflute Citation class](http://scorreia.com/software/panflute/code.html#panflute.elements.Citation)
"""
import argparse
import logging
import os
from typing import Any, Dict
import panflute as pf
from manubot.cite.citations import Citations
def parse_args() -> argparse.Namespace:
"""
Read command line arguments
"""
parser = argparse.ArgumentParser(
description="Pandoc filter for citation by persistent identifier. "
"Filters are command-line programs that read and write a JSON-encoded abstract syntax tree for Pandoc. "
"Unless you are debugging, run this filter as part of a pandoc command by specifying --filter=pandoc-manubot-cite."
)
parser.add_argument(
"target_format",
help="output format of the pandoc command, as per Pandoc's --to option",
)
parser.add_argument(
"--input",
nargs="?",
type=argparse.FileType("r", encoding="utf-8"),
help="path read JSON input (defaults to stdin)",
)
parser.add_argument(
"--output",
nargs="?",
type=argparse.FileType("w", encoding="utf-8"),
help="path to write JSON output (defaults to stdout)",
)
args = parser.parse_args()
return args
def _get_citekeys_action(elem: pf.Element, doc: pf.Doc) -> None:
"""
Panflute action to extract citationId from all Citations in the AST.
"""
if not isinstance(elem, pf.Citation):
return None
manuscript_citekeys = doc.manubot["manuscript_citekeys"]
manuscript_citekeys.append(elem.id)
return None
def _citation_to_id_action(elem: pf.Element, doc: pf.Doc) -> None:
"""
Panflute action to update the citationId of Citations in the AST
with their manubot-created keys.
"""
if not isinstance(elem, pf.Citation):
return None
mapper = doc.manubot["citekey_shortener"]
if elem.id in mapper:
elem.id = mapper[elem.id]
return None
def _get_reference_link_citekey_aliases(elem: pf.Element, doc: pf.Doc) -> None:
"""
Extract citekey aliases from the document that were defined
using markdown's link reference syntax.
https://spec.commonmark.org/0.29/#link-reference-definitions
Based on pandoc-url2cite implementation by phiresky at
https://github.com/phiresky/pandoc-url2cite/blob/b28374a9a037a5ce1747b8567160d8dffd64177e/index.ts#L118-L152
"""
if type(elem) != pf.Para:
# require link reference definitions to be in their own paragraph
return
while (
len(elem.content) >= 3
and type(elem.content[0]) == pf.Cite
and len(elem.content[0].citations) == 1
and type(elem.content[1]) == pf.Str
and elem.content[1].text == ":"
):
# paragraph consists of at least a Cite (with one Citation),
# a Str (equal to ":"), and additional elements, such as a
# link destination and possibly more link-reference definitions.
dest_index = 3 if type(elem.content[2]) in {pf.Space, pf.SoftBreak} else 2
destination = elem.content[dest_index]
if type(destination) != pf.Str:
# prevent infinite loop as per https://github.com/manubot/manubot/pull/302#issuecomment-906743300
logging.warning(
f"Unsupported destination type {type(destination)} in link reference syntax:\n{elem.to_json()}"
)
break
# paragraph starts with `[@something]: something`
# save info to citekeys and remove from paragraph
citekey = elem.content[0].citations[0].id
citekey_aliases = doc.manubot["citekey_aliases"]
if citekey in citekey_aliases and citekey_aliases[citekey] != destination.text:
logging.warning(f"multiple aliases defined for @{citekey}")
citekey_aliases[citekey] = destination.text
# found citation, add it to citekeys and remove it from document
elem.content = elem.content[dest_index + 1 :]
# remove leading SoftBreak, before continuing
if len(elem.content) > 0 and type(elem.content[0]) == pf.SoftBreak:
elem.content.pop(0)
def _get_load_manual_references_kwargs(doc: pf.Doc) -> Dict[str, Any]:
"""
Return keyword arguments for Citations.load_manual_references.
"""
manual_refs = doc.get_metadata("references", default=[])
bibliography_paths = doc.get_metadata("bibliography", default=[])
if not isinstance(bibliography_paths, list):
bibliography_paths = [bibliography_paths]
bibliography_cache_path = doc.manubot["bibliography_cache"]
if (
bibliography_cache_path
and bibliography_cache_path not in bibliography_paths
and os.path.exists(bibliography_cache_path)
):
bibliography_paths.append(bibliography_cache_path)
return {
"paths": bibliography_paths,
"extra_csl_items": manual_refs,
}
def process_citations(doc: pf.Doc) -> None:
"""
Apply citation-by-identifier to a Python object representation of
Pandoc's Abstract Syntax Tree.
"""
# process metadata.manubot-bibliography-cache
bib_cache = doc.get_metadata(key="manubot-bibliography-cache")
if not (bib_cache is None or isinstance(bib_cache, str)):
logging.warning(
f"Expected metadata.manubot-bibliography-cache to be a string or null (None), "
f"but received a {bib_cache.__class__.__name__}. Setting to None."
)
bib_cache = None
doc.manubot["bibliography_cache"] = bib_cache
# process metadata.citekey-aliases
citekey_aliases = doc.get_metadata("citekey-aliases", default={})
if not isinstance(citekey_aliases, dict):
logging.warning(
f"Expected metadata.citekey-aliases to be a dict, "
f"but received a {citekey_aliases.__class__.__name__}. Disregarding."
)
citekey_aliases = {}
doc.manubot["citekey_aliases"] = citekey_aliases
doc.walk(_get_reference_link_citekey_aliases)
doc.walk(_get_citekeys_action)
manuscript_citekeys = doc.manubot["manuscript_citekeys"]
citations = Citations(
input_ids=manuscript_citekeys,
aliases=citekey_aliases,
infer_citekey_prefixes=doc.get_metadata(
"manubot-infer-citekey-prefixes", default=True
),
)
citations.csl_item_failure_log_level = "ERROR"
requests_cache_path = doc.get_metadata("manubot-requests-cache-path")
if requests_cache_path:
from manubot.process.requests_cache import RequestsCache
req_cache = RequestsCache(requests_cache_path)
req_cache.mkdir()
req_cache.install()
if doc.get_metadata("manubot-clear-requests-cache", default=False):
req_cache.clear()
citations.filter_pandoc_xnos()
citations.load_manual_references(**_get_load_manual_references_kwargs(doc))
citations.inspect(log_level="WARNING")
citations.get_csl_items()
doc.manubot["citekey_shortener"] = citations.input_to_csl_id
doc.walk(_citation_to_id_action)
if requests_cache_path:
req_cache.close()
citations.write_citekeys_tsv(path=doc.get_metadata("manubot-output-citekeys"))
citations.write_csl_items(path=doc.get_metadata("manubot-output-bibliography"))
citations.write_csl_items(path=doc.manubot["bibliography_cache"])
# Update pandoc metadata with fields that this filter
# has either consumed, created, or modified.
doc.metadata["bibliography"] = []
doc.metadata["references"] = citations.csl_items
doc.metadata["citekey_aliases"] = citekey_aliases
def main() -> None:
from manubot.command import (
exit_if_error_handler_fired,
setup_logging_and_errors,
)
diagnostics = setup_logging_and_errors()
args = parse_args()
# Let panflute handle io to sys.stdout / sys.stdin to set utf-8 encoding.
# args.input=None for stdin, args.output=None for stdout
doc = pf.load(input_stream=args.input)
log_level = doc.get_metadata("manubot-log-level", "WARNING")
diagnostics["logger"].setLevel(getattr(logging, log_level))
logging.debug(f"Input Pandoc metadata:\n{doc.get_metadata()}")
doc.manubot = {"manuscript_citekeys": []}
process_citations(doc)
pf.dump(doc, output_stream=args.output)
if doc.get_metadata("manubot-fail-on-errors", False):
exit_if_error_handler_fired(diagnostics["error_handler"])
if __name__ == "__main__":
main()
Functions
main
def main(
) -> None
View Source
def main() -> None:
from manubot.command import (
exit_if_error_handler_fired,
setup_logging_and_errors,
)
diagnostics = setup_logging_and_errors()
args = parse_args()
# Let panflute handle io to sys.stdout / sys.stdin to set utf-8 encoding.
# args.input=None for stdin, args.output=None for stdout
doc = pf.load(input_stream=args.input)
log_level = doc.get_metadata("manubot-log-level", "WARNING")
diagnostics["logger"].setLevel(getattr(logging, log_level))
logging.debug(f"Input Pandoc metadata:\n{doc.get_metadata()}")
doc.manubot = {"manuscript_citekeys": []}
process_citations(doc)
pf.dump(doc, output_stream=args.output)
if doc.get_metadata("manubot-fail-on-errors", False):
exit_if_error_handler_fired(diagnostics["error_handler"])
parse_args
def parse_args(
) -> argparse.Namespace
Read command line arguments
View Source
def parse_args() -> argparse.Namespace:
"""
Read command line arguments
"""
parser = argparse.ArgumentParser(
description="Pandoc filter for citation by persistent identifier. "
"Filters are command-line programs that read and write a JSON-encoded abstract syntax tree for Pandoc. "
"Unless you are debugging, run this filter as part of a pandoc command by specifying --filter=pandoc-manubot-cite."
)
parser.add_argument(
"target_format",
help="output format of the pandoc command, as per Pandoc's --to option",
)
parser.add_argument(
"--input",
nargs="?",
type=argparse.FileType("r", encoding="utf-8"),
help="path read JSON input (defaults to stdin)",
)
parser.add_argument(
"--output",
nargs="?",
type=argparse.FileType("w", encoding="utf-8"),
help="path to write JSON output (defaults to stdout)",
)
args = parser.parse_args()
return args
process_citations
def process_citations(
doc: panflute.elements.Doc
) -> None
Apply citation-by-identifier to a Python object representation of
Pandoc's Abstract Syntax Tree.
View Source
def process_citations(doc: pf.Doc) -> None:
"""
Apply citation-by-identifier to a Python object representation of
Pandoc's Abstract Syntax Tree.
"""
# process metadata.manubot-bibliography-cache
bib_cache = doc.get_metadata(key="manubot-bibliography-cache")
if not (bib_cache is None or isinstance(bib_cache, str)):
logging.warning(
f"Expected metadata.manubot-bibliography-cache to be a string or null (None), "
f"but received a {bib_cache.__class__.__name__}. Setting to None."
)
bib_cache = None
doc.manubot["bibliography_cache"] = bib_cache
# process metadata.citekey-aliases
citekey_aliases = doc.get_metadata("citekey-aliases", default={})
if not isinstance(citekey_aliases, dict):
logging.warning(
f"Expected metadata.citekey-aliases to be a dict, "
f"but received a {citekey_aliases.__class__.__name__}. Disregarding."
)
citekey_aliases = {}
doc.manubot["citekey_aliases"] = citekey_aliases
doc.walk(_get_reference_link_citekey_aliases)
doc.walk(_get_citekeys_action)
manuscript_citekeys = doc.manubot["manuscript_citekeys"]
citations = Citations(
input_ids=manuscript_citekeys,
aliases=citekey_aliases,
infer_citekey_prefixes=doc.get_metadata(
"manubot-infer-citekey-prefixes", default=True
),
)
citations.csl_item_failure_log_level = "ERROR"
requests_cache_path = doc.get_metadata("manubot-requests-cache-path")
if requests_cache_path:
from manubot.process.requests_cache import RequestsCache
req_cache = RequestsCache(requests_cache_path)
req_cache.mkdir()
req_cache.install()
if doc.get_metadata("manubot-clear-requests-cache", default=False):
req_cache.clear()
citations.filter_pandoc_xnos()
citations.load_manual_references(**_get_load_manual_references_kwargs(doc))
citations.inspect(log_level="WARNING")
citations.get_csl_items()
doc.manubot["citekey_shortener"] = citations.input_to_csl_id
doc.walk(_citation_to_id_action)
if requests_cache_path:
req_cache.close()
citations.write_citekeys_tsv(path=doc.get_metadata("manubot-output-citekeys"))
citations.write_csl_items(path=doc.get_metadata("manubot-output-bibliography"))
citations.write_csl_items(path=doc.manubot["bibliography_cache"])
# Update pandoc metadata with fields that this filter
# has either consumed, created, or modified.
doc.metadata["bibliography"] = []
doc.metadata["references"] = citations.csl_items
doc.metadata["citekey_aliases"] = citekey_aliases