strip-tags by simonw
README source code
Strip tags from HTML, optionally from areas identified by CSS selectors
See llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs for more on this project.
Install this tool using pip:
pip install strip-tagsPipe content into this tool to strip tags from it:
cat input.html | strip-tags > output.txtOr pass a filename:
strip-tags -i input.html > output.txtTo run against just specific areas identified by CSS selectors:
strip-tags '.content' -i input.html > output.txtThis can be called with multiple selectors:
cat input.html | strip-tags '.content' '.sidebar' > output.txtTo return just the first element on the page that matches one of the selectors, use --first:
cat input.html | strip-tags .content --first > output.txtTo remove content contained by specific selectors - e.g. the <nav> section of a page, use -r or --remove:
cat input.html | strip-tags -r nav > output.txtTo minify whitespace - reducing multiple space and tab characters to a single space, removing any remaining blank lines - add -m or --minify:
cat input.html | strip-tags -m > output.txtYou can also run this command using python -m like this:
python -m strip_tags --helpWhen passing content to a language model, it can sometimes be useful to leave in a subset of HTML tags - <h1>This is the heading</h1> for example - to provide extra hints to the model.
The -t/--keep-tag option can be passed multiple times to specify tags that should be kept.
This example looks at the <header> section of https://datasette.io/ and keeps the tags around the list items and <h1> elements:
curl -s https://datasette.io/ | strip-tags header -t h1 -t li
<li>Uses</li>
<li>Documentation Docs</li>
<li>Tutorials</li>
<li>Examples</li>
<li>Plugins</li>
<li>Tools</li>
<li>News</li>
<h1>
Datasette
</h1>
Find stories in dataAll attributes will be removed from the tags, except for the id= and class= attribute since those may provide further useful hints to the language model.
The href attribute on links, the alt attribute on images and the name and value attributes on meta tags are kept as well.
You can also specify a bundle of tags. For example, strip-tags -t hs will keep the tag markup for all levels of headings.
The following bundles can be used:
-
-t hs:<h1>,<h2>,<h3>,<h4>,<h5>,<h6> -
-t metadata:<title>,<meta> -
-t structure:<header>,<nav>,<main>,<article>,<section>,<aside>,<footer> -
-t tables:<table>,<tr>,<td>,<th>,<thead>,<tbody>,<tfoot>,<caption>,<colgroup>,<col> -
-t lists:<ul>,<ol>,<li>,<dl>,<dd>,<dt>
You can use strip-tags from Python code too. The function signature looks like this:
def strip_tags(
input: str,
selectors: Optional[Iterable[str]]=None,
*,
removes: Optional[Iterable[str]]=None,
minify: bool=False,
remove_blank_lines: bool=False,
first: bool=False,
keep_tags: Optional[Iterable[str]]=None,
all_attrs: bool=False
) -> str:Here's an example:
from strip_tags import strip_tags
html = """
<div>
<h1>This has tags</h1>
<p>And whitespace too</p>
</div>
Ignore this bit.
"""
stripped = strip_tags(html, ["div"], minify=True, keep_tags=["h1"])
print(stripped)Output:
<h1>This has tags</h1>
And whitespace too
Use remove_blank_lines=True to remove any remaining blank lines from the output.
Usage: strip-tags [OPTIONS] [SELECTORS]...
Strip tags from HTML, optionally from areas identified by CSS selectors
Example usage:
cat input.html | strip-tags > output.txt
To run against just specific areas identified by CSS selectors:
cat input.html | strip-tags .entry .footer > output.txt
Options:
--version Show the version and exit.
-r, --remove TEXT Remove content in these selectors
-i, --input FILENAME Input file
-m, --minify Minify whitespace
-t, --keep-tag TEXT Keep these <tags>
--all-attrs Include all attributes on kept tags
--first First element matching the selectors
--help Show this message and exit.
To contribute to this tool, first checkout the code. Then create a new virtual environment:
cd strip-tags
python -m venv venv
source venv/bin/activateNow install the dependencies and test dependencies:
pip install -e '.[test]'To run the tests:
pytest