Following the lxml Etree tutorial
I’m interested in the general problem of parsing documents to prepare them to be used in semantic search systems. The problem is that documents can exist in many forms. My own documents exist as: a folder of local markdown files, and a Hugo blog, where markdown files are rendered as HTML files. Other document formats include PDFs, DOCX files, ePub files,…
HTML, DOCX, and ePub files all have some relationship to XML.
What is XML?
“Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.”1
XML is self-descriptive
XML is intended to be self-descriptive. An example:
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
The XML above says what the thing is (a note), who it’s sent from, who it’s sent to, what its heading is, and what its body is. It doesn’t do anything though. To do anything it would need to be consumed by software.
The difference between XML and HTML
XML is designed to carry data. HTML is designed to display data — HTML does something. HTML tags are predefined. XML tags are not predefined.
What are XML namespaces?
(Paraphrased response from Claude): XML namespaces solve the problem of name collisions when
combining vocabularies from different sources. Imaging you’re creating a document that mixes two XML
vocabularies, one for describing books and another for describing furniture. Both vocabularies have
a <table> element. Namespaces provide a way to distinguish the two uses of <table> elements.
Namespaces use URIs (usually URLs) as unique identifiers. The URIs don’t have to point to anything real — they are just identifiers.
<root xmlns:book="http://example.com/book"
xmlns:furniture="http://example.com/furniture">
<book:table>
<book:row>Author data...</book:row>
</book:table>
<furniture:table>
<furniture:leg>Oak wood</furniture:leg>
</furniture:table>
</root>
Namespaces in HTML
Namespaces aren’t used in HTML(5) in practice as it’s got a fixed vocabulary. XHTML (the
XML-serialized version of HTML) does use namespaces — typically http://www.w3.org/1999/xhtml. You
might also encounter namespaces in HTML when:
- SVG is embedded (it has a namespace, e.g.
<svg xmlns="http://www.w3.org/2000/svg">) - MathML is embedded
- other XML vocabularies are mixed into the HTML
For more details about namespaces, see the Mozilla developer docs Namespaces crash course. (Interestingly, that’s in their SVG guides section.)
The Python xml.etree.ElementTree module
The xml.etree.ElementTree module (part of the Python standard library) implements an API for parsing and creating XML data.2
XML is a hierarchical data format — it’s naturally represented with a tree. The ElementTree module has two classes for this purpose:
The ElementTree module classes
ElementTreerepresents the whole XML document as a treeElementrepresents a single node of the tree
The Python docs ElementTree tutorial is worth going through.
It uses this example XML (country_data.xml):
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
import xml.etree.ElementTree as ET
tree = ET.parse("country_data.xml")
root = tree.getroot()
Using iPython for convenience:
In [12]: print(type(tree))
<class 'xml.etree.ElementTree.ElementTree'> # tree is an ElementTree
In [13]: print(type(root))
<class 'xml.etree.ElementTree.Element'> # root is an Element
Basic Element class attributes (tag and attrib)
The root Element has a tag:
In [14]: root.tag
Out[14]: 'data'
Each tag has a dictionary of attributes (empty for the "data" tag in this case):
In [15]: root.attrib
Out[15]: {}
Child elements
Elements can have children:
In [18]: for child in root:
...: print(child.tag, child.attrib)
...:
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}
The children of children can have children. (It’s a tree):
In [19]: for child in root[0]:
...: print(child.tag, child.attrib, child.text)
...:
rank {} 1
year {} 2008
gdppc {} 141100
neighbor {'name': 'Austria', 'direction': 'E'} None
neighbor {'name': 'Switzerland', 'direction': 'W'} None
Accessing child elements by index
Elements can be accessed by index:
In [20]: root[1][1].text
Out[20]: '2011'
Finding an element’s children
Finding elements with Element.iter():
In [22]: for neighbor in root.iter('neighbor'):
...: print(neighbor.attrib)
...:
{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}
Finding elements with Element.findall(), Element.find(), and getting attributes with
Elemenent.get():
In [23]: for country in root.findall('country'):
...: rank = country.find('rank').text
...: name = country.get('name')
...: print(name, rank)
...:
Liechtenstein 1
Singapore 4
Panama 68
Modifying elements and updating the ElementTree
Modifying elements with Element.set() and updating an element tree with ElementTree.write():
In [24]: for rank in root.iter('rank'):
...: new_rank = int(rank.text) + 1
...: rank.text = str(new_rank)
...: rank.set('updated', 'yes')
...:
In [25]: tree.write('output.xml') # write to a new file
In [26]: tree.write('country_data.xml') # or overwrite the original file
Removing elements from the ElementTree
Removing elements with Element.remove():
In [27]: for country in root.findall('country'): # note the use of root.findall() as opposed to root.iter('country')
...: rank = int(country.find('rank').text)
...: if rank > 50:
...: root.remove(country)
...:
In [28]: tree.write('output.xml')
The modification while iterating issue mentioned in the comment is sure to get me at some point.
Building XML documents
Note the use of the SubElement() function for creating sub elements of a given element:
In [30]: a = ET.Element('a')
In [31]: b = ET.SubElement(a, 'b')
In [32]: c = ET.SubElement(a, 'c')
In [33]: d = ET.SubElement(c, 'd')
In [34]: ET.dump(a)
<a><b /><c><d /></c></a>
ElementTree XPath support
The ElementTree module provides limited support for using XPath expressions for locating elements in a tree. See the docs for details about the supported XPath syntax.
In [38]: root.findall(".")
Out[38]: [<Element 'data' at 0x7f096413b8d0>]
In [39]: root.findall("./country/neighbor")
Out[39]:
[<Element 'neighbor' at 0x7f0964139670>,
<Element 'neighbor' at 0x7f0964139760>,
<Element 'neighbor' at 0x7f09640da980>,
<Element 'neighbor' at 0x7f096408b740>,
<Element 'neighbor' at 0x7f096408a250>]
In [40]: root.findall(".//year/..[@name='Singapore']")
Out[40]: [<Element 'country' at 0x7f0964139ad0>]
In [41]: root.findall(".//*[@name='Singapore']")
Out[41]: [<Element 'country' at 0x7f0964139ad0>]
The lxml library
GitHub: https://github.com/lxml/lxml lxml etree tutorial: https://lxml.de/tutorial.html
Note that what follows is just me working through the (very good) tutorial. Follow the original tutorial instead of what’s written below.
Installing lxml in a virtual env
pip install lxml
In [1]: from lxml import etree
lxml stubs
To get the language server I’m using on Neovim (ruff) to not give the warning "etree" is unknown import symbol, I needed to install lxml-stubs with
pip install lxml-stubs
The issue is mentioned here “etree” is unknown import symbol.
The lxml Element class
An Element is the main container object for the ElementTree (etree) API. Create elements with
dtree.Element("element_name"). Append elements to existing elements with the append method, or
use the etree.SubElement(parent, child) method:
In [2]: root = etree.Element("root")
In [3]: print(root.tag)
root
In [4]: root.append(etree.Element("foo"))
In [5]: etree.tostring(root)
Out[5]: b'<root><foo/></root>'
In [6]: bar = etree.SubElement(root, "bar")
In [7]: etree.tostring(root)
Out[7]: b'<root><foo/><bar/></root>'
A helper function to print XML:
In [8]: def prettyprint(element, **kwargs):
...: xml = etree.tostring(element, pretty_print=True, **kwargs)
...: print(xml.decode(), end='')
...:
In [9]: prettyprint(root)
<root>
<foo/>
<bar/>
</root>
lxml elements are Python lists
Technically, lxml elements implement a lot of the behavior of Python lists:
In [13]: child = root[0]
In [14]: child.tag
Out[14]: 'foo'
In [15]: root.index(root[0])
Out[15]: 0
In [16]: root.index(root[1])
OIn [17]: root.insert(0, etree.Element("foobar"))
In [18]: prettyprint(root)
<root>
<foobar/>
<foo/>
<bar/>
</root>ut[16]: 1
Moving elements
Elements can be moved to a new position in the list using indices:
In [19]: root[0] = root[-1]
In [20]: prettyprint(root)
<root>
<bar/>
<foo/>
</root>
Note how the behavior is different than with Python lists. In Python lists, the element is copied, in lxml it is moved. This behavior also (I think) differs from how the Python ElementTree module deals with XML trees. The reason is because each child should have exactly one parent.
In [1]: l = [0, 1, 2, 3]
In [2]: l[0] = l[-1]
In [3]: l
Out[3]: [3, 1, 2, 3]```
### Each element has a parent
```python
In [24]: root[1].getparent().tag
Out[24]: 'root'
Elements can have siblings
Element siblings (or neighbors) are accessed with getprevious() and getnext(). Note that the
Python is operator checks if two variables or expressions point to the same object in memory:
In [25]: root[0] is root[1].getprevious()
Out[25]: True
In [26]: root[1] is root[0].getnext()
Out[26]: True
Element attributes are dicts
Element attributes can be created with the Element factory function:
In [30]: root = etree.Element("root", foo="bar")
In [31]: etree.tostring(root)
Out[31]: b'<root foo="bar"/>'
In [32]: root.get("foo")
Out[32]: 'bar'
In [33]: root.get("doesntexist")
In [34]: root.set("hello", "hi")
In [35]: etree.tostring(root)
Out[35]: b'<root foo="bar" hello="hi"/>'
In [36]: sorted(root.keys())
Out[36]: ['foo', 'hello']
Elements can contain text
In [37]: root = etree.Element("root")
In [38]: root.text = "FOO BAR"
In [39]: etree.tostring(root)
Out[39]: b'<root>FOO BAR</root>
Document-style or mixed-text content XML
In data-centric XML documents, text can only be contained encapsulated by a leaf tag at the bottom of the tree hierarchy.
In document-style XML, e.g. (X)HTML, text can also appear between different elements in the middle of the tree:
<html>
<body>
Hello<br />World
</body>
</html>
lxml supports document-style XML through the element tail property:
In [43]: html = etree.Element("html")
In [44]: body = etree.SubElement(html, "body")
In [45]: body.text = "TEXT"
In [46]: etree.tostring(html)
Out[46]: b'<html><body>TEXT</body></html>'
In [47]: br = etree.SubElement(body, "br")
In [48]: etree.tostring(html)
Out[48]: b'<html><body>TEXT<br/></body></html>'
In [49]: br.tail = "TAIL"
In [50]: etree.tostring(html)
Out[50]: b'<html><body>TEXT<br/>TAIL</body></html>'
The Element text and tail properties are enough to represent any text content in an XML
document.
Removing the tail property when serializing an element
Using the br element from above, serialize the element with tostring with and without the tail
text:
In [51]: etree.tostring(br)
Out[51]: b'<br/>TAIL'
In [52]: etree.tostring(br, with_tail=False)
Out[52]: b'<br/>'
Using XPath with lxml to find text
In [55]: print(html.xpath("string()"))
TEXTTAIL
In [57]: print(html.xpath("//text()"))
['TEXT', 'TAIL']
Remember that // selects all sub elements on all levels beneath the current element.
The tutorial gives the example of wrapping the call to //text() in a function.
In [58]: build_text_list = etree.XPath("//text()")
In [59]: print(build_text_list(html))
['TEXT', 'TAIL']
Why does this work? (I think it’s calling html.xpath("//text())).
Strings returned from XPath know about their origins
Strings returned from the XPath text() function in a similar way to elements. (I’m unsure about the exceptions
to that.)
In [64]: texts = build_text_list(html)
In [65]: print(texts[0])
TEXT
In [66]: parent = texts[0].getparent()
In [67]: print(parent.tag)
In [69]: type(texts[0])
Out[69]: lxml.etree._ElementUnicodeResult
body
Strings returned from the XPath string() and concat() function do not know about their origin
(despite being of the same type as strings returned from the text() function (?).)
lxml tree iteration
The lxml etree Element class (the class is technically lxml.etree._Element) has a tree iterator
method (iter()):
In [12]: root = etree.Element("root")
In [13]: etree.SubElement(root, "child").text = "Child 1"
In [14]: etree.SubElement(root, "child").text = "Child 2"
In [15]: etree.SubElement(root, "another").text = "Child 3"
In [16]: prettyprint(root)
<root>
<child>Child 1</child>
<child>Child 2</child>
<another>Child 3</another>
</root>
In [19]: for element in root.iter():
...: print(f"{element.tag} - {element.text}")
...:
root - None # note that the root element is returned as well
child - Child 1
child - Child 2
another - Child 3
The iter() method accepts tag name arguments that can be used to filter for tags by name:
In [21]: for element in root.iter("child"):
...: print(element.tag)
...:
child
child
In [22]: for element in root.iter("child", "another"):
...: print(element.tag)
...:
child
child
another
The iter() method yields all nodes by default:
In [23]: root.append(etree.Entity("#234"))
In [24]: root.append(etree.Comment("this is a comment"))
In [25]: prettyprint(root)
<root><child>Child 1</child><child>Child 2</child><another>Child 3</another>ê<!--this is a comment--></root>
In [26]: for element in root.iter():
...: if isinstance(element.tag, str):
...: print(f"{element.tag} - {element.text}")
...: else:
...: print(f"SPECIAL: {element} - {element.text}")
...:
root - None
child - Child 1
child - Child 2
another - Child 3
SPECIAL: ê - ê
SPECIAL: <!--this is a comment--> - this is a comment
To only return Element objects, pass the “Element factory” as the tag parameter:
In [27]: for element in root.iter(tag=etree.Element):
...: print(f"{element.tag} - {element.text}")
...:
root - None
child - Child 1
child - Child 2
another - Child 3
It’s also possible to select for only Element nodes by passing a wildcard tag name ("*"):
In [32]: for element in root.iter("*"):
...: print(element.tag)
...:
root
child
child
another
To only return Entity elements, pass the Entity factory:
In [29]: for element in root.iter(tag=etree.Entity):
...: print(element.text)
...:
ê
lxml Elements have more iterators: iterchildren(), iterchildren(reversed=True),
itersiblings(), …. see the lxml API documentation for
details.
lxml serialization
The tostring() method returns a string. The write() method writes to a file or file-like object
(e.g., to a URL via FTP, or HTTP PUT or POST request). Both methods accept keyword arguments. The
pretty_print (boolean) arg returns formatted output. Note that pretty_print appends a newline to
the end of the output. Supply an end='' option to prevent print() from adding the newline. (It
seems that the end keyword argument is an argument that can be supplied to the Python print
function, see below).
In [33]: root = etree.XML('<root><a><b/></a></root>')
In [34]: etree.tostring(root)
Out[34]: b'<root><a><b/></a></root>'
In [35]: xml_string = etree.tostring(root, xml_declaration=True)
In [36]: print(xml_string.decode(), end='')
<?xml version='1.0' encoding='ASCII'?>
<root><a><b/></a></root>
More serialization examples:
In [37]: etree.tostring(root)
Out[37]: b'<root><a><b/></a></root>'
In [38]: etree.tostring(root, pretty_print=True)
Out[38]: b'<root>\n <a>\n <b/>\n </a>\n</root>\n'
In [39]: etree.tostring(root, pretty_print=True).decode()
Out[39]: '<root>\n <a>\n <b/>\n </a>\n</root>\n'
In [40]: print(etree.tostring(root, pretty_print=True).decode())
<root>
<a>
<b/>
</a>
</root>
In [41]: print(etree.tostring(root, pretty_print=True).decode(), end='')
<root>
<a>
<b/>
</a>
</root>
Python’s print function accepts an end keyword argument
In [42]: print("foo") # appends a newline
foo
In [43]: print("foo", end="") # doesn't append a newline
foo
Adding indentation to serialization
Add whitespace indentation to serialization with the indent() method:
In [44]: etree.indent(root)
In [45]: print(etree.tostring(root).decode())
<root>
<a>
<b/>
</a>
</root>
In [46]: etree.indent(root, space=" ")
In [47]: print(etree.tostring(root).decode())
<root>
<a>
<b/>
</a>
</root>
It seems that the whitespace becomes the element’s text:
In [48]: root.text
Out[48]: '\n '
In [49]: root[0].text
Out[49]: '\n '
In [50]: root[0].tag
Out[50]: 'a'
In [51]: root.tag
Out[51]: 'root'
Serialization to HTML and text
Use the method keyword argument to set the serialization method. Note that elements are serialized
to xml by default. (Use the encoding keyword argument to set the encoding.)
In [52]: root = etree.XML(
...: '<html><head/><body><p>This is a<br/>paragraph.</p></body></html>')
In [53]: etree.tostring(root)
Out[53]: b'<html><head/><body><p>This is a<br/>paragraph.</p></body></html>'
In [54]: etree.tostring(root, method='xml')
Out[54]: b'<html><head/><body><p>This is a<br/>paragraph.</p></body></html>'
In [55]: etree.tostring(root, method='html')
Out[55]: b'<html><head></head><body><p>This is a<br>paragraph.</p></body></html>'
In [56]: prettyprint(root, method="html")
<html>
<head></head>
<body><p>This is a<br>paragraph.</p></body>
</html>
In [57]: etree.tostring(root, method='text')
Out[57]: b'This is aparagraph.'
lxml ElementTree class
An ElementTree is a document wrapper around a tree with a root node. It provides a few method (see documentation for more):
In [59]: root = etree.XML('''\
...: <?xml version="1.0"?>
...: <!DOCTYPE root SYSTEM "test" [ <!ENTITY foo "bar"> ]>
...: <root>
...: <a>&foo;</a>
...: </root>
...: ''')
In [60]: tree = etree.ElementTree(root)
In [61]: print(tree.docinfo.xml_version)
1.0
In [62]: print(tree.docinfo.doctype)
<!DOCTYPE root SYSTEM "test">
When an ElementTree is serialized it returns a complete document, not just the root element:
In [63]: prettyprint(tree)
<!DOCTYPE root SYSTEM "test" [
<!ENTITY foo "bar">
]>
<root>
<a>bar</a>
</root>
An ElementTree is returned from the parse() method. See below.
lxml parsing strings and files
Strings, files, URLs (http/ftp) and file-like objects can be parsed. The main methods are
fromstring() and parse():
In [64]: xml_data = "<root>data</root>"
In [65]: root = etree.fromstring(xml_data)
In [66]: print(root.tag)
root
In [67]: etree.tostring(root)
Out[67]: b'<root>data</root>'
The XML() method behaves like fromstring(). It’s used to write “XML literals into the
source(?)”:
In [67]: etree.tostring(root)
Out[67]: b'<root>data</root>'
In [68]: root = etree.XML(xml_data)
The HTML() method writes HTML literals (it’s easier to see what’s going on here):
In [70]: root = etree.HTML("<p>this is a test</p>")
In [71]: etree.tostring(root)
Out[71]: b'<html><body><p>this is a test</p></body></html>'
The parse() method parses files and file like objects. parse() returns an ElementTree object,
not an Element object.
In [76]: tree = etree.parse("/home/scossar/projects/python/hello_lxml/country_data.xml")
In [77]: etree.tostring(tree)
Out[77]: b'<data>\n <country name="Liechtenstein">\n <rank updated="yes">2</rank>\n <year>2008</year>\n <gdppc>141100</gdppc>\n <neighbor name="Austria" direction="E"/>\n <neighbor name="Switzerland" direction="W"/>\n </country>\n <country name="Singapore">\n <rank updated="yes">5</rank>\n <year>2011</year>\n <gdppc>59900</gdppc>\n <neighbor name="Malaysia" direction="N"/>\n </country>\n <country name="Panama">\n <rank updated="yes">69</rank>\n <year>2011</year>\n <gdppc>13600</gdppc>\n <neighbor name="Costa Rica" direction="W"/>\n <neighbor name="Colombia" direction="E"/>\n </country>\n</data>'
In [78]: prettyprint(tree)
<data>
<country name="Liechtenstein">
<rank updated="yes">2</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank updated="yes">5</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank updated="yes">69</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
In [79]: type(tree)
Out[79]: lxml.etree._ElementTree
Namespaces and lxml
Summarized from Claude: When working with lxml you’ll sometimes need to account for namespaces in
your XPath queries. E.g., if an element is in a namespace, you can’t just search for //div — you
need to specify the namespace or use wildcards like //*local-name()-div. (I’ll confirm the syntax for
that when it comes up.)
From the lxml docs: The ElementTree API avoids namespace prefixes whenever possible. It uses the real namespace (URI) instead:
In [80]: xhtml = etree.Element("{http://www.w3.org/1999/xhtml}html")
In [81]: body = etree.SubElement(xhtml, "{http://www.w3.org/1999/xhtml}bod
⋮ y")
In [82]: body.text = "this is a test"
In [83]: prettyprint(xhtml)
<html:html xmlns:html="http://www.w3.org/1999/xhtml">
<html:body>this is a test</html:body>
</html:html>
Typing like this is obviously error prone. The workaround is to store a namespace URI in a global variable:
In [87]: XHTML
Out[87]: '{http://www.w3.org/1999/xhtml}'
In [88]: XHTML_NAMESPACE = "http://www.w3.org/1999/xhtml"
In [89]: XHTML = "{%s}" % XHTML_NAMESPACE
In [90]: XHTML
Out[90]: '{http://www.w3.org/1999/xhtml}'
In [91]: NSMAP = {None : XHTML_NAMESPACE}
In [93]: xhtml = etree.Element(XHTML + "html", nsmap=NSMAP)
In [94]: xhtml
Out[94]: <Element {http://www.w3.org/1999/xhtml}html at 0x7f49a058c540>
In [95]: xhtml = etree.Element(XHTML + "html", nsmap=NSMAP)
In [96]: body = etree.SubElement(xhtml, XHTML + "body")
In [97]: body.text = "this is a test"
In [98]: prettyprint(xhtml)
<html xmlns="http://www.w3.org/1999/xhtml">
<body>this is a test</body>
</html>
Use the QName helper class to build or split qualified tag names:
In [99]: tag = etree.QName("http://www.w3.org/1999/xhtml", "html")
In [100]: tag.localname
Out[100]: 'html'
In [101]: tag.namespace
Out[101]: 'http://www.w3.org/1999/xhtml'
Generating HTML (and XML) with lmxl E-factory
The E factory (see What is a factory) provides syntax for generating HTML and XML.
from lxml import etree
from lxml.builder import E # pyright: ignore
def prettyprint(element, **kwargs):
xml = etree.tostring(element, pretty_print=True, **kwargs)
print(xml.decode(), end="")
def CLASS(*args):
return {"class": " ".join(args)}
html = page = E.html(
E.head(E.title("This is a test")),
E.body(
E.h1("Testing", CLASS("title")),
E.p("This is a test with some ", E.bold("bold"), " text."),
E.p("This is another paragraph", CLASS("foo")),
),
)
prettyprint(page)
# <html>
# <head>
# <title>This is a test</title>
# </head>
# <body>
# <h1 class="title">Testing</h1>
# <p>This is a test with some <bold>bold</bold> text.</p>
# <p class="foo">This is another paragraph</p>
# </body>
# </html>
E factory Element creation is based on attribute access. This makes it possible to create a
vocabulary for an XML language. This is what’s used by lxml.html.builder. An example will make it
clearer:
from lxml import etree
from lxml.builder import ElementMaker # pyright: ignore
def prettyprint(element, **kwargs):
xml = etree.tostring(element, pretty_print=True, **kwargs)
print(xml.decode(), end="")
E = ElementMaker(
namespace="http://zalgorithm/default/namespace",
nsmap={"p": "http://zalgorithm/default/namespace"},
)
DOC = E.doc
TITLE = E.title
SECTION = E.section
PAR = E.par
test_doc = DOC(
TITLE("This is a test"),
SECTION(
TITLE("Test section"),
PAR("Once upon a time in a land far far away..."),
PAR("There lived a..."),
),
)
prettyprint(test_doc)
# <p:doc xmlns:p="http://zalgorithm/default/namespace">
# <p:title>This is a test</p:title>
# <p:section>
# <p:title>Test section</p:title>
# <p:par>Once upon a time in a land far far away...</p:par>
# <p:par>There lived a...</p:par>
# </p:section>
# </p:doc>
The lxml.html.builder module uses the above approach.
lxml ElementPath
The (Python standard library) ElementTree library has a simple XPath language called ElementPath. It doesn’t support features like value comparison and function.
lxml has a full XPath implementation (see: lxml XPath). It
also has four methods for finding Elements (and ElementTrees, see the getelementtree() section
below):
iterfind(): iterates over all Elements that match a path expressionfindall(): returns a list of matching Elementsfind(): returns the first matchfindtext(): returns the.textcontent of the first match
In [102]: root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")
# find a child of an Element:
In [103]: print(root.find("b"))
None
In [104]: print(root.find("a").tag)
a
# find an Element anywhere in the tree:
In [105]: print(root.find(".//b").tag)
b
In [106]: [b.tag for b in root.iterfind(".//b")]
Out[106]: ['b', 'b']
Find elements with a given attribute:
In [107]: print(root.findall(".//a[@x]")[0].tag)
a
In [108]: print(root.findall(".//a[@y]")) # there are no a tags with a y attribute
[]
Generate an ElementPath expression for an Element (from the ElementTree):
In [109]: tree = etree.ElementTree(root)
In [110]: a = root[0]
In [111]: a
Out[111]: <Element a at 0x7f49a0d5d8c0>
In [112]: print(tree.getelementpath(a[0]))
a/b[1]
In [113]: print(tree.getelementpath(a[1]))
a/c
In [114]: print(tree.getelementpath(a[2]))
a/b[2]
In [115]: tree.find(tree.getelementpath(a[2])) == a[2]
Out[115]: True
As long as the ElementTree isnot’ modified (?), the path expression returned from getelementpath()
represents an identifier for a given Element. It can be used in a call to find()
The iter() method is a special case. It only finds specific tags in the tree by their name, not by
their path. The following are all equivalent. Note the use of next in the calls to iterfind()
and iter() below:
In [116]: print(root.find(".//b").tag)
b
In [117]: print(next(root.iterfind(".//b")).tag)
b
In [118]: print(next(root.iter("b")).tag)
References
Behnel, Stefan. “The lxml.etree Tutorial”. https://lxml.de/tutorial.html.
Wikipedia contributors. “XML.” Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=XML&oldid=1327771438. Accessed December 22, 2025.
W3 Schools. “XML What Is”. W3 Schools. https://www.w3schools.com/xml/xml_whatis.asp.
Mozilla Developers. “Namespaces crash course”. https://developer.mozilla.org/en-US/docs/Web/SVG/Guides/Namespaces_crash_course.
-
Wikipedia contributors, “XML,” Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=XML&oldid=1327771438 (accessed December 22, 2025). ↩︎
-
“xml.etree.ElementTree — The ElementTree XML API”, “https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree" ↩︎