Zalgorithm

Following the lxml Etree tutorial

I’m interested in the general problem of parsing documents to prepare them to be used in semantic search systems. The problem is that documents can exist in many forms. My own documents exist as: a folder of local markdown files, and a Hugo blog, where markdown files are rendered as HTML files. Other document formats include PDFs, DOCX files, ePub files,…

HTML, DOCX, and ePub files all have some relationship to XML.

What is XML?

“Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.”1

XML is self-descriptive

XML is intended to be self-descriptive. An example:

<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>

The XML above says what the thing is (a note), who it’s sent from, who it’s sent to, what its heading is, and what its body is. It doesn’t do anything though. To do anything it would need to be consumed by software.

The difference between XML and HTML

XML is designed to carry data. HTML is designed to display data — HTML does something. HTML tags are predefined. XML tags are not predefined.

What are XML namespaces?

(Paraphrased response from Claude): XML namespaces solve the problem of name collisions when combining vocabularies from different sources. Imaging you’re creating a document that mixes two XML vocabularies, one for describing books and another for describing furniture. Both vocabularies have a <table> element. Namespaces provide a way to distinguish the two uses of <table> elements.

Namespaces use URIs (usually URLs) as unique identifiers. The URIs don’t have to point to anything real — they are just identifiers.

<root xmlns:book="http://example.com/book"
      xmlns:furniture="http://example.com/furniture">
  <book:table>
    <book:row>Author data...</book:row>
  </book:table>
  <furniture:table>
    <furniture:leg>Oak wood</furniture:leg>
  </furniture:table>
</root>

Namespaces in HTML

Namespaces aren’t used in HTML(5) in practice as it’s got a fixed vocabulary. XHTML (the XML-serialized version of HTML) does use namespaces — typically http://www.w3.org/1999/xhtml. You might also encounter namespaces in HTML when:

For more details about namespaces, see the Mozilla developer docs Namespaces crash course. (Interestingly, that’s in their SVG guides section.)

The Python xml.etree.ElementTree module

The xml.etree.ElementTree module (part of the Python standard library) implements an API for parsing and creating XML data.2

XML is a hierarchical data format — it’s naturally represented with a tree. The ElementTree module has two classes for this purpose:

The ElementTree module classes

The Python docs ElementTree tutorial is worth going through. It uses this example XML (country_data.xml):

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
import xml.etree.ElementTree as ET
tree = ET.parse("country_data.xml")
root = tree.getroot()

Using iPython for convenience:

In [12]: print(type(tree))
<class 'xml.etree.ElementTree.ElementTree'>  # tree is an ElementTree

In [13]: print(type(root))
<class 'xml.etree.ElementTree.Element'>  # root is an Element

Basic Element class attributes (tag and attrib)

The root Element has a tag:

In [14]: root.tag
Out[14]: 'data'

Each tag has a dictionary of attributes (empty for the "data" tag in this case):

In [15]: root.attrib
Out[15]: {}

Child elements

Elements can have children:

In [18]: for child in root:
    ...:     print(child.tag, child.attrib)
    ...:
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}

The children of children can have children. (It’s a tree):

In [19]: for child in root[0]:
    ...:     print(child.tag, child.attrib, child.text)
    ...:
rank {} 1
year {} 2008
gdppc {} 141100
neighbor {'name': 'Austria', 'direction': 'E'} None
neighbor {'name': 'Switzerland', 'direction': 'W'} None

Accessing child elements by index

Elements can be accessed by index:

In [20]: root[1][1].text
Out[20]: '2011'

Finding an element’s children

Finding elements with Element.iter():

In [22]: for neighbor in root.iter('neighbor'):
    ...:     print(neighbor.attrib)
    ...:
{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}

Finding elements with Element.findall(), Element.find(), and getting attributes with Elemenent.get():

In [23]: for country in root.findall('country'):
    ...:     rank = country.find('rank').text
    ...:     name = country.get('name')
    ...:     print(name, rank)
    ...:
Liechtenstein 1
Singapore 4
Panama 68

Modifying elements and updating the ElementTree

Modifying elements with Element.set() and updating an element tree with ElementTree.write():

In [24]: for rank in root.iter('rank'):
    ...:     new_rank = int(rank.text) + 1
    ...:     rank.text = str(new_rank)
    ...:     rank.set('updated', 'yes')
    ...:

In [25]: tree.write('output.xml')  # write to a new file

In [26]: tree.write('country_data.xml')  # or overwrite the original file

Removing elements from the ElementTree

Removing elements with Element.remove():

In [27]: for country in root.findall('country'):  # note the use of root.findall() as opposed to root.iter('country')
    ...:     rank = int(country.find('rank').text)
    ...:     if rank > 50:
    ...:         root.remove(country)
    ...:

In [28]: tree.write('output.xml')

The modification while iterating issue mentioned in the comment is sure to get me at some point.

Building XML documents

Note the use of the SubElement() function for creating sub elements of a given element:

In [30]: a = ET.Element('a')

In [31]: b = ET.SubElement(a, 'b')

In [32]: c = ET.SubElement(a, 'c')

In [33]: d = ET.SubElement(c, 'd')

In [34]: ET.dump(a)
<a><b /><c><d /></c></a>

ElementTree XPath support

The ElementTree module provides limited support for using XPath expressions for locating elements in a tree. See the docs for details about the supported XPath syntax.

In [38]: root.findall(".")
Out[38]: [<Element 'data' at 0x7f096413b8d0>]

In [39]: root.findall("./country/neighbor")
Out[39]:
[<Element 'neighbor' at 0x7f0964139670>,
 <Element 'neighbor' at 0x7f0964139760>,
 <Element 'neighbor' at 0x7f09640da980>,
 <Element 'neighbor' at 0x7f096408b740>,
 <Element 'neighbor' at 0x7f096408a250>]

In [40]: root.findall(".//year/..[@name='Singapore']")
Out[40]: [<Element 'country' at 0x7f0964139ad0>]

In [41]: root.findall(".//*[@name='Singapore']")
Out[41]: [<Element 'country' at 0x7f0964139ad0>]

The lxml library

GitHub: https://github.com/lxml/lxml lxml etree tutorial: https://lxml.de/tutorial.html

Note that what follows is just me working through the (very good) tutorial. Follow the original tutorial instead of what’s written below.

Installing lxml in a virtual env

pip install lxml
In [1]: from lxml import etree

lxml stubs

To get the language server I’m using on Neovim (ruff) to not give the warning "etree" is unknown import symbol, I needed to install lxml-stubs with

pip install lxml-stubs

The issue is mentioned here “etree” is unknown import symbol.

The lxml Element class

An Element is the main container object for the ElementTree (etree) API. Create elements with dtree.Element("element_name"). Append elements to existing elements with the append method, or use the etree.SubElement(parent, child) method:

In [2]: root = etree.Element("root")

In [3]: print(root.tag)
root

In [4]: root.append(etree.Element("foo"))

In [5]: etree.tostring(root)
Out[5]: b'<root><foo/></root>'

In [6]: bar = etree.SubElement(root, "bar")

In [7]: etree.tostring(root)
Out[7]: b'<root><foo/><bar/></root>'

A helper function to print XML:

In [8]: def prettyprint(element, **kwargs):
   ...:     xml = etree.tostring(element, pretty_print=True, **kwargs)
   ...:     print(xml.decode(), end='')
   ...:

In [9]: prettyprint(root)
<root>
  <foo/>
  <bar/>
</root>

lxml elements are Python lists

Technically, lxml elements implement a lot of the behavior of Python lists:

In [13]: child = root[0]

In [14]: child.tag
Out[14]: 'foo'

In [15]: root.index(root[0])
Out[15]: 0

In [16]: root.index(root[1])

OIn [17]: root.insert(0, etree.Element("foobar"))

In [18]: prettyprint(root)
<root>
  <foobar/>
  <foo/>
  <bar/>
</root>ut[16]: 1

Moving elements

Elements can be moved to a new position in the list using indices:

In [19]: root[0] = root[-1]

In [20]: prettyprint(root)
<root>
  <bar/>
  <foo/>
</root>

Note how the behavior is different than with Python lists. In Python lists, the element is copied, in lxml it is moved. This behavior also (I think) differs from how the Python ElementTree module deals with XML trees. The reason is because each child should have exactly one parent.

In [1]: l = [0, 1, 2, 3]

In [2]: l[0] = l[-1]

In [3]: l
Out[3]: [3, 1, 2, 3]```

### Each element has a parent

```python
In [24]: root[1].getparent().tag
Out[24]: 'root'

Elements can have siblings

Element siblings (or neighbors) are accessed with getprevious() and getnext(). Note that the Python is operator checks if two variables or expressions point to the same object in memory:

In [25]: root[0] is root[1].getprevious()
Out[25]: True

In [26]: root[1] is root[0].getnext()
Out[26]: True

Element attributes are dicts

Element attributes can be created with the Element factory function:

In [30]: root = etree.Element("root", foo="bar")

In [31]: etree.tostring(root)
Out[31]: b'<root foo="bar"/>'

In [32]: root.get("foo")
Out[32]: 'bar'

In [33]: root.get("doesntexist")

In [34]: root.set("hello", "hi")

In [35]: etree.tostring(root)
Out[35]: b'<root foo="bar" hello="hi"/>'

In [36]: sorted(root.keys())
Out[36]: ['foo', 'hello']

Elements can contain text

In [37]: root = etree.Element("root")

In [38]: root.text = "FOO BAR"

In [39]: etree.tostring(root)
Out[39]: b'<root>FOO BAR</root>

Document-style or mixed-text content XML

In data-centric XML documents, text can only be contained encapsulated by a leaf tag at the bottom of the tree hierarchy.

In document-style XML, e.g. (X)HTML, text can also appear between different elements in the middle of the tree:

<html>
  <body>
    Hello<br />World
  </body>
</html>

lxml supports document-style XML through the element tail property:

In [43]: html = etree.Element("html")

In [44]: body = etree.SubElement(html, "body")

In [45]: body.text = "TEXT"

In [46]: etree.tostring(html)
Out[46]: b'<html><body>TEXT</body></html>'

In [47]: br = etree.SubElement(body, "br")

In [48]: etree.tostring(html)
Out[48]: b'<html><body>TEXT<br/></body></html>'

In [49]: br.tail = "TAIL"

In [50]: etree.tostring(html)
Out[50]: b'<html><body>TEXT<br/>TAIL</body></html>'

The Element text and tail properties are enough to represent any text content in an XML document.

Removing the tail property when serializing an element

Using the br element from above, serialize the element with tostring with and without the tail text:

In [51]: etree.tostring(br)
Out[51]: b'<br/>TAIL'

In [52]: etree.tostring(br, with_tail=False)
Out[52]: b'<br/>'

Using XPath with lxml to find text

In [55]: print(html.xpath("string()"))
TEXTTAIL

In [57]: print(html.xpath("//text()"))
['TEXT', 'TAIL']

Remember that // selects all sub elements on all levels beneath the current element.

The tutorial gives the example of wrapping the call to //text() in a function.

In [58]: build_text_list = etree.XPath("//text()")

In [59]: print(build_text_list(html))
['TEXT', 'TAIL']

Why does this work? (I think it’s calling html.xpath("//text())).

Strings returned from XPath know about their origins

Strings returned from the XPath text() function in a similar way to elements. (I’m unsure about the exceptions to that.)

In [64]: texts = build_text_list(html)

In [65]: print(texts[0])
TEXT

In [66]: parent = texts[0].getparent()

In [67]: print(parent.tag)

In [69]: type(texts[0])
Out[69]: lxml.etree._ElementUnicodeResult
body

Strings returned from the XPath string() and concat() function do not know about their origin (despite being of the same type as strings returned from the text() function (?).)

lxml tree iteration

The lxml etree Element class (the class is technically lxml.etree._Element) has a tree iterator method (iter()):

In [12]: root = etree.Element("root")

In [13]: etree.SubElement(root, "child").text = "Child 1"

In [14]: etree.SubElement(root, "child").text = "Child 2"

In [15]: etree.SubElement(root, "another").text = "Child 3"

In [16]: prettyprint(root)
<root>
  <child>Child 1</child>
  <child>Child 2</child>
  <another>Child 3</another>
</root>

In [19]: for element in root.iter():
    ...:     print(f"{element.tag} - {element.text}")
    ...:
root - None  # note that the root element is returned as well
child - Child 1
child - Child 2
another - Child 3

The iter() method accepts tag name arguments that can be used to filter for tags by name:

In [21]: for element in root.iter("child"):
    ...:     print(element.tag)
    ...:
child
child

In [22]: for element in root.iter("child", "another"):
    ...:     print(element.tag)
    ...:
child
child
another

The iter() method yields all nodes by default:

In [23]: root.append(etree.Entity("#234"))

In [24]: root.append(etree.Comment("this is a comment"))

In [25]: prettyprint(root)
<root><child>Child 1</child><child>Child 2</child><another>Child 3</another>&#234;<!--this is a comment--></root>

In [26]: for element in root.iter():
    ...:     if isinstance(element.tag, str):
    ...:         print(f"{element.tag} - {element.text}")
    ...:     else:
    ...:         print(f"SPECIAL: {element} - {element.text}")
    ...:
root - None
child - Child 1
child - Child 2
another - Child 3
SPECIAL: &#234; - &#234;
SPECIAL: <!--this is a comment--> - this is a comment

To only return Element objects, pass the “Element factory” as the tag parameter:

In [27]: for element in root.iter(tag=etree.Element):
    ...:     print(f"{element.tag} - {element.text}")
    ...:
root - None
child - Child 1
child - Child 2
another - Child 3

It’s also possible to select for only Element nodes by passing a wildcard tag name ("*"):

In [32]: for element in root.iter("*"):
    ...:     print(element.tag)
    ...:
root
child
child
another

To only return Entity elements, pass the Entity factory:

In [29]: for element in root.iter(tag=etree.Entity):
    ...:     print(element.text)
    ...:
&#234;

lxml Elements have more iterators: iterchildren(), iterchildren(reversed=True), itersiblings(), …. see the lxml API documentation for details.

lxml serialization

The tostring() method returns a string. The write() method writes to a file or file-like object (e.g., to a URL via FTP, or HTTP PUT or POST request). Both methods accept keyword arguments. The pretty_print (boolean) arg returns formatted output. Note that pretty_print appends a newline to the end of the output. Supply an end='' option to prevent print() from adding the newline. (It seems that the end keyword argument is an argument that can be supplied to the Python print function, see below).

In [33]: root = etree.XML('<root><a><b/></a></root>')

In [34]: etree.tostring(root)
Out[34]: b'<root><a><b/></a></root>'

In [35]: xml_string = etree.tostring(root, xml_declaration=True)

In [36]: print(xml_string.decode(), end='')
<?xml version='1.0' encoding='ASCII'?>
<root><a><b/></a></root>

More serialization examples:

In [37]: etree.tostring(root)
Out[37]: b'<root><a><b/></a></root>'

In [38]: etree.tostring(root, pretty_print=True)
Out[38]: b'<root>\n  <a>\n    <b/>\n  </a>\n</root>\n'

In [39]: etree.tostring(root, pretty_print=True).decode()
Out[39]: '<root>\n  <a>\n    <b/>\n  </a>\n</root>\n'

In [40]: print(etree.tostring(root, pretty_print=True).decode())
<root>
  <a>
    <b/>
  </a>
</root>


In [41]: print(etree.tostring(root, pretty_print=True).decode(), end='')
<root>
  <a>
    <b/>
  </a>
</root>

Python’s print function accepts an end keyword argument

In [42]: print("foo")  # appends a newline
foo

In [43]: print("foo", end="")  # doesn't append a newline
foo

Adding indentation to serialization

Add whitespace indentation to serialization with the indent() method:

In [44]: etree.indent(root)

In [45]: print(etree.tostring(root).decode())
<root>
  <a>
    <b/>
  </a>
</root>

In [46]: etree.indent(root, space="    ")

In [47]: print(etree.tostring(root).decode())
<root>
    <a>
        <b/>
    </a>
</root>

It seems that the whitespace becomes the element’s text:

In [48]: root.text
Out[48]: '\n    '

In [49]: root[0].text
Out[49]: '\n        '

In [50]: root[0].tag
Out[50]: 'a'

In [51]: root.tag
Out[51]: 'root'

Serialization to HTML and text

Use the method keyword argument to set the serialization method. Note that elements are serialized to xml by default. (Use the encoding keyword argument to set the encoding.)

In [52]: root = etree.XML(
    ...: '<html><head/><body><p>This is a<br/>paragraph.</p></body></html>')

In [53]: etree.tostring(root)
Out[53]: b'<html><head/><body><p>This is a<br/>paragraph.</p></body></html>'

In [54]: etree.tostring(root, method='xml')
Out[54]: b'<html><head/><body><p>This is a<br/>paragraph.</p></body></html>'

In [55]: etree.tostring(root, method='html')
Out[55]: b'<html><head></head><body><p>This is a<br>paragraph.</p></body></html>'

In [56]: prettyprint(root, method="html")
<html>
<head></head>
<body><p>This is a<br>paragraph.</p></body>
</html>

In [57]: etree.tostring(root, method='text')
Out[57]: b'This is aparagraph.'

lxml ElementTree class

An ElementTree is a document wrapper around a tree with a root node. It provides a few method (see documentation for more):

In [59]: root = etree.XML('''\
    ...: <?xml version="1.0"?>
    ...: <!DOCTYPE root SYSTEM "test" [ <!ENTITY foo "bar"> ]>
    ...: <root>
    ...: <a>&foo;</a>
    ...: </root>
    ...: ''')

In [60]: tree = etree.ElementTree(root)

In [61]: print(tree.docinfo.xml_version)
1.0

In [62]: print(tree.docinfo.doctype)
<!DOCTYPE root SYSTEM "test">

When an ElementTree is serialized it returns a complete document, not just the root element:

In [63]: prettyprint(tree)
<!DOCTYPE root SYSTEM "test" [
<!ENTITY foo "bar">
]>
<root>
<a>bar</a>
</root>

An ElementTree is returned from the parse() method. See below.

lxml parsing strings and files

Strings, files, URLs (http/ftp) and file-like objects can be parsed. The main methods are fromstring() and parse():

In [64]: xml_data = "<root>data</root>"

In [65]: root = etree.fromstring(xml_data)

In [66]: print(root.tag)
root

In [67]: etree.tostring(root)
Out[67]: b'<root>data</root>'

The XML() method behaves like fromstring(). It’s used to write “XML literals into the source(?)”:

In [67]: etree.tostring(root)
Out[67]: b'<root>data</root>'

In [68]: root = etree.XML(xml_data)

The HTML() method writes HTML literals (it’s easier to see what’s going on here):

In [70]: root = etree.HTML("<p>this is a test</p>")

In [71]: etree.tostring(root)
Out[71]: b'<html><body><p>this is a test</p></body></html>'

The parse() method parses files and file like objects. parse() returns an ElementTree object, not an Element object.

In [76]: tree = etree.parse("/home/scossar/projects/python/hello_lxml/country_data.xml")

In [77]: etree.tostring(tree)
Out[77]: b'<data>\n    <country name="Liechtenstein">\n        <rank updated="yes">2</rank>\n        <year>2008</year>\n        <gdppc>141100</gdppc>\n        <neighbor name="Austria" direction="E"/>\n        <neighbor name="Switzerland" direction="W"/>\n    </country>\n    <country name="Singapore">\n        <rank updated="yes">5</rank>\n        <year>2011</year>\n        <gdppc>59900</gdppc>\n        <neighbor name="Malaysia" direction="N"/>\n    </country>\n    <country name="Panama">\n        <rank updated="yes">69</rank>\n        <year>2011</year>\n        <gdppc>13600</gdppc>\n        <neighbor name="Costa Rica" direction="W"/>\n        <neighbor name="Colombia" direction="E"/>\n    </country>\n</data>'

In [78]: prettyprint(tree)
<data>
    <country name="Liechtenstein">
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank updated="yes">69</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

In [79]: type(tree)
Out[79]: lxml.etree._ElementTree

Namespaces and lxml

Summarized from Claude: When working with lxml you’ll sometimes need to account for namespaces in your XPath queries. E.g., if an element is in a namespace, you can’t just search for //div — you need to specify the namespace or use wildcards like //*local-name()-div. (I’ll confirm the syntax for that when it comes up.)

From the lxml docs: The ElementTree API avoids namespace prefixes whenever possible. It uses the real namespace (URI) instead:

In [80]: xhtml = etree.Element("{http://www.w3.org/1999/xhtml}html")

In [81]: body = etree.SubElement(xhtml, "{http://www.w3.org/1999/xhtml}bod
        y")

In [82]: body.text = "this is a test"

In [83]: prettyprint(xhtml)
<html:html xmlns:html="http://www.w3.org/1999/xhtml">
  <html:body>this is a test</html:body>
</html:html>

Typing like this is obviously error prone. The workaround is to store a namespace URI in a global variable:

In [87]: XHTML
Out[87]: '{http://www.w3.org/1999/xhtml}'

In [88]: XHTML_NAMESPACE = "http://www.w3.org/1999/xhtml"

In [89]: XHTML = "{%s}" % XHTML_NAMESPACE

In [90]: XHTML
Out[90]: '{http://www.w3.org/1999/xhtml}'

In [91]: NSMAP = {None : XHTML_NAMESPACE}

In [93]: xhtml = etree.Element(XHTML + "html", nsmap=NSMAP)

In [94]: xhtml
Out[94]: <Element {http://www.w3.org/1999/xhtml}html at 0x7f49a058c540>

In [95]: xhtml = etree.Element(XHTML + "html", nsmap=NSMAP)

In [96]: body = etree.SubElement(xhtml, XHTML + "body")

In [97]: body.text = "this is a test"

In [98]: prettyprint(xhtml)
<html xmlns="http://www.w3.org/1999/xhtml">
  <body>this is a test</body>
</html>

Use the QName helper class to build or split qualified tag names:

In [99]: tag = etree.QName("http://www.w3.org/1999/xhtml", "html")

In [100]: tag.localname
Out[100]: 'html'

In [101]: tag.namespace
Out[101]: 'http://www.w3.org/1999/xhtml'

Generating HTML (and XML) with lmxl E-factory

The E factory (see What is a factory) provides syntax for generating HTML and XML.

from lxml import etree
from lxml.builder import E  # pyright: ignore


def prettyprint(element, **kwargs):
    xml = etree.tostring(element, pretty_print=True, **kwargs)
    print(xml.decode(), end="")


def CLASS(*args):
    return {"class": " ".join(args)}


html = page = E.html(
    E.head(E.title("This is a test")),
    E.body(
        E.h1("Testing", CLASS("title")),
        E.p("This is a test with some ", E.bold("bold"), " text."),
        E.p("This is another paragraph", CLASS("foo")),
    ),
)

prettyprint(page)
# <html>
#   <head>
#     <title>This is a test</title>
#   </head>
#   <body>
#     <h1 class="title">Testing</h1>
#     <p>This is a test with some <bold>bold</bold> text.</p>
#     <p class="foo">This is another paragraph</p>
#   </body>
# </html>

E factory Element creation is based on attribute access. This makes it possible to create a vocabulary for an XML language. This is what’s used by lxml.html.builder. An example will make it clearer:

from lxml import etree
from lxml.builder import ElementMaker  # pyright: ignore


def prettyprint(element, **kwargs):
    xml = etree.tostring(element, pretty_print=True, **kwargs)
    print(xml.decode(), end="")


E = ElementMaker(
    namespace="http://zalgorithm/default/namespace",
    nsmap={"p": "http://zalgorithm/default/namespace"},
)

DOC = E.doc
TITLE = E.title
SECTION = E.section
PAR = E.par

test_doc = DOC(
    TITLE("This is a test"),
    SECTION(
        TITLE("Test section"),
        PAR("Once upon a time in a land far far away..."),
        PAR("There lived a..."),
    ),
)

prettyprint(test_doc)
# <p:doc xmlns:p="http://zalgorithm/default/namespace">
#   <p:title>This is a test</p:title>
#   <p:section>
#     <p:title>Test section</p:title>
#     <p:par>Once upon a time in a land far far away...</p:par>
#     <p:par>There lived a...</p:par>
#   </p:section>
# </p:doc>

The lxml.html.builder module uses the above approach.

lxml ElementPath

The (Python standard library) ElementTree library has a simple XPath language called ElementPath. It doesn’t support features like value comparison and function.

lxml has a full XPath implementation (see: lxml XPath). It also has four methods for finding Elements (and ElementTrees, see the getelementtree() section below):

In [102]: root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")

# find a child of an Element:
In [103]: print(root.find("b"))
None

In [104]: print(root.find("a").tag)
a

# find an Element anywhere in the tree:
In [105]: print(root.find(".//b").tag)
b

In [106]: [b.tag for b in root.iterfind(".//b")]
Out[106]: ['b', 'b']

Find elements with a given attribute:

In [107]: print(root.findall(".//a[@x]")[0].tag)
a

In [108]: print(root.findall(".//a[@y]"))  # there are no a tags with a y attribute
[]

Generate an ElementPath expression for an Element (from the ElementTree):

In [109]: tree = etree.ElementTree(root)

In [110]: a = root[0]

In [111]: a
Out[111]: <Element a at 0x7f49a0d5d8c0>

In [112]: print(tree.getelementpath(a[0]))
a/b[1]

In [113]: print(tree.getelementpath(a[1]))
a/c

In [114]: print(tree.getelementpath(a[2]))
a/b[2]

In [115]: tree.find(tree.getelementpath(a[2])) == a[2]
Out[115]: True

As long as the ElementTree isnot’ modified (?), the path expression returned from getelementpath() represents an identifier for a given Element. It can be used in a call to find()

The iter() method is a special case. It only finds specific tags in the tree by their name, not by their path. The following are all equivalent. Note the use of next in the calls to iterfind() and iter() below:

In [116]: print(root.find(".//b").tag)
b

In [117]: print(next(root.iterfind(".//b")).tag)
b

In [118]: print(next(root.iter("b")).tag)

References

Behnel, Stefan. “The lxml.etree Tutorial”. https://lxml.de/tutorial.html.

Wikipedia contributors. “XML.” Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=XML&oldid=1327771438. Accessed December 22, 2025.

W3 Schools. “XML What Is”. W3 Schools. https://www.w3schools.com/xml/xml_whatis.asp.

Mozilla Developers. “Namespaces crash course”. https://developer.mozilla.org/en-US/docs/Web/SVG/Guides/Namespaces_crash_course.


  1. Wikipedia contributors, “XML,” Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=XML&oldid=1327771438 (accessed December 22, 2025). ↩︎

  2. “xml.etree.ElementTree — The ElementTree XML API”, “https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree" ↩︎