Open source natural language processing for Legal Texts


Blackstone is a spaCy model and library for processing long-form, unstructured legal text. Blackstone is an experimental research project from the Incorporated Council of Law Reporting for England and Wales' research lab, ICLR&D.



The past several years have seen a surge in activity at the intersection of law and technology. However, in the United Kingdom, the overwhelming bulk of that activity has taken place in law firms and other commercial contexts. The consequence of this is that notwithstanding the never ending flurry of development in the legal-informatics space, almost none of the research is made available on an open-source basis.

Moreover, the majoritry of research in the UK legal-informatics domain (whether open or closed) has focussed on the development of NLP applications for automating contracts and other legal documents that are transactional in nature. This is understandable, because the principal benefactors of legal NLP research in the UK are law firms and law firms tend not to find it difficult to get their hands on transactional documentation that can be harnessed as training data.

The problem, as we see it, is that legal NLP research in the UK has become over concentrated on commercial applications and that it is worthwhile making the investment in developing legal NLP research available with respect to other legal texts, such as judgments, scholarly articles, skeleton arguments and pleadings.


  • So far as we are aware, Blackstone is the first open source model trained for use on long-form texts containing common law entities and concepts.

  • Blackstone is built on spaCy, which makes it easy to pick up and apply to your own data.

  • Blackstone has been trained on data spanning a considerable temporal period (as early as texts drafted in the 1860s). This is useful because an interesting quirk of the common law is that older writings (particularly, judgments) go on to remain relevant for many, many years.

  • It is free and open source

  • It is imperfect and makes no attempt to hide that fact from you


Note It is strongly recommended that you install Blackstone into a virtual environment! See here for more on virtual environments. Blackstone should compatible with Python 3.6 and higher.

To install Blackstone follow these steps:

  1. Install the library

  2. The first step is to install the library, which at present contains a handful of custom spaCy components. Install the library like so:

    pip install blackstone
  3. Install the Blackstone model

  4. The second step is to install the spaCy model. Install the model like so:

    pip install


The NER component of the Blackstone model has been trained to detect the following entity types.

Entity Label Description Example
CASENAME Case names
Smith v Jones, In re Jones, In Jones' case
CITATION Citations (unique identifiers for reported and unreported cases)
(2002) 2 Cr App R 123
INSTRUMENT Written legal instruments
Theft Act 1968, European Convention on Human Rights, CPR
PROVISION Unit within a written legal instrument
section 1, art 2(3)
COURT Court or tribunal
Court of Appeal, Upper Tribunal
JUDGE References to judges
Eady J, Lord Bingham of Cornhill

Applying the NER model

Example of how the model is applied to some text taken from para 31 of the Divisional Court's judgment in R (Miller) v Secretary of State for Exiting the European Union (Birnie intervening) [2017] UKSC 5; [2018] AC 61


This release of Blackstone also comes with a text categoriser. In contrast with the NER component (which has been trainined to identify tokens and series of tokens of interest), the text categoriser classifies longer spans of text, such as sentences. The Text Categoriser has been trained to classify text according to one of five mutually exclusive categories, which are as follows:

Category Description
CONCLUSION The text appears to make a finding, holding, determination or conclusion
ISSUE The text appears to discuss an issue or question
LEGAL_TEST The text appears to discuss a legal test
LEGAL_TEST The text appears to discuss a legal test
UNCAT The text does not fall into one of the four categories above

Applying the Textcat model

Blackstone's text categoriser generates a predicted categorisation for a Doc. The textcat pipeline component has been designed to be applied to individual sentences rather than a single document consisting of many sentences.

import spacy

# Load the model
nlp = spacy.load("en_blackstone_proto")

def get_top_cat(doc):
    Function to identify the highest scoring category
    prediction generated by the text categoriser. 
    cats = doc.cats
    max_score = max(cats.values()) 
    max_cats = [k for k, v in cats.items() if v == max_score]
    max_cat = max_cats[0]
    return (max_cat, max_score)

text = """
It is a well-established principle of law that the transactions of independent states between each other are governed by other laws than those which municipal courts administer. \
It is, however, in my judgment, insufficient to react to the danger of over-formalisation and “judicialisation” simply by emphasising flexibility and context-sensitivity. \
The question is whether on the facts found by the judge, the (or a) proximate cause of the loss of the rig was “inherent vice or nature of the subject matter insured” within the meaning of clause 4.4 of the Institute Cargo Clauses (A).

# Apply the model to the text
doc = nlp(text)

# Get the sentences in the passage of text
sentences = [sent.text for sent in doc.sents]

# Print the sentence and the corresponding predicted category.
for sentence in sentences:
    doc = nlp(sentence)
    top_category = get_top_cat(doc)
    print (f"\"{sentence}\" {top_category}\n")

>>> "In my judgment, it is patently obvious that cats are a type of dog." ('CONCLUSION', 0.9990500807762146)
>>> "It is a well settled principle that theft is wrong." ('AXIOM', 0.556410014629364)
>>> "The question is whether on the facts found by the judge, the (or a) proximate cause of the loss of the rig was “inherent vice or nature of the subject matter insured” within the meaning of clause 4.4 of the Institute Cargo Clauses (A)." ('ISSUE', 0.5040785074234009)


In addition to the core model, this proto release of Blackstone comes with three custom components.

  • Abbreviation detection - this is heavily based on the AbbreviationDetector() component in [scispacy] and resolves an abbreviated form to its long form definition, e.g. ECtHR -> European Court of Human Rights.

  • Legislation linker - this is an alpha component that attempts to resolve references to provisons to their parent instrument (more on this further down the README).

  • Compound case reference detection - again, this is an alpha component that attempts identify CASENAME and CITATION pairs enabling the merging of a CITATION to its parent CASENAME.

Compound Case References

The compound case reference detection component in Blackstone is designed to marry up CITATION entities with their parent CASENAME entities.

Common law jurisdictions typically relate to case references through a coupling of a name (typically derived from the names of the parties in the case) and some unique citation to identify where the case has been reported, like so:

Regina v Horncastle [2010] 2 AC 373

Blackstone's NER model separately attempts to identify the CASENAME and CITATION entities. However, it is potentially useful (particularly in the context of information extraction) to pull these entities out as pairs.

CompoundCases() applies a custom pipe after the NER and identifies CASENAME/CITATION pairs in two scenarios:

  • The standard scenario: Gelmini v Moriggia [1913] 2 KB 549

  • The possessive scenario (which is a little antiquated): Jones' case [1915] 1 KB 45

import spacy
from blackstone.compound_cases import CompoundCases

nlp = spacy.load("en_blackstone_proto")

compound_pipe = CompoundCases(nlp)

text = "As I have indicated, this was the central issue before the judge. On this issue the defendants relied (successfully below) on the decision of the High Court in Gelmini v Moriggia [1913] 2 KB 549. In Jone's case [1915] 1 KB 45, the defendant wore a hat."
doc = nlp(text)

for compound_ref in doc._.compound_cases:

>>> Gelmini v Moriggia [1913] 2 KB 549
>>> Jone's case [1915] 1 KB 45

Sentence Boundary Detection

Blackstone ships with a custom rule-based sentence segmenter that addresses a range of characteristics inherent in legal texts that have a tendency to baffle out-of-the-box sentence segmentation rules.

import spacy
from blackstone.segmenter import sentence_segmenter

nlp = spacy.load("en_blackstone_proto")

# remove the default spaCy sentencizer from the model pipeline
if "sentencizer" in nlp.pipe_names:

# add the Blackstone sentence_segmenter to the pipeline before the parser
nlp.add_pipe(sentence_segmenter, before="parser")

doc = nlp("Some more legal text goes here. And a little bit more legal text goes here")

for sent in doc.sents:
    print (sent.text)

Abbreviation Detection

It is not uncommon for authors of legal documents to abbreviate long-winded terms that will be used instead of the long-form througout the rest of the document. For example,

The European Court of Human Rights ("ECtHR") is the court ultimately responsible for applying the European Convention on Human Rights ("ECHR").

The abbreviation detection component in Blackstone seeks to address this by implementing an ever so slightly modified version of scispaCy's AbbreviationDetector() (which is itself an implementation of the approach set out in this paper: Our implementation is still rough around the edges.

import spacy
from blackstone.abbreviations import AbbreviationDetector

nlp = spacy.load("en_blackstone_proto")

# Add the abbreviation pipe to the spacy pipeline.
abbreviation_pipe = AbbreviationDetector(nlp)

doc = nlp('The European Court of Human Rights ("ECtHR") is the court ultimately responsible for applying the European Convention on Human Rights ("ECHR").')

print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
    print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")

>>> "ECtHR"          (7, 10) European Court of Human Rights
>>> "ECHR"   (25, 28) European Convention on Human Rights

Legislation Linker

Blackstone's Legislation Linker attempts to couple a reference to a PROVISION to it's parent INSTRUMENT by using the NER model to identify the presence of an INSTRUMENT and then navigating the dependency tree to identify the child provision.

Once Blackstone has identified a PROVISION:LEGISLATION pair, it will attempt to generate target URLs to both the provision and the instrument on

import spacy
from blackstone.legislation_linker import extract_legislation_relations
nlp = spacy.load("en_blackstone_proto")

text = "The Secretary of State was at pains to emphasise that, if a withdrawal agreement is made, it is very likely to be a treaty requiring ratification and as such would have to be submitted for review by Parliament, acting separately, under the negative resolution procedure set out in section 20 of the Constitutional Reform and Governance Act 2010. Theft is defined in section 1 of the Theft Act 1968"

doc = nlp(text) 
relations = extract_legislation_relations(doc)
for provision, provision_url, instrument, instrument_url in relations:

>>> section 20  Constitutional Reform and Governance Act 2010

>>> section 1   Theft Act 1968