BLACKSTONE
Open source natural language processing for Legal Texts
Blackstone is a spaCy model and library for processing long-form, unstructured legal text. Blackstone is an experimental research project from the Incorporated Council of Law Reporting for England and Wales' research lab, ICLR&D.
NEWS
23 SEPTEMBER 2019 | Sentence boundary detection pipeline component released |
20 AUGUST 2019 | Blackstone concept extractor pipeline component released | 9 AUGUST 2019 | Blackstone proto model and library goes live |
WHY ARE WE BUILDING BLACKSTONE?
The past several years have seen a surge in activity at the intersection of law and technology. However, in the United Kingdom, the overwhelming bulk of that activity has taken place in law firms and other commercial contexts. The consequence of this is that notwithstanding the never ending flurry of development in the legal-informatics space, almost none of the research is made available on an open-source basis.
Moreover, the majoritry of research in the UK legal-informatics domain (whether open or closed) has focussed on the development of NLP applications for automating contracts and other legal documents that are transactional in nature. This is understandable, because the principal benefactors of legal NLP research in the UK are law firms and law firms tend not to find it difficult to get their hands on transactional documentation that can be harnessed as training data.
The problem, as we see it, is that legal NLP research in the UK has become over concentrated on commercial applications and that it is worthwhile making the investment in developing legal NLP research available with respect to other legal texts, such as judgments, scholarly articles, skeleton arguments and pleadings.
WHAT'S SPECIAL ABOUT BLACKSTONE?
So far as we are aware, Blackstone is the first open source model trained for use on long-form texts containing common law entities and concepts.
Blackstone is built on spaCy, which makes it easy to pick up and apply to your own data.
Blackstone has been trained on data spanning a considerable temporal period (as early as texts drafted in the 1860s). This is useful because an interesting quirk of the common law is that older writings (particularly, judgments) go on to remain relevant for many, many years.
It is free and open source
It is imperfect and makes no attempt to hide that fact from you
INSTALLATION
Note It is strongly recommended that you install Blackstone into a virtual environment! See here for more on virtual environments. Blackstone should compatible with Python 3.6 and higher.
To install Blackstone follow these steps:
-
Install the library
Install the Blackstone model
The first step is to install the library, which at present contains a handful of custom spaCy components. Install the library like so:
pip install blackstone
The second step is to install the spaCy model. Install the model like so:
pip install https://blackstone-model.s3-eu-west-1.amazonaws.com/en_blackstone_proto-0.0.1.tar.gz
NAMED-ENTITY RECOGNITION
The NER component of the Blackstone model has been trained to detect the following entity types.
Entity Label | Description | Example |
---|---|---|
CASENAME | Case names | Smith v Jones, In re Jones, In Jones' case |
CITATION | Citations (unique identifiers for reported and unreported cases) | (2002) 2 Cr App R 123 |
INSTRUMENT | Written legal instruments | Theft Act 1968, European Convention on Human Rights, CPR |
PROVISION | Unit within a written legal instrument | section 1, art 2(3) |
COURT | Court or tribunal | Court of Appeal, Upper Tribunal |
JUDGE | References to judges | Eady J, Lord Bingham of Cornhill |
Applying the NER model
Example of how the model is applied to some text taken from para 31 of the Divisional Court's judgment in R (Miller) v Secretary of State for Exiting the European Union (Birnie intervening) [2017] UKSC 5; [2018] AC 61
TEXT CLASSIFICATION
This release of Blackstone also comes with a text categoriser. In contrast with the NER component (which has been trainined to identify tokens and series of tokens of interest), the text categoriser classifies longer spans of text, such as sentences. The Text Categoriser has been trained to classify text according to one of five mutually exclusive categories, which are as follows:
Category | Description |
---|---|
CONCLUSION | The text appears to make a finding, holding, determination or conclusion |
ISSUE | The text appears to discuss an issue or question |
LEGAL_TEST | The text appears to discuss a legal test |
LEGAL_TEST | The text appears to discuss a legal test |
UNCAT | The text does not fall into one of the four categories above |
Applying the Textcat model
Blackstone's text categoriser generates a predicted categorisation for a Doc
. The textcat pipeline component has been designed to be applied to individual sentences rather than a single document consisting of many sentences.
import spacy
# Load the model
nlp = spacy.load("en_blackstone_proto")
def get_top_cat(doc):
"""
Function to identify the highest scoring category
prediction generated by the text categoriser.
"""
cats = doc.cats
max_score = max(cats.values())
max_cats = [k for k, v in cats.items() if v == max_score]
max_cat = max_cats[0]
return (max_cat, max_score)
text = """
It is a well-established principle of law that the transactions of independent states between each other are governed by other laws than those which municipal courts administer. \
It is, however, in my judgment, insufficient to react to the danger of over-formalisation and “judicialisation” simply by emphasising flexibility and context-sensitivity. \
The question is whether on the facts found by the judge, the (or a) proximate cause of the loss of the rig was “inherent vice or nature of the subject matter insured” within the meaning of clause 4.4 of the Institute Cargo Clauses (A).
"""
# Apply the model to the text
doc = nlp(text)
# Get the sentences in the passage of text
sentences = [sent.text for sent in doc.sents]
# Print the sentence and the corresponding predicted category.
for sentence in sentences:
doc = nlp(sentence)
top_category = get_top_cat(doc)
print (f"\"{sentence}\" {top_category}\n")
>>> "In my judgment, it is patently obvious that cats are a type of dog." ('CONCLUSION', 0.9990500807762146)
>>> "It is a well settled principle that theft is wrong." ('AXIOM', 0.556410014629364)
>>> "The question is whether on the facts found by the judge, the (or a) proximate cause of the loss of the rig was “inherent vice or nature of the subject matter insured” within the meaning of clause 4.4 of the Institute Cargo Clauses (A)." ('ISSUE', 0.5040785074234009)
CUSTOM PIPELINE COMPONENTS
In addition to the core model, this proto release of Blackstone comes with three custom components.
Abbreviation detection - this is heavily based on the
AbbreviationDetector()
component in [scispacy] and resolves an abbreviated form to its long form definition, e.g.ECtHR -> European Court of Human Rights.
Legislation linker - this is an alpha component that attempts to resolve references to provisons to their parent instrument (more on this further down the README).
Compound case reference detection - again, this is an alpha component that attempts identify CASENAME and CITATION pairs enabling the merging of a
CITATION
to its parentCASENAME
.
Compound Case References
The compound case reference detection component in Blackstone is designed to marry up CITATION
entities with their parent CASENAME
entities.
Common law jurisdictions typically relate to case references through a coupling of a name (typically derived from the names of the parties in the case) and some unique citation to identify where the case has been reported, like so:
Regina v Horncastle [2010] 2 AC 373
Blackstone's NER model separately attempts to identify the CASENAME
and CITATION
entities. However, it is potentially useful (particularly in the context of information extraction) to pull these entities out as pairs.
CompoundCases()
applies a custom pipe after the NER and identifies CASENAME
/CITATION
pairs in two scenarios:
The standard scenario:
Gelmini v Moriggia [1913] 2 KB 549
The possessive scenario (which is a little antiquated):
Jones' case [1915] 1 KB 45
import spacy
from blackstone.compound_cases import CompoundCases
nlp = spacy.load("en_blackstone_proto")
compound_pipe = CompoundCases(nlp)
nlp.add_pipe(compound_pipe)
text = "As I have indicated, this was the central issue before the judge. On this issue the defendants relied (successfully below) on the decision of the High Court in Gelmini v Moriggia [1913] 2 KB 549. In Jone's case [1915] 1 KB 45, the defendant wore a hat."
doc = nlp(text)
for compound_ref in doc._.compound_cases:
print(compound_ref)
>>> Gelmini v Moriggia [1913] 2 KB 549
>>> Jone's case [1915] 1 KB 45
Sentence Boundary Detection
Blackstone ships with a custom rule-based sentence segmenter that addresses a range of characteristics inherent in legal texts that have a tendency to baffle out-of-the-box sentence segmentation rules.
import spacy
from blackstone.segmenter import sentence_segmenter
nlp = spacy.load("en_blackstone_proto")
# remove the default spaCy sentencizer from the model pipeline
if "sentencizer" in nlp.pipe_names:
nlp.remove_pipe('sentencizer')
# add the Blackstone sentence_segmenter to the pipeline before the parser
nlp.add_pipe(sentence_segmenter, before="parser")
doc = nlp("Some more legal text goes here. And a little bit more legal text goes here")
for sent in doc.sents:
print (sent.text)
Abbreviation Detection
It is not uncommon for authors of legal documents to abbreviate long-winded terms that will be used instead of the long-form througout the rest of the document. For example,
The European Court of Human Rights ("ECtHR") is the court ultimately responsible for applying the European Convention on Human Rights ("ECHR").
The abbreviation detection component in Blackstone seeks to address this by implementing an ever so slightly modified version of scispaCy's AbbreviationDetector() (which is itself an implementation of the approach set out in this paper: https://psb.stanford.edu/psb-online/proceedings/psb03/schwartz.pdf). Our implementation is still rough around the edges.
import spacy
from blackstone.abbreviations import AbbreviationDetector
nlp = spacy.load("en_blackstone_proto")
# Add the abbreviation pipe to the spacy pipeline.
abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)
doc = nlp('The European Court of Human Rights ("ECtHR") is the court ultimately responsible for applying the European Convention on Human Rights ("ECHR").')
print("Abbreviation", "\t", "Definition")
for abrv in doc._.abbreviations:
print(f"{abrv} \t ({abrv.start}, {abrv.end}) {abrv._.long_form}")
>>> "ECtHR" (7, 10) European Court of Human Rights
>>> "ECHR" (25, 28) European Convention on Human Rights
Legislation Linker
Blackstone's Legislation Linker attempts to couple a reference to a PROVISION
to it's parent INSTRUMENT
by using the NER model to identify the presence of an INSTRUMENT
and then navigating the dependency tree to identify the child provision.
Once Blackstone has identified a PROVISION:LEGISLATION
pair, it will attempt to generate target URLs to both the provision and the instrument on legislation.gov.uk.
import spacy
from blackstone.legislation_linker import extract_legislation_relations
nlp = spacy.load("en_blackstone_proto")
text = "The Secretary of State was at pains to emphasise that, if a withdrawal agreement is made, it is very likely to be a treaty requiring ratification and as such would have to be submitted for review by Parliament, acting separately, under the negative resolution procedure set out in section 20 of the Constitutional Reform and Governance Act 2010. Theft is defined in section 1 of the Theft Act 1968"
doc = nlp(text)
relations = extract_legislation_relations(doc)
for provision, provision_url, instrument, instrument_url in relations:
print(f"\n{provision}\t{provision_url}\t{instrument}\t{instrument_url}")
>>> section 20 http://www.legislation.gov.uk/ukpga/2010/25/section/20 Constitutional Reform and Governance Act 2010 http://www.legislation.gov.uk/ukpga/2010/25/contents
>>> section 1 http://www.legislation.gov.uk/ukpga/1968/60/section/1 Theft Act 1968 http://www.legislation.gov.uk/ukpga/1968/60/contents