Veronika Fikfak is an Associate Professor at the University of Copenhagen, where she is leading the ERC Project Human Rights Nudge team (ERC 803891), which looks at how and when states change their behaviour in response to ECtHR judgments. We use computational methods to analyse large datasets of ECtHR case law and follow up processes before the Committee of Ministers. We then apply behavioural insights to understand behaviour of individuals, institutions and states.
The President of the European Court of Human Rights, Robert Spano, recently announced that the Court was looking into introducing algorithmic decision-making to help with new applications coming to the Court. In a speech to the German Marshall Fund of the United States, Spano said:
We have throughout the last decade been introducing reforms and one of them, certainly, is the use of information technology. We are now in a phase where we are looking at to what extent we can, for example, at the registration phase introduce algorithmic or automated decision making so as to try and reduce the extent to which this classical registration phase has to all be done manually. … When it is done, we can use the data introduced into the system in a more effective manner. But I do think moving to the future a mass, a bulk case court like ours will slowly start introducing algorithmic tools to facilitate its tasks.(see min. 16 and onwards)
The statement is made in the context of discussion of the extensive workload under which the Court finds itself. At this point, there are more than 65,000 applications open before the Court, many of them have been pending for numerous years. It is therefore understandable that the President would be enthusiastic about exploring all the options available to the Court to filter the cases coming through the pipeline. Whilst Spano acknowledged in his speech that this move will require research and understanding of the various modalities, the indication was clear: automated decision-making will soon form part of the process before the European Court of Human Rights.
In this blog, I wish to show how problematic any introduction of automated decisions-making could potentially be. I raise questions which the Court will have to address when introducing automated decision-making, which concern the dataset from which the algorithm will ‘learn’, the potentially discriminatory outcomes it could lead to, and the biases it might play on if automated decision-making is to ‘only’ form part of an interaction between the Registry and the machine. I start, however, with the preliminary question of ‘what is the algorithm for’.
The purpose of automated decision-making
It is unclear at this point what purpose – beyond the reduction of workload – automated decision-making would serve. For example, would it – as Spano initially suggested – help at the registration phase, where initial information about the victim, their lawyer, the respondent state, and allegations of violations are made? Or would it also assist in the determination of the admissibility of cases, e.g. the automatic rejection of applications brought by the same victim; automatic rejection of out-of-time claims; or perhaps rejection of applications that are incomplete? Or, could the Court go further and use the algorithm to assess the evidence presented and the credibility of the allegations made?
These are not unimaginable scenarios. Machine learning algorithms have been used extensively in adjudication and administrative decision-making for almost a decade. In Canada, where administrative tribunals decide about the immigration status of refugees and asylum seekers, algorithms are used to ‘replace’ or ‘augment’ administrative decision-making by ‘automating certain activities conducted by immigration officials and supporting the evaluation of some applications’, including those made by people fleeing war and violence and seeking asylum on humanitarian or compassionate grounds (see Bots at the Gate). In these cases, the algorithm is used to classify cases and individuals in terms of risk, priority and complexity; it is used to make probability assessments about whether the refugee’s story is likely to be ‘true’ or their marriage ‘genuine’; it is used to identify or flag certain cases for review and finally to provide overall recommendations about whether an application should be approved. In that sense, the algorithm can be used for a wide range of purposes: from the most simple of categorisations, to assessing whether an individual is to be believed. The researchers analysing the Canadian government’s decision to use machine learning algorithms have found that in assessing whether a refugee’s story could be true, the algorithm was allowed to access DNA information from the website ancestry.com to establish the nationality of individuals subject to potential deportation. Individuals’ claims to nationality or origin were therefore assessed against this evidence and on this basis, their stories were evaluated as credible or not.
Algorithms have been used in similar ways in other places: in the UK, for example, in the context of Brexit related settled and pre-settled status as well as with payments of universal credit, or in Australia, to raise welfare debts against the unemployed and students, as well as particularly vulnerable groups. In each of these cases, the algorithm had been adopted to decrease the institution’s workload and facilitate the resolution of repetitive claims by drawing on large amounts of data.
In all the cases enumerated above, the schemes aimed to replace the formerly manual systems with an automated data-matching system. A good example of this is the case of the Australian Robodebt, the process calculating overpayments and issuing debt notices. In order to achieve this, AI algorithms had to be allowed to draw on different sources of data – information about individuals collected by the Department of Work and Pension, Tax Department, etc. – to put together information about an individual and to determine whether – based on all of this data – the individual was entitled to a certain status or payment, or whether they could instead be deprived of it. Each of these schemes has been heavily criticised for issuing false or incorrectly calculated decisions and for failing to take account of the specific situation of the applicant. Australia’s Robodebt scandal carries perhaps the most important lessons: algorithmic notices were issued to people who were already deceased and to disabled and vulnerable people, who should not have received them. The government recently recognised that the scheme had been unlawful.
In the next sections, I show what considerations the Court must keep in mind to avoid problems raised by the above-mentioned schemes.
The quality of the ‘learning’ dataset
- Question of balance
In order for an algorithm to make categorisations, assessments, recommendations or even decisions, the machine has to be ‘fed’ data. The first question therefore is what data will be used for the algorithm to ‘learn’ from.
As early as 2016, computer scientists put together a machine-learning algorithm, which was able to predict with 79% rate accuracy how the European Court of Human Rights would decide on the merits in a certain case. The scientists put together an algorithm that read 600 judgments by the European Court and then tested the trained algorithm on a sample of 100 judgements to determine whether and how accurately it could predict when there would be a violation of the Convention. The algorithm was able to predict the decision of the Court in about 79% of cases.
Building on the patterns found in the Court’s judgments, the algorithm found that certain words were key predictors in the determination of a violation. For article 3, for example, these were: “injury, protection, ordered, damage, civil, caused, failed, claim, course, connection, region, effective, quashed, claimed, suffered, suspended, carry, compensation, pecuniary, Ukraine”. Looking at these terms, there are the familiar words we see in decisions of the ECtHR such as the term ‘damage’ and ‘compensation’ as well as ‘pecuniary’. These terms are unlikely to be mentioned if no violation ensues. It is therefore understandable why a machine would link them to the outcome of a violation. More problematic, however, is the term ‘Ukraine’ as a predictor of a violation. For those familiar with the ECtHR case law, Ukraine is of course amongst the most frequent violators of human rights. Everyone will remember the thousands of cases relating to non-enforcement of domestic decisions in Ivanov and Burmych. But when it comes to Article 3, there are other states that are considerably more frequent violators: Russia, Turkey and Romania all appear to have significantly more cases and more of these cases end in a violation. A machine could conclude that the likelihood of a violation is higher when Ukraine is the respondent party if the term ‘Ukraine’ appeared more frequently in the dataset from which it studied. But looking at all of the 2868 cases invoking Article 3 in HUDOC (and not only at the selection), it is clear that the algorithm is not correct: 90% of cases invoking article 3 in relation to Russia end in a violation, 94% of cases in relation to Romania, and ‘only’ 87% in relation to Ukraine. This closer look at the case law raises obvious questions about what the algorithm in the study was being ‘fed’, i.e. what decisions it was learning from in order to establish the relevant predictors, how these decisions were selected and whether cases against Ukraine were over-represented.
In many regards, this example clearly highlights the issues that the use of algorithms raises. We do not know the extent to which the database (or documents) from which an algorithm learns is complete or balanced. Even if we were to feed the machine the entirety of the ECtHR case law, there are strong concerns that ultimately, the algorithm would skew heavily against frequent violators. In that sense the utility, predictability, and helpfulness of the algorithm, as well as its neutrality, depends on the data we give to it. If the data that forms the basis for the machine learning is either unrepresentative or over-representative, then the algorithm we receive will be equally so. The Court will therefore have to exercise extreme caution about which information it ought to make available for machine learning purposes, so that the categorisations, recommendations, and decisions made are not discriminatory towards certain countries or specific groups of petitioners.
2. Question of bias
The quality of the training data impacts the quality of the output data. As Integrate states in their report on Responsible AI, ‘[i]n standard practice, machine learning assumes the future will look like the past. When the past is unfair or biased, machine learning will propagate these biases and enhance them through feedback loops.’
In the Canadian case mentioned above, concerns were raised about, for example, the data collection and information sharing between law enforcement and the immigration and refugee system. The law enforcement service had faced heavy criticism ‘for engaging in religious and ethnic profiling of migrants’, including questioning relying on Islamophobic stereotypes, seeking information about social values, political beliefs, and religious practices, including on the individual’s perception of women wearing the hijab. Such questionnaires, which clearly targeted only Muslim individuals crossing the border, were shared with security partners and may have been fed into the algorithm used for the assessment of individuals for protected status. This case clearly raises concerns about the extent to which explicitly or implicitly biased information could inevitably shape the recommendations provided by the algorithm and underlines that future decisions – even if they were made by a supposedly impartial algorithm – could effectively perpetuate or indeed exaggerate this bias. As the Canadian study concluded:
‘Case studies, empirical evidence, and investigative journalists have demonstrated on repeated occasions that algorithms can be, and often are, biased, in that they result in unfair outcomes or otherwise in consequences that disproportionately impact particular groups, often those already facing systemic bias in broader societal contexts (in which these algorithms are situated and cannot be separated from). In these cases, the resulting decisions likely cannot be considered to be fair or impartial.’
Since the question of how bias shapes judicial decisions is a relatively recent area of study, we know little about whether and how the ECtHR case law is biased. Some publications have suggested, for example, that countries who joined the Convention after the 1990s have a greater likelihood of being found in violation of the Convention; others have argued that the principle of the margin of appreciation is applied disproportionately to bigger and more powerful states (see e.g. Madsen). But beyond these initial suggestions, it is important to underline that questions of bias appear at every stage of proceedings – admissibility, merits, and satisfaction. For example, we have not studied yet whether certain advocates appearing before the Court are more or less likely to have their case declared admissible or whether this depends on how frequently they appear before the Court (they could unconsciously be perceived as ‘unreasonable’ or a nuisance, or on the contrary, they may be welcomed as repeat players who know the system), or whether certain groups of victims (terrorists, prisoners, political activists, judges, children, Roma, etc.) are more or less likely to reach the merits stage, win, or even be awarded damages, etc.
While observers and researchers may vigilantly supervise the admissibility, merits and satisfaction outcomes of a particular algorithm to determine bias, studies suggest that ‘there may be some cases in which a bias becomes sufficiently clear only if one sees the particular data used to train the algorithm, as well as specifically what the algorithm is programmed to do with that data, such as assigning different weights to different factors in determining applications’ (Bots at the Gate, p. 50). It is important, therefore, that the Court is transparent both about what data will be fed into the algorithm as well as about the inner workings of the algorithm. A concern here may be that these ‘inner workings’ could potentially be subject to proprietary terms and could be considered trade secrets, which would completely prevent researchers from exercising any meaningful scrutiny of automated recommendations. As noted again in Bots At The Gate, ‘[t]his opacity is a major challenge and obstacle to ensuring impartiality’ and it runs counter to principles of human rights and public law, ‘including those of natural justice.’
The issue of the interaction of the human and the machine
The algorithm introduced by the Court will inevitably be used to facilitate the work of lawyers in the Registry. In this regard, it is likely that the ‘human officer’ will remain the primary decision-maker and will have the final decision. But their decision will be based on the recommendation of the algorithm, which they will confirm or reject. Such a ‘human-in-the-loop’ approach to automated decision-making is increasingly becoming the norm, as it arguably fulfils the requirement that public law (or human rights) decisions ought to be made by other humans and with respect for human dignity.
However, even if a ‘human officer’ makes the final decision, the reality is that once they have received the recommendation or suggestion from the algorithm, they will likely proceed on the assumption that the algorithm is unlikely to make a mistake and that its assessment should therefore be trusted. The algorithm will therefore become ‘a proxy or subordinate decision-maker’. This reliance on automated systems leads to a documented human response known as automation bias, where humans believe that they should rely on the machine’s assessment. Automation bias is often unavoidable and has been confirmed to operate in a number of situations in which humans work with machines: in the context of drone usage, health care, etc. The challenge in this context is assessing the degree to which the human officer has ‘delegated’ their judgment to the algorithm. There is a fine line between the registry lawyer being assisted by an algorithmic tool and their judgment being replaced or unduly influenced by algorithmic recommendations.
In order to avoid automatic reliance on the assessment of the machine, humans have to be trained to truly realise that algorithms are not always correct and must actively study how to inspect and supervise the provided recommendation. If such algorithms are there to facilitate and not replace the work of the Registry, as one supposes will be the case, lawyers in the Court will have to be trained on how to resist automation bias and how to evaluate the recommendations given. They will have to understand how the algorithm operates and maintain a healthy scepticism about the correctness of its conclusions. This is a new challenge – not only for the Court, but for any institution and organisation that seeks to entrust parts of its decision-making to computers.
Many reforms have been tried and tested to decrease the workload and the delays at the Court. The latest suggestion, to introduce automated decision-making to help sift through the applications coming to the Court, promises to revolutionise how cases progress through the Court. But it also opens up new concerns about the interaction between humans and machines, about the ECtHR case law as a database, and about the need to understand and control the biases inherent in the Court’s jurisprudence. It also raises questions about what training will be provided to those who will interact with the algorithm, so that automation bias may be minimised. Ultimately, the President’s decision to be transparent about introducing automated decision-making is to be welcomed. In the future, the hope is that the Court will also seek and ensure transparency about the data that informs the algorithm as well as its inner workings, so that the Court’s approach can be studied and supervised in order to help it improve its practice and preserve its role and future as a human rights court.
 An objection against training an algorithm on judgments of the ECtHR is that the Court is known for self-referring and using the same or similar wording and even paragraphs of texts in cases against the same respondent state or alleging the same violation. From the perspective of a machine, which is trying to determine patterns in ‘natural language’, these repetitions are highly likely to jump out. It is therefore questionable whether or at least to what extent, the ECtHR judgments can be treated as ‘natural language’, as the study does.
 Special care should be put into ensuring that the learning database is balanced. In the ECtHR context, the use of the full dataset would be as problematic as the sampling undertaken by the previously mentioned study. The ECtHR case law is not balanced. There are frequent and repetitive violators – eg Turkey, Russia, Ukraine, Romania, Poland, and Italy, who generate thousands of applications and have consequently been held in breach of the Convention more frequently than other states. A machine learning from such a dataset would inevitably produce an algorithm that would put more importance on characteristics raised by cases coming from these countries.