Algorithms for Fair Team Formation in Online Labour Marketplaces

As freelancing work keeps on growing almost everywhere due to a sharp decrease in communication costs and to the widespread of Internet-based labour marketplaces (e.g., guru.com, feelancer.com, mturk.com, upwork.com), many researchers and practitioners have started exploring the benefits of outsourcing and crowdsourcing. Since employers often use these platforms to find a group of workers to complete a specific task, researchers have focused their efforts on the study of team formation and matching algorithms and on the design of effective incentive schemes. Nevertheless, just recently, several concerns have been raised on possibly unfair biases introduced through the algorithms used to carry out these selection and matching procedures. For this reason, researchers have started studying the fairness of algorithms related to these online marketplaces, looking for intelligent ways to overcome the algorithmic bias that frequently arises. Broadly speaking, the aim is to guarantee that, for example, the process of hiring workers through the use of machine learning and algorithmic data analysis tools does not discriminate, even unintentionally, on grounds of nationality or gender. In this short paper, we define the Fair Team Formation problem in the following way: given an online labour marketplace where each worker possesses one or more skills, and where all workers are divided into two or more not overlapping classes (for examples, men and women), we want to design an algorithm that is able to find a team with all the skills needed to complete a given task, and that has the same number of people from all classes. We provide inapproximability results for the Fair Team Formation problem together with four algorithms for the problem itself. We also tested the effectiveness of our algorithmic solutions by performing experiments using real data from an online labor marketplace.


INTRODUCTION
An online labour marketplace is defined as a web application where workers can sell their services and skills in a fluid and delocalised fashion. Usually, employers pay workers hourly to complete a specific task without offering them any long-term employment arrangement. The OECD data on self-employment estimates that between 10% and 20% of workers in developed countries are selfemployed, while it is estimated that in 2020, a full 40% of the US workforce will be freelancers [18]. While crowdsourcing adoption was driven, at least in part, by the assumption that problems can be decomposed into parts that can be addressed separately by independent workers, recent work suggests that crowdsourcing results can be improved by allowing some degree of collaboration among them [21,26]. The idea of combining collaboration with crowdsourcing has led to research on Team Formation [1-3, 7, 9, 11, 15, 17, 20, 22, 28], in which a common thread is the need for complementary skills, and definitions differ in aspects such as objectives (e.g., load balancing and/or compatibility), constraints (e.g., worker capacity), and algorithmic set-up (online or offline). As previously mentioned, these online marketplaces are largely managed through automatic algorithms designed to match supply and demand. Nevertheless, the objective of optimising a given task, which these algorithms are usually based on, goes openly against the need to ensure fairness and diversity, for example, in the composition of groups. We define unfair discrimination as treating someone differently on the base of his group membership, and not his merit. Since algorithms are "black boxes" usually protected by industrial secrecy, legal protections and even intentional obfuscation, most of the times discrimination becomes invisible, and mitigation impossible [12]. For this reason, data scientists and researchers have developed the disparate impact theory [8] whose aim is to spot unintended discrimination in algorithms outcomes. Among the many different sources of the bias on the Web, the one that directly concerns us in the research of a solution for the Fair Team Formation problem is the algorithmic bias, that occurs when the bias is added by the algorithm itself or by the way this algorithm manages the bias present in the data it crunches.
Overview of problem setting and assumptions. In our framework, both workers and tasks are represented by sets of skills. Each skill of the task is possessed by at least one worker, while each worker has a defined cost and belongs alternately to one of two classes. In this setting, we consider the problem of finding the cheapest team of workers that together have all the necessary skills to complete the task, and that is made up of the same number of workers from both classes (fairness constrains). We call this general problem Fair Team Formation, which we formally define in Section 2 and solve in Sections 3-4.
Algorithmic techniques. To the best of our knowledge, we are the first to consider the Fair Team Formation problem, namely the weighed Set Cover problem with some fairness constraints imposed. As shown in section 3, the Fair Team Formation is NP-hard and inapproximable, for this reason the only thing we could do was to look for some algorithms that would function well in practical situations. Now, considering that our problem is closely related to the Set Cover problem [6,Chapter 35], it seemed natural to start from a reasoning similar to the one behind the Greedy Set Cover algorithm [27]. In the next sections, then, we will present four algorithms we developed to solve the Fair Team Formation problem: the first three are partially based on the Greedy Set Cover algorithm, while the fourth is a rounding algorithm based on the linear programming formulation of the Fair Team Formation problem. Furthermore, since we are not able to calculate the value of the optimal solution in reasonable time, we have built a lower-bound for the cost of the optimal solution of the Fair Team Formation problem by solving the relaxed Linear Programming formulation of our problem. This lower-bound came in handy when we had to evaluate our algorithms performance.

Contributions.
The key contributions of our work are: • We formalise the Fair Team Formation problem, which is the problem of finding the cheapest team that can complete the task and, at the same time, that counts the same number of people from two not overlapping classes. • We design four algorithms for solving our problem.
• We experiment on real data based on actual task requirements and worker skills from one of the largest online labor marketplaces, testing algorithms under a broad range of conditions.

PRELIMINARIES
In this section, we formally describe our setting and problem, and provide some necessary background.

Notation and Setting
Skills. We consider a set S of skills with |S | = m. Skills can be any kind of qualification a worker can have or a task may require, such as video editing, technical writing, or project management.
Tasks. We consider a set of J tasks (or jobs). Each task ∈ J is independent, and requires a set of skills from S, therefore, ⊆ S. In our setting we do not consider a streaming of tasks, but rather we take each task as a single instance of the problem.
Workers. Throughout we assume that we have a set W of n workers: W = {W r ; r = 1, . . . , n}. Every worker r possesses a set of skills (W r ⊆ S). Similarly to the tasks, we use W r to denote both the worker and his/her skills. Moreover, each worker has a hiring cost, and belongs alternately to one of two not overlapping classes.
Classes. The workforce is split in two not overlapping classes C = {class1,class2}, for example women and men.
Coverage of tasks. Whenever task ⊆ S arrives, an algorithm has to assign one or more workers to it, i.e., a team. We say that can be completed or covered by a team Q ⊆ W if for every skill required by , there exists at least one worker in Q who possesses this skill: ⊆ ∪ W ∈Q W . We assume that for every skill in the incoming task there is at least one worker possessing that skill, so all tasks can be covered.

Problem Definition
We now define the problem that we study: There exists a set of skills S. We have a pool of workers W, where each worker W r ∈ W is characterised by a subset of skills W r ⊆ S, a hiring cost c r ∈ R ≥0 , and belongs alternately to one of two not overlapping classes, C = {class1,class2}. Given a task ∈ J , the goal is to design an algorithm that, when task arrives, decides which workers to hire such that all the tasks are covered by the workers who are hired, the total cost paid over all the tasks is minimised, and the team formed is made up of the same number of workers from both classes, C = {class1,class2}.
One special case of the Fair Team Formation problem, where no fairness constraints are imposed, is the Weighed Set Cover problem. This problem can be effectively addessed through a greedy approach (see [30,Chapter 2]). As shown by Slavik [27], this greedy algorithm has an approximation ratio of log n−log log n+Θ(1) [27]. Unfortunately, this result does not hold true for the Fair Set Cover that is the algorithmic core behind the Fair Team Formation problem.
First, we will show that the Fair Team Formation problem (i.e. the Fair Weighted Set Cover problem) is inapproximable, then we will present two different lower-bounds that we can easily calculate, and use later on to evaluate the quality of the solutions found by our four algorithms.

Inapproximability of the Fair Set Cover Problem
Cover problems on hyper-graphs H (V , E, w) aim to find a subset S ⊂ E such that ∈ ∪ S i ∈S S i for every ∈ V and w(S) is minimised. The vertex cover problem is a special case where we are given a graph G(V ′ , E ′ ) and aim to find a subset S ′ ⊂ E ′ such that every edge e ∈ E is incident to at least one node of S ′ . In terms of hyper-graphs, V corresponds to E ′ and each hyper-edge h ∈ H corresponds to the set of edges incident to . Given a coloring c : V → {red,blue } of G, we consider a set of vertexes S ⊂ V to be fair, if |S ∩ RED| = |S ∩ BLUE|. The fair vertex cover problem consists of finding a minimum vertex cover under the constraint that it is fair. Note that unlike the unconstrained fair vertex cover, such a set may not exist in general. Similarly, given a coloring of the sets c : E → {red,blue } the fair set cover problem consists of finding a minimum set cover S ⊂ E such that |S ∩RED| = |S ∩BLUE|. We note that generally fair covers need not exist. This feature will allow us to show the following impossibility result.
Computing any finite approximation of the fair vertex cover problem is NP-hard.

P
. Let G(V , E) be a graph, where we consider V to be red. Given an integer k, it is NP-hard to determine whether there exists a vertex cover of size at most k [10]. We add k blue vertexes V ′ . If there exists a fair vertex cover in G ′ (V ∪ V ′ , E), then it can consist of at most k blue vertexes. Since any finite approximation of the fair vertex cover algorithm in particular determines the existence of a fair vertex cover, it also solves the decision problem of vertex cover. Hence, computing any finite approximation of the vertex cover is NP-hard. 3.2. Computing any finite approximation of the fair set cover problem or the fair group Steiner tree problem is NP-hard.

P
. Both problems contain the vertex cover problem as a special case [10,24].
Finally, it is worth noting that for the unweighted version of the Fair Set Cover problem (i.e. all workers have the same cost), and under the enough workers assumption, we can build a simple algorithm whose approximation factor is equal to |C |H (|T |).

Lower-bound
When trying to solve an instance of the Fair Set Cover problem, we are often unable to calculate the value of the optimal solution in reasonable time; therefore, we are forced to use algorithms that find only a suboptimal solution to the problem. For this reason, it is important to have a lower-bound which we are sure that the value of the optimum would never go below. Obviously, a first really trivial lower-bound (TLB) is represented by the cost of the solution we obtain when the Greedy Set Cover is applied to the Fair Set Cover instance (after eliminating the fairness constraints), divided by its approximation factor; namely: Cost(Greed SetCo erSolution) log(n) − log(log(n)) + 3. + log(log(32)) − log (32) (1) A Lower-Bound from the Relaxed LP formulation of the Fair Set Cover problem. A computationally feasible and mathematically elegant way to calculate a better lower-bound for the Fair Set Cover problem is to solve its relaxed Linear Programming formulation. In a nutshell, we formulate the Fair Team Formation problem as an Integer Linear Programming problem, and then we relax its constraints. In this way, we obtain a Linear Programming problem that is solvable in polynomial time and whose solution always costs no more than the optimal solution that we would get if we were able to solve the integer linear programming. The Relaxed Linear Programming formulation of the Fair Set Cover problem is the following: Relaxed Linear program for the Fair Set Cover problem: Where x i assumes either value 0 or 1, depending on whether the i th worker is hired or not; c i is the worker i hiring cost, and k i is equal to -1 if worker belongs to class1, or to 1 if he belongs to class2.

THE FAIR TEAM FORMATION PROBLEM
Given the previous restrictive result, in this section, we provide four algorithms to solve the Fair Team Formation problem. Considering that the Fair Team Formation problem has a lot in common with the Set Cover problem, it seemed natural to start from a reasoning similar to the one behind the Greedy Set Cover algorithm. Therefore, algorithms 1, 2, 3 are partially based on the Greedy Set Cover algorithm, while algorithm 4 is a rounding algorithm based on the linear programming formulation of the Fair Team Formation problem. The only assumption we made is that there is always a team of workers that together have all the necessary skills to complete the task we are handling. In other words, the task is always coverable.
Fair Padding Greedy Set Cover algorithm. The first algorithm we came up with is a simple extension of the Greedy Set Cover algorithm where the cheapest workers of the class whose cardinality is lower are added to make the team fair. Algorithm 1 shows its pseudocode.
The time complexity of this algorithm is equal to the time complexity of the Greedy Set Cover algorithm, namely: O(|W|| | 2 ). Fair Alternating Greedy Set Cover algorithm. Let's start by defining the marginal utility of each worker (WMU) as: Heuristically, at each stage, the AlternatingGreedySetCoverAlgorithm chooses the worker with the lower marginal utility alternating the class of workers within which it picks. Algorithm 2 shows its pseudocode.

Algorithm 2 FairAlternatingGreedySetCoverAlgorithm
Input: (W , ). Ouput: FairTeam W ⊆ W. Also in this case, the time complexity is: O(|W|| | 2 ). Fair Pairs Greedy Set Cover algorithm. Algorithm 3 is particularly simple and intuitive. Essentially, it is the application of the Greedy Set Cover algorithm to all possible pairs of workers. This idea has been suggested by [5]. Algorithm 3 shows its pseudocode.
1: W P air s ← PairsGenerator(W) 2: W 0 ← CoupleGreedySetCover(W P air s , J) 3: Return W 0 Unlike the previous three algorithms, in this case the time complexity is: O(|W| 2 | | 2 ). The |W| 2 factor is due to the fact that the greedy algorithm for the set cover problem has as input the set of all unordered couples of workers.
Relaxed Fair Set Cover Rounding algorithm. Algorithm 4 solves the relaxed linear programming formulation of the Fair Team Formation problem assigning to each worker a real number between 0 and 1: this number could be interpreted as the worker's probability to be hired. Then, it continues by creating random teams of workers using these probabilities until it finds a team that is both fair and able to complete the task.

Algorithm 4 RelaxedFairSetCoverRoundingAlgorithm
Input: (W, ). Ouput: FairTeam W ⊆ W. for w ∈ W Sor ted do 6: add w to W with probability equal to Hirin P r ob abil it Vector (w) 7: if (W balanced ∧ task skills are all covered) then

EXPERIMENTS
In this section, we will present some experiments that we ran on a real dataset to evaluate the algorithms' performance by comparing their cost.

The Freelancer dataset
To create a large pool of tasks and workers needed to test the algorithms, we decided to use a dataset obtained from Freelancer.com: the largest online marketplace for outsourcing in its category according to data from Alexa (Feb. 2018). The input data that we obtained contain anonymised profiles from people registered as freelancer in this marketplace. This includes their self-declared sets of skills, as well as the average rate that they charge for their services. Data have been cleaned to remove skills that were not possessed by any worker, and skills that were never required by any task. Concerning tasks, we had access to a large sample of tasks commissioned by buyers in the marketplace. Some relevant characteristics of our data are summarised in table 2. As shown above, our dataset contains 992 tasks, but since many of them require exactly the same set of skills we decided to take into account only the 600 distinct tasks. The average number of skills per worker is 2.86 and the maximum is 6 skills.
Experiments Design. In the first place, we split the 1211 Freelancer workers into two different classes, and we considered six different compositions of the two groups. In brief, we used a random procedure to select respectively 10%, 30%, and 50% of all workers, and we assigned these workers to one of the two classes, while the remaining to the other. After that, for each of these configurations, we ran the four algorithms we designed to solve the Fair Team Formation Problem, obtaining fair teams to complete each of the 600 tasks.

Experiments
As shown in figure 1, we observe a shift to the left in the distribution, as the workforce becomes more balanced. In most cases the price of the fair team is no more than four times the value of the best lower bound (LB), although for a few tasks the FairAlternat-ingGreedySetCoverAlgorithm finds solutions that are even eight times the value of the lower bound. It is also worth noting that the progressive balancing of workers' colours has a significant effect on all algorithms, except for the RelaxedFairSetCoverRoundingAlgorithm whose cost (cost_RLP) distribution remains more or less consistent as the workforce changes. Moreover, from figure 1 we can see that all distributions are concentrated around a value of 2, indicating that our algorithms have an heuristic approximation ratio of 2, at least on this specific dataset. In summary, histograms in figure 2 give us some important information about the overall algorithms performance, obtained by choosing the less expensive fair team among the four on a caseby-case basis. The balance between the two classes of workers does not influence the cost distributions suggesting that some algorithms are able to efficiently address the problem of strong unbalances between the two groups of workers; second, we can observe that the best solution cost is never more than four times the value of its best lower bound, and it rarely exceeds a factor of two. To conclude, the RelaxedFairSetCoverRoundingAlgorithm beats them all: it was able to find a team whose cost is equal to the best solution cost in no less than 66% of cases, and with an average success rate of 85% (all configurations of colours considered). On the contrary, the FairPaddingGreedySetCoverAlgorithm always had the worst overall performance, never reaching a success rate higher than 70%.   cost_best_algo/bestLB Relative frequency (case: 50%, 50%)