IBM research has released ‘Diversity in Faces’ (DiF) dataset which will help build better and diverse facial recognition systems by ensuring fairness. The DiF provides a dataset of annotations of 1 million human facial images. This dataset was built using publicly available images from the YFCC-100M Creative Commons data set.
Building facial recognition systems that meet fairness expectations, has been a long-standing goal for AI researchers. Most AI systems learn through datasets. If not trained with robust and diverse data sets, accuracy and fairness are at risk. For that reason, AI developers and the research community need to be thoughtful about what data they use for training. With the new DiF dataset, IBM researchers are building a strong, fair, and diverse dataset.
The DiF data set does not just measure different faces by age, gender, and skin tone. It also looks at other intrinsic facial features that include craniofacial distances, areas and ratios, facial symmetry and contrast, subjective annotations, and pose and resolution.
IBM annotated the faces using 10 well-established and independent coding schemes from the scientific literature. These 10 coding schemes were selected based on their strong scientific basis, computational feasibility, numerical representation, and interpretability.
Through thorough statistical analysis, IBM researchers found that the DiF dataset provided a more balanced distribution and broader coverage of facial images compared to previous datasets. Their analysis of the 10 initial coding schemes also provided them with an understanding of what is important for characterizing human faces.
In the future, they plan to use Generative Adversarial Networks (GANs) to possibly generate faces of any variety to synthesize training data as needed. They will also find ways (and encourage others as well) to improve on the initial ten coding schemes and add new ones.