Just an idea. Today, machine learning or deep learning largely rely on the data, the volume of data. If we have millions of images of a single object, we can train the model and eventually approximate a function that maps the image into the object. The resulting model could be complicated, requiring multiple layers of neurons and requiring days to months to train the model.
What if the complexity of modeling and training is caused by incomplete data. Here I do not mean that millions of images of a single object are not enough, are incomplete. I meant, if it is true that image of a single object is incomplete. For example, when human see a dog, running on the ground, we may obtain additional information about the dog to recognize that it is a dog, like from sound, e.g., dog barking.