“Currently, the state-of-the-art similarity metrics are only implemented in R. We want to port these to Python and implement these in frameworks like Tensorflow.”
Context of the internship
The Wasserstein distance has been around for centuries but recently is causing a furore in ML. In essence, you calculate how different two distributions are, and the result is a number between 0 and +inf.
Now, we can use the Wasserstein distance as a metric to calculate the degree of difference between two probabilistic functions, but we have to go with a parametric version of it on real life data to estimate the actual Wasserstein distance of the two underlying distributions.
The question that pops up is: How do we define when 2 distro's are different using the Wasserstein distance? How do we go about hypothesis testing? 🤔
We are not the first ones to think about this. Schefzik et al. have come up with a way to test this and implemented it in R.
So... We want to make this test available in python and add it to scipy and TensorFlow Data Validation.
What you’ll learn
State-of-the-art metrics used extensively in GANs, similarity search, clustering, anomaly detection and pattern discovery;
How to contribute impactfully to the open-source community and create a blogpost;
Getting familiar with how a top notch AI consultancy firm operates internally.