Going ‘Serverless’ for ML on Big Data

2 min readMar 21, 2020

Serverless — Running code without worrying about servers

For working with large datasets, one had to think about resource scaling and maintenance of infrastructure in the past. But cloud service providers have eliminated that need with elastic becoming the norm.

However, one still needs to create environments on the elastic infrastructure to run the code written. Serverless eliminates that by only asking for the code.

Logos of AWS Lambda, Google Cloud Functions, and Microsoft Azure Functions

Key players in this space are AWS Lambda, Google Cloud Functions, and Microsoft’s Azure Functions.

The hassle left to eliminate is scaling one’s code to utilize this nebulous and seemingly infinite hardware.

Leveraging novel hardware for various applications has been done historically by first targeting fundamental linear algebra operations and serverless solutions are no different.

This approach makes sense as the problems can come from various domains, but the same efficient implementation can be used as the foundation.

Within machine learning, Matrix-Matrix product(GEMM) is a ubiquitous example and is seen in backpropagation in deep learning, covariance estimation and even for convolutional neural networks.

Using Python, multiplying two matrices on a desktop is just one line of code. What if one could use similar abstractions to call serverless-GEMM and that too, something that scales?

Note that this isn’t the same as running “Hello World” on a million instances using AWS Lambda by just writing print("Hello World").

The key difference is that information (in this case, the matrices themselves) must be shared between instances and the task must be evenly split between them.

One cool library in this direction is Numpywren, which provides Numpy-like abstraction to leverage serverless platforms.

Geared towards Amazon (uses S3 and Lambda behind the scenes), it is easy to setup with AWS CLI. Also, a quick shoutout to Pywren, which basically wraps functions to run serverless on AWS.

And that’s how it’s done!

The framework seems to be work-in-progress with the desired level of abstraction still far away. Even getting it to work on my side took a fair bit of wading through the repo.

Two things which I haven’t discussed but their README and paper does a very good job is on

Cost: S3 is high-bandwidth storage and costs $0.025/GB * Month. This is extremely cheap if your storage is ephemeral. Also, Lambda invocations are very economical at $0.2/1M requests and $0.00001667 per GB-s of allocated data.
Scalability: Basic linear algebra operations have been demonstrated to scale around 2.4x while providing fault-tolerance

Once this framework matures, scalable computations can be democratized and one doesn’t need access to supercomputers for running the same.

Going ‘Serverless’ for ML on Big Data

Written by Pavan B Govindaraju