Beyond SDK, master Pragmas

A Jupyter Kernel to rule them all

6 min

May 16, 2019 from Nicolas Narbais

jupyter at scale

Writing a machine learning algorithm is just one step towards providing value to the business. This is actually the step that requires more fine tuning and expertise. Nevertheless, the experts responsible for developing algorithms, data scientists, are surrounded with time consuming and repetitive tasks such as preparing and cleaning the data prior to develop the algorithm or deploying the training model algorithm on a large-scale infrastructure.

In the past years, there has been a lot of innovation for them to ease the deployment of their algorithm into various environments. This article aims to describe Activeeon’s approach to deployment.

If you study the most common tools used in the field, you can quickly state that Jupyter Notebook is the standard. It can be run directly on your laptop or on a remote server, enables fast iteration and more. From this environment, at Activeeon, we wanted to offer data scientists the ability to distribute any algorithm and easily access more powerful machines. The traditional approach is to use APIs or SDKs which require to edit your original code to get started.

In a world where AGILE development is a new standard, SDKs reduce scientists flexibility and forces data scientists to use custom code. The algorithm developed with the SDK will only be able to run with the SDK:

  • There is usually no simple way to run the code locally
  • It requires to read the documentation even for a “get started” submission
  • It ties the code to a specific library or resource type

Go pragmas!

At Activeeon, we’ve decided to go one step beyond and get closer to the AGILE principles. We’ve obviously developed our Python SDK but we went to develop a Jupyter Kernel. It is now possible to use pragmas. In a few words, they are Python comments that could be interpreted by our Activeeon kernel to perform relevant request to our SDK.

Now, you may wonder what this actually means for the data scientists and what are the benefits. Below are a few concepts we applied.

Run locally as well as remote

switch

Pragmas can be interpreted by the Jupyter Kernel. As mentioned above, this means that when the Activeeon Kernel is selected, the pragmas will be read to call relevant Python SDK functions. This also means that when a standard kernel is selected, the code will run seamlessly on your local computer and the pragmas will be ignored (just as comments do).

In seconds, you can now switch from executing your code on your local machine and on a remote server. Develop locally the concepts of your training algorithm with a sample of data. Then, when you are ready, scale to a training on the complete dataset with an access to an elastic resource pool with more powerful and specialized machines.

Learn progressively

progress

Implementing a Jupyter Kernel also bring additional UX challenges. The code that runs on a local machine needs to also run on the remote server with minimum efforts.

When you get started with the new Activeeon Kernel, the only action required is to add a line on the first block to connect to the remote server and add a line at the end to submit the code. In a second, you can then run your algorithm on more powerful machines. If you want to go further, you can then add dependencies on named tasks/blocks so that you can create a more relevant structure with controlled parallelism.

Note, if you want to learn more, we also implemented a “help” pragma that indicates all the available pragmas.

Split algorithm errors from platform errors

split paths

As any developer, data scientists want to debug fast and identify errors quickly. The ability to change kernel is essential there. First, you can run your code locally to highlight any errors on your algorithm. Once it works, change the kernel and execute remotely.

With that principle, you would differentiate algorithm errors from platform ones.

Obviously, the solution supports docker containers so you can create environments that suits your execution. No need to worry about the underlying server anymore!

Features

Obviously, we talked above about why we took this approach, but you may be interested in learning more about some of the features provided by this Kernel.

Create distribution graph / pipeline

By default, each block is what we call a task within ProActive. All the tasks will be run in parallel if nothing else is added.

With the option dep with the pragma task, we provide data scientists a way to structure their execution. They can then create dependencies between named tasks, the remote scheduler server will interpret those and will optimize execution time by parallelizing the workload whenever possible.

Before submitting it, you can visualize the graph of dependencies with the pragma draw_job.

Ensure variable and file transfer

Since we are in a distributed environment, variable and file transfers could be a challenge. The task options import and export notify the SDK to save those variables in the workflow scope. This ensures variables can be used later on within any dependent task.

Watch out, if the tasks are not dependent on each other, it will not be possible to retrieve the variables previously set.

Visualize progress

When you launch an execution on a remote server, you obviously want to visualize its progress. Don’t worry we thought about it. We will present you a direct link to the scheduler portal that is responsible to manage the execution and will provide you the progress bar until completion.

Once completed, you’ll receive a summary of the execution with the overall processing time, the total execution time of all the tasks, the number of errors, etc.

Collect visuals

As you may know, coding machine learning models requires regular visualization. This is particularly useful to understand and analyze the incoming data or the actual results from your algorithm.

Store pipelines for utilization within templates (Auto ML, Incremental learning, etc.)

In addition to all the above features, the machine learning open studio (MLOS) from Activeeon includes some templates for standard use cases. For instance, a template is available to perform automated machine learning at scale, another could be used to ease the deployment of a model within a container.

Reuse code

Finally, the MLOS solution includes a catalog system to store and reuse code. A pragma is available to consult the catalog, commit changes and more.

Conclusion

In conclusion, data scientists can quickly leverage the Jupyter Kernel offered by Activeeon to run their code locally and in the cloud without more than a few clicks. Thanks to that, they will then benefit from a quick access to larger and more specialized infrastructure and benefit from large parallelization mechanisms.

Try it on our try.activeeon.com patform, it is an open source kernel.

If you are interested in how Activeeon handles elasticity, do not hesitate to check the video about CNES which presents this specific feature.

See it live


With the contribution of our team: Andrews Sobral, Mohamed Khalil Labidi


More articles

All our articles