Databricks, a unified data platform for accelerating innovation across Data Science, Data Engineering and Business Analytics, leverages Apache Spark for computational capabilities and supports several programming languages such as Python, R, Scala and SQL for code formulation. Databricks is continuing to grow in popularity for big data processing and is being adopted within almost every industry. To cater to a vast audience, Databricks is available in two editions- Databricks Community Edition (the free version) and Databricks Platform (paid version).
At Topcoder, we expect most members participating in Databricks work will use the Community Edition. In this post, we’ll be sharing how to work within this edition and what you can expect.
Databricks Community Edition can be used on an ongoing basis (your access doesn’t expire), and users have access to
15GB clusters,
a cluster manager,
the notebook environment to prototype simple applications,
JDBC / ODBC integrations for BI analysis.
This is in contrast to the paid version, Databricks Platform which offers production-grade functionalities such as
unlimited number of clusters that can be scaled at ease,
a job launcher, collaboration,
advanced security controls, and
expert support.
It helps users process data at scale and build Apache Spark applications in a team setting. Even with these differences, there is a lot you can do with the Community Edition of Databricks.
Since its release in 2009, Apache Spark adoption has grown rapidly and it has quickly become the largest open source community in big data. As founders of Apache Spark and largest contributors to Spark Community, Databricks aimed to open doors for IT professionals through its Community Edition to easily tap the power of Apache Spark and Databricks’ other proprietary functionalities.
The Databricks Community Edition, released in 2016, is a free version of the cloud-based big data platform that, as already mentioned, allows users to access a micro-cluster as well as a cluster manager and notebook environment—making it ideal for developers, data scientists, data engineers and other IT professionals to learn Spark as well as share their notebooks and host them for free. It comes with a rich portfolio of Spark training resources that has exponentially grown over years. Interestingly, the Databricks Community Edition is hosted on Amazon Web Services (AWS). However, the users do not have to incur any operational cost while using Databricks.
Worried about computational capacity?
Databricks Community Edition users can demand for more capacity and gain production-grade functionalities by upgrading existing subscriptions to the ‘full Databricks platform’.
Amazing, right? Try it out, now!!
To upgrade your subscription, sign-up for a 14-day free trial using the following link: https://accounts.cloud.databricks.com/registration.html#signup
Databricks is one cloud platform for massive scale data engineering and collaborative data science. The three major constituents of Databricks Platform are-
The Data Science Workspace
Unified Data Services
Enterprise Cloud Services
Figure 1: Databricks Unified Analytics Platform diagram
You can read more about each of these in our previous THRIVE post. For now, let’s explore more about ‘The Data Science Workspace’ you’ll have access to in the Community Edition:
From data ingestion to production of reports- Data Science Workspace is a physical location where the data science team can collaborate and work together. Depending on the role you play in the ecosystem, you use different functionality within the workspace. From democratizing access to all your data to increasing data science teams productivity to standardizing the full Machine Learning lifecycle- Data Science Workspace is one stop for all solutions.
Figure 2: This is what the Databricks Workspace looks like.
Isn’t it fascinating to picture the backend functioning of the Data Science Workspace? Here’s how it works—
The Workspace is subdivided into 3 categories to support its efficient functioning, which are as follows-
Collaborative Notebooks:
Data Science professionals as well as the enthusiasts can perform quick, exploratory data science work or build machine learning models using collaborative notebooks. These notebooks support multiple languages, built-in data visualizations, automatic versioning, and operationalization with jobs. Principal benefits of the notebooks include working together, easy sharing of insights and operationalizing at scale. Add to those Data Access, Multi-language support, Real-Time Co-authoring, Git Versioning in Git and Creating Workflows and you get a clearer picture of just some of the leading features promised by the Notebooks workspace.
Shared and interactive notebooks, experiments, and extended files support allow data scientists teams to organize, share, and manage complex data science projects more effectively throughout the lifecycle. APIs and Job Scheduler allow data engineering teams to quickly automate complex pipelines, while business analysts can directly access results via interactive dashboards.
Figure 3: Step 1_Notebook Creation
Figure 4*: Step 2_Notebook Creation*
Figure 5*: Congratulations! You can start writing your code here.*
ML Runtime:
The Machine Learning Runtime provides data scientists and ML practitioners with scalable clusters that include popular frameworks, built-in AutoML and optimizations for unmatched performance. From Automated Experiments Tracking to Optimized TensorFlow for simplified scaling to Automating Hyperparameter Tuning for Single-node Machine Learning, ML Runtime comes with a plethora of functionalities for its users. Some of the leading benefits of ML Runtime are as follows-
Figure 6: Benefits of Databricks’ ML Runtime
The Machine Learning Runtime is built on top and updated with every Databricks Runtime release. It is available across all Databricks product offerings including: AWS Cloud, Azure Databricks, GPU clusters and CPU clusters. Additionally, to use the ML Runtime, you are simply required to select the ML version of the runtime when you create your cluster.
Managed MLflow:
Managed MLflow is a proprietary tool to manage the complete machine learning lifecycle at scale with enterprise reliability and security. It’s built on top of an open source version called simply MLflow. From experiment tracking to Model Management to Model Deployment, Managed MLflow assists throughout. With Managed MLflow, data scientists work more productively as they can focus on designing and improving their models, instead of doing all the work manually (like keeping track of everything in a spreadsheet).
MLflow provides a lightweight set of APIs and user interfaces that can be used with any ML framework throughout the Machine Learning workflow. It is made up of four components:
MLflow Tracking: Record and query experiments: code, data, config, and results.
MLflow Projects: Packaging format for reproducible runs on any platform.
MLflow Models: General format for sending models to diverse deployment tools.
MLflow Model Registry: Centralized repository to collaboratively manage MLflow models throughout the full lifecycle.
Managed MLflow on Databricks is a fully managed version of MLflow, providing practitioners with reproducibility and experiment management across Databricks Notebooks, Jobs, and data stores with assured reliability, security, and scalability of the Unified Data Analytics Platform.
Databricks is one of the fastest growing data services on AWS and Azure with 5000+ customer and 450+ partners across the globe.
In our next few THRIVE posts, we’ll be exploring the “bricks” that make up Databricks by going a bit deeper into Apache Spark, Delta Lake, TensorFlow, MLflow, and Redash.
For now we recommend you sign up to the community version of Databricks and check out our ongoing series of Skill-Builder Challenges, available for Topcoder members.