Databricks 2024 Developments and Announcements

The global Databricks Data + AI conference held two weeks ago in San Francisco included a long list of innovations and announcements. Unlike last year, this year Snowflake and Databricks held their conferences in the same conference hall (Snowflake first and Databricks a week later), giving data professionals and enthusiasts the opportunity to catch up, learn and attend an action-packed couple of weeks of announcements, expert presentations and networking..

Some figures from the conference:

  • 16,000 visitors
  • 600 professional sessions
  • Representatives from 140 countries
  • 200 data teams featuring different customer success stories
  • One Nvidia CEO in a leather jacket on stage

The conference included a long list of updates on a variety of topics and it’s practically impossible to cover them all in a single post, however all sessions were recorded and available for viewing on the conference website. We will try to give you a small taste of the conference and briefly review the most significant updates in our opinion:

 

1. Changing format rules

In early June, Databricks announced the acquisition of Tabular, whose founders invented the Iceberg data format. It’s with no surprise that with this acquisition, Databricks wants to develop their products with the two most common open-source data management formats: Iceberg and Delta Lake. Also with the acquisition, Databricks announces Delta Lake UniForm, which is essentially full and transparent support for both formats (as well as Apache Hudi) enabling data teams to choose their preferred format without limitation and need for conversions.

 

 2. Serverless Everything

Until recently, the main service available in serverless configuration within the Databricks ecosystem was SQL Warehouse. Serverless are “hot” servers that Databricks manages and when a customer chooses to use these, they are allocated for their benefit, with the advantage being that the server is immediately available. At the conference, Databricks announced that all the different services are available from now on in a serverless configuration, so that the provision and spin-up time of the service if you chose to work in this configuration is significantly shortened to a few seconds.

 

3. Pack me some models on sale

In July 2023, Databricks acquired MosaicML, which specializes in training LLM models. Since the acquisition, the folks at Databricks have been working diligently to implement Mosaic’s capabilities throughout the platform, During the conference, Databricks introduced the principle of Compound AI Systems. What does this actually mean? We are all trying these days to leverage and implement GenAI capabilities in the various work processes, according to Databricks, the right way to do this is to connect several different models into a unified flow that brings a variety of capabilities to solve business challenges. To illustrate this, a demo of a business scenario was presented, creating a personalized Instagram campaign for the marketing team based on a sequence of GenAI models and connecting to integrated functions and data sources.

We highly recommend to see the recording of this demo, first of all the awesome product manager on stage debugs the demo in real time in front of a hall full of spectators, but beyond that in our view there is nothing less than a revolution in the operating concept of LLM models ahead of us. More so in the way that traditional work processes can be transformed from end to end and present a real and tangible ROI to the organization (and not just hype).

Additional innovations from Mosaic presented at the conference:

  • Built-in ability to fine tune an open source or commercial model within Databricks through training on the customer’s internal data.
  • Shutterstock ImageAI – the world’s largest image exchange partners with Databricks and makes a text-to-image model accessible. The model generates custom images for the needs of business users.
  • Mosaic AI Agent Framework – built-in ability to package organizational information in RAG and submit it to LLM models, including governance integrated using Unity Catalog.
  • Mosaic AI Tool Catalog – built-in set of tools for use within Databricks for common LLM tasks. In this way, data, functions, models and more can be packaged for reuse plus assembly of “puzzles”, as part of the Compound Systems concept.

 

4. Let’s not forget the DWH

In the data warehousing space, Databricks released significant updates with the understanding that this is an essential core capability for organizations that is also particularly expensive. In an effort to reduce the complexity of migration processes, Databricks complements a broad set of functional capabilities and actually provides support for all the goodness we’re used to from traditional data warehouses. Did anyone ask for materialized views? Now you have it, SQL and Python UDF? Yes that as well, you get the idea…

In addition, Databricks harnesses the ability of Compound Systems for the benefit of its internal platform and actually runs AI-based optimization engines behind the scenes. How does it manifest itself in practice? Here are some examples:

  • There is no need to define partition or clustering of a table, Databricks learns the usage patterns on its own based on queries and automatically generates auto clustering, and the same goes for indexes.
  • Databricks claims that through a combination of AI optimization mechanisms, they were able to achieve a 73% improvement in enterprise query runtimes over the past two years.

Additional updates in the DWH space include built-in LLM query writing, the ability to call AI functions in SQL code, and run data retrieval from vector databases.

5. Add some AI on top of my BI

Databricks does not neglect analysts or data consumers and launches a completely new service called AI/BI. In fact, it is a combination of two capabilities, one of which includes on-the-fly creation of dashboards adapted to the user’s needs based on free text requests. The second is called Genie and is an LLM prompt that allows you to ask questions about the data directly or from dashboards.

According to Databricks, one of the challenges of organizations is their business terminology, which is not generic and therefore LLM finds it difficult to give reliable answers in these situations, with GEnie this problem is also addressed because the model can be taught what the definition is, for example, of “churn ratio” or any other business term, and from that point on it will be able to calculate and present this figure in response to analysts’ questions. How does Genie achieve this? Once again, incorporating Compound AI as a core component into the platform, this time to help Genie familiarize itself with all the different data assets in order to give more accurate answers.

 

6. Cross-org data catalog

One of the hottest topics at the conference was in the domain of governance, and data cataloging as part of it, of course. Databricks has made a lot of investments over the past year in Unity Catalog, its built-in data catalog as part of the platform. Development continues with a number of interesting announcements, promoting Unity as the main catalog for organizations due to it’s integration with the entire data stack.

First, it opens and makes Unity accessible as an open-source component, which means that even customers who do not currently use data can make use of Unity. This capability opens up the opportunity to build integration capabilities and connectors for Unity. To top that, Databricks gives users some of these connectors already today using a capability called Lakehouse Federation, which means that you can already connect additional systems to Unity today such as Redshift, Power BI and – Snowflake

 

Another interesting update is the ability to manage business KPIs in Unity Catalog. Databricks announces an object called Metrics and integration with common third-party systems for managing business metrics. This was accompanied by other announcements of new Data Sharing and Data Clean Rooms capabilities, which we will expand on in a separate post.

 

7. Data flows in the palms of your hands

One of the most important announcements was saved for the end of the conference. Databricks announces a new product called LakeFlow, an ETL tool managed within the platform addressing ingestion, transformation processes and business logic, as well as orchestration of these processes. This capability is based on an acquisition that Databricks made in the past of Arcion and includes built-in connectors for common sources, CDC capability, a graphical interface for transformations and more. This announcement challenges the ecosystem of Databricks partners in these areas (such as Fivetran, DBT, Airflow) and it will be interesting to see if customers choose to put all their eggs in one basket and migrate processes in these areas to Databricks.

 

In conclusion: Databricks continues to establish its status as a unified organizational data infrastructure that provides a solution for all data assets, all data workloads, and all functionaries involved in the various data processes in the organization. In this respect, the conference was another demonstration of capabilities in the battles for control against Snowflake and a clear signal to Microsoft’s Fabric that the gap relative to it in capabilities remains significant and continues to grow. One thing is certain: the cloud data space is certainly one to watch over the next few years.

 

Find out more

Please complete your details and we will contact you