A talk with Data Crunch Corp
Lately, Edward, CEO of Graviti, was invited to talk about unstructured data on Data Crunch Corp, a popular podcast in the U.S.. Edward talked about how and why he started Graviti and why the managment of unstructured data is necessary. Here is the recording. Hope you enjoy it.
Here is the transcript in case you need it.
Edward Cui: The amount of data we collect in a self-driving car division for a week is equal to the data we collect for the entire restaurant business globally for the entire year.
Ginette: I’m Ginette,
Curtis: and I’m Curtis,
Ginette: and you are listening to Data Crunch,
Curtis: a podcast about how applied data science, machine learning, and artificial intelligence are changing the world.
Ginette: Data Crunch is produced by the Data Crunch Corporation, an analytics, training, and consulting company.
Curtis: Welcome, everyone, back to the Data Crunch Podcast. Today, we have a really, really interesting guest. His name is Edward Cui. And he is heading up Graviti, a company that’s doing some really interesting stuff with unstructured data.
So, unstructured data is, kind of, taken over the world. It’s all over the place. It’s really hard to manage. But it’s also really valuable. So without any other ado, I want to turn it over to Edward, because he, of anyone, has a really big background in this. He used to work for Uber, the self-driving car division. So Edward, we want to hear a little bit about you and, and how you got into this space. And then let’s jump into unstructured data, what it is, why it’s important and what you’re doing with it.
Edward: At UPenn, I took my first machine learning course and I found that it would change the world in the future. Thus, I took a lot of computer science classes and switched to the ML track, and after graduation, I did a little bit of research at the University of Pennsylvania, working on reinforcement learning. So in 2015 I received a call from my former boss Jeff Snyder. He had just joined Uber.
And then he told me they were working on a secret project, and he really wanted me to join the team. I was kind of like, “What are you guys working on? Could you tell me more?”
And he told me that, “You know what, like, we can’t talk to you on the phone. We had to fly you over to Pittsburgh, and then you can take a look for yourself.”
And then they did fly me over, and I found the car, the very first two prototypes of a self-driving car in an abandoned garage. It’s a very sketchy place. There’s actually no road, it’s like all paved stones, and you have two cars hanging in the garage, and they put all the fancy sensors on top of that. At that moment, I kind of decided, “Hey, this is really cool, this is what I really want to do after graduation.” And then that’s how my career started, working for the self-driving car industry.
So Uber, definitely, is one of the first few companies working on the problems. At the very beginning, it’s kind of hard because, at that time, people didn’t even use big machine learning models. We don’t have a lot of compute power to train the models. But we collect tons of data every day, and during that time, we have built our own data center, but because the amount of data is so huge, we tend to fill up that data center for like three months. And we had a statistics where it told us the amount of data we collect in a self-driving car division for a week is equal to the data we collect for the entire restaurant business globally for the entire year.
So that’s the comparison between the structured data and unstructured data. The data we collect for self-driving cars are all unstructured data. For example, they are images. They are videos. They are LIDAR points. All that type of data from real-world sensors.
Those data are unstructured. But for the data collected by the restaurant business, they (are) pretty much all structured.
So that’s come to the question of what are the unstructured data? In a really simple way, you can think of unstructured data, all the sensory data, for example, the images, the videos, the audios. But in a more accurate definition, unstructured data basically means the data that’s captured by sensors not actually produced by the machines. So a lot of data are produced by machines, people writing softwares that produces data. For example, that data has integers, floating point numbers have strains. Those data are structured, but the data, capture the real world are all unstructured. Even like the comments you put in on social media, they are unstructured data as well, because the machine can not control what you write there. So those data are unstructured. There are tons of unstructured data accumulated for every single day and that creates big problems on how to store that data, how to manage that data. And at the end, how to use that data to actually generate values for different organizations.
Curtis: Got it. And managing that, like you said, just to give people a sense of the scale- I mean, one week of Uber back in 2015 was like the entire restaurant industry for a year. And now it’s, I mean, it’s exponentially larger than that, I’m assuming. So you’ve dealt with this problem in the past and what kinds of things do you have to do to be able to manage that immense size of data? I mean, that’s just, it’s enormous.
Edward: Yeah. So when you accumulate that amount of data, you can’t put them onto a single machine. For example, the machine we use at home, the PCs, or even the Macs, will only have 200 gigabytes of storage. But the data we accumulate at Uber, or in different organizations, could be hundreds of terabytes or even hundreds of petabytes.
That cannot be stored in a single machine. And the only way we can have that amount of storage is put that data onto the cloud. I mean, like you can still build a data center, but building a data center, finding a place to build a data center, guiding all the power you need, setting up the server recs, buying the machines, hiring people to work on the network, hiring people to do the SRE, they all cost money.
So the really easy way for people to accumulate that amount of data and to use that amount of data is basically through the public cloud. So nowadays a lot of companies choose to use a cheap storage solution on the public cloud to store the data. For example, AWS, they have their X-ray, and then Microsoft Azure, they have their similar objects for solutions.
GCPs, which is the Google cloud, have GFS, those solutions are designed for scalability that, those solutions are relatively cheap to store that amount of data. And that’s, nowadays what companies should really choose to store their data.
So we also build our software on top of the cloud storage because it's cheap and scalable. And, also is relatively easy to use.
Curtis: Got it. And I think there’s very few companies nowadays that are probably trying to build their own data center. Like everything is moving to the cloud and which is great. But even in the cloud, you still have this problem of like, how do you, how do you process that? Or how do you filter through it?
How do you hit that with an algorithm and come up with something useful? I mean, when you’re talking about AI or machine learning, how do you approach that in a way that is efficient?
Edward: Yep. So that’s a very good question. Processing a small amount of data is easy. Let me just give you an example. We probably capture a thousand images on our cell phone. Pictures every day. And then imagine a scenario where you just want to find that specific picture you want to send to your friend. And you pop out your phone, you open the gallery app, and you kind of scroll over the pictures. I find that one in your brain and then, send that picture to your friends throughWhatsApp or we’re on Facebook messenger. But think about when you not only have a thousand images, when you have hundreds of millions of images and you want to select a thousand images in that hundred million images and how to do that.
You can’t scroll the entire phone. And then find that thousand images through a million images. And that’s pretty much what AI engineers are facing every day. Finding the specific data they want to use in their daily job through a huge amount of data, and that can no longer be done by going through every single image.
So there has to be a way where we can organize a group of compute powers to process the data and automatically tag the data. And then we can use, we can write, customized code to query like specific data under a specific condition where we want to use, and we need to be able to schedule a lot of machines to do them in parallel so we can get a return in a manageable time. You don’t want to wait forever to find that thousand images. You still want to find those images in minutes and,you have to have a system where that can automatically spin off machines and write the query code for you, and then find those images in that pool, a huge pool of images. So when I describe that, it's hard even though on the call, you have machines available to you. But you still gonna hire a ton of engineers building that system for you.
For example, at Uber, we have fifty to a hundred engineers hired to solve that problem. That pretty much means, 10 million to 20 million U.S. dollars spent each year just, just hiring those people working to solve those problems. And that could create a big problem for every other enterprise. Like not every company, they have 20 million US dollars a year in their budget to hire the people to maintain a system where they use to manage the data, to query some of the data. And that’s the problem we’re trying to solve: building a system that can be used by any size of the company to be able to, to manage the data and query through the data.
Curtis: Got it. Help us understand. So when you say this are, if we were to bring this down, I mean, we talking like a giant catalog, so you can find what you need, or like the Google search of unstructured data to kind of build data sets that you want to be able to train models on, is this sort of something big enterprises would use or more of an open source where anyone can use it? Help us understand some of those parameters.
Edward: Yeah. Well, think about the Google search. Like the Google search really is searching what’s already there. You have keywords in your articles, and Google search will help you to find articles that have those keywords. But think about searching search images. There’s no keywords in the images at all.
To computers, an image is just a bunch of zeros and ones. You need to have some way to extract information from the images. You need to have a machine learning model or any other models that help you to extract the information. And then after extracting that information, you have to index the information so you can quickly find the result you want to find in a limited time.
And then building that system, hooking up that system together, you just need, hire really good talents, really good software engineers to solve that problem for you. That’s not easy, so it’s not like that the existing tools or frameworks can solve those problems.
You definitely need to build new tools, new types of tools to solve those specific problems for you.
Curtis: And we’re talking about only the management of the data, but I mean, essentially you’re using machine learning and AI to find what you need to do more machine learning and AI.
Curtis: The level of complication we’re talking about.
Edward: Yeah. So management is just the first step.Like the reason you want to manage that data is one day you are, you should be able to use that data to create values.Either, the wider creation either through, , you can analyze the data and get some very special insights that can help you to make a very important business decisions, or, you can train machine learning models from the data that will massively improve the user experience of your own product, and, managing them is the first step.
And then you have to be able to use the data later. But you don’t always want to use all the data. Like you need to pick some of the data from the data you accumulated and then picking these data you have to, through some kind of query, you had to sort of sum up some of the search, but that search really isn’t similar to what Google search provided. You have to have a very dedicated complex system, for you to do that type of search and to integrate your machine learning model to constantly extract information from unstructured data, to organize that information extracted and, and be able to search through that information.
Curtis: Yeah. Let’s talk a little bit. I mean, ‘cause what we’re talking about here is, like you said, you don’t have to hire a hundred engineers because you’re building a system that can do this and that, I mean, that’s an order of magnitude increase in efficiency. And anytime you have something like that, it just opens up new doors.
So tell me, in your vision here, like this problem being solved, what does that mean? What new abilities does that give companies or people interested in space? What can they do now with this problem solved? What’s the benefit?
Edward: Yeah, that’s a really good question. A lot of the companies are talking about machine learning, they’re thinking about applying machine learning. There are two problems they’re facing. First of all because they don’t even know what machine learning is, they haven’t accumulated any data when they try to apply machine learning.
Even though when they hire an expert who is good at machine learning, they join, they can do nothing. Like the model has to be trained using the data. If an enterprise, if an organization, hasn't accumulated any data, when they hire that engineer, the engineer will basically be sitting there and the organization (thought) you should start accumulating data, but for the first several months, the engineer can’t do anything.
He’s basically waiting for the data to be accumulated. And, and that’s either time-consuming or that will cause a lot of money. But you have to pay that engineer (who) is sitting there doing nothing while (waiting for) the data is collected. So a message to a lot of organizations is to start to collect your data today, start to manage your data today because what we saw from a lot of organizations is, they collected some of the unstructured data, but because they don’t know how to use it, it’s just sort of my way. But instead, they should have a system in place where when they collect those data, they put those data into the system so that system can help them to manage the data. One day when they’re ready to apply machine learning, they hire someone to join the company and that one can, up to speed immediately. So the other problem those organizations have is oftentimes, even though they have accumulated some of the data that they’re putting in, they put the data somewhere when they hire a machine learning engineer, they think, “Hey, when you join the company, you should immediately train the model. You should immediately, like make the machine learning solution come to reality, and you instantly create value,” but that’s not what happened in a lot of organizations.When they heard about machine learning, a machine learning engineer actually spent a lot of his time dealing with the data, cleaning the data and writing a lot of scrapes to query through the data. And, he has to spend time setting formats and doing data transformations and,all of those things, consume a lot of the engineer's time.
Several different reports came out in the last few years basically saying, “Hey, the machine learning in different organizations, they spent 80% of their time on data cleaning, on data identification, not on building the machine learning models only like 15% of their time.”
They’re really working on machine learning as part of the problems. So that creates a lot of problems for the company who hires those engineers because they’re not working as efficiently as they can. And they often spend a lot of time on the things they aren't really good at. Like they are really good at training models. They are really good at writing papers and finding state-of-the-art models. They’re really good at framing the problems into the machine learning space, but they’re not good at cleaning the data, writing scripts, or doing visualization. They’re not good at doing those 30 works. So that’s why they need a tool to help them to do all those work.
They need automations to automate all those. So they can just write code for once and that code can be executed by the machine automatically. And that will save a lot of time for those machine learning engineers. And instead of spending only 15% of their time, they can now spend like 80% of their time working on solving the real problems for those companies. I think that’s going to be really valuable for those companies. And only in that way, those companies can bring AI to their organizations as fast as possible.
Curtis: Yeah. I mean, that’s 80% of your time unlocked to do more high value activities. That’s a huge efficiency gain. And this is, I mean, like you’ve said this is a massive problem. The 80/20% rule, like everyone has heard about that. And engineers don’t like spending their time in data prep, like nobody likes doing that. So unlocking that is huge.
And I got it, so a problem of this size I gotta imagine there’s people thinking about this, trying to solve it. What at Graviti are you doing that you feel is different or, or like a better approach? That’s going to actually solve it. Because we’ve been talking about this since I started the podcast.
How do we solve this problem? So, what are you guys doing that you feel is gonna get it?
Edward: Yeah, but there’s not many. Well, even though people are talking about it, there are not many solutions on the market people can choose from. So I guess probably car providers are the first, solution providers, a lot of companies go to at the very beginning.
But the problem was, the public health providers. They do very general products and learn their products and, combining their products together into a system that can solve these problems—you always need to hire really good software engineers, you really need to hire infrastructure engineers. But you will compete with the company in Silicon Valley to hire the talents. That’s going to be really hard for a lot of the organizations for a lot of companies. So there’s a certain bar of, using public health solutions. Public health solutions are basically like they give you a bunch of building blocks. You have to hook them together, and you have to write tons of code to do so.
So in my view, organizations often need an out-of-box solution. When they use the solution it really lowers the bar. They can use such solutions. They can just write a minimal amount of code to start with collecting data.
And then we kind of provide a lot of operations that can be done through a web interface rather than writing tons of code. We simplify the way people can interact with the data, with the data stored on the platform through better web interfaces. That’s kind of lowering the bar of using the product.
And on the other hand, we have a pricing model where we kind of charge companies based on the cost, the consumptions. So at the beginning, they don’t have to worry about, “Hey if we don’t have a lot of data, will this product cost us a lot of money.” You don’t have to worry about that. We will charge our customers based on how much data we help them to manage and how much data they used and, and how much compute power, how many hours of CPU power they actually used. So they don’t have to worry about being overcharged if they do not use the product at all.
The other players on the market, they are actually moving from the structure work. A lot of the companies that are building solutions for dealing with structured data, are thinking about adding more capabilities to their solutions by adding support to manage unstructured data.
And we are different from them. We understand the unstructured problems are very different from the structured problems in terms of the scale. So we re-architectured the software, specifically solving unstructured problems, solving the data at that scalability. So that’s how we differentiate from those players.
We don’t believe you can share the structure. You can’t share the architecture to do both. You have to have a dedicated software architecture for unstructured data, because it’s just growing so rapidly. You need to have a separate solution.
Curtis: Got it. So the dedicated no code solution to unstructured data or, that’s good. So, and then here’s, I know we’re running out of time here, but one more question, if you have just a second. And that is- I like this question like five, ten years down the road, where do you see Graviti?
And where do you see the AI industry in general? Like new developments that you’re doing, where’s that going to take us? What’s your vision?
Edward: Yeah, that’s a really good question. I think we are still at a very early stage of AI. There’s a lot of potential that has not been unlocked yet. I think with more time, maybe another five or ten years, we are at the very beginning of accumulating unstructured data.
Within five or ten years, Gartner, they have a report. They believe maybe by the end of ’28, but like in five or ten years, 80% of the data will be unstructured data, the entire human being is gonna accumulate 400 zettabytes of data in total. That’s going to be a huge deal. So we think we are at the very beginning of the AI industry.
The technologies are currently only applied in a few high-tech companies in Silicon Valley. We want AI technologies to be more useful and, we want to lower the bar of how AI technologies can be used by different organizations. We’ll definitely see lots of different applications in lots of different scenarios and more and more companies are gonna start applying AI in their daily routines.
And we are also gonna see a huge spike in terms of the speed or the amount of the volume of the data accumulated in those five or ten years. And then I seek Graviti will grow with our customers. We want Graviti to help our customers, to grow when their data grows, we will always be there and helping them. And working on the problems they really care about. Then managing the scale of data for them. And we want Graviti to be their partner through this process.
Curtis: Got it. That’s exciting stuff. So, and then just to close off. How can people take advantage of this? Or is this mostly like currently enterprise companies, maybe students that are going through school and learning AI? Can they take advantage of this in any way? Just how can the listeners, what can they do?
Edward: Yeah, we have free accounts, like anyone can just go to graviti.com and register an account to try our product. We opened advanced functionalities for the enterprise users, but anyone can go there and can start to try out the product.
We have a fair amount of documentation. And we have always been there. We are happy to support anyone. We are always there to collect feedback, feel free to give us feedback, try the product and let us know whether you like it or not. And let us know what you already need. And we will work on it, we will work on it.
Curtis: Awesome. All right, everybody. Well, Graviti, check them out. If you’re working on AI in an enterprise, you want to solve this? Edward, thank you so much for sharing your knowledge. And your vision for what’s what’s going on. What’s happening. Someone that’s been in the space since the beginning of Uber, it’s really great to have you on the show to have your perspective.