Nick Burch, CTO at Quanticate discussed Big Data in Clinical Trials at the 4th Annual Clinical Data Integration and Management conference this year in Princeton, NJ. His presentation is titled: 'The Myth of the Big Data Silver Bullet - Why Requirements Still Matter'
We've all heard the hype - Big Data will solve all your storage, processing and analytic problems effortlessly! Some moving beyond the buzzwords find things really do work well, but others rapidly run into issues. The difference usually isn't the technologies or the vendors per-se, but their appropriateness to the requirements, which aren't always clear up-front.
Big Data, and the related area of NoSQL, are actually a broad range of technologies, solutions and approaches, with varying levels of overlap. Sadly it's not just enough to pick "a" Big Data solution, it needs to be the right one for your requirements. In this talk, we'll first do a whistle-stop tour of the different broad areas and approaches of the Big Data space. Then, we'll look at how Quanticate selected and built our Big Data platform for clinical data, driven by the needs and requirements. We won't tell you what Big Data platform you yourself need, but instead try to help you with the questions you need to answer to derive your own requirements and approach, from which your successful Big Data in clinical trials solution can emerge!
So we've heard a little bit already over this last couple of days about big data, but can I just get a sense, who here thinks they know what big data is? Who here had heard of it? Who here probably had heard of it two years ago?
Few people, oh dear, go away Acrobat pop-up. Right, so setting the scene a little bit as we heard I’m the CTO at Quanticate, Quanticate is not a big data vendor, we're a data focused CRO, but we're starting to use big data internally to deliver our normal work. I've been speaking about dig data at various conferences for a few years now so even though at Quanticate we were only just started using it, it's something that I've been aware of for some time now and hopefully can share a little bit of that experience.
The definitions are a little bit fuzzy depending on who you ask. I've been at one event where someone said big data was more than they could fit in Excel and someone else who said, big data doesn't start until your data centres are on three continents. But, the sort of general thing that everyone can agree on, is it's when you're working with more data than you can process on a handful of machines with the kinds of systems and processes that you've been used to for the last 10, 15 years.
Typically but not always, it's going to mean that you're going to have a lower cost of storing the data you've got, typically you're going to be able to scale it better but you're going to have to make some trade-offs upfront, think a little bit about what's going on, before you pick the right solution otherwise it can still go wrong.
Why are we seeing more and more of the hands going up when I asked that question to start with? V.C. funding for big data seems to be at an all-time high. It's a lot of money floating around going into these solutions, which means there's a lot of salesmen calling you up telling you about this who don't necessarily know all that much because they are quite new to it.
And some of the amounts of money that are floating around can make those of you who are aware of the early stage drug evaluation things look a bit green with envy. But also with all of that work and all of that time, we're starting to see it move from simple, pure technology building blocks up into business focused offerings you can almost take off the shelf and make use of.
We're starting to see more clear winners showing up, if you go back, sort of, three or four years, you'd say, oh I need a column store, and there would be five or six different big data solutions to pick from, and it'd be hard to work out which one's the one that's going to survive, because you don't really want pick the one that's not going to make it, whereas today we're starting to see a few clear winners coming through.
So it's good to know that there are some clear winners coming, and we can pick the ones that seem to be best of breed rather than having to take a punt and go with something early that may not last. We're also starting to see more support and consultancy available which is going to be important because most of the people in the room are not technologists.
You don't want to have to be diving down into the code yourselves. You want to be able to find someone out there who's going to be independent, who's going to be able help you and that's starting to be available.
Our industry has been grappling with big data in clinical trials for some time, but generally it's only been in the drug discovery, the people doing the DNA sequencing have been fighting with big data for 10 years or so.
People are doing all the crystal structures and all that kind of simulation stuff, they've been working big data for a long time. But for us, looking at single trials, handful of trials, we've been able to get by without needing big data solutions until today, we're starting to see the new trial designs, meaning that you're getting more data for a given trial with all the wearables coming in, we're getting orders of magnitude, more data coming in or potentially going to be coming in and if we want to start doing more of the interesting cross trial analysis, again, we're adding orders of magnitude more data and the system that used to work when we were dealing with one trial and a couple of endpoints is certainly creaking at the seams and you're having to pay a lot of money and your IT teams are getting increasingly grumpy, so we're all starting to think about new ways of doing it.
What I can't do though, is I can't tell you the ideal big data solution for everyone, there is no silver bullet despite what some of the sales people may suggest. I can't tell you how to pick a solution, it's a process, and I also am not going to try and tell you that you should be hiring Quanticate to do it because that's not what we do. We're just going to be trying to share a bit of the experiences we've had.
A few key things to think about, if you take one slide away from this. There's lots of different kinds of big data systems available and you're going to have to ask yourself some questions, do some thinking, come up with some answers and then try and get the right questions to sit down with the vendors and make sure you're getting the right solution.
So, in terms of the big data solutions and kinds of, if you start looking on something like Wikipedia, you're going to start seeing descriptions around these things, you're going to start seeing all the discussion about the low-level things the computer scientists are keen on, the distributed blocked stores, the locks, the consensus algorithms, none of which people here should be interested in, they're dumb, but they solved the problem
The job scheduling stuff starts to get interesting. Who here's heard of Hadoop, Apache Hadoop? Got a few hands going up. There's other ones out there as well that take your jobs and schedule them and get them to run near the data, get the results back. The data tracking stuff and the data workflow starts getting interesting.
Being able to set up something where you say, take that terabyte of data over there, run these operations on it, make sure they work, run these visualizations, put the results over here, let me know when it's ready. Getting all that schedule to work on a big data scale whereas previously you might have a SAS program that cause lots of different macros in a row, trying to get that to work when you've got several data centres worth of data, that's a harder problem, but it's starting to be solved.
Obviously for us, things around the security and audit-ability and identity are very important, and we need to start thinking about those and making sure we're picking systems that support them, because the early systems in big data, a lot of them were working around sales data, tracking data, that sort of thing for industries not like us
Amazon, if they're trying to work out what to recommend to you, their constraints, their security model is going to be very different if you're holding all sorts of patient identifying information. More interesting board classes that you'll find if you're digging away on Wikipedia, are the sort of column stores, document stores, the object databases. These are all the different technologies and ways of storing data. What you're going to need to do is start thinking about what kind of data you have, and then start looking through this list and seeing which ones fit with the kind of data you've got, and then you can help and start selecting some vendors.
If you have lots of very rich, complex data sets, it's no good saying, oh this key value store is really, really fast, and it's getting excellent rave reviews for how fast it is at retrieving data, if your data is not just a simple key value, you've got a complex model, the two are not going to meet. Even though one's really fast, it's the wrong system for you. You shouldn't be picking it.
Let's start getting interesting, are do we need transactions? Those of you who've worked with relational databases before, transactions are a sort of standard thing that you get.
With a lot of the big data clinical trials systems, it's something you can have, or not have. Need to start thinking about eventual consistency, gets into the computer science end but if you put the piece data in, how quickly can you query and get that same piece of data back. If you've got 50 servers and you're pushing data into them, it can potentially take a long time for all those servers to agree that that piece of data has gone in.
If you're doing a bulk load and then you're going to query two weeks later, it's probably fine that it takes 20 minutes until all the servers in your cluster and your data centre are agreeing that piece of data is definitely there, definitely stored. If you're trying to do some real-time monitoring, maybe you've got a medical device that’s trying to do real-time monitoring and look for an adverse event; 20 minutes might be too long if you're trying to work out if you need to stop this thing immediately.
You need to start looking at things which you're going to have very fast turnaround on the data. Some of the systems are optimized for read performance, some write performance. If you got lots and lots of data that's going to be coming in, maybe you've signed up half of the state who are all going to be using a particular medical device and you're capturing the data, you need to optimize and have lots of data going in and writing very fast, but if you're going to bulk load in all of your data from your data warehouse once a month and then all the interesting stuff is the queries, visualizations you run, doesn't really matter how fast you put the data in. Whether you're going to stream process and batch process starts actually being very important in terms of what vendors you select and what solutions you're working with, and again that tends to come back to, when I start running a query, how quickly do I get the result back.
The batch processing systems, there's a minimum time until they can give any answer even if it's a very simple question. At the moment if you're used to running a SAS program, going off and making a cup of tea, coming back 20 minutes later, maybe there's an answer, maybe there's not, you might say, hey if 30 seconds start up time, that's fine. But if you're going to be running lots and lots of queries and you've got lots of data coming in and it needs to be doing point in time analysis, you might say, hey 30 seconds is too long, if it's going to be 30 seconds times a million different queries, and you might start picking something that's going to be streaming processing and can turn the questions around quickly.
There are a lot of solutions tailored to different problems, and generally the big name, big data systems out there solve a particular problem really well, but they don't necessarily solve your problem really well. What's right for Amazon is probably not right for you. Amazon have a thing that if they have two conflicting pieces of data coming in from their data centre, say you've got two tabs open and you click add to the basket in two different tabs, they say, if in doubt put them both in.
You start explaining to regulator, you've picked the system that maybe if it's not sure duplicate some data, they're probably not going to be very happy with that. So it's no good saying that Amazon use it and it's great and they validated it, because they validated for different kind of problem, it's no good just saying it's best breed I'm going to take it.
If they're solving different problems, if they’ve got different business constraints around the availability of the data, the consistency of the data, all that kind of thing. So, some requirements, it all used to be simple, in the very, very early days of computers and computer storage, everything was custom, everything was really hard, and then relational databases came out, and they used SQL and it was really easy.
You'd have a DBA, who knew all about relational data, it would help you model your data into relational structure, you'd sit down with your requirements and they'd be about the cost, the scalability, who is going to support it. But you knew that everything you were dealing with was going to be relational was going to be worked with SQL.
And it made life easy, then the NoSQL movement came along, who's heard of the term NoSQL? Hand if you have, and so, it's a label used to cover a subset of the big data systems which are not relational and don't use SQL for querying. Pretty much all NoSQL systems are big data. Only some big data systems are NoSQL because there's other kinds of big data systems around processing that you wouldn't be doing the querying.
It is a general label, it got very, very trendy sort of four or five years ago, but since then SQL is coming back and not relational databases but SQL itself as a query language. I was at a conference a few years ago where the architects and authors of the SQL standard who on the whole have huge luxuriant beards and looked like the stereotypical commuter scientists, asked questions of some of the big NoSQL vendors and they'd say, wouldn't it be good if SQL could do this?
And the big data vendor would be like, yes, yes, it would be really good if it did that and they go, Bob, didn't you put that in the standard 15 years ago? Another guy will go, yeah, I think I did. And everyone had got so confused thinking that SQL and Relational had to be together, had to be the same that they stopped thinking about the other aspects of SQL and stopped thinking about the fact that we have generations of business analysts and data managers who understand SQL, know how to write SQL, know how to query their data using SQL. And now you're getting the big data vendors putting back SQL support for querying because that's what all your business intelligence taught you, that's what your visualizations taught you to use.
That's what you yourselves all understand and use, so we're starting to see SQL coming back in and it starts being important when you're picking a big data solution to say I've got this tool here, it works with Oracle, it works with Rave, it works with all these through SQL, when I put in my big data system I still want to have SQL.
That may not be the case for all of you but for a lot of you keeping that compatibility with your existing tools and with all your existing knowledge becomes really important, because if you try and change too much, your organization won't go with you, the culture won't go with you, and it's all great and good having five racks with the world's best big data system on, but if none of your tools work with it and none of your staff are willing to try it, what was the point what you spent the money on?
So in terms of our requirements, as I've alluded to a bit before, we need to think about how often we're loading the data in and how much. The numbers that the big data centres cope with are a lot larger, but there are certain fundamental physical limits about the speed of lights, the amount of data you can fit on a disc and it's all very well saying, we've got this big data system in Europe that's got twenty racks of machines, and we want to query it from our staff in America, then when you start working out how much you're going to have to shift all the data you're going to move around, the loading starts becoming important. I've heard of a few people who've ended up having to charter planes and put servers on planes and fly them to other side of the world to do their queries, because they had all the data captured in one place.
And that was all great. And then when they wanted to query it in another place, they couldn't buy the bandwidth to move the data and query it, and ended up just copying onto other hard disks, flying to other side of the world, querying it there. That's probably going to put a bit of a dent in your operations budget.
You're going to need to think about these things upfront, it's no good just saying, well we will load in France, and then we will query in America and then that doesn't work. So the volume does matter, the querying kind of querying that you’re going to do. Are you going to be doing real time queries, are you going to be doing small subsets, are you going to be doing large amounts?
Some of the systems are really good at picking through and getting tiny fractions of the sent of interesting data out, some of them are really good at streaming the whole set of data through and calculating running averages on them. If you pick a system that's optimized at picking out small interesting little bits, and then you try and run a moving average across the whole data set, it's not going to be a good fit.
We have to think a little bit about the availability, if the system goes down, is that the end of the world, how much support are you going to get from IT to bring it back up. There are some systems that are self-healing where you can add another machine into the cluster and it will join and start sharing the work load but they can be a bit more complex to manage and your IT team may get a bit scared about the idea that Skynet's coming when they added another machine and it magically came online and started helping out with the workload.
Key bit for us, though, is the reproducibility and data integrity and it's going to be different from some of the other bits of your business, the people doing crystal structures may have a different view on how much of the data still needs to be there, how much you can afford to lose, how fuzzy it can be. It's what you're going to have if the patient vitals change, we're all in big trouble. So, you maybe, can't just piggyback on the existing system your company is put in place, because DNA and crystal structures are going to be very different to lab vitals.
Another thing you're going to need to consider is how similar is all of your data. It used to be with SQL that you'd just model a relational structure for the blood work and another relational structure for the patients’ age, and it was all relations and that was fine. Today that's not the case, and the consistency, how similar is your data between trials?
You're not going to put you in a different big data system for every trial you run, you want the same kind of system for all the trials, so if your data is in different structures each time, if it's not coded the same, you're going to need to do a lot of post processing and you're going to need a system that supports that processing and the filtering before you get the answers out.
Whereas if you're going to do all that coding and things before you load the data in, maybe you can have a simpler system, but we are going to have to accept that our data is not the same between every trial. For us as a CRO, we have to accept that the data is completely different between each of the different sponsors in the room and that that's going to have to get mapped together and we need to think about where we're going to do that where it makes the most sense and that also matters about volume and the changes happening to it.
If it's structured versus unstructured, that gets important. We had a little bit yesterday about doing a reporting for tweets and things like that. If you're going to have a system that's going to be storing tweets, it's going to be storing photos and it's going to be storing traditional lab data, that's going to be a very different kind of system to pick to one where you say all we're interested in is doing lab data analysis really, really fast.
The system that we've picked at Quanticate, not only does it solve the problem we've got today but it solves three of the problems we think we're likely to have in a years’ time, we hope, if we've got the requirements right.
But we've had to look forward to make sure the system we pick because the best one for the problem we've got right now is not the best one for how we think it's going to change. And IT support is a key one. You're going to have to get these systems installed. They're going to have to be in a data centre somewhere, maybe yours, maybe one you've rented, and it’s going to have to get set up.
A lot of the big data systems are based around Linux. If you've got an IT team that are mostly based around Windows, they might get a bit worried and maybe even need to pick a big data system that works as well on Windows because that way your IT team are going to buy in and it will all work. We have to think a little about the people as well as the technology.
All the big data solutions in clinical trials that you'll be thinking about are going to be tested before they were released. As we hopefully know, tested is not the same as clinically systems validated. The FDA have not certified any big data solutions and they have no interest in certifying them. As I understand it, themselves they are certifying a few for internal use, but they're not saying Jims Hadoop 4.7 is FDA approved, they just might happen to use internally.
So you're going to have to find a vendor that's going to help you do the CSV on this and you need to find a vendor who understands their industry. I'm sure there will be some people in the room who've had issues with vendors before who said, yeah we've done validation for banking, we'll help you with your clinical systems validation, and they come in, they set it up, they give you the paper work and you go, did you read our SOP on validation because this is no match at all, and they're like, but it's validation, you said you wanted validation, we validated it.
It's all good, isn't it?
No.
So you need to work with a vendor who understands our industry, need to be aware that a lot of people who've already adopted it who want validation tend to be in sort of the banking areas who have different rules about validation and what a salesman may think he is selling you and what you need to buy are not the same.
So yeah, we're special. Some questions for the vendors. Ask them how they're going to solve the problems you've got today, and then you got to ask them again about the problems you think are coming up, and make sure that you're picking a solution that's going to work for both, got to answer the validation, how are you going to validate it.
Need to ask them how it's going to work with structured data that you've got today, hopefully easy enough and then you go to ask them how it's going to cope with the unstructured data that we are increasingly saving. You need to ask the questions about how they're going to work with your IT team and an important one is, if the VC money turns off tomorrow when they don't get the funding RAMs who is going to support the system if that particular vendor goes under?
Most of the big data systems are built in open-source, the technology is still going to be there, it's still going to be open but if you've picked a vendor you've seen perfect for the particular solution you want, but they're the only vendor for that solution in the whole of the US, and everyone else is in Europe, then if that vendor suddenly pivots into a new area or doesn't get the next funding round and suddenly everyone else is going to support you who was on the other side of the Atlantic, that's potentially going to be problematic.
Whereas if you pick a solution where there are three different vendors working on that same open-source system in the same town as you, well, you know you can lose a vendor and you've not lost your whole system.
We have couple of minutes left. So, for those of you who are interested in papers and especially in computer science papers, unfortunately fallen off the end, but the slides are going to be distributed later and you can have a read of them.
If you're interested in data storage and data storage technologies, some of these papers can be interesting, you maybe won't understand all of them, but it can be good to think about the challenges that they were facing when they came up with these systems. Make sure that the goal that they had to start with, matches with the goal that you have now.
If you want to know a little bit more, here are some conferences coming up that would help. ApacheCon is taking place in Austin, Texas in a few weeks’ time. Most of the key open-source, big data solutions are based on projects at the Apache Software Foundation so most of them are being discussed there.
For any Europeans in the room, if you don't fancy trekking to Texas, Berlin Buzzwords in Germany is going to be a really, really good conference this year, all about big data and increasingly it's about the business solutions built on top of it, rather than five years ago when it was about the underlying technologies.
Strata and Hadoop, well they are dotted around the place, they tend to be very good, some very good talks happening there and there are lots of them. And finally, if anyone wants to talk some more about it, since we're out of time, we're going to have a big data round table at lunch, we'll just put up a sign on the table saying big data and if you're interested, then we'll all come and sit around at lunch and have a chat about that.
So, what is in your view and based on your experience, what's the biggest value that big data has already delivered to the healthcare industry?
To the healthcare industry? It's a tough one; it's probably the DNA sequencing work. It's not directly in our area but within our industry. The ability to sequence the DNA on the chemical level was a big challenge. But then as soon as you can sequence more than one person's DNA, you have a big issue with storing it and doing the analysis.
And some of the work going on finding particular DNA sequences certain genes that are going to have a big impact on drugs and drug developments; I think that's probably the biggest win so far for our industry from big data. But now it's moving up the value stack and we're starting to see it, coming to other areas as well and hopefully delivering similar, big improvements for us all.
Questions from the audience?
You mentioned earlier about how fast data goes into databases, I guess and so on. I guess, how fast data goes in, in my view, depends how important or how fast it would go in, depends on how important it is to do real time analysis.
So a lot of the retrospective analysis that people want to do, it doesn't really matter how fresh the data is.
Yeah.
So can you comment a little bit on kind of that differential and how big data solutions can enable one versus the other.
If you're doing, the sort of looking back through retrospective, you can kind of take one of the big data warehouse systems which is just going to do a bulk load and you say, right, here's all the data, here's all the import rules, start, send me email when it's done.
It might take an hour; it might take the weekend, that's fine. Just put it all in in one go and it's nice and simple.
If you've got data coming in from say, wearables where they're feeding the data in every second, you don't want to be buffering that up in one system and then doing a weekend loading to the next system to process it because that's all the complexity then you also have to start doing the integrity checking and making sure that you've not lost any in the process and if you can come up with a system that can handle that in real time and accept it in real time into the validated system, that's going to be a lot easier.
But if all you're interested in is loading in the data once a month when the data dump comes in and looking at it and doing the end point analysis and the outlying analysis, you wouldn't necessarily want the complexity of a system that's able to accept data from 50,000 wearables once a second.
Okay. Thank you.
Questions from the audience? Anybody thought of something?
I'm just wondering how do you connect or correlate the big data with the real world data, or is it kind of complementing each other, kind of a stand alone?
I think, increasingly what we'll see is, where you used to have a single relational database that you're querying and the kind of things that you're used to today, you'll be moving that into a big data system and running the same analysis there on the data that you've already got, you're already capturing with the new classes of data that we're also starting to see come in.
I think some of the early deployments that you might do, you'll run both of them in parallel so that you can validate them, so that you can reassure everyone, but I think, increasingly, you'll be throwing away that really, really expensive Oracle relational database you've got and putting everything into the big data system along with all the new things rather than trying to run them both in parallel, keep all the service for them both going in parallel, pay two sets of expensive license fees. I think we will see everyone moving to put everything in big data including the old stuff.
So, unless there are any other burning questions, let’s thank Nick Burch and move to the next talk.
Thank you.