Today, Big Data is one of the hot topics within almost every Industry, especially in clinical trials. May saw the biggest ever European technologists conference on this, Berlin Buzzwords, while the likes of O'Reilly's Strata conference pull in huge numbers of attendees keen to learn how to adapt to this new world.
Despite all of this interest, a great deal of confusion remains around Big Data. Not only are there never-ending debates about what Big Data is, there's a huge range of possible Big Data solutions out there to choose from, only a few of which will be appropriate for any given situation or problem. If you speak to some technologists from big Silicon Valley firms, they'll swear blind that you don't have Big Data until you have entire data centers on three continents. Many of the recent big VC backed Big Data splashes have been targeting the "few racks" sized problem space. Hang out at the right tech events, and you'll see various groups demoing their Big Data solutions on collections of machines that'll fit comfortably in a shoebox. At least one MBA student has been heard declaring that they have a Big Data problem, as their data won't fit in Excel...
On the other hand, there's a small backlash against the Big Data movement, with some explicitly saying they have a Small Data problem, and being proud of it. Many of these stress the importance of being able to process everything on one machine, of ensuring that processing is available to all, and not just those with large budgets. That can be countered though, through the use of on-demand cloud systems from the likes of Amazon, which allow anyone with a few dollars to spare on their credit card to spin up their own temporary Big Data system for an hour to do their processing.
Where does this leave us mere mortals though, starting out on our use of Big Data in clinical trials? When we look at potential solutions, potential systems, potential frameworks, how do we know if they are right for us? When the suave salesperson from one of the Big Data startups phones, or worse one from a large and expensive legacy provider, how do we know that what they're pitching is of the right scale for our needs? After all, something that scales to a handful of machines won't work when holding medical information for a whole country, while another that works best with data centers on three continents will be an expensive overkill for those in the low-tens of terrabytes of data to process. Both problems are Big Data, but what's right for one won't suit the other.
In the past, estimating how much data would be in your clinical trial was easy, and the number small. Today, the number is much higher, sometimes into Big Data sizes, and varies more. One must consider questions like how many patients are being submitted onto the trial, how many visits are scheduled in the protocol, how long does the trial last and what methods is the data being captured? Increasingly, we see trials making use of new technologies, such as medical devices which generate subject readings every minute, or perhaps even every second. The difference in data volume between a 12-lead ECG once a week, and a 2-lead ECG every second (plus 12-lead readings for calibration) are very dramatic! The introduction of changes like this into the study design can suddenly catapult a trial's data from "small" into Big Data in one step.
Considering such ranges of situations, we have to ask ourselves. Despite the hype, the VC funding, the marketing, the buzz, is the use of a single label to cover the whole space becoming a problem? Is the term "Big Data" as a catchall still useful?
At this point, let us allow ourselves a brief diversion. Where else have we come across multiple different words for “Big”? For those living in places well served by a certain international coffee chain, the answer is every morning! For those either off the beaten track, or in a town with a strong independent coffee scene, the answer may be more elusive. Either way, what can we learn from the use of “Tall”, “Grande” or “Venti” as different measures of big?
Well, sticking with coffee, in many places no one likes to order a “small.” Just a few wish to stand up in a senior management meeting and say “actually, we don't have big data after all,” there's a certain reluctance to ordering a “small” coffee. People still prefer different sizes, but naming is important!
On the academic side, Google have released a number of seminal papers on Big Data. Whether we're considering their paper on MapReduce, which led Doug Cutting and friends to re-architect Nutch along those lines (which eventually grew into Hadoop), or their more recent paper, which rely on known error-bars to allow distributed provably-correct distributed handling of “what happened first,” we see great leaps forward. The computer scientist in me is excited by the prospect of what can be done, and the elegance of what is possible. (If you're at all technically interested in distributed systems, the papers from the likes of Google or papers from Amazon will give you hours of intellectual joy, go read them now!). The pragmatist in me wants to know how we can solve last Tuesday's customer issue without committing to another rack in the data center. While some solve globally distributed problems, many of us face short-term multi-machine problems. Many of us foresee larger challenges, but not that many orders of magnitude more. Find a VC over a beer, or certain researchers, and you will hear of the huge Big Data challenges that exist out there, and the innovative giant projects that help solve them. Compared to what many of us face, it seems a different world. However, all of us are within the “Big Data” space. Faced with these divergent needs, can we really all say we are all “Big Data”?
Given this range, how come one term has tended to stick? How much can be explained by the desire not to have a “small” problem (just as many prefer not to order a “small” coffee), and how much can be pinned onto the desire of people to follow the buzzword and marketing effects of “Big Data”? Another challenge we face is the fluidity we see, as new systems and products are developed. If you look at many talks from Big Data events from 2-3 years ago, it's striking how many talked of bespoke functionality and hard coding at the time, which are now available as standard in the latest tools. (Not everyone is as quick to update though. I recently received information from a big vendor about their solution, which fixes an issue with Hadoop, which Apache Hadoop itself fixed a year ago!). In some areas, what's hard or big today won't be next year, while other challenges remain. A new release might make enforcing security permissions easier, or allow new statistics to be run as standard, but the speed of light remains constant!
Despite this, a problem remains – how can someone new to this field work out which kinds of Big Data problems they have, and identify the right kinds of solutions? Plenty of companies (large and small) claim they have what you need, but how can you check before handing over large sums or spending lots of time? The boring and un-sexy answer is in part the need for Requirements, a clear identification of what your problem is, and what is needed. If anything, the growth of Big Data has made the up-front gathering of requirements more important, not less - sorry about that. You need to think about where your source data is, and what form it is in. Consider how spread out it is, how easy it is to run the data processing / analysis near to where it is. Think about if you can work directly on the source data, or if it needs pre-processing. Think about how fast the data is growing, how quickly you need to include new data in the results. Work out the complexity of your calculations, where the outputs will go, and for what use they'll be put. Decide if 10 machines for 10 minutes with complexity are better or worse than two machines and simplicity for an hour. As best you can, identify the problem, then pick the solution, not the other way around!
Once you have gathered your requirements, can you then group yourself into a “kind” of Big Data, to help you in your search for the right solution? Could you, in effect, say, “I've got 1TB of new data each day to summarize, so I've got a Grande Data problem”? Can we, as an Industry, agree on the classifications of the different kinds of Big Data? I'm not sure, but I do know that I need two different kinds of Big Data system to solve the challenges we face at Quanticate, whatever they're called!
Related Blog Posts: