Uncategorized

The Data Science of IT and Computer Science is Not the Only Data Science

Posted on

Originally published in Predictive Analytics Times (www.PATimes.com) under the same title. 02/02/2018.  https://www.predictiveanalyticsworld.com/patimes/data-science-computer-science-not-data-science/9261/ 

For reasons perhaps having to do with “siloed thinking” or “not-invented-here” syndrome (things that can affect all teams from time to time), many participants on the Big Data Management (BDM) and Big Data Analysis (BDA) scene have become convinced that there is only one kind of Data Science: that which is run in their corporate or academic shops. You know. Data Science. The only Data Science.

This misapprehension has given way to confusion, even among those calling ourselves data scientists, about what to call ourselves. We learned about data and analysis in a graduate (MSc or PhD, or both) physics or economics or biostatistics or Ops Research, or…some other STEM, or life, or historical, or social science (the list goes on…). Some of us have recently ended up calling ourselves data scientists because that was the direction market forces were pushing the people who make up names for occupations. Can’t get tagged on a recruiter’s search engine if you don’t have “data scientist” on your resume, right? The main effect of this positive feedback loop of labor market behavior on labor markets is to broaden the definitions of data science developed over the last decade.

Others may quail at the thought of all the time wasted—again—debating what to best call ourselves, but I think the current flux is a good thing. There are just so many kinds of data sciences and scientists. That’s the main point. But you say, again, what about Data Science, you know, where all the CS and IT nerds are? The only Data Science?

Well, it’s one branch of The Data Sciences (plural and capitalized, for the sake of argument). It could be called, quite correctly, Computer Science (CS)-IT Data Science.

But it’s not the only Data Science. I argue that to understand the impact of data and analysis in the last 25 years demands we broaden our focus from the branch of data science in which CS and IT thinking, training, and experience dominates the praxis, to the entire domain of data and the myriad of sciences that use computerized tools to analyze it.

From this point of view, there are hundreds of Data Sciences. These are systematic investigative efforts that collect and analyze data to solve problems for society and in the service of advancing an empirical and theoretical mission. (And I digress, but not by much; you need as much scientific methodology as you can possibly afford. That means properly trained people. Otherwise, you are throwing money away.)

Under the Data Sciences, then, we have, for example, Biological and Biostatistical Data Sciences, Demo-Geographic Data Sciences, Sociological Data Sciences (sociometrics), Psychological (psychometric) Data Sciences, Statistical (mathematical) Data Sciences, even Historical Data Sciences (anthropology, paleontology, archaeology, “natural history” or evolutionary biology). And so on, in growing numbers, and in no particular order or hierarchy.

These are but a few examples. I didn’t mention econometrics, an applied statistical-disciplinary bias of my own. In full disclosure, I do so mention it. And I’ll stop capitalizing now the point is made.

This is a proposal for one way to sort out some of the occupational confusion I referred to earlier. The reader should understand that the categories raised are map boundaries, not the kind of 20th century scientific disciplinary barriers that are often so unhelpful. You can think of them as heuristic lines of demarcation we use to better understand the trajectories of 21st century sciences.

And CS-IT data science’s status is unchanged. It is right in the middle of everything hot that is tech. That means many elements of it will remain indispensable. The most obvious example is its integral involvement in AI. On a Ted-x talk I heard that AI won’t run without Big Data, and Big Data is not interpretable without AI. So Big Data will have to be muscled around on both ends by the CS-IT data scientists. And they will always provide the computational engines for Data Scientists from any given discipline, i.e., advancing data management and analytic technology in the service of storing, maintaining, and retrieving data that has been observationally or experimentally collected for some purpose (or none at all.)

But, again, it is not the only Data Science. Clearly, it is one of many.

We need—all of us who work with data—to better understand that the rubric encompasses an often-bewildering thicket of beyond rapid technological changes in science, industry, labor markets, and ultimately, the behavior of consumer markets themselves. This line of thinking approaches a consistent nomenclature about what each of us does in data science, now and in the future.

Uncategorized

Data Reliability and Validity, Redux: Do Your CIO and Data Curators Really Understand the Concepts?

Posted on

Originally published in Predictive Analytics Times (www.PATimes.com) under the same title. 09/07/2018  https://www.predictiveanalyticsworld.com/patimes/data-reliability-and-validity-redux-do-your-cio-and-data-curators-really-understand-the-concepts/9684/

Here are two recent entries on the big but neglected issue of data reliability and analytic validity (DR&AV), from the vast commentariat that is LinkedIn:

One of my complaints with hashtag#bigdata, is there isn’t enough focus on getting the right kind of data. We deal with this in healthcare all the time. Much of our transactional data is just outcomes. There’s a push in the industry to change health behaviors (generally: population health). But if we’re not collecting meaningful behavioral data (could be secondary data like credit card purchases or primary like surveys about health attitudes), we can’t determine what behaviors are driving the outcomes! — Biostatistical Data Scientist, LinkedIn commenter, June 2016.

A potential problem is that people know less and less as to how to conduct surveys well. Conducting a survey is easier than ever, but the same technologies that make surveys easier are also making response bias easier to creep into the results as well. I suspect that we are headed to a disaster of Literary Digest proportions, for many of the same reasons. Of course, the data we have is very huge. But, at least for the problem that we want to analyze, the data is all wrong. Yet, there seems to be a big resistance to cleverly trying to address these problems instead of worshipping blindly at the altar of technology. —Sociological Data Scientist, LinkedIn commenter, September 2017.

Note that neither of these commenters mentions the words reliability or validity. But that’s what they’re talking about.

In the first, the biostatistical data scientist asks a basic question about data validity, i.e., the absence of “meaningful behavioral data” relevant to answering questions about factors that are “driving medical outcomes.”  Apparently, they have a lot of data, but not the right kind, for their purposes. In effect, she is saying that all their non-behavioral data is invalid because it does not measure what they need to know. From that standpoint, she might as well not have any data at all.

This is a teachable moment for those who insist that Big Data solves all problems. Her example shows you can have all the data you can possibly hoover up into a warehouse or a lake, or to fill the Marianas Trench, and have it all be invalid. This means, of course, invalid for the purposes of any one or a number of teams of analysts and data scientists looking for answers to questions posed, a priori, to collecting data and commencing analytics. These were specific questions they wanted to ask of the data, in terms of variables to investigate, that apparently were not communicated to those responsible for data collection, selection, or construction. (Another reason why Statistical Data Scientists, along with the CS-IT Data Scientists, in their Data Manager role, should be in the lead with specifications for data needs or requirements for analytics, before data collection. But that’s another blog post.)

The second comment addresses both issues of data reliability and validity stemming generally from watered-down skill sets and lowered levels of theoretical (mathematical and logical) understanding of survey research, among survey researchers themselves. He states that it is much easier to conduct a survey now than ever before, in terms of a one-day, one-question pop-up. But to conduct a survey properly, so that data is not rendered unreliable, and findings invalidated by response or other types of bias, is and has remained a painstaking process. (See “Bias in Survey Sampling.”)

I’ll put it this way: my inspection of Survey Monkey did not show a capacity for tests of reliability and validity on the data collected, nor item analysis, nor other diagnostics that exist for survey data. (Maybe the capacity is there, in which case I stand corrected.) And the second commenter raises the example of the ultimate triumph of bad (unreliable) data from survey research, the Literary Digest scandal in 1936, and suggests that because we are not paying attention to the basics of DR&AV, we are heading in the same direction now.

This leads me to think in terms of the reliability of a measuring instrument, e.g., a questionnaire (survey instrument) administered to gauge employees’ job satisfaction. This is an evergreen example. But some statistical data scientists (statisticians, and their more applied sisters and brothers in economics and the social sciences) themselves do not necessarily appreciate that reliability applies not just to the numbers that result from the measurement. It applies to the way in which a survey question that produces the numbers is worded or phrased.

The wording of the questions must be as unambiguous as possible, or it will trash the survey. I’m taking a survey of whether employees wear jeans to work (yes or no), and what “kind” of jeans, as part of a marketing study. Responses are limited to a set of multiple choices. But the word “kind” can refer to a brand of jeans (Levi’s, Lee’s, etc.), or a style of jeans—skinny, boot cut, relaxed fit, and so on. I want to know what style, but I ask respondents to specify, again, the kind of jeans, and give them a choice between several brands.

Your own confusion about what you say you want, and what you will get from the survey, enters the heads of some respondents. Some of them think “kind” means style, like you, and are puzzled that they are given brand names from which to choose. Some of them think it means brand names, and are just fine in specifying Levi’s or Lee’s or another brand. But if we were to poll the respondents on that question, asking whether they thought “kind” meant style or brand, there would be varying responses relative to what they believed they were being asked, rather than, necessarily, what you wanted to know. And you were not sure as well. The result? Increased noise in the survey data, and less reliability.

Another example illustrates how to measure data reliability in a different context and with a different tool. But the principles of consistency and repeatability are the same, with the former implying the latter as reliability’s defining characteristics. In the employ of a large defense contractor a few years ago, I sat in conference at one of our sites in Northern Virginia with 35 engineers of various stripes doing preliminary requirements specifications for a proposed weapons system. We spent a full day brainstorming what they would be.

At the end of the first day, I and an assistant scripted a questionnaire designed to capture data on the consistency and repeatability of the participants’ understandings of requirements they identified and named as critical to a successful system. We did not have time or space on a short survey instrument to ask each person to state what was meant by the terminology in Requirement 1, Requirement 2, and so on. But we got to the question of internal consistency by lowering the information requirements of the survey, and going through a logical back door: we asked the respondents to rank their top 20 requirements by importance (1 = highest importance, 20 = lowest).

This isn’t saying we expected each respondent to rank the requirements identically. It happens rarely, if ever. And that wasn’t the point. if there were consistent and repeatable understandings, i.e., reliable understandings, respondents, more or less, would all be ranking the same list. Said another way, everyone responding to the survey would be ranking the same definitions of each requirement. To test for this, I used a statistic known as Cronbach’s Alpha (a ) that correlated the ranking of each requirement with every other one, and averaged the correlations. a  is bounded by 1 and 0. It’s a correlation measure for all the data collected by the questionnaire or survey. In this instance, if the same lists were being ranked, and thus the same requirement definitions, a  would approach 0.5 or better, telling us that about 50 percent or more of the time that was the case.

In our tests, averaged 0.15, indicating an absence of consistency and repeatability in participants’ terminological understandings of each requirement. To me, the findings indicated strongly that the following kind of situation likely prevailed in the data: for Requirement 1, Respondent 1 had one understanding, Respondent 2 another, and Respondent 3 the same as Respondent 1, but different than Respondent 4, etc. In short, because the respondents had different understandings of each requirement, most of the time each person was ranking a different list. This generated much noise and not much signal, in an effort that demanded every engineer associated with the project be on the same page throughout its execution.

The engineers had to define and refine their terms to eliminate the ambiguity in wording or phrasing of requirements terminology. It was crucial that every engineer knew they were all talking about the same thing when they discussed a given specification. This is data reliability in a different context, and a critical element that should occur in any organizational effort that analyzes data: at the beginning.

In case you lost track, that’s my fault. So, one more time: The point is that measures fail reliability tests the lower their reliability coefficient RXX (I discussed this back in July of this year, “Data Reliability and Analytic Validity for Non-Dummies”), or their Cronbach’s Alpha score, or other diagnostic statistics. And the lower is the correlation with any other measure in the dataset. Low correlations mean that for any two or more variables (all considered pairwise), fewer data points or observations may move positively or negatively in tandem. And false positives (Type 1 Error), or in this case, false negatives (Type II Error), are the result of degraded consistency and repeatability properties of the measurement, masking more reliable relationships between variables that are better correlated.

It’s a hot mess, isn’t it? Go ask your CIO and your Data Curators: what do you know about all this? How reliable is our data? And is it valid with respect to the needs of analysts and the various stripes of data scientists in our employ? If you’re one of those who need valid data, of course, you’ll know just what to say.