The Data Science of IT and Computer Science is Not the Only Data Science

Posted on

Originally published in Predictive Analytics Times (www.PATimes.com) under the same title. 02/02/2018.  https://www.predictiveanalyticsworld.com/patimes/data-science-computer-science-not-data-science/9261/ 

For reasons perhaps having to do with “siloed thinking” or “not-invented-here” syndrome (things that can affect all teams from time to time), many participants on the Big Data Management (BDM) and Big Data Analysis (BDA) scene have become convinced that there is only one kind of Data Science: that which is run in their corporate or academic shops. You know. Data Science. The only Data Science.

This misapprehension has given way to confusion, even among those calling ourselves data scientists, about what to call ourselves. We learned about data and analysis in a graduate (MSc or PhD, or both) physics or economics or biostatistics or Ops Research, or…some other STEM, or life, or historical, or social science (the list goes on…). Some of us have recently ended up calling ourselves data scientists because that was the direction market forces were pushing the people who make up names for occupations. Can’t get tagged on a recruiter’s search engine if you don’t have “data scientist” on your resume, right? The main effect of this positive feedback loop of labor market behavior on labor markets is to broaden the definitions of data science developed over the last decade.

Others may quail at the thought of all the time wasted—again—debating what to best call ourselves, but I think the current flux is a good thing. There are just so many kinds of data sciences and scientists. That’s the main point. But you say, again, what about Data Science, you know, where all the CS and IT nerds are? The only Data Science?

Well, it’s one branch of The Data Sciences (plural and capitalized, for the sake of argument). It could be called, quite correctly, Computer Science (CS)-IT Data Science.

But it’s not the only Data Science. I argue that to understand the impact of data and analysis in the last 25 years demands we broaden our focus from the branch of data science in which CS and IT thinking, training, and experience dominates the praxis, to the entire domain of data and the myriad of sciences that use computerized tools to analyze it.

From this point of view, there are hundreds of Data Sciences. These are systematic investigative efforts that collect and analyze data to solve problems for society and in the service of advancing an empirical and theoretical mission. (And I digress, but not by much; you need as much scientific methodology as you can possibly afford. That means properly trained people. Otherwise, you are throwing money away.)

Under the Data Sciences, then, we have, for example, Biological and Biostatistical Data Sciences, Demo-Geographic Data Sciences, Sociological Data Sciences (sociometrics), Psychological (psychometric) Data Sciences, Statistical (mathematical) Data Sciences, even Historical Data Sciences (anthropology, paleontology, archaeology, “natural history” or evolutionary biology). And so on, in growing numbers, and in no particular order or hierarchy.

These are but a few examples. I didn’t mention econometrics, an applied statistical-disciplinary bias of my own. In full disclosure, I do so mention it. And I’ll stop capitalizing now the point is made.

This is a proposal for one way to sort out some of the occupational confusion I referred to earlier. The reader should understand that the categories raised are map boundaries, not the kind of 20th century scientific disciplinary barriers that are often so unhelpful. You can think of them as heuristic lines of demarcation we use to better understand the trajectories of 21st century sciences.

And CS-IT data science’s status is unchanged. It is right in the middle of everything hot that is tech. That means many elements of it will remain indispensable. The most obvious example is its integral involvement in AI. On a Ted-x talk I heard that AI won’t run without Big Data, and Big Data is not interpretable without AI. So Big Data will have to be muscled around on both ends by the CS-IT data scientists. And they will always provide the computational engines for Data Scientists from any given discipline, i.e., advancing data management and analytic technology in the service of storing, maintaining, and retrieving data that has been observationally or experimentally collected for some purpose (or none at all.)

But, again, it is not the only Data Science. Clearly, it is one of many.

We need—all of us who work with data—to better understand that the rubric encompasses an often-bewildering thicket of beyond rapid technological changes in science, industry, labor markets, and ultimately, the behavior of consumer markets themselves. This line of thinking approaches a consistent nomenclature about what each of us does in data science, now and in the future.


Data Reliability and Validity, Redux: Do Your CIO and Data Curators Really Understand the Concepts?

Posted on

Originally published in Predictive Analytics Times (www.PATimes.com) under the same title. 09/07/2018  https://www.predictiveanalyticsworld.com/patimes/data-reliability-and-validity-redux-do-your-cio-and-data-curators-really-understand-the-concepts/9684/

Here are two recent entries on the big but neglected issue of data reliability and analytic validity (DR&AV), from the vast commentariat that is LinkedIn:

One of my complaints with hashtag#bigdata, is there isn’t enough focus on getting the right kind of data. We deal with this in healthcare all the time. Much of our transactional data is just outcomes. There’s a push in the industry to change health behaviors (generally: population health). But if we’re not collecting meaningful behavioral data (could be secondary data like credit card purchases or primary like surveys about health attitudes), we can’t determine what behaviors are driving the outcomes! — Biostatistical Data Scientist, LinkedIn commenter, June 2016.

A potential problem is that people know less and less as to how to conduct surveys well. Conducting a survey is easier than ever, but the same technologies that make surveys easier are also making response bias easier to creep into the results as well. I suspect that we are headed to a disaster of Literary Digest proportions, for many of the same reasons. Of course, the data we have is very huge. But, at least for the problem that we want to analyze, the data is all wrong. Yet, there seems to be a big resistance to cleverly trying to address these problems instead of worshipping blindly at the altar of technology. —Sociological Data Scientist, LinkedIn commenter, September 2017.

Note that neither of these commenters mentions the words reliability or validity. But that’s what they’re talking about.

In the first, the biostatistical data scientist asks a basic question about data validity, i.e., the absence of “meaningful behavioral data” relevant to answering questions about factors that are “driving medical outcomes.”  Apparently, they have a lot of data, but not the right kind, for their purposes. In effect, she is saying that all their non-behavioral data is invalid because it does not measure what they need to know. From that standpoint, she might as well not have any data at all.

This is a teachable moment for those who insist that Big Data solves all problems. Her example shows you can have all the data you can possibly hoover up into a warehouse or a lake, or to fill the Marianas Trench, and have it all be invalid. This means, of course, invalid for the purposes of any one or a number of teams of analysts and data scientists looking for answers to questions posed, a priori, to collecting data and commencing analytics. These were specific questions they wanted to ask of the data, in terms of variables to investigate, that apparently were not communicated to those responsible for data collection, selection, or construction. (Another reason why Statistical Data Scientists, along with the CS-IT Data Scientists, in their Data Manager role, should be in the lead with specifications for data needs or requirements for analytics, before data collection. But that’s another blog post.)

The second comment addresses both issues of data reliability and validity stemming generally from watered-down skill sets and lowered levels of theoretical (mathematical and logical) understanding of survey research, among survey researchers themselves. He states that it is much easier to conduct a survey now than ever before, in terms of a one-day, one-question pop-up. But to conduct a survey properly, so that data is not rendered unreliable, and findings invalidated by response or other types of bias, is and has remained a painstaking process. (See “Bias in Survey Sampling.”)

I’ll put it this way: my inspection of Survey Monkey did not show a capacity for tests of reliability and validity on the data collected, nor item analysis, nor other diagnostics that exist for survey data. (Maybe the capacity is there, in which case I stand corrected.) And the second commenter raises the example of the ultimate triumph of bad (unreliable) data from survey research, the Literary Digest scandal in 1936, and suggests that because we are not paying attention to the basics of DR&AV, we are heading in the same direction now.

This leads me to think in terms of the reliability of a measuring instrument, e.g., a questionnaire (survey instrument) administered to gauge employees’ job satisfaction. This is an evergreen example. But some statistical data scientists (statisticians, and their more applied sisters and brothers in economics and the social sciences) themselves do not necessarily appreciate that reliability applies not just to the numbers that result from the measurement. It applies to the way in which a survey question that produces the numbers is worded or phrased.

The wording of the questions must be as unambiguous as possible, or it will trash the survey. I’m taking a survey of whether employees wear jeans to work (yes or no), and what “kind” of jeans, as part of a marketing study. Responses are limited to a set of multiple choices. But the word “kind” can refer to a brand of jeans (Levi’s, Lee’s, etc.), or a style of jeans—skinny, boot cut, relaxed fit, and so on. I want to know what style, but I ask respondents to specify, again, the kind of jeans, and give them a choice between several brands.

Your own confusion about what you say you want, and what you will get from the survey, enters the heads of some respondents. Some of them think “kind” means style, like you, and are puzzled that they are given brand names from which to choose. Some of them think it means brand names, and are just fine in specifying Levi’s or Lee’s or another brand. But if we were to poll the respondents on that question, asking whether they thought “kind” meant style or brand, there would be varying responses relative to what they believed they were being asked, rather than, necessarily, what you wanted to know. And you were not sure as well. The result? Increased noise in the survey data, and less reliability.

Another example illustrates how to measure data reliability in a different context and with a different tool. But the principles of consistency and repeatability are the same, with the former implying the latter as reliability’s defining characteristics. In the employ of a large defense contractor a few years ago, I sat in conference at one of our sites in Northern Virginia with 35 engineers of various stripes doing preliminary requirements specifications for a proposed weapons system. We spent a full day brainstorming what they would be.

At the end of the first day, I and an assistant scripted a questionnaire designed to capture data on the consistency and repeatability of the participants’ understandings of requirements they identified and named as critical to a successful system. We did not have time or space on a short survey instrument to ask each person to state what was meant by the terminology in Requirement 1, Requirement 2, and so on. But we got to the question of internal consistency by lowering the information requirements of the survey, and going through a logical back door: we asked the respondents to rank their top 20 requirements by importance (1 = highest importance, 20 = lowest).

This isn’t saying we expected each respondent to rank the requirements identically. It happens rarely, if ever. And that wasn’t the point. if there were consistent and repeatable understandings, i.e., reliable understandings, respondents, more or less, would all be ranking the same list. Said another way, everyone responding to the survey would be ranking the same definitions of each requirement. To test for this, I used a statistic known as Cronbach’s Alpha (a ) that correlated the ranking of each requirement with every other one, and averaged the correlations. a  is bounded by 1 and 0. It’s a correlation measure for all the data collected by the questionnaire or survey. In this instance, if the same lists were being ranked, and thus the same requirement definitions, a  would approach 0.5 or better, telling us that about 50 percent or more of the time that was the case.

In our tests, averaged 0.15, indicating an absence of consistency and repeatability in participants’ terminological understandings of each requirement. To me, the findings indicated strongly that the following kind of situation likely prevailed in the data: for Requirement 1, Respondent 1 had one understanding, Respondent 2 another, and Respondent 3 the same as Respondent 1, but different than Respondent 4, etc. In short, because the respondents had different understandings of each requirement, most of the time each person was ranking a different list. This generated much noise and not much signal, in an effort that demanded every engineer associated with the project be on the same page throughout its execution.

The engineers had to define and refine their terms to eliminate the ambiguity in wording or phrasing of requirements terminology. It was crucial that every engineer knew they were all talking about the same thing when they discussed a given specification. This is data reliability in a different context, and a critical element that should occur in any organizational effort that analyzes data: at the beginning.

In case you lost track, that’s my fault. So, one more time: The point is that measures fail reliability tests the lower their reliability coefficient RXX (I discussed this back in July of this year, “Data Reliability and Analytic Validity for Non-Dummies”), or their Cronbach’s Alpha score, or other diagnostic statistics. And the lower is the correlation with any other measure in the dataset. Low correlations mean that for any two or more variables (all considered pairwise), fewer data points or observations may move positively or negatively in tandem. And false positives (Type 1 Error), or in this case, false negatives (Type II Error), are the result of degraded consistency and repeatability properties of the measurement, masking more reliable relationships between variables that are better correlated.

It’s a hot mess, isn’t it? Go ask your CIO and your Data Curators: what do you know about all this? How reliable is our data? And is it valid with respect to the needs of analysts and the various stripes of data scientists in our employ? If you’re one of those who need valid data, of course, you’ll know just what to say.


A little bit more on Anatoly Karlin’s IQ thesis

Posted on

In Anatoly Karlin’s world, average IQ explains almost all economic success. In my world, that of a professional economist–albeit of a somewhat heterodox stance–I could say with certainty that economic success probably explains as much of the variation in average IQ across countries. We can build an entire regional (i.e., general to all geographic levels) endogenous growth model that I think, if specified correctly in economic developmental parameters like per cap clean water, calorie, and protein intake, might even wash the IQ variable right out of the equation(s). I elaborated on this a few days ago in this forum but Anatoly is too busy bloviating on his discovery to read the replies to his blogging. Adding to my “reverse causation” critique, Karlin’s cosa nostra is a socio-economically retrograde, and ultimately racist and colonialist, theory of economic development, especially when we apply it to the actual history of the conquest of the Americas, over the last 500 years (or even the last 100).

Bill Luker Jr




A call for a new movement: We want our fucking money back

Posted on

Quoting Slavoj Zizek, extensively. And although very well said, if you have the vocabulary and terminology, nothing he says is new. I said it in 1986, in my paper entitled “Privatism and Sunbelt Development in the US” [eventully published in Challenge Magazine in the early 1990s] and there were only a few maybe, one or two, who said it with me. BWA-HAHA BWA-HAHAHAAAAA. I will have my vindication when all those who misunderestimated me [indeed, George W. Bush’s great contribution to the English language] bow at my feet.

Zizek, 09/27/2018, RT

“Liberalism and its great opponent, classical Marxism, both tend to reduce the state to a secondary mechanism which obeys the needs of the reproduction of capital. So, they both thereby underestimate the active role played by state apparatuses in economic processes. Today (perhaps more than ever) one should not fetishize capitalism as the Big Bad Wolf that is controlling states: state apparatuses are active in the very heart of economic processes, doing much more than just guaranteeing legal and other (educational, ecological…) conditions of the reproduction of capital.
In many different forms, the state is more active as a direct economic agent – it helps failing banks, it supports selected industries, it orders defense and other equipment – in the US today than ever before. Around 50 percent of production is mediated by the state, while a century ago, this percentage was between five percent and 10 percent.

Old Rope

One has to be more specific here: the digital network that sustains the functioning of our societies as well as their control mechanisms is the ultimate figure of the technical grid that sustains power today – and does this not confer a new power to the old Trotsky idea that the key to the State lies, not in its political and secretarial organizations, but in its technical services? Consequently, in the same way that, for Trotsky, taking control of the post, electricity, railways, etc., was the key moment of the revolutionary seizure of power, is it not that today, the occupation’ of the digital grid is absolutely crucial if we are to break the power of the state and capital?”

In the same way Trotsky required the mobilization of a narrow, well-trained “storming party, of technical experts and gangs of armed men led by engineers” to resolve this “question of technique,” the lesson of the last decades is that neither massive grassroots protests (as we have seen in Spain and Greece) nor well-organized political movements (parties with elaborate political visions) are enough. Instead, we also need a narrow strike force of dedicated “engineers” (hackers, whistle-blowers…) organized as a disciplined conspiratorial group. Its task will be to “take over” the digital grid, and to rip it from the hands of corporations and state agencies which now de facto control it.

WikiLeaks was just the beginning, and our motto should be a Maoist one: let a hundred of WikiLeaks blossom. The panic and fury with which those in power, those who control our digital commons, reacted to Assange is a proof that such an activity hits the nerve.

There will be many blows below the belt in this fight – our side will be accused of playing the enemy’s hands (like the campaign against Assange for being in the service of Putin), but we should get used to it and learn to strike back with interest, ruthlessly playing one side against the other in order to bring them all down.

Were Lenin and Trotsky also not accused of being paid by Germans and/or by the Jewish bankers? As for the scare that such an activity will disturb the functioning of our societies and thus threaten millions of lives, we should bear in mind that it is those in power who are ready to selectively shut down the digital grid to isolate and contain protests. Indeed, when massive public dissatisfaction explodes, the first move is always to disconnect the internet and cell phones.

Or, to put it in the well-known terms from 1968, in order for its key legacy to survive, liberalism needs the brotherly help of the radical Left.”


The Idiocy of IQ Reductionism and Determinism, or, Anatoly Karlin Screws the Pooch

Posted on

My comments are based on an article by Anatoly Karlin, a right-wing Russian economist, entitled The Idiocy of the Average – and Why It Matters (Unz Review, September 04/18.) He is making a simple but basic mistake, made by many: Failing to understand that statistically significant determining effects in a multiple linear regression equation(s) (from which Karlin is reporting his findings) are as likely in the material world to move from Y to X as they are from X to Y. This is the fallacy of reverse causation. What I mean is that there is an equally plausible alternative analysis, another multiple linear regression equation that says differences in GDP per capita and other measures of general socio-economic welfare, including pre- and post-natal caloric and protein intake (presumably), are the forces that determine variations in IQ across countries. IQ is the dependent variable, not an independent variable that determines economic growth and welfare. Growth and welfare instead determine IQ. Not only can I envision such an empirical result, but I can discuss the theoretical reasons–in terms of endogenous economic growth theory–why this could be the case. The mistake is the same as in orthodox Western economists’s unquestioned maxim, that consuming higher levels of education leads to higher levels of income. The reverse is as likely to be true. The empirical data supports it (1975, Schooling in Capitalist America, Sam Bowles and Herb Gintis). Higher income levels are positively and statistically associated with higher levels of education, with the direction of the effect moving from income to education, not the other way ’round.


Intersection and Flyover, in D; meditation on left hand entrance ramps

Posted on

What’s the deal with the intersection of I35 and US 77—Dallas Drive—going southbound, at or about the Pep Boys and Calloway’s Nursery, on the north side of the intersection?

TxDoT (the Texas Department of Transportation), that two-syllable acronym that has come to mean so much to all of us, over these last ten years of the I-35 multi-mega Corridor Project from IH 635 (the Great Lyndon Baines Johnson Expressway) all the way to US 380 (Phase One, no less), the road that runs through Denton to Decatur and to US 287, and into the vast Northwest Texas Beyond, seems to have made a mistake. We don’t know if it’s a massive mistake. But it could be. And it’s certainly a major local disappointment.

But no one is talking about it. Not in the papers, and I haven’t seen it on social media.

People on the street, they’re talking about it. Especially those forced to negotiate daily loop-de-loops and 4 lights to turn left on Teasley from Dallas Drive, to escape the trap of a completely messed-up traffic pattern that this mistake has caused. We need to insert this discussion into the ongoing media dialogue Denton has with itself in real time, on social media and other cyber spaces, and get TxDoT to tell us what they are going to do about it. Because something will have to be done. Absolutely.

As I said, we have all become familiar with I35’s grandiose but necessary re-do. With that 48 km stretch of much more complicated road construction than TxDot has ever attempted before, with all the electronics embedded in the concrete, necessary for more toll congestion pricing, etc. But that is another story. There are associated physical issues (from driving on bad roads–elevates blood pressure and can even kill you in a crash) and metaphysical ones. To the latter, I know a person who listened to lectures on Immanuel Kant–you know, one of those Great Courses advertised in reputedly highbrow rags like Texas Monthly (not so highbrow; University of Texas highbrow so not high) and the New York Times Book Review (higher and higher brow, but not the New York Review of Books), from the Great Professors at the Great Universities, regularly $299.00 discounted to $39.99 (“Truly,” you’re thinking, “when and how long was it EVER for 299.95? Sort of like the singularity at the big bang: femtoseconds.”)–to while away the time sitting in traffic, trying to get to work. Much to the betterment of her understanding of mid 18th century moral philosophy in Germany and Europe, I might add.

So they have closed an underpass again, leading again to the same “confunction at the junction” they were trying to eliminate. I know that’s a hick expression but we have to say it before someone else captures it and I won’t be able to use it again.

Unfortunately for my professional and moral development I have lived here too long, but strangely long enough to know when someone has torn something down, and it becomes a matter of interest and sort of happiness that something new will solve some problem or another, but then to have hope for solution dashed as they build it back exactly the way it was before! Many of us old Dentonites had endured, with much wailing and gnashing of teeth, for decades, this traffic design disaster, which would eventually have to be fixed, so that cars coming from the left are not merging into the far left lane of the freeway traffic as it is coming around a curve in its left lane, and freeway traffic is not forced to simultaneously negotiate with local traffic immediately merging in from their left. Comprendo?

Put another way, successfully negotiating this interchange means to avoid getting sideswiped by them, or them by you. It’s the Hobbesian realization of bad traffic design, “the warre of carre against carre,”[1] unregulated except by the Leviathan. And we are the Leviathan. It must be fixed, and responsible parties must be held accountable. So I will take the case. More on this as the investigation develops.

Since the contractors and TxDot have again closed the entrance to the ramp that leads to the freeway, I presume they have noticed the problem and are trying to see what to do about it. Or maybe not. But the mess is spreading into our neighborhoods, making this like Long Island. And you know what that means.

Update: Talked with TXDot. People, we are going to visit the site.

[1] Quote from B. Luker Jr. 2159 CDT Tuesday Apr 24.



Now for a New Beginning–The One With No Mistakes

Posted on

For various reasons, facebook is dying. So I’m getting out while the getting is good. I have now embarked on a career as a writer of poetry, non-fiction, and fictive prose. The latter is a term I just made up, although it’s kind of pretty and I run the risk of someone else claiming it. As for what it means, some people call it memoir, but that’s for me to know and you to find out. Heh. So, going forward, I’ll be posting on this, my new website, billluker.com. You’ll find the same wacky but ultimately very black, bilious mix of vicious personal attacks on people with whom I will never really be acquainted, poetry that might be good if I hit one of those days when I’m feeling OK, and maybe the occasional kernel of rationality on matters of politics, economics from the bottom up, and international affairs (i.e., in these times, wars, rumors of wars, and more wars and rumors of wars.) At any rate, my work, judged “highly likely” by the British Secret Service (Mi6) to be forever obscure, will be appearing here, at billluker.com. I will have an assistant, name of neptoonish, who will act as editor, spokesperson, and sometime amanuensis for works that will require me to cut myself, bleed sufficiently, and weep copiously for several hours.

Be seeing you–and may your days be filled with grace, love, light, and peace.

Bill (drblukerjr@gmail.com)

+1 940-435-2028