Data Knowledge

Data Knowledge, a heretofore uncoined term, is a missing concept that is fundamental to understanding, assessing, discussing, and thinking critically about artificial intelligence and AI systems. This post defines data knowledge. Subsequent posts will illustrate use cases of data knowledge for data scientists, systems architects, AI ethicists, and the lay public.

What is data knowledge?

Data Knowledge is the physical representation of past learning by an AI system, as it is stored. The term combines two ideas that traditionally conflict in information science: 1. data 2. knowledge

These terms are most often understood as resting along the Data-Information-Knowledge-Wisdom continuum, with data as the most basic form of information and wisdom as the most complex.

Data-Information-Knowledge-Wisdom Pyramid/

Data Knowledge

In information science, data represents raw signal. For humans, this would be anything picked up and processed by the nervous system: the smell of fresh cut grass, the sound of rain, the sight of a sunset. In computer world, data is the representation of meaning on disk. It's possible for us to follow the rabbit hole all the way down to the binary representation of meaning, but this usually is not necessary for artificial intelligence (in certain low-resource environments, it may be.)

In the AI world, it is sufficient for us to think about data as the means by which we store data on disk. We might do this in a pickle format, as JSON, HD5, or as some proprietary or otherwise specialized format. There are obvious considerations that come into play here, including:

  1. Which programs know how to modify this data?
  2. How big or small is the data?
  3. How human-readable/inspectable is the data?

Ultimately, it is important to think of the data component in data knowledge because we must remember that AI systems are not only consumers of data--but they also produce it and depend on it for inference.

Data Knowledge

In information science, knowledge represents the first applied stage of meaning--defined by application, processes, rules, and doing. Colloquially, we might think of it as "know how". If we imagine baking a cake, then: - a person who can bake a cake has knowledge about cake baking; - a person who has read a recipe may have information about cake baking; - the recipe itself is data about cake making.

Knowledge is a higher-level of expertise than information and data. Indeed, along the continuum it is said that data is collected into information and information in turn into knowledge.

For AI, knowledge is the systems ability to do things: the systems coverage and performance. For a question and answering system, this would be the system's ability to answer questions. What type of questions can it respond to? How good are the responses? For a object detection system this would be the number of objects the system can detect, the types of images it can detect objects in, and the system's effectiveness at detecting.

Critically: information scientists argue that all knowledge is tacit -- that is: unable to be properly encoded into information. We cannot transfer knowledge to another person through words or symbols in any form: we can give them information; they must come by the knowledge themselves.

Data Knowledge

Data knowledge is the combination of these two concepts: the doing of knowledge and the tangibility of data. AI systems do only by following rules (knowledge) encoded as weights, probabilities, metrics--all data.

This is true for all manners of AI systems:

  • Linear systems store linear weights
  • Deep learning systems store weights
  • Nearest neighbor systems store neighbors and properties
  • Decision trees/forests store rules

AI systems can be updated, improved, or attacked by modifying the data, and hence, changing the behavior (knowledge) of the system.

Importantly, because AI systems use data knowledge, they are also reproducible. Whereas humans must transmit knowledge through information; AI systems use data knowledge: a physically copyable representation of what they can do. AI systems, then, can be perfectly replicated by sharing data knowledge.

Why not information?

Why do we not discuss information? It would seem more intuitive that we map the parts of an AI system along the existing data, information, knowledge continuum. However, for AI practitioners the answer to this will be apparent: the data used by AI systems rarely rises to the "data with meaning" threshold for information.

Information is about facts and ideas. Anyone who has ever looked at the weights of a deep learning network would be hard pressed to enumerate the ideas expressed therein. Similarly, we might consider a word vector.

Example word embeddings, by Google

A word vector is a sequence of weights associated with some token, designed to be processed by a deep learning system. Do these weights in and of themselves represent information? Do they describe something? Hardly. It is not even clear that a all subsets or tokens--to say nothing of weights--would be sufficient for a system to put the information into practice. No--word vectors are data.

Visualizations of words in vector space, by Google.

Implications of data knowledge

Data knowledge is an important concept to understanding AI. It helps us differentiate AI data knowledge from human knowledge--knowledge based in information. In subsequent posts, we will look at implications of this idea and how we can apply this idea to better AI systems development, AI ethics, and AI policy.


Mastering Large Datasets

My new book, Mastering Large Datasets, is out now! Head over to Manning.com or Amazon and buy a copy today.


Subscribe