Data Husbandry – AnotherPeak

Data Husbandry and Edge compute

Backdrop

The choice of edge compute over a pure Hyperscaler cloud service provider model is primarily a function of:

comparative costs (unit cost and total cost; first cost & recurring, operational costs) of the four primary cost elements
service level requirements and what aspect of the four elements (compute, data, storage or network transit) will result in violation of a service level agreement (SLA)
system security, data privacy, and compliance which includes system security, system privacy, requirements

In each of these aspects, one can find a number of metrics and for any given system is likely to have established targets and policies with which the operations must comply.

Data as an Asset

At the ‘board level’, data has generally been seen as a source of cost, as a source of financial risk, and/or as a risk to both organisational and personal reputation. Tales of failed compliance audits, data breaches, fines levied, and proprietary data exposure create a fear-based mindset, and consideration of ‘data’ as primarily a cost centre.

Rarely has that mindset included the consideration of data as an asset, or even less often, data as a monetizable asset. Yet, enterprise’ increasing reliance on data is clearly evident. In 2022, the average enterprise in modern economies produced almost seven times what they did in 2016 and consumed from outside sources almost ten times the amounts reported in 2017.

Data Product

A Data Product is a software application or tool that leverages data to deliver value to its users. In a data-oriented company, a data product often involves the integration of various data sources and the application of advanced analytics techniques to derive insights and make data-driven decisions.

The goal of a data product is to provide a user experience that leverages data to drive business outcomes. A data product is designed to align with the principles of data husbandry, ensuring that

data used as inputs to the production processes are of the optimal quality and appropriate for the processing of a data product,
data generated as outputs of the data product are of optimal value and usefulness to the data consumer, and
that compliance with regulation, corporate policy or licensing terms a chain of accountability and responsibility is maintained

Data Husbandry

Data Husbandry refers to the practice of managing and caring for data in order to maximize its value and usefulness. This can involve tasks such as cleansing and organizing data, selecting and combining datasets for a specific purpose, providing appropriate storage and access to the data, and monitoring the quality and integrity of the data over time. In this sense, data husbandry can be seen as a way of ensuring that data is well-maintained and optimized for use in analytic, research and other applications.

The goal is both to enhance compliance whilst at the same time turn data into a sellable, usually anonymised offering.

Data Lineage

The term data lineage has been in use for more than two decades but has received the most intense attention during the 10 – 12 years with the emergence of data engineering. Data lineage information is a form of metadata; that is data which describes some aspect of the primary data. It is, truly, ‘data about data’ that, ideally, represents the lifecycle of data from its inception to its consumption or exhaustion.

Data lineage metadata is lifecycle data about a dataset. In particular, it includes information about the creation of data (when, what, who), subsequent updates (when, what, who), transformation, versioning, summarization, migration, and replication. It provides visibility into the processing workflow and simplifies tracing errors back to their sources.

Data lineage metadata describes a form of relationship of a dataset with its antecedents, since each data transformation, migration, and replication implies dependency on the same original data. For example, when wrong data is found in one document, changes need to be made not only to that document, but also

to all prior documents from which the document was derived (upstream dependencies)
to all existing documents that are, themselves, derived from and dependent on the document just identified as in error (downstream dependencies).

Data Provenance

Data Provenance has been defined in several ways over the past two decades, and many of these definitions are incompatible with the other. For some, there is no clear distinction between data provenance and data lineage.

For our purposes as Data Provenance, the term refers to the same lifecycle of a dataset that is described by data lineage, with the difference being that data provenance is meant to record the context within which a specific action (such as a transformation, an erasure, etc.) has occurred. For example, provenance might include

the record of data ownership, change in ownership and delegation to a data steward,
a determination of responsibility and accountability, (what organization had responsibility for the maintenance of a dataset? in the event of an error or compliance violation, what is the nature of liability?) and
a reference to the applicable policy, regulation or license which dictates appropriate actions and defines unacceptable, non-compliant action.

Leave a comment Cancel reply