Here is a glossary of key data science terms to help you on your road to deepening your knowledge of the field and choosing the solution that is right for you!
A test applied to data for atomicity, consistency, isolation, and durability
A process of searching, gathering and presenting data
A mathematical formula placed in software that performs an analysis on a set of data.
The severing of links between people in a database and their records to prevent the discovery of the source of the records.
Developing intelligence machines and software that are capable of perceiving the environment and take corresponding action when required and even learn from those actions.
Automatic identification and capture (AIDC)
Any method of automatically identifying and collecting data on items, and then storing the data in a computer system. For example, a scanner might collect data about a product being shipped via an RFID chip.
Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing remote procedure calls
Using data about people’s behavior to understand intent and predict future actions.
Big Data Scientist
Someone who is able to develop the algorithms to make sense out of big data.
Business Intelligence (BI)
The general term used for the identification, extraction, and analysis of data.
Cascading provides a higher level of abstraction for Hadoop, allowing developers to create complex jobs quickly, easily, and in several different languages that run in the JVM, including Ruby, Scala, and more. In effect, this has shattered the skills barrier, enabling Twitter to use Hadoop more broadly.
Call Detail Record (CDR) analysis
CDRs contain data that a telecommunications company collects about phone calls, such as time and length of call. This data can be used in any number of analytical applications.
Cassandra is a distributed and Open Source database. Designed to handle large amounts of distributed data across commodity servers while providing a highly available service. It is a NoSQL solution that was initially developed by Facebook. It is structured in the form of key-value.
Cell phone data
Cell phones generate a tremendous amount of data, and much of it is available for use with analytical applications.
The analysis of users’ Web activity through the items they click on a page.
A systematic process for obtaining important and relevant information about data, also meta data called; data about data.
A distributed computing system over a network used for storing data off-premises
The process of identifying objects that are similar to each other and cluster them in order to understand the differences as well as the similarities within the data.
Cold data storage
Storing old data that is hardly used on low-power servers. Retrieving the data will take longer
It ensures a step-by-step procedure of comparisons and calculations to detect patterns within very large data sets.
Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of the Hadoop distributed filesystem (HDFS) and MapReduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying monitoring and analyzing results, in order to make the best use of this collected data.
Clojure is a dynamic programming language based on LISP that uses the Java Virtual Machine (JVM). It is well suited for parallel data processing.
A broad term that refers to any Internet-based application or service that is hosted remotely.
Columnar database or column-oriented database
A database that stores data by column rather than by row. In a row-based database, a row might contain a name, address, and phone number. In a column-oriented database, all names are in one column, addresses in another, and so on. A key advantage of a columnar database is faster hard disk access.
Two ways you may compare your keys is by implementing the interface or by implementing the RawComparator interface. In the former approach, you will compare (deserialized) objects, but in the latter approach, you will compare the keys using their corresponding raw bytes.
Complex event processing (CEP)
CEP is the process of monitoring and analyzing all events across an organization’s systems and acting on them when necessary in real time.
The act of making an intuition-based decision appear to be data-based.
Analysis that can attribute sales, show average order value, or the lifetime value.
The act or method of viewing or retrieving stored data.
A graphical representation of the analyses performed by the algorithms
The act of collecting data from multiple sources for the purpose of reporting or analysis.
Data architecture and design
How enterprise data is structured. The actual structure or design varies depending on the eventual end result required. Data architecture has three stages or processes: conceptual representation of business entities. the logical representation of the relationships among those entities, and the physical construction of the system to support the functionality.
A digital collection of data and the structure around which the data is organized. The data is typically entered into and accessed via a database management system (DBMS).
Database administrator (DBA)
A person, often certified, who is responsible for supporting and maintaining the integrity of the structure and content of a database.
Database as a service (DaaS)
A database hosted in the cloud and sold on a metered basis. Examples include Heroku Postgres and Amazon Relational Database Service.
Database management system (DBMS)
Software that collects and provides access to data in a structured format.
A physical facility that houses a large number of servers and data storage devices. Data centers might belong to a single organization or sell their services to many organizations.
The act of reviewing and revising data to remove duplicate entries, correct misspellings, add missing data, and provide more consistency.
Any process that captures any type of data.
A person responsible for the database structure and the technical environment, including the storage of data.
Data-directed decision making
Using data to support making crucial decisions.
The data that a person creates as a byproduct of a common activity–for example, a cell call log or web search history.
A means for a person to receive a stream of data. Examples of data feed mechanisms include RSS or Twitter.
A set of processes or rules that ensure the integrity of the data and that data management best practices are met.
The process of combining data from different sources and presenting it in a single view.
The measure of trust an organization has in the accuracy, completeness, timeliness, and validity of the data.
The access layer of a data warehouse used to provide data to users.
The process of moving data between different storage types or formats, or between different computer systems.
The process of deriving patterns or knowledge from large data sets.
Data model, data modeling
A data model defines the structure of the data for the purpose of communicating between functional and technical people to show data needed for business processes, or for communicating a plan to develop how data is stored and accessed among application development team members.
An individual item on a graph or a chart.
The process of collecting statistics and information about data in an existing source.
The measure of data to determine its worthiness for decision making, planning, or operations.
The process of sharing information to ensure consistency between redundant sources.
The location of permanently stored data.
A recent term that has multiple definitions, but generally accepted as a discipline that incorporates statistics, data visualization, computer programming, data mining, machine learning, and database engineering to solve complex problems.
A practitioner of data science.
The practice of protecting data from destruction or unauthorized access.
A collection of data, typically in tabular form.
Any provider of data–for example, a database or a data stream.
A person responsible for data stored in a data field.
A specific way of storing and organizing data.
A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively.
A place to store data for the purpose of reporting and analysis.
The act of removing all data that links a person to a particular piece of information.
Data relating to the characteristics of a human population.
IBM’s weather prediction service that provides weather data to organizations such as utilities, which use the data to optimize energy distribution.
A data cache that is spread across multiple systems but works as one. It is used to improve performance.
A software module designed to work with other distributed objects stored on other computers.
The execution of a process across multiple computers connected by a computer network.
Distributed File System
Systems that offer simplified, highly available access to storing, analysing and processing data
Document Store Databases
A document-oriented database that is especially designed to store, manage and retrieve documents, also known as semi structured data.
The practice of tracking and storing electronic documents and scanned images of paper documents.
An open source distributed system for performing interactive analysis on large-scale datasets. It is similar to Google’s Dremel, and is managed by Apache.
An open source search engine built on Apache Lucene.
Shows the series of steps that led to an action.
One million terabytes, or 1 billion gigabytes of information.
Data that exists outside of a system.
Extract, transform, and load (ETL)
A process used in data warehousing to prepare data for use in reporting or analytics.
Finding patterns within data without standard procedures or methods. It is a means of discovering the data and to find the data sets main characteristics.
The automatic switching to another computer or node should one fail.
Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.
The performing of computing functions using resources from multiple distributed systems. Grid computing typically involves large files and are most often used for multiple applications. The systems that comprise a grid computing network do not have to be similar in design or in the same geographic location.
They use graph structures (a finite set of ordered pairs or certain entities), with edges, properties and nodes for data storage. It provides index-free adjacency, meaning that every element is directly linked to its neighbour element.
An open source software library project administered by the Apache Software Foundation. Apache defines Hadoop as “a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.”
Hama is a distributed computing framework based on Bulk Synchronous Parallel computing techniques for massive scientific computations e.g., matrix, graph and network algorithms. It’s a Top Level Project under the Apache Software Foundation.
A software/hardware in-memory computing platform from SAP designed for high-volume transactions and real-time analytics.
HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily
HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored.
HDFS (Hadoop Distributed File System)
HDFS (Hadoop Distributed File System) the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured.
Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language called HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.
Hue (Hadoop User Experience) is an open source web-based interface for making it easier to use Apache Hadoop. It features a file browser for HDFS, an Oozie Application for creating workflows and coordinators, a job designer/browser for MapReduce, a Hive and Impala UI, a Shell, a collection of Hadoop API and more.
Impala (By Cloudera) provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase using the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
The integration of data analytics into the data warehouse.
Any database system that relies on memory for data storage.
In-memory data grid (IMDG)
The storage of data in memory across multiple servers for the purpose of greater scalability and faster access or analytics.
Internet of Things
Ordinary devices that are connected to the internet at any time any where via sensors
Kafka (developed by LinkedIn) is a distributed publish-subscribe messaging system that offers a solution capable of handling all data flow activity and processing these data on a consumer website. This type of data (page views, searches, and other user actions) are a key ingredient in the current social web.
Key Value Stores
Key value stores allow the application to store its data in a schema-less way. The data could be stored in a datatype of a programming language or an object. Because of this, there is no need for a fixed data model.
They store data with a primary key, a uniquely identifiable record, which makes easy and fast to look up. The data stored in a KeyValue is normally some kind of primitive of the programming language.
Any delay in a response or delivery of data from one point to another.
As described by World Wide Web inventor Time Berners-Lee, “Cherry-picking common attributes or languages to identify connections or relationships between disparate sources of data.”
The process of distributing workload across a computer network or computer cluster to optimize performance.
Location analytics brings mapping and map-driven analytics to enterprise business systems and data warehouses. It allows you to associate geospatial information with datasets.
Data that describes a geographic location.
A file that a computer, network, or application creates automatically to record events that occur during operation–for example, the time a file is accessed.
Any data that is automatically created from a computer process, application, or other non-human source.
Two or more machines that are communicating with each other
The use of algorithms to allow a computer to analyze data for the purpose of “learning” what action to take when a specific pattern or event occurs.
MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The “Map” function divides a query into multiple parts and processes data at the node level. The “Reduce” function aggregates the results of the “Map” function to determine the “answer” to the query.
The process of combining different datasets within a single application to enhance output–for example, combining demographic data with real estate listings.
Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.
Data about data; gives information about what the data is about.
MongoDB is a NoSQL database oriented to documents, developed under the open source concept. It saves data structures in JSON documents with a dynamic scheme (called MongoDB BSON format), making the integration of the data in certain applications more easily and quickly.
A database optimized to work in a massively parallel processing environment.
A database optimized for data online analytical processing (OLAP) applications and for data warehousing.
They are a type of NoSQL and multidimensional databases that understand 3 dimensional data directly. They are primarily giant strings that are perfect for manipulating HTML and XML strings directly
Viewing relationships among the nodes in terms of the network or graph theory, meaning analysing connections between nodes in a network and the strength of the ties.
An elegant, well-defined database system that is easier to learn and better than SQL. It is even newer than NoSQL
NoSQL (commonly interpreted as “not only SQL“) is a broad class of database management systems identified by non-adherence to the widely used relational database management system model. NoSQL databases are not built primarily on tables, and generally do not use SQL for data manipulation.
They store data in the form of objects, as used by object-oriented programming. They are different from relational or graph databases and most of them offer a query language that allows object to be found with a declarative programming approach.
Object-based Image Analysis
Analysing digital images can be performed with data from individual pixels, whereas object-based image analysis uses data from a selection of related pixels, called objects or image objects.
Online analytical processing (OLAP)
The process of analyzing multidimensional data using three operations: consolidation (the aggregation of available), drill-down (the ability for users to see the underlying details), and slice and dice (the ability for users to select subsets and view them from different perspectives).
Online transactional processing (OLTP)
The process of providing users with access to large amounts of transactional data in a way that they can derive meaning from it.
The open source version of Google’s Big Query java code. It is being integrated with Apache Drill.
Open Data Center Alliance (ODCA)
A consortium of global IT organizations whose goal is to speed the migration of cloud computing.
Operational data store (ODS)
A location to gather and store data from multiple sources so that more operations can be performed on it before sending to the data warehouse for reporting.
Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages – such as Map Reduce, Pig and Hive — then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.
Parallel data analysis
Breaking up an analytical problem into smaller components and running algorithms on each of those components at the same time. Parallel data analysis can occur within the same system or across multiple systems.
Parallel method invocation (PMI)
Allows programming code to call multiple functions in parallel.
The ability to execute multiple tasks at the same time.
A query that is executed over multiple system threads for faster performance.
The classification or labeling of an identified pattern in the machine learning process.
Pentaho offers a suite of open source Business Intelligence (BI) products called Pentaho Business Analytics providing data integration, OLAP services, reporting, dashboarding, data mining and ETL capabilities
One million gigabytes or 1,024 terabytes.
Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL).
Using statistical functions on one or more datasets to predict trends or future events.
The process of developing a model that will most likely predict a trend or outcome.
Public information or data sets that were created with public funding
Asking for information to answer a certain question
The process of analyzing a search query for the purpose of optimizing it for the best possible result.
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.
Combining several data sets to find a certain person within anonymized data
Data that is created, processed, stored, analysed and visualized within milliseconds
An algorithm that analyzes a customer’s purchases and actions on an e-commerce site and then uses that data to recommend complementary products.
Data that describes an object and its properties. The object may be physical or virtual.
The application of statistical methods on one or more datasets to determine the likely risk of a project, action, or decision.
The process of determining the main cause of an event or problem.
Finding the optimized routing using many different variables for a certain means of transport in order to decrease fuel costs and increase efficiency.
The ability of a system or process to maintain acceptable performance levels as workload or scope increases.
The structure that defines the organization of data in a database system.
Aggregated data about search terms used over time.
Data that is not structured by a formal data model, but provides other means of describing the data and hierarchies.
The application of statistical functions on comments people make on the web and through social networks to determine how they feel about a product or company.
A physical or virtual computer that serves requests for a software application and delivers those requests over a network.
It refers to analysing spatial data such geographic data or topological data to identify and understand patterns and regularities within data distributed in geographic space.
A programming language for retrieving data from a relational database
Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.
Storm is a system of real-time distributed computing, open source and free, born into Twitter. Storm makes it easy to reliably process unstructured data flows in the field of real-time processing, which made Hadoop for batch processing.
Software as a service (SaaS)
Application software that is used over the web by a thin client or web browser. Salesforce is a well-known example of SaaS.
Any means of storing data persistently.
An open-source distributed computation system designed for processing multiple data streams in real time.
Data that is organized by a predetermined structure.
Structured Query Language (SQL)
A programming language designed specifically to manage and retrieve data from a relational database system.
The application of statistical, linguistic, and machine learning techniques on text-based sources to derive meaning or insight.
Data that changes unpredictably. Examples include accounts payable and receivable data, or data about product shipments.
“Thrift is a software framework for scalable cross-language services development. It combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml.”
Data that has no identifiable structure–for example, the text of email messages.
All that available data will create a lot of value for organizations, societies and consumers. Big data means big business and every industry will reap the benefits from big data.
The amount of data, ranging from megabytes to brontobytes
A visual abstraction of data designed for the purpose of deriving meaning or communicating information more effectively.
WebHDFS Apache Hadoop
WebHDFS Apache Hadoop provides native libraries for accessing HDFS. However, users prefer to use HDFS remotely over the heavy client side native libraries. For example, some applications need to load data in and out of the cluster, or to externally interact with the HDFS data. WebHDFS addresses these issues by providing a fully functional HTTP REST API to access HDFS.
Real-time weather data is now widely available for organizations to use in a variety of ways. For example, a logistics company can monitor local weather conditions to optimize the transport of goods. A utility company can adjust energy distribution in real time.
XML Databases allow data to be stored in XML format. XML databases are often linked to document-oriented databases. The data stored in an XML database can be queried, exported and serialized into any format needed.
ZooKeeper is a software project of the Apache Software Foundation, a service that provides centralized configuration and open code name registration for large distributed systems. ZooKeeper is a subproject of Hadoop.